Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Hadoop HDFS


May 26, 2021 Hadoop


Table of contents


Hadoop - HDFS

Brief introduction

Hadoop Distributed File System, Distributed File System

Architecture

Hadoop HDFS

  • Block blocks;

    1. Basic storage units, the general size of 64M (configured large blocks mainly because: 1) reduce search time, the general hard disk transmission rate is faster than seek time, large blocks can reduce seek time;

    2. A large file is split into blocks and stored on different machines. If a file is less than the Block size, the actual footprint is the size of its file

    3. Basic read and write units, similar to disk pages, read and write one block at a time

    4. Each block is copied to multiple machines, with 3 copies copied by default
  • NameNode

    1. The metadata where files are stored, all data is saved to memory at runtime, and the total number of files that HDFS can store is limited by the memory size of NameNode

    2. A Block corresponds to a record in NameNode (typically a block consumes 150 bytes) and, if it is a large number of small files, consumes a lot of memory. A t the same time, the number of map tasks is determined by the splits, so when you use MapReduce to process a large number of small files, you will generate too many map tasks, thread management overhead will increase job time. W orking with large numbers of small files is much faster than working with large files of the same size. Therefore, Hadoop recommends storing large files

    3. The data is saved to the local disk on a timely date, but instead of the block's location information, it is escalated and runtime maintained by DataNode when it is registered (the information related to DataNode in NameNode is not saved to NameNode's file system, but is dynamically rebuilt by NameNode after each restart)

    4. NameNode fails and the entire HDFS fails, so the availability of NameNode is guaranteed
  • Secondary NameNode

    1. Regular synchronization with NameNode (regularly merges file system images and edit logs, then passes the merged to NameNode, replaces its mirrors, and emptys the edit logs, similar to the CheckPoint mechanism), but When NameNode fails, it still needs to be set up manually as a host
  • DataNode

    1. Save specific block data

    2. Responsible for data read and write operations and replication operations

    3. DataNode reports the currently stored block information to NameNode when it starts, followed by a scheduled report on modifications

    4. DataNode communicates with each other, copies blocks of data, and ensures data redundancy