May 26, 2021 Hadoop
Hadoop Distributed File System, Distributed File System
Block blocks;
Basic storage units, the general size of 64M (configured large blocks mainly because: 1) reduce search time, the general hard disk transmission rate is faster than seek time, large blocks can reduce seek time;
A large file is split into blocks and stored on different machines. If a file is less than the Block size, the actual footprint is the size of its file
Basic read and write units, similar to disk pages, read and write one block at a time
NameNode
The metadata where files are stored, all data is saved to memory at runtime, and the total number of files that HDFS can store is limited by the memory size of NameNode
A Block corresponds to a record in NameNode (typically a block consumes 150 bytes) and, if it is a large number of small files, consumes a lot of memory. A t the same time, the number of map tasks is determined by the splits, so when you use MapReduce to process a large number of small files, you will generate too many map tasks, thread management overhead will increase job time. W orking with large numbers of small files is much faster than working with large files of the same size. Therefore, Hadoop recommends storing large files
The data is saved to the local disk on a timely date, but instead of the block's location information, it is escalated and runtime maintained by DataNode when it is registered (the information related to DataNode in NameNode is not saved to NameNode's file system, but is dynamically rebuilt by NameNode after each restart)
Secondary NameNode
DataNode
Save specific block data
Responsible for data read and write operations and replication operations
DataNode reports the currently stored block information to NameNode when it starts, followed by a scheduled report on modifications