May 26, 2021 Hadoop
1. The client writes the file to the HDFS Client file on the local disk
2. When the temporary file size reaches a block size, HDFS client notifies NameNode and requests to write the file
3. NameNode creates a file in the HDFS file system and returns the block id and the list of DataNode to be written to the client
4. When the client receives this information, the temporary file is written to DataNodes
5. After the file is finished (client closed), NameNode submits the file (then the file is visible, and if NameNode collapses before the commit, the file is lost.) fsync: Only information about the data is guaranteed to be written to NameNode, but there is no guarantee that the data has been written to DataNode)
Rack awareness (rack awareness)
The configuration file specifies the corresponding relationship between the rack name and DNS
Suppose the replication parameter is 3, one copy of the data is saved in the local rack when the file is written, and then two copies of the data are saved in the other rack (faster transfer within the same rack, improving performance)
The entire HDFS cluster is preferably load balanced in order to maximize the benefits of the cluster