Hadoop about

May 26, 2021 10:00 Hadoop

Table of contents


The basic principles of Hadoop's components, processing processes and key knowledge points are recorded, including HDFS, YARN, MapReduce, etc.

This tutorial is from Penny Wong

The date of the update Update the content
2015-5-7 Hadoop documentation


  • People are generating data faster and machines are faster, and more data is better algorithms, so there needs to be another way to process data.
  • The capacity of the hard drive has increased, but the performance has not eddyed, and the solution is to split the data into multiple hard drives and read them at the same time. But there are some problems:

Hardware Problem: Copy Data Resolution (RAID)

Analysis requires reading data from different hard drives: MapReduce

And Hadoop offers it

1. Reliable shared storage (distributed storage) 2. Abstract analysis interface (distributed analysis)

Big data


Data that cannot be processed using one machine

At the heart of big data is the sample-population


  • Volume: Typically in big data, a single file is at least a few dozen, hundreds of GB or more
  • Fastness: Reflected in the rapid generation of data and the frequency with which it changes
  • Diversity: Generally refers to the diversity of data types and their sources, which further summarizes data structures into structured, semi-structured, and unstructured (unstructured)
  • Variability: With the characteristics of data speed, the data flow also presents a characteristic of volatility. Unstable data flows have periodic spikes with the triggering of specific events, day, season, and day
  • Accuracy: Also known as data assurance. I n different ways, the data collected by the channel can vary greatly in quality. The degree of error and credibility of data analysis and output results depends to a large extent on the quality of the data collected
  • Complexity: Reflected in the management and operation of data. How to extract, transform, load, connect, and corred to grasp the useful information contained in the data has become increasingly challenging

Key technologies

1. The data is distributed across multiple machines

Reliability: Each block of data is copied to multiple nodes

Performance: Multiple nodes process data at the same time

2. Calculations go with the data

Network IO speed and local disk IO speed, big data system will try to assign tasks to the closest machine to the data to run (when the program is running, copy the program and its dependent packages to the machine where the data is located)

Code migration to data, to avoid large-scale data, resulting in a large amount of data migration, as far as possible, a piece of data calculations on the same machine

3. Serial IO replaces random IO

The transfer time, the seek time, is not modified after the general data is written