The basic principles of Hadoop's components, processing processes and key knowledge points are recorded, including HDFS, YARN, MapReduce, etc.
This tutorial is from
The date of the update
Update the content
People are generating data faster and machines are faster, and more data is better algorithms, so there needs to be another way to process data.
The capacity of the hard drive has increased, but the performance has not eddyed, and the solution is to split the data into multiple hard drives and read them at the same time.
But there are some problems:
Hardware Problem: Copy Data Resolution (RAID)
Analysis requires reading data from different hard drives: MapReduce
And Hadoop offers it
1. Reliable shared storage (distributed storage) 2. Abstract analysis interface (distributed analysis)
Data that cannot be processed using one machine
At the heart of big data is the sample-population
Volume: Typically in
big data, a single file is at least a few dozen, hundreds of GB or more
Fastness: Reflected in the rapid
generation of data and the frequency with which it changes
refers to the diversity of data types and their sources, which further summarizes data structures into structured, semi-structured, and unstructured (unstructured)
the characteristics of data speed, the data flow also presents a characteristic of volatility.
Unstable data flows have periodic spikes with the triggering of specific events, day, season, and day
Also known as data assurance. I
n different ways, the data collected by the channel can vary greatly in quality.
The degree of error and credibility of data analysis and output results depends to a large extent on the quality of the data collected
in the management and operation of data.
How to extract, transform, load, connect, and corred to grasp the useful information contained in the data has become increasingly challenging
1. The data is distributed across multiple machines
Reliability: Each block of data is copied to multiple nodes
Performance: Multiple nodes process data at the same time
2. Calculations go with the data
Network IO speed and local disk IO speed, big data system will try to assign tasks to the closest machine to the data to run (when the program is running, copy the program and its dependent packages to the machine where the data is located)
Code migration to data, to avoid large-scale data, resulting in a large amount of data migration, as far as possible, a piece of data calculations on the same machine
3. Serial IO replaces random IO
The transfer time, the seek time, is not modified after the general data is written