May 30, 2021 Article blog
Pandas is a Numpy-based tool created to solve data analysis tasks, incorporating a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large data sets, and pandas providing a number of functions and methods that enable us to process data quickly and easily. R
ecommended
lessons: Pandas Chinese Tutorials,
Python3 Advanced: Data Analysis and Visualization.
1, import pandas library
import pandas as pd
column:column index:row index values:element data
Mode 1:
df = pd.DataFrame(
Data = [["Alex ', 20,' Men, '0831'], ['Tom', 30, 'Female', '0830'],],
Index = ['A', 'B'], # can not write, the default starts from 0, or you can specify the character to sort
columns=['name', 'age', 'sex', 'class'],
# Build method
Print (DF) # Print Data
name age sex class
a a Alex 20 male 0831
B Tom 30 female 0830
Mode 2:
DF1 = pd.dataframe (Data = {'Name': ['Tom', 'Alex'], 'Age': [18, 20], 'SEX': ['Male', 'Female'], 'Class': ['0831', '0831']})
Print (DF) # Prints the data without specifying the index character sort, start sorting from 0 by default
name age sex class
0 alex 20 male 0831
1 Tom 30 female 0830
Because pandas are based on numpy, the properties of numpy's ndarray, dataframe, also have.
- df.shape- structure
- df.ndim s dimension
- The number of df.sizes
- The data type of the df.dtypes element
- df.columns . . . column index
- df.index s row index
- The df.values element
1, index a column value
The one-dimensional cut of df1'name', returned is series
Print (DF1 ['Name']) # Sliced a column value
0 tom
1 alex
2, the method of cutting multi-column values
print(df1[['name', 'age']])
name age
0 tom 18
1 alex 20
Print (Type (DF1 [['Name', 'Age']]) # Series is a one-dimensional type, only one axis
<class 'pandas.core.series.Series'>
3, index cut method
Method 1:
Print (DF [['Name', 'Age'] [: 2]) # cannot specify the row to index
name age
a alex 20
b tom 30
Method two:
Method of index cut: df.loc (row index name, condition, column index name)
print(df.loc['a', 'name'])
alex
df.loc['a', ['name']] # <class 'pandas.core.series.Series'> 行或者列,只要有一个为字符串,是一维
df.loc[['a'], ['name']] # <class 'pandas.core.frame.DataFrame'> 行或者列,两个参数都为列表,是二维
4, conditional index: bool slice
Mask = DF ['agn']> 18 # Returns all students than 18-year-old classmates, return True, False
Mask2 = DF ['SEX'] == 'Female' # Returns all women's classmates
Mask3 = Mask & Mask2 # combines two MASKs, and can not use and use & logic
print(mask3)
a False
b True
dtype: bool
Print (Df.loc [Mask3,:]) # Slices of data using MASK
name age sex class
B Tom 30 female 0830
5, index query: iloc (index of rows, index of columns) . . . before closed and open
print(df.iloc[:1, :])
name age sex class
a a Alex 20 male 0831
1, key value pair to add columns
# DF ['address'] = [' Beijing ', Shanghai'] two ways, one, directly equal to 'Beijing', all data will become Beijing
DF ['Address'] = 'Beijing'
name age sex class address
A a Alex 20 male 0831 Beijing
B Tom 30 Female 0830 Beijing
2, append add lines
df_mini = pd.DataFrame(data = {
'name':['jerry', 'make'],
'age':[15, 18],
'SEX': ['Male', 'Female'],
'class':['0831', '0770'],
'Address': ['Beijing', 'Henan']
}, index = ['a', 'b'])
df4 = df.append(df_mini)
print(df4)
A a Alex 20 male 0831 Beijing
B Tom 30 Female 0830 Beijing
A Jerry 15 male 0831 Beijing
B Make 18 Female 0770 Henan
Axis: Deleted rows or columns
INPLACE: Whether to modify the original table
A = df4.drop (labels = ['address', 'class'], axis = 1) # Delete columns need to use a variable acceptance
df4.drop(labels=['a'], axis=0, inplace=True)
Cut out the specified data and then make assignment modifications
C = DF4.LOC [DF4 ['name'] == 'Tom', 'Class'] = 'has problems'
print(c)
name age sex class address
A a Alex 20 male 0831 Beijing
B Tom 30 women have problems Beijing
A Jerry 15 male 0831 Beijing
B Make 18 Female 0770 Henan
1, extending 10 statistical methods in Numpy
min() argmin() max() argmax() std() vat() sum() mean() cumsum() cumprod()
2, the method in pandas
df['age'].min() df['age'].max() df['age'].argsort()
3, majority, non-empty elements, frequency
df['age'].mode()
a grade
b grade
dtype: object
df['age'].count()
tom 1
make 1
alex 1
jerry 1
Name: name, dtype: int64
df['age'].value_counts()
name alex
age 20
SEX female
class 0830
Address Beijing
dtype: object
4, for the df type
DF ['agn']. IDXMAX (AXIS = 1) # horizontal comparison
DF ['agn']. IDXMAX (AXIS = 0) #ir comparison
name age sex class address
0 alex 15 female 0831 Beijing
1 Jerry 18 male nan nan
2 make 20 NaN NaN NaN
3 tom 30 NaN NaN NaN
5, describe describe
df['age'].describe()
# age
# Count 4.00 Number of non-space
# Mean 20.75 average
# STD 6.50 standard difference
# Min 15.00 Minimum
# 25% 17.25 1/4
# 50% 19.00 2/4
# 75% 22.50 3/4
# MAX 30.00 Max
df['name'].describe()
# Count: Number of non-space
# Unique: There are several values after it is heavy.
# TOP: Number
# Freq: The number of frequent numbers
Pandas can read a variety of data types, and here's how to read excel data
Pd.read_excel (R 'File Path')