Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Python ---- first met Pandas


May 30, 2021 Article blog


Table of contents


preface

Pandas is a Numpy-based tool created to solve data analysis tasks, incorporating a large number of libraries and some standard data models, providing the tools needed to efficiently manipulate large data sets, and pandas providing a number of functions and methods that enable us to process data quickly and easily. R ecommended lessons: Pandas Chinese Tutorials, Python3 Advanced: Data Analysis and Visualization.

First, the pandas operating process

  1. The addition, deletion and revision of the table data;
  2. Multi-table processing is achieved;
  3. Data cleaning operations: missing values, duplicate values, outliers, data standardization, data conversion operations;
  4. Implement excel's special operation, generate perspective table, cross table;
  5. Complete the statistical analysis.

Second, the creation of pandas

1, import pandas library

import pandas as pd

2, table structure data, build Dataframe

column:column index:row index values:element data

Mode 1:

df = pd.DataFrame(

Data = [["Alex ', 20,' Men, '0831'], ['Tom', 30, 'Female', '0830'],],

Index = ['A', 'B'], # can not write, the default starts from 0, or you can specify the character to sort

columns=['name', 'age', 'sex', 'class'],

# Build method

Print (DF) # Print Data

name  age sex class
a a Alex 20 male 0831
B Tom 30 female 0830

Mode 2:

DF1 = pd.dataframe (Data = {'Name': ['Tom', 'Alex'], 'Age': [18, 20], 'SEX': ['Male', 'Female'], 'Class': ['0831', '0831']})

Print (DF) # Prints the data without specifying the index character sort, start sorting from 0 by default

name  age sex class
0 alex 20 male 0831
1 Tom 30 female 0830

3, the properties of dataframe

Because pandas are based on numpy, the properties of numpy's ndarray, dataframe, also have.

  • df.shape- structure
  • df.ndim s dimension
  • The number of df.sizes
  • The data type of the df.dtypes element
  • df.columns . . . column index
  • df.index s row index
  • The df.values element

Three, df's lookup

1, index a column value

The one-dimensional cut of df1'name', returned is series

Print (DF1 ['Name']) # Sliced a column value

0     tom
1    alex

2, the method of cutting multi-column values

print(df1[['name', 'age']])

name  age

0   tom   18

1  alex   20

Print (Type (DF1 [['Name', 'Age']]) # Series is a one-dimensional type, only one axis

<class 'pandas.core.series.Series'>

3, index cut method

Method 1:

Print (DF [['Name', 'Age'] [: 2]) # cannot specify the row to index

name  age

a  alex   20

b   tom   30

Method two:

Method of index cut: df.loc (row index name, condition, column index name)

print(df.loc['a', 'name'])

alex

df.loc['a', ['name']]     # <class 'pandas.core.series.Series'>  行或者列,只要有一个为字符串,是一维

df.loc[['a'], ['name']]   # <class 'pandas.core.frame.DataFrame'> 行或者列,两个参数都为列表,是二维

4, conditional index: bool slice

Mask = DF ['agn']> 18 # Returns all students than 18-year-old classmates, return True, False

Mask2 = DF ['SEX'] == 'Female' # Returns all women's classmates

Mask3 = Mask & Mask2 # combines two MASKs, and can not use and use & logic

print(mask3)

a    False

b     True

dtype: bool

Print (Df.loc [Mask3,:]) # Slices of data using MASK

name  age sex class

B Tom 30 female 0830

5, index query: iloc (index of rows, index of columns) . . . before closed and open

print(df.iloc[:1, :])

name  age sex class

a a Alex 20 male 0831

Fourth, df increase method

1, key value pair to add columns

# DF ['address'] = [' Beijing ', Shanghai'] two ways, one, directly equal to 'Beijing', all data will become Beijing

DF ['Address'] = 'Beijing'

name  age sex class address

A a Alex 20 male 0831 Beijing

B Tom 30 Female 0830 Beijing

2, append add lines

df_mini = pd.DataFrame(data = {

'name':['jerry', 'make'],

'age':[15, 18],

'SEX': ['Male', 'Female'],

'class':['0831', '0770'],

'Address': ['Beijing', 'Henan']

}, index = ['a', 'b'])

df4 = df.append(df_mini)

print(df4)

A a Alex 20 male 0831 Beijing

B Tom 30 Female 0830 Beijing

A Jerry 15 male 0831 Beijing

B Make 18 Female 0770 Henan

V. Delete the method

Axis: Deleted rows or columns

INPLACE: Whether to modify the original table

A = df4.drop (labels = ['address', 'class'], axis = 1) # Delete columns need to use a variable acceptance

df4.drop(labels=['a'], axis=0, inplace=True)

Six, modify

Cut out the specified data and then make assignment modifications

C = DF4.LOC [DF4 ['name'] == 'Tom', 'Class'] = 'has problems'

print(c)

name  age sex class address

A a Alex 20 male 0831 Beijing

B Tom 30 women have problems Beijing

A Jerry 15 male 0831 Beijing

B Make 18 Female 0770 Henan

Seven, statistical analysis

1, extending 10 statistical methods in Numpy

min() argmin() max() argmax() std() vat() sum() mean() cumsum() cumprod()

2, the method in pandas

df['age'].min() df['age'].max() df['age'].argsort()

3, majority, non-empty elements, frequency

df['age'].mode()

a    grade

b    grade

dtype: object

df['age'].count()

tom      1

make     1

alex     1

jerry    1

Name: name, dtype: int64

df['age'].value_counts()

name       alex

age          20

SEX female

class      0830

Address Beijing

dtype: object

4, for the df type

DF ['agn']. IDXMAX (AXIS = 1) # horizontal comparison

DF ['agn']. IDXMAX (AXIS = 0) #ir comparison

name  age  sex class address

0 alex 15 female 0831 Beijing

1 Jerry 18 male nan nan

2   make   20  NaN   NaN     NaN

3    tom   30  NaN   NaN     NaN

5, describe describe

df['age'].describe()

#          age

# Count 4.00 Number of non-space

# Mean 20.75 average

# STD 6.50 standard difference

# Min 15.00 Minimum

# 25%    17.25   1/4

# 50%    19.00   2/4

# 75%    22.50   3/4

# MAX 30.00 Max

df['name'].describe()

# Count: Number of non-space

# Unique: There are several values after it is heavy.

# TOP: Number

# Freq: The number of frequent numbers

Eight, Excel file reading

Pandas can read a variety of data types, and here's how to read excel data

Pd.read_excel (R 'File Path')