Jun 01, 2021 Article blog
In the work, efficiency is a key factor, a person's efficiency in doing things, determines how much time spent.
So when our project involves some basic coding, using
pandas
library can save you a lot of time and increase your productivity.
Pandas
is an open source package. I
t helps with data analysis and data manipulation in
Python
language.
In addition, it provides us with a flexible data structure.
Here are a few practical tips for
pandas
First, data exploration is a necessary step.
Pandas
a quick and easy way to perform a variety of analyses.
One of the most important techniques is to select rows or filter data based on criteria.
The condition selection of a row can be based on a single condition or condition in a single statement separated by logical operators.
For example, I use a dataset about loan forecasts.
We'll pick a line of clients who haven't graduated yet and earn less than 5,400 pounds. Let's see what we can do.
import pandas as pd
data = pd.read_csv('../Data/loan_train.csv')
data.head()
data2 = data.loc[(data['Education'] == 'Not Graduate') & (data['ApplicantIncome'] <= 5400)]
data2
Note: Remember to put each condition in parentheses.
There can be two types of data - continuous and discrete, depending on our analytical requirements. Sometimes we don't need the exact value in a continuous variable, but we need the group to which it belongs.
For example, there is a continuous variable in your data, age. B
ut you need an age group to analyze, such as children, adolescents, adults, and the elderly.
In fact,
Binning
is well suited to solving our problems here.
In order to execute
Binning
we use
cut()
function.
This is useful from continuous variables to discrete variables.
import pandas as pd
df = pd.read_csv('titanic.csv')
from sklearn.utils import shuffle
# 随机化
df = shuffle(df, random_state = 42)
df.head()
bins = [0,4,17,65,99]
labels =['Toddler','Child','Adult','Elderly']
category = pd.cut(df['Age'], bins = bins, labels = labels)
df.insert(2, 'Age Group', category)
df.head()
df['Age Group'].value_counts()
df.isnull().sum()
This is often done in the daily lives of data scientists and analysts.
Pandas
a basic function to perform data grouping, or
Groupby
Groupby
operations include splitting objects based on specific criteria, applying functions, and then combining results.
Let's look at the loan forecast dataset again, assuming I want to look at the average amount of loans to people from different property sectors, such as rural, semi-urban, and urban areas. Take a moment to understand the statement of the problem and think about how to solve it.
Well,
Pandas
groupby
can solve this problem very effectively. S
tart by dividing the data by attribute area. S
econd, we apply
mean()
function to each category.
Finally, we group them together and print them as new data frames.
#导入数据集
import pandas as pd
df = pd.read_csv('../Data/loan_train.csv')
df.head()
# 男女平均收入
df.groupby(['Gender'])[['ApplicantIncome']].mean()
# 平均贷款金额不同的财产地区,如城市,农村
df.groupby(['Property_Area'])[['LoanAmount']].mean()
# 比较不同教育背景的贷款状况
df.groupby(['Education'])[['Loan_Status']].count()
map
is another important operation that provides a high degree of flexibility and practical application.
Pandas map()
is used to map each value in a sequence to another value based on the input correspondence.
In fact, this input can be a sequence, a dictionary, or even a function.
Let's give an interesting example. W e have a virtual employee dataset. This dataset consists of the following : name, age, occupation, city.
Now you need to add another column to illustrate the appropriate status. W
hat would you do? I
f the range of the dataset is 10 rows, you can do it manually, but what if there are thousands of rows?
Using
Pandas map
is more beneficial.
#样本数据
data = {'name': ['A', 'B', 'C', 'D', 'E'],
'age': [22, 26, 33, 44, 50],
'profession' : ['data engineer', 'data scientist', 'entrepreneur', 'business analyst', 'self-employed'],
'city': ['Gurgaon', 'Bangalore', 'Gurgaon', 'Pune', 'New Delhi']}
df = pd.DataFrame(data)
df
# 城市与州
map_city_to_states = { 'Gurgaon' : 'Haryana',
'Bangalore' : 'Karnataka',
'Pune' : 'Maharashtra',
'New Delhi' : 'Delhi'}
# 将城市列映射为州
df['state'] = df['city'].map(map_city_to_states)
df
This is one of my favorite
Pandas
tips.
This technique gives me the ability to visually locate data under specific conditions.
You can use
Pandas
style
property to apply conditional formatting to data frames.
In fact, the conditional format is an operation that applies a visual style to a data frame based on a condition.
While
Pandas
offers a lot of action, I'll show you a simple one here. F
or example, we have sales data for each salesperson.
What I want to see is sales value above 80.
import pandas as pd
data = pd.read_excel("../Data/salesman_performance.xlsx")
data
data.style
def highlight_green(sales):
color = 'green' if sales > 80 else 'black'
return 'color: %s' % color
formatting = data.iloc[:,1:6].style.applymap(highlight_green)
formatting
(Recommended tutorial: Pandas Chinese tutorial)
Here are five practical tips for
Pandas
that I hope will help you get the job done better and faster.