Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

Pandas Group Aggregation Do you understand?


May 30, 2021 Article blog


Table of contents


One: Groupby

  • Group the datasets, and then analyze each group statistically
  • SQL is able to filter data and aggregate it in groups
  • Pandas can use groupby for more complex grouping operations
  • Grouping process: split->apply->combine split: Group by application: Merge the calculation rules run by each group: Combine the calculation results for each group

 Pandas Group Aggregation Do you understand?1

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

Results:

      data1     data2 key1   key2
0 0.974685 -0.672494 a one
1 -0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359 -0.177858 a two
5 -0.264616 1.017155 b two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

GroupBy objects: DataFrameGroupBy, SeriesGroupBy

1. Group operations

groupby() grouping, GroupBy objects do not perform actual operations, but the intermediate data that contains the grouping is grouped by column name: obj.groupby ('label')

Sample code:

print(type(df_obj.groupby('key1'))) # dataframe根据key1进行分组
print(type(df_obj['data1'].groupby(df_obj['key1']))) # dataframe的 data1 列根据 key1 进行分组

Results:

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>

2. Grouping operations

GroupBy objects are grouped/multi-grouped, and non-numeric data such as mean() is not grouped

Sample code:

grouped1 = df_obj.groupby('key1') # 分组运算
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())

Results:

         data1     data2
key1
a 0.437389 -0.230101
b 0.014657 0.802114
key1
a 0.437389
b 0.014657
Name: data1, dtype: float64
size() returns the number of elements per grouping

Sample code:

print(grouped1.size()) # size
print(grouped2.size())

Results:

key1
a 5
b 3
dtype: int64
key1
a 5
b 3
dtype: int64

3. Group by custom key

obj.groupby (self_def_key) Custom keys can be lists or multi-tier lists obj.groupby ('label1', 'label2'))-> multi-tier dataframe

Sample code:

self_def_key = [0, 1, 2, 3, 3, 4, 5, 7] # 按自定义key分组,列表
print(df_obj.groupby(self_def_key).size())

print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size()) # 按自定义key分组,多层列表

grouped2 = df_obj.groupby(['key1', 'key2']) # 按多个列多层分组
print(grouped2.size())

grouped3 = df_obj.groupby(['key2', 'key1']) # 多层分组按key的顺序进行
print(grouped3.mean())

print(grouped3.mean().unstack()) # unstack可以将多层索引的结果转换成单层的dataframe

Results:

0    1
1 1
2 1
3 2
4 1
5 1
7 1
dtype: int64

key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64


key1 key2
a one 2
three 1
two 2
b one 1
three 1
two 1
dtype: int64


data1 data2
key2 key1
one a 0.174988 -0.110804
b -0.214324 0.758372
three a -1.019229 -1.143825
b 0.522911 0.630814
two a 1.428099 0.107465
b -0.264616 1.017155

data1 data2
key1 a b a b
key2
one 0.174988 -0.214324 -0.110804 0.758372
three -1.019229 0.522911 -1.143825 0.630814
two 1.428099 -0.264616 0.107465 1.017155

Second, the GroupBy object supports iterative operations

Each iteration returns a tuple (group_name, group_data) that can be used to group specific operations of the data

1. Single-tier grouping

Sample code:

for group_name, group_data in grouped1: # 单层分组,根据key1
print(group_name)
print(group_data)

Results:

a      data1     data2 key1   key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three

b
data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two

2. Multi-tiered grouping

Sample code:

for group_name, group_data in grouped2: # 多层分组,根据key1 和 key2
print(group_name)
print(group_data)

Results:

('a', 'one')
data1 data2 key1 key2
0 0.974685 -0.672494 a one
6 -0.624708 0.450885 a one

('a', 'three')
data1 data2 key1 key2
7 -1.019229 -1.143825 a three

('a', 'two')
data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two

('b', 'one')
data1 data2 key1 key2
1 -0.214324 0.758372 b one

('b', 'three')
data1 data2 key1 key2
3 0.522911 0.630814 b three

('b', 'two')
data1 data2 key1 key2
5 -0.264616 1.017155 b two

GroupBy objects can be converted into lists or dictionaries

Sample code:

print(list(grouped1)) # GroupBy对象转换list

print(dict(list(grouped1))) # GroupBy对象转换dict

Results:

[('a',       data1     data2 key1   key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three),
('b', data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two)]

{'a': data1 data2 key1 key2
0 0.974685 -0.672494 a one
2 1.508838 0.392787 a two
4 1.347359 -0.177858 a two
6 -0.624708 0.450885 a one
7 -1.019229 -1.143825 a three,
'b': data1 data2 key1 key2
1 -0.214324 0.758372 b one
3 0.522911 0.630814 b three
5 -0.264616 1.017155 b two}

1. Group by column, by data type

Sample code:

print(df_obj.dtypes) # 按列分组

print(df_obj.groupby(df_obj.dtypes, axis=1).size()) # 按数据类型分组
print(df_obj.groupby(df_obj.dtypes, axis=1).sum())

Results:

data1    float64
data2 float64
key1 object
key2 object
dtype: object

float64 2
object 2
dtype: int64

float64 object
0 0.302191 a one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6 -0.173823 a one
7 -2.163054 a three

2. Other grouping methods

Sample code:

df_obj2 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['A', 'B', 'C', 'D', 'E'])
df_obj2.ix[1, 1:4] = np.NaN
print(df_obj2)

Results:

   a    b    c    d  e
A 7 2.0 4.0 5.0 8
B 4 NaN NaN NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1

3. Group by dictionary

Sample code:

mapping_dict = {'a':'Python', 'b':'Python', 'c':'Java', 'd':'C', 'e':'Java'} # 通过字典分组
print(df_obj2.groupby(mapping_dict, axis=1).size())
print(df_obj2.groupby(mapping_dict, axis=1).count()) # 非NaN的个数
print(df_obj2.groupby(mapping_dict, axis=1).sum())

Results:

C         1
Java 2
Python 2
dtype: int64

C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2

C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0
C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0

4. By grouping functions, the arguments passed in by the function are row indexes or column indexes

Sample code:

 
df_obj3 = pd.DataFrame(np.random.randint(1, 10, (5,5)),
columns=['a', 'b', 'c', 'd', 'e'],
index=['AA', 'BBB', 'CC', 'D', 'EE']) # 通过函数分组

def group_key(idx): #df_obj3
"""
idx 为列索引或行索引
"""
#return idx
return len(idx)

print(df_obj3.groupby(group_key).size())

# 以上自定义函数等价于
#df_obj3.groupby(len).size()

Results:

1    1
2 3
3 1
dtype: int64

5. Group by index level

Sample code:

columns = pd.MultiIndex.from_arrays([['Python', 'Java', 'Python', 'Java', 'Python'],
['A', 'A', 'B', 'C', 'B']], names=['language', 'index']) # 通过索引级别分组
df_obj4 = pd.DataFrame(np.random.randint(1, 10, (5, 5)), columns=columns)
print(df_obj4)

print(df_obj4.groupby(level='language', axis=1).sum()) # 根据language进行分组

print(df_obj4.groupby(level='index', axis=1).sum()) # 根据index进行分组

Results:

language Python Java Python Java Python
index A A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8

language Java Python
0 11 13
1 3 13
2 9 12
3 10 9
4 8 18

index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4

Two: Aggregation

  • The process by which an array produces a scale, such as mean(), count(), and so on
  • It is often used to calculate data after grouping

Sample code:

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1,10, 8),
'data2': np.random.randint(1,10, 8)}
df_obj5 = pd.DataFrame(dict_obj)
print(df_obj5)

Results:

   data1  data2 key1   key2
0 3 7 a one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 a one
7 8 4 a three

1. Built-in aggregation functions

sum(), mean(), max(), min(), count(), size(), describe()

Sample code:

print(df_obj5.groupby('key1').sum())
print(df_obj5.groupby('key1').max())
print(df_obj5.groupby('key1').min())
print(df_obj5.groupby('key1').mean())
print(df_obj5.groupby('key1').size())
print(df_obj5.groupby('key1').count())
print(df_obj5.groupby('key1').describe())

Results:

      data1  data2
key1
a 27 24
b 12 18

data1 data2 key2
key1
a 8 7 two
b 9 9 two

data1 data2 key2
key1
a 3 4 one
b 1 4 one

data1 data2
key1
a 5.4 4.8
b 4.0 6.0

key1
a 5
b 3
dtype: int64

data1 data2 key2
key1
a 5 5 5
b 3 3 3

data1 data2
key1
a count 5.000000 5.000000
mean 5.400000 4.800000
std 2.302173 1.303840
min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000
max 8.000000 7.000000
b count 3.000000 3.000000
mean 4.000000 6.000000
std 4.358899 2.645751
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
max 9.000000 9.000000

2. Customizable functions, passed into the agg method

The parameter of grouped.agg (func) func is the record corresponding to the groupby index

Sample code:

def peak_range(df): # 自定义聚合函数
"""
返回数值范围
"""
#print type(df) #参数为索引所对应的记录
return df.max() - df.min()

print(df_obj5.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))

Results:

      data1  data2
key1
a 5 3
b 8 5

data1 data2
key1
a 2.528067 1.594711
b 0.787527 0.386341
In [25]:

3. Apply multiple aggregation functions

Apply multiple functions to aggregate operations at the same time, using a list of functions

Sample code:

'''应用多个聚合函数

同时应用多个聚合函数'''
print(df_obj.groupby('key1').agg(['mean', 'std', 'count', peak_range])) # 默认列名为函数名

print(df_obj.groupby('key1').agg(['mean', 'std', 'count', ('range', peak_range)])) # 通过元组提供新的列名

Results:

         data1                                data2                           
mean std count peak_range mean std count peak_range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

data1 data2
mean std count range mean std count range
key1
a 0.437389 1.174151 5 2.528067 -0.230101 0.686488 5 1.594711
b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

4. Use dict for aggregation functions that act differently on different columns

Sample code:

dict_mapping = {'data1':'mean',
'data2':'sum'} # 每列作用不同的聚合函数
print(df_obj.groupby('key1').agg(dict_mapping))

dict_mapping = {'data1':['mean','max'],
'data2':'sum'}
print(df_obj.groupby('key1').agg(dict_mapping))

Results:

         data1     data2
key1
a 0.437389 -1.150505
b 0.014657 2.406341

data1 data2
mean max sum
key1
a 0.437389 1.508838 -1.150505
b 0.014657 0.522911 2.406341

5. Common built-in aggregation functions

 Pandas Group Aggregation Do you understand?2

Sample code:

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randint(1, 10, 8),
'data2': np.random.randint(1, 10, 8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

k1_sum = df_obj.groupby('key1').sum().add_prefix('sum_') # 按key1分组后,计算data1,data2的统计信息并附加到原始表格中,并添加表头前缀
print(k1_sum)

Results:

   data1  data2 key1   key2
0 5 1 a one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 a one
7 8 3 a three

sum_data1 sum_data2
key1
a 26 26
b 17 17
The aggregate operation changes the shape of the original data, how do you maintain the shape of the original data?

1. merge

Using the external connection of the merge is more complex

Sample code:

k1_sum_merge = pd.merge(df_obj, k1_sum, left_on='key1', right_index=True) # 方法1,使用merge
print(k1_sum_merge)

Results:

   data1  data2 key1   key2  sum_data1  sum_data2
0 5 1 a one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 a one 26 26
7 8 3 a three 26 26
1 7 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17

2. transform

Transform's calculations are consistent with the shape of the original data, such as: grouped.transform (np.sum)

Sample code:

k1_sum_tf = df_obj.groupby('key1').transform(np.sum).add_prefix('sum_') # 方法2,使用transform
df_obj[k1_sum_tf.columns] = k1_sum_tf
print(df_obj)

Results:

   data1  data2 key1   key2 sum_data1 sum_data2           sum_key2
0 5 1 a one 26 26 onetwotwoonethree
1 7 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 onetwotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 a one 26 26 onetwotwoonethree
7 8 3 a three 26 26 onetwotwoonethree
Custom functions can also be passed in.

Sample code:

def diff_mean(s): # 自定义函数传入transform
"""
返回数据与均值的差值
"""
return s - s.mean()

print(df_obj.groupby('key1').transform(diff_mean))

Results:

      data1     data2 sum_data1 sum_data2
0 -0.200000 -4.200000 0 0
1 1.333333 2.333333 0 0
2 -4.200000 3.800000 0 0
3 -3.666667 0.333333 0 0
4 3.800000 2.800000 0 0
5 2.333333 -2.666667 0 0
6 -2.200000 -0.200000 0 0
7 2.800000 -2.200000 0 0

groupby.apply(func)

Func functions can also be called separately on each grouping, and the final result is assembled together by pd.concat (data merge)

Sample code:

import pandas as pd
import numpy as np

dataset_path = './starcraft.csv'
df_data = pd.read_csv(dataset_path, usecols=['LeagueIndex', 'Age', 'HoursPerWeek',
'TotalHours', 'APM'])

def top_n(df, n=3, column='APM'):
"""
返回每个分组按 column 的 top n 数据
"""
return df.sort_values(by=column, ascending=False)[:n]

print(df_data.groupby('LeagueIndex').apply(top_n))

Results:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518

1. Produce a hierarchical index: The outer index is a grouping name, and the inner index is the row index of the df_obj

Sample code:

print(df_data.groupby('LeagueIndex').apply(top_n, n=2, column='Age')) # apply函数接收的参数会传入自定义的函数中

Results:

                  LeagueIndex   Age  HoursPerWeek  TotalHours       APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN NaN NaN 189.7404
3341 8 NaN NaN NaN 287.8128

2. Prohibit hierarchical indexes, group_keys

Sample code:

print(df_data.groupby('LeagueIndex', group_keys=False).apply(top_n))

Results:

      LeagueIndex   Age  HoursPerWeek  TotalHours       APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN NaN NaN 375.8664
3373 8 NaN NaN NaN 364.8504
3372 8 NaN NaN NaN 355.3518
apply can be used to process missing data fills within different groupings, filling the mean of that group.

Recommended lessons: Python Automation Office, Python Automation Management