Last modified on 01 Oct 2021.

In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

Dataframe

dataquest_aio = 'https://raw.githubusercontent.com/dinhanhthi/dataquest-aio/master/step-2-data-analysis-and-visualization/'
dataset_url = dataquest_aio + 'course-4-data-cleaning-and-analysis/data/World_Happiness_2015.csv'
df = pd.read_csv(dataset_url) # read the data set
df.head()
Country Region Happiness Rank Happiness Score Standard Error
0 Switzerland Western Europe 1 7.587 0.03411
1 Iceland Western Europe 2 7.561 0.04884
2 Denmark Western Europe 3 7.527 0.03328
3 Norway Western Europe 4 7.522 0.03880
4 Canada North America 5 7.427 0.03553

Group dataset using groupby()

Group df by column Region and then selct the column Western Europe,

df.groupby('Region').get_group('Western Europe') # returns a df
Country Region Happiness Rank Happiness Score Standard Error
0 Switzerland Western Europe 1 7.587 0.03411
1 Iceland Western Europe 2 7.561 0.04884
2 Denmark Western Europe 3 7.527 0.03328
3 Norway Western Europe 4 7.522 0.03880
5 Finland Western Europe 6 7.406 0.03140

Select just the Happiness Score column and then find the mean,

df.groupby('Region')['Happiness Score'].mean()
# other methods: size, max, min, count
Region
Australia and New Zealand          7.285000
Central and Eastern Europe         5.332931
Eastern Asia                       5.626167
Latin America and Caribbean        6.144682
Middle East and Northern Africa    5.406900
North America                      7.273000
Southeastern Asia                  5.317444
Southern Asia                      4.580857
Sub-Saharan Africa                 4.202800
Western Europe                     6.689619
Name: Happiness Score, dtype: float64

Apply multiple/custom functions,

def max_min(group):
  return group.max() - group.min()

df.groupby(['Country', 'Region']).agg([np.mean, np.max, max_min]).head()
Happiness Rank Happiness Score Standard Error
mean amax max_min mean amax max_min mean amax max_min
Country Region
Afghanistan Southern Asia 153 153 0 3.575 3.575 0.0 0.03084 0.03084 0.0
Albania Central Europe 95 95 0 4.959 4.959 0.0 0.05013 0.05013 0.0
Algeria Middle Africa 68 68 0 5.605 5.605 0.0 0.05099 0.05099 0.0

If you wanna apply different functions on different columns

df.groupby(['Country', 'Region']).agg({
    'Happiness Rank': max_min,
    'Happiness Score': ['min', 'max'],
    'Standard Error': 'count'
}).head(3)
Happiness Rank Happiness Score Standard Error
max_min min max count
Country Region
Afghanistan Southern Asia 0 3.575 3.575 1
Albania Central Europe 0 4.959 4.959 1
Algeria Middle Africa 0 5.605 5.605 1

Or using apply and lambda function,

orders.groupby('shoes').price.apply(lambda x: np.min(x, 25)).reset_index()

Group using pivot_table()

An example of pivotting by a single column An example of pivotting by a single column (ref)

Group by Region (as an index) and choosing GDP and City columns,[ref]

df.pivot_table(values=['GDP', 'City'], index='Region') # returns df
Happiness Rank Standard Error
Region
Australia and New Zealand 9.5 0.037270
Central and Eastern Europe 79.0 0.045208
Eastern Asia 64.5 0.037225

Apply some functions,

df.pivot_table(['GDP', 'City'], 'Region', aggfunc=[np.mean, np.max], margins=True)
# margins shows the "All" row
mean amax
Happiness Rank Standard Error Happiness Rank Standard Error
Region
Australia and New Zealand 9.5 0.037270 10 0.04083
Central and Eastern Europe 79.0 0.045208 134 0.06913
Eastern Asia 64.5 0.037225 100 0.05051

Reorganizing df using pivot()

An example of multi-column pivoting An example of multi-column pivoting (ref)

Make values in one columns be columns in a new “pivot” table,[ref]

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
                           'two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})

pivot_1 = df.pivot(index='foo', columns='bar', values='baz')
pivot_2 = df.pivot(index='foo', columns='bar')['baz']
pivot_3 = df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])

display_side_by_side(df, pivot_1, pivot_2, pivot_3)
foo bar baz zoo
0 one A 1 x
1 one B 2 y
2 one C 3 z
3 two A 4 q
4 two B 5 w
5 two C 6 t
bar A B C
foo
one 1 2 3
two 4 5 6
bar A B C
foo
one 1 2 3
two 4 5 6
baz zoo
bar A B C A B C
foo
one 1 2 3 x y z
two 4 5 6 q w t

For one who wanna know display_side_by_side: ref this note.

Change shape of df with melt()

Contrary to pivot, we now want to transform several columns into values of a single column,[ref]

df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
                   'B': {0: 1, 1: 3, 2: 5},
                   'C': {0: 2, 1: 4, 2: 6}})

df1 = pd.melt(df, id_vars=['A'], value_vars=['B'])
df2 = pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])

display_side_by_side(df, df1, df2)
A B C
0 a 1 2
1 b 3 4
2 c 5 6
A variable value
0 a B 1
1 b B 3
2 c B 5
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6

References