Pandas extra

Last modified on 01 Oct 2021.

In this note, a general dataframe is called df (type pandas.core.frame.DataFrame), a general series is call s (type pandas.core.series.Series).

Import library

import pandas as pd
import numpy as np # import numpy if necessary

Read/Write `.csv` file

# READ
df = pd.read_csv('filename.csv', sep=';') # default sep=','

# if 1st col contains 0,1,2,...
df = pd.read_csv('filename.csv', index_col=1)

# with datetime info
df = pd.read_csv(PATH_DATA_FOLDER+"raw_data.csv",
                  parse_dates=['timestamp'],
                  infer_datetime_format=True,
                  cache_dates=True)

# WRITE
df.to_csv(path, index=False) # don't incldue index

Create a dataframe

# FROM A LIST
pd.DataFrame(a_list, colummns=['col_name'])

# FROM A DICTIONARY
names = ['John', 'Thi', 'Bi', 'Beo', 'Chang']
ages =  [10, 20, 21, 18, 11]
marks = [8, 9, 10, 6, 8]
city = ['Ben Tre', 'Paris', 'Ho Chi Minh Ville', 'New York', 'DC']

my_dict = {'Name':names, 'Ages':ages, 'Marks':marks, 'Place': city}
students = pd.DataFrame(my_dict)

	Name	Ages	Marks	Place
0	John	10	8	Ben Tre
1	Thi	20	9	Paris
2	Bi	21	10	Ho Chi Minh Ville
3	Beo	18	6	New York
4	Chang	11	8	DC

Adding

# a column
df['new_col] = [new_values]

# a row
df.loc['new_index'] = [new_value]

# add a new col based on another's values
df_im = df0.copy()[['col']]
df_im['status'] = df0['col'].apply(lambda row: 1 if row>=80 else 0)

Shuffle rows

# shuffle all rows and reset the index
df_new = df.sample(frac=1).reset_index(drop=True)

Sorting

df.sort_values(by='col1', ascending=False)

Select rows/columns/item(s)

👉 Indexing and selecting data — pandas 1.1.2 documentation

Select Single value

Select a single value (with condition): Get the mark of Thi (9).

# interchange `.values[0]` and `.iloc[0]`
df[df.Name=='Thi'].Marks.values[0]
df.loc[df.Name=='Thi', 'Marks'].values[0]

# with indexes
df.iloc[1,2] # row 2, column 3

# column's name with row's index
df[['Marks']].iloc[1].values[0] # column 'Marks', row 2

# column's index with row's value
df[df.Name=='Thi'].iloc[:,2].values[0] # column 3, row of 'Thi'

Select integer rows and named columns

df.loc[1:5, 'col']

Select columns

Select a column (returns a Series)

# with column's name
df['Name']
df.loc[:, 'Name']

# with an index
df.iloc[:,0]

Returns a pd.DataFrame,

df[['Name']]
df.loc[:, ['Name']]

# with an index
df.iloc[:,[0]]

Select multi-columns (type DataFrame): Get columns Name & Place:

# using columns's names
df[['Name', 'Place']]
df.loc[:, ['Name', 'Place']]

# using indexes
df.iloc[:, [0,-1]]

Select rows

Select a row (returns a Series)

# with an index
df.iloc[1]

# with a condition
df[df['Name']=='Thi'] # DataFrame
df[df['Name']=='Thi'].iloc[0] # Series

df[df.Name=='Thi'] # DataFrame
df[df.Name=='Thi'].iloc[0] # Series
df[df.Name=='Thi'].values[0] # ndarray

Select multi-rows (type DataFrame)

# using indexes
df.iloc[:3]
df.loc[:2]

# with conditions
df[df['A'].isin([3, 6])]

MultiIndex

👉 MultiIndex / advanced indexing — pandas 1.1.2 documentation

All multiindex

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo'], ['one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays)
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)

	bar		baz		foo
	one	two	one	two	one	two
A	-0.752333	0.490581	0.774629	0.487185	1.767773	0.028956
B	-0.057864	-0.221516	-0.568726	-0.563732	1.362453	-0.563213
C	-0.338319	-0.346590	0.012845	0.755455	1.260937	-0.038209

Selection,

df.loc['A', ('baz', 'two')]

0.487185

df.loc[:,('baz', 'two')]

A    0.487185
B   -0.563732
C    0.755455
Name: (baz, two), dtype: float64

With a single name column

If there are some column with single name,

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo'], [i for i in range(2)]*3]
index = pd.MultiIndex.from_arrays(arrays)
df1 = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)

Good practice

# GOOD PRACTICE
df1['time'] = [1,2,3]
df_rs2 = df1

	bar		baz		foo		time
	0	1	0	1	0	1
A	-1.386119	-0.496755	1.482855	0.943795	-1.173290	-0.445365	1
B	-0.900710	-1.571009	1.086964	1.546927	-1.564426	0.622763	2
C	0.712231	0.235247	-0.807031	0.671802	0.597149	0.111332	3

Selection,

# FOR GOOD PRACTICE
df_rs2.loc['A', ('baz', 1)]
df_rs2.loc['A', 'baz']

943795
  1.482855
  0.943795

Bad practice

# BAD PRACTICE
df2 = pd.DataFrame([1,2,3], index=['A', 'B', 'C'], columns=['time'])
df_rs1 = pd.concat([df1, df2], axis=1)

	(bar, 0)	(bar, 1)	(baz, 0)	(baz, 1)	(foo, 0)	(foo, 1)	time
A	-1.386119	-0.496755	1.482855	0.943795	-1.173290	-0.445365	1
B	-0.900710	-1.571009	1.086964	1.546927	-1.564426	0.622763	2
C	0.712231	0.235247	-0.807031	0.671802	0.597149	0.111332	3

Selection,

# FOR BAD PRACTICE
df.loc['A', [('baz', 0)]]
df_rs1.loc['A', [('baz', i) for i in [0,1]]]

(baz, 0)    0.729023
(baz, 0)    1.482855
(baz, 1)    0.943795

Rename multiindex

# all columns' name at the level 1
df.columns.set_levels(['b1','c1','f1'], level=1, inplace=True)

Drop multiindex

df.columns = df.columns.droplevel()

   a
   b  c         b c
0  1  2   ->  0 1 2
1  3  4       1 3 4

Compare 2 dataframes

df1.equals(df2)

True / False

# Invert True/False value in Series
s = pd.Series([True, True, False, True])
~s

# Convert True / False to 1 / 0
df['col'] = df['col'].astype(int)
# int or float

Import library

Read/Write .csv file

Create a dataframe

Adding

Shuffle rows

Sorting

Select rows/columns/item(s)

Select Single value

Select integer rows and named columns

Select columns

Select rows

MultiIndex

All multiindex

With a single name column

Good practice

Bad practice

Rename multiindex

Drop multiindex

Compare 2 dataframes

True / False

Read/Write `.csv` file