Last modified on 01 Oct 2021.
In this note, a general dataframe is called df
(type pandas.core.frame.DataFrame
), a general series is call s
(type pandas.core.series.Series
).
Import library
import pandas as pd
import numpy as np # import numpy if necessary
Read/Write .csv
file
# READ
df = pd.read_csv('filename.csv', sep=';') # default sep=','
# if 1st col contains 0,1,2,...
df = pd.read_csv('filename.csv', index_col=1)
# with datetime info
df = pd.read_csv(PATH_DATA_FOLDER+"raw_data.csv",
parse_dates=['timestamp'],
infer_datetime_format=True,
cache_dates=True)
# WRITE
df.to_csv(path, index=False) # don't incldue index
Create a dataframe
# FROM A LIST
pd.DataFrame(a_list, colummns=['col_name'])
# FROM A DICTIONARY
names = ['John', 'Thi', 'Bi', 'Beo', 'Chang']
ages = [10, 20, 21, 18, 11]
marks = [8, 9, 10, 6, 8]
city = ['Ben Tre', 'Paris', 'Ho Chi Minh Ville', 'New York', 'DC']
my_dict = {'Name':names, 'Ages':ages, 'Marks':marks, 'Place': city}
students = pd.DataFrame(my_dict)
Name | Ages | Marks | Place | |
---|---|---|---|---|
0 | John | 10 | 8 | Ben Tre |
1 | Thi | 20 | 9 | Paris |
2 | Bi | 21 | 10 | Ho Chi Minh Ville |
3 | Beo | 18 | 6 | New York |
4 | Chang | 11 | 8 | DC |
Adding
# a column
df['new_col] = [new_values]
# a row
df.loc['new_index'] = [new_value]
# add a new col based on another's values
df_im = df0.copy()[['col']]
df_im['status'] = df0['col'].apply(lambda row: 1 if row>=80 else 0)
Shuffle rows
# shuffle all rows and reset the index
df_new = df.sample(frac=1).reset_index(drop=True)
Sorting
df.sort_values(by='col1', ascending=False)
Select rows/columns/item(s)
👉 Indexing and selecting data — pandas 1.1.2 documentation
Select Single value
Select a single value (with condition): Get the mark of Thi
(9
).
# interchange `.values[0]` and `.iloc[0]`
df[df.Name=='Thi'].Marks.values[0]
df.loc[df.Name=='Thi', 'Marks'].values[0]
# with indexes
df.iloc[1,2] # row 2, column 3
# column's name with row's index
df[['Marks']].iloc[1].values[0] # column 'Marks', row 2
# column's index with row's value
df[df.Name=='Thi'].iloc[:,2].values[0] # column 3, row of 'Thi'
Select integer rows and named columns
df.loc[1:5, 'col']
Select columns
Select a column (returns a Series
)
# with column's name
df['Name']
df.loc[:, 'Name']
# with an index
df.iloc[:,0]
Returns a pd.DataFrame
,
df[['Name']]
df.loc[:, ['Name']]
# with an index
df.iloc[:,[0]]
Select multi-columns (type DataFrame
): Get columns Name
& Place
:
# using columns's names
df[['Name', 'Place']]
df.loc[:, ['Name', 'Place']]
# using indexes
df.iloc[:, [0,-1]]
Select rows
Select a row (returns a Series
)
# with an index
df.iloc[1]
# with a condition
df[df['Name']=='Thi'] # DataFrame
df[df['Name']=='Thi'].iloc[0] # Series
df[df.Name=='Thi'] # DataFrame
df[df.Name=='Thi'].iloc[0] # Series
df[df.Name=='Thi'].values[0] # ndarray
Select multi-rows (type DataFrame
)
# using indexes
df.iloc[:3]
df.loc[:2]
# with conditions
df[df['A'].isin([3, 6])]
MultiIndex
👉 MultiIndex / advanced indexing — pandas 1.1.2 documentation
All multiindex
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo'], ['one', 'two', 'one', 'two', 'one', 'two']]
index = pd.MultiIndex.from_arrays(arrays)
df = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
bar | baz | foo | ||||
---|---|---|---|---|---|---|
one | two | one | two | one | two | |
A | -0.752333 | 0.490581 | 0.774629 | 0.487185 | 1.767773 | 0.028956 |
B | -0.057864 | -0.221516 | -0.568726 | -0.563732 | 1.362453 | -0.563213 |
C | -0.338319 | -0.346590 | 0.012845 | 0.755455 | 1.260937 | -0.038209 |
Selection,
df.loc['A', ('baz', 'two')]
0.487185
df.loc[:,('baz', 'two')]
A 0.487185
B -0.563732
C 0.755455
Name: (baz, two), dtype: float64
With a single name column
If there are some column with single name,
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo'], [i for i in range(2)]*3]
index = pd.MultiIndex.from_arrays(arrays)
df1 = pd.DataFrame(np.random.randn(3, 6), index=['A', 'B', 'C'], columns=index)
Good practice
# GOOD PRACTICE
df1['time'] = [1,2,3]
df_rs2 = df1
bar | baz | foo | time | ||||
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 1 | 0 | 1 | ||
A | -1.386119 | -0.496755 | 1.482855 | 0.943795 | -1.173290 | -0.445365 | 1 |
B | -0.900710 | -1.571009 | 1.086964 | 1.546927 | -1.564426 | 0.622763 | 2 |
C | 0.712231 | 0.235247 | -0.807031 | 0.671802 | 0.597149 | 0.111332 | 3 |
Selection,
# FOR GOOD PRACTICE
df_rs2.loc['A', ('baz', 1)]
df_rs2.loc['A', 'baz']
0.943795
0 1.482855
1 0.943795
Bad practice
# BAD PRACTICE
df2 = pd.DataFrame([1,2,3], index=['A', 'B', 'C'], columns=['time'])
df_rs1 = pd.concat([df1, df2], axis=1)
(bar, 0) | (bar, 1) | (baz, 0) | (baz, 1) | (foo, 0) | (foo, 1) | time | |
---|---|---|---|---|---|---|---|
A | -1.386119 | -0.496755 | 1.482855 | 0.943795 | -1.173290 | -0.445365 | 1 |
B | -0.900710 | -1.571009 | 1.086964 | 1.546927 | -1.564426 | 0.622763 | 2 |
C | 0.712231 | 0.235247 | -0.807031 | 0.671802 | 0.597149 | 0.111332 | 3 |
Selection,
# FOR BAD PRACTICE
df.loc['A', [('baz', 0)]]
df_rs1.loc['A', [('baz', i) for i in [0,1]]]
(baz, 0) 0.729023
(baz, 0) 1.482855
(baz, 1) 0.943795
Rename multiindex
# all columns' name at the level 1
df.columns.set_levels(['b1','c1','f1'], level=1, inplace=True)
Drop multiindex
df.columns = df.columns.droplevel()
a
b c b c
0 1 2 -> 0 1 2
1 3 4 1 3 4
Compare 2 dataframes
df1.equals(df2)
True / False
# Invert True/False value in Series
s = pd.Series([True, True, False, True])
~s
# Convert True / False to 1 / 0
df['col'] = df['col'].astype(int)
# int or float