Last modified on 01 Oct 2021.
In this note, I use df
as DataFrame
, s
as Series
.
Libraries
import pandas as pd # import pandas package
import numpy as np
Other tasks
Things need to be checked
csv
file:- Values are separated by
,
of;
? - Encoding.
- Timestamp type.
- Values are separated by
- Indexes are sorted?
- Indexes are continuous with step 1 (especially after using
.dropna()
or.drop_duplicates
)? - Are there
NaN
values? Drop them? - Are there duplicates? Drop them?
- How many unique values?
- For
0/1
features, they have only 2 unique values (0
and1
)? KDE
plot to check the values distribution.- The number of columns?
- Unique labels?
- Time series:
- Time range.
- Time step.
- Timestamp’s type.
- Timezone.
- Timestamps are monotonic?
Deal with columns
Remove or Keep some
# REMOVING COLUMNS
df.drop('New', axis=1, inplace=True) # drop column 'New'
df.drop(['col1', 'col2'], axis=1, inplace=True)
# ONLY KEEP SOME
kept_cols = ['col1', 'col2', ...]
df = df[kept_cols]
# ALL EXCEPT SOME
df[df.columns.difference(['b'])]
Rename columns
# IMPLICITLY
df.columns = ['Surname', 'Years', 'Grade', 'Location']
# EXPLICITLY
df.rename(columns={'Name': 'Surname', 'Ages': 'Years'}, inplace=True)
# A SPECIFIC COLUMN
data.rename(columns={'gdp':'log(gdp)'}, inplace=True)
# RENAME INDEX COLUMN
df.index.name = 'new_name'
Make index
# COLUMN HAS UNIQUE VALUES?
df['col'].is_unique # True if yes
# INDEX -> NORMAL COLUMN
df.reset_index(inplace=True)
# NORMAL COLUMN -> INDEX
df.set_index('column')
df.set_index(['col1', 'col2'])
Drop duplicates
# check duplicates
df['Student'].duplicated().any()
# remove duplicates in some columns
df.drop_duplicates(['col1', 'col2'])
# use "ignore_index=True" if you wanna reset indexes to 0,1,...,n-1
Couple different columns
df = df0[['Date', 'Heure', 'tH (°C)']].copy()
df['timestamp'] = df['Date'] + ' ' + df['Heure']
# if you use without `.copy()`
# WARNING: A value is trying to be set on a copy of a slice from a DataFrame.
Deal with missing values NaN
Drop NaN
values
Full reference of dropna
is here.
# Drop any rows which have any nans
df.dropna()
# Drop if all values in that col are NA
df.dropna(how='all', axis=1)
# Drop columns that have any nans
df.dropna(axis=1)
# Only drop columns having min 90% non-NaNs
df.dropna(thresh=int(df.shape[0]*.9), axis=1)
# Only keep rows having >=2 non-NA values
df.dropna(thresh=2)
# Only consider some cols
df.dropna(subset=['col1', 'col2'])
# multi-index
df.dropna(subset=[(1,'a'), (1,'b'), (2,'a'), (2,'b')])
# consider all cols '1' and '2'
df.dropna(subset=df.loc[[], [1,2]].columns)
Fill NaN
with others
Check other methods of fillna
here.
# Fill NaN with ' '
df['col'] = df['col'].fillna(' ')
# Fill NaN with 99
df['col'] = df['col'].fillna(99)
# mean / median of each column
df.fillna(df.mean())
# Fill NaN with the mean of the column
df['col'] = df['col'].fillna(df['col'].mean())
# Fill NA with mean of row
m = df.mean(axis=1)
for col in df.columns:
df.loc[:, col] = df.loc[:, col].fillna(m)
Do with conditions
np.where(if_this_condition_is_true, do_this, else_this)
df['new_column'] = np.where(df[i] > 10, 'foo', 'bar) # example
Work with text data
There are a lot of methods we can work with text data (pd.Series.str
). We can use it coupling with regular expression.