Last modified on 01 Oct 2021.
This is the final project for the course “Applied Data Science Capstone” given by IBM on Coursera. We will explore the venues in different districts of Ho Chi Minh City to find the best place to set up our business.
– – Full report
Data presentation
In order to explore the previous questions, we need to use the following data in the research.
- List of Ho Chi Minh City administrative units from Wikipedia.
- List of the coordinates (latitude, longitude) of all urban districts in HCMC. This list can be generated based on the name of each district and package geopy.geocoders.Nominatim.
- List of average housing prices per in HCMC.
- A
.json
file contains all coordinates where we use it to create a choropleth map of Housing Sales Price Index of HCMC. I create this file by myself using OpenStreetMap.
Methodology (TL;DR;)
- Get the data:
- Scrape the data from a website using
requests
andbs4.BeautifulSoup
: data of districts (df_hcm
) and data of housing price (df_housing_price
). - Using
geopy.geocoders.Nominatim
to find coordinates (longitude, latitude) of districts based on their name.
- Scrape the data from a website using
- Using
folium
to plot the map. - Using Foursquare API to find the venues of each district.
- Explore the venues in each district:
- List of unique categories.
- Number of venues in each district.
- Number of venues in each category.
- Number of categories in each district.
- Data preprocessing:
- Remove all
,
in a number. - Create new feature called
Density
which is the population over area. - Remove word
District
and only keep the name of that district. - Remove Vietnamese accents.
- Merge 2 dataframes into 1
df
. - Using
pd.get_dummies()
to convert categorical features into dummy ones (one-hot encoding).
- Remove all
- Using
groupby
to find the top 10 venue categories for each district. - Using K-Means clustering to cluster districts by category. We try with different values of
k
and use the “elbow” method to choose the bestk
for the K-Means. - Next, we examine the range of Average Housing Price (AHP): low, medium, high and very high.
- Finally, based on the map, we can choose the best district to set up our business: the one in which there are a lot of people living there (high density), there are not many already-working café and the average housing price is low.