Last modified on 01 Oct 2021.

This is the final project for the course “Applied Data Science Capstone” given by IBM on Coursera. We will explore the venues in different districts of Ho Chi Minh City to find the best place to set up our business.

Open this html fileOpen In ColabFull report

Data presentation

In order to explore the previous questions, we need to use the following data in the research.

  1. List of Ho Chi Minh City administrative units from Wikipedia.
  2. List of the coordinates (latitude, longitude) of all urban districts in HCMC. This list can be generated based on the name of each district and package geopy.geocoders.Nominatim.
  3. List of average housing prices per m2m^2 in HCMC.
  4. A .json file contains all coordinates where we use it to create a choropleth map of Housing Sales Price Index of HCMC. I create this file by myself using OpenStreetMap.

Methodology (TL;DR;)

  • Get the data:
    • Scrape the data from a website using requests and bs4.BeautifulSoup: data of districts (df_hcm) and data of housing price (df_housing_price).
    • Using geopy.geocoders.Nominatim to find coordinates (longitude, latitude) of districts based on their name.
  • Using folium to plot the map.
  • Using Foursquare API to find the venues of each district.
  • Explore the venues in each district:
    • List of unique categories.
    • Number of venues in each district.
    • Number of venues in each category.
    • Number of categories in each district.
  • Data preprocessing:
    • Remove all , in a number.
    • Create new feature called Density which is the population over area.
    • Remove word District and only keep the name of that district.
    • Remove Vietnamese accents.
    • Merge 2 dataframes into 1 df.
    • Using pd.get_dummies() to convert categorical features into dummy ones (one-hot encoding).
  • Using groupby to find the top 10 venue categories for each district.
  • Using K-Means clustering to cluster districts by category. We try with different values of k and use the “elbow” method to choose the best k for the K-Means.
  • Next, we examine the range of Average Housing Price (AHP): low, medium, high and very high.
  • Finally, based on the map, we can choose the best district to set up our business: the one in which there are a lot of people living there (high density), there are not many already-working café and the average housing price is low.