Anh-Thi DINH
A final report for the course "Applied Data Science Capstone" given by IBM on Coursera
(Please read more in the final report!)
By using Data Science and some geometric factors about the relation between districts in HCMC, we can give good answers of following questions to the investors so that they can have a better vision about not only the café but also about other venues in Ho Chi Minh City (HCMC).
First, we need to collect the data by scraping the table of HCMC units on the wikipedia page and the average housing price (AHP) on a website. The BeautifulSoup package is very useful in this case.
The column Density is calculated later based on columns Population and Area of each district.
Throughout the project, we use numpy and pandas packages to manipulate dataframes.
We use geopy.geocoders.Nominatim to get the coordinates of districts and add them to the main data frame.
We use folium package to visualize the HCMC map with its districts. The central coordinate of each district will be represented as a small circle on top of the city map.
We use Foursquare API to explore the venues in each district and segment the districts based on them.
For clustering the “Café" venues between districts, we use K-Means Clustering method and the package scikit-learn will help us implement the algorithm on our data. In order to indicate how many K for the method, we try with 10 different values of K from 1 to 10 and use the “elbow" method to choose the most appropriate one.
In order to visualize the charts, we use pakage matplotlib.
We use again the package folium to visualize the clusters on the main map and the choropleth map of AHP.
import pandas as pd
import numpy as np
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize
# Scrape the web to get the data
from bs4 import BeautifulSoup
import requests
import folium # map rendering library
# import k-means from clustering stage
from sklearn.cluster import KMeans
# find the distances
from scipy.spatial.distance import cdist
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# translate Vietnamese character (with accents)
# to the closest possible representation in ascii text
from unidecode import unidecode
We don't have an all-in-one talbe, we have to collect all necessary information from various data tables.
First, we scrape the list of all 19 urban districts from a list of administrative units on a wiki page of Ho Chi Minh City.
source_wiki_hcm = requests.get("https://en.wikipedia.org/wiki/Ho_Chi_Minh_City").text
soup = BeautifulSoup(source_wiki_hcm, 'lxml')
table_wiki_hcm = ( soup.find("span", {"id": "Demographics"})
.parent.previous_sibling.previous_sibling )
table_rows = table_wiki_hcm.tbody.find_all("tr")
res_hcm = []
for tr in table_rows:
td = tr.find_all("td")
row = [tr.text for tr in td]
res_hcm.append(row)
df_hcm = pd.DataFrame(res_hcm, columns=["District", "Subdistrict",
"Area (km2)", "", "", "", "", "Population 2015", ""])
df_hcm.drop("", axis=1, inplace=True)
df_hcm = df_hcm.iloc[3:22].reset_index().drop("index", axis=1)
df_hcm["Population 2015"] = (
df_hcm["Population 2015"].str.replace("\n", "")
.str.replace(",", "")
.str.replace(".", "")
.str.strip()
)
# Add the "Density" column = Population / Area
df_hcm["Density (pop/m2)"] = round(df_hcm["Population 2015"].astype(float)
/ df_hcm["Area (km2)"].astype(float)
, 3)
# remove the word "District"
df_hcm["District"] = ( df_hcm["District"]
.str.replace("District", "")
.str.strip()
)
# remove Vietnamese accents
df_hcm["District"] = df_hcm["District"].apply(unidecode)
df_hcm
Next, we collect the housing price at different districts of HCMC.
source_housing_price = requests.get("https://mogi.vn/gia-nha-dat").text
soup = BeautifulSoup(source_housing_price, 'lxml')
table_housing_price = soup.find("div", class_="mt-table")
table_rows = table_housing_price.find_all("div", class_="mt-row")
res_housing_price = []
for tr in table_rows:
district = tr.find("div", class_="mt-street").a.text
medium_price = tr.find("div", class_="mt-vol").span.text
row = [district, medium_price]
res_housing_price.append(row)
df_housing_price = pd.DataFrame(res_housing_price,
columns=["District", "Average Housing Price (1M VND)"])
df_housing_price = df_housing_price.iloc[:19].reset_index().drop("index", axis=1)
# Remove the word "Quận"
df_housing_price["District"] = ( df_housing_price["District"]
.str.replace("\n", "").str.replace("Quận", "")
.str.strip()
)
# Remove Vietnamese accents
df_housing_price["District"] = df_housing_price["District"].apply(unidecode)
# Remove the word "triệu" (It's 10^6 in Vietnamese)
df_housing_price["Average Housing Price (1M VND)"] = ( df_housing_price["Average Housing Price (1M VND)"]
.str.replace("triệu", "")
.str.replace(",", ".")
.str.strip()
)
df_housing_price
Merge two dataframes df_hcm and df_housing_price into one table called df
df = pd.merge(df_hcm, df_housing_price, how='left', left_on = 'District', right_on = 'District')
df
Next, we find all coordinates of all urban districts in HCMC their name. In order to do that, we create a function which does the same job for all.
def find_coor(name):
address = name + " Ho Chi Minh City Vietnam"
geolocator = Nominatim(user_agent="hcmc")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
return [latitude, longitude]
Remark: We cannot find the coordinate of "District Go Vap" but "Go Vap". However, we need the word "District" to find the coordinate of districts whose name contains a number like "District 1".
# For District 1 to 12 (numbers)
coords = [find_coor("District " + dist) for dist in df["District"].iloc[:12].tolist()]
# For the other districts (letters)
coords = coords + [find_coor(dist + " District") for dist in df["District"].iloc[12:].tolist()]
df_coords = pd.DataFrame(coords, columns=["Latitude", "Longitude"])
df["Latitude"] = df_coords["Latitude"]
df["Longitude"] = df_coords["Longitude"]
df
[hcm_lat, hcm_long] = find_coor("")
print('The geograpical coordinate of Ho Chi Minh City are {}, {}.'.format(hcm_lat, hcm_long))
Plot the HCMC's map
map_hcm = folium.Map(location=[hcm_lat, hcm_long], zoom_start=11)
for lat, lng, dis in zip(df['Latitude'], df['Longitude'], df['District']):
label = '{}'.format(dis)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_hcm)
map_hcm
On the public repository on Github, I remove this field for the privacy!
CLIENT_ID = 'POLHESRKW3XHK2RRL43QDI0MTY1IMDPIQYRHVYKTHVQBOAWZ'
CLIENT_SECRET = 'MQEW54YAQYCNSE2C3RMF04TGMUZASA21XCCTE4LVEC3DBJT5'
VERSION = '20180605'
First, let's create a function to repeat the same process to all the districts of HCMC.
def getNearbyVenues(names, latitudes, longitudes, radius=1500, LIMIT=150):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['District',
'District Latitude',
'District Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
Now, we apply above function to our dataframe.
hcm_venues = getNearbyVenues(names=df['District'],
latitudes=df['Latitude'],
longitudes=df['Longitude']
)
hcm_venues.head()
Let's check how many venues were returned for each district.
hcm_venues_group = hcm_venues.groupby('District').count().reset_index()
hcm_venues_group
print('In above table, there are {} uniques categories.'.format(len(hcm_venues['Venue Category'].unique())))
hcm_venues['Venue Category'].unique()[:50]
We plot a chart in order to compare visually the different of number of venues between districts.
ax = hcm_venues_group.sort_values(by="Venue", ascending=False).plot(x="District", y="Venue", kind="bar")
ax.set_ylabel("Number of venues")
most_venues = hcm_venues.groupby('Venue Category').count().sort_values(by="Venue", ascending=False)
most_venues.head(15)
hcm_venues_group_cat = (
hcm_venues.groupby(['District','Venue Category'])
.count().reset_index()[['District', 'Venue Category']]
.groupby('District').count().reset_index()
)
# hcm_venues_group_cat
ax = hcm_venues_group_cat.sort_values(by="Venue Category", ascending=False).plot(x="District", y="Venue Category", kind="bar")
ax.set_ylabel("Number of categories")
# one hot encoding
hcm_onehot = pd.get_dummies(hcm_venues[['Venue Category']], prefix="", prefix_sep="")
# add district column back to dataframe
hcm_onehot['District'] = hcm_venues['District']
# move district column to the first column
fixed_columns = [hcm_onehot.columns[-1]] + list(hcm_onehot.columns[:-1])
hcm_onehot = hcm_onehot[fixed_columns]
# group the rows by district and by taking the mean of the frequency of occurrence of each category
hcm_grouped = hcm_onehot.groupby('District').mean().reset_index()
hcm_grouped.head()
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['District']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
hcm_10 = pd.DataFrame(columns=columns)
hcm_10['District'] = hcm_grouped['District']
for ind in np.arange(hcm_grouped.shape[0]):
hcm_10.iloc[ind, 1:] = return_most_common_venues(hcm_grouped.iloc[ind, :], num_top_venues)
hcm_10
hcm_grouped_cafe = hcm_grouped[["District", "Café"]]
hcm_grouped_cafe
We want to cluster districts by the category "Café" only. We will use the K-Means clustering to do this but first we need to determine how many k we need to use. The "elbow" method helps to find a good k.
# try with 10 different values of k to find the best one
Ks = 10
distortions = []
hcm_cafe_clustering = hcm_grouped_cafe.drop('District', 1)
for k in range(1, Ks):
# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(hcm_cafe_clustering)
# find the distortion w.r.t each k
distortions.append(
sum(np.min(cdist(hcm_cafe_clustering, kmeans.cluster_centers_, 'euclidean'), axis=1))
/ hcm_cafe_clustering.shape[0]
)
plt.plot(range(1, Ks), distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
We see that, the "elbow" appears at k=3.
nclusters = 3
kmeans = KMeans(n_clusters=nclusters, random_state=0).fit(hcm_cafe_clustering)
Let's create a new dataframe look like hcm_grouped_cafe but contains the cluster labels for each district.
df_cafe = hcm_grouped_cafe.copy()
df_cafe["Cluster Labels"] = kmeans.labels_
# add two columns Latitude and Logitude into cafe_merged
df_cafe = df_cafe.join(df.set_index("District"), on="District")
# sort the table by cluster labels
df_cafe.sort_values(["Cluster Labels"], inplace=True)
# Drop some unnecessary columns
df_cafe = df_cafe.drop(["Subdistrict", "Area (km2)"], axis=1)
# change to numeric date type
df_cafe['Average Housing Price (1M VND)'] = df_cafe['Average Housing Price (1M VND)'].astype(float)
df_cafe
# create map
map_clusters = folium.Map(location=[hcm_lat, hcm_long], zoom_start=11)
# set color scheme for the clusters
x = np.arange(nclusters)
ys = [i+x+(i*x)**2 for i in range(nclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
df_cafe['Latitude'],
df_cafe['Longitude'],
df_cafe['District'],
df_cafe['Cluster Labels']
):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
count, division = np.histogram(df_cafe['Average Housing Price (1M VND)'], bins = [30, 100, 200, 300, 400])
df_cafe['Average Housing Price (1M VND)'].plot.hist(bins=division, rwidth=0.9)
The number of districts in each range of AHP
count
Now, we want to classify the AHP of each district into above types by creating a new columns whose name is "AHP Level". In order to do that, we need to create a function first.
def classify_ahp(price):
if price <= 100:
return "Low"
elif price <= 200:
return "Medium"
elif price <= 300:
return "High"
else:
return "Very High"
df_cafe["AHP Level"] = df_cafe["Average Housing Price (1M VND)"].apply(classify_ahp)
df_cafe
We can understand the clusters
Now we want to create a choropleth map of AHP coupling with the map of clusters created in previous section.
hcm_geo = r'hcm_urban.geojson' # geojson file
map_ahp = folium.Map(location=[hcm_lat, hcm_long], zoom_start=11)
map_ahp.choropleth(
geo_data=hcm_geo,
name='choropleth',
data=df_cafe,
columns=['District', 'Average Housing Price (1M VND)'],
key_on='feature.properties.name',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Average Housing Price'
)
# folium.LayerControl().add_to(map_ahp)
# add clusters to the map
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
df_cafe['Latitude'],
df_cafe['Longitude'],
df_cafe['District'],
df_cafe['Cluster Labels']
):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_ahp)
map_ahp
map_density = folium.Map(location=[hcm_lat, hcm_long], zoom_start=11)
map_density.choropleth(
geo_data=hcm_geo,
name='choropleth',
data=df_cafe,
columns=['District', 'Density (pop/m2)'],
key_on='feature.properties.name',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Population density'
)
# folium.LayerControl().add_to(map_ahp)
# add clusters to the map
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(
df_cafe['Latitude'],
df_cafe['Longitude'],
df_cafe['District'],
df_cafe['Cluster Labels']
):
label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_density)
map_density
From all above results, we conclude that, the best place for us to set up a new café is in district 4 because there are a lot of people living there (high density), there are not many already-working café (cluster 0) and the average housing price is low.