Take Home Exercise 03: Predicting HDB Public Housing Resale Prices using Geographically Weighted Models

Published

March 11, 2023

Modified

April 3, 2023

1 Our Objective

Conventional predictive models for housing resale prices were built using the Ordinary Least Square (OLS) method, but this approach doesn’t consider the presence of spatial autocorrelation and spatial heterogeneity in geographic data sets. The existence of spatial autocorrelation means that using OLS to estimate predictive housing resale pricing models could result in biased, inconsistent, or inefficient outcomes. For this exercise, we will be using Geographical Weighted Models (GWR) to calibrate a predictive model for housing resale prices. Please feel free to follow along.

Note

Before I begin, I would like to mention that this exercise was done by with the help of many resources online, including some of my seniors’ works. All references have been mentioned at the end of this webpage. Please do refer to them should you require any further explanation.

Selection of Independent Variables

For this exercise, we will be using the independent variables recommended by Professor Kam. Please view them here.

2 Data Source

Fig. 1: Datasets used
Type	Name	Format	Source
Geospatial	MPSZ-2019	.shp	Professor Kam Tin Seong
Aspatial	Resale Flat Prices	.csv	data.gov.sg
Geospatial	Bus Stop locations (Feb 2023)	.shp	LTA Data Mall
Geospatial	Parks	.shp	OneMapSG API
Geospatial	Hawker Centres	.shp	OneMapSG API
Geospatial	Supermarkets	.geojson	data.gov.sg
Geospatial	MRT & LRT Stations (Train Station Exit Point)	.csv	LTA Data Mall
Geospatial	Shopping Malls	.csv	Wikipedia
Geospatial	Primary Schools	.csv	data.gov.sg
Geospatial	Eldercare	.shp	OneMapSG API
Geospatial	Childcare Centres	.shp	OneMapSG API
Geospatial	Kindergartens	.shp	OneMapSG API

3 Data Preparation

3.1 Install R Packages

Code for Package Installation

pacman::p_load(olsrr, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary, onemapsgapi, rvest, stringr, httr, jsonlite, readr, gdata, matrixStats, units, ranger, SpatialML, car, Metrics, gridExtra)

3.2 Retrieving Geospatial Data

We will need to prepare the geospatial data and by the end of this section, we should have the locations of all the points-of-interest (e.g., Primary Schools, Shopping Malls, Supermarkets, etc.) in a .shp, .csv, or .geojson format. While I was able to find data for Bus Stop locations and MRT & LRT Stations, the rest were not readily available on the Internet. Here is how I collated the rest of the data…

Methods Used To Retrieve Necessary Data

(1) OneMapSG API

The data was extracted by calling the OpenMapSG API. Most API endpoints in OneMapSG require a token. Please register for one here if you would like to perform some of the steps shown below.

I used the get_theme() function to retrieve the location coordinates of facilities. If we only have the facility name, we can even use the /commonapi/search endpoint to retrieve coordinates of a particular facility (this was used for the Primary School data set retrieved from data.gov.sg)

(2) Webscrap data from websites using rvest library

Some facilities’ names had to be webscrapped from Wikipedia using the rvest package. The OneMapSG API /commonapi/search endpoint was then used on the names of the facilities extracted to get the longitude and latitude (this was used to collect the Mall data).

(3) Download .shp, .geojson files from LTA or data.gov.sg

The shapefiles and geojson files were directly downloaded from LTA Data Mall or data.gov.sg and read as an sf object using st_read().

Here are the codes I used to get the data for the following facilities-of-interest:

# search for themes related to "health"
health_themes = onemapsgapi::search_themes(token, "health")

# pick a suitable theme from the output above
# use get_themes() with the queryname of the theme selected
eldercare_tibble = onemapsgapi::get_theme(token, "eldercare")

# convert it into an sf object
eldercare_sf = st_as_sf(eldercare_tibble, coords=c("Lng", "Lat"), crs=4326)
  
  
# let's write it into a shapefile for recurrent use
st_write(eldercare_sf, dsn="data/geospatial", layer="eldercare", driver= "ESRI Shapefile")

# search for themes related to "parks"
park_themes = onemapsgapi::search_themes(token, "parks")

# pick a suitable theme from the output above
# use get_themes() with the queryname of the theme selected
parks_tibble = onemapsgapi::get_theme(token, "nationalparks")

# convert it into an sf object
parks_sf = st_as_sf(parks_tibble, coords=c("Lng", "Lat"), crs=4326)
  
  
# let's write it into a shapefile for recurrent use
st_write(parks_sf, dsn="data/geospatial", layer="parks", driver= "ESRI Shapefile")

hawker_themes = onemapsgapi::search_themes(token, "hawker")

hawker_tibble = onemapsgapi::get_theme(token, "hawkercentre")

hawker_sf = st_as_sf(hawker_tibble, coords=c("Lng", "Lat"), crs=4326)
  
st_write(hawker_sf, dsn="data/geospatial", layer="hawker", driver= "ESRI Shapefile")

all_themes = onemapsgapi::search_themes(token)

kindergarten_tibble = onemapsgapi::get_theme(token, "kindergartens")

kindergarten_sf = st_as_sf(kindergarten_tibble, coords=c("Lng", "Lat"), crs=4326)

st_write(kindergarten_sf, dsn="data/geospatial", layer="kindergarten", driver= "ESRI Shapefile")

all_themes = onemapsgapi::search_themes(token)

childcare_tibble = onemapsgapi::get_theme(token, "childcare")

childcare_sf = st_as_sf(childcare_tibble, coords=c("Lng", "Lat"), crs=4326)

st_write(childcare_sf, dsn="data/geospatial", layer="childcare", driver= "ESRI Shapefile")

To get the latest shopping mall data, I had to scrape the data off of Wikipedia. There is an amazing guide by GitHub user ValaryLim that I referenced for this section. I managed to get a pretty accurate list of shopping malls in Singapore (including the smaller ones in the Heartlands).

Code to scrape malls off of Wikipedia

# from the wiki page we can see that the malls are all in unordered lists
url = "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
document <- read_html(url)
ul_li_elems <- as.list(document %>% html_elements("ul") %>% html_elements("li") %>% html_text())

# remove those that aren't malls
temp <- ul_li_elems[-c(1:46)]
temp <- temp[-c(168:170)]
temp <- temp[-c(341:360)]

# there are duplicated malls - let's remove them
temp = as.list(gsub("\\[[^][]*]", "", temp))
temp = as.list(gsub("\\([^][]*)", "", temp))
temp = as.list(toupper(temp))
temp = as.list(gsub("THE", "", temp))
temp = trim(temp)
all_malls = as.list(unique(temp))

# remove duplicates and malls that are not operational
all_malls = all_malls[-c(176, 31, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 20, 193, 185, 169, 191)]

# some of the mall names are not accurate - let's change them!
all_malls[[39]] = "GR.ID"
all_malls[[95]] = "DJIT SUN MALL"
all_malls[[42]] = "SHAW HOUSE"
all_malls[[198]] = "SHAW CENTRE"
all_malls[[51]] = "TEKKA MARKET"
all_malls[[158]] = "GRANTRAL MALL @ CLEMENTI"
all_malls[[199]] = "GRANTRAL MALL @ MACPHERSON"
all_malls[[176]] = "CAPITOL BUILDING SINGAPORE"
all_malls[[117]] = "NORTHSHORE PLAZA I"
all_malls[[200]] = "NORTHSHORE PLAZA II"

# replace spaces with %20 - so we can use it in the api endpoint
all_malls = as.list(gsub(" ", "%20", all_malls))

With all the mall names, let’s use the OneMapSG API to retrieve the lat and long coords.

Code to retrieve lat and lng coords using OneMapSG API

# get lat and lng data using onemapsg api
get_lat_lng <- function(location) {
  result = tryCatch({
    query = str_glue("https://developers.onemap.sg/commonapi/search?searchVal={location}&returnGeom=Y&getAddrDetails=Y")
    print(query)
    res = GET(query)
    res_data = fromJSON(rawToChar(res$content))$results[1,]
    lat =  res_data$LATITUDE
    lng = res_data$LONGITUDE
    return (c(lat, lng))
  },
  error = function(e) {
    print(e)
    return(c("INVALID LOCATION", "INVALID LOCATION"))
  })
}

shopping_mall_df = data.frame()
# h = hash()
for (mall in all_malls) {
  lat_long = get_lat_lng(mall)
  # print(lat_long)
  name = gsub("%20", " ", mall)
  # h[[str_glue("{name}")]] = lat_long
  row <- c(str_glue("{name}"), lat_long[[1]], lat_long[[2]])
  shopping_mall_df <- rbind(shopping_mall_df, row)
}

# rename dataframe
colnames(shopping_mall_df) <- c("Mall", "Lat", "Lng")

# let's manually retrieve the lat and lng values for those that can't be retrieved using the API

shopping_mall_df$Lat[shopping_mall_df$Mall == "CITY GATE MALL"] <- "1.30231590504573"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CITY GATE MALL"] <- "103.862331661034"

shopping_mall_df$Lat[shopping_mall_df$Mall == "CLARKE QUAY CENTRAL"] <- "1.2887904"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CLARKE QUAY CENTRAL"] <- "103.8424709"

shopping_mall_df$Lat[shopping_mall_df$Mall == "MANDARIN GALLERY"] <- "1.3021529"
shopping_mall_df$Lng[shopping_mall_df$Mall == "MANDARIN GALLERY"] <- "103.8363372"

shopping_mall_df$Lat[shopping_mall_df$Mall == "OD MALL"] <- "1.3379938"
shopping_mall_df$Lng[shopping_mall_df$Mall == "OD MALL"] <- "103.7935013"

# some of the lat and lng retrieved are incorrect so let's manually fix them
# the correct coords were manually found on the onemapsg website

shopping_mall_df$Lat[shopping_mall_df$Mall == "ADELPHI"] <- "1.29118503658447"
shopping_mall_df$Lng[shopping_mall_df$Mall == "ADELPHI"] <- "103.851184338699"

shopping_mall_df$Lat[shopping_mall_df$Mall == "APERIA"] <- "1.3110137"
shopping_mall_df$Lng[shopping_mall_df$Mall == "APERIA"] <- "103.8639307"

shopping_mall_df$Lat[shopping_mall_df$Mall == "BEDOK MALL"] <- "1.3248556"
shopping_mall_df$Lng[shopping_mall_df$Mall == "BEDOK MALL"] <- "103.9292532"

shopping_mall_df$Lat[shopping_mall_df$Mall == "BUANGKOK SQUARE"] <- "1.3845127"
shopping_mall_df$Lng[shopping_mall_df$Mall == "BUANGKOK SQUARE"] <- "103.8816655"

shopping_mall_df$Lat[shopping_mall_df$Mall == "BUGIS JUNCTION"] <- "1.2993706"
shopping_mall_df$Lng[shopping_mall_df$Mall == "BUGIS JUNCTION"] <- "103.8554388"

shopping_mall_df$Lat[shopping_mall_df$Mall == "BUGIS+"] <- "1.2996817"
shopping_mall_df$Lng[shopping_mall_df$Mall == "BUGIS+"] <- "103.8543123"

shopping_mall_df$Lat[shopping_mall_df$Mall == "CATHAY"] <- "1.2992061"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CATHAY"] <- "103.8478268"

shopping_mall_df$Lat[shopping_mall_df$Mall == "CENTREPOINT"] <- "1.3019784"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CENTREPOINT"] <- "103.8397590"

shopping_mall_df$Lat[shopping_mall_df$Mall == "CHINATOWN POINT"] <- "1.2851156"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CHINATOWN POINT"] <- "103.8447261"

shopping_mall_df$Lat[shopping_mall_df$Mall == "CLEMENTI MALL"] <- "1.3148962"
shopping_mall_df$Lng[shopping_mall_df$Mall == "CLEMENTI MALL"] <- "103.7644231"

shopping_mall_df$Lat[shopping_mall_df$Mall == "DUO"] <- "1.2995343"
shopping_mall_df$Lng[shopping_mall_df$Mall == "DUO"] <- "103.8584017"

shopping_mall_df$Lat[shopping_mall_df$Mall == "ELIAS MALL"] <- "1.3786093"
shopping_mall_df$Lng[shopping_mall_df$Mall == "ELIAS MALL"] <- "103.9420270"

shopping_mall_df$Lat[shopping_mall_df$Mall == "ESPLANADE MALL"] <- "1.2896478"
shopping_mall_df$Lng[shopping_mall_df$Mall == "ESPLANADE MALL"] <- "103.8562673"

shopping_mall_df$Lat[shopping_mall_df$Mall == "FORUM SHOPPING MALL"] <- "1.3060975"
shopping_mall_df$Lng[shopping_mall_df$Mall == "FORUM SHOPPING MALL"] <- "103.8286786"

shopping_mall_df$Lat[shopping_mall_df$Mall == "HDB HUB"] <- "1.3320088"
shopping_mall_df$Lng[shopping_mall_df$Mall == "HDB HUB"] <- "103.8485316"

shopping_mall_df$Lat[shopping_mall_df$Mall == "HOUGANG 1"] <- "1.3757153"
shopping_mall_df$Lng[shopping_mall_df$Mall == "HOUGANG 1"] <- "103.8794723"

shopping_mall_df$Lat[shopping_mall_df$Mall == "ION ORCHARD"] <- "1.3039797"
shopping_mall_df$Lng[shopping_mall_df$Mall == "ION ORCHARD"] <- "103.8320323"

shopping_mall_df$Lat[shopping_mall_df$Mall == "MARINA BAY SANDS"] <- "1.2834542"
shopping_mall_df$Lng[shopping_mall_df$Mall == "MARINA BAY SANDS"] <- "103.8608090"

shopping_mall_df$Lat[shopping_mall_df$Mall == "NOVENA SQUARE"] <- "1.3199675"
shopping_mall_df$Lng[shopping_mall_df$Mall == "NOVENA SQUARE"] <- "103.8438506"

shopping_mall_df$Lat[shopping_mall_df$Mall == "ORCHARD GATEWAY"] <- "1.3004433"
shopping_mall_df$Lng[shopping_mall_df$Mall == "ORCHARD GATEWAY"] <- "103.8394428"

shopping_mall_df$Lat[shopping_mall_df$Mall == "PEOPLE'S PARK COMPLEX"] <- "1.2841340"
shopping_mall_df$Lng[shopping_mall_df$Mall == "PEOPLE'S PARK COMPLEX"] <- "103.8425200"

shopping_mall_df$Lat[shopping_mall_df$Mall == "PEOPLE'S PARK CENTRE"] <- "1.2857701"
shopping_mall_df$Lng[shopping_mall_df$Mall == "PEOPLE'S PARK CENTRE"] <- "103.8439801"

shopping_mall_df$Lat[shopping_mall_df$Mall == "POIZ"] <- "1.3313212"
shopping_mall_df$Lng[shopping_mall_df$Mall == "POIZ"] <- "103.8680699"

shopping_mall_df$Lat[shopping_mall_df$Mall == "SENGKANG GRAND MALL"] <- "1.3829816"
shopping_mall_df$Lng[shopping_mall_df$Mall == "SENGKANG GRAND MALL"] <- "103.8927210"

shopping_mall_df$Lat[shopping_mall_df$Mall == "SHAW HOUSE"] <- "1.3058481"
shopping_mall_df$Lng[shopping_mall_df$Mall == "SHAW HOUSE"] <- "103.8315082"

shopping_mall_df$Lat[shopping_mall_df$Mall == "SOUTH BEACH"] <- "1.2948335"
shopping_mall_df$Lng[shopping_mall_df$Mall == "SOUTH BEACH"] <- "103.8560375"

shopping_mall_df$Lat[shopping_mall_df$Mall == "SQUARE 2"] <- "1.3207051"
shopping_mall_df$Lng[shopping_mall_df$Mall == "SQUARE 2"] <- "103.8441607"

shopping_mall_df$Lat[shopping_mall_df$Mall == "TAMPINES 1"] <- "1.3543014"
shopping_mall_df$Lng[shopping_mall_df$Mall == "TAMPINES 1"] <- "103.9450922"

shopping_mall_df$Lat[shopping_mall_df$Mall == "TANJONG PAGAR CENTRE"] <- "1.2765836"
shopping_mall_df$Lng[shopping_mall_df$Mall == "TANJONG PAGAR CENTRE"] <- "103.8459363"

shopping_mall_df$Lat[shopping_mall_df$Mall == "TEKKA MARKET"] <- "1.3061777"
shopping_mall_df$Lng[shopping_mall_df$Mall == "TEKKA MARKET"] <- "103.8506100"

# remove VELOCITY - it's in NOVENA SQUARE which is already accounted for
shopping_mall_df = shopping_mall_df %>% filter(Mall != "VELOCITY@NOVENA SQUARE")

With that we have all the malls and their locational data in a data.frame. Let’s convert this into a .csv file so we can use it later on.

Code to write dataframe into .csv file for analysis

write.csv(shopping_mall_df, "data/geospatial/mall.csv")

Using this dataset from data.gov.sg, I only used the school name to get the lat and lng data using the OneMapSG API.

Code for function to get lat and lng information

get_lat_lng <- function(location) {
  result = tryCatch({
    query = str_glue("https://developers.onemap.sg/commonapi/search?searchVal={location}&returnGeom=Y&getAddrDetails=Y")
    print(query)
    res = GET(query)
    res_data = fromJSON(rawToChar(res$content))$results[1,]
    lat =  res_data$LATITUDE
    lng = res_data$LONGITUDE
    return (c(lat, lng))
  },
  error = function(e) {
    print(e)
    return(c("INVALID LOCATION", "INVALID LOCATION"))
  })
}

Code to get all Primary School Names and Coords

# as mentioned earlier I got this dataset from data.gov.sg - I will read it, filter out all the primary schools, and extract the names

school_info <- readr::read_csv("data/geospatial/school_info.csv")
primary_schools = as.list(subset(school_info, mainlevel_code == "PRIMARY" | mainlevel_code == "MIXED LEVELS")$school_name)

# remove those that aren't primary schools
primary_schools <- primary_schools[-c(8, 44, 74, 101, 111, 134, 136, 142, 149, 157, 169)]

# use the onemapsg api to retrive lat and lng coords
primary_schools_df = data.frame()
for (school in primary_schools) {
  temp = gsub(" ", "%20", school)
  lat_long = get_lat_lng(temp)
  row <- c(school, lat_long[[1]], lat_long[[2]])
  primary_schools_df <- rbind(primary_schools_df, row)
}

# rename the dataframe columns
colnames(primary_schools_df) <- c("School", "Lat", "Lng")

While we have retrieved the data, some of them are wrong. Let’s manually rectify this.

Code to recitify faulty coords and remove non-operational schools

primary_schools_df$Lat[primary_schools_df$School == "JURONG PRIMARY SCHOOL
"] <- "1.348861338692006"
primary_schools_df$Lng[primary_schools_df$School == "JURONG PRIMARY SCHOOL
"] <- "103.73297053951389"

primary_schools_df$Lat[primary_schools_df$School == "KUO CHUAN PRESBYTERIAN PRIMARY SCHOOL"] <- "1.349716245801766"
primary_schools_df$Lng[primary_schools_df$School == "KUO CHUAN PRESBYTERIAN PRIMARY SCHOOL"] <- "103.8552420147255"

primary_schools_df$Lat[primary_schools_df$School == "MAYFLOWER PRIMARY SCHOOL
"] <- "1.3776322593463508"
primary_schools_df$Lng[primary_schools_df$School == "MAYFLOWER PRIMARY SCHOOL
"] <- "103.84324844673546"

primary_schools_df$Lat[primary_schools_df$School == "METHODIST GIRLS' SCHOOL(PRIMARY)"] <- "1.3331343330353658"
primary_schools_df$Lng[primary_schools_df$School == "METHODIST GIRLS' SCHOOL(PRIMARY)"] <- "103.78342078588985"

primary_schools_df$Lat[primary_schools_df$School == "PUNGGOL PRIMARY SCHOOL"] <- "1.377228287998256"
primary_schools_df$Lng[primary_schools_df$School == "PUNGGOL PRIMARY SCHOOL"] <- "103.89467211203765"

primary_schools_df$Lat[primary_schools_df$School == "ST. ANTHONY'S PRIMARY SCHOOL"] <- "1.3643281311526845"
primary_schools_df$Lng[primary_schools_df$School == "ST. ANTHONY'S PRIMARY SCHOOL"] <- "103.74890532821937"

primary_schools_df$Lat[primary_schools_df$School == "TAMPINES PRIMARY SCHOOL"] <- "1.3498992868953197"
primary_schools_df$Lng[primary_schools_df$School == "TAMPINES PRIMARY SCHOOL"] <- "103.94425668320217"

primary_schools_df$Lat[primary_schools_df$School == "TAO NAN SCHOOL"] <- "1.305174050370657"
primary_schools_df$Lng[primary_schools_df$School == "TAO NAN SCHOOL"] <- "103.91138035251849"

primary_schools_df = primary_schools_df[! (primary_schools_df$School == "JUYING PRIMARY SCHOOL"),]

Let’s save this data.frame into a csv file for us to use later on

Code to write dataframe into .csv file for analysis

write.csv(primary_schools_df, "data/geospatial/school.csv")

Import Singapore Subzone Boundary

sg_sf <- st_read(dsn = "data/geospatial", layer = "MPSZ-2019")

Reading layer `MPSZ-2019' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Import Eldercare data

eldercares_sf <- st_read(dsn = "data/geospatial", layer = "eldercare")

Reading layer `eldercare' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 133 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.7119 ymin: 1.271472 xmax: 103.9561 ymax: 1.439561
Geodetic CRS:  WGS 84

Import Hawker Centre data

hawkers_sf <- st_read(dsn = "data/geospatial", layer = "hawker")

Reading layer `hawker' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 125 features and 18 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6974 ymin: 1.272716 xmax: 103.9882 ymax: 1.449017
Geodetic CRS:  WGS 84

Import MRT & LRT data

trains_sf <- st_read(dsn = "data/geospatial", layer = "Train_Station_Exit_Layer")

Reading layer `Train_Station_Exit_Layer' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 562 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21

Import Parks data

parks_sf <- st_read(dsn = "data/geospatial", layer = "parks")

Reading layer `parks' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 421 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6929 ymin: 1.214491 xmax: 104.0538 ymax: 1.462094
Geodetic CRS:  WGS 84

Import Shopping Malls data

malls <- read_csv("data/geospatial/mall.csv")

# for now we will be reading the data using EPSG:4326 (i.e., WGS84)
malls_sf <- st_as_sf(malls, coords = c("Lng", "Lat"), crs=4326)

Import Supermarket data

supermarkets_sf <- st_read("data/geospatial/supermarkets-geojson.geojson")

Reading layer `supermarkets-geojson' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial\supermarkets-geojson.geojson' 
  using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension:     XYZ
Bounding box:  xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range:       zmin: 0 zmax: 0
Geodetic CRS:  WGS 84

Import Kindergartens data

kindergartens_sf <- st_read(dsn = "data/geospatial", layer = "kindergarten")

Reading layer `kindergarten' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 448 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6887 ymin: 1.247759 xmax: 103.9717 ymax: 1.455452
Geodetic CRS:  WGS 84

Import Childcare Centres data

childcares_sf <- st_read(dsn = "data/geospatial", layer = "childcare")

Reading layer `childcare' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 1925 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 103.6878 ymin: 1.247759 xmax: 103.9897 ymax: 1.462134
Geodetic CRS:  WGS 84

Import Bus Stops data

busstops_sf <- st_read(dsn = "data/geospatial", layer = "BusStop")

Reading layer `BusStop' from data source 
  `C:\guga-nesh\IS415-GAA\take-home_ex\take-home_ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

Import Primary Schools data

pri_schs <- read_csv("data/geospatial/school.csv")

# for now we will be reading the data using EPSG:4326 (i.e., WGS84)
pri_schs_sf <- st_as_sf(pri_schs, coords = c("Lng", "Lat"), crs=4326)

Import Resale Flat Prices

resale_tbl <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

# filtering the data

# train set: Jan '21 - Dec '22
# test set: Jan '23 - Feb '23
# for the purposes of this study we will be looking at 3 Room Flats ONLY
three_rm_flats = resale_tbl %>% filter(flat_type == "3 ROOM")

resale_tbl <- three_rm_flats %>% filter(month >= "2021-01")

# take a look at the dataframe with glimpse()
glimpse(resale_tbl)

3.4 Geospatial Data Transformation

From the previous section, we saw that the data sets are not using Singapore’s Projected Coordinate System (i.e., SVY21 - EPSG:3414). This is just one of the issues we will need to fix in this section. Here’s what we will be doing:

remove unnecessary columns using select()
check for invalid geometries using st_is_valid() and missing values using is.na()
transform the crs to the appropriate one using st_transform()

For our geospatial data sets we only need to know the name of the facility and the geometry column. So let’s remove all the unnecessary ones using select()

Code to remove unnecessary columns

# for the geospatial datasets, the facility names are all in the first col

busstops_sf <- busstops_sf %>% select(c(1))
childcares_sf <- childcares_sf %>% select(c(1))
eldercares_sf <- eldercares_sf %>% select(c(1))
hawkers_sf <- hawkers_sf %>% select(c(1))
kindergartens_sf <- kindergartens_sf %>% select(c(1))
malls_sf <- malls_sf %>% select(c(1))
parks_sf <- parks_sf %>% select(c(1))
pri_schs_sf <- pri_schs_sf %>% select(c(1))

# for this supermarket dataset, the names of the facilities were not given
supermarkets_sf <- supermarkets_sf %>% select(c(1))

# for the trains dataset, each train has multiple exits - as such, it will be wise to put the Exit point together with the name of the stations to differentiate them
trains_sf$NAME <- paste(trains_sf$stn_name, toupper(trains_sf$exit_code))
trains_sf <- trains_sf %>% select(c(4))