An interactive data visualisation featuring Singapore Demographic Trend
This take-home exercise requires the application of appropriate interactivity and animation methods to design an age-sex pyramid based data visualisation to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.
For this task, the data set used is the Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 The data set is obtained from Department of Statistics.
The code chunk below is used to install and load the required packages onto RStudio.
packages = c('gganimate', 'tidyverse', 'gifski', 'plotly')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The source file is in csv format, hence read_csv of readr package is used to import the dataset.
Popdata2010 <- read_csv("data/respopagesextod2000to2010.csv")
Popdata2020 <- read_csv("data/respopagesextod2011to2020.csv")
glimpse(Popdata2010)
Rows: 1,040,592
Columns: 7
$ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ <chr> "Cheng San", "Cheng San", "Cheng San", "Cheng San", "Ch~
$ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop <dbl> 20, 480, 220, 80, 0, 0, 0, 0, 20, 390, 200, 90, 0, 0, 0~
$ Time <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2~
glimpse(Popdata2020)
Rows: 984,656
Columns: 7
$ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "An~
$ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop <dbl> 0, 10, 30, 50, 0, 0, 40, 0, 0, 10, 30, 60, 0, 0, 40, 0,~
$ Time <dbl> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2~
After confirming that the 2 datasets, Popdata2010 and Popdata2020, contain the same variables, rbind function in Base R is used to combine both datasets into 1 dataset.
PopdataAll <- rbind(Popdata2010,Popdata2020)
glimpse(Popdata2010)
Rows: 1,040,592
Columns: 7
$ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ <chr> "Cheng San", "Cheng San", "Cheng San", "Cheng San", "Ch~
$ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop <dbl> 20, 480, 220, 80, 0, 0, 0, 0, 20, 390, 200, 90, 0, 0, 0~
$ Time <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2~
Drawing reference from Take-home Ex01, Pop values of Males are transformed to negative values, and binded with the Female dataset using the following code chunk:
PopdataAll_M <- PopdataAll %>%
filter(`Sex` == "Males") %>%
mutate (Pop = -Pop)
#
PopdataAll_F <- PopdataAll %>%
filter(`Sex` == "Females")
PopdataAll_MF <- rbind(PopdataAll_M,PopdataAll_F)
glimpse(PopdataAll_MF)
Rows: 2,025,248
Columns: 7
$ PA <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ <chr> "Cheng San", "Cheng San", "Cheng San", "Cheng San", "Ch~
$ AG <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop <dbl> -20, -480, -220, -80, 0, 0, 0, 0, -20, -620, -260, -120~
$ Time <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2~
The following code chunk uses factor() and as.integer() from Base R to convert the variables AG and PG to factors, and Time as integers. This step is required for the data groups to be ordered correctly in the chart (e.g. for Time, when using {frame_time} function, it will become a continuous variable with decimal values if not converted to integers).
PopdataAll_MF$AG <- factor (PopdataAll_MF$AG, levels = unique(PopdataAll_MF$AG))
PopdataAll_MF$PA <- factor (PopdataAll_MF$PA, levels = unique(PopdataAll_MF$PA))
PopdataAll_MF$Time <-as.integer(PopdataAll_MF$Time)
Next we compute the frequency count of the total population by planning area, time, age group and sex. In the code chunk below, group_by() of dplyr package is used to group the data by PA, Time, AG and Sex, and summarise() from the same package is used to count the number of residents.
PopdataAll_Grp <- PopdataAll_MF %>%
group_by(`PA`, `Time`, `AG`, `Sex`) %>%
summarise('Count'= sum(`Pop`)) %>%
ungroup()
head(PopdataAll_Grp)
# A tibble: 6 x 5
PA Time AG Sex Count
<fct> <int> <fct> <chr> <dbl>
1 Ang Mo Kio 2000 0_to_4 Females 4460
2 Ang Mo Kio 2000 0_to_4 Males -4830
3 Ang Mo Kio 2000 5_to_9 Females 5880
4 Ang Mo Kio 2000 5_to_9 Males -5890
5 Ang Mo Kio 2000 10_to_14 Females 5520
6 Ang Mo Kio 2000 10_to_14 Males -5710
As the requirement is to plot the pyramid by planning area, 4 planning areas (out of a total of 57 planning areas) are randomly selected using subset() of Base R.
[1] Ang Mo Kio Bedok
[3] Bishan Boon Lay/Pioneer
[5] Bukit Batok Bukit Merah
[7] Bukit Panjang Bukit Timah
[9] Central Water Catchment Changi
[11] Changi Bay Choa Chu Kang
[13] Clementi Downtown Core
[15] Geylang Hougang
[17] Jurong East Jurong West
[19] Kallang Lim Chu Kang
[21] Mandai Marina East
[23] Marina South Marine Parade
[25] Museum Newton
[27] North-Eastern Islands Novena
[29] Orchard Outram
[31] Pasir Ris Paya Lebar
[33] Punggol Queenstown
[35] River Valley Rochor
[37] Seletar Sembawang
[39] Sengkang Serangoon
[41] Simpang Singapore River
[43] Southern Islands Straits View
[45] Sungei Kadut Tampines
[47] Tanglin Tengah
[49] Toa Payoh Tuas
[51] Western Islands Western Water Catchment
[53] Woodlands Yishun
[55] Not Stated Boon Lay
[57] Pioneer
57 Levels: Ang Mo Kio Bedok Bishan Boon Lay/Pioneer ... Pioneer
To plot the static chart, drawing reference from Take-home Ex01, the relevant codes from ggplot2 package are used.
xbrks <- seq(-300000, 300000, 5000)
xlabls <- paste0(as.character(c(seq(300, 0, -5), seq(5, 300, 5))))
p <- ggplot(subset(PopdataAll_Grp, PA %in% c("Ang Mo Kio", "Sengkang", "Tampines", "Jurong East")),
aes (x = AG, y = Count, fill = Sex)) +
geom_bar(stat = "identity", width = .6 ) +
scale_y_continuous(breaks = xbrks, labels = xlabls, name = "Population Size ('000)") +
xlab("Age Groups") +
coord_flip() +
theme_bw()
Using facet_wrap of ggplot2, we are able to plot the population pyramids of the 4 different planning areas into a 2 by 2 panel. ggplotly() is then used to convert to an interactive plot.
xbrks1 <- seq(-300000, 300000, 50000)
xlabls1 <- paste0(as.character(c(seq(300, 0, -50), seq(50, 300, 50))))
p1 <- p + facet_wrap (~ PA, nrow = 2, ncol = 2) +
xlab("Age Groups") +
scale_y_continuous(breaks = xbrks1, labels = xlabls1, name = "Population Size ('000)")
ggplotly(p1)
While this plot using ggplotly allows interactivity, such as viewing the data for any selected data point, it does not show the changes in data over the years.
To show changes in the demographic over the years by respective planning area, functions from gganimate package is used:
p2 <- p +
facet_wrap (~ PA, nrow = 2, ncol = 2) +
transition_time(Time) +
labs(title = 'Singapore Age-Sex Population Pyramid 2000-2020 \n\nYear: {frame_time}') +
enter_fade() +
exit_fade() +
ease_aes('cubic-in-out')
p2
Finally, animate() of gganimate is used to render the the chart object and finetune the animation. 24 frame per second over a duration of 20 seconds is chosen for smoother motion. The size of the object is also adjusted to 750 x 600 for better visualisation.
animate (p2, fps = 24, duration = 20,
width = 750,
height = 600
)
From the visualisation, we can easily see that across most of the planning areas, with the exception of Sengkang, Singapore’s birth rate is slowing down over the years, and the population is ageing. For Sengkang, it is one of the newer estates which sees an increasing population of younger families over the years.
As the R codes are reproducible, we can easily add in other planning areas for comparison depending on the areas of interest.
The main challenge with this take-home exercise is the amount of time given and having sufficient experience in using the correct R packages. With suitable R packages and functions, more layers can be added to the chart object depending on user needs. For instance, using crosstalk package, we can have a drop down list to allow users to select the planning area, and a slider for users who prefer to manually filter the planning area and year.