Take-home Exercise 1

Creating data visualisation beyond default

Author

Affiliation

Published

Jan. 26, 2022

DOI

The Task

This take-home exercise aims to sharpen the skill of building data visualisation programmatically using R ggplot2. The specific requirements are:

Create a pareto chart showing showing the distribution of returns by product sub-category, using the superstore-2021.xls file.
Create a age-sex pyramid showing the demographic structure of Singapore by age cohort and gender for June 2021. The data set is obtained from Department of Statistics.

You can find the links to download the dataset below:

Download Superstore-2021.xls

Download respopagesextod2021.csv

1.0 Distribution of Returns by Sub-category

1.1 Sketch of Proposed Design

1.2 Installing and loading the required libraries

The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), readxl, ggrepel and knitr.

The code chunk below is used to install and load the required packages onto RStudio.

packages = c('tidyverse','readxl', 'knitr', 'ggrepel')
for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

1.3 Importing the dataset

The source file is in xls format, and consists of three worksheets, namely: Orders, People and Returns. Hence, read_csv() of the readr package is used to import the data.

orders <- read_xls("data/Superstore-2021.xls", sheet = "Orders")
returns <- read_xls("data/Superstore-2021.xls", sheet =  "Returns")
glimpse(orders)

Rows: 9,994
Columns: 21
$ `Row ID`         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, ~
$ `Order ID`       <chr> "CA-2020-152156", "CA-2020-152156", "CA-202~
$ `Order Date`     <dttm> 2020-11-08, 2020-11-08, 2020-06-12, 2019-1~
$ `Ship Date`      <dttm> 2020-11-11, 2020-11-11, 2020-06-16, 2019-1~
$ `Ship Mode`      <chr> "Second Class", "Second Class", "Second Cla~
$ `Customer ID`    <chr> "CG-12520", "CG-12520", "DV-13045", "SO-203~
$ `Customer Name`  <chr> "Claire Gute", "Claire Gute", "Darrin Van H~
$ Segment          <chr> "Consumer", "Consumer", "Corporate", "Consu~
$ `Country/Region` <chr> "United States", "United States", "United S~
$ City             <chr> "Henderson", "Henderson", "Los Angeles", "F~
$ State            <chr> "Kentucky", "Kentucky", "California", "Flor~
$ `Postal Code`    <dbl> 42420, 42420, 90036, 33311, 33311, 90032, 9~
$ Region           <chr> "South", "South", "West", "South", "South",~
$ `Product ID`     <chr> "FUR-BO-10001798", "FUR-CH-10000454", "OFF-~
$ Category         <chr> "Furniture", "Furniture", "Office Supplies"~
$ `Sub-Category`   <chr> "Bookcases", "Chairs", "Labels", "Tables", ~
$ `Product Name`   <chr> "Bush Somerset Collection Bookcase", "Hon D~
$ Sales            <dbl> 261.9600, 731.9400, 14.6200, 957.5775, 22.3~
$ Quantity         <dbl> 2, 3, 2, 5, 2, 7, 4, 6, 3, 5, 9, 4, 3, 3, 5~
$ Discount         <dbl> 0.00, 0.00, 0.00, 0.45, 0.20, 0.00, 0.00, 0~
$ Profit           <dbl> 41.9136, 219.5820, 6.8714, -383.0310, 2.516~

glimpse(returns)

Rows: 800
Columns: 2
$ Returned   <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", ~
$ `Order ID` <chr> "CA-2018-100762", "CA-2018-100762", "CA-2018-1007~

1.4 Data Wrangling

Joining two tables

As we are interested in plotting the count of returns by sub-category, and the information is found in two separate sheets, Orders and Returns, we need to join the data tables. Using Order ID as the unique identifier, left_join() of dplyr is used to join the Returns data frame and Orders data frame.

joined_tab <- left_join(returns, orders, 
                        by=c('Order ID' = 'Order ID'))

Compute frequency count of returns by sub-category

Then, the frequency count of returns by sub-category is computed by using the group_by method found in dplyr.

Group_by method

freq_returned <- joined_tab %>%
  group_by(`Sub-Category`) %>%
  summarise('Returns' = n ()) %>% 
  ungroup()

The alternative method is to use the count method in dplyr. By default, count() will return a new field called n to store the result. Hence, rename() of dplyr is used to rename n to Returns.

Count method

freq_returned2 <- joined_tab %>%  
  count(`Sub-Category`) %>% 
  rename(Returns = n)

The new dataframe, freq_returned, is shown in the following using paged_table of the rmarkdown package.

library(rmarkdown)
paged_table(freq_returned)

ABCDEFGHIJ0123456789

Sub-Category <chr>	Returns <int>
Accessories	251
Appliances	177
Art	177
Binders	552
Bookcases	51
Chairs	238
Copiers	21
Envelopes	50
Fasteners	94
Furnishings	266

Sorting Data

As shown above, the values of the tibble data frame is sorted according to the values of the first column, alphabetically by Sub-category. To obtain the cumulative frequency, the values in the the Returns field is first sorted in decreasing order using arrange() of dplyr package.

freq_sorted <- freq_returned %>%
  arrange(desc(Returns))

paged_table(freq_sorted)

ABCDEFGHIJ0123456789

Sub-Category <chr>	Returns <int>
Binders	552
Paper	487
Phones	309
Furnishings	266
Accessories	251
Chairs	238
Storage	233
Appliances	177
Art	177
Tables	116

Computing cumulative frequency

Next, mutate of dylyr and cumsum of Base R are used to compute the cumulative frequency of Returns. Using sum of Base R, the cumulative frequency is divided by the sum of Returns to obtain the cumulative percentage of Returns. As the computed figures are in decimal, a factor of 100 is then applied to convert the values to percentage.

freq_cum <- freq_sorted %>%
  mutate(cumfreq = cumsum(Returns)/sum(Returns)*100)

paged_table(freq_cum)

ABCDEFGHIJ0123456789

Sub-Category <chr>	Returns <int>	cumfreq <dbl>
Binders	552	17.11097
Paper	487	32.20707
Phones	309	41.78549
Furnishings	266	50.03100
Accessories	251	57.81153
Chairs	238	65.18909
Storage	233	72.41166
Appliances	177	77.89833
Art	177	83.38500
Tables	116	86.98078

1.5 Creating the Pareto Chart

Plotting Returns in a bar chart

There are two types of bar charts: geom_bar() and geom_col() in ggplot. geom_bar() makes the height of the bar proportional to the number of cases in each group, while the height of the bar in geom_col represent values in the data. In our case, since we already computed the frequency count and we want the bar height to represent the value in the dataset, geom_col() is used.

ggplot(data=freq_cum, 
        aes(x = `Sub-Category`, y = Returns)) +
       geom_col(fill = "light blue")

As the bars are plotted in alphabetical order by default, the reorder function in Base R is used to sort the bars based on the count of the returns in decreasing order.

g1 <- ggplot(data=freq_cum, 
        aes(x = reorder(`Sub-Category`, -Returns), y = Returns)) +
        geom_col(fill = "light blue") +
        xlab("Sub-Category")
g1

Plotting cumulative frequency line

Next, geom_point from ggplot2 package is added to use points to represent the cumulative frequency, and geom_line is added to connect the points.

g2 <- g1 + 
      geom_point(aes(y = `cumfreq`), colour = 'orange', size = 2) +
      geom_line(aes(y = `cumfreq`), colour = 'orange', group = 1) 
g2

Formatting the Chart to improve visualisation

Lastly, a secondary y-axis is added using sec_axis, which basically adds a new y-axis by transforming the primary y-axis using a coefficient. After some trial and error, a coefficient of 0.18 is selected i.e. primary y-axis is multiplied by 0.18 to get the secondary y axis. The corresponding values of the cumulative frequency also needs to be transformed using the coefficient.

Using geom_hline, a reference line representing 80% is added to make it clearer which are the sub-categories that contribute 80% of the returns.

By adjusting the settings using theme, the labels n the x-axis are also rotated so that they do not overlap each order. The complete code chunk to create the pareto chart and final formatted chart are shown below.

coeff <- 0.18
ggplot(data=freq_cum, aes(x = reorder(`Sub-Category`, -Returns), y = Returns)) +
  geom_col(fill = "light blue") +
  geom_point(aes(y = `cumfreq`/coeff), colour = 'orange', size = 2) +
  geom_line(aes(y = `cumfreq`/coeff), colour = 'orange', group = 1) +
  geom_hline(yintercept = 80/coeff, colour = 'dark grey', linetype = 'dashed') +
  scale_y_continuous(name =  "Count of Returns", breaks = seq(0, 1000, 100), 
  sec.axis = sec_axis(~.*coeff, name = "Percentage of cumulative sum of Returns (%)")) +
  xlab("Sub-Category") +
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  annotate("text", x = 'Envelopes', y = 85/coeff, label = "80%", colour = "dark grey")

2.0 Distribution of Population by Age and Sex

2.1 Sketch of Proposed Design

2.2 Installing and loading the required libraries

The packages required are tidyverse (included relevant packages for data analyses such as ggplot2, readr and dplyr), ggrepel and knitr. As the required packages to plot the population pyramid are already loaded in 1.2, we do not have to run new codes.

2.2 Importing the dataset

The source file is in csv format, hence read_csv of readr package is used to import the dataset.

Popdata <- read_csv("data/respopagesextod2021.csv")
glimpse(Popdata)

Rows: 100,928
Columns: 7
$ PA   <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ   <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "An~
$ AG   <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex  <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD  <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop  <dbl> 0, 10, 10, 30, 0, 0, 50, 0, 0, 0, 10, 20, 0, 0, 30, 0, ~
$ Time <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2~

2.3 Data Wrangling

Transforming the data of one gender

For the population pyramid, we want the Pop values of one gender to appear on the left side of the chart, and the other to go on the right. Here, male is chosen, and the Pop values need to be transformed to negative values. The mutate function of dplyr package is used to add a new variable that consist of the negative male Pop values.

Popdata_males <- Popdata %>%
  filter(`Sex` ==  "Males") %>%
  mutate (Pop = -Pop)
glimpse(Popdata_males)

Rows: 50,464
Columns: 7
$ PA   <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio",~
$ SZ   <chr> "Ang Mo Kio Town Centre", "Ang Mo Kio Town Centre", "An~
$ AG   <chr> "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to_4", "0_to~
$ Sex  <chr> "Males", "Males", "Males", "Males", "Males", "Males", "~
$ TOD  <chr> "HDB 1- and 2-Room Flats", "HDB 3-Room Flats", "HDB 4-R~
$ Pop  <dbl> 0, -10, -10, -30, 0, 0, -50, 0, 0, -10, -10, -40, 0, 0,~
$ Time <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2~

Binding to form new dataset

Next, we bind the new dataset containing negative Pop values of Males with the females data of the original dataset using rbind function in Base R.

Popdata_females <-Popdata %>%
  filter(`Sex` ==  "Females") 
Popdata_T <- rbind(Popdata_males,Popdata_females)
paged_table(Popdata_T)

ABCDEFGHIJ0123456789

PA <chr>	SZ <chr>	AG <chr>	Sex <chr>
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	0_to_4	Males
Ang Mo Kio	Ang Mo Kio Town Centre	5_to_9	Males
Ang Mo Kio	Ang Mo Kio Town Centre	5_to_9	Males

2.4 Creating the Population Pyramid

Initial plot

geom_bar of ggplot2 is used to plot the bar chart, and coord_flip is used to flip the x and y axis to form the pyramid.

ggplot(Popdata_T, aes (x = AG, y = Pop, fill = Sex)) +
  geom_bar(stat = "identity") +
  coord_flip()

Formatting the axis

In the initial chart, it is observed that the age group 5 to 9 appeared in the middle of the axis. AG is initially a character field. factor in Base R is then used to encode the variable AG to factor and to recognise that each value in AG is a unique level.

Popdata_T$AG <- factor (Popdata_T$AG, levels = unique(Popdata_T$AG)) 
# To convert AG to a factor with unique values
glimpse(Popdata_T$AG)

 Factor w/ 19 levels "0_to_4","5_to_9",..: 1 1 1 1 1 1 1 1 2 2 ...

The default scale of the x axis is also not easy to read.To reformat the x axis, seq in Base R is used to sequence the axis with each interval having a length of 50000, and the labels of the x-axis to range from 0 to 200000 on both ends.

xbrks <- seq(-200000, 200000, 50000)
xlabls <- paste0(as.character(c(seq(200, 0, -50), seq(50, 200, 50))), "k")
# creating new intervals and labels for x-axis

Final Visualisation

The final chart after formatting is shown below.

ggplot(Popdata_T, aes (x = AG, y = Pop, fill = Sex)) +
  geom_bar(stat = "identity", width = .6 ) +
  scale_y_continuous(breaks = xbrks, labels = xlabls, name = "Population Size") +
  xlab("Age Groups") +
  coord_flip() +
  theme_bw()

3.0 Comparing R and Tableau on Data Visualisation

Similar visualisations were created using Tableau in class. Comparing with Tableau, creating the charts programmatically using R have the following benefits:

With R, you know what each chart is composed of and what goes into every point on the chart. Hence, it is easier to know what and how to customise the charts. For instance, you can specify the width of the bar, change the sequence and labels of the interval.
As the codes are reproducible, it is easier to replicate similar visualisation if you change the data using the same codes. For Tableau, although you can drag and drop, you will have to restart the entire process everytime you change the data.
R has many packages and has endless possibilities on how you want to create your visualisation. In fact, there is a package ggQC available to create pareto chart with simpler step though I did not try in this exercise. With Distil, it allows you to create a webpage seamlessly in a single platform (R studio) and publish in different formats. This is not possible with Tableau.

Back to homepage

Take-home Exercise 1

Author

Affiliation

Published

DOI

The Task

1.0 Distribution of Returns by Sub-category

1.1 Sketch of Proposed Design

1.2 Installing and loading the required libraries

1.3 Importing the dataset

1.4 Data Wrangling

Joining two tables

Compute frequency count of returns by sub-category

Group_by method

Count method

Sorting Data

Computing cumulative frequency

1.5 Creating the Pareto Chart

Plotting Returns in a bar chart

Plotting cumulative frequency line

Formatting the Chart to improve visualisation

2.0 Distribution of Population by Age and Sex

2.1 Sketch of Proposed Design

2.2 Installing and loading the required libraries

2.2 Importing the dataset

2.3 Data Wrangling

Transforming the data of one gender

Binding to form new dataset

2.4 Creating the Population Pyramid

Initial plot

Formatting the axis

Final Visualisation

3.0 Comparing R and Tableau on Data Visualisation

Footnotes