Analyzing Canadian Census Data

According to the official Statistics Canada website, starting as a 211 question long census in 1871 that occurred every ten years, the Canadian census conducted by Statistics Canada now occurs every five years and most recently includes questions on basic population demographics such as population, age and sex, type of dwelling, families, households and marital status, language, income, immigration and ethnocultural diversity, housing, Aboriginal peoples, education, labour, journey to work, language of work and mobility and migration. Socioeconomic data collected from the nation-wide census can help address questions like where is the highest density of population located? what kind of housing is most available? where is housing potentially able to be densified?
An important relationship that can be utilized in spatial analysis is that which exists between spatial neighbours (such as census tracts). Specifically, spatial autocorrelation analysis involves looking at neighbouring attributes and determining if they are more similar or different from the observation in question with regards to the average of the population. The follow tutorial details the various steps required to accomplish spatial autocorrelation analyses in R using dwelling occupancy census data to understand the variability in housing in the Capital Regional District and Cowichan Valley.

Collecting and Mapping Census Data

Census data is freely available by download, the 2016 census data AGGREGATE DISSEMINATION AREA shapefile can be found here and the AGGREGATE DISSEMINATION AREA population and dwelling count csv here. Then, in R Studio:

#import relevant libraries
library(plyr)
library(dplyr)
library(spdep)
library(GISTools)
library(raster)
library(maptools)

Access the dissemination area shapefile and select the areas of interest (Capital and Cowichan Valley regions) and get the census population data:

dissem <- shapefile("lada000b16a_e.shp")
dissem.sub <- dissem[dissem$CDNAME %in% c("Capital", "Cowichan Valley"),]
census.16 <- read.csv("T1701EN.csv", check.names = FALSE)

Clean up the data:

#fix the column names
#remove punctuation and spaces and replace with "_"
#names(census.16) <- gsub("%", "Percent", names(census.16))
names(census.16) <- gsub("[[:space:]]|[[:punct:]]", "_", names(census.16), fixed = FALSE)
names(census.16) <- gsub("__", "_", names(census.16), fixed = TRUE)
names(census.16) <- gsub("__", "_", names(census.16), fixed = TRUE)
census.16 <- census.16[,-12]
#remove columns with "french"
census.16 <- census.16[,-grep("french", names(census.16))]

Select the columns we want to work with and create new columns for occupied dwelling density analysis as well as a generalization for the dwelling vacancy density analysis:

census.16 <- census.16[,names(census.16) %in% c("Geographic_code", "Population_2016", "Land_area_in_square_kilometres_2016", "Total_private_dwellings_2016", "Private_dwellings_occupied_by_usual_residents_2016", "Population_density_per_square_kilometre_2016")]

#add new columns to dataframe
#for occupied dwelling density
census.16$Private_dwelling_occupied_density_per_square_kilometre_2016 <- census.16$Private_dwellings_occupied_by_usual_residents_2016/census.16$Land_area_in_square_kilometres_2016

#for generalization of private dwelling vacancies
census.16$Private_dwelling_vacancies_2016 <- census.16$Total_private_dwellings_2016 - census.16$Private_dwellings_occupied_by_usual_residents_2016

#for dwelling vacancy density
census.16$Private_dwelling_vacancy_density_per_square_kilometre_2016 <- census.16$Private_dwelling_vacancies_2016/census.16$Land_area_in_square_kilometres_2016

Now we want to give our data some geographic context so we can visualize the data on maps. We do this by merging the cleaned up table data with the area shapefile using a common attribute (in this case ‘Geographic_code’ and ‘ADAUID’ are the common attribute).

#rename "Geographic_code" as "ADAUID" to match the dissemination shapefile
names(census.16)[1] <- "ADAUID"
#now merge with the shapefile based on dissemination ID
crd.data <- merge(dissem.sub, census.16, by = "ADAUID")
#remove NA
crd.data <- crd.data[!is.na(crd.data$Population_2016),]

crd.data is now our working dataset, class SpatialPolygonsDataFrame, and it now includes the census data as attributes! You can see a printout of a summary of the data (including the min/max coordinates, projection, attributes, and min/max/median/mean statistics for the population, dwelling, and density columns) with the commands:

class(crd.data)
summary(crd.data)

Create a choropleth map of occupied dwelling density and of dwelling vacancy density per km2 in 2016 using vectors of these densities (the columns from crd.data) so that you can visualize the geographic spread of dwellings and add informative colours, titles, and legends to your maps.

#vectors 
OccDen <- crd.data$Private_dwelling_occupied_density_per_square_kilometre_2016
VacDen <- crd.data$Private_dwelling_vacancy_density_per_square_kilometre_2016
#create a list of 6 colours
shades.vacancy <- auto.shading(VacDen, n=6, cols = brewer.pal(6, 'Blues'))
shades.occupied <- auto.shading(OccDen, n=6, cols = brewer.pal(6, 'Purples'))
#format title text
legend.vacancy.text = expression(paste("Vacancy Density per km"^"2"," in 2016"))
legend.occupied.text = expression(paste("Occupied Density per km"^"2"," in 2016"))
#select a location on the map for your legend (otherwise the legend may overlay on your plot by default)
legend_coor <- locator(1)
#print out the coordinates to use below
legend_coor
#map the data with associated colours, titles, and legends
choropleth(crd.data, OccDen, shades.occupied, main = "Occupied Density for Capital and Cowichan Regional Districts")
choro.legend(3653863, 2047977, shades.occupied, title = legend.occupied.text, cex = 0.45)
choropleth(crd.data, VacDen, shades.vacancy, main = "Vacancy Density for Capital and Cowichan Regional Districts")
choro.legend(3653863, 2047977, shades.vacancy, title = legend.vacancy.text, cex = 0.45)

The output choropleth maps illustrate varying levels of occupied and vacancy dwelling densities. As can be expected, the central downtown Victoria and surrounding area shows high occupied dwelling density (darkest purple areas with over 1200 homes occupied). The distribution of density for vacancies appears to mirror that of occupied homes – though on a smaller scale – and the highest areas of vacancy (30 to over 68 vacant homes) appear to be located in the downtown Victoria as well suggesting the two are related in some way.

Creating a Neighbourhood Weights Matrix

One important aspect of spatial analysis is a neighbourhood weights matrix which helps to define the dependence between pairs of spatial units and thus who the neighbours (j) of the observation i are in the subsequent spatial autocorrelation analyses. Typically a weighted matrix (often denoted simply as W) can be defined as an n x n matrix with a weight value (wi,j) for each neighbour located at i,j. The idea is that farther away locations will have lower weight keeping in tune with Tobler’s law that nearer things are more closely related than distant things. The following examples of a queen’s neighbour and rooks neighbour matrices are based on a centroid distance between neighbouring spatial units. From the neighbours map below, it can be seen that two are very similar and in fact there are only a few links in the Queen’s (red) that are not in the Rook’s (yellow dashed) weighted matrices meaning most of the same neighbours exist for each of the census subdivisions. For the purposes of this tutorial the queen neighbour weighted matrix will be used for subsequent analyses and summary of the spatial weights list object shows that there is an average of about four links for each of the 86 census subdivisions.

The neighbours (j) of i considered in a Rook’s vs Queen’s weighted matrix
#Queen's neighbour
crd.nb <- poly2nb(crd.data)

#Rook's neighbour
crd.nb2 <- poly2nb(crd.data, queen = FALSE)

#Now, create a new map that overlays the rook neighbours file (in yellow) onto the queen neighbours file (in red).
plot(crd.data, border = "lightgrey")
plot(crd.nb, coordinates(crd.data), add = TRUE, col = "red")
plot(crd.nb2, coordinates(crd.data), add = TRUE, col = "yellow", lwd = 2, lty = "dashed")
#Create the spatial weights neighbour list using the Queen's case
crd.lw <- nb2listw(crd.nb, zero.policy = TRUE, style = "W")
print.listw(crd.lw, zero.policy = TRUE)
 

Mapping Lagged Means

The Lagged means calculation provides us with a measure of the neighbourhood average which we easily can visualize on a map following the below code. With the Queen’s weighted matrix, the below outputted maps show that the darker regions have higher density than that of the neighbourhood average and conversely, the lighter areas have lower density than that of the neighbourhood average.

#lagged means calculation
OccDen.lagged.means = lag.listw(crd.lw, OccDen, zero.policy = TRUE)
VacDen.lagged.means = lag.listw(crd.lw, VacDen, zero.policy = TRUE)

#Map the lagged means using a new shading object
shades2.occupied <- auto.shading(OccDen.lagged.means, n=6, cols = brewer.pal(6, 'Purples'))
shades2.vacancy <- auto.shading(VacDen.lagged.means, n=6, cols = brewer.pal(6, 'Blues'))
choropleth(crd.data, OccDen.lagged.means, shades2.occupied, main = "Lagged Means Occupied Density")
choro.legend(3653863, 2047977, shades2.occupied, title = legend.occupied.text, cex = 0.45)
choropleth(crd.data, VacDen.lagged.means, shades2.vacancy, main = "Lagged Means Vacancy Density") 
choro.legend(3653863, 2047977, shades2.vacancy, title = legend.vacancy.text, cex = 0.45)

Global Moran’s I

One way to test for spatial autocorrelation is to use the Global Moran’s I statistic in a spatial analysis to determine if a variable is more positively or negatively spatially autocorrelated than we would expect given a random distribution. The four important parts of the Global Moran’s I formula (see below) are (a) the difference between ourselves i and the mean, which we multiply by (b) the difference between our neighbours and the mean, after which we are multiplying by (c) the weight given to our neighbours, and finally – (d) the denominator, which provides a way to standardize the values. The resulting I value will either be less than or greater than zero indicating positive or negative spatial autocorrelation.

Global Moran’s I and Z-test
#Moran's I Test
#Again use the OccDen and VacDen vectors with the queen's spatial weights neighbour list
mi <- moran.test(OccDen, crd.lw, zero.policy = TRUE)
mi2 <- moran.test(VacDen, crd.lw, zero.policy = TRUE)
mi
mi2

#To contextualize your Moran's I value, retrieve range of potential Moran's I values.
moran.range <- function(lw) {
  wmat <- listw2mat(lw)
  return(range(eigen((wmat + t(wmat))/2)$values))
}
moran.range(crd.lw)

#Perform the Z-test
z.i=mi$estimate[1]-mi$estimate[2]/(mi$estimate[3])
z.i

z.i=mi2$estimate[1]-mi2$estimate[2]/(mi2$estimate[3])
z.i


The above results from show that for occupied density the Moran’s I value of 0.689 and for vacancy density the Moran’s I value of 0.595  both fall into the positive potential Moran’s I range values (-1.480 to 1.504) that are expected for a random pattern indicating positive spatial autocorrelation. Respectively, Zi-test scores of 2.450 and 2.681 indicate that, with a 98% confidence level, occupied and vacancy densities are significantly different than random distribution (for p < 0.02) and thus significantly positively spatially autocorrelated.

Local Moran’s I

The follow code will demonstrate how to compute the Global Moran’s I for each point at the local level with slight simplification. The following Local Moran’s Ii statistic also uses the difference between the observation and the mean but instead multiplies this by the sum of differences of its neighbours j and the mean. Using the ZIi test, it can be determined if the computed spatial distribution of attributes is significantly different than what would be expected for the Local Moran’s I to be at location if the spatial distribution of points in the neighbourhood of i were to exhibit a random spatial distribution.

Screen Shot 2018-10-16 at 6.25.04 PM.png
Screen Shot 2018-10-16 at 6.34.09 PM.png
#Local Moran's I 
#Again use the OccDen and VacDen vectors with the queen's spatial weights neighbour list 
lisa.occupied.test <- localmoran(OccDen, crd.lw)
lisa.occupied.test
summary(lisa.vacancy.test)

lisa.vacancy.test <- localmoran(VacDen, crd.lw)
lisa.vacancy.test #the last column provides you with the p-values which relate to different confidence levels (e.g. 99%, 95%, etc.) 
#summarize the table
summary(lisa.vacancy.test)


#Create choropleth maps of the LISA values.
lisa.occupied.shades <- auto.shading(c(lisa.occupied.test[,1],-lisa.occupied.test[,1]),cols=brewer.pal(5,"PRGn"))
choropleth(crd.data, lisa.test[,1],shading=lisa.occupied.shades, main="Local Moran's I for Occupied Dwelling Density")
choro.legend(3653863, 2047977,lisa.occupied.shades,fmt="%6.2f", cex=0.45, title="LISA Values")

lisa.vacancy.shades <- auto.shading(c(lisa.vacancy.test[,1],-lisa.vacancy.test[,1]),cols=brewer.pal(5,"PRGn"))
choropleth(crd.data, lisa.test[,1],shading=lisa.vacancy.shades, main="Local Moran's I for Vacancy Dwelling Density")
choro.legend(3653863, 2047977,lisa.vacancy.shades,fmt="%6.2f", cex=0.45, title="LISA Values")

The above choropleth maps give an idea of where areas are clustered or dispersed with respect to the Local Moran’s I values to better visualize the calculated statistic. To visualize the level of significance of these clusters create the following maps using the LISA p-values.

#Get P-values from last column of LISA table
p.lisa.occupied=lisa.occupied.test[,5]
p.lisa.occupied.shades <- auto.shading(c(lisa.occupied.test[,5],-lisa.occupied.test[,5]),cols=brewer.pal(5,"PRGn"))
p.lisa.vacancy=lisa.vacancy.test[,5]
p.lisa.vacancy.shades <- auto.shading(c(lisa.vacancy.test[,5],-lisa.vacancy.test[,5]),cols=brewer.pal(5,"PRGn"))
#Create a choropleth map of the LISA P-values.
choropleth(crd.data,p.lisa.occupied,shading = p.lisa.occupied.shades, main = "Occupied Dwelling LISA P-values")
choro.legend(3786037, 2047977,lisa.occupied.shades,fmt="%6.2f", cex=0.45)
choropleth(crd.data,p.lisa.vacancy,shading = p.lisa.vacancy.shades, main = "Vacancy Dwelling LISA P-values")
choro.legend(3786037, 2047977,lisa.vacancy.shades,fmt="%6.2f", cex=0.45)

The values of the Local Moran’s can be plotted in a scatterplot to show the relationship between each location and its mean. The x-axis represents values the values at location i, while the y-axis represents values in the neighbourhood of location i.  To visualize spatial autocorrelation for the dwelling densities, perform the following to create Moran’s I scatterplots. Points in the top-right quadrant of the scatterplot represent locations in which the attribute at i and its neighbours are above the mean. Points in the bottom-left of the plot indicate locations in which the attribute at i and its neighbours are below the mean. Both of these quadrants thus indicate positive spatial autocorrelation and the black solid diagonal line is a best-fit line through the points, indicating the presence of positive spatial autocorrelation throughout the entire dataset. Diamond shape points are values that are deemed to be significant at a specific level of confidence (the numbers next to these points simply represent their ID number).

#Create Moran's I scatterplots
moran.plot(main="Moran's I",OccDen, crd.lw, zero.policy=NULL, spChk=NULL, labels=NULL, xlab="Occupied Density", 
           ylab="Spatially Lagged Occupied Density", quiet=NULL)
 moran.plot(main="Moran's I",VacDen, crd.lw, zero.policy=NULL, spChk=NULL, labels=NULL, xlab="Vacancy Density", ylab="Spatially Lagged Vacancy Density", quiet=NULL) 

Summary

The results of the previous spatial autocorrelation analyses indicate a high degree of positive clustering of attributes between neighbours in regards to the density of occupied and vacant dwellings (both high and low density areas) according to the 2016 census data. Hopefully this tutorial highlighted the key strengths of using R to perform spatial autocorrelation analysis in a clear, descriptive, and visual way. The ability to glean important information from the wealth of data collected in the Canadian census through spatial analysis can help us understand the current and changing state of the diversity our country. What better way to learn more about where you live than to find relationships between  you and your neighbour!

Leave a comment