Being interested in the quantitative analysis of all things urban and social I decided to explore and visualise the daily commuter flow data available from the UK Census. This proved itself to be an interesting exercise in data manipulation as well as learning ggplot2 for visualisation – by the end of my exploration session I made the two map-based visualisations using data for my adopted home city – Bristol.

In exploring the data the top 3 questions that I sought to answer were:

- How many commuters are coming into the city center on a daily basis?
- What is the modal split of these commuters?
- What is the average commute distance and what distribution characterises the variability in commute distance?

## How many commuters are coming into the city center on a daily basis?

It turns out that according to the dataset there are **137,617** daily commuters traveling within and into the city center of Bristol.

To calculate the **number of commuters** I filtered all the flows between Middle Layer Super Output Areas in the UK so that it would include only flows from within 50K radius of Bristol to areas within the central Bristol (2.5km radius). Following this I used an SQL SELECT query to sum the flows by area of residence and then summed this to get the total number of commuters . You can see my code below:

dt <- read.csv("wu03ew_v2.csv" , stringsAsFactors = FALSE) msoas2 <- read.csv("MSOA_Within_2.5KBristol.csv" , stringsAsFactors = FALSE) msoas50 <- read.csv("MSOA_Within_50KBristol.csv" , stringsAsFactors = FALSE) #Filter data to ensure only flows from 50K radius to 2.5K radius dt <- with(dt, dt[Area.of.residence %in% msoas50$MSOA11CD, ]) dt <- with(dt, dt[Area.of.workplace %in% msoas2$MSOA11CD, ]) names(dt) <- c ( "Area.of.residence", "Area.of.workplace","All", "Home", "Lightrail", "Train", "Bus" , "Taxi", "Motorcycle", "Car", "CarPassenger", "Bicycle", "Pedestrian" ,"Other" ) library(sqldf) ## Q1. How many people are commuting in on a daily basis # Sum the number of people from each origin giving us flows into the 2.5k centre for all categories count.df <- sqldf( 'SELECT "Area.of.residence" , SUM("All") FROM dt GROUP BY "Area.of.residence" ' ) names(count.df) <- c( "MSOA11CD" , "PPL" ) totalCommuters <- sum(count.df$PPL)

As you can see I used sqldf to perform the aggregation calculations using SQL. SQL is a very powerful language for performing aggregations so it seemed natural to use this approach rather than create R code.

## What is the modal split of the commuters?

To calculate the **modal split **I simply used an SQL SELECT query to sum up the number from each modality type and divided it by the total number of people thus giving a percentage modal split.

## Q2. What is the modal split of the commuters? # Sum the number of people from each origin giving us flows into the 2.5k centre for car, train, bus, walk, bicycle, taxi counts.df <- sqldf( 'SELECT SUM("Car") , SUM("CarPassenger"), SUM("Train") , SUM("Bus") , SUM("Pedestrian") , SUM("Bicycle") , SUM("Taxi"), SUM("Motorcycle"), SUM("Lightrail") , SUM("Home") , SUM("Other") FROM dt' ) modalSplit <- counts.df / sum(counts.df)

## What is the average commute distance and what distribution characterises the variability in commute distance?

It turns out that the average commute distance considering all types of transport is **7.5km**. I plotted the histogram below to illustrate the distribution of travel distances.

To calculate the average commute distance I calculated the distance between the originating MSOA and the destination MSOA using their centroids as arrival and departure points. I got my shapefile containing the MSOA boundaries from ONS.

## Q3. What is the average commute distance and what distribution characterises the variability in commute distance? # Get the distance from the centroid of each origin to destination library(rgeos) library(maptools) msoaSHP = readShapeSpatial("MSOA_Within_50KBristol.shp") msoaCoords <- data.frame ( msoaSHP@data$MSOA11CD , coordinates ( msoaSHP ) ) names ( msoaCoords ) <- c ( "MSOA11CD" , "oX" , "oY" ) dt <- merge ( dt , msoaCoords , by.x= "Area.of.residence" , by.y = "MSOA11CD" ) names ( msoaCoords ) <- c ( "MSOA11CD" , "dX" , "dY" ) dt <- merge ( dt , msoaCoords , by.x= "Area.of.workplace" , by.y = "MSOA11CD" ) distance <- function ( dt ) { return ( sqrt((dt$dX-dt$oX)^2 + (dt$dY-dt$oY)^2) ) } dt$distance <- distance ( dt ) # Calculate the mean distance travel weighted by the number of trips weighted.mean ( dt$distance , dt$All ) #Weighted mean for all transport types

The above code is straight forward due to the use of the merge function to map MSOA IDs within the flow data to MSOA IDs within the shapefile. R also provides an off-the-shelf weighted mean function making the code concise 🙂

Glancing at the histogram above it would seem that frequency of trip decays with distance according to some sort of gravity model i.e. number of trips proportional to 1 / d^m.

A simple way of of finding the value for the decay function is to use linear regression on the log of journeys and travel distance thus giving the coefficient m and a constant k.

# characterising the distribution dt0 <- subset ( dt , All > 0 & distance > 0 ) llm <- lm ( log(dt0$All) ~ log(dt0$distance) ) k <- exp(llm$coefficients[1]) m <- llm$coefficients[2]

This produced a reasonable line of best-fit for the data giving m a value of **1.04 **suggesting that the relationship between number of trips and distance is a slow exponential decay.

To get a feel of the overall spatial distribution of journeys I finally created a flow diagram that colours the links by how frequently used they are.

The code for this is a modification of the work of James Cheshire UCL where I added code to colour the links according to their frequency of travel.

centroids<- read.csv("msoa_popweightedcentroids.csv") #read centroids of MSOAS for whole UK msoas <- read.csv("MSOA_Within_50KBristol.csv" , stringsAsFactors = FALSE) #read MSOAS from within 50K radius of Bristol msoasBristol <- read.csv("MSOA_Within_2.5KBristol.csv" , stringsAsFactors = FALSE) #read MSOAS from within 50K radius of Bristol annotations <- read.csv("MapAnnotations.csv" , stringsAsFactors = FALSE) ex <- c( "Reading" , "Trowbridge" , "Chippenham" ) annotations <- with(annotations, annotations[!Name %in% ex,] ) input <- read.csv("wu03ew_v2.csv" , stringsAsFactors = FALSE) #read in flow data for all MSOAS within the UK names(input)<- c("origin", "destination","total" , "home" , "metro" , "train" , "bus" , "taxi", "motorbike" , "car" , "carpassenger" , "cycle" , "foot" , "other" ) inputo <- with(input, input[origin %in% msoasBristol$MSOA11CD, ]) #filter so origin only in bristol inputd <- with(input, input[destination %in% msoasBristol$MSOA11CD, ]) #filter so destination only in bristol input <- rbind ( inputo , inputd ) #Following code is based on James Cheshire http://blogs.casa.ucl.ac.uk/category/r-spatial/ or.xy<- merge(input, centroids, by.x="origin", by.y="Code") names(or.xy)<- c( names(input) , "o_name", "oX", "oY" , "oShowName") dest.xy<- merge(or.xy, centroids, by.x="destination", by.y="Code") names(dest.xy)<- c( names(or.xy) , "d_name", "dX", "dY" , "dShowName") makeViz <- function ( trips , title ) { library(ggplot2) dest.xy$trips <- trips #order from smallest to largest trip dest.xy <- dest.xy[ order( dest.xy$trips ) ,] #Now for plotting with ggplot2.This first step removes the axes in the resulting plot. xquiet<- scale_x_continuous("", breaks=NULL) yquiet<-scale_y_continuous("", breaks=NULL) quiet<-list(xquiet, yquiet) #Let's build the plot. First we specify the dataframe we need, with a filter excluding flows of <10 p <- ggplot(dest.xy[which(dest.xy$trips>10),], aes(oX, oY))+ geom_segment(aes(x=oX, y=oY,xend=dX, yend=dY , alpha=trips, col=trips) )+ scale_colour_gradientn(colours = topo.colors(10))+ scale_alpha_continuous(range = c(0.03, 0.9) , guide="none")+ ggtitle(paste( title , "Daily Commuter Travel Destined To/From Central Bristol" , sep = " ")) + theme(panel.background = element_rect(fill='white',colour='white'))+quiet+coord_equal() for ( i in 1:nrow(annotations) ) { p = p + geom_text(x=annotations$X[i], y=annotations$Y[i]-2500, label=annotations$Name[i], size=4 ) } return ( p ) } p1 <- makeViz ( dest.xy$total , "Total" )

To create the visual at the top of the page I wrote the following code to colour the MSOAs by the total population commuting into central Bristol.

This was generated using the following code:

# Visualise the number of people at each origin flowing to central bristol library(rgeos) library(maptools) library(plyr) library(ggplot2) #Read in the ShapeFile msoaSHP = readShapeSpatial("MSOA_Within_50KBristol.shp") msoaSHP@data$id = rownames(msoaSHP@data) msoaSHP.points = fortify(msoaSHP, region="id") msoaSHP.df = join(msoaSHP.points, msoaSHP@data, by="id") msoaSHPc.df = join(msoaSHP.df , count.df , by="MSOA11CD") #Create visualisation for each zone xquiet<- scale_x_continuous("", breaks=NULL) #Now for plotting with ggplot2. This first step removes the axes in the resulting plot. yquiet<-scale_y_continuous("", breaks=NULL) quiet<-list(xquiet, yquiet) ggplot(msoaSHPc.df) + ggtitle("Flows Into Bristol City Centre") + aes(long,lat,group=group,fill=PPL) + geom_polygon() + geom_path(color="white") + scale_fill_gradientn( colours = rev(rainbow(20, alpha = 1))[-1:-5], name="Flows\n (ppl)" )+ coord_equal() + theme(panel.background = element_rect(fill='white',colour='white')) + quiet