In part 1 we developed our strategy and R code for measuring the efficiency of the placement of tube stations. In this post we will scale this up so that we can input empirical data that we will gather from openstreetmap and Transport for London.
To briefly recap
We are measuring the efficiency of station placement by:
- finding the sum of the squared shortest distance (sssd) from every building (in the area of interest) to a tube station – this is the measure of current performance;
- using R optim to find the optimal placement of stations by minimising the sssd from every building to a tube station;
- finding the sssd from every building (in the area of interest) to randomly placed tube stations to create a baseline measure of performance;
- comparing the difference in sssd between (1) current, (2) optimal and (3) baseline.
The R code that we developed was as follows:
Scaling this up to London?
In this blog post I will share with you how I went from using toy data to using openstreet map for the location of buildings and the Transport For London website ( https://www.tfl.gov.uk/cdn/static/cms/documents/stations.kml ) for the locations of the 301 tube and DLR stations that make up the TfL station infrastructure.
- Download the TfL station location data and import it into QGIS by adding it as a new vector layer.
- Convert the locations into WGS84-UTM30N coordinates as a shapefile so that one unit is one meter.
- Download the building data from open street map using the built in features in QGIS i.e. Vector->Openstreetmap->download data
- Export the polygon layer and filter away non-buildings e.g. surface water, agriculture.
- Calculate the centroids of your buildings using the built in QGIS features e.g. Vector->Geometry tools->Polygon centroid
- Save your building centroids as WGS84-UTM30N coordinates within a shapefile.
To process this empirical data in R, I modified our code into the following:
As can be seen the code is more or less the same as before except that maptools is being used to read in the shape files and some data preprocessing is performed to massage the data into the format expected by our previous code.
- The only modification perhaps worthy of explanation is the new ‘min.distance.compare’ function which provides a vector of the shortest distance in meters from each building to the nearest tube station. This function is nice because you can plot the distribution of distances and perform useful summary statistics.
As can be seen below I chose to limit my analysis to a 10KM radius around central london. The coverage of the OSM data is pretty good with the vast majority of the areas being well populated with building polygons from which we generated centroid points.
Using the output of min.distance.compare we are able to identify how likely you are to be within n meters of a TfL station. It appears from the histogram below that there is an exponential distribution of walking distances from buildings to TfL stations. The sssd (sum of squared shortest distances) was 1.95e11 and the summary statistics were as follows:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.108 338.600 567.000 950.500 1253.000 5844.000
For the purpose of comparison I generated a set of random station points using a uniform distribution between the min and max observed building coordinates – this can be easily achieved using the ‘runif’ function. The sssd was 7.83e10 whilst the summary statistics were as follows:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.863 465.500 704.500 753.900 988.600 2246.000
Interestingly enough the random allocation does not perform badly in comparison to the current placement of TfL stations. The random allocation outperforms the current TfL setup with respect to it having a lower sssd, mean and max case. The current Tfl allocation however do have a better median but at the expense of a poor coverage in south east and north east london.
In the next post I will share the outcome of the optimisation and attempt to quantify how far from optimal the placement of stations is for walking distances.