Geographic referencing (or “geo-referencing”) is an increasingly important process in the production of National Statistics, allowing greater data accuracy and facilitating the sharing and aggregation of data.
This guide describes the process and explains why geographic referencing is an improvement on the previous process of manual postcode referencing.
The production of National Statistics involves the collection, processing and output of statistical data.
Most data events can be referenced to a known location, and this means that most statistics can be output using a geographic classification.
For example, we might produce statistics of unemployment rates by electoral ward or statistics of birth rates by local authority district (LAD).
Between the late 1970s and 2000 the approach to data referencing was to manually allocate the event postcode.
Although this was a valuable method, it was not without its limitations, and so we moved to a new approach: geographic referencing.
This involves referencing events to a specific and fixed point, usually a grid reference; the advantages of this are described below.
3. Geographic Referencing
There is great potential for data visualisation, as grid-referenced events can be located on a map and viewed in relation to other geographic features. These include administrative areas and boundaries, as well as physical features such as roads, coastlines and buildings.
As well as simply viewing the data, we also have the potential to use geographic information systems (GIS) to carry out detailed analysis and modelling.
We can also readily link between different datasets, as we simply need to identify events with a common grid reference.
There are a number of possibilities for geographic referencing:
3.1. Geographic referencing using the postcode centroid
Since 2000, under the Gridlink® initiative, our postcode directories have provided the centroid grid reference (the mean grid reference of all addresses in that postcode, snapped to the nearest property - the geographic centre of the postcode).
This is a good start and may be the most accurate reference possible, as we may not have any more detailed locational information for the data event. However:
3.1.1. Postcodes do not map directly to other geographic areas
Postcodes do not take account of administrative boundaries (or any other geography).
This “straddling” of boundaries means that each postcode is assigned to a single administrative area.
The result is that some addresses lying close to administrative boundaries may appear to be assigned to the ‘other’ area.
For small areas such as output area (OA), the resulting statistical differences can sometimes be considerable.
Fortunately the differences are less significant for larger areas, as:
there are proportionally fewer postcodes straddling the boundaries
the differences are more likely to be cancelled out, as data that is misallocated to one area may be balanced by an opposite misallocation elsewhere on the area boundary. This cancellation effect is even stronger in datasets with a large number of observations
3.1.2. Postcodes can move around
Royal Mail assigns postcodes to address locations for the sole purpose of providing an efficient mail delivery service.
Postcodes may be terminated, reassigned and reused as a result of demolitions and new building activity.
Royal Mail may occasionally decide to reuse these terminated postcodes in another part of the same postcode sector, thus the physical location of a postcode may shift.
This could cause data to be assigned to the wrong area unless care is taken to use the correct year's directory (note though that Royal Mail aims not to reuse a postcode for at least two years after it has been discarded).
3.1.3. Area boundaries keep changing
The UK has a very high level of electoral and administrative boundary change. For example, between 2001 and 2010 there were over 8,000 electoral ward/division boundary changes in England and Wales alone.
This further complicates referencing postcodes to administrative areas.
Once a ward boundary has changed, the allocation of some properties may change.
In addition, when the next version of the postcode directory is released, it will once again be affected by straddling.
All properties in the split postcode will end up referenced to either ward A or ward B, and this means that a proportion of them may appear to be inconsistent.
Although we can relate the grid reference of the postcode centroid to a map and perform detailed analysis on the associated data, this method does not solve the problems of straddling and boundary change.
3.2. Geographic referencing using address-level grid references
Address-level grid referencing, which we are working towards, is even more powerful.
Whereas the postcode centroid gives an approximate location of a data event, the address-level grid reference describes precisely where it occurs.
This has several advantages:
straddling is no longer an issue, as multiple addresses are no longer considered
dealing with administrative boundary change is even easier. We simply load the new boundary set into a GIS and, knowing the events are precisely located, can very quickly produce accurate statistics for the new boundaries
outputs and analysis can be even more flexible. For example, if we wanted to consider whether there is a relationship between how close people live to a motorway and the incidence of a particular disease, our data is now referenced with the accuracy required to do this
Note, however, that although address-level grid referencing is powerful, it does have limitations:
not all data can be assigned to an address
the automated assigning of grid references to addresses is more difficult than it is for postcodes. This is because, unlike postcodes, addresses can be lengthy, complicated and inconsistent. For example, the first line of an address may be a building number and street name, the number of a flat within a building or the name of a property
as data relates to individual addresses, greater security precautions may be required to protect the confidentiality of individuals
3.3. Other forms of locational referencing
Address-level grid referencing is appropriate for data events that relate to residential and business properties, but some events relate to other types of location.
For example, if the data event is the occurrence of a specific type of cereal crop, the location will be a field.
Such events can be assigned to land parcels via identifiers such as the GeoPlace® unique property reference number (UPRN) or Land Registry parcel boundaries.
Other events (for example, the location of a street crime) may simply need a grid reference.
An alternative may be to reference to the nearest address.
The key point, though, is that all data needs suitable, consistent and unambiguous geographic identifiers.
The approach of using postcodes to reference geographic data is a valuable tool but is subject to a number of limitations, especially when trying to produce statistics for small areas.
Geographic referencing based on the postcode centroid offers many advantages in terms of facilitating event linkage, data visualisation and data analysis, but it doesn't eliminate the problems caused by straddling and boundary change.
If a reference can be given at address level, the potential is even greater, allowing for detailed and accurate small-area statistics.
Different types of data will of course require different types of referencing, and issues such as ensuring confidentiality are crucial.
Therefore we are giving a great deal of attention to ensuring that we utilise geographic referencing in the best possible way.