Importing Scottish places

This month I’ve started working on importing large amounts of data for places in Great Britain. I’m writing this blog post as I go, so that I can easily remember what I did. Imports of authors seem like old news now, so I might not write any more detailed posts about how I did it. See below for more details of how I did the geodata. This is a very long post by today’s standards, so the ‘Too Long, Didn’t Read’ version is that I’ve imported pages for:

I’m starting with Scotland because that seems like the easiest job. England and Wales both have extra complications to deal with but will start with the same basic technique as Scotland. (Or so I thought when I started. I now know that the Scottish islands make Scotland quite complicated and it’s taken much longer than I expected.)

The kind of places I’m looking for are what the By The Sword Linked wiki classes as settlements but some other projects call them populated places. This can be any built up area from a hamlet to a city. The key thing is that it represents a collection of physical buildings and streets, not an administrative jurisdiction. I’m following the common practice of locating settlements by the coordinates of a single point because mapping the boundaries of a built up area as distinct from the administrative jurisdiction is very difficult and probably not worth doing. Single points are very easy to put on a map, as you can see from the map of combat events. For now I don’t care how precise these points are as long as they’re somewhere in or near the settlement they represent. In future it will be possible to edit the coordinates manually to make them closer to where the settlements were in the 17th century, which might be slightly different from where the centre is now. This project uses WGS84 coordinates. This is because this coordinate system is very widely used and understood (for example, Google maps, Wikimedia projects, many GPS systems and more) and covers the whole world.

There are tens of thousands of settlements in Great Britain, so manually creating a wiki page for each one would be a huge waste of time, if it was even possible. It’s much easier to import existing data. In this case, I’m starting with the Ordnance Survey of Great Britain, which has made lots of geodata free to reuse for all purposes under the Open Government Licence. The dataset I’m using is Open Names, which includes streets and postcodes as well as settlements. It can be downloaded as Comma Separated Values (CSV), which is an easy format for doing batch imports, and it includes identifiers that link to an online Linked Open Data service. For each place, there are coordinates for a single point. These are in the Ordnance Survey National Grid, which is the most precise and convenient coordinate system for Great Britain but doesn’t cover anywhere else. This is a different coordinate system from WGS84, but we’ll come back to that later.

Because this dataset includes every settlement and road that exists now, it’s very large. To make it more manageable, the Ordnance Survey has split it into many smaller files. This creates some challenges of its own because you can’t easily open the whole dataset in a spreadsheet. Instead, I wrote a simple Python script which opens each file in turn, finds only the rows that refer to settlements, not other kinds of named place, and merges them into one file for each of England, Scotland, and Wales. The file for Scotland has just over 6,000 rows, which is small enough to open in a spreadsheet or other software.

The Ordnance Survey data is based on what is true now. In most cases, the current place names are useful as standard forms. It’s not worth trying to make the page names the same as the seventeenth century spellings because these could vary so much, although the description property can contain any known variants, and forms in languages other than English. As I said before, the point coordinates are near enough to be useful even if they’re not exactly the same as the location of the place in the seventeenth century. The data gives the current local authority that a place is under, but many of these are completely different from the counties that were used in the 17th century. I need to add the historic counties in order to:

  • qualify the page name by county as well as country. This is especially useful if two places in different counties have the same place name. In any case, it’s helpful for users to have an idea of where the place is when they see it in search suggestions.
  • add a command relationship between the settlement and county using the relationship type ancestor area, which makes it easy to query for all the settlements in a county. Constructing a query that drills down through all levels of the local government hierarchy would be much more difficult, and in practice not all levels will be represented for a long time.

The Historic County Borders Project provides shape files that represent the boundaries of the historic counties of the UK (including Northern Ireland, but not the south) taken from early Ordnance Survey maps. These are free to download (the terms of use are effectively the same as Creative Commons Attribution). There are several options. I’ve chosen:

  • definition B, which treats exclaves of a county as part of the county they are exclaves of, not part of the county that surrounds them. This is more suitable for finding administrative jurisdictions.
  • OSGB36 coordinate reference system, because this matches the data that I’ve downloaded from the Ordnance Survey. WGS84 is also available but that’s not compatible with the data I’m working with at this stage, even though it’s what I want to end up with.
  • full resolution, because I want the boundaries to be as precise as possible. Smaller low resolution files are also available if precision isn’t so important to you.

Even after taking care over getting the most precise and rigorous definition of the county boundaries, there may be anomalies where a settlement switched from one county to another between the civil wars of the mid-17th century and the creation of the first Ordnance Survey maps in the 19th century. For England I can check the standard reference book by Frederic Youngs, but I don’t know of any similar source for Scotland, Wales or Ireland, so for now I’ll just have to live with the possibility of a few places being in the wrong county. As the project uses a wiki, it will be easy to update things if new information turns up.

The next step is to use a Geographical Information System (GIS) to combine the settlements with the county boundaries. I’m using QGIS because it’s free (and also easy to install from a repository if you use Linux and don’t mind getting an older version of QGIS).

QGIS is very powerful but can be difficult to learn. It can probably do what you need quite easily but the trick is knowing what QGIS calls it. In this case, a ‘spatial join’ or ‘join attributes by location’, which matches point coordinates to the area that they’re within, is what I need to find the county for each settlement. First thing is to create a new project in QGIS, go into the project properties, set the Coordinate Reference System (CRS) to OSGB36 (no need to set a projection), and save the project. If you don’t do this before importing data, things can go wrong (as I’ve just discovered!). Then you can load each set of data into a separate layer (instructions here). I imported a vector layer with the county boundary shapefiles for the whole of Great Britain and Northern Ireland and a delimited text file for settlements in Scotland. Now it’s easy to combine the data with a spatial join following this tutorial by Ujaval Gandhi, although what I’m doing is a bit simpler. All I really needed to set up was:

  • target or input layer (the name varies between versions of QGIS): the layer with the settlements. This is what I want to add new data to.
  • join layer: the one with the county boundaries. This is where the extra data comes from. It gets copied to every matching row in the list of settlements.
  • one-to-one relationship, because I expect each point to be in exactly one county.

Everything else can be left on the default settings. The result is a new layer with a row of data for every settlement containing all the original Ordnance Survey data plus a copy of the Historic Counties data for the county that the coordinates are within. These can be exported as CSV files.

(At this point I should also have done a spatial join with the Scottish Islands shape file from National Records of Scotland to find out which island a settlement is on, but I didn’t think of that so I had to come back later. The shape data needed some manual cleanup in QGIS to make sure that English and Gaelic names were recorded in separate fields and that identities of islands were unambiguous. The difficulties of identifying and disambiguating similarly named islands, and inconsistencies and omissions of the NRS data prompted me to also make a wiki page for each island with the new entity type island. All this made Scotland much more complicated than I thought it would be.)

It should be possible for QGIS to convert coordinates from OSGB36 to WGS84 when exporting data, but I haven’t been able to make it work in the version of QGIS I’m using. Instead I used the Ordnance Survey’s GridInquestII tool to convert from OSGB36 to latitude and longitude in ETRS89, which isn’t exactly the same as WGS84 but is near enough for my purposes. It would be even easier if the Ordnance Survey’s CSV data already included the WGS84 coordinates, which are already available through their online Linked Open Data service.

Before doing any more with the settlement data, I also need to get some data for constituencies of the Parliament of Scotland so I can link every settlement to the constituency it was in. Most settlements are in the shire constituency that matches their county, but some are in separate burghs. I got a list of constituencies from Wikidata using this SPARQL query. I barely understand SPARQL but I was able to cobble the code together from examples provided by the query service. I downloaded the result as CSV and opened it in a spreadsheet for manual editing, then used the Wikipedia links to check each Wikipedia page. Although these are very vague they sometimes contain more information than the Wikidata item, which allowed me to delete some burghs that didn’t exist in the mid-17th century. If there are more that I’ve missed, I can delete them from my wiki later. The rest were expanded into all the data that I need to create wiki pages for them. I’ll come back to this data later.

The next thing to do with the settlement data is import it into OpenRefine. This software is a little bit like a spreadsheet in that it displays one dataset in rows and columns and allows you to sort and filter the data. But it can do so much more than a spreadsheet because it allows you to do batch operations on large numbers of cells. These can involve fairly complex expressions written in OpenRefine’s own programming language or in Python. And that’s before we even get onto moving data into and out of Wikidata…

First I did some basic checking, cleaning and rearrangement. For example, the coordinates exported from GridInquest have separate columns for latitude and longitude. OpenRefine can easily join them into one string with a comma between the two numbers, which is the format that Semantic MediaWiki needs coordinates to be in. The Ordnance Survey data has two columns for names plus columns to show which language each name is in. I would expect these to only be English and Gaelic, but using a facet to see a list of unique values in each column revealed that the language codes are incomplete and inconsistent. In four cases, Gaelic names were wrongly identified as Welsh. In another case, Greenfield in Glasgow was wrongly given the Welsh name for Greenfield in Wales, which is Maes-glas. Another combination of facets showed that 16 places have two names each but neither name has a language code. These were quite easy to resolve manually. Once every place with two names had correct language codes, I created new columns for English names and Gaelic names, using a Python expression to check the language codes to see which was which, because they’re not in a fixed order in the original data. (I should also have used facets to check that a pair of names with English and Gaelic language codes looked like they were the right way round but I only discovered later that the Ordnance Survey has got the language codes the wrong way round in about 15 cases, and in a couple there are two English names but one is wrongly identified as Gaelic.)

Once the English and Gaelic names were arranged it was quite easy to join fields together to make columns for:

  • wiki page name in English in the form: [Settlement name], [County name], Scotland
  • redirect in Gaelic in the form: [Settlement name], [County name], Alba
  • description in the form: Settlement in [County name], Scotland. Gaelic: [Gaelic name].

Where a settlement is on an island other than the mainland, the name of the island is also added.

Once page names were constructed, I faceted them and then filtered by choice count to find any that occurred more than once. These had to have extra qualifiers manually added to make them different, because two wiki pages can’t have the same name. The qualifiers were usually the name of a nearby larger town, loch or river, or points of the compass. To work them out I needed to look at a map. Initially I used links to the Ordnance Survey ID but I found that the Ordnance Survey website was far too slow. Instead I used OpenRefine to turn the coordinates into a link to Wikimedia’s GeoHack website, which is much quicker.

(At this point I could have reconciled the data with Wikidata to get Wikidata IDs and extra data but I’ve got enough from the Ordnance Survey data for now.)

Finally, some more editing with OpenRefine, a custom Python script, and a spreadsheet left me with CSV files that I could run through my usual import process. Once the XML file was generated, it took about 20 minutes to import through the command line script importDump.php which sounds like a long time but is a much quicker, easier, and more reliable way of importing 6,000 pages and 500 redirects than the Special:Import page, which is how we used to do it at Linking Experiences of WW1.

Now it’s all done, you can see a list and count of all addresses in Category:Addresses and a list of Scottish settlements in Concept:Settlements in Scotland. Where there are settlements on an island other than the mainland, the island page should display a map of everything on the island, for example, the Isle of Skye.

Some limitations of this data:

  • it could include settlements that didn’t exist as long ago as the 17th century. These will need to be weeded out later.
  • settlements that existed in the 17th century but later completely disappeared through depopulation or coastal erosion are missing.
  • coverage of Gaelic names is patchy, and the ones I’ve got could be mis-spelt because they were cobbled together from different sources.
  • most Scots names are missing where they differ from the English names.
  • coverage of Geonames IDs is patchy, and most Wikidata IDs are missing (just because I didn’t have time to reconcile them), but I should be able to add these later.

But this is a good start, and even with the unexpected complications, importing a batch was much quicker than manually creating pages would be.