Importing English places

I’ve been working on importing wiki pages for settlements in England. This post follows on from the one about importing Scottish places and will refer back to that instead of repeating all the details, but for England some things will be different.

I had already extracted separate CSV files for England and Wales from the Ordnance Survey data at the same time as doing Scotland. Since then I decided to put England and Wales together in case there are places that have changed country as well as county. That required adapting the python script that extracts settlements from the Ordnance Survey data because the CSV file for England is so big that it’s difficult to open it in a text editor and paste the Wales data. The new script merges places from England and Wales into the same CSV file and drops columns that I don’t need, which makes the file much smaller and easier to deal with.

The next step is to import the CSV file into QGIS for a spatial join, which is explained in more detail in the post about Scotland. As well as counties, I will also be joining places to National Character Areas for England and Wales. These areas are defined according to landscape rather than administrative boundaries. The definitions start with geology, which doesn’t change over historical time, but also summarize historical land use including a rough idea of when and how an area was enclosed. This is very useful for military history because it tells us how easy or hard it was for cavalry to operate effectively in an area. You could also use this data for the kind of social and cultural analysis that David Underdown did in Revel, Riot and Rebellion if you want to. A few places couldn’t be joined to any character area in Wales. I’m not sure why but I can deal with these later in OpenRefine. For now, I’ve got a CSV file with all of the English and Welsh settlements in the Ordnance Survey data, all joined to historic counties, and most joined to a character area.

(After this, I should have also done a spatial join on the ridings of Yorkshire and Lincolnshire but I didn’t find the shape files for these at Wikishire until later. At the same time, I also adapted Wikishire’s shape file for Hampshire to make a polygon for the Isle of Wight. Because these shape files are only available as WGS84, I used GridInquest to convert the point coordinates in the CSV file at this stage.)

The next thing to do is open this file in a spreadsheet and manually check against Frederic Youngs, Local Administrative Units of England. This allows me to:

  • find places that changed counties between the British Civil Wars in the 17th century and the creation of Ordnance Survey maps in the 19th century (because the county shape files are based on 19th-century OS maps).
  • only include places that had administrative units (such as parishes, townships, or chapelries) named after them in the 17th century. This will exclude places that didn’t exist in the 17th century, which is very necessary because the Ordnance Survey data has over 30,000 populated places for England and I only expect to need less than half of them for the 17th century. There will be some false negatives because not all places that existed in the 17th century had an administrative unit named after them. These can be added to the wiki manually as they’re discovered. While checking against Youngs I was able to include some extra places that I can remember being mentioned in historical records.
  • find alternative names for some places.

This method should avoid infringing compilation copyright in this reference book because the data I’m actually using is legally licensed from the Ordnance Survey and I’m mostly using Youngs to exclude records from this data. I only expect to copy data from Youngs in a small number of cases where a place changed county, which I don’t think will be a substantial part of his work (in the end this was only 8 places out of over 12,000, plus about 30 relationships with the Ainsty of York and liberty of Lincoln, so definitely not a substantial part).

While doing these checks, I found some anomalies:

  • about 1,100 place names that are listed as administrative units in Youngs but have no obvious match with settlements in the Ordnance Survey dataset. These will be dealt with later. Some of these places were later absorbed by other towns and lost their identity, but some seem to be defects in the Ordnance Survey data. The next step will be to reconcile them against Geonames.
  • parishes that were split between two different counties. In these cases, I wasn’t sure which part of the parish the settlement it was named after would be in. I decided to stick with the counties that came out of the spatial join. If any are later found to be wrong, they can be changed.

Checking against Youngs took a long time: from about 15 minutes for Rutland to 4 hours for Yorkshire. After the first pass there were about 12,500 English places flagged to be used. These included over 1,000 places that needed further disambiguation, usually because there were multiple places in the OS data with the same name, or because there wasn’t an exact match between OS and Youngs but there was a place with a similar name.

Only 8 places needed the county to be changed, and only one had changed from England to Wales.

Next I put the whole CSV file into OpenRefine and used facets to export it into three separate CSV files:

  • English settlements to be imported
  • English settlements not to be imported (these will be saved in case I need them later)
  • Welsh settlements (these may well be discarded because I discovered further complications with Wales which mean that I’ll probably have to start again)

As with Scotland, I also need to link every settlement to the parliamentary constituency it was in. I got all the English and Welsh constituencies from Wikidata and checked them against Mary Frear Keeler’s book about the Long Parliament. All constituencies need to be linked to each Parliament that they sent MPs to, and most also need to be linked to the jurisdiction of the sheriff who received the writs for elections and returned the winning candidates. I manually made a spreadsheet of shrievalties based on this list at the Internet Archive. As with Scotland, I used OpenRefine to create a new column for the constituency that a settlement was in, initially based on the county. Then I manually adjusted the values for boroughs. Borough constituencies and the settlements they shared a name with had to be checked against each other to make sure the names were spelt consistently and that known variations were recorded for both. While doing this, I found that the settlements for some boroughs were missing from my data because they never had parishes or lower level units named after them. This was especially true of rotten boroughs in Cornwall. Constituencies, shrievalties and landscape areas were imported separately before the settlements.

Back to settlements, I used OpenRefine to generate and disambiguate page names. This was similar to Scotland except that I also checked duplicate page names against Youngs. This allowed me to resolve about half of the ambiguous names that were previously flagged. While doing this disambiguation, I found that there are some duplicates in the Ordnance Survey data, and that the OS coordinates are sometimes not very accurate. Resolving the other half of the ambiguous names took several more days but was worth it to get rid of irrelevant data. I ended up with just under 12,200 settlements to import. It took about 25 minutes to import all the pages, which is a bit quicker than I was expecting.

This concept lists all of the English places that have pages. Added to the Scottish places I imported before, this makes more than 18,000 pages. This is an important milestone for the project as a big and difficult task is now out of the way, and it opens up more possibilities for record linkage. It took much longer than I thought, but in July I can concentrate on adding more manuscript transcripts.