How did I get here? – By The Sword Linked

By The Sword Linked will use Semantic MediaWiki to create human-readable and machine-readable data about the British Civil Wars. This didn’t just come from nowhere. These are some other projects I’ve been involved with that helped me to come up with this idea and gain the skills to put it into practice.

Bibliographic databases

When I started my PhD in 1997, I was still using handwritten file cards to manage references to books and articles. In 1999 I got my own laptop and entered all these records into a Microsoft Access database. That made things much easier to manage, and I could even use mailmerge to generate my bibliography in a Word document. In 2006 I decided I needed something better, so I started making a new bibliographic database using PHP and MySQL running on localhost (this may have been a strange way to do it, but it was what I knew at the time, and having a web front end seemed convenient). There was a web page for each author and each book or article (I don’t think I represented different levels such as work and edition). There were link tables in the underlying database to link authors to works that they wrote or edited, and an author’s page included a query that displayed a list of everything that was linked to them. Pages for books or articles could have notes entered on them, and I think there was some way of tagging subjects as well. There were also containers to represent journals, serials or edited collections, so that articles or chapters could be linked to them. I imported all the data from my old Access database and started adding more manually. Before I tackled the problem of how to import new records automatically, I found out about Zotero and so my own experiment fell by the wayside. In 2007, I transferred all my existing bibliographic data into Zotero, where it still is, along with lots more that I’ve scraped since then. Zotero can do so much more than what I had planned for my own database, let alone what I was capable of implementing myself, so I stopped even thinking about doing it my own way, and all the old source code and data have long since been deleted. But now I can see that my approach was kind of like what is now known as linked data, even though I didn’t know that at the time.

From the Page

Somehow through the mid-00s history blogosphere, I got into contact with Ben Brumfield, now of Brumfield Labs. Ben was one of the first people to tackle the problem of developing a platform for crowdsourced manuscript transcription and indexing. He created From the Page and I played a small part in testing it by using it to publish and index images and transcripts of my great-grandfather’s letters from the First World War. From the Page works like a wiki: when editing a transcript, you can index names by putting double square brackets around them. This creates a link to a page about that named entity, which automatically lists all transcribed pages that mention the same entity. Entity pages can be categorised as people, places or whatever, and you can add general information about the entity, and external links to other online sources.

Sandall’s history

While researching my great-grandfather’s service in the First World War, I found that the published official history of his battalion, written by the commanding officer, was out of copyright. That gave me the idea that I could publish my own digital edition. This wasn’t just some page images and bad OCR like you would get on the Internet Archive, and it wasn’t just a very carefully proofread plain text file like you would from Project Gutenburg. The finished version has a web page for each chapter. Each name of a person or place is a hyperlink which leads to an interactive index where you can see a list of links to all mentions of that entity, or you can view, sort and filter a list of all entities. The person index links to external sources about soldiers: medal index cards in TNA’s catalogue, and entries in the Commonwealth War Graves Commission database if they died as a result of the war (although these links are now broken because CWGC have changed all their URLs and not left redirects: link rot is a big problem for digital history). The place index also allows you to view a map (but not currently working properly: I won’t rely on Google in future). What most people who use the site don’t need to know is that the index and map run on the Exhibit API (I feel lucky that this still works after all these years), and that the whole site is generated from a TEI XML file by XSLT.

Your Archives

Your Archives was an experimental wiki which the UK National Archives set up in 2007 and shut down in 2012. I was a major contributor and eventually a community moderator. The project was a notorious failure but rather than go on and on about what TNA did wrong, I’ll focus on positive things that don’t get talked about much. First of all, volunteering on this project helped me to get much better at using MediaWiki, and this experience is still valuable now. I used Your Archives to share notes from my own research, especially about SP 28, and Tom Crawshaw said that he found my page about Essex’s army to be very useful for his PhD research. Although the wiki didn’t take off to the extent that TNA (and I) hoped, a small and dedicated community grew up around transcribing and publishing wills, mostly from PROB 11. The Probate Transcripts category had several hundred transcripts of wills, although the quality and transcription conventions varied a lot (Your Archives never had any official policy or guidance about transcription conventions!). Another success was a page that I created to link to transcripts of official unit war diaries from the First World War that were already available elsewhere on the web (this was long before Operation War Diary and the digitization of WO 95). This became the third most viewed page on the whole site (see statistics), and several other editors contributed to it (see history), so it was a genuine collaboration, like wikis are supposed to be. This gave me a more ambitious idea: a project to create a wiki page for every British Army unit in the First World War, which would contain basic information about it, and links to its war diaries and other official documents. I made a few demo pages but there wasn’t much interest in it, I suspect because TNA had already decided to pull the plug long before they actually announced the site’s closure. As part of the planning for this project I investigated ways of scraping data from TNA’s catalogue and automatically generating wiki pages. MediaWiki’s documentation included how to export XML files from one wiki and import them into another. I realised that it was easy to generate these XML files outside MediaWiki and use them to import new pages.

Linking Experiences of WW1

In 2014, Mia Ridge happened to have a vaguely similar idea to create a website with a page for each military unit in the First World War, but she improved it in two crucial ways: first, the emphasis was on linking personal narratives and putting them in context; second, there would be machine-readable Linked Open Data. She got a CENDARI fellowship and set about putting it into practice. I found out via Twitter (I think I’d started following Mia because she knew Ben Brumfield and we’re all interested in crowdsourcing) and offered to help, because I had valuable experience of failing to do a similar thing at Your Archives (this is not a joke: we can learn an awful lot from failures). The TLDR version is that we failed again, partly because a three-month fellowship wasn’t enough for such an ambitious project, and partly because a lot of the data we needed to link to still doesn’t exist anywhere, but we failed better. Helping with this project taught me a lot of new things. First of all we had intense discussions about how to model military units and relationships between them as structured data, which involved dealing with questions that hadn’t properly been answered, or even asked, before. How do we define this thing? What do we call it? What sources do we need to reference? For me, one of the most shocking revelations was that there isn’t a canonical source for the official names of all British Army units in the First World War. My biggest contribution was generating and importing thousands of unit pages. This built on the ideas that I had for Your Archives but I improved them and made them work in practice. It was easier by this time because TNA’s old catalogue had been replaced by Discovery, which can export search results as CSV files, so you can download large amounts of data without even using the API. With a mixture of manual data cleaning and Python scripts, I turned WO 95 catalogue records into wiki pages for each unit which linked back to the catalogue records for its war diaries. For units that didn’t have war diaries in WO 95, I manually created batches in spreadsheets and then used the same Python scripts to turn them into wiki XML. The project is now on hiatus because Mia and I are busy with bigger and better things, but the website is still there. It runs on MediaWiki (not Semantic) with Scribunto. There are pages for just over 7,000 units, with structured data in wiki templates plus free text. We didn’t get as far as making the data machine-readable. Out of all the projects I’ve been involved with, this is the biggest influence on By The Sword Linked.

Marine Lives

Marine Lives is a long running project which trains volunteers in 17th-century palaeography and gets them to transcribe text from images of documents about maritime history, especially High Court of Admiralty records from The National Archives. In 2015, they moved all of their material to Semantic MediaWiki. I haven’t been heavily involved in the project but I’ve sometimes helped out, and used it to test my own ideas. This has encouraged me to learn more about Semantic MediaWiki. In 2016, I experimented with ways of making better use of semantic properties for record linkage (see this page) although I drifted away from it because I got too busy with paid work. At this time I was already thinking about something like By The Sword Linked and wanted to test out possible techniques. Now I think that what I tried at Marine Lives was too complicated. By The Sword Linked will use simpler ways of linking records.

Other projects

This post is already very long by today’s standards (although not by the standards of history blogging 10+ years ago) so I’ll be brief. Other projects that I’ve volunteered on and been influenced by include Lives of the First World War and Six Degrees of Francis Bacon. These both highlight the usefulness of unique identifiers for people. I was also influenced by the way that Lives allows and encourages linking internal and external sources to a person’s profile. I’ve also used and been influenced by Early Modern Letters Online (EMLO) but haven’t been involved in editing it. The Cultures of Knowledge project, which runs EMLO, is now planning projects to create linked data about early-modern people and places. I will be keeping an eye on these projects with a view to linking to them and not duplicating too much of what they’re doing. I’ve also been inspired by the BCW Project Regimental Wiki, which has a human-readable page for each regiment (but not for lower levels of unit, and no semantic data). I will be linking to it, and they will be free to copy and reuse my data. A probably surprising omission is that I’ve never got round to trying Recogito.

So these are my most important influences. Now I’m putting together all my experiences of success and failure to build something that combines the best aspects of all of these projects and that I’m fairly sure will work.