Basics (and complications) of Linked Data

By The Sword Linked will be based on the principles of Linked Open Data. Most users won’t need to know the technical details of what goes on behind the scenes, but it’s still useful to sketch out the basic principles and the way they work in Semantic MediaWiki because these things influence the structures of the data that I’m creating and how users will be able to navigate it. I won’t say much about the technical details of machine-readable RDF code. This is partly because I don’t think many people actually want to use RDF directly, and partly because Semantic MediaWiki automatically creates and publishes it as a side-effect of the wiki templates and semantic properties that have to be defined for the wiki itself to work, so even I don’t really need to learn much about RDF (but it will be there if you want it).

Triples

The triple is the basic concept of linked data. It functions like a simple sentence in a human language: it has a subject, a verb and an object.

The subject is the entity (eg person, place, organization, source) that the statement is about. In Semantic MediaWiki, this is the entity that is the subject of the page that you’re looking at.

The verb describes a relationship between this entity and another entity. In Semantic MediaWiki, the verb is a property, and each property also has its own page to define what it means. It’s often said to be best practice to give properties names that are verbs or include verbs, such as ‘is’ or ‘has’, but not everyone has to do this all the time.

The object is another entity that is related to the subject in the way defined by the verb. In Semantic MediaWiki, these entities also have their own pages, which are the targets of the wikilinks created by properties embedded in the subject’s page.

Semantic MediaWiki can use these links to run queries that find all the entities that are linked to a certain entity in a certain way. These queries can be ready-made and embedded in templates so that anyone can benefit from them without writing queries themselves. When a linked page changes, the query results will automatically change.

For example, Marine Lives uses semantic properties to link manuscript pages to their parent volumes like this:

Subject (manuscript page): HCA 13/68 f.25r Annotate
Verb (semantic property): Parent volume
Object (manuscript volume): HCA 13/68

The parent volume’s wiki page includes a list of all the pages that are linked to it via the ‘Parent volume’ property. No-one had to type out this list, and it would update itself if more pages were linked to the volume. The links still have to be entered manually on the child side of the relationship, but the parent side is automatically updated by a query that looks for any pages that are linked to the it by the ‘Parent volume’ property.

Identifiers and spines

Every entity that is used in a triple needs a unique and persistent identifier. I’m not going to try to deal with the technical differences between a URI and URL, or between an entity and a resource about that entity. All we really need to know is that an identifier is usually a web address that has two main parts: one that identifies the website and one that identifies the individual entity.

For example, this is my ORCID URL:

https://orcid.org/0000-0003-3818-8996

And this is the ORCID URL for M.H. Beals:

https://orcid.org/0000-0002-2907-3313

You can see that every ORCID URL starts with https://orcid.org/ and that this is followed by a number that identifies an individual person.

You can also see this in the example triple from Marine Lives above. They all start with http://www.marinelives.org/wiki/ and are followed by a wiki page name. A page name has to be unique within the wiki, but when linked to from outside the wiki it’s qualified by the rest of the web address, so you don’t have to worry about making your page names unique everywhere. It’s usual for wiki page names to be human-readable and descriptive, whereas ORCID IDs and many other identifiers are just arbitrary numbers. Inside a wiki, you can link to pages just by using the page name, and the software will automatically take care of the rest.

Semantic MediaWiki extends this ease of use to external identifiers as well as internal page names. If a semantic property has the type ‘External identifier’ it can be set up to format the value as a link to an identifier at another site. Users only have to enter the unique value that identifies the entity (the second part of the URL in the examples above) and the software will automatically add the first part. So if there was a property called ‘Has ORCID ID’, users could enter 0000-0003-3818-8996 as the value and the other bit – https://orcid.org/ – would be added for them. If the structure of the links changed in future, it would be easy to change the format without manually editing every page.

External identifiers are very important because they help to rigorously identify entities and link together different resources about them in a way that both humans and computers can work with. There are lots of sources of identifiers. For example, these identifiers all refer to Prince Rupert:

Oxford Dictionary of National Biography: 24281
Early Modern Letters Online: ea030b7f-cdb5-4bd3-8265-a3ad22abc474
Six Degrees of Francis Bacon: 10013300
Virtual International Authority File: 266222116
International Standard Name Identifier: 0000000386821119

Linking to every possible identifier is a lot of work for one project and duplicates work already done by other projects. A spine is a site that provides an identifier for an entity and links to lots of other identifiers for the same entity at other sites. I’ll be using Wikidata as a spine because it already includes lots of identifiers, it’s freely reusable, and I can help to improve it myself. Prince Rupert’s Wikidata ID is Q76930, and his page there links to all the identifiers listed above and many more.

When triples are not enough

Triples are perfect for representing a relationship that is true now, or that is always true, but when we’re dealing with historical data we often need to qualify relationships with start or end dates. For example, we could use a triple to say ‘Oliver Cromwell held office Lord Protector’ but the same triple couldn’t tell us that this relationship started in the year 1653 and ended in 1658. We might also need to create semantic links to sources that support a factual claim.

At Linking Experiences of WW1, we tackled this problem by having a repeatable template that created an infobox to represent each command relationship that a military unit was in (this template was adapted from Wikipedia). This allowed us to group together these facts:

the other unit in the relationship
the type of relationship (such as tactical or administrative)
the start date
the end date
sources to prove that the relationship existed

We didn’t get as far as turning this into machine-readable data, but the logical structure is a useful way to think about it.

In Semantic MediaWiki, these infoboxes can be represented as subobjects. A subobject is a semantic object embedded within a wiki page. It represents an entity in its own right, but is also linked to the parent entity in whose page it’s embedded. A relationship between two entities can be represented as a third entity by a subobject. That means that the relationship itself can have multiple properties, each of which is a triple. This lets us represent the type, start date, end date etc but group them together to make it clear that they refer to the same relationship. In practice, doing it this way can make it a bit harder to access data through queries, but many of the queries will be predefined so that most people won’t need to know exactly how they work.