A large part of what I’m trying to do with By The Sword Linked involves indexing and citing historical sources. This means that I need to model bibliographic data about published sources, which is something that seems simple as long as you don’t think about it… The more you think about it, the more you realise how complicated it is. This post explains how I’m doing it and why. I’d be grateful for any comments or criticisms.
I started with Wikidata’s book project. This represents books at three levels:
- work: the broadest category that covers all versions that can reasonably be classed as the same book
- edition: different versions, such as the first edition, second edition etc
- exemplar: an individual physical copy of a book
This model is derived from Functional Requirements for Bibliographic Records (FRBR). The books project page describes this as ‘a widely used and famous conceptual framework in library science’. Maybe it’s because I don’t move in library science circles, but in my limited experience, FRBR seems to be more talked about than actually used. Anyway, FRBR has four levels:
- work
- expression (can only have one parent work)
- manifestation (can have more than one parent expression)
- item (can only have one parent manifestation)
Work is the same in FRBR and Wikidata. Exemplar and item are the same thing with different names. But Wikidata collapses expression and manifestation into one level called edition because the distinction between them is hard to understand. I thought I was going to do the same, but I’ve ended up with a model that’s almost the same as FRBR.
When I first looked at FRBR, I found the terms ‘expression’ and ‘manifestation’ too essentialist and Romantic with a capital R, as if a work is a Platonic ideal that is expressed through the genius of a creator and then manifested in reality. The words that FRBR uses imply a top-down approach where I would prefer a bottom-up approach, using increasingly abstract categories to group things that are similar in some way. But when I approached it that way, I found that splitting edition into two levels was a good idea. This is because versions of a text that have distinct identifiers, such as ESTC numbers for old books or ISBNs for more recent books, can be very similar. The first edition of a book might have a paperback and a hardback that have different ISBNs. Or there might be two ESTC numbers because of relatively minor variations in the printing. But in these cases, the text is substantially the same, so in that way they can still be grouped together and classed as the same edition. I want to rigorously distinguish between entities that have different identifiers, but it’s also convenient to group them according to substantial revisions of the text. So my model did end up having four levels for books:
- work: same as FRBR and Wikidata
- major edition: similar to FRBR expression but can be an edition of more than one work at the same time, where a FRBR expression can only have one parent work
- minor edition: similar to FRBR manifestation but can only have one parent major edition
- printed copy: same as FRBR item and Wikidata exemplar
Or looking at it another way, my major and minor editions could both be different aspects of manifestation, and my model doesn’t represent expression at all. Or it’s a bit of both.
Here’s an example that’s relevant to the British Civil Wars:
- work: John Cruso’s Military Instructions for the Cavalry
- major edition: 1st edition, 1632
- minor edition: ESTC S121933
- minor edition: ESTC S126413, a variant printing but with substantially the same text
- major edition: 2nd edition, 1644 with revised text probably partly copied from John Vernon
- minor edition: ESTC R23795
- minor edition: ISBN 978-1240416813, EEBO print-on-demand reprint
- major edition: Peter Young’s edition, 1972, with new introduction and illustrations
- minor edition: SBN 900093242, hardback published by Roundwood Press
- major edition: 1st edition, 1632
The main deviations from FRBR are that a major edition can be of more than one work at the same time, and that a major edition blurs the boundary between abstract text and physical book. This is less theoretically rigorous than FRBR, but I find it more convenient in cases where an edition is a compilation of more than one work. The main drawback of my concept of ‘major edition’ is that it’s so idiosyncratic that it’s not possible to link it to any external identifiers, but this is a side effect of making sure that ‘minor edition’ level corresponds to exactly one external identifier, so it’s no great loss.
In my model, a major edition can also have a parent serial. This combination of edition and serial can be used to represent:
- 17th-century newsbooks
- record society series
- a series of monographs or edited collections
Articles (including chapters in edited collections) have a simpler model. Every version of an article (including self-archived versions) is treated as a major edition in its own right, and the minor edition level isn’t represented, although for the version of record, the online and printed versions are collapsed into the same entity. Again this is for convenience even though it’s theoretically messy: in practice there are likely to be fewer versions of an article than a book that can have hardback, paperback and ebook versions of each edition with a different identifier for each format. Each article has a parent work, and also a parent book or serial that it was published in. If it’s a chapter in an edited collection, it also links to the parent book at major edition level.
Works can also be used to group manuscripts. A manuscript text collapses both edition levels and copy level into one entity. If the text exists in more than one version, these can have parent works. A work can group together manuscript copies and printed editions of the same text.
So that’s a fairly superficial overview of how I’m trying to model bibliographic data and how I arrived at it. It will probably make more sense once the public demo is online, but for now, does anyone have any comments or criticisms?
On Twitter, Rhys Owen pointed me to FRBR OO which has identified some vagueness and inconsistencies in FRBR concepts. Rigorously correcting these problems has led to something even more complicated. I need to look at it in more detail, but it looks like what I call ‘minor edition’ is exactly the same thing as FRBR OO’s ‘Manifestation product type’: ‘For example, hardcover and paperback are two distinct publications (i.e. two distinct instances of F3 Manifestation Product Type) even though authorial and editorial content are otherwise identical in both publications.’ What I’m calling a manuscript text seems to include all the same characteristics as FRBR OO ‘Manifestation singleton’, but probably combines it with at least some aspects of Expression. FRBR OO still has the top-down approach that I criticized above: ‘Just like any product of the human mind, a Work necessarily begins to exist in the material world at a given point in time’.