Research Notes‎ > ‎

Importance of Narrative


1        Introduction

2        Background

3        Types of Text

4        Computer Representation

4.1       Advantages

5        STEMMA

6        Citations

7        Meta-data

8        Copyright

 


 

1      Introduction

This paper discusses the importance of narrative text in micro- history data and how STEMMA® addresses it. The paper suggests how text may be given structure so that it can be integrated into micro-history data, as opposed to being an adjunct or attachment, and how this might even help with the Semantic Web.

 

It is hoped that it will help offset the current trend to distil family history data down to a set of discrete facts and conclusions.

 

At the time of writing, no commercial product or data format adequately accommodates narrative content in the context of micro-history. Elizabeth Shown Mills has advocated narrative genealogy by using a word-processor. See also Randy Seaver’s associated blog post. However, this does not adequately integrate the narrative with the structures of your core data.

 

Structured narrative is neither plain-text notes in your data nor rich-text narrative in separate documents; it is marked-up text segments cross-linked with other entities in one all-embracing micro-history schema. A separate presentation of the structured-narrative concept may be downloaded from Structured Narrative.

2      Background

Online content largely consists of transcribed facts such as details from census returns, BMD registrations, and parish records. In the interests of economy, only enough key facts are transcribed to support computer indexing and searching. The original data may be put online as scanned images but its content is not accessible to a computer search. Even when it is typed, rather than hand-written, bulk text is rarely transcribed to be computer searchable — the most obvious exception being newspaper archives.

 

We should therefore understand the rationale for why content providers and archives focus on discrete key facts, and not assume that this is an inherent property of micro-history data.

 

Genealogy in its literal sense (i.e. biological lineage, usually expressed a family tree or pedigree chart) may not need much more than this. However, family history (see Genealogy & Family History — The Difference), and micro-history in general, require much more in the way of narrative text. Professional genealogists may be more use to writing narrative text, especially to justify their conclusions in a report. However, if a universal representation of micro-history data does not accommodate such narrative then the combination of current online content and the capabilities of current software products may diminish its status to that of eccentricity.

 

3      Types of Text

The most obvious type of text is biographical narrative for a Person, or historical narrative for a Place or a Group. Such text might be extensive and will undoubtedly reference other entities such as Persons, Places, Animals, Groups, Events, and even raw dates.

 

Another type of text that would commonly be used in narrative would be footnotes and endnotes, whether for reference-note citations or for general discursive notes. Although citations are more formalised than discursive notes, they may have analytical notes associated with them that will have less structure.

 

Other uses of text include:

 

  • Narrative Essays — Family-history stories, making frequent reference to conclusion entities such as Persons and Places.
  • Narrative Reports — Reports of personal research presented in narrative form for general readability.
  • Research Reports — Report on the findings of a paid research assignment.
  • Simple Notes — Commentary attached to one-or-more entities in our data.
  • Research Notes — Everything we know about an associated person, place, family, etc., expressed in a raw form.
  • Inference and Logic — Explanation of how information supports or contradicts some claim or proposition.
  • Transcription — Transcribed edition of a document, prior work, or statements. These would use mark-up to provide a faithful reproduction of the relevant nuances. The source material may be an abstract, extract, translation, or in manuscript/typescript, and this will need an indication.
  • Tasks or “ToDos” — Aids to further research required for the data.

 

A number of properties may also be associated with the text, and these may be inclusive of those categories associated with transcribed editions.

 

  • The language of the text, preferably using an ISO designation.
  • A surety or confidence assessment. This applies both to transcriptions of data and to conclusions or inferences.
  • Some indication of how sensitive or controversial the data might be, or some control over its privacy.

 

STEMMA uses a percentage value as an indication of the confidence in a piece of evidence, or in an inference (see its Surety attribute). The reason for doing this, rather than simple integers as used by GEDCOM, is that it allows some basic arithmetic to assess the confidence of derived data. For instance, the confidence of A may depend on the confidence of ‘B and C’, or of ‘B or C’, which is something that can be handled mathematically. Another potential advantage is that of ‘collective assessment’. Given three alternatives, X, Y, & Z, simple integers might allow an assessment of X against Y, or X against Z, but not X against all the remaining alternatives (i.e. Y+Z).

 

The use of a numeric representation of confidence is controversial. The subject of "Structured Indications of Uncertainty" is discussed in the context of TEI here: Structured Uncertainty in section 17.1.2. A further discussion directly related to genealogy may be found at:  You're Probably Right.

4      Computer Representation

Most data formats for family history have a NOTE element or record type, e.g. GEDCOM. However, these generally provide for small-scale notes and commentary rather than large-scale narrative. STEMMA stands alone in the way it supports narrative for micro-history (see below). It must be stressed that its support is not the same as the marked-up text that might be found on a wiki or a blog; these have no documented data model, and their text is in isolation from structured information such as lineage, timelines, and geography.

 

Computer software cannot create narrative from raw facts. If a family history program wants to show text then it has to load it from some content that already exists. Presenting an image of the text is OK but the associated content will not have been assimilated into the family history data and so it will be of limited use. Storing it in a separate word-processor document is almost there — the text is inherently computer-readable — but being stored separately from your core data just wastes the content. Also, there is no mark-up that makes the content usable in an historical context, say by some software product it is just text.

 

The ideal is to store the text as an intrinsic part of the micro-history data, but with a mechanism that identifies the semantics of key parts. For instance, identifying references to Persons, Places, Animals, Groups, Events, raw dates, or links to other pieces of text, and making them all computer-readable. The latter type is not dissimilar to a footnote and could be used for that purpose, or it could be used to link conclusions and reasoning to supporting evidence.

 

In effect, this means you need some sort of mark-up language to create structured narrative. A mark-up language is familiar to anyone who has written an HTML (HyperText Markup Language) page, or even a word-processor document although the mark-up is then generated by the software and not generally visible to you. The essential similarity is that the visible text is annotated with extra information, not unlike the original marking-up of a manuscript.

 

Using the terminology from Markup_language, there are two forms of mark-up that are required for micro-history narrative:

 

  • Presentational: This mark-up controls the layout and presentation of the text. Control over explicit physical rendition such as colour, bold, italic, underline, font name, and font size are best left to the tool presenting the text, unless it’s essential for a faithful transcription of something. However, structural control (such as lines and paragraphs) can be supplemented by logical rendition such as emphasis and strong, as defined by XHTML.
  • Semantic (or descriptive): This mark-up provides information about part of the text without indicating how it should be handled or depicted. It is precisely what is needed to identify the entities listed above such as Persons and Places.

 

As an example of semantic mark-up, consider the case of an embedded URL in an HTML or wiki page. The mark-up language provides the computer with the knowledge of the target address, but at the same time provides a separate element of text for the display. There are effectively two bits of information for the same element — one for the end-user and one for the computer.

 

As another example, consider the citation support in HTML 5. Here’s an example:

 

According to <cite title="HTML & XHTML: The Definitive Guide. Published by O'Reilly Media, Inc.; fifth edition (August 1, 2002)">Chuck Musciano and Bill Kennedy</cite>, the HTML cite tag actually exists!

 

The <cite> tag provides a formal citation, which can be taken out of line by the computer software, and a separate piece of substitution text for the end-user to read. This example would display the following text in the main body and use the citation elsewhere:

 

According to Chuck Musciano and Bill Kennedy, the HTML cite tag actually exists!

 

The actual substitution text might be selectable if presented on a computer display, and used to navigate to the citation.

 

NB: Although irrelevant to this discussion, the HTML <cite> tag is not a practical model for a similar element in a micro-history data format. The citation style is fixed, the regional preferences (e.g. date/time display format) are fixed, and there is no identification of the distinct elements of the citation (e.g. author) for semantic tagging.

 

4.1    Advantages

There are multiple advantages to using semantic mark-up in narrative text. Allowing computer software to recognise a specific item means that it can use that data, or reference it, in a special way such as for the creation of a footnote. Similarly, it can decide to display the item using special formatting or highlight rules selected from a style gallery.

 

The following examples cover some of the possibilities:

 

  • Persons — Having a reference to a Person entity embedded in your text allows some canonical version of that person’s name to be automatically displayed in its place. If the Person details are later modified then all embedded references will automatically show the modified name thereafter. The software can automatically highlight the surname portion using bold, italic, underline, or a specific colour. This should eliminate the tradition of uppercasing such name parts, which is not culturally neutral (see Letter Case). On a computer display, as opposed to a printed format, the visible name may also be made into a hyperlink that can take you to full details of that Person, or of their family, etc.
  • Places — Being able to embed a reference to a Place entity allows a hyperlink to be generated that can be selected to obtain further details. As well as presenting details from your own data, such a link might consult a Place Authority (see Place Authority) to obtain full geographical and historical data for that Place.
  • Dates — It is important for software to be able to understand a date value (see Dates and Calendars). If a date is embedded in a computer-readable fashion then it allows software to relate that to other Events or timelines. It also allows the software to display a version that is automatically formatted according to your regional settings and preferences, whatever they happen to be. A different end-user might see them formatted according to different settings. A date such as “yesterday” or “next week” may make sense to us but not to computer software.
  • Annotation Notes — If one section of narrative text includes a link to another section then the software can add a traditional superscript numeral indicating the presence of the extra text. In a printed form, that extra text might appear as a footnote or an endnote. On a computer display, the superscript may be selectable and could take you to that text if clicked. Similarly, if a specific datum (e.g. a date of birth) has a link to a section of text then that could be handled in the same way whenever that datum is displayed, and it might provide insights into how the datum was derived. Citations are a particular form of note and will be discussed further in a different section.

5      STEMMA

Although there are many possible uses for narrative text there are two important categories that STEMMA has strived to unify. They are for transcriptions and for generating new narrative work (e.g. essays, reports, inference, etc.). These have markedly different characteristics as follows:

 

  • Transcriptions — requires support for anomalies (uncertain characters, marginalia, footnotes, interlinear/intralinear notes), indications of original emphasis (e.g. italics), indications of alternative spellings/meanings, and semantic mark-up for references to persons, places, groups, events, and dates. The latter semantic mark-up also needs to clearly distinguish objective information (e.g. that a reference is to a person) from subjective information (e.g. a conclusion as to whom that person is).
  • Narrative work — requires support for layout and presentational mark-up. It needs to be able to generate references to known persons, places, and dates that result in a similar mark-up to that for transcriptions. The difference here is that a textual reference is being generated from the ID of a Person entity, say, as opposed to marking an existing textual reference and possibly linking it to a Person with a given ID. Also needs to be capable of generating reference-note citations and general discursive notes.

 

All narrative is expressed using the following element structure:

 

<Narrative [Key=’key’]>

            [ <Title> narrative-title </Title> ]

{ <Text [Key=’key’] [Language=’code’ | Locale=’code’] [TEXT_TYPE] … [DATA_ATTRIBUTE] ... >

[ <Title> text-title </Title> ]

…text with embedded entity links…

</Text>} ...

</Narrative>

 

 

The optional Language attribute provides an explicit ISO 639-2 three-letter code for the narrative language. If omitted then the language defaults to the prevailing language of the STEMMA Dataset. The Locale attribute provides a more detailed specification since that involves both an ISO 639-1 two-letter language code plus an ISO 3166-1 two-letter territory code, e.g. “en_GB” for British English.

 

A <Narrative> element is divided into separate Text segments, each of which has different properties. These properties correspond to most of the types and usages presented under Types of Text.

 

Both the <Narrative> element and any of the individual <Text> elements may specify a key and this allows them to be referenced from elsewhere using a NoteRef element.

 

References to other STEMMA entities can be embedded in a <Text> element using the following:

 

<PersonRef [Key=’key’]/>

<AnimalRef [Key=’key’]/>

<PlaceRef [Key=’key’]/>

<GroupRef [Key=’key’]/>

<EventRef [Key=’key’]/>

 

<ResourceRef Key=’key’/>

<CitationRef Key=’key’/>

 

The first set of these can also be used to mark-up existing references in a transcription, and optionally link them to a conclusion entity such as a Person.

 

In STEMMA, a Resource is a separate item in the micro-history collection — typically a separate image or photograph. A Citation makes reference to an external source of information but the concept is generalised and so includes traditionally separate categories such as a section in a published work, the published work itself, and the repository that holds it. See Citations.

 

Date references may be embedded using a DateRef mark-up and specifying either a STEMMA date-value string or a full STEMMA date-entity. The same element can mark-up an existing reference during a transcription and optionally attach a conclusion date. Some simplified examples might be:

 

<DateRef Value=’1956-06-09’ Mode=’Short’/>

<DateRef Value=’1903-03-17’> St Patrick’s Day, 1903 </DateRef>

 

The date-entity structure allows for different degrees of granularity, imprecision, and multiple calendars for synchronised date such as Gregorian/Julian Dual Dates.

 

Here’s an example that references both a Place and a date:

 

<Text Key=’tDemiseJessamine’>

<Title> Demise of Jessamine Cottages </Title>

<PlaceRef Key=’wJessamine’ Mode=’Hierarchy’/> were demolished in <DateRef Value=’1956’/>

</Text>

 

This text could be referenced from another Text section using the key name tDemiseJessamine. It might generate the text title in place of the reference, but the following text might pop up when it is selected.

 

Jessamine Cottages, Nottingham were demolished in 1956

 

Both the name of the Place and the date might be further selectable, as implied here.

 

Here’s an example that references a Person this time:

 

<Narrative>

<Text Inference=’1’>

Head of household is Elizabeth Wildgoose (b. <Date><Value Margin=’1’>1802</Value></Date>) and is almost certainly a relative of <PersonRef Key=’pSarahElliott’/>, nee Wildgoose

</Text>

</Narrative>

 

It might generate the following text when presented on the screen:

 

Head of household is Elizabeth Wildgoose (b. c1802) and is almost certainly a relative of Sarah Elliott (b. c1842), nee Wildgoose

 

 

A STEMMA file is deemed a Document, and this is broken down into one-or-more Datasets. Each Dataset has a separate self-contained set of entities, distinguished by author, geography, surname, or multiple criteria. Although STEMMA was initially conceived as a format for long-term storage, such as archive or backup, and secondarily as an exchange format, the presence of structured narrative (incl. the aforementioned mark-up) means that it can be used as a traditional document format. Following this observation, a viewing tool was prototyped that loaded a specific Dataset in one pass from a Document, indexed it in memory, and immediately provided a user interface to navigate around its content and follow the hyperlinks. This is not intended to displace the need for more complicated products, or their associated indexed databases, but it is an interesting digression on the purpose of a file format. On one hand, it provides a way of peeking inside a file without having to learn some low-level data syntax, such as XML, and without having to load it into some proprietary database. On the other hand, it provides a “genealogical document” that has both content and structure, including lineage, events/timelines, geography, and narrative, that can be navigated and presented with a generic tool. This bundling of information as a “genealogical document” could also make it usable for automatic upload to some online framework (see What to Share, and How - Part II) or transmission as a genealogical report to clients. This would neither limit the content nor reduce editorial control and narrative freedom.

6      Citations

Citations are a fundamental part of micro-history data. However, the concept of a reference note, source label, and a source list (as described under Worldwide Family History Data) can be generalised through the use of narrative.

 

Most readers will think of a citation in terms of its printed reference-note form, e.g.

 

C. Dallett Hemphill, Bowing to Necessities: A History of Manners in America, 1620-1860 (New York: Oxford University Press, 1999), p.114.

 

It would be fairly straightforward to represent the essence of such a citation in micro-history data such that a printed form can be generated in the preferred style (e.g. CMOS) and with the regional preferences of any particular reader (see Meta-data).

 

When commentary is added, though, then they can get much more complicated. Consider this example:

 

Death notices, Ulster Gazette and Daily National Intelligencer, both dated 24 January 1815. Corra Bacon-Foster, "The Story of Kalorama," Records of the Columbia Historical Society (1910), 108, states Louisa left four children; three have been identified. In 1810, Charles "Cating" and a female, both over 44, were enumerated with one male and female aged 26-44; one male and female aged 16-25; and one male under 10 - suggesting that George, Louisa, and their first son may have been living in the Catton household. See 1810 U.S. census, Ulster County, New York, New Paltz, p. 116, line 6; NA micropublication M252, roll 37.

 

What this is doing is effectively wrapping one or more simple citations in some commentary by the current author in order to create what might be called a complex citation.

 

There may be similar uses of analytical notes, such as suggesting some bias or the fact that the source information may be second-hand..

 

As a further generalisation we encounter discursive notes which may or may not include any citations at all. These are simply some text that has been taken out of line; a digression. In a printed publication, there may be a superscript at the point where it is relevant, followed by a footnote or endnote containing that text. In an electronic document (e.g. in a browser), the footnote/endnote concepts are less appropriate and the text may be popped-up when the superscript, or some other link at the point of reference, is selected or hovered-over.

 

The point being made here is that structured narrative can be used to create generalised notes. Also, by virtue of its ability to embed references to other entities using semantic mark-up (including Citation entities), it can be used for full citations, including complex ones. An example complex citation is presented under Data Model as a combination of a single footnote containing multiple embedded citations. A more in-depth discussion of handling different types of note may be found at: Cite Seeing.

7      Meta-data

In the context of micro-history data, meta-data is data about data. Meta-data is an important concept because it allows data to be processed or utilised appropriately. In a computer context, it allows software to understand (in a primitive sense) what data it is dealing with.

 

STEMMA’s structured narrative provides a form of meta-data in the semantic mark-up it uses to embed references to Persons, Places, Animals, Events, etc. Without that meta-data, all the narrative would contain is the name or textual description of the same entities.

 

The Semantic Web movement strives to supplement the currently unstructured Web content with meta-data. The idea is that this will allow computers to search, correlate, and combine information more easily. This will therefore be an important part of micro-history representation in online content. It doesn’t necessarily mean that a standard data model needs to incorporate its RDF tags since it would be wrong to tie it to such a specific technology. However, it does mean that such a data model must provide for meta-data, and that a physical serialisation format (as with file formats) that was derived from the data model for the Semantic Web could use RDF tags.

 

Another situation where meta-data comes up, and is still hotly debated, is citation-elements. These are the elements of data that would constitute a computer-readable citation, e.g. author(s), title, publisher, etc. A computer-readable citation differs from a printed citation in that factors such as style (e.g. CMOS) and regional settings are removed, and can be reapplied later for the context of a specific end-user.

 

At its most simple, you might imagine such a citation to be represented by a number of discrete XML elements such as:

 

<CitationReference>

<Elements>

<Author> name </Author>

<Title> title </Title>

…etc…

</Elements>

</CitationReference>

 

Irrespective of whether each element has a specific tag (e.g. <Author>) or a qualified generic tag (e.g. <Element Name=”Author”>), it still basically identifies the datum rather than the nature of the datum. There has been some considerable discussion on BetterGEDCOM about whether these tags should follow the Dublin Core scheme and have shared semantics. Dublin Core started with a vocabulary of 15 shared meta-data tags, which included things like Creator, although it has since been extended with extra tags and contextual refinements of existing ones. I have been a critic of this for citation-elements since it’s analogous to have relational databases share a common set of column names, each with fixed semantics. Also, the citation-element tags are not themselves meta-data tags, as already mentioned. If the citation-elements were part of a Semantic Web contribution then they would be annotated with RDF meta-data tags indicating their nature — it would not be deduced from the XML data tags. There may be other schemes, too, but they would be part of the corresponding physical serialisation format.

 

A related discussion of the importance of meta-data may be found at Technophoo, have no fear, although this does not differentiate between the formal data and meta-data concepts for citation-elements. Hence, although Dublin Core (which is now represented by ISO Standard 15836-2009) is a viable standard, it should be acknowledged that it is a standard for meta-data tags rather than data tags.

8      Copyright

Some of the things that cannot be copyrighted include facts and raw unarranged data[1]. From this perspective, mere details transcribed from vital records cannot be the subject of copyright.

 

Even the building of family pedigree charts showing marriages between people and linking their respective offspring is little more than a re-arrangement of facts that can be looked up by someone else. Although that linking of Person entities according to their biological lineage constitutes part of a conclusion model, and may have been the product of some research, the expression embodies nothing that can be copyrighted in any practical way.

 

This means that there is no legal impediment to online collaborative trees that include nothing more than facts and conclusions of the pedigree variety. Unfortunately, we all know where that leads. Online trees with no citations, no reasoning, no evidence supporting their conclusions, and no attribution, are “ten-a-penny” (or “a dime-a-dozen”). These trees are replicated just as easily as they’re created and that compounds the problem to a point where it almost kills serious genealogy. It has even been likened to a virus by blogger Ben Sayer.

 

Once micro-history data makes use of structured narrative then it starts to become a creative work a work of academic research that includes reasoning, conclusions, and opinions. Such a work is automatically copyright by virtue of the Berne Convention.

 

Such data might be published under a Creative Commons licence but that only avoids the legal issues. The fact that those reliable and thoroughly-researched contributions will have taken someone a long time to produce possibly a life’s work means that they will be understandably less-inclined to just share it with everyone, especially if some of the weaker researchers would simply pass it off as their own.

 

Is this an argument against collaborative trees? No, it’s simply an indication that the current naïve approach is demonstrably wrong, and will lead to further issues when we include narrative.

 

A case is made under Evidence and Conclusion for distinguishing three types of data rather than the two implied by this name; the third being all those parts that justify, or prove, the conclusions. Having three distinct parts gives greater flexibility for handling copyright and sharing issues. Also, What to Share, and How - Part II presents an alternative model that incorporates narrative in such a way of to provide automatic attribution, reduce the need for copying, and avoiding edit wars when there are differences of opinion on a shared tree.



® STEMMA is a registered trademark of Tony Proctor.

[1] This is the case in the US, but a sweat of the brow still exists in Europe and that has paved the way for database rights. See the “Analysis” section of A Copyright Casualty — Part II.