Introduction‎ > ‎Document Structure‎ > ‎


A Citation entity identifies the source of some information mentioned in the Dataset. Examples might include books, newspaper articles, BMD certificates, census data, tax records, court records, film, tombstones, military service records, journey manifests, cemetery records, oral history, church records, pension records, land or property transfers, etc.


These are common historical sources, and specific printed citation formats are applicable to each of them, but this Citation entity goes further. It can also identify a collection of works, a repository or institution, or even represent attribution to an individual.


In the citations of normal written or printed works, there are two main citation modes that may be employed within text source labels, or document labels, being applicable to documents or images thereof. Citations may involve reference notes linked to inline superscript indicators in the main text. Alternatively, they may involve a source list or bibliography at the end of the work. The parenthetical in-text citations such as “Smith (2004, p. 39) claims that...”, or “…(Smith 2004, p.39)…” if all details are parenthesised commonly associated with a bibliography are less appropriate for genealogical or historical citations. This is because they do not accommodate the source provenance or analytical notes that are frequently required.


There are citation conventions that apply to different source types in order to present consistency, and these have precise specifications for layout, quotation marks,  punctuation, and use of italics. Several citation styles are in common use. For instance, in the humanities there are: Modern Language Association (MLA), Harvard referencing, Modern Humanities Research Association (MHRA), and the Chicago Manual of Style (CMOS). There are other styles commonly used in law or the sciences too.


The Board for Certification of Genealogists (BCG) recommends CMOS which utilises footnotes, endnotes, and bibliographies. The requirements of genealogy are very demanding in the varieties of sources that need to be cited, and Elizabeth Shown Mills[1] has extended conventional CMOS style guidelines to include many of those additional source types. It should be understood, though, that these citation styles and modes relate to the final-form written or printed citations. Their application is therefore relevant to genealogical reports, including on-screen computer displays and charts, rather than computer storage (see Cite Seeing).                                           


Those final-form citations are designed to be humanly-readable, and so embody elements of a specific locale, culture, and preferred style. This is a problem for electronic documents as they are not computer-readable, and so cannot be adjusted to suit the locale or preferences of an arbitrary end-user. It is therefore necessary to go back to the essence of a citation rather than consider specific physical implementations i.e. to provide sufficient information through a digested citation to uniquely identify a source, enabling it to be re-examined if necessary, and to support the formatting appropriate for the current end-user. The scheme presented here is a generalised computer-readable one that would cope with all possible source types. It does not strive to enumerate all possible source types, or specify what parameters they require, or mandate a particular presentation style. The main goals of this scheme are to keep it open-ended so that source types can be defined freely, to parameterise the scheme so that it can interface to external citation-templates, and to give it a hierarchical structure for representing different layers of a citation (e.g. for provenance or location).




<Citation Key=’key’ [Abstract=’boolean’]>

[ <Title> citation-title </Title> ]

[ <DisplayFormat> format-string </DisplayFormat> ]

<URI> source-type-uri </URI>

[ <Params>

{ PARAM_DEF... } | { PARAM_VALUE … }

</Params> ]

[ <Quality> source-quality </Quality> ]

[ <Credibility> information-credibility </Credibility> ]

[ <Reliability> information-reliability </Reliability> ]


[ <ParentCitationLnk Key=’key’/>


</ParentCitationLnk> ]

[ <BaseCitationLnk Key=’key’>


</BaseCitationLnk> ]







<Param Name=’name’ [Type=’type’]  [SemType=’sem-type’]

[ItemList=’boolean’] [Optional=’boolean’]>







{ <Param Name=’name’  [Key=’key’]>


</Param> }


{ <Param Name=’name’>

{ <Item [Key=’key’]> value </Item> } …

</Param> }



Note that STEMMA syntax does not differentiate between citing a specific source of information, citing a collection or work that the information was contained within, or citing a repository or institution hosting that work or collection they are all citing something in the more literal sense. The <Citation> entity is therefore hierarchical so that these related data can be arranged in a chain (actually a tree) using the <ParentCitationLnk> to indicate each parent. This avoids duplication and provides a stronger representation overall.


The Dublin Core Metadata Initiative has encountered the issue of a chain but has tried to solve it by adding additional terms and namespaces (see dc-citation-guidelines/). Basically, the simple Dublin Core terms cannot clearly distinguish, for instance, the title of an article from the title of a journal containing that article, or provide a clear indication of other data related to the containing journal such as publication date (as distinct from the article submission date), or the volume and issue numbers. That same page recommends the use of the OpenURL (ANSI/NISO standard, Z39.88-2004) ContextObject for representing the context of a bibliographic citation, although it does not take this to the level of a hierarchical chain. The OpenURL concept is designed to provide the context of a citation in a machine-readable form that can be resolved by an unspecified library or archive. In other words, the Dublin core recommendation doesn’t cite a source directly but as a library-independent hyperlink to content.


The STEMMA scheme described here is fully in keeping with those Dublin Core recommendations but is not specifically tied to it. It allows each type of source to be represented by a source-type-uri. Parameters can be applied to build up a citation description for a specific instance of that source-type. The source-type-uri also acts as a global key for retrieving localised text for soliciting Parameter values, data-types for validating the Parameter values, and for interfacing to a citation-template system in order to generate a formatted string for the user.


The STEMMA Citation hierarchy allows the individual parts of a layered citation to be described, and re-used for related references. It generally places a reference to the indefinite source at the lowest level, and then links that to an actual instance, or the definite source, if a specific derivative was consulted. Further layers might identify the original source, and the location of the originals. With a book, for example, the indefinite source could be identified by the title, author, publisher, and edition. The definite source could have been an online copy, or a translation. For a digital image of a church record then the indefinite source would be the entry in the parish register. A subsequent layer might identify a scanned copy at, say, findmypast or FamilySearch. Although not condoned in professional circles, there will be instances where a researcher is not interested in the provenance of a published source, and believes that a mere identification of the indefinite source will be sufficient; this also being implicit in the OpenURL scheme. The flexibility of this mechanism, and the ease of creating private source-type-uris, means that it is equally capable of describing abstract source citations where just a catalogue code or digital identifier (e.g. a book ISBN, or TNA code for a page of the census of England & Wales). Although effective to a point, such an approach does not allow a different reader to immediately assess the strength of the cited source.


The display format is part of the Citation entity for convenience. However, many citation types will require formatting to a given style and locale. A later version may allow styles to be automatically selected from Citation Style Language (CSL) templates. CSL is an open XML-based language for defining the parameters and formatting for different citation types. These styles can be browsed and searched via the Zotero Style Repository. It currently has no concept of a URI string which is unfortunate because it would be a convenient handle to distinguish the templates and applicable source-types in the repository. A problem with such citation-template schemes is that they try to format plain textual elements into a simple template, whereas STEMMA assumes that objects representing, say, a Person or a Contact can be provided. The advantage of the latter is that the template system can call back on well-defined methods to obtain a particular style of name, or specific contact details; otherwise the genealogical software product is assumed to have intimate knowledge of the specific template. In the absence of any external formatting support for citations, the <DisplayFormat> element is used as a default.


The parameterisation is available in the citation-title, the format-string, and the values of Parameters themselves (e.g. within a Params or ResourceLnk element).


Note that Parameter names are local to the corresponding source-type. There is no sharing of Parameter names between different source-types, and no implied semantics in any of their names. If two source-types each have a Parameter called ‘Publisher’ then they are each interpreted in the context of their respective source-types. In effect, no semantics are conveyed directly by the Parameter name that is the purpose of the SemType attribute.


The valid Parameter data-types are documented at: Data Types. The same ItemList approach to lists is taken as for Property values. The semantic type is indicated by the SemType attribute which may use the Dublin Core vocabulary, e.g. SemType=’DC:Title’ or SemType=’DC:Publisher.CorporateName.Address’. The default value for the Optional attribute is 0 (i.e. false) which means that a non-blank value must be provided.


The Quality, Credibility, and Reliability elements characterise the confidence in a source, and of information derived from it Note that these do not relate to a specific datum from the source. The Surety data-attribute is provided for that case. See Extended Vocabularies for defining custom values.


  • Unknown – Unknown or unspecified assessment.
  • Credibility – Expert Information from someone with relevant expertise.
  • Credibility – Questionable. Questionable credibility of information, as in interviews and oral genealogies, or with potential for bias as in an autobiography.
  • Credibility – Trusted. Information from a trusted source.
  • Credibility – Unsubstantiated claims or opinions.
  • Quality – Original. Material in its original recorded form.
  • Quality – Copy. Facsimile of original, e.g. image copy, certified copy.
  • Quality – Derivative. Manipulated version of original, e.g. translation, abstract, extract.
  • Quality Authored. Narrative work using other sources but providing independent conclusions.
  • Reliability – Primary. Details provided by someone with first-hand knowledge.
  • Reliability – Secondary. Details provided by someone with second-hand or more-distant knowledge.


The BaseCitationLnk element may nominate an Abstract Citation from which data may be inherited by the current Citation, in much the same vein as base classes and derived classes in software programming. An Abstract Citation must define no embedded Keys, can only reference other abstract entities, and must contain Parameter definitions rather than Parameter settings. Any application of Parameter substitution must therefore occur after the inheritance process has completed. If an implementation creates a temporary conglomerate entity in memory by doing a physical merge then it must not be persisted back to the data file, otherwise it constitutes a data corruption.


Here’s a simple example of a traditional book citation:


<Citation Key=’cOldNottm’>

<Title>Old Nottingham Notes</Title>

<URI> http://stemma </URI>


<Param Name=’Author’>James Granger</Param>

<Param Name=’Title’>OLD NOTTINGHAM : Its Streets, People, etc</Param>

<Param Name=’Publisher’>Nottingham Daily Express Office</Param>

<Param Name=’Date’>1904</Param>

<Param Name=’Pages’/>




Reprinted from the Nottingham Daily Express, October 3rd, 1903 – July 9th, 1904





A corresponding citation might appear as:


<CitationLnk Key=’cOldNottm’>

<Param Name=’Pages’>46-48</Param>



Whether this generates a source-list reference or a short/long reference note depends on the selected citation mode.


The Board for Certification of Genealogists (BCG) has an interesting ‘work sample’ on their Web site ( that presents a multi-part citation:


Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore: Genealogical Publishing Co., 2007)—or the earlier abridged edition, Evidence! Citation & Analysis for the Family Historian (1997) together with its companion QuickSheet: Citing Online Historical Resources Evidence! Style (rev. 2007).


This might be represented as a compound Citation that brings together the referenced simple Citations through its Narrative support.


Citations can become even more complex than this since the author will not only want to cite the source, and the information obtained form that source, but the context of how it substantiates or contradicts their assertions and conclusions. This often involves some type of analytical commentary in the citation. For instance:


Death notices, Ulster Gazette and Daily National Intelligencer, both dated 24 January 1815. Corra Bacon-Foster, "The Story of Kalorama," Records of the Columbia Historical Society (1910), 108, states Louisa left four children; three have been identified. In 1810, Charles "Cating" and a female, both over 44, were enumerated with one male and female aged 26-44; one male and female aged 16-25; and one male under 10 - suggesting that George, Louisa, and their first son may have been living in the Catton household. See 1810 U.S. census, Ulster County, New York, New Paltz, p. 116, line 6; NA micropublication M252, roll 37.


This type of text could make extensive use of the STEMMA Narrative support but should it appear in the Citation definition, the relevant CitationLnk, or elsewhere? The Narrative Structure section illustrates how Evidence and Conclusion (E&C) trees can be built up from named Narrative elements.


It is important retain a clear view of the distinction between a Citation and a Resource. As another example, consider UK BMD references. These might be linked to the defining body, say with something like, in order to create a unique abstract source citation. However, if you wanted to be able to pull up the appropriate index page on some Web site then that would be done via a Resource entity.

[1] Elizabeth Shown Mills, Evidence Explained: Citing History Sources from Artifacts to Cyberspace (Baltimore: Genealogical Publishing Co., 2009)