Recording Evidence

A number of features are required to correctly record source information in a transcription. This section illustrates how STEMMA deals with them.

 

  • Positional anomalies, meaning text added outside the main body or flow. This may have been added by the author or by-hand after it was printed or published. The following types are supported by the <Anom> semantic mark-up element:
    • Footnotes and Endnotes. Text added at the end of a page or document.
    • Tablenote. Text added at end of table.
    • Maginalia. Text added in a margin.
    • Interlinear notes. Text added between lines.
    • Intralinear notes. Text inserted within a line, usually marked with a caret. Identification of sublinear and supralinear variants.
  • Audio anomalies, including non-verbal gestures or movements, noises, and significant pauses.
  • Marking sections from different contributors, or suspected different contributors. This includes different (written) hands and different voices. See ‘id' attribute of <ms>, <ts>, and <voice> elements.
  • Uncertain characters. Sequences of characters may be unreadable or uncertain (i.e. there are several distinct possibilities). Recording this correctly is essential for accurate searching. See <Ucf> element below.
  • Struck-out characters. Characters crossed out in the original. See the <s> element in Descriptive Mark-up.
  • Uncertain interpretation. Adding a suggested meaning or spelling correction to a word or phrase that is readable but is unusual or not recognised. Similarly with unusual pronunciation in audio transcription. Supported via the <Alt> mark-up.
  • Specific emphasis, such as bold, italic, or underline. See the respective elements in Descriptive Mark-up.
  • Stylistic variations from a given contributor, including different colours, different fonts, different intonation. See ‘scheme’ attribute of <ms>, <ts>, and <voice> elements.
  • Numbering of pages, columns, paragraphs, and lines. See <page>, <col>, <p>, and <line> mark-up.
  • Linking textual transcription to locations in an image (see ‘x’ and ‘y’ attributes on various elements), and audio transcription to locations in a recording (see <time> element).

 

Some of these terms and concepts may be found in Editorial Methods for Journals, volume 1, and The Conventions of Textual Treatment, chapter five. For other attempts at audio transcription, see http://clu.uni.no/icame/manuals/WSC/MARKCONV.HTM and https://www.univie.ac.at/voice/documents/VOICE_mark-up_conventions_v2-1.pdf.

 

Traditional editorial notations for uncertain characters are not well-suited to digital text as they do not facilitate efficient and accurate searching within the limits of what is known. TEI has elements such as <choose> and <unclear>, and a comprehensive formalised notation may be found at: http://igenie.org under Transcriptions. Although less comprehensive, perhaps the most compact is the UCF (Uncertain Character Format) devised by FreeUKGEN. This is based on the regex pattern-matching language although it must be remembered that this exists within target strings rather than search strings. Regex, in turn, is an extension of tradition wildcard characters[1].This UCF is the basis of the notation used within STEMMA and the following table is from the FreeBMD pages:

 

 

_ (Underscore)

A single uncertain character. It could be anything but is definitely one character. It can be repeated for each uncertain character.

* (Asterisk)

Several adjacent uncertain characters. A single * is used when there are 1 or more adjacent uncertain characters. It is not used immediately before or after a _ or another *. Note: If it is clear there is a space, then * * is used to represent 2 words, neither of which can be read.

[abc]

A single character that could be any one of the contained characters and only those characters. There must be at least two characters between the brackets. For example, [79] would mean either a 7 or a 9, whereas [C_] would mean a C or possibly some other character.

{min,max}

Repeat count - the preceding character occurs somewhere between min and max times. max may be omitted, meaning there is no upper limit. So _{1,} would be equivalent to *, and _{0,1} means that it is unclear if there is any character.

 

UCF also defines a ‘?’ character that is used to represent the situation where all of the characters have been read but you remain uncertain of the word, e.g. “RACHARD?” This is not used within STEMMA because it is ambiguous with ‘?’ representing an absent value, and the equivalent feature is supported by <Alt> mark-up.

 

Some examples:

 

 [lt]                   Can't tell if it's an l or a t.

___                 Three unreadable characters.

[x_]                  I think the character is an ‘x’

_{2,3}              Two or three unreadable characters.

*                       Unknown number of unreadable characters.

_{0,1}              Not sure if there's a letter or an ink blob.

 

Early STEMMA designs considered using an ANSI escape sequence to bracket a set of UCF characters. For instance, <APC>_12[68]<ST> where APC=0x9F and ST=0x9C. This was partly to avoid unconditionally reserving a whole set of characters but also to allow them in attribute values as well as element data. The current version accommodates them in a <Ucf> element:

 

<Ucf> ucf-sequence </Ucf>



[1] Wildcard characters represent variable sequences. There are several schemes but most allocate a single character to represent 0-or-more unknown characters (e.g. ‘*’) and another to represent exactly one unknown character (e.g. ‘?’). These may be combined so that, for instance, ‘?*’ represents 1-or-more unknown characters. Note that since ‘*?’ ≡ ‘?*’ and ‘**’ ≡ ‘*’ then any contiguous sequence of ‘*’ and ‘?’ can be simplified to just [?...][*], i.e. 0-or-more ‘?’ followed by an optional ‘*’.