This paper discusses the names of Persons and Places as used in micro-history data. It looks at their similarities and their differences, and considers how their data representation in STEMMA® has benefitted from that analysis.
It also looks at the concept of authorities for definitive lists of names and places.
Place names share many aspects with personal names:
From a computer software point of view, there are other similarities between the handling of their names. For instance, the way two names are matched must be “relaxed” and consider both of the following:
One aspect that isn’t really shared identically between Persons and Places is their parentage. A person has a fixed biological parentage but a place may move from one parent region to another as its geographical or administrative boundaries are changed. This is discussed in detail later.
STEMMA takes full advantage of these similarities by using the same Name Variants element for both Persons and Places. This treats all names as a sequence of tokens, and provides support for each token in a sequence being selected from a 0-or-more set (i.e. optional choice) or a 1-or-more set (i.e. mandatory choice). The sequences can also be given date ranges to indicate when they would be applicable from and to. Note that the applicable date ranges may be overlapping — they are not disjoint.
A Person or a Place may have alternative names in different languages, and a personal-name example is given on the main STEMMA pages. For both a Person and Place reference, the local and foreign versions of a name are called Endonyms and Exonyms respectively.
The same token-matching rules are used for both entity types, including tokenisation, relaxed character matching, handling of common misspellings, abbreviations, etc.
Advantages and disadvantages of this unified approach that are specific to Persons or Places are presented below. NB: Person groups share many of the similarities presented here for Places. That is leveraged by the STEMMA Group concept which is a top-level entity alongside Person and Place.
There are a number of terms here that regularly get used interchangeably but which have slightly different meanings. We first need to refine our terminology in order to be sure of what we’re talking about, and to eliminate ambiguities.
The first of these is easy to define, and easy to distinguish from the other two. Obviously postal addresses do not apply to every place or location. For instance, it may be an historical one, or it may no longer exist, or it could be the name of a vehicle in transit such as a ship, or post may never be delivered there. Conversely, an address such as a “P. O. Box Number” is an abstract collection point for the recipient and does not correspond to a physical place or location. In genealogy, addresses might be employed as part of the general category of contact details, such as during data attribution. They are discussed further at Addresses and under Worldwide Family History Data.
Distinguishing a Place from a Location is more subtle (see difference-between-location-and-place for instance). Even though the terms are indistinct in everyday life, they have more precise meanings in geography. I have chosen semantics for micro-history data which should be both agreeable and meaningful. In order to illustrate the difference, consider a major urban redevelopment where streets are torn down and remade in a different fashion. A household on a new street constitutes a different Place to a household on an old street, even though they may be at the same physical Location. It therefore makes sense to talk about the “location of a place”.
A Place has a granularity associated with it. By this I mean that it could be a very specific fine-grained Place, such as a household, or a more general coarse-grained Place, such as a country. Since geographical coordinates are not appropriate for all Places (in contrast to Locations), we need a unique textual reference based on their names. The obvious choice is a hierarchical identification starting with a recognised outermost Place and working inwards until the specified Place is reached [NB: the visible representation may be small-to-large or large-to-small, depending on personal and cultural preferences]. We’ll call this concept a Place-hierarchy, and the full textual reference a Place-hierarchy-path.
The first question is which type of hierarchy. Ideally, it should be a geographical one rather than an administrative one, a religious one (e.g. ecclesiastical parishes), a judicial one, a political one (e.g. electoral wards and polling areas), etc. However, nations are typically organised according to administrative jurisdictions and so a combined geographical and administrative hierarchy would be more useful. The underlying premise is that every place has a unique bounding parent place at any given time. The other jurisdictional zones could be represented as non-hierarchical connections (see RelatedTo), although alternative hierarchies can also be constructed as long as that premise is not violated. These other aspects of a Place will still be important, such as a civil birth registration occurring in an administrative area but a baptism occurring in a religious one, so we still have to relate them somehow.
A Place-hierarchy could begin with a continent but they’re ill-defined — both in number and in content — and some Places do not sit within a continent. We’ll therefore bypass them and begin with the respective country. Here are a few examples that go as far as a city:
[US, California, San Francisco]
[England, Nottinghamshire, Nottingham]
We immediately have another question here. In the second example, could England be replaced by UK? Well, ideally not since the United Kingdom is not a country; it is a sovereign state. However, the fact that the UK has not always existed raises the bigger issue that the valid terms need to cover other time periods as well as the present day. ISO 3166-1 only defines codes for present-day countries. Also, ISO 3166-2 defines codes for the names of the principal present-day subdivisions of the countries in ISO 3166-1 (e.g. provinces or states). This does not include subdivisions such as Shires although they are still historically relevant.
There is a similar standard to ISO 3166-2 developed independently by the European Union and called the Nomenclature of Units for Territorial Statistics (NUTS).
The divisional entities within each country not only have many different names but their relative size and organisation cannot be assumed to be similar in different countries. For instance, if a country has both states and counties, are states a subdivision of counties, or vice versa, or are they relative equals? This is not a big problem though. What is needed is a compiled list of entity names that is valid not just in the present age but in the past too. From this, we can define something called a “controlled vocabulary” which can be applied to all countries. For instance:
Authorities, Boroughs, Counties, Departments, Dependencies, Districts, Islands, Municipalities, Parish (Civil), Provinces, Regions, Republics, Shires, States, Territories, Townlands, Townships
Each country would then have a selection of these names that are applicable to it, and a relative ordering specified between them.
When we get to the level of a street, a building, or a house, then a Place-hierarchy is fairly easy to work with and is obviously geographical in nature (see STEMMA). However, in between the country and a street, things get a little muddier. Entities like those defined above are generally administrative rather than geographical. That is probably an acceptable necessity but the mud gets thicker.
Looking at the concept of counties in England, for example, we find there are several flavours so we would need to be explicit about which entities are being employed. The historic counties of England were established for administration by the Normans, and in most cases were based on earlier kingdoms and shires established by the Anglo-Saxons. These geographic counties existed before the local government reforms of 1965 and 1974. Counties are now primarily an administrative division composed of districts and boroughs. The overall administrative hierarchy is described at Subdivisions_of_England. However, the ceremonial counties are still used as geographical entities. The postal counties of the UK, now known officially as the former postal counties, were postal subdivisions in routine use by the Royal Mail until 1996. The registration counties were a statistical unit for the registration of births, marriages, and deaths and for the output of census information.
See also Modern_Counties which associates the UK historic counties with the more modern counties and county boroughs.
This scheme does not suggest how to deal with overlapping entities. It assumes that every place has a unique bounding place at any given time. Although these are primarily of a geographical/administrative nature, alternative hierarchies can also be constructed as long as that premise is not violated. However, a country may be split due a political change, or a property may be split during inheritance, or a street may be completely torn down and the area redeveloped with new streets. These examples “overlap” in that they are related and may even occupy the same physical location. There are also informal place references which do not correspond with any single official place, and yet may still be encountered in a census return. An example might be The Potteries in England.
STEMMA defines a partially controlled vocabulary for a Place-hierarchy. This is significant in several respects:
It takes the Place-hierarchy right down to a given household or building. This allows easy association with other Places on the same street, or in the same village, and so helps software to make correlations about different families in the same data. It may seem an unnecessary overhead to describe each household using a separate Place entity but the fact that all the common data is factored out quickly makes it quite economical. That common data will include the rest of the Place-hierarchy, and dates of establishment or destruction, alternative names or spellings, and any historical narrative. It also reduces the chances of errors in your data by reducing the need for duplication. See also Place Authority.
The variability of Place Hierarchies is handled in the Place entity through the use of <ParentPlaceLnk> and <Creation>/<Demise> elements. These provide date ranges for a particular parent Place, thus supporting a variable hierarchical relationship, and for the establishment or destruction of the Place. These dates are disjoint and will not overlap. Note that these dates are not directly related to those dates in the Name Variants element. They control the accepted local names for the current Place, and not the hierarchy that the Place happens to be within. Those names variants will also be applicable long after a Place ceases to exist and so they are not limited by the <Demise> event.
STEMMA also generalises the concept of a Place to include a ship or other named vehicle in transit. It is then applicable to either end of a journey, say during emigration.
See Related Entities for a description of how STEMMA tackles the issues of non-hierarchical relationships raised at the end of the previous section.
Postal address formats vary widely around the world, and some information may be found at: international-address-formats. There is no available standard yet, although ISO 19160 is still being developed.
At the building level there may be a number and/or a name for the building, and these alternatives may be mixed even for houses on the same street. For apartments or flats within a building or condominium then there may be a further identification within that entity in order to locate the ‘Addressable Object’ (using ISO 19160 terminology). For very rural properties such as farms then there may be no actual street or road name.
Given this variability, most forms and databases resort to fields such as AddressLine1 – AddressLine3 for the initial part of an address string. Just as with Place Hierarchies, the larger entities in a postal address differ between countries. Let’s look at a few for comparison:
The <Address> element used by STEMMA <ContactDetails> takes a generic stance as follows:
Each element, if non-blank, is an explicit part of the address string. The StateOrProvince address term should be interpreted as the top-level administrative division used in an address for the respective country, e.g. a State or a County. CountryCode is an ISO 3166-1 2-or-3 letter code.
Toponymy is the scientific study of place names (toponyms). A very useful selection of place name information around the world may be found at: Toponymy.
When comparing place names, we need to take account of abbreviations as well as alternative spellings and misspellings, just as with personal names. Common address abbreviations for a few English-speaking countries (e.g. St, Rd, Sq, Gnds) may be found at the following sites:
A recent site has appeared to help locate the UK Traditional County from a UK postcode: http://www.postal-counties.co.uk/. The traditional counties are viewed as important from a cultural and geographical point of view by many, and yet attempts to redefine them with newer (and short-lived) administrative counties have caused confusion and resentment.
A given name is used to distinguish members of a family group. The term implies that the name is purposefully chosen when the child is born and contrasts with inherited parts of their personal name. In the West, a given name is often called a first name, or forename, but this presupposes the order of the name parts. See Given_name.
A surname is an inherited part of a personal name added to a given name, and is usually a family name. Many dictionaries actually define ‘surname’ as a synonym of ‘family name’ but this is not true where a culture uses patronymic or matronymic names, i.e. where a surname is based on the given name of a male or female ancestor, respectively. In the West, a surname is often called a last name but that presupposes the order of the name parts. In North and South America, as well as in Europe, a surname is placed at the end of a person's given name. In China, Japan, Korea, Hungary, and in many other East Asian countries, the family name is placed before a person's given name. In Spain and most Spanish-speaking countries, two or more surnames are commonly used. See Surname.
A more in-depth look at personal names around the world may be found under Worldwide Family History Data. According to that examination, names may be broken down into the following generalised token categories:-
In principle, once name tokens have been categorised then a set of rules appropriate to a given culture can be applied to indicate how a personal name should be sorted, or how it should be presented in different contexts. Those contexts might include:
Let’s call those rules a name scheme. As an English-language example, consider the unwieldy name of a well-known TV character:
General Sir Anthony Cecil Hogmanay Melchett VC DSO KCB
The nine tokens in this name could be categorised as follows:
Prefix = 1 2
Given = 3
Middle = 4 5
Family = 6
Suffix = 7 8 9
Hence, an appropriate set of rules could generate the following name forms:
Formal = "General Sir Anthony Cecil Hogmanay Melchett VC DSO KCB"
Informal = "Anthony Melchett"
SemiFormal = "Anthony C. H. Melchett"
Listing = "Melchett, General Sir Anthony Cecil Hogmanay VC DSO KCB"
The same name scheme could define rules for sorting, collation, and case conversion. It could also define fields (with locale-dependent annotation) for a form that would solicit such a name from the end-user and automatically categorise the tokens.
Unfortunately, this formalised approach gets complicated very quickly when addressing other cultures and historical names. It may be the case that there is no universal categorisation that will not admit exceptions. The consensus reached under Worldwide Family History Data was that it is just as easy — but more portable — to simply store the set of names accepted on input, and a smaller set to be used for presentation in specific contexts. This is the route adopted by STEMMA.
It has been mentioned already how STEMMA takes advantage of the great similarities between personal names and place names. This results in a very streamlined approach that easily supports alternative names and separates the idea of canonical names from the name-matching process during input.
The original intention was to analyse the tokens of a personal name and use them accordingly during sorting, collation, and case conversation. However, research into structural variations in personal names quickly demonstrated that rigorous categorisation of name tokens, and the development of culturally-dependent rules to handle them, was fraught with problems and so a culturally-neutral approach was taken for all names (person, place, and group). See under Worldwide Family History Data and The Game of the Name.
Basically, STEMMA separates the variations of a name accepted during input, searching, and lookup, from the presentation styles used during output. Any number of name variations can be declared for input, including alternative names, names in different languages, phonetic versions, abbreviations, and time-dependent names. For output, a smaller number of canonical names are provided with different style classifications such as Formal and Informal.
This may an issue for citation references when an author’s name has to appear in an appropriate form. The data actually made available to citation template engines - see Zotero and CSL - is still hotly debated, but there’s an implicit assumption that a cooperating software unit will only provide a name in one form, and that the template engine will have to parse it and categorise the tokens itself — the route already abandoned by STEMMA.
A very good list of given names used in different countries may be found at: Given_name_appendices. A list of common abbreviations for English given names may be found at: Abbreviations_for_English_given_names.
An authority is a definitive source of information. In the context of genealogy, it may be applied to personal names, places, and place names. There may be other applications, such as for significant events (including when census returns were made), dates & calendars, and cited sources, but these are not considered here.
They cannot be applied to people themselves since proving the existence of a specific, distinct individual is considerably more difficult than, say, a place, and is one of the prime goals of micro-history research. In effect, there can be no single ‘authority’ there.
In the context of computer software, an authority will usually be a Web service or some other type of API accessible over the Internet. It would allow content providers and genealogy programs to consult an index of names and places, and be able to verify newly-entered details or fill in missing parts.
Personal names and place names share a number of common requirements that a name authority can provide support for:
Such an authority must be provided with a locale identifier to tailor its service to a particular region and language. The above list doesn’t mention bilingual countries where place names routinely have spellings for each language, or where people may elect to have variations of their name in both languages. Neither of these cases fall directly under the ‘Alternative spellings’ category above. People adopt such variations on an individual basis — they are not an intrinsic part of the personal name itself. Also, places may have a different form of their name in different languages (e.g. “Rue de la Paix” rather than “Peace Street”) but support for that should be a function of a place authority rather than a mere name authority.
Links to useful name resources may be found at Place Name Resources and at Given Name Resources. Many Web sites and software products currently have similar lists of alternatives, abbreviations, and misspellings built into them which then results in differences between them. Having a single authority would provide consistency and reduce errors or omissions.
An authority for places is slightly different because it related to something that physically exists, or existed, rather than simply a decoupled name.
When dealing with simple names, an authority can provide lists of the associated alternatives. A place, however, may have many properties, including historical narrative, and they may get updated over time. In this case, the authority might provide Web access to that information rather than a copy of it.
This raises an important question: should a place authority provide a persistent identifier for the place (e.g. a URI or a UUID), or should it return a unified textual place reference (e.g. a place name hierarchy)? Let’s look at both cases:
In effect, both schemes have merit and so a hybrid would be more useful than either of them separately. For instance, the authority could return a hierarchy with both a Place name and a persistent identifier for each level defined within it. The Place entities within the micro-history data would then store the name of each level in the Place-hierarchy and any persistent identifiers relevant to them. It means something like a Place-hierarchy-path can easily be generated without the need for a live connection to an authority, but the option is there to obtain further details of the Place using the identifier if available.
A very practical problem with the concept of a Place authority is its update and maintenance if held in a central repository. Because different countries, and probably administrative levels within each country, will want to retain responsibility for their specific content, the only way to handle this is to have a federated authority. In other words, have a single root service that delegates to more specific services elsewhere. This requires a fair amount of standardisation in terms of the service addresses, the request and response data formats, the language of the response, the controlled vocabulary, etc. The allocation of persistent identifiers does not need to be done from a central place as both URIs and UUIDs may be allocated in a distributed fashion.
For the specific case of England and Wales, the www.ukbmd.org site contains excellent information on the registration counties and registration districts but this does not yet constitute an ‘authority service’ as described here.
In a similar vein, the UK National Archives have a partially complete street index into the various England and Wales census returns: Census_Street_searches and Historical_Streets_Project. Besides being enormously useful, this could be a great starting point for a street-level authority. The ability to relate census pages to specific streets-level places is a great example of what could be accomplished. Unfortunately, the ‘your Archives’ project was being closed down at the beginning of 2012 and the future of that index is uncertain.
Switching to mainland Europe, there is a great resource for places in Germany and Middle Europe, roughly translated as "The Historic Gazeteer": http://wiki-de.genealogy.net/GOV. At the last check it contained over 300,000 “objects” in its database and these included places, churches, structures, etc. The search tool is at http://gov.genealogy.net/search. Each link from the results shows a page of associated information and statistics, e.g. the city Straßburg/Strasbourg is at: http://gov.genealogy.net/item/show/STRURGJN38VN. Notice that the GOV-id is akin to a persistent identifier. There is still no programmatic service interface though.
® STEMMA is a registered trademark of Tony Proctor.
Research Notes >