Text only | Skip links
Skip links||IT Services, University of Oxford

1. Linking, segmentation and alignment

In some texts we need to be able
  • to link disparate elements without using the xml:id attribute;
  • to segment text into elements and to mark arbitrary points within documents
  • to represent correspondence or alignment among groups of text elements
  • to synchronize elements of a text, representing temporal correspondences and alignments among text elements
  • to specify that one text element is identical to or a copy of another
  • to aggregate possibly noncontinguous elements
  • to specify that different elements are alternatives to one another and to express preferences among the alternatives
  • to store markup separately from the the data it describes

2. Underlying assumptions

  • Use W3C identifying, pointing and linking mechanisms where possible
  • Use xml:id to identify an element directly
  • Use XPointer to point to elements that do not have an xml:id

3. Complex pointing

The standard URI scheme allows for pointing
  • to documents other than the current document
  • to a particular element in a document other than the current document using its xml:id;
but we also need to point
  • to a particular element using its position in the XML element tree (standard XPointer schemes)
  • at arbitrary content in any XML document using TEI-defined XPointer schemes

4. Some XPointer schemes

From http://www.w3.org/2005/04/xpointer-schemes/; ones marked with a ➠ were specified by the TEI itself:
Identify elements by position within parent, recursively.
➠ left
Locates the point immediately preceding its argument. The sole argument is a pointer, which is treated as if it were a fragment identifier itself. The argument may return a node, node set, range, or point.
Takes as arguments a pointer, a string, and an optional integer. Designates the result of a literal match of the argument string within the string-value of the pointer argument.
Locates a range between two points in an XML information set. Takes two pointer arguments which locate the boundaries of the range by two points, and are interpreted as fragment identifiers.

5. Some XPointer schemes (2)

Locates the point immediately following its argument. The sole argument is a pointer, which is treated as if it were a fragment identifier itself.
Locates a range based on character positions. Takes three arguments: a pointer, an offset, and a length.
Bind a prefix for use in subsequent pointer parts e.g. xmlns(xs=http://www.w3.org/2001/XMLSchema)

6. Some XPointer schemes (3)

Locates a node or node set within an XML Information Set. The single argument is an XPath path as defined in the W3C XPath 1 Recommendation.
Locates a node or node set within an XML Information Set. The single argument is an XPath path as defined in the W3C XPath 2 Recommendation.
The rich scheme including XPaths and ranges described in the XPtr-xpointer Working draft

7. Test document for XPointer schemes


<!-- seven divs here -->
  <div xml:id="lastterm">
    <emph>'But'</emph>, said
   <name key="Stalky">Stalky</name>,
       ‘come to think of it, we've done more giddy
       jesting with the Sixth since we've been
       passed over than any one else in the last
       seven years.’</p>

9. Examples for XPointer schemes

<ptr target="stalky.xml#element(lastterm)"/>
<ptr target="stalky.xml#element(1/1/8)"/>
xpointer() and xmlns()


<ptr target="stalky.xml#xmlns(t=http://www.tei-c.org/ns/1.0)

Note that the last expression returns multiple nodes.

11. A daily use for XPointer

The W3C XInclude specification is a good way to write composite documents; the <include> element's href attribute allows for XPointers:
<div xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="stalky.xml#

12. Generic linking

The core TEI <ptr> and <ref> elements let you do the point to point linking we are used to on web pages, relying on XML IDs for internal links:
<p>Wikipedia has a a good starter page on
waving cats</ref>, with links to more esoteric
resources; our own pictures are in
section <ref target="#cats">3</ref>

13. …

The linking module adds <link> to let you specify a point to point relationship between two or more elements:
<p xml:id="beetle1">You're a despondin' brute, Beetle</p>
<p xml:id="beetle2">An' who the dooce is this
Raymond Martin, M.P.?’ demanded Beetle</p>
<link targets="#beetle1 #beetle2"/>
Note that this is establishing a connection, not a direction.

14. Groups of links

<linkGrp> is provided to group together sets of <link>s. In the following example, it allows for stand-off notes, and characterisation of those notes:
<l xml:id="l2.79">A place there is, betwixt earth,
air and seas</l>
<l xml:id="l2.80">Where from Ambrosia, Jove
retires for ease.</l>
<l xml:id="l2.88">Sign'd with that Ichor which from Gods distills.</l>
<note xml:id="n2.79">
 <bibl>Ovid Met. 12.</bibl>
 <quote xml:lang="la">
  <l>Orbe locus media est, inter terrasq; fretumq;</l>
  <l>Cœlestesq; plagas —</l>
<note xml:id="n2.88">Alludes to <bibl>Homer, Iliad 5</bibl>
<linkGrp type="imitationnotes">
 <link targets="#n2.79 #l2.79"/>
 <link targets="#n2.88 #l2.88"/>

15. Segmenting text, and marking arbitrary points within documents

This module adds three useful new elements:
marks an block of text with no special semantic interpretation
marks a range of text with no special semantic interpretation
marks an arbitrary point in the text
The first two have helpful type and subtype attributes.

16. Marking points

<anchor> is comparable to HTML anchors:
<p>He was merely working up to a peroration, and the
boys knew it; but McTurk cut through the frothing
sentence, the others echoing:</p>
<p><anchor xml:id="MTa"/>I appeal to the Head, sir.’</p>
<p><anchor xml:id="Ba"/>I appeal to the head, sir.’</p>
<p><anchor xml:id="Sa"/>I appeal to the Head, sir.’</p>
<p>It was their unquestioned right. Drunkenness meant
expulsion after a public flogging. They had been
accused of it. The case was the Head's, and the
Head's alone.</p>

17. Anonymous blocks

In this inscription, there are separate lines, but they are not poetry, or paragraphs, so we isolate them with <ab>:
 <ab>JOSEPH STORY</ab>
 <ab>ONLY SON OF</ab>
 <ab>BORN MAY 3rd 1847</ab>
 <ab>AT BOSTON U.S.A</ab>
 <ab>DIED NOV. 23rd 1853</ab>
 <ab>AT ROME</ab>

18. Segments

There are more specific elements elsewhere in the TEI for marking sentences, words and characters, but sometimes we need to mark an arbitrary span, using <seg>:
<q>Don't say <q>
  <seg type="stutter">I-I-I</seg>'m afraid,</q>
Melvin, just say <q>I'm afraid.</q>

19. Correspondence and alignment

First, consider the representation of a manuscript page:

<ab xml:id="N6">
 <lb/>and hat hire
don in obedience ðe cnoweð hire manere
<lb/>and hire strencðe. he mai ðe vttre
riwle chaungen efter <lb/>wisdom alse he
isihð te inre mai beon best iholden.
<anchor xml:id="N_6"/>
 <lb/>Non ancre bi
mine rede ne schal makien professiun.
<lb/>þet is. bihoten ase hest. bute
þreo þinges. þet beoð. o-<lb/>bedience.
chastete. and studestaþeluestnesse.
þet heo ne schal <lb/>þene stude neuer
more chaungen; bute vor neod one.
<lb/>alse strengðe and deaþes dred.
obedience of hire bischope; <lb/>
oþer of hire herre. vor whoa se
nimeð þing an hond and bi-<lb/>hat
hit god alse heste to donne.
heo bint hire þerto. and su-
<lb/>negeð deadliche i ðe bruche;
3if heo hit brekeð willes and
wol<lb/>des. …

20. Correspondence and alignment (cont.)

Now lets look at an edited version and a translation:

<p xml:id="edited_6">Nan ancre bi mi read ne schal
makien professiun—þet is, bihaten
ase heast—bute þreo þinges,
þet beoð obedience, chastete, ant
stude-steaðeluestnesse (þet ha ne
schal þet stude neauer mare
changin bute for nede ane,
as strengðe ant deaðes dred,
obedience of hire bischop oðer of
his herre). For hwa-se
nimeð þing on hond ant bihat hit
Godd as heast forte don hit, ha
bint hire þer-to, ant
sunegeð deadliche i þe bruche 3ef
ha hit brekeð willes.</p>
<p xml:id="translated_6">My advice
is that no anchoress should make
profession—that is, bind herself to
a vow—of more than three things,
which are obedience, chastity, and
stability of abode (that she should
never move elsewhere afterwards
unless it is absolutely necessary,
as in the case of violence and fear
of death, or obedience to her
bishop or his superior). For
whoever undertakes something and
promises God to carry it out as a
vow binds herself to it, and
commits a mortal sin if she
voluntarily breaks her vow. …</p>

21. Correspondence and alignment (cont.)

We can express a relationship between the texts as follows:
<linkGrp type="translations">
 <link targets="#edited_6 #translated_6"/>
<!-- … -->
<linkGrp type="editions">
 <link targets="#N-f2r #N6"/>
<!-- … -->
meaning ‘this paragraph in the translated edition corresponds to text at that anchor in the original’.

There are many other ways of dealing with material like this!

22. Synchronizing time-based material

If you are linking together sequences which are aligned by time, there is a special stand-off linking element <when>, grouped inside a <timeline>. It has attributes:
an absolute time for the event
the length of the gap since the last event
the unit of time in which the interval value is expressed
a link to the previous event
<timeline xml:id="tl1origin="#w0unit="ms">
 <when xml:id="w0absolute="11:30:00"/>
 <when xml:id="w1interval="unknownsince="#w0"/>
 <when xml:id="w2interval="100since="#w1"/>
 <when xml:id="w3interval="200since="#w2"/>
 <when xml:id="w4interval="150since="#w3"/>
 <when xml:id="w5interval="250since="#w4"/>
 <when xml:id="w6interval="100since="#w5"/>

These when objects can be used in a <link> to relate time events to points in the text.

23. Aggregating non-continguous elements

The <join> element is used like <link>, pointing to 2 or more identified fragments of text. It claims that they could be joined to create a new virtual element (the result attribute). <joinGroup> is provided to aggregate <join>s.
  <seg xml:id="L1">E</seg>lizabeth it is in vain you say</l>
 <l>"<seg xml:id="L2">L</seg>ove not" — thou sayest it in so sweet a way:</l>
  <seg xml:id="L3">I</seg>n vain those words from thee or L.E.L.</l>
  <seg xml:id="L4">Z</seg>antippe's talents had enforced so well:</l>
  <seg xml:id="L5">A</seg>h! if that language from thy heart arise,</l>
  <seg xml:id="L6">B</seg>reath it less gently forth — and veil thine eyes.</l>
  <seg xml:id="L7">E</seg>ndymion, recollect, when Luna tried</l>
  <seg xml:id="L8">T</seg>o cure his love — was cured of all beside —</l>
  <seg xml:id="L9">H</seg>is follie — pride — and passion — for he died.</l>
   targets="#L1 #L2 #L3 #L4 #L5 #L6 #L7 #L8 #L9result="name">

  <desc>The beloved's name</desc>
(from Edgar Allan Poe).

24. Elements as alternatives to one another

The <alt> element is used to indicate that two elements are mutually exclusive. <altGroup> is provided to aggregate <alt>s.

Example: the first time we transcribed this text, we saw
but on another look it says
Can this be a genuine change since our first visit? or just a mistake? Let's keep both:
<ab xml:id="W1">WILLILAM W. AND EMELYN STORY</ab>
<ab xml:id="W2">WILLIAM W. AND EMELYN STORY</ab>
<alt mode="excltargets="#W1 #W2"/>
weights and mode assign weight to the judgement, and allow for relationships other than mutually-exclusive.

25. Another way to express alternation

The global exclude attribute can be used by any element to indicate another element to which it is allergic:
<ab exclude="#W4xml:id="W3">WILLILAM W. AND EMELYN STORY</ab>
<ab exclude="#W3xml:id="W4">WILLIAM W. AND EMELYN STORY</ab>

26. Conclusions

The linking module provides a wide range of tools to let you describe relationships between parts of your text. If you need these, remember:
  • You should work out a naming scheme to assign ID attributes. You will need a lot of them
  • There are often several ways to do things; use the more specialized markup when you can to make it easier for others to read. Don't rely on type attributes with undefined meanings everywhere
  • Control your vocabulary for token attributes like type
  • The TEI only takes you as far as markup. Implementing all this to make a fancy interactive text exploration web site may be a lot of work.

27. Characters in TEI: Unicode

  • Unicode is the only supported character encoding schema. This means that entities for characters are deprecated, and the recommended daily use is for UTF-8 encoded text, as in
    <persName xml:lang="el-grc">Φλ. Θάλλος</persName>
  • There is a clean mechanism to use non-Unicode characters
  • all appropriate text content models are set to allow a mixture of CDATA and <g> (where <g> is a reference to a non-Unicode character)
  • all elements have an attribute xml:lang to record the language used
  • there are no places where an attribute is used to hold pure text

28. Non-Unicode characters

If you wish to encode characters, or specific glyphs, which do not appear in Unicode:
  1. define them in a series of <charDesc> elements, inside <encodingDesc> in the TEI header
  2. refer to them using using the <g> element in the body of your text
Inside <charDesc>:
  • the <char> element defines a character which is not available in the current document character set
  • the <glyph> element annotates an existing character (usually providing a glyph that shows how a character appeared in the original document)

29. Glyph and character properties

contains the name of a character, expressed following Unicode conventions.
provides a name and value for some property of the parent character or glyph. This allows for all the details provided in Unicode
contains the name of a glyph, expressed following Unicode conventions for character names.
contains a locally defined name for some property.
contains one or more characters which are related to the parent character or glyph in some respect, as specified by the type attribute.
contains the name of a registered Unicode normative or informative property.
contains a single value for some property, attribute, or other analysis.
You can also use <graphic> to provide a picture of your character or glyph.

30. Defining a character

A new character can assigned to a position in the Unicode Private Use Area (PUA), and also described in terms of Unicode combining characters:
 <char xml:id="ydotacute">
  <mapping type="composed">#x0079;#x0307;#x0301;</mapping>
  <mapping type="PUA">U+E0A4</mapping>

31. Defining a local glyph

A new glyph variant can also be assigned to a position in the Unicode Private Use Area (PUA) and provide standardized form as a fallback:
 <glyph xml:id="z103">
  <glyphName>LATIN LETTER Z WITH TWO STROKES</glyphName>
  <mapping type="standardized">Z</mapping>
  <mapping type="PUA">U+E304</mapping>
This can now be referred to using the <g> element, as in
<g ref="#z103"/>

32. More use of <g>

It is also possible to override what appears in the text by using markup like this
<g ref="#z103">z</g>
where the content of the <g> element can be used immediately without any lookup.

Note the likelihood that few pieces of TEI-processing software will have implemented support for the <charDecl> / <g> markup yet.

Sebastian Rahtz. Date: 2008-07-07
Copyright University of Oxford