Text only | Skip links
Skip links||IT Services, University of Oxford

Contents

1. Introduction to the TEI

This session provides an overview of the Recommendations of the Text Encoding Initiative

1.1. TEI Infrastructure

  • The TEI encoding scheme consists of a number of modules
  • These declare XML elements and their attributes
  • An element's declaration assigns it to one (or more) model classes
  • Another part declares its possible content and attributes with reference to these classes
  • This indirection allows strength and flexibility
  • It makes it easy to add/exclude new elements by referencing existing classes

1.2. What is a module?

  • A convenient way of grouping together a number of element declarations
  • These are usually on a related topic or specific application
  • Most chapters focus on elements drawn from a single module, which that chapter then defines
  • A TEI Schema is created by selecting modules and add/removing elements from them as needed

1.3. Modules

Module name Chapter
analysis Simple Analytic Mechanisms
certainty Certainty and Responsibility
core Elements Available in All TEI Documents
corpus Language Corpora
dictionaries Dictionaries
drama Performance Texts
figures Tables, Formulae, and Graphics
gaiji Representation of Non-standard Characters and Glyphs
header The TEI Header
iso-fs Feature Structures
linking Linking, Segmentation, and Alignment
msdescription Manuscript Description
namesdates Names, Dates, People, and Places
nets Graphs, Networks, and Trees
spoken Transcriptions of Speech
tagdocs Documentation Elements
tei The TEI Infrastructure
textcrit Critical Apparatus
textstructure Default Text Structure
transcr Representation of Primary Sources
verse Verse

2. Default Text Structure and Header

Two of the major modules used in all TEI documents (the other one is Core)
  • Text Structure
  • TEI Header

2.1. Structure of a TEI Document

There are two basic structures of a TEI Document:
  • <TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
  • <teiCorpus> contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

2.2. TEI basic structures (1)

<teiCorpus>
 <teiHeader>
<!-- required -->
 </teiHeader>
 <TEI>
<!-- required -->
 </TEI>
</teiCorpus>

2.3. TEI basic structures (1)

<TEI>
 <teiHeader>
<!-- required -->
 </teiHeader>
 <facsimile>
<!-- optional, new in TEI P5 -->
 </facsimile>
 <text>
<!-- required if no facsimile -->
 </text>
</TEI>

2.4. <text>

What is a text?
  • A text may be unitary of composite
    • unitary: forming an organic whole
    • composite: consisting of several components which are in some important sense independent of each other
  • a unitary text contains
    • optional front matter
    • <body> (required)
    • optional back matter

2.5. Composite texts

A composite text contains
  • optional front matter
  • <group> (required)
  • optional back matter

A corpus is a collection of text and header pairs. It has its own header.

<group> tags may self-nest.

2.6. TEI text structure (1)

<text>
 <front>
<!-- optional -->
 </front>
 <body>
<!-- required -->
 </body>
 <back>
<!-- optional -->
 </back>
</text>

2.7. TEI text structure (2)

<text>
 <front>
<!-- ... -->
 </front>
 <group>
  <text>
   <body>
    <p>...</p>
   </body>
  </text>
 </group>
 <back>
<!-- ... -->
 </back>
</text>

2.8. A text usually has divisions

<div>
  • generic, hierarchic subdivisions, each incomplete
  • the type attribute is used to label a particular level e.g. as 'part' or 'chapter'
  • the n attribute gives a particular division a name or number
  • the xml:id attribute gives a particular division a unique identifier

2.9. Divisions may have heads and trailers

<div>
 <head>Chapter 1</head>
 <p>
<!-- content of the div -->
 </p>
 <trailer>...</trailer>
</div>

2.10. numbered and unnumbered divs

The level can be made explicit by using 'numbered' divs (div1, div2). Opinions vary:

<div1> vs. <div n="1">
  • numbered: the number indicates the depth of this particular division within the hierarchy, the largest such division being ‘div1’, any subdivision within it being ‘div2’, etc.
  • unnumbered: nest recursively to indicate their hierarchic depth. (And computers can count very well!)
The two styles must not be combined within a single <front>, <body>, or <back> element.

N.B. Divisions always tessellate

2.11. Grouped and Floating Texts

The <group> element should be used to represent a collection of independent texts which is to be regarded as a single unit for processing or other purposes.

<floatingText> contains a single text of any kind, whether unitary or composite, which interrupts the text containing it at any point and after which the surrounding text resumes.

2.12. Grouped texts Example

<TEI>
 <teiHeader>
<!-- header information for the whole collection -->
 </teiHeader>
 <text>
<!-- optional front matter -->
  <group>
   <text>
<!-- optional front matter -->
    <body>
<!-- First Body -->
    </body>
   </text>
   <text>
<!-- optional front matter -->
    <body>
<!-- Second Body-->
    </body>
   </text>
  </group>
 </text>
</TEI>

2.13. Floating texts

As mentioned above, <div>s must tesselate over the entire text
<div1>
 <div2>
<!-- content -->
 </div2>
 <div2>
<!-- content -->
 </div2>
</div1>
is valid, while
<div1>
<!-- content -->
 <div2>
<!-- content -->
 </div2>
<!-- content -->
</div1>
is not valid.

2.14. Floating texts (2)

In the second case, div2 is a 'floating' text and its content must be encoded using the <floatingText> element.

The <floatingText> element is a member of the model.divPart class, and can thus appear within any division level element in the same way as a paragraph.

2.15. Floating text Example

<p>She was thus ruminating, when a Gentleman enter'd the
Room, the Door being a jar... calling for a Candle, she
beg'd a thousand Pardons, engaged him to sit down, and let
her know, what had so long conceal'd him from her
Correspondence. </p>
<pb n="5"/>
<floatingText>
 <body>
  <head>The Story of <hi>Captain Manly</hi>
  </head>
  <p>
<!-- Captain Manly's store here -->
  </p>
 </body>
</floatingText>
<pb n="37"/>
<p>The Gentleman having finish'd his Story ...
<!-- more -->
</p>

2.16. Virtual divisions

Where the whole of a division can be automatically generated, for example because it is derived from another part of this or another document, an encoder may prefer not to represent it explicitly but instead simply mark its location by means of a processing instruction, or by using the special purpose <divGen> element:
<front>
<!-- <titlePage>...</titlePage> -->
 <divGen type="toc"/>
 <div>
  <head>Preface</head>
  <p>...</p>
 </div>
</front>
(intended primarily for use in document production or manipulation, rather than in transcription of pre-existing material)

2.17. The TEI Header

The TEI header was designed with two goals in mind
  • needs of bibliographers and librarians trying to document ‘electronic books’
  • needs of text analysts trying to document ‘coding practices’ within digital resources
The result is that discussion of the header tends to be pulled in two directions...

2.18. The Librarian’s Header

  • Conforms to standard bibliographic model, using similar terminology
  • Organized as a single source of information for bibliographic description of a digital resource, with established mappings to other such records (e.g. MARC)
  • Emerging code of best practice in its use, endorsed by major digital collections
  • Pressure for greater and more exact constraints to improve precision of description: preference for structured data over loose prose

2.19. Everyman’s Header

  • Gives a polite nod to common bibliographic practice, but has a far wider scope
  • Supports a (potentially) huge range of very miscellaneous information, organized in fairly ad hoc ways
  • Many different codes of practice in different user communities
  • Unpredictable combinations of narrowly encoded documentation systems and loose prose descriptions

2.20. TEI Header Structure

The TEI header has four main components:
  • <fileDesc> (file description) contains a full bibliographic description of an electronic file.
  • <encodingDesc> (encoding description) documents the relationship between an electronic text and the source or sources from which it was derived.
  • <revisionDesc> (revision description) summarizes the revision history for a file.
  • <profileDesc> (text-profile description) provides a detailed description of non-bibliographic aspects of a text, specifically the languages and sublanguages used, the situation in which it was produced, the participants and their setting. (just about everything not covered in the other header elements

Only <fileDesc> is required; the others are optional.

2.21. Example Header: Minimal required header

<teiHeader>
 <fileDesc>
  <titleStmt>
   <title>A title?</title>
  </titleStmt>
  <publicationStmt>
   <p>Who published?</p>
  </publicationStmt>
  <sourceDesc>
   <p>Where from?</p>
  </sourceDesc>
 </fileDesc>
</teiHeader>

2.22. Example Header: TEI corpus

<teiCorpus>
 <teiHeader type="corpus">
<!-- corpus-level metadata here -->
 </teiHeader>
 <TEI>
  <teiHeader type="text">
<!-- metadata specific to this text here -->
  </teiHeader>
  <text>
<!-- ... -->
  </text>
 </TEI>
 <TEI>
  <teiHeader type="text">
<!-- metadata specific to this text here -->
  </teiHeader>
  <text>
<!-- ... -->
  </text>
 </TEI>
</teiCorpus>

3. Elements Available in All TEI Documents

The so-called 'Core' module groups together elements which may appear in any kind of text and the tags used to mark them in all TEI documents. This includes:
  • paragraphs
  • highlighting, emphasis and quotation
  • simple editorial changes
  • basic names numbers, dates, addresses
  • simple links and cross-references
  • lists, notes, annotation, indexing
  • graphics
  • reference systems, bibliographic citations
  • simple verse and drama

3.1. Paragraphs

<p> (paragraph) marks paragraphs in prose
  • Fundamental unit for prose texts
  • <p> can contain all the phrase-level elements in the core
  • <p> can appear directly inside <body> or inside <div> (divisions)
<p>It was a cottage, the cottage of a
dream. And by a cottage I mean, not
four plain rooms and a kitchen, but one
surprising room opening into another;
rooms all on different levels and of
different shapes, with delightful places
to bump your head on; open fireplaces;
a large square hall, oak-beamed, where
your guests can hang about after breakfast,
while deciding whether to play
golf or sit in the garden. Yet all so
cunningly disposed that from outside
it looks only a cottage or, at most, two
cottages persuaded into one.</p>

3.2. Highlighting

By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. For words and phrases which are:
  • distinct in some way (e.g. foreign, archaic, technical)
  • emphatic or stressed when spoken
  • not really part of the text (e.g. cross references, titles, headings)
  • a distinct narrative stream (e.g. an internal monologue, commentary)
  • attributed to some other agency inside or outside the text (e.g. direct speech, quotation)
  • set apart in another way (e.g. proverbial phrases, words mentioned but not used)

3.3. Highlighting Examples

  • <hi> (general purpose highlighting)
    <p>[The rest of this communication is
    omitted owing to considerations of
    space.—<hi rend="sc">Ed</hi>.]</p>
  • <distinct> (linguistically distinct)
    But then I remind myself
    that the Russian ballet is nothing if not
    <distinct>bizarre</distinct>.
  • Other similar elements include: <emph>, <mentioned>, <soCalled>, <term> and <gloss>

3.4. Quotation

Quotation marks can be used to set off text for many reasons, so the TEI has the following elements:
  • <q> (separated from the surrounding text with quotation marks)
  • <said> (speech or thought)
  • <quote> (passage attributed to an external source)
  • <cit> (groups a quotation and citation)
<p>
 <said who="#Celia">I know a lovely tin of potted
   grouse,</said> said Celia, and she went off
to cut some sandwiches. By twelve
o'clock we were getting out of the
train.
</p>

3.5. Simple Editorial Changes: <choice> and Friends

  • <choice> (groups alternative editorial encodings)
  • Errors:
    • <sic> (apparent error)
    • <corr> (corrected error)
  • Regularization:
    • <orig> (original form)
    • <reg> (regularized form)
  • Abbreviation:
    • <abbr> (abbreviated form)
    • <expan> (expanded form)

3.6. Choice Example

I profess not to know how women's
<choice>
 <orig>heartes</orig>
 <reg>hearts</reg>
</choice> are wooed and won. To me they have
always been <choice>
 <sic>maters</sic>
 <corr>matters</corr>
</choice> of riddle and <choice>
 <abbr>admirat'n</abbr>
 <expan>admiration</expan>
</choice>.

3.7. Additions, Deletions, and Omissions

  • <add> (addition to the text, e.g. marginal gloss)
  • <del> (phrase marked as deleted in the text)
  • <gap> (indicates point where material is omitted)
  • <unclear> (contains text unable to be transcribed clearly)

3.8. Example of <add>, <del>, <gap>, and <unclear>

<add place="left">The Cause</add> The immediate
cause, however, of the prevalence of supernatural

<del>tales</del>
<add place="supra">stories</add>
in these parts, was doubtless owing to the
<unclear reason="blood splatter">vicinity</unclear>
of Sleepy Hollow.
<gap reason="illegible">
 <desc>The rest of this paragraph is covered
   in dried blood.</desc>
</gap>

3.9. Basic Names

  • <name> (a name in the text, contains a proper noun or noun phrase)
  • <rs> (a general-purpose name or referencing string )

The type attribute is useful for categorizing these, and they both also have key, ref, and nymRef attributes.

3.10. Basic Names Example

<p>The scene opens at a party given by <name
   nymRef="http://www.meanings-of-name.com/potiphar.html">
Potiphar</name>
in <name ref="http://en.wikipedia.org/wiki/Venicetype="place">Venice</name>. </p>
<p>It is when the natural end of the story is reached, and <name xml:id="SIMON">Simon</name> has come into his own and has just been
wedded to his proper affinity, that the structure seems to me to fall
with a crash. I might perhaps, though not without reluctance, have
pardoned an impertinent railway accident which leaves <rs corresp="#SIMON">the young man</rs> apparently crippled for life.</p>

3.11. Addresses

  • <email> (an electronic mail address)
  • <address> (a postal address)
  • <addrLine> (a non-specific address line)
  • <street> (a full street address)
  • <postCode> (a postal (or zip) code)
  • <postBox> (a postal box number)
  • <name> can also be used
  • and the 'namesdates' module extends this with more geographic names

3.12. Basic Address Example

<email>gbs@heaven.com</email>
<address>
 <name>George Bernard Shaw</name>
 <addrLine>Shaw's Corner</addrLine>
 <settlement>Ayot St Lawrence</settlement>
 <district>Hertfordshire</district>
 <postCode>HE 1 XXX</postCode>
 <country>England.</country>
</address>

3.13. Basic Numbers and Measures

  • <num> (marks a number of any sort)
  • <measure> (marks a quantity or commodity)
  • <measureGrp> (groups specifications relating to a single object)
  • While <num> has simple type and value attributes, <measure> has type, quantity, unit and commodity attributes

3.14. Number and Measure examples

<l>They went off at a pace I am bound to deplore,</l>
<l>For they did <num value="20">twenty</num> yards in a minute or more</l>
<l>And a yard or <num value="2">two</num> over, a capital score</l>
<l>For Farnaby Fullerton Rigby.</l>
<p>If neither of these values is available, a value of <num>20,35</num>
for ash content can be assumed initially and checked, after the
sampling has been carried out, using one of the methods described in
ISO 13909-7.</p>
It is on these days that we travel to our Castle of Stopes; as the
crow flies, <measure quantity="24140unit="m">fifteen miles</measure>
away. Indeed, that is the way we get to it, for it is a castle in the
air.

3.15. Dates

  • <date> (contains a date in any format and includes a when attribute for a regularised form and a calendar attribute to specify what calendar system)
  • <time> (contains a time in any format and includes a when attribute for a regularised form)
<p>At <time when="09:30:00">9.30 o'clock</time>,
as the fog lifted somewhat, the rescuing steamer
Lyonnesse had sighted the Gothland, fast on the rocks, with a bad
list to starboard, and apparently partly filled with pater.</p>
<p>House of Commons, <date when="19140622">Monday, June 22,
   1914</date>.</p>

3.16. Simple Linking

  • <ptr> (defines a pointer to another location)
  • <ref> (defines a reference to another location, with optional linking text)
  • Both elements have:
    • target attribute taking a URI reference
    • cRef attribute for canonical referencing schemes
  • If the linking text is able to be generated, <ptr> and <ref> might be used in the same place.

3.17. Simple Linking Example


See <ref target="#Section12">section 12 on page 34</ref>.

See <ptr target="#Section12"/>.

3.18. Lists

  • <list> (a sequence of items forming a list)
  • <item> (one component of a list)
  • <label> (label associated with an item)
  • <headLabel> (heading for column of labels)
  • <headItem> (heading for column of items)

3.19. Simple List Example

<div>
 <head>Lists</head>
 <p>
  <list>
   <item>
    <gi>list</gi> (a sequence of
       items forming a list)</item>
   <item>
    <gi>item</gi> (one component of
       a list)</item>
   <item>
    <gi>label</gi> (label
       associated with an item)</item>
   <item>
    <gi>headLabel</gi> (heading for
       column of labels)</item>
   <item>
    <gi>headItem</gi> (heading for
       column of items)</item>
  </list>
 </p>
</div>

3.20. Notes

  • <note> (contains a note or annotation)
  • Notes can be those existing in the text, or provided by the editor of the electronic text
  • A place attribute can be used to indicate the physical location of the note
  • Although notes should usually be encoded where its identifier/mark first appears, notes can also be kept separately and point back to their location with a target attribute

3.21. Note Example

<p>It is not only misfortune that makes strange bedfellows. <note place="foot">By-the-by, it is denied that Sir <name>Joseph
     Beecham</name> was in any way responsible for the Government's
 <title>Pills for Earthquakes</title>, by which it was hoped to avert
   the Irish crisis.</note>
</p>

3.22. Indexing

  • If converting an existing index, use nested lists. For auto-generated indexes:
  • <index> (marks an index entry) with optional indexName attribute
  • The <term> element is used to mark a term inside an <index> element
  • The <index> element can self-nest for hierarchical index entries

3.23. Indexing Example

<p>… activated sludge treatment<index>
  <term>activated sludge</term>
  <index>
   <term>treatment</term>
  </index>
 </index> process for the
biological treatment of wastewater in which a mixture of wastewater
and <hi>activated sludge</hi> is agitated and aerated. The
<hi>activated sludge</hi> is subsequently separated from the
<hi>treated wastewater</hi> by
<term>sedimentation</term>
 <index>
  <term>sedimentation</term>
 </index>,
and is removed or returned to the process as required.</p>

3.24. Graphics

  • <graphic> (indicates the location of an inline graphic, illustration, or figure)
  • <binaryObject> (encoded binary data embedding a graphic or other object)
  • The figure module provides <figure> and <figDesc> for more complex graphics
<figure>
 <graphic url="images/014.png"/>
 <head>Garden City Washing-day.</head>
 <p>Our sensitive artist insists on a harmonious colour-scheme.</p>
 <figDesc>A bearded man sits in a deckchair and wags his
   finger at a woman hanging up washing</figDesc>
</figure>

3.25. Bibliographic Citations

  • <bibl> (loosely structured bibliographic citation)
  • <biblStruct> (structured bibliographic citation)
  • <listBibl> (a list of bibliographic citations such as a bibliography)
  • The 'header' module also includes <biblFull> (fully-structured bibliographic citation based on the TEI fileDesc element)

3.26. Simple <bibl> Example


Keble is, of course, named after the hymn-writer and divine; and
Balliol, where C. S. C. played the wag so divertingly, after Balliol.
<hi rend="it">À propos</hi> of Oxford, it is a question whether that
extremely amusing book, <bibl>
 <title>Verdant Green</title>
</bibl>, is
still much read by freshers.

3.27. Simple <biblStruct> Example

<biblStruct>
 <monogr>
  <title>Magnalia Christi Americana: or, The
     ecclesiastical history of New-England, ...</title>
  <author>Mather, Cotton (1663-1728)</author>
  <imprint>
   <publisher>Printed for Thomas Parkhurst, at the
       Bible and Three Crowns in Cheapside.</publisher>
   <pubPlace>London</pubPlace>
   <date when="1702">MDCCII</date>
  </imprint>
 </monogr>
</biblStruct>

3.28. Verse

  • <l> (a line of verse)
  • <lg>(a line group such as stanza or paragraph)
<lg>
 <l>There were eight pretty walkers who went up a hill;</l>
 <l>They were Jessamine, Joseph and Japhet and Jill,</l>
 <l>And Allie and Sally and Tumbledown Bill,</l>
 <l rend="i10">And Farnaby Fullerton Rigby.</l>
</lg>

3.29. Drama

  • <sp> (an individual speech in a performance text)
  • <speaker> (the name of the speaker(s) as given in the performance text)
  • <stage> (a stage direction of any sort within a dramatic text)

3.30. Dramatic example from Punch

<sp>
 <speaker>Greece.</speaker>
 <said> ISN'T IT TIME WE STARTED FIGHTING AGAIN?</said>
</sp>
<sp>
 <speaker>Turkey.</speaker>
 <said> YES, I DARESAY. HOW SOON COULD YOU BEGIN?</said>
</sp>
<sp>
 <speaker>Greece.</speaker>
 <said> OH, IN A FEW WEEKS.</said>
</sp>
<sp>
 <speaker>Turkey.</speaker>
 <said> NO GOOD FOR ME. SHAN'T BE READY TILL
   THE AUTUMN.</said>
</sp>


Date: 2008-07-07
Copyright University of Oxford