Text only | Skip links
Skip links||IT Services, University of Oxford

1. Introduction to the Course

Aims of Course

  1. Examine the concept of markup and XML encoding
  2. Provide hands-on experience in using TEI XML markup
  3. Introduce the some of TEI scheme, its assumptions
  4. Provide routes into more detailed information for exploration at your leisure
  5. Provide opportunities for questions and discussions relating to your own encoding needs and priorities

1.1. Course Structure: Session 1

Session 1: Basic transcription in XML

  • Introducing the rules of XML
  • Survey of the most common TEI elements
  • Introducing the oXygen editor
  • Exercise: Basic transcription

1.2. Course Structure: Session 2

Session 2: Metadata

  • The TEI Header
  • Annotating names of people and places
  • Optional exercise: Adding metadata

1.3. Course Structure: Session 3

Session 3: Detailed transcription with TEI

  • Transcription and editorial phenomena
  • Exercise: Adding detail to the transcription
  • Conclusion: TEI Community and Questions

1.4. Course Materials

  • All course materials including:
    • All slides from lectures
    • All exercises
    • All materials for the exercises
    are available on the TEI @ Oxford website.
  • The url is: http://tei.oucs.ox.ac.uk/Oxford/2011-10-helsinki/ and this is where you will need to download your materials for the exercises!
  • All these materials are licensed with a Creative Commons Attribution license, which means they are freely available for re-use (though do let us know!)

1.5. After the workshop...

  • After the workshop, if you have questions about:
    If you mail the TEI-L mailing list it is better because:
    • we'll still try to answer as well as we would privately
    • you get answers not only from us, but TEI experts around the world
    • questions from those of all levels of ability stop the list becoming too technical
    • everyone benefits from having the answers be public -- and you benefit by reading (and sometimes answering!) others' problems

2. An Introduction to XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web and elsewhere.

2.1. XML: what it is and why you should care

  • XML is structured data represented as strings of text
  • XML looks like HTML, except that:-
    • XML is extensible
    • XML must be well-formed
    • XML can be validated
  • XML is application-, platform-, and vendor- independent
  • XML empowers the content provider and facilitates data integration

2.2. XML is an international standard

  • XML requires use of ISO 10646 (also known as Unicode)
    • a 31 bit character repertoire including most human writing systems
    • encoded as UTF8 or UTF16
  • other encodings may be specified at the document level
  • language may be specified on any element level using W3C xml:lang

The xml:id attribute is another W3C-defined attribute.

2.3. XML terminology

An XML document may contain:-
  • elements, possibly bearing attributes
  • processing instructions
  • comments
  • entity references
  • marked sections (CDATA, IGNORE, INCLUDE)

An XML document must be well-formed and may be valid

2.4. XML terminology Example

<?xml version="1.0" ?><root>
 <element attribute="value"> content </element>
<!-- comment -->

2.5. The rules of the XML Game

  • An XML document represents a (kind of) tree
  • It has a single root and many nodes
  • Each node can be
    • a subtree
    • a single element (possibly bearing some attributes)
    • a string of character data
  • Each element has a name or generic identifier

2.6. Representing an XML tree

  • An XML document is encoded as a linear string of characters
  • It begins with a special processing instruction
  • Element occurrences are marked by start- and end-tags
  • The characters < and & are Magic and must always be "escaped" if you want to use them as themselves
  • Comments are delimited by <!- - and - ->
  • CDATA sections are delimited by <![CDATA[ and ]]>
  • Attribute name/value pairs are supplied on the start-tag and may be given in any order
  • Entity references are delimited by & and ;

2.7. Parts of an XML document

<?xml version="1.0"?><greetings xmlns="http://www.example.org/greetings"

 <hello xmlns="http://www.example.org/greetings"
hello world!</hello></greetings>
  • The XML declaration
  • Namespace declarations
  • The root element of the document itself
  • Other elements and content
  • Attribute and value

2.8. The XML declaration

An XML document must begin with an XML declaration which does three things:
  • specifies that this is an XML document
  • specifies which version of the XML standard it follows
  • specifies which character encoding the document uses
<?xml version="1.0" ?>
<?xml version="1.0" encoding="iso-8859-1" ?>
The default, and recommended, encoding is ‘UTF-8’ (Unicode)

2.9. Namespace declarations

All TEI documents are declared within the TEI namespace: <TEI xmlns="http://www.tei-c.org/ns/1.0"> ... </TEI>

XML documents can include elements declared in different name spaces.

  • a namespace declaration associates a namespace prefix with an external URI-like identifier
  • the default namespace may be declared using a xmlns
  • other name spaces must all use a specially declared prefix

The xml namespace is used by the TEI for global attributes xml:id and xml:lang

2.10. The Tempest

version="1.0" encoding="utf-8" ?><div n="1">
 <head>SCENE I. On a ship at sea: a tempestuous noise of
   thunder and lightning heard.</head>
 <stage>Enter a Master and a Boatswain</stage>
  <ab>Here, master: what cheer?</ab>
  <ab>Good, speak to the mariners: fall to't, yarely,</ab>
  <ab>or we run ourselves aground: bestir, bestir.</ab>

2.11. An XML Tree For The Tempest

2.12. XML syntax: the small print

What does it mean to be well-formed?

  1. there is a single root node containing the whole of an XML document
  2. each subtree is properly nested within the root node
  3. names are always case sensitive
  4. start-tags and end-tags are always mandatory (except that a combined start-and-end tag may be used for empty nodes)
  5. attribute values are always quoted

Note: You can be valid in addition to being well-formed. This means you obey the rules of a specified schema, such as the TEI.

2.13. Test your XML knowledge

  • Which are correct?
    • <seg>some text</seg>
    • <seg><foo>some</foo> <bar>text</bar></seg>
    • <seg><foo>some <bar></foo> text</bar></seg>
    • <seg type="text">some text</seg>
    • <seg type='text'>some text</seg>
    • <seg type=text>some text</seg>
    • <seg type="text">some text<seg/>
    • <seg type="text">some text<gap/></seg>
    • <seg type="text">some text</Seg>

3. Default Text Structure

All TEI documents are structured in a particular manner. This section attempts to describe the different variations on this as briefly as possible.

3.1. Structure of a TEI Document

There are two basic structures of a TEI Document:
  • <TEI> (TEI document) contains a single TEI-conformant document, comprising a TEI header and a text, either in isolation or as part of a teiCorpus element.
  • <teiCorpus> contains the whole of a TEI encoded corpus, comprising a single corpus header and one or more TEI elements, each containing a single text header and a text.

3.2. TEI basic structures (1)

<!-- required -->
<!-- required -->

3.3. TEI basic structures (2)

<!-- required -->
<!-- optional, new in TEI P5 -->
<!-- required if no facsimile -->

3.4. <text>

What is a text?
  • A text may be unitary or composite
    • unitary: forming an organic whole
    • composite: consisting of several components which are in some important sense independent of each other
  • a unitary text contains
    • optional front matter
    • <body> (required)
    • optional back matter

3.5. Composite texts

A composite text contains
  • optional front matter
  • <group> (required)
  • optional back matter

A corpus is a collection of text and header pairs. It has its own header.

<group> tags may self-nest.

3.6. TEI text structure (1)

<!-- optional -->
<!-- required -->
<!-- optional -->

3.7. TEI text structure (2)

<!-- ... -->
<!-- ... -->

3.8. Another Grouped Text Example

<!-- header information for the whole collection -->
<!-- optional front matter -->
<!-- optional front matter -->
<!-- First Body -->
<!-- optional front matter -->
<!-- Second Body-->

4. Common TEI Elements

The so-called 'Core' module groups together elements which may appear in any kind of text and the tags used to mark them in all TEI documents. This includes:
  • paragraphs
  • highlighting, emphasis and quotation
  • simple editorial changes
  • basic names numbers, dates, addresses
  • simple links and cross-references
  • lists, notes, annotation, indexing
  • graphics
  • reference systems, bibliographic citations
  • simple verse and drama

4.1. Paragraphs

<p> (paragraph) marks paragraphs in prose
  • Fundamental unit for prose texts
  • <p> can contain all the phrase-level elements in the core
  • <p> can appear directly inside <body> or inside <div> (divisions)
<p>It was a cottage, the cottage of a dream. And by a cottage
I mean, not four plain rooms and a kitchen, but one
surprising room opening into another; rooms all on
different levels and of different shapes, with delightful
places to bump your head on; open fireplaces; a large
square hall, oak-beamed, where your guests can hang about
after breakfast, while deciding whether to play golf or sit
in the garden. Yet all so cunningly disposed that from
outside it looks only a cottage or, at most, two cottages
persuaded into one.</p>

4.2. Highlighting

By highlighting we mean the use of any combination of typographic features (font, size, hue, etc.) in a printed or written text in order to distinguish some passage of a text from its surroundings. For words and phrases which are:
  • distinct in some way (e.g. foreign, archaic, technical)
  • emphatic or stressed when spoken
  • not really part of the text (e.g. cross references, titles, headings)
  • a distinct narrative stream (e.g. an internal monologue, commentary)
  • attributed to some other agency inside or outside the text (e.g. direct speech, quotation)
  • set apart in another way (e.g. proverbial phrases, words mentioned but not used)

4.3. Highlighting Examples

  • <hi> (general purpose highlighting)
    <p>[The rest of this communication is omitted owing to
    considerations of space.—<hi rend="sc">Ed</hi>.]</p>
  • <distinct> (linguistically distinct)
    But then I remind
    myself that the Russian ballet is nothing if not
  • Other similar elements include: <emph>, <mentioned>, <soCalled>, <term> and <gloss>

4.4. Quotation

Quotation marks can be used to set off text for many reasons, so the TEI has the following elements:
  • <q> (separated from the surrounding text with quotation marks)
  • <said> (speech or thought)
  • <quote> (passage attributed to an external source)
  • <cit> (groups a quotation and citation)
 <said who="#Celia">I know a lovely tin of potted
   grouse,</said> said Celia, and she went off to cut some
sandwiches. By twelve o'clock we were getting out of the

4.5. Simple Editorial Changes: <choice> and Friends

  • <choice> (groups alternative editorial encodings)
  • Errors:
    • <sic> (apparent error)
    • <corr> (corrected error)
  • Regularization:
    • <orig> (original form)
    • <reg> (regularized form)
  • Abbreviation:
    • <abbr> (abbreviated form)
    • <expan> (expanded form)

4.6. Choice Example

I profess not
to know how women's <choice>
</choice> are wooed and won. To me they have always been <choice>
</choice> of riddle and <choice>

4.7. Additions, Deletions, and Omissions

  • <add> (addition to the text, e.g. marginal gloss)
  • <del> (phrase marked as deleted in the text)
  • <gap> (indicates point where material is omitted)
  • <unclear> (contains text unable to be transcribed clearly)

4.8. Basic Names

  • <name> (a name in the text, contains a proper noun or noun phrase)
  • <rs> (a general-purpose name or referencing string )

The type attribute is useful for categorizing these, and they both also have key, ref, and nymRef attributes.

4.9. Basic Names Example

<p>The scene opens at a party given by <name
Potiphar</name> in <name ref="http://en.wikipedia.org/wiki/Venicetype="place">Venice</name>. </p>
<p>It is when the natural end of the story is reached, and
<name xml:id="SIMON">Simon</name> has come into his own
and has just been wedded to his proper affinity, that the
structure seems to me to fall with a crash. I might
perhaps, though not without reluctance, have pardoned an
impertinent railway accident which leaves <rs corresp="#SIMON">the young man</rs> apparently crippled
for life.</p>

4.10. Addresses

  • <email> (an electronic mail address)
  • <address> (a postal address)
  • <addrLine> (a non-specific address line)
  • <street> (a full street address)
  • <postCode> (a postal (or zip) code)
  • <postBox> (a postal box number)
  • <name> can also be used
  • and the 'namesdates' module extends this with more geographic names

4.11. Basic Address Example

 <name>George Bernard Shaw</name>
 <addrLine>Shaw's Corner</addrLine>
 <settlement>Ayot St Lawrence</settlement>
 <postCode>HE 1 XXX</postCode>

4.12. Basic Numbers and Measures

  • <num> (marks a number of any sort)
  • <measure> (marks a quantity or commodity)
  • <measureGrp> (groups specifications relating to a single object)
  • While <num> has simple type and value attributes, <measure> has type, quantity, unit and commodity attributes

4.13. Number and Measure examples

<l>They went off at a pace I am bound to deplore,</l>
<l>For they did <num value="20">twenty</num> yards in a
minute or more</l>
<l>And a yard or <num value="2">two</num> over, a capital
<l>For Farnaby Fullerton Rigby.</l>
<p>If neither of these values is available, a value of
<num>20,35</num> for ash content can be assumed initially
and checked, after the sampling has been carried out, using
one of the methods described in ISO 13909-7.</p>
It is on these
days that we travel to our Castle of Stopes; as the crow
flies, <measure quantity="24140unit="m">fifteen
miles</measure> away. Indeed, that is the way we get to it,
for it is a castle in the air.

4.14. Dates

  • <date> (contains a date in any format and includes a when attribute for a regularised form and a calendar attribute to specify what calendar system)
  • <time> (contains a time in any format and includes a when attribute for a regularised form)
<p>At <time when="09:30:00">9.30 o'clock</time>, as the fog
lifted somewhat, the rescuing steamer Lyonnesse had sighted
the Gothland, fast on the rocks, with a bad list to
starboard, and apparently partly filled with pater.</p>
<p>House of Commons, <date when="1914-06-22"> Monday, June 22,

4.15. Simple Linking

  • <ptr> (defines a pointer to another location)
  • <ref> (defines a reference to another location, with optional linking text)
  • Both elements have:
    • target attribute taking a URI reference
    • cRef attribute for canonical referencing schemes
  • If the linking text is able to be generated, <ptr> and <ref> might be used in the same place.

4.16. Simple Linking Example

See <ref target="#Section12">section 12 on page 34</ref>.
See <ptr target="#Section12"/>.

4.17. Lists

  • <list> (a sequence of items forming a list)
  • <item> (one component of a list)
  • <label> (label associated with an item)
  • <headLabel> (heading for column of labels)
  • <headItem> (heading for column of items)

4.18. Simple List Example

The previous slide contained only:
    <gi>list</gi> (a sequence of items forming a
    <gi>item</gi> (one component of a list)</item>
    <gi>label</gi> (label associated with an
    <gi>headLabel</gi> (heading for column of
    <gi>headItem</gi> (heading for column of

4.19. Notes

  • <note> (contains a note or annotation)
  • Notes can be those existing in the text, or provided by the editor of the electronic text
  • A place attribute can be used to indicate the physical location of the note
  • Although notes should usually be encoded where its identifier/mark first appears, notes can also be kept separately and point back to their location with a target attribute

4.20. Note Example

<p>It is not only misfortune that makes strange bedfellows.
<note place="foot">By-the-by, it is denied that Sir
 <name>Joseph Beecham</name> was in any way responsible
   for the Government's <title>Pills for
     Earthquakes</title>, by which it was hoped to avert the
   Irish crisis.</note>

4.21. Graphics

  • <graphic> (indicates the location of an inline graphic, illustration, or figure)
  • <binaryObject> (encoded binary data embedding a graphic or other object)
  • The figure module provides <figure> and <figDesc> for more complex graphics
 <graphic url="images/014.png"/>
 <head>Garden City Washing-day.</head>
 <p>Our sensitive artist insists on a harmonious
 <figDesc>A bearded man sits in a deckchair and wags his
   finger at a woman hanging up washing</figDesc>

5. Introduction to the oXygen XML editor

For our exercises we're going to be using the oXygen XML editor, made by a Romanian company called SynchRo Soft. This has quickly become the market leader in XML editors, but I thought I should explain why we use it. There are other alternatives which you are free to use, but they don't have the vast array of features that oXygen does.

6. Editor types

Editing tools cover a wide spectrum:
  • Basic text editors
  • General programmers' editors
  • XML-aware programmers' editors
  • XML-specific editors
  • Word-processors which can export XML
  • Data-entry forms
  • Image-specific editors
it is likely that people in different roles need different tools.

7. Things to look for in specialist XML editors

  • schema-aware
  • constraining element entry
  • IDE features
  • customizable
  • validation, preferably continual
  • Multiple display views (as tree, with tags, formatted etc)
  • folding structures
  • context-sensitive help
For XML editing, oXygen, Emacs, jEdit, XMetaL, XMLSpy, Stylus Studio, Arbortext Adept are all worth a look.

For image markup try University of Victoria Image Markup Tool.

8. oXygen Features (1)

  • Multiple modes for editing XML documents: Author (CSS based), Grid, Text
  • TEI Support including: New document templates; Author mode CSS; Transformations to HTML and PDF
  • Ability to add/extend/customise for other frameworks
  • Available as an Eclipse plugin (Java IDE)
  • Java API for developer add-ons

9. oXygen Features (2)

  • Support for all schema languages: such as Relax NG, Schematron, XML Schema, DTDs, NVDL, NRL
  • Content completion based on TEI Relax NG schemas
  • Tooltip documentation based on TEI Relax NG schemas
  • NVDL easily validates TEI documents in multiple namespaces

10. oXygen Features (3)

  • XQuery directly against local/remote XML databases like eXist
  • XSLT and FOP support for transformations to XML/HTML/PDF etc.
  • WebDAV and FTP support for access to files on remote servers/CMS
  • Built-in subversion client for collaborative version control and visual change management
  • Spell checking support as you type that is xml:lang aware
  • Included graphical XML Diff to analyse differences between documents

11. oXygen Features (4)

But maybe most important...

  • Multi-platform: oXygen is available on Windows, Mac OS X, Linux, Solaris, etc.
  • They have an enlightened academic pricing policy ($64 USD) (Oxford has a site license)
  • The named-user based license allows the same user to use any oXygen distribution on any platform or machine: the same license covers you at work, laptop, and home.
  • They are nice enough to give us trial licenses to teach workshops with!

12. oXygen

13. Basic oXygen Editing

14. Adding An Element

15. Adding An Attribute

16. Surround With Element

17. Surround With Element Result

18. Another Surround With Element

19. Or With Russian Text

20. Or If You Generate Your TEI Schema In Chinese...

21. XPath Searching Built In

22. Tagless editing in oXygen

23. Why use oXygen?

  1. Is probably the best and most complete XML development IDE available.
  2. Ready to use support for a large number of document types (including TEI).
  3. Continuous and active development with proactive user community
  4. Free support.

    oXygen provides a very responsive support for all its users free of charge.

  5. Huge academic discounts and additional discounts for TEI members.

    There is a huge discount for the Academic licenses of oXygen, that costs $64 with the same set of features as the ‘Enterprise’ license that costs $543. TEI members benefit also from an additional 20% discount.

24. Exercise: Basic transcription

Now, if we have time, a quick demonstration of the kind of thing you are shortly going to be asked to do in the exercise.

James Cummings. Date: August 2010
Copyright University of Oxford