Text only | Skip links
Skip links||IT Services, University of Oxford

1. Introduction to the Course

Aims of Course

  1. Examine the concept of markup and XML encoding
  2. Provide hands-on experience in using TEI XML markup
  3. Introduce the TEI scheme, its assumptions, and its organization
  4. Survey the whole landscape of the TEI recommendations
  5. Demonstrate how the TEI scheme may be customized to particular needs
  6. Demonstrate some real world applications of the TEI scheme
  7. Provide routes into more detailed information for exploration at your leisure
  8. Provide opportunities for questions and discussions relating to your own encoding needs and priorities

1.1. Course Structure

The times of each lecture in the course aren't written in stone and we will go as fast or as slow as students need, adapting as necessary. However the structure we will attempt to follow the structure below:
  • 10:00 - 11:00: Lecture 1
  • 11:00 - 11:30: Coffee Break
  • 11:30 - 12:30: Lecture 2
  • 12:30 - 13:00: Practical Exercise
  • 13:00 - 13:30: Lecture 3

1.2. Day 1: Monday 13 April, 2009 -- Introductions

Lecture 1:
Introduction to Course, Markup, XML, and the oXygen XML Editor
Lecture 2:
The TEI, TEI Structure and Core Elements
Practical:
Exercise: Editing XML in oXygen
Lecture 3:
Tools for Editing and Publishing TEI Documents

1.3. Day 2: Tuesday 14 April, 2009 -- Metadata

Lecture 1:
The TEI Header
Lecture 2:
Manuscript Description and Facsimile
Practical:
Exercise: Describing a Manuscript in TEI XML
Lecture 3:
Marking Up Images

1.4. Day 3: Wednesday 15 April, 2009 -- Transcription and Pointing

Lecture 1:
Transcription and Critical Apparatus
Lecture 2:
Names / Dates / People / Places
Practical:
Exercise: More TEI Editing
Lecture 3:
Pointing, Linking, and Stand Off Markup

1.5. Day 4: Thursday 16 April, 2009 -- Corpora, Genres, and Glyphs

Lecture 1:
Analysis, Speech, and Linguistics
Lecture 2:
Verse, Drama, and Dictionaries
Practical:
Exercise: Even More TEI Editing
Lecture 3:
TEI, Unicode, and Non-standard Characters

1.6. Day 5: Friday 17 April, 2009 -- Using the TEI

Lecture 1:
Documenting TEI Customisations
Lecture 2:
Exploring the TEI Community
Practical:
Using Roma
Lecture 3:
Conclusions and Group Discussion

1.7. Course Materials

  • All course materials including:
    • All slides from lectures (in TEI XML, HTML, and PDF)
    • All exercises (in TEI XML, HTML, and PDF)
    • All materials for the exercises
    • A PDF booklet combining all these with 'TEI Lite'
    are available on the TEI @ Oxford website.
  • The url is: http://tei.oucs.ox.ac.uk/Oxford/index.xml
  • All these materials are licensed with a Creative Commons Attribution license, which means they are freely available for re-use (though do let us know!)
  • To save you downloading a huge zip with all the workshop materials, I'll now pass around a USB key or two for you to copy the 'materials' folder from onto your computer

1.8. After the workshop...

  • After the workshop, if you have questions about:
    If you mail the TEI-L mailing list it is better because:
    • we'll still try to answer as well as we would privately
    • you get answers not only from us, but TEI experts around the world
    • questions from those of all levels of ability stop the list becoming too technical
    • everyone benefits from having the answers be public -- and you benefit by reading (and sometimes answering!) others' problems

2. An Introduction to Textual Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts. When we talk about text encoding, what do we mean by a text? What is in a text and what assumptions do we make in reading them?

2.1. What's in a text?

2.2. What's in a text (2)?

BL Ms Cotton Vitelius A xv, fol. 129r

2.3. What's in a text (3)?

2.4. The ontology of text

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "text" is an abstraction, created by or for a community of readers. Markup encodes and makes concrete such abstractions.

2.5. Encoding of texts

  • Texts are more than sequences of encoded glyphs
    • They have structure and content
    • They also have multiple readings
  • Encoding, or markup, is a way of making these things explicit

Only that which is explicit can be reliably processed

2.6. Styles of markup

  • In the beginning there was procedural markup
    RED INK ON; print balance; RED INK OFF
  • which being generalised became descriptive markup <balance type='overdrawn'>some numbers</balance>
  • also known as encoding or annotation

descriptive markup allows for easier re-use of data

2.7. Some more definitions

  • Markup makes explicit the distinctions we want to make when processing a string of bytes
  • Markup is a way of naming and characterizing the parts of a text in a formalized way
  • It's (usually) more useful to markup what we think things are than what they look like

2.8. What's the point of markup?

  • To make explicit (to a machine) what is implicit (to a person)
  • To add value by supplying multiple annotations
  • To facilitate re-use of the same material
    • in different formats
    • in different contexts
    • by different users

2.9. Separation of form and content

  • Presentational markup cares more about fonts and layout than meaning
  • Descriptive markup says what things are, and leaves the rendition of them for a separate step
  • Separating the form of something from its content makes its re-use more flexible
  • It also allows easy changes of presentation across a large number of documents

2.10. Markup as a scholarly activity

  • The application of markup to a document can be an intellectual activity
  • In deciding what markup to apply, and how this represents the original, one is undertaking the task of an editor
  • There is (almost) no such thing as neutral markup -- all of it involves interpretation
  • Markup can assist in answering research questions, and the deciding what markup is needed to enable such questions to be answered can be a research activity in itself
  • Good textual encoding is never as easy or quick as people would believe
  • Detailed document analysis is needed before encoding for the resulting markup to be useful

2.11. What does markup capture?

Compare
<hi rend="dropcap">H</hi>&amp;WYN;ÆT WE GARDE <lb/>na in
gear-dagum þeod-cyninga <lb/>þrym gefrunon, hu ða æþelingas
<lb/>ellen fremedon. oft scyld scefing sceaþe
<add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl
<add>a</add>
<lb/>of<damage>
 <desc>blot</desc>
</damage>teah ...
and
<lg>
 <l>Hwæt! we Gar-dena in gear-dagum</l>
 <l>þeod-cyninga þrym gefrunon,</l>
 <l>hu ða æþelingas ellen fremedon,</l>
</lg>
<lg>
 <l>Oft Scyld Scefing sceaþena þreatum,</l>
 <l>monegum mægþum meodo-setla ofteah;</l>
 <l>egsode Eorle, syððan ærest wearþ</l>
 <l>feasceaft funden...</l>
</lg>

2.12. A useful mental exercise

Imagine you are going to markup several thousand pages of complex material....
  • Which features are you going to markup?
  • Why are you choosing to markup this feature?
  • How reliably and consistently can you do this?

Now, imagine your budget has been halved. Repeat the exercise!

2.13. Some alphabet soup

SGML Standard Generalized Markup Language
HTML Hypertext Markup Language
W3C World Wide Web Consortium
XML eXtensible Markup Language
DTD Document Type Definition (or Declaration)
CSS Cascading Style Sheet
Xpath XML Path Language
XSLT eXtensible Stylesheet Language - Transformations
XQuery XML Querying
RELAXNG Regular Expression Language for XML (New Generation)

Oh, and then there's also TEI, the Text Encoding Initiative

3. An Introduction to XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML also now plays an indispensible role in the exchange of a wide variety of data on the Web and elsewhere.

3.1. XML: what it is and why you should care

  • XML is structured data represented as strings of text
  • XML looks like HTML, except that:-
    • XML is extensible
    • XML must be well-formed
    • XML can be validated
  • XML is application-, platform-, and vendor- independent
  • XML empowers the content provider and facilitates data integration

3.2. XML terminology

An XML document may contain:-
  • elements, possibly bearing attributes
  • processing instructions
  • comments
  • entity references
  • marked sections (CDATA, IGNORE, INCLUDE)

An XML document must be well-formed and may be valid

3.3. XML terminology Example

<?xml version="1.0" ?><root>
 <element attribute="value"> content </element>
<!-- comment -->
</root>

3.4. The rules of the XML Game

  • An XML document represents a (kind of) tree
  • It has a single root and many nodes
  • Each node can be
    • a subtree
    • a single element (possibly bearing some attributes)
    • a string of character data
  • Each element has a name or generic identifier

3.5. Representing an XML tree

  • An XML document is encoded as a linear string of characters
  • It begins with a special processing instruction
  • Element occurrences are marked by start- and end-tags
  • The characters < and & are Magic and must always be "escaped" if you want to use them as themselves
  • Comments are delimited by <!- - and - ->
  • CDATA sections are delimited by <![CDATA[ and ]]>
  • Attribute name/value pairs are supplied on the start-tag and may be given in any order
  • Entity references are delimited by & and ;

3.6. Parts of an XML document

<?xml version="1.0"?><greetings xmlns="http://www.example.org/greetings"
>

 <hello xmlns="http://www.example.org/greetings"
  type="sarcastic">
hello world!</hello></greetings>
  • The XML declaration
  • Namespace declarations
  • The root element of the document itself
  • Other elements and content
  • Attribute and value

3.7. The XML declaration

An XML document must begin with an XML declaration which does three things:
  • specifies that this is an XML document
  • specifies which version of the XML standard it follows
  • specifies which character encoding the document uses
<?xml version="1.0" ?>
<?xml version="1.0" encoding="iso-8859-1" ?>
The default, and recommended, encoding is ‘UTF-8’ (Unicode)

3.8. Namespace declarations

All TEI documents are declared within the TEI namespace: <TEI xmlns="http://www.tei-c.org/ns/1.0"> ... </TEI>

XML documents can include elements declared in different name spaces.

  • a namespace declaration associates a namespace prefix with an external URI-like identifier
  • the default namespace may be declared using a xmlns
  • other name spaces must all use a specially declared prefix
<TEI
xmlns="http://www.tei-c.org/ns/1.0"
xmlns:math="http://www.mathml.org">
<p>...<math:expr>...</math:expr>...</p>...</TEI>

The xml namespace is used by the TEI for global attributes xml:id and xml:lang

3.9. The Doctype Declaration

You may sometimes find an optional "Document Type" declaration at the start of a document:

<?xml version="1.0" ?> <!DOCTYPE greeting SYSTEM "greeting.dtd []">
  • The DTD is one way of associating the document with its schema (but is not used by W3C or RELAX NG for this purpose)
  • The DTD subset is used to provide declarations additional to those in the schema, for example for external files
  • The DTD subset may be internal, external, or both

DTDs are now considered old-fashioned -- RELAX NG schemas are preferred.

3.10. The Tempest

<?xml
version="1.0" encoding="utf-8" ?><div n="1">
 <head>SCENE I. On a ship at sea: a tempestuous noise of thunder and lightning heard.</head>
 <stage>Enter a Master and a Boatswain</stage>
 <sp>
  <speaker>Master</speaker>
  <ab>Boatswain!</ab>
 </sp>
 <sp>
  <speaker>Boatswain</speaker>
  <ab>Here, master: what cheer?</ab>
 </sp>
 <sp>
  <speaker>Master</speaker>
  <ab>Good, speak to the mariners: fall to't, yarely,</ab>
  <ab>or we run ourselves aground: bestir, bestir.</ab>
 </sp>
 <stage>Exit</stage>
</div>

3.11. Example deconstructed: root node

<?xml
version="1.0" encoding="utf-8" ?> <div n="1">
<!-- .... -->
</div>

3.12. Example deconstructed: head

<head>SCENE I. On a ship at sea: a tempestuous noise of thunder and
lightning heard.</head>

3.13. Example deconstructed: stage direction and speech

<stage>Enter a Master and a Boatswain</stage>
<sp>
 <speaker>Master</speaker>
 <ab>Boatswain!</ab>
</sp>

3.14. An XML Tree For The Tempest

3.15. XML syntax: the small print

What does it mean to be well-formed?

  1. there is a single root node containing the whole of an XML document
  2. each subtree is properly nested within the root node
  3. names are always case sensitive
  4. start-tags and end-tags are always mandatory (except that a combined start-and-end tag may be used for empty nodes)
  5. attribute values are always quoted

Note: You can be valid in addition to being well-formed. This means you obey the rules of a specified schema, such as the TEI.

3.16. Test your XML knowledge

  • Which are correct?
    • <seg>some text</seg>
    • <seg><foo>some</foo> <bar>text</bar></seg>
    • <seg><foo>some <bar></foo> text</bar></seg>
    • <seg type="text">some text</seg>
    • <seg type='text'>some text</seg>
    • <seg type=text>some text</seg>
    • <seg type="text">some text<seg/>
    • <seg type="text">some text<gap/></seg>
    • <seg type="text">some text< /seg>
    • <seg type="text">some text</Seg>

3.17. XML is an international standard

  • XML requires use of ISO 10646 (also known as Unicode)
    • a 31 bit character repertoire including most human writing systems
    • encoded as UTF8 or UTF16
  • other encodings may be specified at the document level
  • language may be specified at the element level using xml:lang

The xml:id attribute is another W3C-defined attribute.

4. Introduction to the oXygen XML editor

For our exercises we're going to be using the oXygen XML editor, made by a Romanian company called SynchRo Soft. This has quickly become the market leader in XML editors, but I thought I should explain why we use it. There are other alternatives which you are free to use, but they don't have the vast array of features that oXygen does.

4.1. Why use oXygen?

  1. Is probably the best and most complete XML development IDE available.
  2. Ready to use support for a large number of document types (including TEI).
  3. Continuous and active development with proactive user community
  4. Free support.

    oXygen provides a very responsive support for all its users free of charge.

  5. Huge academic discounts and additional discounts for TEI members.

    There is a huge discount for the Academic licenses of oXygen, that costs $48 with the same set of features as the Professional license that costs $299. TEI members benefit also of an additional 20% discount.

4.2. Basic oXygen Editing

4.3. Closing Side Views

4.4. Surround With Element

4.5. Or With Russian Text

4.6. Adding An Element

4.7. Adding An Attribute

4.8. Or If You Generate Your TEI Schema In Chinese...

4.9. If You Really Hate Tags...

4.10. XPath Searching Built In



James Cummings. Date: April 2009
Copyright University of Oxford