Text only | Skip links
Skip links||IT Services, University of Oxford

1. Introduction to the Course

Goals and Talks

  • A brief introduction to markup, XML and the TEI
  • A series of practical exercises
  • Familiarisation with the TEI's manuscript description module
  • An exploration of various other aspects of the TEI

1.1. Schedule

  • Talk 1: Introductions to the Course, Markup, XML, and the TEI
  • Exercise 1: Document Description
  • Talk 2: Manuscript Description
  • Exercise 2: A Basic <msDesc>
  • Talk 3: TEI Structure
  • Exercise 3: <msDesc> in Context
  • Talk 4: Exploring the TEI

I've not timetabled these very rigorously as hopefully the exercises will go quickly, but you never know!

1.2. What's on your USB key?

  • All the teaching slides and exericses in XML and PDF format
  • All the course materials
  • A PDF booklet of all the slides, exercises, and materials
  • A bootable Ubuntu Linux distribution with the oXygen XML editor and other software installed
  • The TEI Guidelines

1.3. Materials

  • F101-19: Alexander (Alexander), King of Poland and Grand Duke of Lithuania (
  • F101-21: Sigismund the Old (Sigismundus), King of Poland and Grand Duke of Lithuania
  • For both we have images, transcription, and manuscript description
  • They are both in the Lithuanian National Martynas Mazvydas Library
  • We'll mainly be looking at F101-19 in the exercises, but examples come from both documents, and both are on your USB key

1.4. F101-19

1.5. F101-21

2. An Introduction to Textual Markup

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts. When we talk about text encoding, what do we mean by a text? What is in a text and what assumptions do we make in reading them?

2.1. What's in a text?

2.2. What's in a text (2)?

2.3. What's in a text (3)?

2.4. The ontology of text

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "text" is an abstraction, created by or for a community of readers. Markup encodes and makes concrete such abstractions.

2.5. Encoding of texts

  • Texts are more than sequences of encoded glyphs
    • They have structure and content
    • They also have multiple readings
  • Encoding, or markup, is a way of making these things explicit

Only that which is explicit can be reliably processed

2.6. Styles of markup

  • In the beginning there was procedural markup
    RED INK ON; print balance; RED INK OFF
  • which being generalised became descriptive markup <balance type='overdrawn'>some numbers</balance>
  • also known as encoding or annotation

descriptive markup allows for easier re-use of data

2.7. What's the point of markup?

  • To make explicit (to a machine) what is implicit (to a person)
  • To add value by supplying multiple annotations
  • To facilitate re-use of the same material
    • in different formats
    • in different contexts
    • by different users

It's (usually) more useful to markup what we think things are than what they look like

2.8. Separation of form and content

  • Presentational markup cares more about fonts and layout than meaning
  • Descriptive markup says what things are, and leaves the rendition of them for a separate step
  • Separating the form of something from its content makes its re-use more flexible
  • It also allows easy changes of presentation across a large number of documents

2.9. Markup as a scholarly activity

  • The application of markup to a document can be an intellectual activity
  • In deciding what markup to apply, and how this represents the original, one is undertaking the task of an editor
  • There is (almost) no such thing as neutral markup -- all of it involves interpretation
  • Markup can assist in answering research questions, and the deciding what markup is needed to enable such questions to be answered can be a research activity in itself
  • Good textual encoding is never as easy or quick as people would believe
  • Detailed document analysis is needed before encoding for the resulting markup to be useful

2.10. What does markup capture?

Compare
<hi rend="dropcap">H</hi>&amp;WYN;ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon,
hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe
<add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl
<add>a</add>
<lb/>of<damage>
 <desc>blot</desc>
</damage>teah ...
and
<lg>
 <l>Hwæt! we Gar-dena in gear-dagum</l>
 <l>þeod-cyninga þrym gefrunon,</l>
 <l>hu ða æþelingas ellen fremedon,</l>
</lg>
<lg>
 <l>Oft Scyld Scefing sceaþena þreatum,</l>
 <l>monegum mægþum meodo-setla ofteah;</l>
 <l>egsode Eorle, syððan ærest wearþ</l>
 <l>feasceaft funden...</l>
</lg>

2.11. A useful mental exercise

Imagine you are going to markup several thousand pages of complex material....
  • Which features are you going to markup?
  • Why are you choosing to markup this feature?
  • How reliably and consistently can you do this?

Now, imagine your budget has been halved. Repeat the exercise!

3. An Introduction to XML

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

3.1. XML: what it is and why you should care

  • XML is structured data represented as strings of text
  • XML looks like HTML, except that:-
    • XML is extensible
    • XML must be well-formed
    • XML can be validated
  • XML is application-, platform-, and vendor- independent
  • XML empowers the content provider and facilitates data integration

3.2. XML terminology

An XML document may contain:-
  • elements, possibly bearing attributes
  • processing instructions
  • comments
  • entity references
  • marked sections (CDATA, IGNORE, INCLUDE)

An XML document must be well-formed and may be valid

3.3. XML terminology Example

<?xml version="1.0" ?><root xmlns="http://www.example.com/"
>

 <element xmlns="http://www.example.com/"
  attribute="value">
content </element>
<!-- comment --></root>

3.4. The rules of the XML Game

  • An XML document represents a (kind of) tree
  • It has a single root and many nodes
  • Each node can be
    • a subtree
    • a single element (possibly bearing some attributes)
    • a string of character data
  • Each element has a name or generic identifier

3.5. Representing an XML tree

  • An XML document is encoded as a linear string of characters
  • It begins with a special processing instruction
  • Element occurrences are marked by start- and end-tags
  • The characters < and & are Magic and must always be "escaped" if you want to use them as themselves
  • Comments are delimited by <!- - and - ->
  • CDATA sections are delimited by <![CDATA[ and ]]>
  • Attribute name/value pairs are supplied on the start-tag and may be given in any order
  • Entity references are delimited by & and ;

3.6. Parts of an XML document

<?xml version="1.0"?><greetings xmlns="http://www.example.org/greetings"
>

 <hello xmlns="http://www.example.org/greetings"
  type="sarcastic">
hello world!</hello></greetings>
  • The XML declaration
  • Namespace declarations
  • The root element of the document itself
  • Other elements and content
  • Attribute and value

3.7. The XML declaration

An XML document must begin with an XML declaration which does two things:
  • specifies that this is an XML document, and which version of the XML standard it follows
  • specifies which character encoding the document uses
<?xml version="1.0" ?>
<?xml version="1.0" encoding="iso-8859-1" ?>
The default, and recommended, encoding is UTF-8

3.8. Namespace declarations

All TEI documents are declared within the TEI namespace: <TEI xmlns="http://www.tei-c.org/ns/1.0"> ... </TEI>

XML documents can include elements declared in different name spaces.

  • a namespace declaration associates a namespace prefix with an external URI-like identifier
  • the default namespace may be declared using a xmlns
  • other name spaces must all use a specially declared prefix
<TEI xmlns="http://www.tei-c.org/ns/1.0"
xmlns:math="http://www.mathml.org">
<p>...<math:expr>...</math:expr>...</p>...</TEI>

The xml namespace is used by the TEI for global attributes xml:id and xml:lang

3.9. Test your XML knowledge

  • Which are correct?
    • <seg>some text</seg>
    • <seg><foo>some</foo> <bar>text</bar></seg>
    • <seg><foo>some <bar></foo> text</bar></seg>
    • <seg type="text">some text</seg>
    • <seg type='text'>some text</seg>
    • <seg type=text>some text</seg>
    • <seg type = "text">some text</seg>
    • <seg type="text">some text<seg/>
    • <seg type="text">some text<gap/></seg>
    • <seg type="text">some text< /seg>
    • <seg type="text">some text</Seg>

4. An Introduction to the TEI

The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts chiefly in the humanities, social sciences and linguistics.

4.1. Why the TEI?

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and other) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

4.2. Relevance

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

4.3. The virtuous circle of encoding

4.4. The scope of intelligent markup

Even within the original scope of the TEI we have
  • basic structural and functional components
  • diplomatic transcription, images, annotation
  • links, correspondence, alignment
  • data-like objects such as dates, times, places, persons, events (named entity recognition)
  • meta-textual annotations (correction, deletion, etc)
  • linguistic analysis at all levels
  • contextual metadata of all kinds
  • ... and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

4.5. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

4.6. The wrong way of thinking about the TEI

  • A traditional (if large) research project with soft funding, driven by academic curiosity
  • A codification of best practice, with no formal maintenance method
  • Uncertain licencing and development practices
  • Unmanageably complex except by the priesthood — or simultaneously as too simple for real scholarly work
  • Lack of specific tools to do something with a TEI text
  • Failure to market the advantages of rich markup

4.7. The good way of thinking about the TEI

  • A funded international consortium supported by major institutions
  • Proper open source licence, with openly visible development on Sourceforge
  • Architecture rethought to facilitate expansion and integration with other systems
  • Self documenting, each release fully validated, delivered using standard mechanisms
  • Publicly available processing tools managed together with the Guidelines
  • Active developer community, wiki, SIGs, test files, exemplars, regular updates...

4.8. Support for many schema languages

The TEI uses a subset of itself called TEI ODD as a base to generate both project documentation and schemas:
  • TEI schemas can be generated for
    • ISO RELAX NG language
    • W3C Schema Language
    • XML DTD language
  • Internally, content models are defined using RELAX NG syntax
  • Datatypes are defined in terms of W3C datatypes
  • Some facilities (e.g. alternation, namespaces) cannot be expressed in DTDs -- RELAX NG schema is recommended
  • Additional constraints can be expressed in Schematron

4.9. Two reasons why standards fail

  • The theory is not yet ripe
  • The "not invented here" attitude: the community of users is too diverse

4.10. Coping with partially-baked ideas

In a TEI ODD, you can ...
  • constrain the domain of a value list
  • enforce schematron rules about e.g. codependency
  • provide new elements in your own namespace
  • remove (non-mandatory) child elements

From the single TEI ODD you can then generate the required schemas, as well as your project documentation.

4.11. Do not re-invent the wheel

  • TEI P5 has extensive I18N features for translation of ...
    • schema objects
    • schema documentation
  • TEI is hospitable to other namespaces:
    • You can use SVG for graphics, MathML for math, or any other markup if you like
  • TEI ODD also includes an <equiv> element for mapping to external ontologies

4.12. For example

Embedding SVG within TEI:
<figure>
 <svg xmlns="http://www.w3.org/2000/svg"
  width="6cmheight="5cmviewBox="6 3 6 5">

  <ellipse xmlns="http://www.w3.org/2000/svg"
  
    style="fill: #ffffff"
    cx="9.75"
    cy="6.35"
    rx="2.75"
    ry="2.35"/>
</svg>
</figure>
A user-defined attribute:
<div   xmlns:my="http://www.example.org/ns/nonTEI">
 <p n="12my:topic="rabbits">Flopsy, Mopsy, Cottontail, and Peter...</p>
</div>

NVDL processors validate against multiple namespace schemas, so you can validate each part individually

4.13. Conformance issues

A document is TEI Conformant if and only if it:
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines
or if it can be transformed automatically using some TEI-defined procedures into such a document. (it is then considered TEI-conformable)

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

4.14. Exercise 1

You will be provided with a handout containing a prose manuscript description of one of the manuscripts. Follow the instructions on the handout and make notes as to which major sections you think the description could be divided into. We'll report back in a few minutes.



Date: March 2009
Copyright University of Oxford