Text only | Skip links
Skip links||IT Services, University of Oxford

1. 1986 was a long time ago...

  • The first computer virus – Brain – appears, in the USA
  • Construction of the channel tunnel begins
  • The Soviet Union launches space station Mir
  • Olaf Palme assassinated
  • Records of the year: Raising Hell (Run DMC) and Graceland (Paul Simon)

2. ...but we used computers then

  • Corpus linguistics
  • Databases on CD ROM
  • Largescale lexical resources already existed (eg TLF, TLG, LASLA...)
  • Digital lexicography (e.g. OED)
  • Document management systems (e.g. TeX, Scribe, tRoff..)
    • some proprietary (and expensive), some research
  • Text archives
  • Hypertext theory

but no world wide web and not many desktop pcs...

3. Birth of the Text Encoding Initiative

  • Spring 1987: European workshops on standardisation of historical data (J.P. Genet, M Thaller)
  • Autumn 1987: NEH funds an exploratory international workshop on the feasibility of defining "text encoding guidelines"
Vassar College, Poughkeepsie
Figure 1. Vassar College, Poughkeepsie

4. Today's question:

  • So the TEI is very old!
  • It comes from a time before the Web, before the DVD, the mobile phone, cable tv, or Microsoft Word
  • Not much in computing survives 5 years, never mind 20
  • What relevance can it possibly have today?
  • Why is it still here, and how has it survived?

5. Is the TEI still relevant?

  • With XML everyone can create their own markup system and still share data!
  • In the Semantic Web, XML systems will all understand each other's data!
If we have
  • historical data marked up with a Historical Markup Language
  • linguistic data marked up with a Linguistic Markup Language
  • metadata marked up with a Metadata Markup Language
how will we integrate resources or ask interesting questions?

Haven't we been here before?

6. Relevance 1

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and others) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

7. Relevance 2

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

8. The virtuous circle of encoding

9. The scope of ‘intelligent’ markup

Even within the original scope of the TEI we have
  • basic structural and functional components
  • diplomatic transcription, images, annotation
  • links, correspondence, alignment
  • data-like objects such as dates, times, places, persons, events (‘named entity recognition’)
  • meta-textual annotations (correction, deletion, etc)
  • linguistic analysis at all levels
  • contextual metadata of all kinds
  • ... and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

10. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

11. Old Skool TEI

  • A traditional (if large) research project with soft funding, driven by academic curiosity
  • a codification of best practice, with no formal maintenance method
  • uncertain licencing and development practices
  • perceived as unmanageably complex except by the priesthood — or simultaneously as too simple for real scholarly work
  • lack of specific tools to do something with a TEI text
  • failure to market the advantages of rich markup

12. TEI New

  • Proper open source licence, with visible development on Sourceforge
  • Architecture rethought to facilitate expansion and integration with other systems
  • Self documenting, each release fully validated, delivered using standard mechanisms
  • Publicly available processing tools managed together with the Guidelines
  • Active developer community, wiki, etc. Test files, exemplars, regular updates...
  • New governance structure, new tools, new modules...

13. Three important things about TEI P5

  1. Being a good digital citizen:
    • Support for multiple schema languages and namespaces
    • Reliance on XML, and hence on Unicode
    • Validation of attributes and datatyping
    • Use of W3C pointers and paths
  2. Making it flexible:
    • ODD: a single specification language for developers, users, and teachers, integrating schema and documentation;
    • Verifiable conformance
  3. Old annoyances removed and some new topics added

14. One Specification Language

  • A set of TEI documents is described by an ODD, which is itself a TEI document that combines:
    • references to existing declarations
    • formal declarations for elements and attributes
    • documentation and usage notes
  • Underlying this:
    • a conceptual model which abstracts from specific elements to generic classes
    • a modular architecture for combining sets of definitions
  • specifications are chainable; modifications are written in ODD with ODD as input and output
  • Roma is one interface to this: there will be others

15. For example

A TEI ODD file is a valid TEI document, containing as much discursive prose as you want, and a <schemaSpec> element to define the schema it documents

<text>
 <body>
  <div>
   <head>Our Project manual</head>
   <p>In this project we use the basic TEI structures
       with a few minor modifications to exclude elements
       we don't plan to use.</p>
   <schemaSpec ident="TEI-minimalstart="TEI">
    <moduleRef key="tei"/>
    <moduleRef key="header"/>
    <moduleRef key="core"/>
    <moduleRef key="textstructure"/>
<!-- We don't need these drama elements: -->
    <elementSpec ident="spmode="deletemodule="core"/>
    <elementSpec ident="speakermode="deletemodule="core"/>
   </schemaSpec>
  </div>
 </body>
</text>

16. Support for many schema languages

  • TEI schemas can be generated for
    • Traditional XML DTD language
    • ISO RELAX NG language
    • W3C Schema Language
  • Content models are defined using RELAX NG syntax
  • Datatypes are defined in terms of W3C datatypes
  • Some facilities (e.g. alternation, namespaces) cannot be expressed in DTD
  • Additional constraints can be expressed in Schematron

17. Two reasons why standards fail

  • the theory is not yet ripe
  • "not invented here": the community of users is too diverse

18. Coping with partially-baked ideas

In a TEI ODD, you can
  • constrain the domain of a value list
  • enforce schematron rules about e.g. codependency
  • provide new elements in your own namespace
  • remove (non-mandatory) child elements

19. New elements

A schema is a grammar. How can you add new terminals to an existing syntax?

  • All content models are expressed indirectly, by reference to element classes rather than elements
  • Hence adding a new element is simply a matter of saying which class/es it belongs to

The TEI schema is also enriched with semantics. How can you explain what a new element means?

  • class membership also conveys some semantics
  • ODD includes detailed documentation

20. Coping with the NIH Syndrome

  • TEI P5 has extensive I18N features for translation of
    • schema objects
    • schema documentation
  • Cf ROMA at http://www.tei-c.org/Roma/
  • TEI is hospitable to other namespaces
    • so you can use SVG for graphics, MathML for math, Word Table markup if you like
    • (but note this doesn't solve the Other Overlap Problem)
  • ODD also includes an <equiv> element for mapping to external ontologies

21. For example

Embedding SVG within TEI:
<figure>
 <svg xmlns="http://www.w3.org/2000/svg"
  width="6cmheight="5cmviewBox="6 3 6 5">

  <ellipse xmlns="http://www.w3.org/2000/svg"
  
    style="fill: #ffffff"
    cx="9.75"
    cy="6.35"
    rx="2.75"
    ry="2.35"/>
</svg>
</figure>
A user-defined attribute:
<div   xmlns:my="http://www.example.org/ns/nonTEI">
 <p n="12my:topic="rabbits">Flopsy, Mopsy, Cottontail, and Peter...</p>
</div>

James Clark's onvdl processer validates against multiple namespace schemas

22. Conformance issues

A document is TEI Conformant iff it
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines
  • is ‘conformable’, that is, can be transformed automatically using some TEI-defined procedures into a TEI Conformant document.

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’

23. Evolution works...

  • make modifications in your own namespace
  • document them in an ODD
  • propose them to the TEI Council as amendments or feature requests
  • TEI P5 now has a 6 month release cycle...

Visit http://www.tei-c.org for more background info

Visit http://tei.sf.net to download



Lou Burnard. Date: Marrakech, 2008
Copyright University of Oxford