Text only | Skip links
Skip links||IT Services, University of Oxford

1. What is markup?

In order to talk about texts, markup and encoding of texts, we need to understand what we mean by these basic concepts. When we talk about text encoding, what do we mean by a text? What is in a text and what assumptions do we make in reading them?

1.1. What's in a text?

1.2. What's in a text (2)?

1.3. What's in a text (3)?

1.4. The ontology of text

Where is the text?
  • in the shape of letters and their layout?
  • in the original from which this copy derives?
  • in the stories we read into it? or in its author's intentions?

A "text" is an abstraction, created by or for a community of readers. Markup encodes and makes concrete such abstractions.

1.5. Encoding of texts

  • Texts are more than sequences of encoded glyphs
    • They have structure and content
    • They also have multiple readings
  • Encoding, or markup, is a way of making these things explicit

Only that which is explicit can be reliably processed

1.6. Styles of markup

  • In the beginning there was procedural markup
    RED INK ON; print balance; RED INK OFF
  • which being generalised became descriptive markup <balance type='overdrawn'>some numbers</balance>
  • also known as encoding or annotation

descriptive markup allows for easier re-use of data

1.7. What's the point of markup?

  • To make explicit (to a machine) what is implicit (to a person)
  • To add value by supplying multiple annotations
  • To facilitate re-use of the same material
    • in different formats
    • in different contexts
    • by different users

It's (usually) more useful to markup what we think things are than what they look like

1.8. Separation of form and content

  • Presentational markup cares more about fonts and layout than meaning
  • Descriptive markup says what things are, and leaves the rendition of them for a separate step
  • Separating the form of something from its content makes its re-use more flexible
  • It also allows easy changes of presentation across a large number of documents

1.9. Markup as a scholarly activity

  • The application of markup to a document can be an intellectual activity
  • In deciding what markup to apply, and how this represents the original, one is undertaking the task of an editor
  • There is (almost) no such thing as neutral markup -- all of it involves interpretation
  • Markup can assist in answering research questions, and the deciding what markup is needed to enable such questions to be answered can be a research activity in itself
  • Good textual encoding is never as easy or quick as people would believe
  • Detailed document analysis is needed before encoding for the resulting markup to be useful

1.10. What does markup capture?

Compare
<hi rend="dropcap">H</hi>&amp;WYN;ÆT WE GARDE <lb/>na in gear-dagum þeod-cyninga <lb/>þrym gefrunon,
hu ða æþelingas <lb/>ellen fremedon. oft scyld scefing sceaþe
<add>na</add>
<lb/>þreatum, moneg<expan>um</expan> mægþum meodo-setl
<add>a</add>
<lb/>of<damage>
 <desc>blot</desc>
</damage>teah ...
and
<lg>
 <l>Hwæt! we Gar-dena in gear-dagum</l>
 <l>þeod-cyninga þrym gefrunon,</l>
 <l>hu ða æþelingas ellen fremedon,</l>
</lg>
<lg>
 <l>Oft Scyld Scefing sceaþena þreatum,</l>
 <l>monegum mægþum meodo-setla ofteah;</l>
 <l>egsode Eorle, syððan ærest wearþ</l>
 <l>feasceaft funden...</l>
</lg>

1.11. A useful mental exercise

Imagine you are going to markup several thousand pages of complex material....
  • Which features are you going to markup?
  • Why are you choosing to markup this feature?
  • How reliably and consistently can you do this?

Now, imagine your budget has been halved. Repeat the exercise!

2. XML, TEI, etc

Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

2.1. Why the TEI?

The TEI provides
  • a language-independent framework for defining markup languages
  • a very simple consensus-based way of organizing and structuring textual (and other) resources...
  • ... which can be enriched and personalized in highly idiosyncratic or specialised ways
  • a very rich library of existing specialised components
  • an integrated suite of standard stylesheets for delivering schemas and documentation in various languages and formats
  • a large and active open source style user community

2.2. Relevance

Why would you want those things?
  • because we need to interchange resources
    • between people
    • (increasingly) between machines
  • because we need to integrate resources
    • of different media types
    • from different technical contexts
  • because we need to preserve resources
    • cryogenics is not the answer!
    • we need to preserve metadata as well as data

2.3. The virtuous circle of encoding

2.4. The scope of intelligent markup

Even within the original scope of the TEI we have
  • basic structural and functional components
  • diplomatic transcription, images, annotation
  • links, correspondence, alignment
  • data-like objects such as dates, times, places, persons, events (named entity recognition)
  • meta-textual annotations (correction, deletion, etc)
  • linguistic analysis at all levels
  • contextual metadata of all kinds
  • ... and so on and so forth

Is it possible to delimit encyclopaedically all possible kinds of markup?

2.5. Reasons for attempting to define a common framework

  • re-usability and repurposing of resources
  • modular software development
  • lower training costs
  • ‘frequently answered questions’ — common technical solutions for different application areas

The TEI was designed to support multiple views of the same resource

2.6. The wrong way of thinking about the TEI

  • A traditional (if large) research project with soft funding, driven by academic curiosity
  • a codification of best practice, with no formal maintenance method
  • uncertain licencing and development practices
  • unmanageably complex except by the priesthood — or simultaneously as too simple for real scholarly work
  • lack of specific tools to do something with a TEI text
  • failure to market the advantages of rich markup

2.7. The good way of thinking about the TEI

  • Proper open source licence, with visible development on Sourceforge
  • Architecture rethought to facilitate expansion and integration with other systems
  • Self documenting, each release fully validated, delivered using standard mechanisms
  • Publicly available processing tools managed together with the Guidelines
  • Active developer community, wiki, etc. Test files, exemplars, regular updates...

2.8. One Specification Language

  • A set of TEI documents is described by an ODD, which is itself a TEI document that combines:
    • references to existing declarations
    • formal declarations for elements and attributes
    • documentation and usage notes
  • Underlying this:
    • a conceptual model which abstracts from specific elements to generic classes
    • a modular architecture for combining sets of definitions
  • specifications are chainable; modifications are written in ODD with ODD as input and output
  • Roma is one interface to this: there will be others

2.9. Support for many schema languages

  • TEI schemas can be generated for
    • XML DTD language
    • ISO RELAX NG language
    • W3C Schema Language
  • Content models are defined using RELAX NG syntax
  • Datatypes are defined in terms of W3C datatypes
  • Some facilities (e.g. alternation, namespaces) cannot be expressed in DTD
  • Additional constraints can be expressed in Schematron

2.10. Coping with partially-baked ideas

In a TEI ODD, you can ...
  • constrain the domain of a value list
  • enforce schematron rules about e.g. codependency
  • provide new elements in your own namespace
  • remove (non-mandatory) child elements

2.11. New elements

A schema is a grammar. How can you add new terminals to an existing syntax?

  • All content models are expressed indirectly, by reference to element classes rather than elements
  • Hence adding a new element is simply a matter of saying which class/es it belongs to

The TEI schema is also enriched with semantics. How can you explain what a new element means?

  • Class membership also conveys some semantics
  • ODD includes detailed documentation

2.12. Do not re-invent the wheel

  • TEI P5 has extensive I18N features for translation of ...
    • schema objects
    • schema documentation
  • TEI is hospitable to other namespaces:
    • You can use SVG for graphics, MathML for math, Word Table markup if you like
  • ODD also includes an <equiv> element for mapping to external ontologies

2.13. For example

Embedding SVG within TEI:
<figure>
 <svg xmlns="http://www.w3.org/2000/svg"
  width="6cmheight="5cmviewBox="6 3 6 5">

  <ellipse xmlns="http://www.w3.org/2000/svg"
  
    style="fill: #ffffff"
    cx="9.75"
    cy="6.35"
    rx="2.75"
    ry="2.35"/>
</svg>
</figure>
A user-defined attribute:
<div   xmlns:my="http://www.example.org/ns/nonTEI">
 <p n="12my:topic="rabbits">Flopsy, Mopsy, Cottontail, and Peter...</p>
</div>

NVDL processors validate against multiple namespace schemas

2.14. Conformance issues

A document is TEI Conformant if and only if it ...
  • is a well-formed XML document
  • can be validated against a TEI Schema, that is, a schema derived from the TEI Guidelines
  • conforms to the TEI Abstract Model
  • uses the TEI Namespace (and other namespaces where relevant) correctly
  • is documented by means of a TEI Conformant ODD file which refers to the TEI Guidelines

Or if it can be transformed automatically using some TEI-defined procedures into such a document. (it is TEI-conformable)

Standardization should not mean ‘Do what I do’, but rather ‘Explain what you do in terms I can understand’



Sebastan Rahtz and other TEI@Oxford authors. Date: February 2009
Copyright University of Oxford