Text only | Skip links
Skip links||IT Services, University of Oxford


1. What do I need to make GOOD XML documents?

In this session we consider what kinds of constraints we might want to apply to our XML documents, and what components we need to implement them.

1.1. In XML a schema is optional!

XML allows you to make up your own tags, to mix tags from different name spaces, and it doesn't actually require a schema...

  • The XML concept is dangerously powerful:
    • XML elements are light in semantics
    • one man's <p> is another's <para> (or is it?)
    • the appearance of interchangeability may be worse than its absence
  • But XML is too good to ignore
    • mainstream software development
    • proliferation of tools
    • the language of the web

1.2. What can a schema do for you?

  • ensure that your documents use only predefined elements, attributes, and entities
  • enforce structural rules such as ‘every chapter must begin with a heading’ or ‘recipes must include an ingredient list’
  • make sure that the same thing is always called by the same name

Schema languages vary in the amount of validation they support

1.3. What kinds of validation do we need?

1.4. What can the TEI do for you?

The TEI provides a framework for the definition of multiple schemas

  • it defines and names several hundred useful textual distinctions
  • it provides a set of modules that can be used to define schemas making those distinctions
  • it provides a customization mechanism for modifying and combining those definitions with new ones using the same conceptual model

1.5. Where did the TEI come from?

  • Originally, a research project within the humanities
    • Sponsored by three professional associations
    • Funded 1990-1994 by US NEH, EU LE Programme etc.
  • Major influences
    • digital libraries and text collections
    • language corpora
    • scholarly datasets
  • International consortium established June 1999 (see http://www.tei-c.org/)

1.6. Goals of the TEI

  • better interchange and integration of scholarly data
  • support for all texts, in all languages, from all periods
  • guidance for the perplexed: what to encode — hence, a user-driven codification of existing best practice
  • assistance for the specialist: how to encode — hence, a loose framework into which unpredictable extensions can be fitted

These apparently incompatible goals result in a highly flexible, modular, environment

1.7. TEI Deliverables

  • A set of recommendations for text encoding, covering both generic text structures and some highly specific areas based on (but not limited by) existing practice
  • A very large collection of element definitions with associated declarations for various schema languages
  • A modular system for creating personalized schemas or DTDs from the foregoing
  • Software that transforms TEI documents

For the full picture see http://www.tei-c.org/TEI/Guidelines/

1.8. Legacy of the TEI

  • a way of looking at what text really is
  • a codification of current scholarly practice
  • (crucially) a set of shared assumptions and priorities about the digital agenda:
    • focus on content and function (rather than presentation)
    • identify generic solutions (rather than application-specific ones)

2. Software Options

This section provides a brief overview of available technology for creating and editing TEI XML documents.

2.1. What tools do we need?

  • Appropriately expressive vocabularies (eg TEI XML)
  • Syntax-checking document creation tools (ie editors)
  • Document transformation tools
  • Document delivery tools
  • Document storage and management tools
  • Programming interfaces
  • Specialized applications

2.2. Two stages to get a TEI text

  • capture the text
  • create the markup
Often they occur simultaneously; but often not.

Note that the markup does not necessarily all have to be in the same file.

2.3. Creating the text

  • scanning/OCR
  • scouring the web
  • data-entry vendors
  • software to add tagging automatically
  • editors
followed by
  • validators, well-formedness checkers
  • proofing aids, data integrity checkers

2.4. OCR/Data Entry

  • Scanning and OCR software generally produce only minimal HTML or Word (e.g., recognizing paragraph breaks, font changes etc).
  • Data-entry vendors in theory would insert whatever markup you wanted, but at a price. They generally prefer HTML or TEI Lite or some such well-known DTD.
  • (TEI has sponsored creation of 'TEI Tite': a standard slimmed-down vocabulary for initial encoding in mass-digitisation projects)

2.5. For hackers only...

For the Punch project, we found the Gutenberg archive of HTML versions useful!

  • wget utility to hoover up a website
  • lynx utility to save formatted plaintext version of an HTML file
  • tidy utility to convert arbitrary HTML to well-formed XHTML, which can then be tweaked with XSLT scripts
  • ... or just plain old perl


2.6. Editor types

Editing tools cover a wide spectrum:
  • Basic text editors
  • General programmers' editors
  • XML-aware programmers' editors
  • XML-specific editors
  • Word-processors which can export/import XML
  • Data-entry forms
  • Image-specific editors

different strokes suit different folks

2.7. Spoilt for choice...

Almost anyone can write an XML editor
Figure 1. Almost anyone can write an XML editor

2.8. Things to look for in specialist XML editors

  • schema-aware
  • constraining element entry
  • IDE features
  • customizable
  • validation, preferably continual
  • Multiple display views (as tree, with tags, formatted etc)
  • folding structures
  • context-sensitive help
For XML editing, Emacs, oXygen, jEdit, XMLSpy, Stylus Studio, Arbortext Adept are all worth a look.

2.9. oXygen screenshot 1

2.10. oXygen screenshot 2

2.11. oXygen screenshot 3

2.12. Tagless editing in oXygen

2.13. What is missing, or hard, in the TEI editing world

  • Only a few editors like oXygen9 or XMetaL combine visual feedback with code editing
  • Visual, or WYSIWYG, editors embedded in web applications (eg in a CMS); most web editors are for XHTML (cf Google Docs)
  • Reliable conversion to and from Word and OpenOffice styles.
    • the general inability of word-processors to nest inline inside inline, or block inside block
    • the difficulty of extrapolating a hierarchical structure from a sequence of free-standing headings at assorted levels
    • the tedious programming required to trace the ancestry of styles in Word and OO
    • the lack of a facility in OO to stop the user formatting by hand

Date: 2008-03-05
Copyright University of Oxford