Text only | Skip links
Skip links||IT Services, University of Oxford

1. Project starting points

ENRICH
EU project on manuscript description
OxGarage
Service derived from ENRICH to perform document conversion
OUCS
maintaining web pages of Oxford University Computing Service
TEI
systems for producing TEI Guidelines and documentation from ODD
ISO
managing ISO standards using TEI markup

I shall be addressing the problem of Word <-->TEI XML conversion for these projects.

2. ISO

ISO Central Secretariat want to introduce XML into their production workflow. Some constraints:
  • Authors almost all work in Word
  • ISO want an XML representation of the document
  • For drafting revisions, it must be possible to go back to editing a finished standard in Word again
  • Management of additions and corrections is essential
  • Standards go through several stages of drafting independently of ISO, then several stages with secretariat
  • The documents have a very rigid structure, components and business rules
  • The metadata is managed by a conventional database at ISO

3. Some ISO business rules

  • A Foreword clause in the front matter is mandatory
  • A Scope clause in the body is mandatory
  • There must be a set of normative references to other standards
  • Terminology must be defined in a specific section
  • A clause must contain at least two subclauses

There is a whole ISO standard on how to draft ISO standards…

4. The TEI ISO project

  • Following a public tender issued in 2007, OUCS/TEI was selected as a supplier in March 2008, and delivered the first proof of concept in November 2008; assistance from Brigham State University (Alan Melby and Jarom McDonald) for Word template
  • Second stage pilot completed in January 2010
  • Third stage implementation completed May 2010

This is a project on authoring, not publication. No workflow is yet defined to typeset from ISO TEI XML files

5. ISO prerequisites

  • Authors will use an ISO word template with detailed and strict guidelines on use of styles, and a Wizard to help them create structured documents
  • Lossless round-tripping between TEI and Word is essential — but that means Word --> TEI --> Word round-trip, not TEI --> Word --> TEI!
  • Enriched semantic XML where possible, eg structured references and terminology
  • I18N features, such as variant representation of numbers, must be supported: 12,5 vs 12.5

6. Some key technical decisions for ISO work

TEI purity
use the TEI when the semantics really are the same — no abuse
Specialist namespaces
use MathML for formulae, CALS for tables, TBX for terminology
ISO-specific data
use ISO namespace for extra attributes on existing elements
Word fudges
embed small islands of Word data if absolutely necessary (eg system allows for keeping Word tables if needed)
Validation
record business rules using Schematron

7. Ugly-headed problems

  1. Some constituencies want to author in directly in XML (eg for writing a standard using ODD markup)
  2. Not all authoring environments use complex multi-namespace schemas well
  3. We convert to XML to (inter alia) check structure — but how do we report errors back to the author?

8. Implementation

  • Define single ODD file defining three schemas (Lite, Normal, and Normal_with_ODD)
  • Embed Schematron constraints in schemas
  • Write XSLT 2.0 transformations to turn body of OO XML into TEI XML and vice versa
  • Unpack a suitable empty Word document as template and replace selected components. All styling is determined by template document
  • Create web servlet to manage transformation, processing of graphics, and packing/unpacking of .docx zip files

9. The OOXML data format

Microsoft Office 2007 (Office 2008/2011 on a Mac) is more or less an implementation of ISO/IEC 29500 (OOXML); this defines
  • a family of interlinked XML schemas to describe office documents
  • a file hierarchy structure
  • a packaging format (zip)
There is a (smallish) difference between is in Word, not what should be there according to the spec.

10. The architecture of a Word docx (OOXML) file

(Useful picture from http://en.wikipedia.org/wiki/Office_Open_XML)

11. XML namespaces in Word

urn:schemas-microsoft-com:mac:vml
Drawing
http://schemas.microsoft.com/office/mac/office/2008/main
http://schemas.openxmlformats.org/markup-compatibility/2006
urn:schemas-microsoft-com:office:office
http://schemas.openxmlformats.org/officeDocument/2006/relationships
Links
http://schemas.openxmlformats.org/officeDocument/2006/math
Maths
urn:schemas-microsoft-com:vml
Another bit of drawing
urn:schemas-microsoft-com:office:word
http://schemas.openxmlformats.org/wordprocessingml/2006/main
Normal text
http://schemas.microsoft.com/office/word/2006/wordml
http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
More drawing

12. The contents of the package

13. What are the files for?

[Content_Types].xmlmime types of files
_rels/.relslinks between names and objects
word/_rels/document.xml.relslinks between names and support files
word/document.xmldocument body
word/media/image1.jpegpicture
docProps/thumbnail.jpegdocument thumbnail
word/settings.xmlsettings
word/webSettings.xmlsettings for HTML export
word/styles.xmlstyle definitions
word/numbering.xmlnumbering schemes
docProps/core.xmldocument properties
word/fontTable.xmlfont details
docProps/app.xmlapplication details

All of these, except media files, are XML files (despite some weird names).

14. Simple text in Word

The main building blocks are
<p>
block-level object (‘paragraph’)
<r>
inline object
<t>
text ‘run’
with corresponding style objects:
<pPr>
block-level object style rules
<rPr>
inline style rules
There is no hierarchy, just a flat set of block-level objects.

15. Example: references in Word

16. Example: references in OOXML (Word) — 1

<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"

  rsidR="008A0CE8"
  rsidRPr="00250571"
  rsidRDefault="008A0CE8"
  rsidP="008A0CE8">

 <pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   val="Heading1"/>

  <tabs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="clearpos="400"/>

   <tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="clearpos="560"/>

   <tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="leftpos="403"/>

   <tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="leftpos="562"/>
</tabs></pPr>
 <bookmarkStart xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  id="8name="_Toc201542376"/>

 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
Normative references</t></r>
 <bookmarkEnd xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  id="8"/>
</p>
<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"

  rsidR="008A0CE8"
  rsidRPr="00250571"
  rsidRDefault="008A0CE8"
  rsidP="008A0CE8">

 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
The following referenced documents are indispensable for
     the application of this document. For dated references, only
     the edition cited applies. For undated references, the latest
     edition of the referenced document (including any amendments)
     applies.</t></r></p>

17. Example: references in OOXML (Word) — 2

<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"

  rsidR="008A0CE8"
  rsidRPr="00250571"
  rsidRDefault="008A0CE8"
  rsidP="008A0CE8">

 <pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   val="RefNorm"/>
</pPr>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <sz xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="19"/>

   <szCs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="19"/>
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
ISO </t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <sz xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="19"/>

   <szCs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    val="19"/>
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
13909-2:2001,</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   xml:space="preserve">
</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
Hard coal and coke</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
 —</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   xml:space="preserve">
Mechanical sampling</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
 —</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   xml:space="preserve">
</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
Part </t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
2: Coal</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
 >

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >
 —</t></r>
 <r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  rsidRPr="00250571">

  <rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
  >

   <i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   />
</rPr>
  <t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   xml:space="preserve">
Sampling from moving streams</t></r></p>

18. Example: references in XML (TEI)

<div type="normativeReferences">
 <head>Normative references</head>
 <p>The following referenced documents are indispensable
   for the application of this document. For dated
   references, only the edition cited applies. For undated
   references, the latest edition of the referenced document
   (including any amendments) applies.</p>
 <listBibl type="normativeReferences">
  <bibl type="dated">
   <publisher>ISO</publisher>
   <idno type="docNumber">13909</idno>
   <idno type="docPartNumber">1</idno>
   <edition>2001</edition>
   <title rend="italic">Hard coal and coke — Mechanical
       sampling —<seg/>Part 1: General
       introduction</title>
  </bibl>
 </listBibl>
</div>

19. Example: math in Word

20. Example: math in XML (MathML)

<p>The required overall precision on a lot should be agreed between the parties concerned. In the absence of such agreement, a value of one tenth of the ash content may be assumed.</p>
<p>The theory of precision is given in ISO 13909-7. The following equation is derived:</p>
<p>
 <formula>
  <m:math>
   <m:msub>
    <m:mrow>
     <m:mi>P</m:mi>
    </m:mrow>
    <m:mrow>
     <m:mtext>L</m:mtext>
    </m:mrow>
   </m:msub>
   <m:mo>=</m:mo>
   <m:mn>2</m:mn>
   <m:msqrt>
    <m:mfrac>
     <m:mrow>
      <m:mfrac>
       <m:mrow>
        <m:msub>
         <m:mrow>
          <m:mi>V</m:mi>
         </m:mrow>
         <m:mrow>
          <m:mtext>l</m:mtext>
         </m:mrow>
        </m:msub>
       </m:mrow>
       <m:mrow>
        <m:mi>n</m:mi>
       </m:mrow>
      </m:mfrac>
      <m:mo>+</m:mo>
      <m:mfenced separators="|">
       <m:mrow>
        <m:mn>1</m:mn>
        <m:mo>-</m:mo>
        <m:mfrac>
         <m:mrow>
          <m:mi>u</m:mi>
         </m:mrow>
         <m:mrow>
          <m:mi>m</m:mi>
         </m:mrow>
        </m:mfrac>
       </m:mrow>
      </m:mfenced>
      <m:msub>
       <m:mrow>
        <m:mi>V</m:mi>
       </m:mrow>
       <m:mrow>
        <m:mtext>m</m:mtext>
       </m:mrow>
      </m:msub>
      <m:mo>+</m:mo>
      <m:msub>
       <m:mrow>
        <m:mi>V</m:mi>
       </m:mrow>
       <m:mrow>
        <m:mtext>PT</m:mtext>
       </m:mrow>
      </m:msub>
     </m:mrow>
     <m:mrow>
      <m:mi>u</m:mi>
     </m:mrow>
    </m:mfrac>
   </m:msqrt>
  </m:math>
  <lb/>
  <c rend="tab"/>(1)</formula>
</p>

21. Example: terminology in Word

22. Example: terminology in XML (TBX)

<termEntry xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
 id="user_3.3.1">

 <note xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
 >
The specific linguistic means of expression always include subject-specific <hi xmlns="http://www.tei-c.org/ns/1.0"
   rend="italic">
terminology</hi> (3.5.1) and phraseology and also may cover stylistic or syntactic features.</note>
 <descripGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
 >

  <descrip xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
   type="definition">
language used in a <ref xmlns="http://www.tei-c.org/ns/1.0"
   >
domain</ref> (3.1.2) and characterized by the use of specific linguistic means of expression</descrip></descripGrp>
 <langSet xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
  xml:lang="">

  <ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
  >

   <termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
   >

    <term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     id="user_3.3.1-1">
special language</term>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="partOfSpeech">
noun</termNote>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="administrativeStatus">
preferredTerm-admn-sts</termNote></termGrp></ntig>
  <ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
  >

   <termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
   >

    <term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     id="user_3.3.1-2">
language for special purposes</term>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="partOfSpeech">
noun</termNote>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="administrativeStatus">
admittedTerm-admn-sts</termNote></termGrp></ntig>
  <ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
  >

   <termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
   >

    <term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     id="user_3.3.1-3">
LSP</term>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="partOfSpeech">
noun</termNote>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="administrativeStatus">
admittedTerm-admn-sts</termNote>
    <termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
     type="termType">
abbreviation</termNote></termGrp></ntig></langSet></termEntry>

23. Challenges in the XSLT conversion

interpolating hierarchy from flat section headings
we use XSLT 2.0 <for-each-group> heavily to create document structure
making decisions depend on generated structure
the conversion makes 3 passes over the data with the one XSLT transform, each time adding more goodness
table management
a Word table differs from a CALS table in how it models spanning cells and tables, which causes considerable problems in mapping

24. Questions about ISO work

What about OpenOffice?
theoretically we could translate docx/odt, but perhaps better to maintain parallel set of XSLT. This is not yet a requirement.
Do ISO support ODD?
Not at present. ‘Are aware of’, perhaps
Is all this open source?
The stylesheets are all managed under the TEI project on Sourceforge, with the ISO variations distinct from generic docx conversion
How do I add support for new TEI elements or new Word styles?
It all depends …

25. Handling incoming Word style

We use TEI rend a lot to preserve style names

<xsl:template
  match="w:p[w:pPr/w:pStyle/@w:val='Figure text']"
  mode="paragraph">

 <p>
  <xsl:if test="w:pPr/w:jc/@w:val">
   <xsl:attribute name="iso:align">
    <xsl:value-of select="w:pPr/w:jc/@w:val"/>
   </xsl:attribute>
  </xsl:if>
  <xsl:attribute name="rend">
   <xsl:text>Figure text</xsl:text>
  </xsl:attribute>
  <xsl:apply-templates/>
 </p>
</xsl:template>

26. Handling incoming TEI element

<xsl:template
  match="tei:front/tei:div/tei:p[@type='foreword']">

 <xsl:call-template name="block-element">
  <xsl:with-param name="pPr">
   <pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
   >

    <pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
    >

     <xsl:attribute name="w:val">
      <xsl:value-of
        select="concat(translate(substring(parent::tei:div/@type,1,1),$lowercase,$uppercase),substring(parent::tei:div/@type,2))"/>

     </xsl:attribute></pStyle></pPr>
  </xsl:with-param>
 </xsl:call-template>
</xsl:template>

27. Displaying validity errors

The bad way:
macbookpro:TEIISO rahtz$ cat Test2010-18_validation.txt Test2010-18.xml:1:1578: error: bad character content for element Test2010-18.xml:1:4840: error: bad value for attribute "version" Test2010-18.xml:1:4889: error: bad value for attribute "target" Test2010-18.xml:1:6622: error: attribute "style" from namespace "http://www.w3.org/1999/xhtml" not allowed at this point; ignored

28. Displaying validity errors (2)

A better way

29. Corrigenda and addenda (TEI XML)

<p>This fourth edition cancels and replaces the third
edition (ISO 6579:<del when="2009-10-30T13:19:00Ztype="CORn="1">1993</del>
 <add when="2009-10-30T13:19:00Ztype="CORn="1">1999</add>), which has been technically revised.</p>
<bibl>
 <add when="2009-10-30T09:27:00Ztype="AMDn="1">ISO/TS 11133-1, <title rend="italic">Microbiology of food and animal feeding stuffs — Guidelines on preparation and production of culture media — Part 1: General guidelines on quality assurance for the preparation of culture media in the laboratory</title>
 </add>
</bibl>

30. Displaying corrigenda and addenda (HTML)

31. Recommendations and conclusions

  1. Define a very specific TEI customization, document it with local examples, and enforce it
  2. Don't just document your business rules in prose, but implement them using Schematron
  3. Do not be afraid of namespaces — use the vocabulary suited to the task
  4. Achieving genuinely lossless transformations is very hard, so make sure you really know what you are trying to achieve
  5. Don't go anywhere near word-processors :-}
  6. It has to be questioned whether authoring web pages in TEI XML has any real advantages over simply using HTML 5


Sebastian Rahtz. Date: TEI members meeting, Zadar, 2010-11-13
Copyright University of Oxford