1. Project starting points
- ENRICH
- EU project on manuscript
description
- OxGarage
- Service derived from ENRICH to
perform document conversion
- OUCS
- maintaining web pages of Oxford University Computing Service
- TEI
- systems for producing TEI Guidelines and documentation from ODD
- ISO
- managing ISO standards using TEI markup
I shall be addressing the problem of Word <-->TEI XML
conversion for these projects.
2. ISO
ISO Central Secretariat want to introduce XML into their
production workflow. Some constraints:
- Authors almost all work in Word
- ISO want an XML representation of the document
- For drafting revisions, it must be possible to go back
to editing a finished standard in Word again
- Management of additions and corrections is essential
- Standards go through several stages of drafting
independently of ISO, then several stages with
secretariat
- The documents have a very rigid structure, components
and business rules
- The metadata is managed by a conventional database at ISO
3. Some ISO business rules
- A Foreword clause in the front matter is mandatory
- A Scope clause in the body is mandatory
- There must be a set of normative references to other standards
- Terminology must be defined in a specific section
- A clause must contain at least two subclauses
There is a whole ISO standard on how to draft ISO standards…
4. The TEI ISO project
- Following a public tender issued in 2007, OUCS/TEI was selected as
a supplier in March 2008, and delivered the first proof of concept in
November 2008; assistance from Brigham State University (Alan Melby and
Jarom McDonald) for Word template
- Second stage pilot completed in January 2010
- Third stage implementation completed May 2010
This is a project on authoring, not
publication. No workflow is yet defined to typeset from ISO TEI
XML files
5. ISO prerequisites
- Authors will use an ISO word template with detailed
and strict guidelines on use of styles, and a Wizard to help
them create structured documents
- Lossless round-tripping between TEI and Word is
essential — but that means Word --> TEI --> Word round-trip, not
TEI --> Word --> TEI!
- Enriched semantic XML where possible, eg structured
references and terminology
- I18N features, such as variant representation of
numbers, must be supported: 12,5 vs 12.5
6. Some key technical decisions for ISO work
- TEI purity
- use the TEI when the semantics really are the same — no abuse
- Specialist namespaces
- use MathML for formulae, CALS for tables, TBX for terminology
- ISO-specific data
- use ISO namespace for extra attributes on existing elements
- Word fudges
- embed small islands of Word
data if absolutely necessary (eg system allows for keeping Word tables if needed)
- Validation
- record business rules using Schematron
7. Ugly-headed problems
- Some constituencies want to author in directly in XML
(eg for writing a standard using ODD markup)
- Not all authoring environments use complex
multi-namespace schemas well
- We convert to XML to (inter alia)
check structure — but how do we report errors back to the author?
8. Implementation
- Define single ODD file defining three schemas (Lite, Normal,
and Normal_with_ODD)
- Embed Schematron constraints in schemas
- Write XSLT 2.0 transformations to turn body of OO XML into TEI XML
and vice versa
- Unpack a suitable empty Word document as template and
replace selected components. All styling is determined by
template document
- Create web servlet to manage transformation, processing of
graphics, and packing/unpacking of .docx zip files
9. The OOXML data format
Microsoft Office 2007 (Office 2008/2011 on a Mac) is more or
less an implementation of ISO/IEC 29500 (OOXML); this defines
- a family of interlinked XML schemas to describe office
documents
- a file hierarchy structure
- a packaging format (zip)
There is a (smallish) difference between
is in Word, not what
should be there according to the spec.
10. The architecture of a Word docx (OOXML) file
(Useful picture from http://en.wikipedia.org/wiki/Office_Open_XML)

11. XML namespaces in Word
- urn:schemas-microsoft-com:mac:vml
- Drawing
- http://schemas.microsoft.com/office/mac/office/2008/main
- http://schemas.openxmlformats.org/markup-compatibility/2006
- urn:schemas-microsoft-com:office:office
- http://schemas.openxmlformats.org/officeDocument/2006/relationships
- Links
- http://schemas.openxmlformats.org/officeDocument/2006/math
- Maths
- urn:schemas-microsoft-com:vml
- Another bit of drawing
- urn:schemas-microsoft-com:office:word
- http://schemas.openxmlformats.org/wordprocessingml/2006/main
- Normal text
- http://schemas.microsoft.com/office/word/2006/wordml
- http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
- More drawing
12. The contents of the package

13. What are the files for?
[Content_Types].xml | mime types of files |
_rels/.rels | links between names and objects |
word/_rels/document.xml.rels | links between
names and support files |
word/document.xml | document body |
word/media/image1.jpeg | picture |
docProps/thumbnail.jpeg | document thumbnail |
word/settings.xml | settings |
word/webSettings.xml | settings for HTML export |
word/styles.xml | style definitions |
word/numbering.xml | numbering schemes |
docProps/core.xml | document properties |
word/fontTable.xml | font details |
docProps/app.xml | application details |
All of these, except media files, are XML files (despite some weird names).
14. Simple text in Word
The main building blocks are
- <p>
- block-level object (‘paragraph’)
- <r>
- inline object
- <t>
- text ‘run’
with corresponding style objects:
- <pPr>
- block-level object style rules
- <rPr>
- inline style rules
There is no hierarchy, just a flat set of block-level objects.
15. Example: references in Word

16. Example: references in OOXML (Word) — 1
<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidR="008A0CE8"
rsidRPr="00250571"
rsidRDefault="008A0CE8"
rsidP="008A0CE8">
<pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="Heading1"/>
<tabs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="clear" pos="400"/>
<tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="clear" pos="560"/>
<tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="left" pos="403"/>
<tab xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="left" pos="562"/></tabs></pPr>
<bookmarkStart xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
id="8" name="_Toc201542376"/>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>Normative references</t></r>
<bookmarkEnd xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
id="8"/></p>
<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidR="008A0CE8"
rsidRPr="00250571"
rsidRDefault="008A0CE8"
rsidP="008A0CE8">
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>The following referenced documents are indispensable for
the application of this document. For dated references, only
the edition cited applies. For undated references, the latest
edition of the referenced document (including any amendments)
applies.</t></r></p>
17. Example: references in OOXML (Word) — 2
<p xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidR="008A0CE8"
rsidRPr="00250571"
rsidRDefault="008A0CE8"
rsidP="008A0CE8">
<pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="RefNorm"/></pPr>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<sz xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="19"/>
<szCs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="19"/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>ISO </t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<sz xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="19"/>
<szCs xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
val="19"/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>13909-2:2001,</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xml:space="preserve"> </t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>Hard coal and coke</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
> —</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xml:space="preserve"> Mechanical sampling</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
> —</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xml:space="preserve"> </t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>Part </t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>2: Coal</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
> —</t></r>
<r xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
rsidRPr="00250571">
<rPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<i xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
/></rPr>
<t xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xml:space="preserve"> Sampling from moving streams</t></r></p>
18. Example: references in XML (TEI)
<div type="normativeReferences">
<head>Normative references</head>
<p>The following referenced documents are indispensable
for the application of this document. For dated
references, only the edition cited applies. For undated
references, the latest edition of the referenced document
(including any amendments) applies.</p>
<listBibl type="normativeReferences">
<bibl type="dated">
<publisher>ISO</publisher>
<idno type="docNumber">13909</idno>
<idno type="docPartNumber">1</idno>
<edition>2001</edition>
<title rend="italic">Hard coal and coke — Mechanical
sampling —<seg/>Part 1: General
introduction</title>
</bibl>
</listBibl>
</div>
19. Example: math in Word

20. Example: math in XML (MathML)
<p>The required overall precision on a lot should be agreed between the parties concerned. In the absence of such agreement, a value of one tenth of the ash content may be assumed.</p>
<p>The theory of precision is given in ISO 13909-7. The following equation is derived:</p>
<p>
<formula>
<m:math>
<m:msub>
<m:mrow>
<m:mi>P</m:mi>
</m:mrow>
<m:mrow>
<m:mtext>L</m:mtext>
</m:mrow>
</m:msub>
<m:mo>=</m:mo>
<m:mn>2</m:mn>
<m:msqrt>
<m:mfrac>
<m:mrow>
<m:mfrac>
<m:mrow>
<m:msub>
<m:mrow>
<m:mi>V</m:mi>
</m:mrow>
<m:mrow>
<m:mtext>l</m:mtext>
</m:mrow>
</m:msub>
</m:mrow>
<m:mrow>
<m:mi>n</m:mi>
</m:mrow>
</m:mfrac>
<m:mo>+</m:mo>
<m:mfenced separators="|">
<m:mrow>
<m:mn>1</m:mn>
<m:mo>-</m:mo>
<m:mfrac>
<m:mrow>
<m:mi>u</m:mi>
</m:mrow>
<m:mrow>
<m:mi>m</m:mi>
</m:mrow>
</m:mfrac>
</m:mrow>
</m:mfenced>
<m:msub>
<m:mrow>
<m:mi>V</m:mi>
</m:mrow>
<m:mrow>
<m:mtext>m</m:mtext>
</m:mrow>
</m:msub>
<m:mo>+</m:mo>
<m:msub>
<m:mrow>
<m:mi>V</m:mi>
</m:mrow>
<m:mrow>
<m:mtext>PT</m:mtext>
</m:mrow>
</m:msub>
</m:mrow>
<m:mrow>
<m:mi>u</m:mi>
</m:mrow>
</m:mfrac>
</m:msqrt>
</m:math>
<lb/>
<c rend="tab"/>(1)</formula>
</p>
21. Example: terminology in Word

22. Example: terminology in XML (TBX)
<termEntry xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
id="user_3.3.1">
<note xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>The specific linguistic means of expression always include subject-specific <hi xmlns="http://www.tei-c.org/ns/1.0"
rend="italic">terminology</hi> (3.5.1) and phraseology and also may cover stylistic or syntactic features.</note>
<descripGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<descrip xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="definition">language used in a <ref xmlns="http://www.tei-c.org/ns/1.0"
>domain</ref> (3.1.2) and characterized by the use of specific linguistic means of expression</descrip></descripGrp>
<langSet xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
xml:lang="">
<ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
id="user_3.3.1-1">special language</term>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="partOfSpeech">noun</termNote>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="administrativeStatus">preferredTerm-admn-sts</termNote></termGrp></ntig>
<ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
id="user_3.3.1-2">language for special purposes</term>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="partOfSpeech">noun</termNote>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="administrativeStatus">admittedTerm-admn-sts</termNote></termGrp></ntig>
<ntig xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<termGrp xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
>
<term xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
id="user_3.3.1-3">LSP</term>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="partOfSpeech">noun</termNote>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="administrativeStatus">admittedTerm-admn-sts</termNote>
<termNote xmlns="http://www.lisa.org/TBX-Specification.33.0.html"
type="termType">abbreviation</termNote></termGrp></ntig></langSet></termEntry>
23. Challenges in the XSLT conversion
- interpolating hierarchy from flat section headings
- we use XSLT 2.0 <for-each-group> heavily to create
document structure
- making decisions depend on generated structure
- the conversion makes 3 passes over the data with the
one XSLT transform, each time adding more goodness
- table management
- a Word table differs from a CALS table in how it
models spanning cells and tables, which causes considerable
problems in mapping
24. Questions about ISO work
- What about OpenOffice?
- theoretically we
could translate docx/odt, but perhaps better to maintain
parallel set of XSLT. This is not yet a requirement.
- Do ISO support ODD?
- Not at
present. ‘Are aware of’, perhaps
- Is all this open source?
- The stylesheets are all managed under the TEI project
on Sourceforge, with the ISO variations distinct from
generic docx conversion
- How do I add support for new TEI elements or new Word
styles?
- It all depends …
25. Handling incoming Word style
We use TEI rend a lot to preserve style names
<xsl:template
match="w:p[w:pPr/w:pStyle/@w:val='Figure text']"
mode="paragraph">
<p>
<xsl:if test="w:pPr/w:jc/@w:val">
<xsl:attribute name="iso:align">
<xsl:value-of select="w:pPr/w:jc/@w:val"/>
</xsl:attribute>
</xsl:if>
<xsl:attribute name="rend">
<xsl:text>Figure text</xsl:text>
</xsl:attribute>
<xsl:apply-templates/>
</p>
</xsl:template>
26. Handling incoming TEI element
<xsl:template
match="tei:front/tei:div/tei:p[@type='foreword']">
<xsl:call-template name="block-element">
<xsl:with-param name="pPr">
<pPr xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<pStyle xmlns="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
>
<xsl:attribute name="w:val">
<xsl:value-of
select="concat(translate(substring(parent::tei:div/@type,1,1),$lowercase,$uppercase),substring(parent::tei:div/@type,2))"/>
</xsl:attribute></pStyle></pPr>
</xsl:with-param>
</xsl:call-template>
</xsl:template>
27. Displaying validity errors
The bad way:
macbookpro:TEIISO rahtz$ cat Test2010-18_validation.txt
Test2010-18.xml:1:1578: error: bad character content for element
Test2010-18.xml:1:4840: error: bad value for attribute "version"
Test2010-18.xml:1:4889: error: bad value for attribute "target"
Test2010-18.xml:1:6622: error: attribute "style" from namespace
"http://www.w3.org/1999/xhtml" not allowed at this point; ignored
28. Displaying validity errors (2)
A better way

29. Corrigenda and addenda (TEI XML)
<p>This fourth edition cancels and replaces the third
edition (ISO 6579:<del when="2009-10-30T13:19:00Z" type="COR" n="1">1993</del>
<add when="2009-10-30T13:19:00Z" type="COR" n="1">1999</add>), which has been technically revised.</p>
<bibl>
<add when="2009-10-30T09:27:00Z" type="AMD" n="1">ISO/TS 11133-1, <title rend="italic">Microbiology of food and animal feeding stuffs — Guidelines on preparation and production of culture media — Part 1: General guidelines on quality assurance for the preparation of culture media in the laboratory</title>
</add>
</bibl>
30. Displaying corrigenda and addenda (HTML)

31. Recommendations and conclusions
- Define a very specific TEI customization, document it with local
examples, and enforce it
- Don't just document your business rules in prose, but
implement them using Schematron
- Do not be afraid of namespaces — use the vocabulary
suited to the task
- Achieving genuinely lossless transformations is very
hard, so make sure you really know what you are trying to achieve
- Don't go anywhere near word-processors :-}
- It has to be questioned whether authoring web pages in
TEI XML has any real advantages over simply using HTML
5