Text only | Skip links
Skip links||IT Services, University of Oxford

1. Analysis and annotation

  • All markup results from analysis; all markup is expressed as annotation.
  • However, there is a general feeling that assertions such as
    This is a paragraph
    are different from assertions such as
    This is a verb-noun complementizer
  • Standardization of linguistic annotation, as practiced in the NLP community in particular, was one of the original goals of the TEI
  • (though subsequent history showed this may have been over-ambitious)
  • The TEI provides a range of very general facilities

2. Leech's principles of linguistic annotation

  • the annotation should be separable from the text
  • multiple annotations may co-exist within the text
  • annotation should be
    • self-documenting
    • explicit
    • reproducible
    • formally verifiable

3. Some varieties of annotation

identification of segments, locations, and spans
Alignment and correspondence
identification of associations between segments (e.g. translation equivalence, anaphoric reference...)
classification of identified structures, e.g. POS tagging, syntactic function, analytic category

4. Uses of segmentation

  • Segmentation allows for components to be identified and accessed at any level
    • for reference purposes e.g. this occurs at ....
    • for scoping purposes e.g. find this within a that
    • for analytic purposes e.g. 90% of these of type that contain a the other

(overlap happens)

5. Segmentation elements

  • general-purpose:
    for end-to-end segmentation
    for arbitrary nestable segmentation
  • linguistically-motivated:
    (clause) represents a grammatical clause.
    (phrase) represents a grammatical phrase.
    (word) represents a grammatical (not necessarily orthographic) word.
    (morpheme) represents a grammatical morpheme.
    (character) represents a character.

From the att.segLike class these elements all inherit type and function attributes

6. Word or sentence level annotation is quite easy...

 <s>The export of sardines in oil from
   Sweden is prohibited. </s>
<s n="11">

7. ... syntactic structures less so

((The export of (sardines in oil) (from Sweden)) is prohibited.)
<s n="11">
Although XML was designed to represent linearized tree structures,
  • there are problems with discontinuity and overlap
  • typing of relationships can be problematic

8. ... e.g. the next sentence...

 <cl>Some resentment
   is felt <phr>at the order</phr>
  <phr>by the
 </cl>, <cl>who <phr>with their customary
     ingenuity</phr> have <phr>for some time</phr> been
   importing <phr>india-rubber sardines in
  <phr>without detection</phr>

9. Discontinuity: using pointers

... (Germans, who (with their customary ingenuity) have (for some time) been importing)...
 <w xml:id="s1next="#s2">who</w>
 <phr>with their customary ingenuity</phr>
 <w xml:id="s2prev="#s1next="#s3">have</w>
 <phr>for some time</phr>
 <w xml:id="s3prev="#s2">been</w>

can also use part attribute to indicate that segments are incomplete

10. Discontinuity: using “standoff” technique

<w xml:id="W1">who</w>
<phr>with their customary ingenuity</phr>
<w xml:id="W2">have</w>
<phr>for some time</phr>
<w xml:id="W3">been</w>
<!--... -->
<join targets="#W1 #W2 #W3result="seg"/>

11. Translation pairs

<s corresp="#ALRTP1xml:lang="ENxml:id="RTP1">For a long time I used to go to bed early</s>
<!-- ... -->
<s xml:id="ALRTP1corresp="#RTP1xml:lang="FR">Longtemps je me couchais de bonne heure</s>
<linkGrp type="trans">
 <link targets="#s1 #s2"/>

12. Anaphoric reference

<title xml:id="shirl">Shirley</title>, which made its Friday night
debut only a month ago, was not listed on <name xml:id="nbc">NBC</name>'s new schedule, although <seg corresp="#nbc">the network</seg> says <seg corresp="shirl">the
show</seg> still is being considered.
or, stand-offishly,
<title xml:id="SHIRL">Shirley</title>, which made its Friday night
debut only a month ago, was not listed on <name xml:id="NBC">NBC</name>'s new schedule, although
<seg xml:id="NWK">the network</seg> says
<seg xml:id="SHOW">the show</seg> still is being considered.

<linkGrp type="anaphor">
 <link targets="#SHIRL #SHOW"/>
 <link targets="#NWK #NBC"/>

13. Generic elements for stand-off interpretation

The <span> element can be used to identify arbitrary discontinuous segments:
 <ab xml:id="eye_start">Lest it see more, prevent it. Out, vile jelly!</ab>
 <ab>Where is thy lustre now?</ab>
 <ab>All dark and comfortless. Where's my son Edmund?</ab>
 <ab>Edmund, enkindle all the sparks of nature,</ab>
 <ab xml:id="eye_end">To quit this horrid act.</ab>
<span from="#eye_startto="#eye_end">the eye is pulled out</span>

14. Stand-off interpretation (cont)

The <interp> element is used to define any kind of interpretation, for example a discourse or narrative function.

The global ana attribution can then point from parts of the text to which such an interpretation is applicable

15. A simple example

 <interp xml:id="quote">
  <desc>A quotation, usually from the press</desc>
 <interp xml:id="comment">
  <desc>A humorous comment on such a quotation</desc>
<cit ana="#quote">
  <p> 105 Canadian Dogs to go with Sir E. Shackleton."</p>
  <title>Daily Express.</title>
<p ana="#comment">A gay lot, these Canadians.</p>

16. The ana attribute

  • provides one way of associating an element with some analysis of it
  • points to an analysis which may be defined in any of the following ways:
    • a bald prose description
    • an <interp> element
    • a formally defined feature-structure

The type attribute provides an alternative method of categorisation

17. Simple word-level analyses

<s n="11">
 <w ana="#DT">The</w>
 <w ana="#NN">export</w>
 <w ana="#IN">of</w>
 <w ana="#NNS">sardines</w>
 <w ana="#IN">in</w>
 <w ana="#NN">oil</w>
 <w ana="#IN">from</w>
 <w ana="#NP">Sweden</w>
 <w ana="#VBZ">is</w>
 <w ana="#VVN">prohibited</w>
 <c ana="#SENT">.</c>
This requires, somewhere, a definition of DT, NN, IN etc. such as
<interp xml:id="DT">
<interp xml:id="NN">
 <desc>singular noun</desc>
<interp xml:id="NNS">
 <desc>plural noun</desc>
<!-- ... etc -->

18. Or alternatively...

<s n="11">
 <w type="DT">The</w>
 <w type="NN">export</w>
 <w type="IN">of</w>
 <w type="NNS">sardines</w>
<!-- ... -->
This requires an ODD in which the legal values for the type attribute have been defined, using a modified declaration such as
<elementSpec ident="wmode="change">
  <attDef ident="typemode="replace">
    <valItem ident="DT">
    <valItem ident="NN">
     <desc>singular noun</desc>
    <valItem ident="NNS">
     <desc>plural noun</desc>
<!-- ... -->

19. A word on feature structure representation

  • The feature structure is a widely-used concept in theoretical linguistics
  • Any analysis can be represented by bundles of named feature-value pairs
  • TEI representation of this is now the basis of an ISO standard, providing a theoretically neutral and pragmatic solution to the problem of intermachine communication

20. Classification and categorization at higher levels

TEI also provides mechanisms for representing classification or analysis of higher level objects, such as text divisions, or whole texts in a corpus.

  • <div> elements can be typed in the same way as <w> elements
  • or they can use the decls attribute to point to relevant metadata elements in the header

And the <catRef> and <taxonomy> elements can be used to specify text-level analyses, as you already know...

Copyright University of Oxford