1. Analysis and annotation
2. Leech's principles of linguistic annotation
- the annotation should be separable from the text
- multiple annotations may co-exist within the text
- annotation should be
- self-documenting
- explicit
- reproducible
- formally verifiable
3. Some varieties of annotation
- Segmentation
- identification of segments, locations, and spans
- Alignment and correspondence
- identification of associations between segments
(e.g. translation equivalence, anaphoric reference...)
- Categorization
- classification of identified structures, e.g. POS tagging,
syntactic function, analytic category
4. Uses of segmentation
- Segmentation allows for components to be
identified and accessed at any level
- for reference purposes
e.g. this occurs at ....
- for scoping purposes
e.g. find this within a that
- for analytic purposes
e.g. 90% of these of type that contain a the other
(overlap happens)
5. Segmentation elements
- general-purpose:
- <s>
- for end-to-end segmentation
- <seg>
- for arbitrary nestable segmentation
- linguistically-motivated:
- <cl>
- (clause) represents a grammatical clause.
- <phr>
- (phrase) represents a grammatical phrase.
- <w>
- (word) represents a grammatical (not necessarily
orthographic) word.
- <m>
- (morpheme) represents a grammatical morpheme.
- <c>
- (character) represents a character.
From the att.segLike class these elements all inherit
type and function attributes
6. Word or sentence level annotation is quite easy...
<p>
<s>The export of sardines in oil from
Sweden is prohibited. </s>
</p>
<s n="11">
<w>The</w>
<w>export</w>
<w>of</w>
<w>sardines</w>
<w>in</w>
<w>oil</w>
<w>from</w>
<w>Sweden</w>
<w>is</w>
<w>prohibited</w>
<c>.</c>
</s>
7. ... syntactic structures less so
((The export of (sardines in
oil) (from Sweden)) is prohibited.)
<s n="11">
<seg>
<w>The</w>
<w>export</w>
<w>of</w>
<seg>
<w>sardines</w>
<w>in</w>
<w>oil</w>
</seg>
<seg>
<w>from</w>
<w>Sweden</w>
</seg>
</seg>
<w>is</w>
<w>prohibited</w>
<c>.</c>
</s>
Although XML was
designed to represent
linearized tree structures,
- there are problems with discontinuity and overlap
- typing of relationships can be problematic
8. ... e.g. the next sentence...
<s>
<cl>Some resentment
is felt <phr>at the order</phr>
<phr>by the
Germans</phr>
</cl>, <cl>who <phr>with their customary
ingenuity</phr> have <phr>for some time</phr> been
importing <phr>india-rubber sardines in
petrol</phr>
<phr>without detection</phr>
</cl>
</s>
9. Discontinuity: using pointers
... (Germans, who (with their customary ingenuity) have
(for some time) been importing)...
<seg>
<w xml:id="s1" next="#s2">who</w>
<phr>with their customary ingenuity</phr>
<w xml:id="s2" prev="#s1" next="#s3">have</w>
<phr>for some time</phr>
<w xml:id="s3" prev="#s2">been</w>
<w>importing</w>
</seg>
can also use part attribute to indicate that segments
are incomplete
10. Discontinuity: using “standoff”
technique
<w xml:id="W1">who</w>
<phr>with their customary ingenuity</phr>
<w xml:id="W2">have</w>
<phr>for some time</phr>
<w xml:id="W3">been</w>
<w>importing</w>
<join targets="#W1 #W2 #W3" result="seg"/>
11. Translation pairs
<s corresp="#ALRTP1" xml:lang="EN" xml:id="RTP1">For a long time I used to go to bed early</s>
<s xml:id="ALRTP1" corresp="#RTP1" xml:lang="FR">Longtemps je me couchais de bonne heure</s>
And/Or...
<linkGrp type="trans">
<link targets="#s1 #s2"/>
</linkGrp>
12. Anaphoric reference
<title xml:id="shirl">Shirley</title>, which made its Friday night
debut only a month ago, was not listed on <name xml:id="nbc">NBC</name>'s new schedule, although <seg corresp="#nbc">the network</seg> says <seg corresp="shirl">the
show</seg> still is being considered.
or, stand-offishly,
<title xml:id="SHIRL">Shirley</title>, which made its Friday night
debut only a month ago, was not listed on <name xml:id="NBC">NBC</name>'s new schedule, although
<seg xml:id="NWK">the network</seg> says
<seg xml:id="SHOW">the show</seg> still is being considered.
<linkGrp type="anaphor">
<link targets="#SHIRL #SHOW"/>
<link targets="#NWK #NBC"/>
</linkGrp>
13. Generic elements for stand-off interpretation
The
<span> element
can be used to identify arbitrary discontinuous segments:
<sp>
<speaker>CORNWALL</speaker>
<ab xml:id="eye_start">Lest it see more, prevent it. Out, vile jelly!</ab>
<ab>Where is thy lustre now?</ab>
</sp>
<sp>
<speaker>GLOUCESTER</speaker>
<ab>All dark and comfortless. Where's my son Edmund?</ab>
<ab>Edmund, enkindle all the sparks of nature,</ab>
<ab xml:id="eye_end">To quit this horrid act.</ab>
</sp>
<span from="#eye_start" to="#eye_end">the eye is pulled out</span>
14. Stand-off interpretation (cont)
The <interp> element is used to define any kind of
interpretation, for example a discourse or narrative
function.
The global ana
attribution can then point from parts of the text to which such an
interpretation is applicable
15. A simple example
<interpGrp>
<interp xml:id="quote">
<desc>A quotation, usually from the press</desc>
</interp>
<interp xml:id="comment">
<desc>A humorous comment on such a quotation</desc>
</interp>
</interpGrp>
<cit ana="#quote">
<quote>
<p>"MEN FOR THE ANTARCTIC.</p>
<p> 105 Canadian Dogs to go with Sir E. Shackleton."</p>
</quote>
<bibl>
<title>Daily Express.</title>
</bibl>
</cit>
<p ana="#comment">A gay lot, these Canadians.</p>
16. The ana attribute
- provides one way of associating an element with some analysis of
it
- points to an analysis which may be defined in any of the following
ways:
- a bald prose description
- an <interp> element
- a formally defined feature-structure
The type attribute provides an alternative
method of categorisation
17. Simple word-level analyses
<s n="11">
<w ana="#DT">The</w>
<w ana="#NN">export</w>
<w ana="#IN">of</w>
<w ana="#NNS">sardines</w>
<w ana="#IN">in</w>
<w ana="#NN">oil</w>
<w ana="#IN">from</w>
<w ana="#NP">Sweden</w>
<w ana="#VBZ">is</w>
<w ana="#VVN">prohibited</w>
<c ana="#SENT">.</c>
</s>
This requires, somewhere, a definition of DT, NN, IN etc. such as
<interp xml:id="DT">
<desc>determiner</desc>
</interp>
<interp xml:id="NN">
<desc>singular noun</desc>
</interp>
<interp xml:id="NNS">
<desc>plural noun</desc>
</interp>
18. Or alternatively...
<s n="11">
<w type="DT">The</w>
<w type="NN">export</w>
<w type="IN">of</w>
<w type="NNS">sardines</w>
</s>
This requires an ODD in which the legal values for the
type attribute have been defined, using a modified
declaration such as
<elementSpec ident="w" mode="change">
<attList>
<attDef ident="type" mode="replace">
<valList>
<valItem ident="DT">
<desc>determiner</desc>
</valItem>
<valItem ident="NN">
<desc>singular noun</desc>
</valItem>
<valItem ident="NNS">
<desc>plural noun</desc>
</valItem>
</valList>
</attDef>
</attList>
</elementSpec>
19. A word on feature structure representation
- The feature structure is a widely-used concept in
theoretical linguistics
- Any analysis can be represented by bundles of named
feature-value pairs
- TEI representation of this is now the basis of an ISO standard,
providing a theoretically neutral and pragmatic solution to the
problem of intermachine communication
20. Classification and categorization at higher levels
TEI also provides mechanisms for representing classification or
analysis of higher level objects, such as text divisions, or whole
texts in a corpus.
- <div> elements can be typed in the same way as <w>
elements
- or they can use the decls attribute to point to
relevant metadata elements in the header
And the <catRef> and <taxonomy> elements can be used
to specify text-level analyses, as you already know...