3 The rationale of declarative markup

3.1 XML as a markup meta-language

As mentioned in the technical bakcgounrd sectionof this document, the acronym XML stand for eXtensible Mark-up Language. XML is indeed, first of all, a language of markupor encoding, a language to annotate texts.

To be more precise, being a simplified derivative of SGML (Standard Generalized Markup Language) developed by W3C (World Wide Web Consortium), XML is actually a meta-language of mark-up, a language for describing other languages. A meta-language is a language which defines a methodology and describes schemas of mark-up together with some requirements for achieving a ‘standardised’ encoding. Indeed, a first important difference between HTML (Hypertextual Markup Language) and XML is the fact that while HTML consists of a fixed set of tags, XML is extensible. This means that it is possible to create new sets of tags with it.

3.2 XML as a declarative markup system

There are different types of markup. XML can be defined as a declarative markup language, that is to say a markup system based on the logical document model. It aims at describing the logical structure of a document independent of its physical representation.

This type of markup is also called descriptive or semantic markup language since the portions of a text are declared and defined on the basis of their function or meaning. While HTML is mostly a presentational mark-up language, XML describes what a textual portion means rather than how it should appear. See, for instance, the following example:

  • HTMLIn both the following cases <i> stands for italics and it would be rendered in italics by browsers
    <i>Divina Commedia</i>
    <i>mens sana in corpore sano</i>
  • XML
    <title>Divina Commedia</title>
    <foreign>mens sana in corpore sano</foreign>
    <title> indicates that the portion of text it wraps is a title, no matter how it is going to be rendered; while <foreign> indicates that the relevant text is in a language different from the one of the main text, no matter how it is going to be displayed.

Thus, in XML:

  • in contrast with a presentational approach, the meaning of the text is considered as separate from its display or presentation;
  • the meaning of the text is made explicit through one or more levels of interpretation which are added to the text.

Indeed, an XML document aims to embed the meaning/s of the textual data it describes, thus giving an interpretative representation of the actual content of the text in question. It does so by excluding all the procedural information related to the processing of the content for presentational purposes. The textual content is therefore perceived as separate from the formats of presentations it can potentially assume.

For this reason, the XML document can be considered as a sort of matrix from which different presentational outputs can be generated: outputs to different media (e.g. web, print) or different devices (e.g. computer, PDA, mobile, speech browser) in different formats (e.g. HTML, TXT, PDF, XML, XHTML).

3.2.1 Structured textual models

In addition, while HTML is a markup that can be defined as linear, XML is a hierarchical markup language, where portions of the text are integrated within a hierarchical structure. So, for instance, the HTML tags <h1> and <h2> specify a function of the portions of text they wrap (heading of level 1 and heading of level 2, respectively), but they can be inserted freely in the text without restrictions on the relationships between different headings in the page or between the headings and the relevant content underneath. This is a useful approach for those types of texts which feature a poor or inconsistent structure. Some have associated this type of linear markup with the image of a necklace made of a string of beads.

<body>
 <h2>Welcome</h2>
 <p>Welcome to the homepage...</p>
 <h3>News</h3>
 <p>Latest news...</p>
</body>

On the contrary, in a declarative hierarchical markup portions of a textual document are defined on the basis of their position. Indeed, XML specifies both:

  • which portions of a textual document are contained within other portions;
  • which portions are required in specific contexts.

This hierarchical structure typical of XML has been compared to the image of a set of Chinese boxes or nesting Russian dolls.

<div>
 <head>Welcome</head>
 <p>Welcome to the homepage...</p>
 <div>
  <head>News</head>
  <p>Latest news...</p>
 </div>
</div>

3.3 Independence, portability and relative stability

Since information in an XML document is stored in plain-text, a special feature of it is its independence from hardware and software. Therefore XML as international standard is considered to be particularly efficient for the exchange of data. Indeed, if you email from your PC an XML document created with your PC to a friend who has a MAC, she will be able to open it, read it and eventually edit it and send it back to you. By accepting and sending information in plain text format, programs running on disparate platforms can communicate with each other and be interoperable.

The international and worldwide use of XML is strengthen even more by the fact that it fully supports Unicode standard for character encoding. Unicode is an internationally recognized standard which provides a unique number for every character, no matter what the platform, the program, the language.

The fact that XML is platform and software independent makes it at the same time relatively immune to changes in technology and flexible enough to adaptation to new technologies.

3.4 Benefits of an abstract approach to texts

When we think of a text, whatever type of text, what comes to mind is not a unordered sequence of words, but a rather organized and meaningful structure that can be more or less complex depending on the material in question. Indeed, each discipline in the humanities claims to deal with, observe, analyse, interpret and study specific types of text. These types of text feature some idiosyncratic characteristics that can be generally subsumed under the following multiple textual aspects and structures:

  • physical e.g. the binding of a codex as opposite to its gatherings, the erasures of some words on a version of a text, the presence of colours in a text etc.
  • structural e.g. divisions of a text into chapters, columns, sections, stanzas etc.
  • implicit or extra textual e.g. information on the text's author and date, or on the provenance of a textual source and the archival collection where is physically held etc.
  • semantic or interpretative e.g. editorial interventions, identification of variants, abbreviations, readings, names of persons and places that occur in the text etc.

These broad definitions of textual aspects are not necessarily exhaustive and can be expanded and tailored to cover particular research aims and textual phenomena. Indeed, potentially there are infinite ways of making, describing and interpreting models of texts.

As a declarative markup language, XML can be used to mark the interpretation of a text explicitly on the basis of a set of conventions to be followed while encoding the text itself. Given its ability to express the structures of texts, XML can be very powerful, especially when the texts to be encoded follow a consistent organization and are of a substantial amount.

Based on the methodologies and conventions of a specific discipline, on the editorial strategy, and on the objectives of study, a text can therefore be organised in various abstract units of different type and size.

Because this idea of text as structurally meaningful object is deeply rooted in the humanities and given the specific features of XML as described above, it is easy to understand why XML is particularly suited to support the study of texts for scholarly purposes.

In addition, because of the importance given to the hierarchical structure of texts, XML supports an approach to the text that focuses very much on the concept of type of document, on the meaningful structural constituents of text rather than on its presentational appearance. So, for instance, consider the following XML structure to encode a chapter within a book:
<div type="chapter">
 <head>Chapter heading here</head>
 <div type="lessons">
  <p>text of paragraph here</p>
  <p>other paragraphs would follow</p>
 </div>
 <div type="assignements">
  <div type="readings">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
  <div type="writing">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
  <div type="discussion">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
  <div type="key_terms">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
 </div>
 <div type="resources">
  <div type="additional_readings">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
  <div type="web_links">
   <p>text of paragraph here</p>
   <p>other paragraphs would follow</p>
  </div>
 </div>
 <div type="assesment">
  <p>text of paragraph here</p>
  <p>other paragraphs would follow</p>
 </div>
</div>

This structure may be useful to encode only those types of chapters that fit within it (for instance, the chapters of a particular course handbook), while it will be completely unsuited for those types of documents that do not respect this kind of composition. It follows that a scholar who uses XML to encode her texts of interest has to concentrate on the structural organisation of these texts and therefore on those textual features which are directly relevant to her study and purposes of analysis.

However, the process of making explicit a structure which is claimed to exist implicitly is not an easy task. First of all, the texts that need to be encoded in XML may already exist in a format which is particularly hard to re-mediate. A good example is for instance a print dictionary. When we browse it for looking up a word, we are not necessary aware of its structure, of the fact that different chunks of text in an entry may bear different meanings and functions. But if we want to produce an XML document which expresses the structure of a typical page of this dictionary, we would need to analyse the dictionary and make explicit all those elements that on a print page may look the same.

Once the document analysis (i.e. the identification of the meaningful constituents of a text for XML encoding purposes) is done and the XML document has been produced in a more or less definite form, every kind of search based on the document structure can be performed, a multiple display can be arranged, the document can be re-purposed for different media and formats, its data can be re-used and exchanged. In a way, the difficulties of expressing a document structure in XML are paid back by the potential advantages of carrying out further processing with it, such as producing sophisticated deliveries and advanced combinatorial search based on the structure of the textNote: To give a very simple example, if in a text containing both prose and poem, both lines of verses and lines of prose are encoded in XML, it will be possible to search specific occurrences of text within prose or poem separately, if needed and if meaningful for specific research questions..

Date: 2013-03-21