As mentioned in the technical bakcgounrd sectionof this document, the acronym XML stand for eXtensible Mark-up Language. XML is indeed, first of all, a language of markupor encoding, a language to annotate texts.
To be more precise, being a simplified derivative of SGML (Standard Generalized Markup Language) developed by W3C (World Wide Web Consortium), XML is actually a meta-language of mark-up, a language for describing other languages. A meta-language is a language which defines a methodology and describes schemas of mark-up together with some requirements for achieving a ‘standardised’ encoding. Indeed, a first important difference between HTML (Hypertextual Markup Language) and XML is the fact that while HTML consists of a fixed set of tags, XML is extensible. This means that it is possible to create new sets of tags with it.
There are different types of markup. XML can be defined as a declarative markup language, that is to say a markup system based on the logical document model. It aims at describing the logical structure of a document independent of its physical representation.
This type of markup is also called descriptive or semantic markup language since the portions of a text are declared and defined on the basis of their function or meaning. While HTML is mostly a presentational mark-up language, XML describes what a textual portion means rather than how it should appear. See, for instance, the following example:
Thus, in XML:
Indeed, an XML document aims to embed the meaning/s of the textual data it describes, thus giving an interpretative representation of the actual content of the text in question. It does so by excluding all the procedural information related to the processing of the content for presentational purposes. The textual content is therefore perceived as separate from the formats of presentations it can potentially assume.
For this reason, the XML document can be considered as a sort of matrix from which different presentational outputs can be generated: outputs to different media (e.g. web, print) or different devices (e.g. computer, PDA, mobile, speech browser) in different formats (e.g. HTML, TXT, PDF, XML, XHTML).
In addition, while HTML is a markup that can be defined as linear, XML is a hierarchical markup language, where portions of the text are integrated within a hierarchical structure. So, for instance, the HTML tags <h1> and <h2> specify a function of the portions of text they wrap (heading of level 1 and heading of level 2, respectively), but they can be inserted freely in the text without restrictions on the relationships between different headings in the page or between the headings and the relevant content underneath. This is a useful approach for those types of texts which feature a poor or inconsistent structure. Some have associated this type of linear markup with the image of a necklace made of a string of beads.
<p>Welcome to the homepage...</p>
On the contrary, in a declarative hierarchical markup portions of a textual document are defined on the basis of their position. Indeed, XML specifies both:
This hierarchical structure typical of XML has been compared to the image of a set of Chinese boxes or nesting Russian dolls.
<p>Welcome to the homepage...</p>
Since information in an XML document is stored in plain-text, a special feature of it is its independence from hardware and software. Therefore XML as international standard is considered to be particularly efficient for the exchange of data. Indeed, if you email from your PC an XML document created with your PC to a friend who has a MAC, she will be able to open it, read it and eventually edit it and send it back to you. By accepting and sending information in plain text format, programs running on disparate platforms can communicate with each other and be interoperable.
The international and worldwide use of XML is strengthen even more by the fact that it fully supports Unicode standard for character encoding. Unicode is an internationally recognized standard which provides a unique number for every character, no matter what the platform, the program, the language.
The fact that XML is platform and software independent makes it at the same time relatively immune to changes in technology and flexible enough to adaptation to new technologies.
When we think of a text, whatever type of text, what comes to mind is not a unordered sequence of words, but a rather organized and meaningful structure that can be more or less complex depending on the material in question. Indeed, each discipline in the humanities claims to deal with, observe, analyse, interpret and study specific types of text. These types of text feature some idiosyncratic characteristics that can be generally subsumed under the following multiple textual aspects and structures:
These broad definitions of textual aspects are not necessarily exhaustive and can be expanded and tailored to cover particular research aims and textual phenomena. Indeed, potentially there are infinite ways of making, describing and interpreting models of texts.
As a declarative markup language, XML can be used to mark the interpretation of a text explicitly on the basis of a set of conventions to be followed while encoding the text itself. Given its ability to express the structures of texts, XML can be very powerful, especially when the texts to be encoded follow a consistent organization and are of a substantial amount.
Based on the methodologies and conventions of a specific discipline, on the editorial strategy, and on the objectives of study, a text can therefore be organised in various abstract units of different type and size.
Because this idea of text as structurally meaningful object is deeply rooted in the humanities and given the specific features of XML as described above, it is easy to understand why XML is particularly suited to support the study of texts for scholarly purposes.
This structure may be useful to encode only those types of chapters that fit within it (for instance, the chapters of a particular course handbook), while it will be completely unsuited for those types of documents that do not respect this kind of composition. It follows that a scholar who uses XML to encode her texts of interest has to concentrate on the structural organisation of these texts and therefore on those textual features which are directly relevant to her study and purposes of analysis.
However, the process of making explicit a structure which is claimed to exist implicitly is not an easy task. First of all, the texts that need to be encoded in XML may already exist in a format which is particularly hard to re-mediate. A good example is for instance a print dictionary. When we browse it for looking up a word, we are not necessary aware of its structure, of the fact that different chunks of text in an entry may bear different meanings and functions. But if we want to produce an XML document which expresses the structure of a typical page of this dictionary, we would need to analyse the dictionary and make explicit all those elements that on a print page may look the same.
Once the document analysis (i.e. the identification of the meaningful constituents of a text for XML encoding purposes) is done and the XML document has been produced in a more or less definite form, every kind of search based on the document structure can be performed, a multiple display can be arranged, the document can be re-purposed for different media and formats, its data can be re-used and exchanged. In a way, the difficulties of expressing a document structure in XML are paid back by the potential advantages of carrying out further processing with it, such as producing sophisticated deliveries and advanced combinatorial search based on the structure of the textNote: To give a very simple example, if in a text containing both prose and poem, both lines of verses and lines of prose are encoded in XML, it will be possible to search specific occurrences of text within prose or poem separately, if needed and if meaningful for specific research questions..