TEI and South Asian Texts

Basic questions about the TEI standards and their application to research in South Asia.

What is TEI?

TEI stands for the Text Encoding Initiative, a “consortium which collectively develops and maintains a standard for the representation of texts in digital form.” The current standard published by the TEI is the P5 Guidelines, which provided detailed instructions for how to represent various aspects of a text in digital form.

Encoding just means representing a particular feature of a text in a particular form. What the TEI focuses on is the actual feature that is encoded in a text, rather than the way in which it happens to be presented in a given medium. For example, let’s say I have a text that mentions the name of an author. The name is the feature of the text we’ll focus on. In a printed edition, this name might be represented in bold:

tathā cōktaṁ śrībhōjēna

Depending on how you’re creating your text, you might choose to replicate the presentation of such a feature, for instance:

  • with bold text in a word-processor document;
  • with the <b>bold</b> tag in an HTML document; or
  • with *bold* in Markdown document or other text-based markup.

But this approach comes with some liabilities.

  1. Presentation isn’t specific enough. Unless you’ve decided to use bold for personal names that occur in the text and nothing else, there’s a strong likelihood that “bold” in your text will also mean something else, like emphasis, or a citation of the base-text in a commentary, etc. This means that, if you wanted to isolate all of the personal names, you couldn’t just search for everything you’ve put in bold; you’d have to do this and then manually filter out the other stuff.
  2. Presentation might change. What if you decide that you don’t want personal names to be in bold, but in a different color? In an approach that combines structure and presentation, you’ve have to change every instance of personal names in the text so that they’re displayed in a different color rather than in bold. If you were smart enough to reserve bold only for personal names, it might be easy to make this kind of change. If not, then once again, you have a lot of manual filtering to look forward to.
  3. Human error. What if you forget your own convention and write a name in italics rather than in bold, or use square brackets where you should have used curly brackets? When working with plain text, HTML, or a word processor, there’s nothing to tell you when you’e made a mistake, because the computer isn’t aware of your conventions.
  4. Idiosyncrasy. Even if you manage to create an error-free text that uses a particular set of conventions consistently, they are still just your own conventions. They might make sense to you, and they might make sense to some of your colleagues, but without you spelling out explicitly what they are, other people might have difficulty understanding them, and if they ever make their own changes to the text, there’s a good chance they’ll break your conventions and introduce their own.

There are approaches that minimize some of these liabilities, especially the conflation between structure and presentation. You might, for instance, use a particular plain-text convention like @@name:śrībhōjēna@@, which manages to convey that śrībhōjēna is a name without committing you to displaying that name in bold. And in the context of specific projects, this kind of approach might make sense.

But the TEI provides an answer to all of these problems at once. To encode a text according to the TEI standards means to use a well-defined and consistent markup strategy that is specified in a file called a schema, which allows your text editor to check whether you’re doing the right thing, and this markup strategy focuses on structural features of the text, such as names, rather than on features of the text’s presentation, like typography. And the markup is done in a language, XML, that is flexible, powerful, and relatively simple, and for which there are many tools available.

In a TEI document, the above passage would look like this:

tathā cōktaṁ <name>śrībhōjēna</name>

A TEI document is fundamentally an XML document that uses the elements and attributes defined in the TEI P5 schema.

TEI is the most common approach to encoding for digital texts projects, and the guidelines are designed to support encoding all kinds of different information. Obviously not all of them are relevant to every project. But here are some aspects of texts that can be encoded in TEI:

to give just a small sample of the kinds of encoding the TEI has been used for.

Should I use TEI?

TEI is a relatively large investment. It takes a while to learn its conventions, and it might take a while to get used to editing text documents, validating them against a schema, and so on. It will also generally take some time to transform TEI documents into a form that is useful to you and your readers or colleagues. So it’s not for everyone. If you just want to type up a text to search, then TEI might be overkill.

Then again, if you want to share that text with other people, and if there are specific aspects of that text that you’ve taken care to represent in your transcription, then you might consider using TEI for your transcription, or at least converting your document to a TEI format afterwards. That way, your markup will in principle be available to others.

If, however, you are considering a project that makes heavy use of textual encoding—for instance, annotating a corpus of texts, identifying quotations, specifying relations between texts, and so on—then TEI is a natural choice. TEI is also, increasingly, a good choice for editions of texts, because you can put an arbitrarily large amount of information into your edition, and then later on you can decide how to display this information, and what kind of information you want to omit.

One of the great virtues of TEI is that it is an encoding format, not a presentation format, which means that a single TEI document can be used to generate all kinds of different output formats: PDF, EPUB, and HTML. The process of going from TEI to these output formats generally involves stylesheets, which will be discussed in the next section.

A great many admirable projects use TEI under the hood, and they use it for different reasons and with different effects. Here is a selection:

What can I do with TEI?

One reason that more people don’t use TEI is because TEI documents, in themselves, don’t do anything. They are text documents. They sit there, on your computer, until you “do something” with them.

One of the things you can do with them is to load them into a database, like eXist-DB. This is an XML database (which several of the projects mentioned above use) that allows you to perform database-like queries on TEI documents, and hence it’s good if you have a lot of documents to deal with. This has a steep learning curve—it is a database platform, after all—but I find it to be relatively straightforward and powerful, and it is open-source.

Most of the time, TEI documents are transformed into other kinds of documents using XSLT Stylesheets. These are documents that tell a program (an XSLT Processor) how to change an XML document into another kind of document. In my case, it is usually either an HTML document or a LaTeX document (the latter being used to produce PDF documents).

XSLT Processors aren’t super intuitive to use, but if you are looking for HTML output, most modern web browsers actually include XSLT processors as part of the Javascript standard. This means that you can transform XML (and therefore TEI) into HTML on-the-fly in a web browser. Some examples for doing so can be found on my Mālatīmādhava website.

How do I start with TEI?

First you need a text editor. A text editor is different from a word processor. Word processors operate on complex files (like DOCX or ODT files) that are actually XML files, under the hood, but which the word processor interprets to look like a piece of paper with text on it. Text editors allow you to interact with the text directly. When you are dealing with text documents, and attempting to add markup to those documents, you do not want a word processor imposing different kinds of markup on them. TEI documents are text files, not word processor files.

There are a number of text editors available for different platforms, but you will want one with the following features:

  • syntax highlighting: your editor will color elements, attributes, and text nodes for you.
  • validation: your editor should check, first of all, whether your XML document is well-formed, and secondly, whether it validates against the TEI schema.
  • completion: your editor should close tags for you, and if it is plugged into a schema, it should also suggest which tags are possible to use at a given location in the document.

oXygen XML Editor is a text editor specifically designed for XML editing, and comes with out-of-the-box support for TEI. It’s available for several platforms, but it’s commercial software. Sublime Text Editor, another piece of cross-platform commercial software, apparently has XML schema validation through the Exalt plugin. BBEdit is a simple text editor for Mac, which is also commercial; I cannot attest to its support for XML. Vim is a great piece of free, open-source, cross-platform software that can be used for XML editing (see here), although it has a relatively steep learning curve.

My preferred text editor, for several years now, has been GNU Emacs. It also has a relatively steep learning curve. But it is powerful, free, and open-source. I use it on Linux, but it can be used on Mac or Windows. And the most recent version contains Nxml mode, which makes it very easy to edit XML documents, including validating against a schema. I highly recommend Emacs, especially since it can be used for lots of other things.

Basics of XML

An XML document is a plain-text document that is structured into things called elements. An element in XML is represented using tags that are set off in angle brackets: most elements will have a beginning and an end, represented by an opening tag and a closing tag, as follows:

<p>This is a paragraph.</p>

Note that the name of the element is contained within the angle brackets of both the opening and closing tag, and that the closing tag has a slash before the name of the element.

Elements can have content, as this example shows. But besides text content, elements can contain other elements:

<p><s>This is a sentence.</s></p>

They can also have mixed content, i.e., both text content and other elements:

<s>I think that was <name>Browning</name>.</s>

Not all XML elements have content. Some of them are empty, such as the TEI milestone elements, which are used to mark page and line breaks:

<s>I think that<pb/> was <name>Browning</name>.</s>

In this example, the <pb/> element has no content, and is marked as such by the use of the slash before the closing angle-bracket.

In addition to its content, any XML element can also have a series of attributes, which provide information about that element in machine-readable form. For example:

<s>I think that <pb n="4"/> was <name>Browning</name>.</s>

The n attribute of the element <pb/> provides the number of the page.

Elements can have multiple attributes:

<s>I think that <pb n="4" ed="#1960"/> was <name>Browning</name>.</s>

The use of the ed attribute here tells us that the beginning of the fourth page, which is represented by the element <pb n="4"/>, occurs in a particular edition, which is here referred to by means of an identifier.

One important property of an XML document is well-formedness. A document is well-formed if there are no errors with its syntax: that is, no elements are opened without being closed, no elements are closed without being opened, every attribute is assigned a value with quotation marks, and so on. Most XML editors will automatically check to see if a document is well-formed, but the command-line tool xmllint will do this as well.

Schemas and Validation

XML is “extensible” in the sense that you can always define your own elements and attributes. But in the context of big projects like the TEI, it’s important to have consistency in the repertoire of elements and their behavior. That’s why XML is usually used with a schema, which is a machine-readable document that tells your XML editor (or validator) precisely what elements are allowed in what contexts, and what attributes they’re allowed to have. Schemas are what separate out “plain” XML documents from XML documents that are crafted to reflect a certain formal specification of elements and attributes, such as TEI documents. As an example, an XML document can begin with any element you want, but a TEI document must begin with the element <TEI>.

Schemas are provided by the TEI in a number of formats. Your XML editor or validator might prefer a certain format; Emacs’ nxml mode prefers RNC. Checking a document against a schema is called validation. Once your editor is aware of the schema, then it will probably help you insert and complete TEI elements.