Digital Tools for South Asian Studies

Searching Online Corpora Using grep Using Regular Expressions

Searching Online Corpora

There are a few services that provide searchable corpora of texts in Indian languages, meaning an online interface which allows you to put queries to a corpus of texts and which will display the results:

SARIT (“Search and Retrieval of Indic Texts”): This site contains a modest but growing number of texts, at this point largely related to Buddhist philosophy and Mīmāṁsā, which are also available for download on the project’s GitHub website. The web application allows you to search for texts in Dēvanāgarī or transliteration, using wildcards, and presents results in a keyword-in-context display. (Full disclosure: I have worked on this project.)
The SARIT Corpus has also been loaded onto a Philologic application, and can also be searched there.
DCS (“Digital Corpus of Sanskrit”). This is a large and extremely rich corpus of Sanskrit texts that have been entered and tagged by Oliver Hellwig (with his SanskritTagger program). The searches are lemmatized.
Muktabodha. A large collection of mostly plain-text files which can be searched using regular expressions; it also contains transcriptions of unpublished manuscripts.
Pandanus Sanskrit e-Texts.

Most of these tools have their own instructions for searching texts, which you are encouraged to read. In order to effectively use these tools, you must answer the following questions:

What transliteration system do I use for inputting a search term?

Although we typically use Unicode to write in Indian languages—either using an Indic script, such as Dēvanāgarī, or in a standard romanization format, such as IAST or ISO-15919—some tools require that searches be typed in an ASCII transliteration system, which maps the sounds of an Indian language onto combinations of plain ASCII letters. Common ASCII transliteration systems include Harvard-Kyoto, ITRANS, Velthius (those of us who used the Velthuis LaTeX package are familiar with this), and SLP1. SLP1 is very commonly used today in the back end of applications, since it provides a one-to-one match between Sanskrit sounds and ASCII characters, which makes it ideal for computation.

Some tools, such as SARIT, allow search terms to be input in Unicode. Others, such as Muktabodha and Pandanus, require search terms to be input in an ASCII transliteration. Muktabodha, for example, requires terms in the Velthuis transliteration to be enclosed in curly brackets, and terms in the Harvard-Kyoto transliteration to be enclosed in angle brackets.

Can I use wildcards or regular expressions?

Wildcards are a kind of “light” version of regular expressions: they can expand your search to match more than the literal value of what you’ve typed in. For example, SARIT supports the wildcards * and ?, which will match any number of letters and one letter respectively. See the search guidelines for more information.

Some sites, such as Muktabodha, support relatively robust regular expressions in their searches, for which see below.

Can I search for words in proximity to each other?

This is a feature that is not very widely supported, although SARIT allows you to formulate queries with the boolean AND operator, which will find passages where two (or more) terms co-occur.

Using grep

Before Google, there was grep. It is a program, introduced to Unix in 1973, that allows one to search through a file, or set of files, for a given expression.

Getting text files onto your computer

One recommenation is to just download the GRETIL texts as a zip file, and put them somewhere on your computer. Other corpora also offer downloads, such as SARIT (here is a direct download link).

Finding them in your terminal

Because grep is a command-line tool, you need to use it with a terminal. If you use Linux, you shouldn’t need any help with this (I use GNOME Terminal). If you use a Mac, see this tutorial: the Apple OS is a Unix-based operating system, so it comes with programs like a terminal and grep already, even if you might have to look around for them.

Windows is, as usual, much more of a problem: its command line interfaces (if it still has any?) are DOS-based and don’t have standard Unix programs like grep. There are some grep clones that I have not tried, such as AstroGrep, which might be worth a shot.

Get to grepping

First, use the cd command to navigate to the directory where your texts live. Mine are in /home/andrew/Dropbox/e-texts.

$ cd /home/andrew/Dropbox/e-texts

Now you can use grep to search. The basic usage is:

$ grep -R "dharma" *

This will invoke the command grep with the following arguments:

-R is a flag that tells the program to search recursively, that is, including all of the subdirectories of the current directory, and their subdirectories, etc.
"dharma" is the search string. The quotation marks set off the search string; they are not part of it.
* tells grep to look at every file (as we will see, * is computer-speak for “everything”) in the current directory.

The full list of options for grep is available, as usual, in the manual page, which can be accessed by typing the following into the terminal:

$ man grep

Using Regular Expressions

Regular Expressions (regex) are like search terms, but more powerful. Like a simple search term, such as dharm, it is meant to be evaluated against a text or set of texts, and it will produce a number of matches But unlike simple search terms, regular expressions contain characters, such as ., +, and *, that are not matched against the corresponding characters in the text, but tell the matching program something about how it should look for matches.

Regex is supported in a wide range of programs. But these programs sometimes differ in the “flavor” of regex they use: mostly these differences have to do with how the special characters are escaped. I will speak first about using regex in the standard Unix program grep, and then about using it in GNU Emacs. Mileage will vary with your text edit.

There are some good general resources about regular expressions available:

regular-expressions.info tutorial
using regular expressions with grep

Regular Expressions with grep

See this tutorial.

Regular Expressions in GNU Emacs

Emacs offers powerful regex support, for which a general overview can be found here and here. Some Emacs tools that use regex are:

M-x query-replace-regexp: Searches the buffer for matches of a given regular expression and interactively replaces the matched text with the given text. Note that query-replace-regexp (which is usually bound to CTL-ALT-%) uses the string mode of parsing regex.
M-x regexp-builder: a way to build regular expressions interactively. As you write the expression in the buffer, it will highlight the matches in the text. Note that by default regexp-builder uses the read syntax by default. I prefer to use the string syntax, especially if I am constructing expressions for use in query-replace-regexp. You can change the modes by CTL-C + TAB (M-x customize-variable RET reb-re-syntax). For more information see here.

Regular Expressions in SublimeText

See this cheatsheet.