How to find words and phrases in digital texts.
There are a few services that provide searchable corpora of texts in Indian languages, meaning an online interface which allows you to put queries to a corpus of texts and which will display the results:
Most of these tools have their own instructions for searching texts, which you are encouraged to read. In order to effectively use these tools, you must answer the following questions:
Although we typically use Unicode to write in Indian languages—either using an Indic script, such as Dēvanāgarī, or in a standard romanization format, such as IAST or ISO-15919—some tools require that searches be typed in an ASCII transliteration system, which maps the sounds of an Indian language onto combinations of plain ASCII letters. Common ASCII transliteration systems include Harvard-Kyoto, ITRANS, Velthius (those of us who used the Velthuis LaTeX package are familiar with this), and SLP1. SLP1 is very commonly used today in the back end of applications, since it provides a one-to-one match between Sanskrit sounds and ASCII characters, which makes it ideal for computation.
Some tools, such as SARIT, allow search terms to be input in Unicode. Others, such as Muktabodha and Pandanus, require search terms to be input in an ASCII transliteration. Muktabodha, for example, requires terms in the Velthuis transliteration to be enclosed in curly brackets, and terms in the Harvard-Kyoto transliteration to be enclosed in angle brackets.
Wildcards are a kind of “light” version of regular expressions: they can expand your search to match more than the literal value of what you’ve typed in. For example, SARIT supports the wildcards * and ?, which will match any number of letters and one letter respectively. See the search guidelines for more information.
Some sites, such as Muktabodha, support relatively robust regular expressions in their searches, for which see below.
This is a feature that is not very widely supported, although SARIT allows you to formulate queries with the boolean AND
operator, which will find passages where two (or more) terms co-occur.
Before Google, there was grep. It is a program, introduced to Unix in 1973, that allows one to search through a file, or set of files, for a given expression.
One recommenation is to just download the GRETIL texts as a zip file, and put them somewhere on your computer. Other corpora also offer downloads, such as SARIT (here is a direct download link).
Because grep is a command-line tool, you need to use it with a terminal. If you use Linux, you shouldn’t need any help with this (I use GNOME Terminal). If you use a Mac, see this tutorial: the Apple OS is a Unix-based operating system, so it comes with programs like a terminal and grep already, even if you might have to look around for them.
Windows is, as usual, much more of a problem: its command line interfaces (if it still has any?) are DOS-based and don’t have standard Unix programs like grep. There are some grep clones that I have not tried, such as AstroGrep, which might be worth a shot.
First, use the cd
command to navigate to the directory where your texts live. Mine are in /home/andrew/Dropbox/e-texts
.
$ cd /home/andrew/Dropbox/e-texts
Now you can use grep
to search. The basic usage is:
$ grep -R "dharma" *
This will invoke the command grep
with the following arguments:
-R
is a flag that tells the program to search recursively, that is, including all of the subdirectories of the current directory, and their subdirectories, etc."dharma"
is the search string. The quotation marks set off the search string; they are not part of it.*
tells grep
to look at every file (as we will see, * is computer-speak for “everything”) in the current directory.The full list of options for grep
is available, as usual, in the manual page, which can be accessed by typing the following into the terminal:
$ man grep
Regular Expressions (regex) are like search terms, but more powerful. Like a simple search term, such as dharm, it is meant to be evaluated against a text or set of texts, and it will produce a number of matches But unlike simple search terms, regular expressions contain characters, such as .
, +
, and *
, that are not matched against the corresponding characters in the text, but tell the matching program something about how it should look for matches.
Regex is supported in a wide range of programs. But these programs sometimes differ in the “flavor” of regex they use: mostly these differences have to do with how the special characters are escaped. I will speak first about using regex in the standard Unix program grep, and then about using it in GNU Emacs. Mileage will vary with your text edit.
There are some good general resources about regular expressions available:
See this tutorial.
Emacs offers powerful regex support, for which a general overview can be found here and here. Some Emacs tools that use regex are:
M-x query-replace-regexp
: Searches the buffer for matches of a given regular expression and interactively replaces the matched text with the given text. Note that query-replace-regexp
(which is usually bound to CTL-ALT-%) uses the string
mode of parsing regex.M-x regexp-builder
: a way to build regular expressions interactively. As you write the expression in the buffer, it will highlight the matches in the text. Note that by default regexp-builder
uses the read
syntax by default. I prefer to use the string
syntax, especially if I am constructing expressions for use in query-replace-regexp
. You can change the modes by CTL-C + TAB (M-x customize-variable RET reb-re-syntax
). For more information see here.See this cheatsheet.