Towards a machine-readable Mahābhārata

The following text documents the first stage in my work on the electronic text of the Mahābhārata: the techniques described were employed during the mid-1990s to produce a usable but still very flawed text. Subsequently that text became the basis of a project, conducted from 1997 to 2001 in association with the Bhandarkar Oriental Research Institute in Pune, and with generous funding from the British Academy and the Society for South Asian Studies, to employ a team of assistants to correct the entire electronic Mahābhārata “by hand”. The text available from this website is the result of that project.

Professor Muneo Tokunaga of Kyoto has typed up the complete Sanskrit text of both the Rāmāyaṇa and the Mahābhārata, and has made these electronic texts available via the Internet. In principle, this means that certain highly intensive tasks (for example, metrical analysis, analysis of diction, or the building of concordances), which would previously have been almost unthinkable for such enormous works, are now much more feasible. And at a smaller scale, it also becomes possible to check through the texts for usage of particular words or phrases, and to manipulate them in other ways — such as printing them in high-quality Devanagari script.

A number of problems arise for users who wish to use the these texts for purposes other than those which Prof. Tokunaga had in mind. There are inconsistencies in format or spelling; the transliteration used is hard to read and often ambiguous; the form in which the texts are presented differs markedly from “normal” Sanskrit; there are frequent typographic errors; and, in the case of the Mahābhārata, it suffers from the consequences of a policy decision which essentially results in all compounds being split up into apparently separate “words”.

Despite such problems, these electronic texts are potentially so valuable that I have tried to convert them into a more generally usable form. This has meant the following broad objectives:

  1. the format in which the texts are stored should be standardised;
  2. a “legible” eight-bit encoding should replace Tokunaga's seven-bit system, in which long vowels are represented by doubled characters and retroflexes by capitalised characters (so that the heroes of the Mahābhārata are the paaNDavas);
  3. the text should be a simple transcription — compounds should not be split up, sandhi should not be undone;
  4. typographic errors should be corrected;
  5. the broken compounds resulting from Tokunaga's use (in his version of the Mahābhārata only) of a single character to represent both “end of word” and “end of compound member” should be mended.
As far as possible, I have tried to tackle these problems by means of automatic procedures, rather than by hand: I have written numbers of Perl programs to carry out particular emendations or check for particular problems (the directory where I store these currently contains 21 such programs). The greater part of objectives 1-4 is met by a single program named mconv, while a suite of programs seeks to address objective 5. However, a huge amount of “hands on” work has also been necessary.

Using these methods to improve the Mahābhārata text to the point where it could be made public took me about a year; thereafter I continued to make specific improvements on an occasional basis — i.e. when I noticed some particularly annoying problem.

1. Format

As far as format is concerned, I have followed Tokunaga's general policy but tried to introduce consistency.

In attempting to get this format standardised, I have checked mechanically to make sure

There may well still be mistakes in this general area, but I think they must be few in number.

2. Encoding

There is nothing intrinsically wrong with Tokunaga's seven-bit ASCII system of transcription, but it is difficult to read and therefore prone to errors. I have converted his texts into the eight-bit CSX encoding. I chose this not for its inherent merits (it has few) or because it is well suited to the Unix environment in which I work (it is very badly suited) but because it is the only attempt at a standard eight-bit encoding known to me, and standards are precious things. In converting the texts I have done my best to resolve the ambiguities in Tokunaga's original material, where “m” may be the labial nasal or anusvara, “h” may be the voiced breathing or visarga, and “n” may be the dental, palatal or velar nasal.

The only area where it is likely that errors may remain is the conversion of “n” to velar “ṅ”, which has to be largely done on a case-by-case basis. If errors do remain here, they are certainly not numerous.

3. Transcription method

Tokunaga's texts are entered with vowel sandhi undone and compounds broken up, ostensibly to facilitate word searches. I do not believe that this is desirable, since the texts ought to be usable for other purposes (printing high-quality copy in Devanagari, metrical analysis, analysis of diction, etc.), and since there is not in fact any real difficulty in performing word searches on normal Sanskrit texts — with care, even a “difficult” word like api can be isolated and searched for. I have therefore normalised the sandhi and attempted to rejoin the compounds.

In the MBh, consonant sandhi was handled by Tokunaga in such a way as to necessitate undoing and then redoing it all. (For example, final “c” in a word might represent the form after sandhi, as in tac ca, or it might represent an idealised pre-sandhi form, as in vāc.)

Reapplying vowel sandhi automatically rejoins many compound members. Other compounds pose no problem in the R, where Tokunaga distinguishes between “.” (= hyphen between compound members) and “..” (= space between words). In the MBh there are extremely difficult problems in this area (see section 5 below).

4. Correction of typographic errors

The MBh text appears to me less carefully typed up than the R. I have not done a great deal in the way of textual correction of the R. In the MBh, I have

5. Eliminating spaces from within compounds

In the MBh, Tokunaga uses “.” to indicate both end of word and end of compound member. When this is translated into the standardised system I am aiming at, the result is that compounds are broken up by spaces. Eliminating these “false spaces” is a very serious difficulty. I believe that I have so far managed to remove about 75% of them by the semi-automatic process described below.

The method that I have used attempts to teach the computer how to distinguish “false spaces” from genuine spaces by comparing sections of Tokunaga text with corrected equivalents.

It was only after I had gone through this process with the first four sections of Mahābhārata text mentioned above that I looked for the first time at Tokunaga's Rāmāyaṇa. I then realised that his use of “.” and “..” in the R allowed me to create (so to speak) deliberately incorrect versions of every kāṇḍa, as well as “corrected” versions. I did this, and fed the results into mkmconv in just the same way as with sections of MBh text, since there was likely to be significant overlap between the two epics. This produced large numbers of further improvements.

This method has been very successful, but it has intrinsic limitations, and I think it may be nearing the end of its usefulness — the graph of corrections achieved against effort expended is levelling off, and there are a lot of cases that it simply cannot handle properly. A major example is proper names. My system cannot close up spaces after all occurrences of (e.g.) bhīma because bhīma may be a vocative; it is restricted to closing up specific cases that happen to have been drawn to its attention, such as bhīma parākram. The next task must be to deal with all cases of bhīma<space> etc. by means of “sweeps” through the entire Mahābhārata using an editor interactively. Unfortunately the MBh has a large cast of characters, many of them known by numerous different names; in addition there are many other words that require similar treatment because they cannot always be assumed to be stems.

Back to John Smith's home page