Towards a machine-readable Mahābhārata

The following text documents the first stage in my work on the electronic text of the Mahābhārata: the techniques described were employed during the mid-1990s to produce a usable but still very flawed text. Subsequently that text became the basis of a project, conducted from 1997 to 2001 in association with the Bhandarkar Oriental Research Institute in Pune, and with generous funding from the British Academy and the Society for South Asian Studies, to employ a team of assistants to correct the entire electronic Mahābhārata “by hand”. The text available from this website is the result of that project.

Professor Muneo Tokunaga of Kyoto has typed up the complete Sanskrit text of both the Rāmāyaṇa and the Mahābhārata, and has made these electronic texts available via the Internet. In principle, this means that certain highly intensive tasks (for example, metrical analysis, analysis of diction, or the building of concordances), which would previously have been almost unthinkable for such enormous works, are now much more feasible. And at a smaller scale, it also becomes possible to check through the texts for usage of particular words or phrases, and to manipulate them in other ways — such as printing them in high-quality Devanagari script.

A number of problems arise for users who wish to use the these texts for purposes other than those which Prof. Tokunaga had in mind. There are inconsistencies in format or spelling; the transliteration used is hard to read and often ambiguous; the form in which the texts are presented differs markedly from “normal” Sanskrit; there are frequent typographic errors; and, in the case of the Mahābhārata, it suffers from the consequences of a policy decision which essentially results in all compounds being split up into apparently separate “words”.

Despite such problems, these electronic texts are potentially so valuable that I have tried to convert them into a more generally usable form. This has meant the following broad objectives:

the format in which the texts are stored should be standardised;
a “legible” eight-bit encoding should replace Tokunaga's seven-bit system, in which long vowels are represented by doubled characters and retroflexes by capitalised characters (so that the heroes of the Mahābhārata are the paaNDavas);
the text should be a simple transcription — compounds should not be split up, sandhi should not be undone;
typographic errors should be corrected;
the broken compounds resulting from Tokunaga's use (in his version of the Mahābhārata only) of a single character to represent both “end of word” and “end of compound member” should be mended.

As far as possible, I have tried to tackle these problems by means of automatic procedures, rather than by hand: I have written numbers of Perl programs to carry out particular emendations or check for particular problems (the directory where I store these currently contains 21 such programs). The greater part of objectives 1-4 is met by a single program named mconv, while a suite of programs seeks to address objective 5. However, a huge amount of “hands on” work has also been necessary.

Using these methods to improve the Mahābhārata text to the point where it could be made public took me about a year; thereafter I continued to make specific improvements on an occasional basis — i.e. when I noticed some particularly annoying problem.

1. Format

As far as format is concerned, I have followed Tokunaga's general policy but tried to introduce consistency.

Comments are introduced by the “%” sign; I have used these only at the top of each file to allow me to detail its history and current status.
Normal text lines begin with a line number, which is slightly different in the MBh and the R:
- In the MBh, the line number 123456781 would mean “book 12, chapter 345, stanza 678, pādas 1 [and 2]”. If the pada number is replaced by a capital letter, it signifies a section of prose. If it is replaced by a space, it signifies an X uvāca line.
- In the R, where there are fewer than ten books, no prose and no X uvācas, 12345671 would mean “book 1, chapter 234, stanza 567, pādas 1 [and 2]”.
In both texts, anuṣṭubh lines are printed with two pādas on each line (as is normal in Devanagari editions); triṣṭubh and other metres also have two pādas per line, but separated by a semicolon. This caused a major problem in the MBh, because Tokunaga uses “;” both for that purpose and to indicate hiatus between vowels. The problem did not really arise in the R, where he indicates triṣṭubhs by the device of repeating their pādas.

In attempting to get this format standardised, I have checked mechanically to make sure

that all lines begin with a valid line number containing the right number of digits/capital letters in the right positions;
that every line number can legitimately follow the line number preceding it;
that all “long” lines in both the MBh and the R do contain one and only one semicolon.

There may well still be mistakes in this general area, but I think they must be few in number.

2. Encoding

There is nothing intrinsically wrong with Tokunaga's seven-bit ASCII system of transcription, but it is difficult to read and therefore prone to errors. I have converted his texts into the eight-bit CSX encoding. I chose this not for its inherent merits (it has few) or because it is well suited to the Unix environment in which I work (it is very badly suited) but because it is the only attempt at a standard eight-bit encoding known to me, and standards are precious things. In converting the texts I have done my best to resolve the ambiguities in Tokunaga's original material, where “m” may be the labial nasal or anusvara, “h” may be the voiced breathing or visarga, and “n” may be the dental, palatal or velar nasal.

The only area where it is likely that errors may remain is the conversion of “n” to velar “ṅ”, which has to be largely done on a case-by-case basis. If errors do remain here, they are certainly not numerous.

3. Transcription method

Tokunaga's texts are entered with vowel sandhi undone and compounds broken up, ostensibly to facilitate word searches. I do not believe that this is desirable, since the texts ought to be usable for other purposes (printing high-quality copy in Devanagari, metrical analysis, analysis of diction, etc.), and since there is not in fact any real difficulty in performing word searches on normal Sanskrit texts — with care, even a “difficult” word like api can be isolated and searched for. I have therefore normalised the sandhi and attempted to rejoin the compounds.

In the MBh, consonant sandhi was handled by Tokunaga in such a way as to necessitate undoing and then redoing it all. (For example, final “c” in a word might represent the form after sandhi, as in tac ca, or it might represent an idealised pre-sandhi form, as in vāc.)

Reapplying vowel sandhi automatically rejoins many compound members. Other compounds pose no problem in the R, where Tokunaga distinguishes between “.” (= hyphen between compound members) and “..” (= space between words). In the MBh there are extremely difficult problems in this area (see section 5 below).

4. Correction of typographic errors

The MBh text appears to me less carefully typed up than the R. I have not done a great deal in the way of textual correction of the R. In the MBh, I have

made global changes to certain commonly misspelt words (e.g. cest- for ceST-, i.e. ceṣṭ-);
searched out and eliminated all “impossible” letters (f, q etc.);
searched out and eliminated all “impossible” 2-letter sequences (ao, bk etc.);
made an attempt to correct spellings in the second half of the Śāntiparvan, where Tokunaga did not distinguish retroflexes from dentals, and also in a smaller section of the Āraṇyakaparvan, where a less extreme version of the same problem was found. This last exercise has certainly not eliminated all the errors in these two sections of the MBh.
made corrections by hand wherever I have noticed specific errors.

5. Eliminating spaces from within compounds

In the MBh, Tokunaga uses “.” to indicate both end of word and end of compound member. When this is translated into the standardised system I am aiming at, the result is that compounds are broken up by spaces. Eliminating these “false spaces” is a very serious difficulty. I believe that I have so far managed to remove about 75% of them by the semi-automatic process described below.

The method that I have used attempts to teach the computer how to distinguish “false spaces” from genuine spaces by comparing sections of Tokunaga text with corrected equivalents.

First, I correct a section of text by hand, choosing passages which I know well enough to be able to do fairly quickly and easily; so far these have been the Gītā, Nala, the Ambopākhyāna at the end of the Udyogaparvan, chapters 66-68 of the Sabhāparvan, and 13 chapters from the battle narrative of the Bhīṣmaparvan.
Next, I feed the uncorrected and corrected versions of this passage into a Perl program named mkmconv, which isolates all cases where the difference between the two is simply the presence/absence of a space.
Mkmconv ends by invoking edmconv, which presents me with these cases one at a time, after first eliminating ones it already knows about, and asks me whether (a) the form preceding the false space can always be assumed to be a stem forming part of a compound, as in e.g. ratna dhanāni, or whether (b) it might be able to stand on its own as a word, as in e.g. bhīma parākramaḥ.
If I tell it that the form is an (a) type, the program constructs a command meaning “remove all spaces immediately following the stem ratna”.
If I tell it that the form is a (b) type, the program constructs a command meaning “remove all spaces occurring between bhīma and parākram”, stripping obvious case-inflections off the second word in order to make the command apply as generally as possible. (This has its dangers, which the program tries to be alert to.)
The resulting collections of commands are then added to the program mconv.spacing, ready to be applied to the rest of the MBh text. (Also, obviously, the corrected section of text is inserted in its proper place.)

It was only after I had gone through this process with the first four sections of Mahābhārata text mentioned above that I looked for the first time at Tokunaga's Rāmāyaṇa. I then realised that his use of “.” and “..” in the R allowed me to create (so to speak) deliberately incorrect versions of every kāṇḍa, as well as “corrected” versions. I did this, and fed the results into mkmconv in just the same way as with sections of MBh text, since there was likely to be significant overlap between the two epics. This produced large numbers of further improvements.

This method has been very successful, but it has intrinsic limitations, and I think it may be nearing the end of its usefulness — the graph of corrections achieved against effort expended is levelling off, and there are a lot of cases that it simply cannot handle properly. A major example is proper names. My system cannot close up spaces after all occurrences of (e.g.) bhīma because bhīma may be a vocative; it is restricted to closing up specific cases that happen to have been drawn to its attention, such as bhīma parākram. The next task must be to deal with all cases of bhīma<space> etc. by means of “sweeps” through the entire Mahābhārata using an editor interactively. Unfortunately the MBh has a large cast of characters, many of them known by numerous different names; in addition there are many other words that require similar treatment because they cannot always be assumed to be stems.

Back to home page