Towards a machine-readable Mahabharata
The following text documents the first stage in my work on the
electronic text of the Mahabharata: the techniques described
were employed to produce a usable but still very flawed text.
Subsequently that text became the basis of a project, conducted in
association with the Bhandarkar Oriental Research Institute in Pune,
and with generous funding from the British Academy and the Society
for South Asian Studies, to employ a team of assistants to correct
the entire electronic Mahabharata "by hand". The text
available from this website is the result of that project.
Professor Muneo Tokunaga of Kyoto has typed up the complete Sanskrit
text of both the Ramayana and the Mahabharata, and has
made these electronic texts available via the Internet. In principle,
this means that certain highly intensive tasks (for example, metrical
analysis, analysis of diction, or the building of concordances), which
would previously have been almost unthinkable for such enormous works,
are now much more feasible. And at a smaller scale, it also becomes
possible to check through the texts for usage of particular words or
phrases, and to manipulate them in other ways -- such as printing them
in high-quality Nagari
script.
A number of problems arise for users who wish to use the these texts
for purposes other than those which Prof. Tokunaga had in mind.
There are inconsistencies in format or spelling; the transliteration
used is hard to read and often ambiguous; the form in which the
texts are presented differs markedly from "normal" Sanskrit; there
are frequent typographic errors; and, in the case of the
Mahabharata, it suffers from the consequences of a policy
decision which essentially results in all compounds being split up
into apparently separate "words".
Despite such problems, these electronic texts are potentially so
valuable that I have tried to convert them into a more generally
usable form. This has meant the following broad objectives:
- the format in
which the texts are stored should be standardised;
- a "legible"
eight-bit encoding
should replace Tokunaga's seven-bit system, in which long vowels
are represented by doubled characters and retroflexes by
capitalised characters (so that the heroes of
the Mahabharata are the paaNDavas);
- the text should be a simple transcription --
compounds should not be split up, sandhi should not be undone;
- typographic
errors should be corrected;
- the broken
compounds resulting from Tokunaga's use (in his version of
the Mahabharata only) of a single character to represent
both "end of word" and "end of compound member" should be mended.
As far as possible, I have tried to tackle these problems by means of
automatic procedures, rather than by hand: I have written numbers of
Perl programs to carry out particular emendations or check for
particular problems (the directory where I store these currently
contains 21 such programs). The greater part of objectives 1-4 is met
by a single program named mconv,
while a suite of programs seeks to address objective 5. However, a
huge amount of "hands on" work has also been necessary.
Using these methods to improve the Mahabharata text to the
point where it could be made public took me about a year; thereafter
I continued to make specific improvements on an occasional basis --
i.e. when I noticed some particularly annoying problem.
1. Format
As far as format is concerned, I have followed Tokunaga's general
policy but tried to introduce consistency.
- Comments are introduced by the "%" sign; I have used these only
at the top of each file to allow me to detail its history and
current status.
- Normal text lines begin with a line number, which is slightly
different in the MBh and the R:
- In the MBh, the line number 123456781 would mean
"book 12, chapter 345, stanza 678, padas 1 [and 2]". If the
pada number is replaced by a capital letter, it signifies a
section of prose. If it is replaced by a space, it signifies
an X uvaaca line.
- In the R, where there are fewer than ten books, no
prose and no X uvaacas, 12345671 would mean "book 1,
chapter 234, stanza 567, padas 1 [and 2]".
- In both texts, anustubh lines are printed with two padas on each
line (as is normal in Nagari editions); tristubh and other metres
also have two padas per line, but separated by a semicolon. This
caused a major problem in the MBh, because Tokunaga uses
";" both for that purpose and to indicate hiatus between vowels.
The problem did not really arise in the R, where he
indicates tristubhs by the device of repeating their padas.
In attempting to get this format standardised, I have checked
mechanically to make sure
- that all lines begin with a valid line number containing the
right number of digits/capital letters in the right positions;
- that every line number can legitimately follow the line number
preceding it;
- that all "long" lines in both the MBh and the R do
contain one and only one semicolon.
There may well still be mistakes in this general area, but I think
they must be few in number.
2. Encoding
There is nothing intrinsically wrong with Tokunaga's seven-bit ASCII
system of transcription, but it is difficult to read and therefore
prone to errors. I have converted his texts into the
eight-bit CSX encoding. I chose
this not for its inherent merits (it has few) or because it is well
suited to the Unix environment in which I work (it is very badly
suited) but because it is the only attempt at a standard eight-bit
encoding known to me, and standards are precious things. In converting
the texts I have done my best to resolve the ambiguities in Tokunaga's
original material, where "m" may be the labial nasal or anusvara, "h"
may be the voiced breathing or visarga, and "n" may be the dental,
palatal or velar nasal.
The only area where it is likely that errors may remain is the
conversion of "n" to velar "n", which has to be largely done on a
case-by-case basis. If errors do remain here, they are certainly not
numerous.
3. Transcription method
Tokunaga's texts are entered with vowel sandhi undone and compounds
broken up, ostensibly to facilitate word searches. I do not believe
that this is desirable, since the texts ought to be usable for other
purposes (printing high-quality copy in Nagari, metrical analysis,
analysis of diction, etc.), and since there is not in fact any real
difficulty in performing word searches on normal Sanskrit texts --
with care, even a "difficult" word like api can be isolated and
searched for. I have therefore normalised the sandhi and attempted to
rejoin the compounds.
In the MBh, consonant sandhi was handled by Tokunaga in such a
way as to necessitate undoing and then redoing it all. (For example,
final "c" in a word might represent the form after sandhi, as in
tac ca, or it might represent an idealised pre-sandhi form,
as in vaac.)
Reapplying vowel sandhi automatically rejoins many compound members.
Other compounds pose no problem in the R, where Tokunaga
distinguishes between "." (= hyphen between compound members) and ".."
(= space between words). In the MBh there are extremely difficult
problems in this area
(see section 5
below).
4. Correction of typographic errors
The MBh text appears to me less carefully typed up than the
R. I have not done a great deal in the way of textual
correction of the R. In the MBh, I have
- made global changes to certain commonly misspelt words (e.g.
cest- for ceST-);
- searched out and eliminated all "impossible" letters (f, q etc.);
- searched out and eliminated all "impossible" 2-letter sequences
(ao, bk etc.);
- made an attempt to correct spellings in the second half of the
Santiparvan, where Tokunaga did not distinguish retroflexes from
dentals, and also in a smaller section of the
Aranyakaparvan, where a less extreme version of the same
problem was found. This last exercise has certainly not
eliminated all the errors in these two sections of the
MBh.
- made corrections by hand wherever I have noticed specific errors.
5. Eliminating spaces from within
compounds
In the MBh, Tokunaga uses "." to indicate both end of word and
end of compound member. When this is translated into the standardised
system I am aiming at, the result is that compounds are broken up by
spaces. Eliminating these "false spaces" is a very serious difficulty.
I believe that I have so far managed to remove about 75% of them by
the semi-automatic process described below.
The method that I have used attempts to teach the computer how to
distinguish "false spaces" from genuine spaces by comparing sections
of Tokunaga text with corrected equivalents.
- First, I correct a section of text by hand, choosing passages
which I know well enough to be able to do fairly quickly and
easily; so far these have been the Gita, Nala,
the Ambopakhyana at the end of the Udyogaparvan,
chapters 66-68 of the Sabhaparvan, and 13 chapters from
the battle narrative of the Bhismaparvan.
- Next, I feed the uncorrected and corrected versions of this
passage into a Perl program
named mkmconv, which
isolates all cases where the difference between the two is simply
the presence/absence of a space.
- Mkmconv ends by
invoking edmconv,
which presents me with these cases one at a time, after first
eliminating ones it already knows about, and asks me whether (a)
the form preceding the false space can always be assumed
to be a stem forming part of a compound, as in e.g. ratna
dhanaani, or whether (b) it might be able to stand on its own
as a word, as in e.g. bhiima paraakramaH.
- If I tell it that the form is an (a) type, the program constructs
a command meaning "remove all spaces immediately following the
stem ratna".
- If I tell it that the form is a (b) type, the program constructs
a command meaning "remove all spaces occurring between
bhiima and paraakram", stripping obvious
case-inflections off the second word in order to make the command
apply as generally as possible. (This has its dangers, which the
program tries to be alert to.)
- The resulting collections of commands are then added to the
program mconv.spacing,
ready to be applied to the rest of the MBh text. (Also,
obviously, the corrected section of text is inserted in its
proper place.)
It was only after I had gone through this process with the first
four sections of Mahabharata text mentioned above that I
looked for the first time at Tokunaga's Ramayana. I then
realised that his use of "." and ".." in the R allowed me to
create (so to speak) deliberately incorrect versions of every
kaaNDa, as well as "corrected" versions. I did this, and fed
the results into mkmconv in just the same way as with
sections of MBh text, since there was likely to be
significant overlap between the two epics. This produced large
numbers of further improvements.
This method has been very successful, but it has intrinsic
limitations, and I think it may be nearing the end of its usefulness
-- the graph of corrections achieved against effort expended is
levelling off, and there are a lot of cases that it simply cannot
handle properly. A major example is proper names. My system cannot
close up spaces after all occurrences of (e.g.) bhiima because
bhiima may be a vocative; it is restricted to closing up
specific cases that happen to have been drawn to its attention, such
as bhiima paraakram. The next task must be to deal with all
cases of bhiima<space> etc. by means of "sweeps"
through the entire Mahabharata using an editor interactively.
Unfortunately the MBh has a large cast of characters, many of
them known by numerous
different names; in addition
there are many other words
that require similar treatment because they cannot always be assumed
to be stems.
John Smith can be contacted as jds10@cam.ac.uk
Back to John Smith's home page