Utility programs

Here you can find various sets of programs for use with Indian-language text.

There is a set of conversion utilities written in the programming language Perl, which is installed by default on most Unix systems and is freely available for all systems. These programs (csx2tex, dn2tex, tex2csx, tex2dn, iscii2csx, tex2norman, norman2tex) are for use in converting between different encodings used to represent Indian-language text: (1) CSX, (2) the DN encoding used in conjunction with Frans Veltuis's “Devanagari for TeX” package, (3) the ISCII standard used by much Indian software, (4) the encoding popularised by Professor K. R. Norman, and (5) my variation on standard TeX (in which “\.” represents a subscript dot, “\:” a superscript). Some of the programs accept options on the command line to modify their behaviour: these have a “-h” option which provides basic help.

A second set of conversion programs are written not in Perl but in C; source code (*.c) and Win32 executables (*.exe) are provided. Csx2isc and csxp2isc convert respectively from CSX and CSX+ to ISCII. Csxp2ur converts text from CSX+ to accented Unicode Roman. A2c and c2a convert between CSX and Harvard-Kyoto ASCII. Iscii2ud converts from ISCII to Unicode Devanagari, and ud2iscii converts in the opposite direction. Ur2ud converts from Unicode Roman to Unicode Devanagari; it can read and write UTF-8 and other standard Unicode formats; Roman transliteration adheres to the ISO 15919 standard. There is also a Unicode format converter uconv, which can convert between UTF-8 and the two UCS-2 variants (big- and little-endian). Both ur2ud and uconv have a “-h” option to provide help on usage.

Pc2mac and mac2pc are Win32 utility programs to allow the transfer of Word documents using “private” character encodings (such as Norman or CSX+) between PCs and Macintoshes. By default such documents are garbled in the transfer, as Word assumes that each file is in the native encoding of the source machine, and translates it to the native encoding of the destination machine. Note that these programs are provided to assist users who find themselves in difficulties because of the incompatibilities caused by the use of legacy character-sets; however, in most cases a better solution is to abandon such outdated fonts and convert to Unicode. Word macros to help with this can be found here; suitable Unicode fonts can be found here.

There are three Sanskrit-related utilities written in Perl. Sscan is a simple program that generates metrical analyses of Sanskrit verse texts. It is particularly geared to the texts of the two epics, but stands a good chance of working with any CSX-encoded text in a reasonably sane format. Ssort is a similarly simple utility to sort lists of CSX-encoded Sanskrit forms into Devanagari “alphabetic” order. The length of file it can handle is dependent on the amount of memory available; longer files can be sorted with ssort.unix, but this is unlikely to work on non-Unix systems, since it invokes the standard Unix sort program. Finally, vaccent is intended for use in conjunction with ur2ud (available in the conversion programs above). It reads in an accented Vedic text in Unicode Roman transliteration, which must adhere to ISO 15919 conventions, and outputs the same file with Vedic accents added; this output file can the be processed with ur2ud -s to produce a Devanagari version of the text correctly accented according to the system used in the RV, AV, TS, etc.

The zip-file accfonts.zip contains three Perl programs which address the same requirement using the same basic algorithms: their aim is to make it easy to create versions of existing fonts containing whatever accented characters the user may need, arranged according to whatever encoding he/she may favour. Mkt1font does this by reading in the two files that define a Type 1 PostScript font and writing out new versions of them; vpl2vpl does it by reading in the file that defines a TeX virtual font and writing out a new version of it; vpl2ovp does it in the same way as vpl2vpl, but generates virtual fonts for Omega, the 16-bit Unicode-aware development from TeX. In each case information about what accented characters are required and where they should be located is supplied by means of a simple definition file, which has the same format for all three programs. For more details consult the README file provided.

Please email any problems to jds10 <at> cam.ac.uk.

Back to home page