You are on page 1of 75

Existing standards for codes in respect of Indian Scripts

Internal representation of text in Indian Languages may be viewed as the problem of assigning codes to the aksharas of the languages. The complexities of the syllabic writing systems in use have presented difficulties in standardizing internal representations. TeX was an inspiration in the late 1980s but using TeX was more suited for Typesetting and not Text processing per se. In the absence of appropriate fonts, interactive applications could not be attempted and when fonts became available, applications simply used the Glyph positions as the codes and the number of Glyphs was restricted on account of the eight bit fonts. The following representations still apply as many applications have been written to use one or the other. It must be remembered that these representations primarily address the issue of internal representation for rendering text. Use of Roman letters with diacritic marks ISCII codes Unicode for Indian Scripts. ISFOC standard from CDAC Of the above, the first has been discussed in the section on Transliteration principles. The ISFOC standard applies more to standardization of Fonts for different scripts and cannot really be thought as as an encoding standard. We confine our discussion in this section to ISCII and the Unicode. A brief note on ISFOC will be found in a separate page.

Indian Script Code for Information Interchange (ISCII) ISCII was proposed in the eighties and a suitable standard was evolved by 1991. Here are the salient aspects of the ISCII representation. It is a single representation for all the Indian Scripts. codes have been assigned in the upper ASCII region (160 - 255) for the aksharas of the language. The scheme also assigns codes for the Matras (vowel extensions). Special characters have been included to specify how a consonant in a syllable should be rendered. Rendering of Devanagari has been kept in mind. A special Attribute character has been included to identify the script to be used in rendering specific sections of the text.

shown below is the basic assignment in the form of a Table. There is also a version of this table known as PCISCII, where there are no characters defined in the range 176-223. In PC-ISCII, The first three columns of the ISCII-91 table have been shifted to the starting location of 128. PC-ISCII has been used in many applications based on the GIST Card, a hardware adapter which supported Indian language applications on an IBM PC. In the table, some code values have not been assigned. Six columns of 16 assignments each start at the Hexadecimal value of A0 which is equivalent to decimal 160.

The following observations are made. 1. The ISCII code is reasonably well suited for representing the syllables of Indian languages, though one must remember that a multiple byte representation is inevitable, which could vary from one byte to as many as 10 bytes for a syllable. 2. The ISCII code has effected a compromise in grouping the consonants of the languages into a common set that does not preserve the true sorting order of the aksharas across the languages. Specifically, some aksharas of Tamil, Malayalam and Telugu are out of place in the assignment of codes. 3. The ISCII code provides for some tricks to be used in representing some aksharas, specifically the case of Devanagari aksharas representing Persian letters. ISCII uses a concept known as the Nukta Character to indicate the required akshara. 4. When forming conjuncts, ISCII specifications require that the halanth character be used once or twice depending on whether the halanth form of the consonant or half form of the consonant is present. This results in more than one internal representations for the same syllable. Also, ISCII provides for the concept of the soft halanth as well as an invisible consonant to handle representations of special letters. Parsing a text string made up of ISCII codes is a fairly complex problem requiring a state machine which is also language dependent. This is a consequence of the observation that languages like Tamil do not support conjuncts made up of three or more differing consonants. In fact it is stated that Tamil has no conjunct aksharas. What is probably implied here is that a syllable in Tamil is always split into its basic consonants and the Matra. Several decades ago Tamil writing in palm leaves did show geminated consonants in special form. Though representation at the level of a syllable is possible in ISCII, processing a syllable can become quite complex, i.e., linguistic processing may pose specific difficulties due to the variable length codes for syllables. 5. The code assignments, though language independent, do not admit of clean and error free transliteration across

languages especially into Tamil from Devanagari. 6. It is difficult to perform a check on an ISCII string to see if arbitrary syllables are present. Though theoretically many syllables are possible, in practice the set is limited to about 600 - 800 basic syllables which can also combine with all the vowels. The standard provides for arbitrary syllables to handle cases where new words may be introduced in the language or syllables from other languages are to be handled. It must be stated here that ISCII represents the very first attempt at syllable level coding of Indian Language aksharas. Unfortunately, outside of CDAC which promoted ISCII through their GIST technology, very few seem to use ISCII. ISCII codes have nothing to do with fonts and a given text in ISCII may be displayed using many different fonts for the same script. This will require specific rendering software which can map the ISCII codes to the glyphs in a matching font for the script. Multibyte syllables will have to be mapped into multiple glyphs in a font dependent and language dependent manner. It is primarily this complexity that has rendered ISCII less popular. Details of ISCII are covered in the Bureau of Indian Standard Documents No. IS:13194-1991. Shown below are some examples of strings in Devanagri and other scripts along with their ISCII representations.

Unicode for Indian Languages Unicode was the first attempt at producing a standard for multilingual documents. Unicode owes its origin to the concept of the ASCII code extended to accommodate International Languages and scripts. Short character codes ( 7 bits or 8 bits) are adequate to represent the letters of the alphabets of many languages of the world. The fundamental idea behind Unicode is that a superset of characters from all the different languages/scripts of the world be formed so that a single coding scheme could effectively handle almost all the alphabets of all the languages. What this implies is that the different scripts used in the writing systems followed by different languages be accommodated in the coding scheme. In Unicode more than 65000 different characters can be referenced. This large set includes not only the letters of the alphabet from many different languages of the world but also punctuation, special shapes such as mathematical symbols, Currency symbols etc. The term Code Space is often used to refer to the full set of codes and in Unicode, the Code space is divided into consecutive regions spanning typically 128 code values. Essentially this assignment retains the ordering of the characters within the assigned group and is therefore very similar to the ASCII assignments which were in vogue earlier. Unicode assignments may be viewed geometrically as a stack of planes, each plane having one and possibly multiple chunks of 128 consecutive code values. Logically related characters or symbols have been grouped together in Unicode to span one or more regions of 128 code values. We may view these regions as different planes in the Code Space as illustrated in the figure below. Data processing software using Unicode will be able to identify the Language of the text for each character by identifying the plane the character is located in and use appropriate font to display the same or invoke some meaningful linguistic processing.

Technically, Unicode can handle many more languages than the supported scripts if these languages use the same script in their writing systems. By consolidating a complete set of symbols used in the writing systems across

a family of languages, one can get a script that caters to all of them. The Latin script with its supplementary characters and extended symbol has about 550 different characters and this is quite adequate to handle almost anything that has appeared in print in respect of the Latin script. Hence in the geometrical view above, some planes may be larger (wider) than others and more than one script could have characters from logically similar groups specified in a plane. The fact that several languages/scripts of the world require many more than 128 codes has necessitated assignments of more than one basic plane (i.e., multiples of 128 code values) for them. Languages such as Greek, Arabic or Chinese have larger planes assigned to them. In particular, Unicode has allowed nearly 20000 characters of Chinese, Japanese and Korean scripts to be included in a contiguous region of the Code Space. Currently fewer than a hundred different groups of symbols or specific scripts are included in Unicode. Even though it is a sixteen bit code and can therefore handle more than 65000 code values, Unicode should not be viewed as a scheme which allows several thousand characters for each and every language. That it has provision for fewer than 128 characters for many scripts is a general observation since many languages do not require more than 128 characters to display text. In respect of Indian languages which use syllabic writing systems, one might think that Unicode would have provided several thousands of codes for the syllables similar to the nearly 11000 Hangul syllables already included. On the contrary, Unicode has pretty much accepted the concept behind ISCII and has provided only for the most basic units of the writing systems which include the vowels, consonants and the vowel modifiers. Unlike ISCII, which has a uniform coding scheme for all the languages, Unicode has provided individual planes for the nine major scripts of India. Within these planes of 128 code values each, assignments are language specific though the ISCII base has been more or less retained. Consequently, Unicode suffers from the same limitations that ISCII runs into. There are some questionable assignments in Unicode in respect of Matras. A Matra is not a character by itself. It is a representation of a combination of a vowel and consonant, in other words the representation of a medial vowel. A vowel and NOT its Matra is the basic linguistic unit. Consequently linguistic processing will be difficult with Unicode with Indian languages, just as in ISCII. Here is the Unicode assignment for Sanskrit (Devanagari). The language code for Sanskrit (Devanagari) is 09 (hex) and so the codes span the range 0901 to 097f (Hexadecimal values). In this chart, the characters of Devanagari with a dot beneath, are grouped in the range 0958 to 095f. These are the characters used in Hindi which are derived from Persian and seen in Urdu as well. Likewise in locations 0929, 0931 and 0934 the letters are dotted. The codes are similar to ISCII in ordering but Unicode includes characters not specified in ISCII. Also, the assignments for each language more or less adhere to the same relative locations for the basic vowels and consonants as in ISCII but include many language dependent codes. The code positions in Unicode will not exactly match the corresponding ISCII assignments.

Shown below are the Unicode representations for some strings in different scripts. These are the same strings shown earlier under ISCII.

From the discussion above, it will be seen that ISCII and Unicode provide multibyte representations for syllables. This is not unlike the case for English and other European languages where syllables are shown only with the basic letters of the Alphabet. However, in all the writing systems used in India, each syllable is individually identifiable through a unique shape and one has to provide for thousands of shapes while rendering text. While these thousands of shapes may be composed from a much smaller set of basic shapes for the vowels, consonants and vowel modifiers, one must admit that several hundreds of syllables have unique shapes which cannot be derived by putting together the basic shapes. It is estimated that in practice, more than 600 different glyphs would be required to adequately represent all the different syllables in most of the scripts. The main problem of dealing with Unicode for Indian languages/scripts has to do with the mapping between a multibyte code for a syllable and its displayed shape. This is a very complex issue requiring further understanding of rendering rules. As such a full discussion of this would require that the viewer understand the intricacies of the writing systems of India. We cover this in a separate page.

Specific technical problems with ISCII and Unicode. It must be observed, in the light of the above discussion that displaying a Unicode string in Indian language

requires a complex piece of processing software to identify the syllables and get the corresponding glyphs from an appropriate font for the script. The multibyte nature of Unicode (for a syllable) makes a table driven approach to this quite difficult. Even though it is possible to write such modules which can go from Unicode to the display of text using some font, one faces a formidable problem in respect of data entry, where formation of syllables from multiple key sequences Is truly overwhelming. With limited number of keys available in standard keyboards, it is often not possible to accommodate all the symbols one would require to produce meaningful printouts in each script consistent with quality typesetting systems. Unicode based applications employ the concept of "Locales" to permit data entry of multilingual text. Each Locale is associated with its own keyboard mapping and application software can switch Locales to permit data entry of multilingual text. It will be seen that for Indian scripts, the Locales themselves have limitations since they do not permit a full complement of letters and special characters to be typed in, much less the standard punctuation that has become part of Indian scripts today. While it is possible to write special keyboard driver programs which implement a state machine to handle key sequences to produce syllables, the approach is not universal enough to be included into the Operating Systems, certainly not when a single driver should cater to all the Indian scripts. There is no meaning in having a Hindi version of OS with its own Data entry convention which differs substantially from a Tamil or Telugu version. Here is a summary of the issues that confront us when dealing with Unicode for Indian scripts. Rendering text in a manner that is uniform across applications is quite difficult. Windowing applications with cut,copy/paste features suffer due to problems in correctly identifying the width of each syllable on the screen. Also, applications have to worry about specific rendering issues when modifier codes are present. How applications run into difficulties in rendering even simple strings is illustrated with examples in a separate page. Interpreting the syllabic content involves context dependent processing, that too with a variable number of codes for each syllable. A complete set of symbols used in standard printed text has not been included in Unicode for almost all the Indian scripts. Displaying text in scripts other that what Unicode supports is not possible. For instance, many of the scripts used in the past such as the Grantha Script, Modi, Sharada etc., cannot be used to display Sanskrit text. This will be a fairly serious limitation in practice when thousands of manuscripts written over the centuries have to be preserved and interpreted. Transliteration across Indian scripts will not be easy to implement since appropriate symbols currently recommended for transliteration are not part of the Unicode set. In the Indian context, transliteration very much a requirement. The unicode assignments bear little resemblance to the linguistic base on which the aksharas of Indian scripts are founded. While this is not a critical issue, it is desirable to have codes whose values are based on some linguistic properties assigned to the vowels and consonants, as has been the practice in India.

In a separate web page, we discuss the problems associated with Unicode for linguistic processing of text in Indian languages. Details of Unicode for Indian scripts have been published in the standard available from the Unicode consortium. The Unicode web site does have useful information but one will have to resort to the printed text to get the real details. These are also available in PDF format from the above web site.

Is Unicode for Indian Languages meaningless then ? The answer is certainly No. The main purpose of the Unicode is to transport information across computer systems. As of today, Unicode is reasonably adequate to do this job since it does provide for representing text at the syllable level though not in the fixed size units (Bytes). Applications dealing with Indian Languages will have to include a special layer which transforms Unicode text into a more meaningful layer for linguistic or text processing purposes. The point to keep in mind is that the seven bit ASCII based representation for most World language serves both purposes well i.e., not only are text strings transferable across systems, but linguistic processing is consistent with the seven bit representation . It so happens that the phonetic nature of our Indian Languages has necessitated a different representation for linguistic

analysis. With majority of the Languages of the World, which use a relatively small set of symbols to represent the letters of their alphabet, 8 bit (or even 7 bit) character codes are adequate to represent the letters.

Unicode for Indian Languages: A discussion


Support for Unicode in applications catering to Indian languages is a highly debated issue. Though Unicode has emerged as a viable standard and is finding increasing use all over the world, there are some real difficulties in using it in practice for building applications supporting multilingual user interfaces in Indian languages. The conceptual basis for Unicode, though well accepted for the western languages (scripts), does not fully conform to the linguistic requirements seen in our languages. At the Systems Development Laboratory, IIT Madras, where some meaningful multilingual solutions consistent with the linguistic requirements for all the Indian languages have been developed and distributed as well, there is a strong feeling that Unicode will not really help. It is true that Unicode is a world standard proposed and accepted by a large community of academics, professionals and users. Unfortunately, it does not really blend with the syllabic writing systems used in india, much less provide the means to express linguistic content without ambiguity and in a manner that ties in well with our own understanding of languages. What we have tried to say here reflects the above view.

Multilingual Computing: A view from SDL


Introduction Viewpoint Idiosyncrasies of the writing systems Defining Linguistic requirements Dealing with Text consistent with Linguistic requirements Multilingual computing requirements (for India)

Unicode for Indian Languages


The conceptual basis for Unicode Unicode for Indian languages/scripts Data entry and associated problems Issues in rendering Unicode Using a shaping engine to render Unicode text Discussion on sorting or collation The conceptual basis of the Open type font

Unicode support in Microsoft applications


Uniscribe, the shaping engine Limitations of Uniscribe A review of some Microsoft applications in respect of handling linguistic content

Recommendations for Developers of Indian language Applications


Use of True type fonts to render Unicode Text Can we simplify handling Unicode text? Guidelines for development under Linux

Examples of Unicode Rendering by different applications (Windows and Linux)

circa 2003 circa 2007

Summary of Observations
The experiences of the lab in working with Unicode are summarized in the linked page. As of this update (June 2006), one has not seen an application in any of the Indian Languages that can be cited as a satisfactory implementation based on Unicode. Though a number of developers are counting on using Unicode, it is not going to be easy to effect Localization of our languages, consistent with the requirements of Computing with Indian Languages.

Unicode- A Brief Introduction


Introduction In the context of internationalization and providing uniformity in the handling of text based information across the languages of the world, Unicode has gained considerable importance. The fundamental concept behind Unicode is that text (Unicode based text) representation retains the linguistic content that must be conveyed while at the same time provide for this content to be displayed in human readable form. By catering to both these requirements, Unicode has emerged as the best choice for representing text in a computer application, specifically one that deals with multilingual content. Developers across the world are committing themselves to providing Unicode support in all their applications. Multilingual information processing is one of the essential requirements when it comes to computerization in India. Here, the development of applications requires that interactive user interfaces in different regional languages must be part of each application. A specific regional language may be supported through one or more scripts despite the fact that a given script may be used for more than one language, A very important issue, from a conceptual angle at least, is whether support for a script is equivalent to supporting a language? During the initial phases of development of applications in Indian languages, one was concerned more with the rendering aspects of text, a formidable problem in itself on account of the syllabic writing system followed for all the Indian languages. No one really felt compelled to take into consideration text processing issues. Majority of the early applications required text entry and display with computation effected on numbers rather than text per se. It is not surprising therefore that whatever standardization was attempted, did emphasize mostly the aspects of the writing system without really catering to the linguistic requirements. In essence, the standardization mentioned above (ISCII and Unicode) requires context dependent text processing of each character as opposed to simple handling of a character by itself. In western scripts, the writing system employs a relatively small set of shapes and symbols as this is sufficient to satisfy the requirement that linguistic content as well as rendering information be exactly specified through the same set of codes. Consequently, text processing could be comfortably achieved using a small set of codes. In respect of our languages, the complexities of the writing systems demand that a large number of written shapes (typically in thousands) be used though the linguistic content may still be specified using a small set of codes for the vowels and consonants (typically less than a hundred). Hence it is not possible to use the same set of codes to satisfy both the requirements. In their wisdom, the designers of ISCII and subsequently Unicode, essentially struck a compromise where the smaller set of codes was recommended. Yet, they yielded to the temptation of incorporating codes to include rendering information as well. These codes conveying rendering information took care of Devanagari derived writing systems but do not adequately address the writing systems of the South. The problem that we face today, in respect of efficient representation of text in our languages, is precisely one of not being able to do either effective linguistic processing or meet the real requirements of the writing systems. The Multilingual Systems Development Project at IIT Madras had taken the view that efficient text processing is absolutely essential and is perhaps more important than precise rendering of text so long as ambiguities are avoided. The consequence of this decision was that the coding structure should preserve linguistic content as well as provide complete rendering information within the flexibilities offered by the writing system. Such a coding scheme would require syllables to be coded since the linguistic content is expressed through syllables and the writing system displays syllables. The multilingual software applications developed at IIT Madras have successfully demonstrated that linguistic text processing at the syllable level is not only possible but can also be accomplished by using conventional algorithms which work with fixed size codes. In contrast with this, application development with Unicode support has raised a number of issues which must be thoroughly discussed and understood before one accepts Unicode as a viable standard for computing with Indian languages.

In the light of the above, the Systems Development Laboratory, IIT Madras is pleased to share with the viewers, the Lab's experiences in dealing with linguistic and rendering issues of text in all the important scripts of India.

Unicode: A Viewpoint from SDL


The Multilingual Systems project at IIT Madras was started around the time ISCII had evolved into a standard. It was clear to the development team that though ISCII was conceived as the basis for syllabic representation of text in Indian languages, one had to reckon the need to process a variable number of bytes to identify a proper syllable. The variable length code makes text processing very complex especially in the presence of codes which do not have linguistic significance but are required for correctly rendering the syllable. In recent years, software developers have indeed given serious thought to supporting Unicode for Indian languages. Unicode for Indian languages has basically evolved from ISCII and has retained the essence of eight bit coding scheme though script specific codes have been assigned for the different scripts. World over, there has been a continuing debate about the real suitability of Unicode for applications in Indian languages but the open commitment given by Microsoft has led many developers to toe the line towards Unicode. From the very beginning, the Multilingual Systems project at IIT Madras had seen the futility of attempting to do text and linguistic processing with variable length codes for syllables and had therefore evolved a uniform two byte scheme to simplify text processing. The question of adhering to a meaningful standard where developers see distinct advantages is an important issue but a standard becomes meaningful only if most of what we have successfully attempted earlier can be accommodated. In this respect, Unicode for Indian languages does pose fairly serious challenges and to this date (March 2005) no satisfactory implementation of useful applications can be cited as examples. The purpose of this article is not to present an argument against using Unicode but to bring out the real difficulties in coping with its implementation for Indian languages. Many of the complexities involved in rendering Unicode text through Uniscribe (Microsoft's shaping engine) or equivalent interfaces will be taken up one by one and the difficulties faced in linguistic processing will be explained. Where required, test files have been included for viewers to download and verify the points made. The information provided here will probably convince the reader that it is quite difficult to work with Unicode for Indian languages. Hence one should seriously consider alternatives for text processing. On the issue of using Unicode for transporting information across system, there is enough consensus however. Idiosyncrasies of Writing systems in India. Writing systems followed in India are considered complex on account of the rules which specify how a syllable should be written. The reader is advised to look at the page discussing the principles of writing systems before looking at the current page which concentrates on the problem of rendering syllables on a computer. By and large most languages of India follow the syllabic writing system which represent syllables rather than pure consonants and vowels. Though there can be thousands of syllables, the writing systems generally follow some rules by which the syllables are shaped. These rules allow a syllable to be built up from a smaller set of shapes which include the vowels, consonants and the representations for the medial vowels. This smaller set is usually made available in a font and on a computer a syllable is shaped typically by placing the glyphs in the required order. It will help if we specify the manner in which a syllable is shaped by examining the structure of the syllable. A syllable may be made up of 1. A pure vowel . This usually applies to a vowel appearing at the beginning of a word, though in some languages, a pure vowel may be seen inside a word. A pure vowel has a unique shape and is written using this shape wherever it occurs. 2. A consonant with an implied "ah". The consonants of our languages cannot be pronounced easily unless a vowel is attached to the consonant or other consonants follow. Unlike in western scripts where a consonant is always written in its generic form, consonants in

India are almost always written with an implied "ah" so that one can pronounce an independent consonant directly without having to refer to it by a name (unlike in western languages where each letter has a name). e.g., "m" is normally referred to as "em" and only when an "a" comes with it as in "ma" will one say it as "ma". In Sanskrit (and in other India languages), when you see the consonant 'm', you will know that it is to be pronounced "ma". This subtle distinction has to be retained when a child is taught the writing system. In Indian scripts, a generic consonant occurs only as part of a syllable and not by itself except that a word may end in a generic consonant. Hence the writing convention includes a special form for the same by attaching a "halanth" ligature. So m is the generic form of m but it is not easy to pronounce it by itself. (Try saying "hmm") A pure consonant is written using the shape assigned to the consonant. 3. A consonant vowel combination. In India, one refers to the consonant as the body and the vowel as the one that gives a consonant its life. Hence the vowel symboically represents life. This simple syllable is almost always written by adding a ligature to the shape of the consonant which ligature depends on the vowel. This medial vowel representation has specific forms in specific scripts. There are exceptions to this rule as well in some of the scripts (Tamil and Malayalam).

In the above, we see three scripts where the syllables with "ta" have been formed with all the vowels. Notice that in Tamil, the Matra (ligature) can have components on both sides of the consonant while in Telugu, the components may be written above and below the consonant as well as on one side. 4. Two or more consonants and a vowel. Very simply, we can say this conforms to the ccv, cccv, ccccv etc. format. It will be useful to point out here that one cannot really have arbitrarily long syllables. It will become almost impossible to pronounce them. By and large two and three consonant syllables are common and very few with four or five consonants. One sees long syllables even in English (Angstrom!) Across all the languages of India, approximatley eight hundred to a thousand syllables ( with implied vowel "ah") are known to be present in spoken and written form. Since a basic syllable can include any of the vowels, the number of actual syllables will be of the order of about eight thousand, for all the vowels may not be seen with a base syllable which has two or more consonants in it. Rules for generating the display 1. A pure vowel or a basic consonant has an individual shape associated with it. This shape has evolved over a period of time but one does find significant variations in older manuscripts. A pure vowel or a basic consonant is always displayed by drawing the associated shape.

The forms for all the vowels and pure consonants are defined uniquely in each script. 2. A consonant vowel combination is written with a Matra (ligature) atatched to the basic consonant. The Matra may be drawn on either side of the consonant and in some cases, it is written on both sides or above and below a consonant. This applies to Tamil, Telugu, Malayalam, Bengali and older scripts such as Grantha. Now, it is also true that in Tamil and Malayalam, there is no specific matra in respect of the vowels "uh" and its long version. No matras are applicable here and these will have to be remembered as exceptions. In most scripts, there will be such exceptions for specific combinations and these exceptions will have to be kept in mind when rendering the syllable. 3. The shape for a consonant in a syllable may be roughly specified by applying the rules observed in practice for each script. There rules vary across scripts. Some of the rules are explained below. The half form of a consonant is normally used in many cases, especially with scripts which are closer to Devanagari e.g., Gujarati. The half form is also referred to as the joining form. Usually, the half form has enough resemblance to the full form of the consonant.

However, the half form is not defined for all the consonants, especially those which do not have a vertical stroke in them (Devanagari). Several consonants which do not have a clearly defined half form are shown in the figure above.In these cases, a form diminished in size but in a manner where the consonants can be written one below the other is considered useful. Again, examples are seen in the figure above, The one below the other form is actually the default for South Indian scripts, except Tamil. In these, there is no half form for a consonant. The first consonant in the syllable is written first, the second is written below in reduced size and the third may also get written below this combination. Since one seldom finds arbitrarily long syllables and most of the three or four consonant syllable end with "ra" or "ya", the actual need to write three consonants one below the other arises only rarely. The syllables with "ra" or "ya" as the last consonant have a special form for them. Composing syllables with generic consonants The shape of a syllable can always be built by using the generic form of the consonants. This will be linguistically correct though not conforming to convention. Using generic consonants to write syllables generally results in a smaller set of shapes for the writing system. Among the Indian languages, Tamil employs a simple script where a syllable is always shown in this manner.

Syllable Representation Examples

When we compare the rules across different scripts, the following seem to apply in general, though different rules may apply in different scripts for the same syllable. In other words, several displayed forms may refer to the same sound. Concatenate half forms except for the last consonant. Write the consonants one below the other but retain their basic shapes with diminished size. Use special ligatures for specific vowel combinations in some of the scripts. Use unique forms for a syllable. Just decompose any syllable into its consonants and the vowel. Use special ligatures for "ra" in Devanagari based scripts. The ligature will depend on where "ra" occurs within the syllable. Use special ligatures for other consonants as well. This applies to Telugu. The medial vowel representations may have ligatures on both sides of the consonant.

The following are illustrative of syllable formation in different scripts. The variations in the writing systems will be seen by examining these carefully. This is not an exhaustive set but is provided only as an example.

Coding schemes: Linguistic requirements


1. Accommodate all basic sounds All the basic vowels and consonants should find a place in the code space. All the symbols that convey related information about the text (Vedic symbols, Accounting symbols etc.) should also be coded. Punctuation marks, consistent with the use of the scripts in use today and the ten numerals, should also be accommodated in the code space irrespective of whether they have been accommodated with other scripts or not. 2. Lexical ordering A meaningful ordering of the vowels and consonants will help in text processing. Over the years, on line dictionaries have become very meaningful. Arrangement of words within a dictionary should conform to some known lexical ordering. Lexical ordering of the aksharas may not really conform to any known arrangement for different languages since no standards have been recommended or proposed. The ordering currently in vogue is somewhat arbitrary and different across languages. 3. Coding structure to reflect linguistic information When codes are assigned to the basic vowels and consonants, it would be of immense help to relate the code value to some linguistic information. For instance, the consonants in our languages are grouped into classes based on the manner in which the sound is generated such as the cerebrals, palatals etc. It would certainly help if looking at a code

one could immediately recognize the class. In fact the system of using aksharas to refer to numerals is a well known approach to specifying numbers and this system, familiar to many as the "katapayadi" system has been followed in India for ages. 4. Ease of data entry The scheme proposed for data entry must provide for typing in all the symbols without having to install additional software or use multiple keyboard schemes. It is also important that data entry modules restrict data entry to only those strings that carry meaningful linguistic content. In the context of Unicode, data entry schemes may permit typing in any valid Unicode character though it may convey nothing linguistically. It would therefore help if the schemes allowed only linguistically valid text strings. 5. Transliteration across scripts It is important that the coding structure allows codes corresponding to one script be easily displayed using other scripts as well. In a country such as India, where a lot of common information has to be disseminated to the public, one should not be burdened with the task of generating the text independently for each script. The Unicode assignments for linguistically equivalent aksharas across languages is not sufficiently uniform to permit quick and effective transliteration. One requires independent tables for each pair of scripts. ISCII assignments were uniform across the scripts and made transliteration easier. Transliteration is quite complex with Unicode. The problem of finding equivalents requires that characters assigned in one script but not in the other will have to be mapped based on some phonetic content. This may not always be possible with current Unicode assignments. The illustration below is typical of what one may prefer. Three consonants in Tamil have their Unicode equivalents specified only in Devanagari but not for other scripts. This means that proper transliteration of Tamil text into say Bengali or Gujarati may not be feasible with the existing Unicode assignments and only nearest equivalents may be shown. Transliteration based on nearest phonetic equivalents may not be appropriate from a linguistic angle.

This brings up another important issue as well. In the Unicode assignment for Devanagari, equivalent codes for aksharas from Tamil have been specifically provided for. But the Unicode book also allows the same aksharas to be rendered using two Unicode characters, the first corresponding to the basic phonetic equivalent and second, the Nukta character which identifies the dot in the preceding character. This creates problems in practice when two different Unicode strings result in identical text displays, for tracing back to the correct internal representation will be difficult. This shows the bias exhibited by Unicode towards a coding structure which also specifies rendering information as opposed to rigidly specifying syllables alone. 6. String matching issues Archives of text in Indian languages may have to be indexed and stored for purposes of retrieval against specific queries. The query string may pertain to text in a given language but the result may actually be text in another language. Here is a situation which illustrates this. A Journalist might have filed a report in a language for publication in a magazine. At a later time, a similar event may have to be reported in another region and information from the earlier report might prove useful. Here the journalist covering the latter event may actually query a data base for keywords in the original language in which the earlier report was written but actually submit the query in a different script but containing the same linguistic information. The question of correctly forming a query string is also something that one must think about, for it is quite easy to make spelling errors while typing in the query string. How would one find a match? This is a typical scenario in India where centralized information sources cater to dissemination of the information in different regional languages. 7. Handling spelling errors One of the major difficulties in preparing a query string is getting the spelling right. With syllabic writing systems, it is entirely possible that conjuncts (i.e., syllables with multiple consonants) are typed in with some error. Often the string is derived on the basis of its pronunciation. With errors in spelling, string matching on the basis of syllables can be very difficult. The problem indicated here assumes significance when central data bases are queried in regional scripts. A person in Tamilnadu may desire to lookup information about places in the Himalayas and submits a query in Tamil for a match against the name.

The characters in the Tamil string will have to be transliterated into appropriate codes for Devanagari text in which the information may be kept. The syllables in Tamil are always written in decomposed form and this will result in differences between the Tamil and Devanagari strings causing the string matching program to report either a spelling error or the absence of a match. In respect if Indian scripts it will be too much to expect users to know the correct spelling. Thus string matching on the basis of close sounds will be required rather than on the internal representation. This argument will also apply to applications that might attempt to check spelling in a data entry program.

Linguistic issues in text processing


Dealing with Text consistent with linguistic requirements
Text processing with linguistic requirements in mind can be effected with a minimal set of characters and a few special symbols. By this we mean that a displayed text string can be interpreted with respect to the language it represents. When we are looking for the meaning of a word in a text string, the language does come into the picture and a computer program may actually match the string with a set of words in order to arrive at a linguistically important feature in the word. Interestingly, what associates a word with a language is not the script in which the word is written but the sounds associated with the word. For example, the bilingual text we see in railway stations in India conveys the same linguistic information even though written in different scripts. Unfortunately, computers have forced us to work with scripts rather than the sounds constraining us to handle representations for the shapes of the written letters. The reader will agree with this readily once he/she reads the following text strings and relates them all to the same linguistic content.

An important consequence of the above observation is that in the case of two of the scripts (Roman diacritics and Greek), a minimal set of about 30-40 shapes is adequate to represent virtually any text one wishes to display. In the case of the other two (Devanagari and Tamil), hundreds of shapes may have to used since each shape is associated with a unique sound which is in contrast with the other situation where a sequence of shapes from a small set are placed one after the other. In other words, while in the western scripts a syllable is always shown in decomposed form, in Indian scripts, a syllable is usually shown in its individual form though this individual form may conform to some convention in respect of how it is generated. In the context of Indian scripts, one seldom runs into a problem of reading the text correctly since the reader automatically associates the shapes with the sounds whereas there is enough room for incorrect reading with the Roman script. Thus the shapes of the symbols used in Indian scripts relate more directly to linguistic content without ambiguity when one pronounces the sounds as inferred from the shapes. This brings us to an important problem of text representation. If we want to code the text in a way the linguistic content and the shape are mapped one to one, we will have to find a code for each syllable and we will have to provide for thousands of these, even for a single language. The reader who is familiar with language primers in elementary schools will immediately remember the very basic set consisting of all the consonant vowel combinations. Shown below is a portion of the table of syllable representations in their most basic form with just one vowel with a consonant and this includes the case where the generic consonant is represented as well. Thus the total set, equals the product of the number of consonants and the number of vowels together with the set of vowels and this may be what constitutes the bare minimum requirement for syllable representation. This set is linguistically adequate though the writing conventions may require special ligatures when specific conjuncts are formed.

This large set of displayed shapes has certainly posed problems for the computer scientists who had always worked with a limited set of letters. The new requirement can be met only with schemes that allow more than eight bits per

code since the required number is far in excess of 256. Till recently, majority of computer applications had been written only to work with eight bit codes for text representation except perhaps those meant for use with Chinese, Japanese and Korean, where more than 20,000 shapes are required. Surprisingly, individual codes have been assigned to each of these ( a very tedious process indeed but one that had been handled meticulously). To circumvent the data entry problem with that many symbols, a dictionary based approach is used for these specific languages where the name of the shape is typed in using a very small set of letters (called kana) and the application substitutes the shapes (called ideographs). Handling Indian scripts. Computer applications written for the western scripts can handle about 150-200 shapes (letters, accented letters and symbols). Designers have thought of clever approaches to dealing with Indian scripts by identifying a minimal set of primitive shapes from which the required shape for any syllable could be constructed. For Indian scripts, the basic set of consonant vowel combinations can be easily accommodated through a minimal set of basic shapes involving only the vowels, consonants and the matras. When we write text in our languages, we can in fact build the required shape of the syllable from these but writing conventions are such that for almost all the scripts (except Tamil) many syllables have independent shapes. It is very likely that as writing systems evolved in India, the syllables which did occur more frequently got special shapes assigned to them. We observe that there are about a hundred and fifty of these special shapes which will have to be included in our set if we wish to generate displays conforming to most of the conventions. These basic shapes can be used as the glyphs in a font so that one can generate meaningful displays conforming to the writing conventions. If we look at the number of glyphs, we will find that about 230-240 may be adequate to build almost all the syllables in use. However, fonts used in computers cannot really support this many glyphs. Each system, Win9X, Unix or the MacIntosh, has its own specifications for the correct handling of fonts and the common denominator that all these platforms can truly cater to is only about 190 glyphs, though individually, the Macintosh can support many more. For most scripts, multiple copies of the Matras, each one magnified or reduced in size and located appropriately to blend with the consonant or conjunct will be required. In some cases, it may be difficult to add a matra by overlaying two glyphs because the basic shape of the consonant may not permit an attachment that is not individually tailored to it. This happens for example with the "u" matra for the consonant "ha". In these cases, new glyphs are invariably added. The observations made above may not hold for the case of text representation through Unicode which provides a large code space of more than 64000 codes. Yet, within this large space, each language (identified through the script associated with it) will be confined to a much smaller set of codes but this set can truly exceed 256. Thus Unicode, used with an appropriate 16 bit font can accommodate a fairly large number of characters for a script. The Western Latin set has more than 450 assigned codes to cater to most European requirements. We will now make some specific observations about handling our scripts and assigning codes. 1. If we agree to represent text using codes assigned to shapes used in building up the displayed symbols, we will certainly be able to store and display the text and possibly handle data entry as well using the same methods adopted for plain ASCII text. However, tracing the displayed text to the linguistic content requires us to map the displayed shape into the consonants and vowels that make up the syllable. This makes linguistic processing quite complicated. Also, this approach will not work uniformly across fonts since each font has its own selection of basic glyphs and ligatures.

2. We can agree to assign codes to the basic vowels and consonants of our languages which run into about fifty one symbols. However these codes cannot be directly mapped to shapes in the displayed text. A string containing these codes will necessarily have to be parsed to identify syllable boundaries and the result mapped to a shape. If we do what is done in the western scripts, we will end up with a situation such as seen below. If we take the approach through ISCII and try to display text directly with the codes, we will also run into similar difficulties.

In the use of ISCII, the situation similar to Roman is acceptable so long as the convention for including the vowel shape to only one side of the consonant is retained. The group of codes will indeed contribute to identifying the linguistic content properly but the display may require swapping of glyphs if the matra addition follows a different rule. The main advantage of ISCII is that it provides for codes that relate to the linguistic content (sounds) and thus these could be used uniformly across the Indian languages which are based on a more or less common set of sounds. However, this simplistic view does not always hold, for ISCII also prescribed the means for interpreting specific codes to result in a specific display form. It achieved this through two special codes called the INV and Nukta. Going from an ISCII string to displayed shapes requires one to identify syllable boundaries and also properly interpret the INV and Nukta characters. This approach will be script dependent as well as font dependent. Such a program will code into itself the rules of the writing system followed for a language when using the script. Clearly, writing such programs to handle multiple scripts in the same document will not be easy. Also, since the writing system rules are coded into the program, handling a new script for a language will require the program to be modified and recompiled. It is however possible to read into the program the rules if the program were written in an appropriate manner involving data structures that directly specify the rules and are read in at run time from appropriate files (Tables or simple structures can help).

Going from the displayed shape to the internal representation. How easy or difficult will it be for us to retrace the steps and go from a displayed shape to the ISCII codes which generated the shape? This problem is faced in practice when we perform copy paste operations. The problem is quite difficult to handle since the display is based on codes corresponding to the glyphs in the font while the internal representation conforms to ISCII (or Unicode). What is recommended in practice is the approach through a backing store for the displayed string, typically implemented as a buffer in memory that retains the internal codes of the displayed text. This buffer will have to be maintained in addition to any other buffer maintained by the application for manipulating the text. When a block of text is selected on the screen, a copy of the display is generated again from the internal buffer and this is compared with the codes corresponding to the display. In other words, one really does not go from the displayed codes to the internal codes but rather matches the displayed codes by generating a virtual display and comparing the two. We now appreciate the fact that if the displayed code and the internal code were the same, there is no difficulty at all in doing this. The writing systems which are syllable based do not permit this however. Tracing back can be quite complicated when the same syllable gets displayed in alternate forms as in the illustration below.

One has perfect freedom in choosing any of the above forms when displaying text and no one would complain that the text is not readable since all the forms are accepted as equivalent. The assignment of ISCII or Unicode values does not specify in which form a syllable should be rendered so long as the result is acceptable. The rendering in practice will have to take into account the availability of the required basic shapes to build up the final form. Hence the rendering process will depend on the font used for the script. Experience tells us that at least in respect of Devanagari, the first and the fourth forms above are seen only in some commercially available fonts which are normally recommended for high quality typesetting. Summary and specific observations. 1. The characters defined in any coding scheme should meet the basic linguistic requirements as applicable to a language. It is also necessary to accommodate all the special symbols used in the writing system to add syntactic value to a string. For instance, the Vedic marks used in Sanskrit text or the accounting symbols used in Tamil provide additional information which may not be strictly linguistic in nature but useful for interpreting the contents. 2. As far as possible, every text string must conform to the basic requirement that the displayed shape always carry specific linguistic information. That is, some amount of semantic detail must also be part of the information conveyed by the string. In the absence of this, an application will have great difficulty in interpreting a text string from a linguistic angle, though the string may contain only valid codes. 3. The same linguistic information may be conveyed by more than one displayed shape. The coding schemes must permit alternative representations to be traced back to specific linguistic content.

Multilingual Computing Requirements


(for India) The multilingual multicultural environment in India calls for new approaches to computing with Indian languages. There is a need for applications across the country which deal with information common to all the regions. Individually within each region, where there is homogeneity in terms of the language spoken, applications need to address local requirements. Government and public institutions in the country have traditionally resorted to Bilingual documentation with English as the base and the regional language as the means to communicating with the people. Today we need to cater to the following requirements. Dissemination of centralized information to different regions of the country, both in English and the regional language. Information flows back from the region to the center as well and this is invariably done in English.

Exchange of information across the states. This is accomplished through English though the use of the National Language has been encouraged. Dissemination of information within a region. This is done primarily in the local language and to some extent in English. In the rural setup, the regional language is always used. Exchange of information between institutions (public and private) which have offices distributed throughout India.

In almost all cases, information originates in the regional languages even if it is subsequently transmitted in English. Often bilingual documents are exchanged between Government institutions in different regions. Data shared across the country is usually independent of a specific language since the nature of the shared information usually relates to demographic details, schedules of national events, prices of agricultural products etc.

Types of applications
Applications catering to bilingual document preparation. Though only two languages may be involved at a time, the regional language can be any one of the national languages. These are covered by nine scripts. Urdu should also be supported since it is a national language. Applications catering to multilingual documents. Such applications are called for in the study of scriptures and old manuscripts preserved in india. Such documents are also required to be generated when data has to be displayed in public places attended by people from different regions of the country. Typically, signs and posters in railway stations display information in three or four different scripts. Applications used in the teaching of languages. Introducing one language through another is useful and is also effective when the cultural background of the people speaking the different languages is common or similar. The language is easier to understand since many words would be common. Creation of centralized data bases where the stored information is common to all the regions. While English may be suited well for such applications, what is desirable is an approach to storing information in a language/script independent fashion. Postal information remains essentially the same across the country and this could be an example. All the rules and regulations in effect throughout the country will also qualify here. Applications catering to linguistic analysis. machine translation programs, user interfaces based on natural language queries, generation of linguistic corpora are examples of these applications. Internet based applications such a email, chat, search engines also qualify as multilingual applications. It will be important to permit localized versions of these applications where knowledge of English will not be required on the part of the user to run the applications. The different applications mentioned above may be grouped into Document preparation and data entry Creating conventional data bases and access the data through a web interface or client applications running on standard systems. Create and manage text data bases which include indexing. Command processors similar to a shell where text based applications may be run with ease. The commands may include standard shell commands to manipulate files, invoke applications, manage files and directories etc.

While it is entirely conceivable that any application currently supported through English will qualify for localization, there are still many questions to be answered about items of information which need to be identified on a global basis. It is unlikely that in the near future totally localized applications will be available matching corresponding applications running in English. Basically one is looking for applications that can be handled comfortably in the mother tongue so that a large population in the country may use computers in a meaningful manner.

Requirements to be met
1. Ease of data entry in the regional script as well as in English. Keyboard mapping must be flexible enough to support all the aksharas, traditional symbols and punctuation. The use of the keyboard should be uniform across the scripts.

2. Ease of transliteration across the scripts. This is important to disseminate common information, set up centralized data bases etc. There is also the need to transliterate between English and the regional script to help people learn the language. 3. User interfaces should be uniform across the languages, platforms or operating systems so that training will be easy both for the person learning computers and the trainer himself/herself. Training programs in different regions may be easily handled by experts who may not speak the language but can communicate through a common language other than English. The phonetic nature of India's languages is indeed very useful for this. 4. Applications should cater to a large population of children and adults with serious disabilities, specifically visual impairments.

Conceptual Basis for Unicode


A coding scheme provides for representing text in a computer. The text presumably comes from some language. The writing system employed for a language utilizes shapes associated with the linguistic elements which are fundamental to the language. These are usually the vowels and consonants present in the language and this set is normally known as the alphabet. Assigning codes to the letters of the alphabet has been the standard practice in respect of processing information with computers. Codes are generally assigned on the basis of linguistic requirements. A code is essentially a number that is associated with a letter of the alphabet. Working with numbers is easier and a good deal of text processing can be effected by just manipulating the numbers. For example, an upper case letter in English can be changed to its lower case by applying a simple formula. The number of codes required in practice for any particular language will be decided by the totality of shapes associated with the writing system such as upper case letters, punctuation symbols, numerals etc. Typically this would be a set with less than a hundred codes for most Western languages. Traditionally, computer applications dealt with text corresponding to only one language. Subsequently the need to work with multilingual text was felt and this brought in additional requirements in respect of codes. The letters from different languages cannot be normally distinguished on the basis of their codes, for across different languages, the numerical values assigned for the codes fall in the same range. Thus one might find that the code assigned to the letter "a" in English is really the same as the code assigned for the Greek letter "alpha" or an equivalent letter in the Cyrillic alphabet. A multilingual document with text from different languages cannot really be identified as one, unless a mechanism is available to specifically mark sections of the text as belonging to a specific language/script. The traditional way of solving this was to embed descriptors in the text in a default language/script and allow these descriptors to specify multilingual content. Typically one would use different fonts to identify different languages and the application would use the specified font to display portions of the text in a particular language/script. This way, at least the display of multilingual information was possible though it was still difficult to associate a code, i.e., a character in the text with its language, unless the application kept track of the context. Keeping track of the context requires that an application should necessarily examine the text in the document from the beginning to the current letter, for only then the language associated with the letter could be ascertained without doubt. The eight bit coding schemes, the codes are typically in the range 32-127 though values above 128 are also used. Since different characters from different languages are assigned codes in the same range, identification of the language for a given code is rather difficult unless the context is also specified. The concept of the "character set" was precisely introduced for this purpose so that each language/script could be identified through the name given to the character set. The character set name would figure in the document (in a default language) and thus the context could be established. This is predominantly the method used in most word processor documents as well as web pages displayed through web browsers. Linguistic processing with codes can proceed only when the language associated with the codes is known. keeping track of the context of the language is cumbersome though not impossible. The idea behind Unicode is to present the language information associated with each character code in a manner that an application can readily associate the character with the particular language. Clearly, the need to identify the set of languages/scripts which would qualify for processing comes up first and Unicode first examined the different scripts used in the writing systems of the world and provided a comprehensive set of codes to cover most of the languages of importance. The rationale for this is the following. Typically, the writing systems employ shapes or symbols which are directly related to the alphabet and so by providing for the script, one would also provide for the language or languages which use the same script (though with minor variations). Majority of the languages of the world could be handled this way including Japanese, Chinese and Korean where literally twenty thousand or more shapes are required. Unicode indeed set aside a very large range of numbers to cater to these.

The basic idea in Unicode was to assign codes over a much larger range of numbers form 0 to nearly 65000. This large range would be apportioned to different languages/scripts by assigning chunks of 128 consecutive numbers to each script which may also include a group of special symbols. The size of the alphabet in many languages is much less than 50 and so this minimal range of 128 is quite adequate even to cover additional symbols, punctuation etc.. The list of languages supported in the current version of Unicode (Version 3.2) is given at the Unicode web site. An important concept in Unicode is that codes are assigned to a language on the basis of linguistic requirements. Thus, for most languages of the world which use the letters of their alphabet in the writing system, the linguistic requirement is basically satisfied if all the letters are covered along with special symbols. Display of text would proceed by identifying the letters through their assigned Unicode values both in the input string and the displayed string, which for most languages/scripts would be identical. Thus a Unicode font for a language need incorporate only the glyphs corresponding to the letters of the alphabet and the glyphs in the font would be identified with the same codes used for the letters they represent. As a concept, Unicode provides for a very effective way of dealing with multilingual information both in respect of text display and linguistic processing. Unfortunately, we encounter special problems with languages which use syllabic writing systems where the shapes of the displayed text may not bear a one to one relationship with the letters of the alphabet. In other words, for those languages of the world where the writing system employed displays syllables, the one to one relationship between the letters of the alphabet and the displayed shape does not apply. The languages of the South Asian region as well as the Semitic languages like Hebrew, Arabic, Persian etc., typically employ the syllabic writing system. Unicode assignment for these languages does meet the basic linguistic requirements. However, the issue of display or text rendering has to be addressed separately for these languages.

Unicode for Indian Languages


A Perspective
A brief introduction The essential concept underlying Unicode relates to the assignment of codes for a superset of world languages (essentially scripts used in different writing systems) such that a single coding scheme would adequately handle multilingual text in any document. In Unicode, it is generally possible to identify the language/script and the letter of the alphabet or a language specific symbol from a unique code made up of sixteen bits. It is important to keep in mind the fact that the need for handling different languages of the world had been felt long before Unicode was thought of. The earlier solution was a simple one. Collect the set of letters to be displayed and give the set a name or an identification. A computer application could then be told to interpret a character code with respect to a character set. The idea of the character set was simply that a set of values, typically 128 or in some cases going up to 255, would relate to a set of displayed shapes or symbols for a specific language associated with the character set. The character set name would be given as a parameter to the application which would then choose an appropriate font to display the text specified by the eight bit code values in a text string. The only issue that had to be taken care of with the earlier approach was that the application always had to work in the context of some language to be able to correctly interpret the code. Since the codes were common to all the character sets (being eight bit codes), it would not be easy for an application to interpret a given code unless the associated character set was also known. This would be a constraint to reckon with while handling multilingual text. For most western scripts, the number of distinct shapes to convey information through displayed text is usually small, typically of the order of about 70 and perhaps about 100 if the new symbols which have become meaningful in the context of electronic data processing get included. In some of the western scripts, accented characters are present which will have to be treated as independent linguistic entities. Otherwise, an accented letter may be viewed as a composite with a base letter and an accent mark. Viewed in the light of this, the normal ISO-Latin character set has about 94 displayable characters without accents and perhaps another 90 which include accented letters, the accents themselves and other special symbols. An eight bit code is entirely adequate to meet all linguistic requirements here. Computer applications render text by using the rendering support provided by the Operating System. Given that a code value is associated with a character set, the application will choose an appropriate font containing the letters and symbols for the script associated with the character set. Traditionally most of the fonts were eight bit fonts providing a maximum of about 190-200 Glyphs for each character set. Multilingual documents An application rendering multilingual text should know which portion of the document should be rendered in a particular script. Typically, the format of multilingual documents included the means to identify portions of the text as having certain attributes which include the font, colour and the size of text. The Rich Text Format standardized by Microsoft or the HTML specification allows a document to describe itself using descriptors made up of symbols from the set of letters used in the script. Readers familiar with Word processors will readily appreciate the fact that the

document contains a lot of formatting details all of which is described using only the characters from the set. These are generally known as tags. HTML documents contain a lot of tags which tell the browser application how to present the text in a window. Formats for documents which allow the document to describe itself, are usually known as Mark up languages. RTF, HTML and XML all belong to this category of Mark up languages. While this approach appears meaningful, there are practical difficulties in using the self describing tags where the tags themselves appear as text in the document but the specifications for the document usually provide for handling such situations through the concept of Entities, where an entity may uniquely describe a specific character in the text through a unique name assigned to the character. Multilingual text is usually tagged in ASCII but the tags can confuse Web Browsers if not handled properly. Unicode was introduced as the solution to the problem of handling multilingual text where any character in the text could be individually and uniquely identified as belonging to a script/langauge. In Unicode for Indian languages, each character is identified through a field within the code which specifies the language and a field which specifies an individual letter within that language. Though sixteen bits are used to specify each code, the number of codes assigned to any language is small and is often just about 128, with very few exceptions.

The Unicode experts may actually describe Unicode as one single scheme for dealing with all the scripts and languages of the world, where the code space of 65536 has been apportioned to the different languages, one after another. So the idea of splitting the code into two fields does not really apply in general. However, when only 128 code values have been assigned for a language, it is very easy to see that the two fields can be uniquely discerned. Among the Indian languages, Unicode assignment has been effected for all the basic scripts: Devanagari, Bengali, Oriya, Gurmukhi, Gujarati, Tamil, Telugu, Kannada and Malayalam. For these, the language descriptor part of the code occupies nine bits and the remaining seven refer to the consonants, vowels and the matras along with special symbols. Devanagari - 128 code values from 0900 Bengali - 128 code values from 0980 Oriya - 128 code values from 0B00 Gurmukhi - 128 code values from 0A00 Gujarati - 128 code values from 0A80 Tamil 128 code values from 0B80 Telugu - 128 code values from 0C00 Kannada - 128 code values from 0C80 Malayalam - 128 code values from 0D00 The Unicode book specifies a unique English name for each code. This is typically a combination of the language name and and an individual name for each of the 128 characters in the range. For most of the Indian scripts, several code values in the set of 128 for each may be reserved. The actual code assignments may be seen from the web pages at the Unicode Consortium web site. Unicode and conformity to linguistic requirements. The Unicode Book is specific in respect of implementing schemes to render text in a manner which is consistent with the linguistic requirements of the language. Here the original intent of Unicode was to represent only the basic linguistic elements forming the alphabet and not a specific rendered form. For example, an accented character which may be used in German or French is identified as a single letter though composed of a letter and an independent accent mark. Since such accented characters belong to the set of letters used in the writing system, they are assigned individual codes. An accented character could well be described by two codes, one for the letter and one for the accent but in the wisdom of the designers of Unicode, almost all accented characters have been assigned individual codes to make text processing simpler.

In normal Roman (standard English), one does not see such characters and so the basic set for Roman excludes them. However, these are linguistically important and so they are included as an extension to the normal Latin character set, called the Latin supplement where each accented character is assigned a unique code. (Refer to the chart at the Unicode Web site). Unicode consortium did not however specify how they would be typed in along with English. This was the responsibility of the application. Even today, very few applications can actually permit direct data entry of accented characters from the standard keyboard without resorting to a keyboard switch. The generic concept of Unicode works well for the western languages where there is only one shape associated with one and only one code value. That is, each code value can directly refer to a glyph index and when the glyphs are placed side by side, the required display is achieved. In this case, a text string is rendered simply by horizontally concatenating the shapes (Glyphs) of the letters. Thus a Unicode font for a western script need have only one glyph for each character code. The Glyph index and the code value can therefore be exactly the same. When the glyph indices are given, the original text is also known exactly due to the one to one mapping. Most languages whose writing system is based on the Latin alphabet come under this category. This simplistic view does not help when the displayed shape does not correspond to a single letter but relates to a group of consonants and a vowel which constitute a linguistic quantum. In the South East Asian region, writing systems are based on rendering syllables and not the consonants and vowels. The accented characters mentioned earlier may also be viewed in this light as being made up of two or more shapes derived from two or more codes. The problem at hand in respect of Indian languages is one of finding a way to display thousands of such combinations of basic letters where each combination is recognized as a proper syllable. This corresponds to a situation where a string of character codes map to a single shape. In the context of Indian scripts, the code for a consonant followed by a code for the vowel will usually imply a simple syllable often rendered by adding a matra (ligature) to the consonant, though there are enough exceptions to this rule. Those responsible for assigning Unicode values to Indian languages had known about the complexity of rendering syllables. But they felt that the assigned codes correctly reflected the linguistic information in the syllable and so suggested that there was no need to assign codes to each syllable. It would be (and should be) possible to identify the same from a string of consonant and vowel codes (Just as syllables are identified in English). What was specifically recommended was that an appropriate rendering engine or shaping engine should be used to actually generate the display from the multibyte representation of a syllable. Since Unicode evolved from ISCII, there was also the special provision of Unicode values to specify the context in which a consonant or vowel was being rendered as part of a syllable. In other words, Unicode also provided for explicit representations achieved by forcing the shaping engine to build up a shape for a syllable, different from what might be a default. The zero width modifier characters accomplish this along with the Nukta character, when dotted characters (the Persian or Urdu characters in Hindi) have to be handled. These do not directly belong to the basic set of vowels and consonants but are sort of derived shapes.

Download Unicode Text file The idea of assigning codes to displayed shapes may appear to contradict the original intent of Unicode where codes would be assigned only to the linguistic elements. This is usually justified on the following grounds. You always require a font containing the basic letter shapes and ligatures to render text as per the rules of the writing system. It is not going to hurt to add a few characters in the input string which may influence the selection of specific glyphs for a given context so long as the application does not interpret the string linguistically and performs only string matching. This is perfectly acceptable in situations where serious text processing is not attempted ( e.g., parse the input string to identify prefixes or suffixes in a verb). However, in the context of Indian languages, a word has to be interpreted properly to extract linguistic information and this requires analyzing the syllable structure. It is here that the multibyte representation can cause serious headaches for a programmer, for the algorithms working with multibyte structures are usually quite complex. The presence of characters which do not carry linguistic information will only compound the problem and there is also the possibility that the algorithm would fail when ambiguities arise. In the context of text processing in Indian languages, an interactive application which supports a find and replace feature may actually fail to identify the string in question if it is difficult for the user to correctly identify the actual codes used in the text though the display may look familiar. This is in fact what happens with Unicode when different text strings get rendered identically. That all these different text strings convey the same linguistic information may not be easy to discern in an application unless all possible representations (i.e., Unicode text strings) for a syllable are examined. This will not be easy at all. Given below is an example of a word with three syllables represented in twelve different ways, all linguistically identical but very different in terms of Unicode representation. The associated file containing the Unicode characters is downloadable as aditya.txt

Data entry issues with Unicode


When text is prepared through data entry, the user should be provided with a natural interface to generate the desired text from the keystrokes. In the past, several schemes had been proposed depending on the coding chosen or the font used to display the aksharas (Ref. section on Data entry in Indian languages). The Inscript layout had been traditionally recommended for use with ISCII and data entry in Microsoft applications supporting Indian languages is based on this layout. In this scheme, the keystrokes correspond to pure vowels, consonants, matras and special characters, all of which have been assigned specific Unicode values. One will observe that since the Matras have been assigned codes, it will be possible to type them in standalone form though the matra may be seen with a dotted circle so as to identify where it will be located with respect to a consonant. In the implementation of Unicode based data entry, the basic understanding is that each keystroke will register internally as a Unicode character and it will be the responsibility of of the application to form the desired syllables from the codes for the consonants, vowels and matras. The normal rule is that a syllable is formed when a series of consonants is terminated with a vowel or a matra conforming to the form CCCCV. Here "C" refers to a generic consonant without a vowel. By convention, C usually refers to a consonant with the built in vowel "a", and so one forms syllables by typing in a halanth character in between, ChCh..ChM where "h" refers to the halanth character and "M" a matra. Thus the Unicode assignment for a consonant is actually a syllable with one consonant and the vowel "a". The generic consonant will therefore have to be distinguished during data entry through the use of the halanth character, as well as some context defining characters such as the zero width modifiers. The general form of the syllable in Unicode will therefore be ChCh...ChM, with no specific restrictions on the number of consonants.

Interested viewers may perform a small experiment to see the vagaries of text editing under Windows 2000/XP. We have prepared three different files containing the same linguistic information, i.e., the same text string in different scripts. The text files can be opened under Notepad, Wordpad or Word. The RTF files may be opened under Wordpad or Word. Notice the differences in the actual display when seen in the three applications and also check out how the applications behave differently while editing. Editing backwards from the end of a string Under Word is quite some experience! Text processing algorithms lose their simplicity and elegance when they have to examine multiple byte strings that are arbitrarily long, to extract the linguistic information contained in the strings. When the same linguistic quantum is given two or more different representations (all perfectly acceptable as equivalents), processing becomes involved often leading to unpredictable results. It just happens that one cannot really predict what syllable will come in a string. In the screen shot below, one sees what happens in Word when a series of keystrokes is input. The key corresponds to a matra. Word merely displays the matra with a dotted circle up to a point beyond which it gets confused. One additional input can cause the application to run into confusion! Worse still, try and type in four or five lines of the same matra in Wordpad, block the text and copy the text. The application runs into an error situation and outputs a message. Often it just crashes!

Legal Unicode strings but with no Linguistic content! An application supporting Unicode based data entry in Indian languages is also expected to allow data entry of all legal Unicode values. It is therefore possible to type in perfectly legal Unicode strings but without any linguistic content as in the illustration below. While there is no harm in permitting data entry of all legal Unicode values, it will be a complex issue to identify whether the string has a valid linguistic content. Many applications suffer due to bugs in the implementation of this feature which basically boils down to identifying the quanta that can be handled by the shaping engine displaying the syllables.

Special symbols and punctuation marks It is true that traditional manuscripts written in India do not include punctuation. In line with the western tradition, punctuation is now standard with most scripts. In assigning Unicode values, it was assumed that punctuation symbols from the western scripts would not be assigned in other scripts and so a single assignment would suffice. Typically, the keyboard would provide for all important punctuation marks to be keyed in directly. In respect of Indian scripts, the keyboard layout used for data entry utilizes most of the keys to type in one letter of the script or other and thus does not directly provide for all punctuation marks to be entered. In the Inscript layout seen in Microsoft applications, one sees this problem. It may not be possible to type in a punctuation mark unless the keyboard is switched. In Tamil for instance, at least four important symbols ( Question mark, Exclamation mark, the parentheses etc.) cannot be typed in as the keys corresponding to these have been assigned Tamil letters. With Devanagari, the parentheses can be typed in but not the question mark and the Exclamation mark. Switching keyboards is not an issue that we can ignore since it requires additional effort on the part of the operator.

Rendering Unicode Text


Unicode rendering issues The following illustration and subsequent paragraphs give a summary of the issues involved in dealing with Unicode, either for rendering the text or for specific processing. The viewer should get an overview of the steps involved in handling a Unicode text string in any application. The image below provides a good illustration.

The application is responsible for mapping the keystrokes to their individual Unicode values. This should be handled properly by the input module which may allow data entry using an appropriate keyboard. This internal string will have Unicode values only in the assigned code space for each script.

The internally stored string may be processed by the application for any purpose. Typically the application should display the string if an interactive user interface is supported. To render the text properly, the application may make use of the Uniscribe engine provided by Microsoft or use other approaches. Going through Uniscribe might permit a degree of standardization but may still be inadequate for specific applications. In such a case, the application might use the Open type services library (OTLS) to query the nature of support provided by an Open type font and identify the glyphs to be displayed. Uniscribe does this too. In either case, the application receives enough information about the Unicode glyph indices in the Open type font which would actually constitute the display. The Font rendering may be easily accomplished by using suitable OS services such as provided by Rich Edit. Note that the Glyph indices are meant only for identifying the glyphs in the font and the text string indicated as "B" in the above diagram does not represent the code values for the consonants and vowels. Should the application support cut/copy/paste features, the onus is on the application to maintain a backing string containing only the input codes so that a match could be effected between the glyph codes and the assigned codes by regenerating the display virtually and comparing the codes with the already displayed ones. This process would identify the portion in the input string corresponding to the blocked text in the display. The application can make use of the OLTS to query for specific features supported by the Open type font such as the availability of glyph substitution, alternate glyphs for a code value, positioning information for ligatures etc. In this case, the application should know how to relate the linguistic requirement to the script. This method gives the developer much flexibility but requires him/her to know the display rendering issues as well. This is a very stringent requirement and one which is quite difficult to meet in practice since Software Developers are not linguistic experts and do not in general know the nuances of actual rendering or linguistic processing of the text Alternatively, the application could use the Uniscribe functions to accomplish the same. However, it is not guaranteed that the implementation of Uniscribe is correct or even adequate from the linguistic angle. Uniscribe itself relies on OLTS services to figure out how text should be rendered consistent with the conventions followed for a language. It is a bit odd that this approach emphasizes the script and builds in support to cater to specific languages supported by the script. It is usually the other way. The text pertains to a language but is rendered using any appropriate script.

Shaping Engine for Rendering Text


The complexity of rendering Indian scripts Uniscribe (or its equivalent) is the programming interface which allows Unicode text strings to be interpreted for display in a Microsoft Windows environment. As we know, the Unicode text strings for Indian scripts will consist of only the basic vowels, consonants, matras and a few additional symbols. The purpose of Uniscribe is to generate the information for display in terms of Glyph codes, consistent with the conventions of the writing system. A computer program which is given a Unicode text string in any of the Indian languages will have to identify how the string should be broken up for generating the display. This is the basic process of identifying the syllables which make up the text string. Suppose the string in question is

In other words, we need to force an intermediate code to tell the shaping engine to do something different. This is in fact what Unicode recommends through the use of zero width joiners and non joiners. One might argue that this is a pathological example which is unlikely to be encountered in practice. The truth is that when we teach a writing system to children we tell them that there are equivalent ways of writing the same syllable. That is, the same linguistic content may be shown differently using different scripts or even in the same script through permitted variations. You will find that it is pretty much impossible to get the Microsoft shaping engine to render the same text string differently though such a provision will be very helpful in practice to handle the variations in the writing systems practiced in different regions. Assuming that one decides to change the rendering to a different standard, we will have to modify the shaping engine to change the rendering rule. This will not only require rewriting the module but require recompilation and distribution of the new module. Such flexibility is not easily provided in Microsoft applications where one recommends an upgrade rather than a patch or file substitution. It cannot be assumed that the mapping from the Unicode text to the rendered shape is unique and will be frozen for ever to write a one time shaping engine. We will find that when we have to reproduce thousands of manuscripts preserved in India (written as well as printed) we will necessarily have to accommodate variations. The problem can be handled somewhat if we allow the rendering rules in the shaping engine to be read in dynamically rather than remain hard coded. This provision will not be an easy one since the shaping engine will have to map a multibyte string into a final shape that may depend on a supplied parameter. If Unicode were devoid of context specifying codes such as the ZWJ and ZWNJ, this would be much easier. Unfortunately, the presence of these codes, can really complicate string processing.

Philosophically, Unicode would remain a meaningful scheme for our scripts if only it confined itself to specifying the linguistic content and nothing more. As observed by other experts, Unicode's bias towards rendering is an issue one has to reckon with in implementing the shaping engine. What this implies is that certain Unicode values have no linguistic content but are used only to guide the rendering process so that the displayed shape is forced to conform to a specific pattern. Such codes are seldom required in European scripts since each Unicode character maps directly to one and only one shape. If we are required to perform linguistic processing on a Unicode text string, the presence of special characters will certainly pose problems. Let us consider an example.

We now see that the conventional fixed width codes certainly aid in string processing if each code carries only linguistic information. Unfortunately we are not able to provide for this if we take the Unicode route. The pertinent question is, can one have fixed width codes for the syllables? That is, can we have each syllable coded into a fixed number of bytes? The answer is certainly yes, though one must admit that there are at least 5000 syllables ( bare minimum) which are in regular use and across the different languages, one might even see the need for more than ten thousand. The Multilingual software developed at the Systems Development Lab., IIT Madras, is indeed an example of a system that is based on fixed width syllable level codes. The software uses a sixteen bit code for each syllable where the linguistic content is very clearly identified in terms of the consonants and the vowel present in it. The conceptual basis for the shaping engine. The Uniscribe script engine is faithful to the specification of Unicode in rendering syllables. Unfortunately, the rendering rules are hard coded into the modules of the engine though these rules conform to some default conventions in the writing system. Consequently variations in the displayed syllable shapes cannot be honoured. Nor can we introduce a new script for the language without rewriting the shaping engine. Unicode character names are bound to the name of the script and it is quite unlikely that one will be able to introduce new scripts for Indian languages based on Unicode. Many Indian languages used different scripts at different times without any loss of linguistic content e.g., Grantha for Sanskrit, Modi for Marathi. The essential steps involved in rendering Unicode text through the shaping engine go as follows. 1. Identify syllable boundaries or special characters. 2. Apply the rendering rules for each syllable by examining the consonants and identifying the specific rendered form applicable to the each consonant. For example, if "ra" is present in the syllable, see if it is the first consonant or the middle one or even the last one. The form chosen for display will now be based on the nature of the consonant occurring before "ra". If that consonant has a vertical line in its shape, then "ra" would be formed with a short diagonal stroke joining the vertical line in the lower half of the consonant. If the previous consonant were one without a vertical stroke, then the form of "ra" chosen may be that resembling the caret sign placed below the consonant. 3. The shaping engine may also apply some rules that call for reordering of the consonants and associating suitable shapes with the reordered consonants. This happens when "ra" comes in as the first consonant of a syllable and the displayed shape involves the "reph" form. The Uniscribe engine has enough complexity to identify the rules for a large number of syllables of arbitrary length running into many consonants. It will now be clear to the reader that not only are the rendering rules hard coded but they assume the availability of the associated shapes in the font used for display. This can cause problems in applications which may prefer to use high quality fonts for typesetting which fonts may not have the expected features in respect of the shaping engine but otherwise be adequate for high quality printouts. Uniscribe requires that an Open Type font be used along with it and not any True type font which may be entirely adequate for the purpose. As of this writing (Mar. 2003) the Devanagari font supported under WinXP/2000 cannot cater to many requirements called for in normal writing in spite of being rated as an effective Open Type font for the script. It is quite unlikely that one single but adequate font for Devanagari text rendering will be developed since special software tools are required for creating meaningful Open Type fonts. Designing fonts for Indian Scripts requires the designer to understand the writing system thoroughly so that all the ligatures of importance are included in the font. In the Open Type font, a syllable can be mapped into the required shape by graphically positioning the component shapes (glyphs) which are related to the consonants and the vowel in the syllable. The Uniscribe engine would differentiate the shapes to be used for consonants based on the syllable. That is, the choice of the shapes building up the final form for a syllable will be context dependent based on the actual consonants. The same consonant may get rendered using different shapes in different syllables.

No doubt the whole process is complex and quite involved since the font designer and the Uniscribe developer have to work together to arrive at a good solution. One finds top font designers who may not know the intricacies of the writing system. Likewise, a linguistic expert may not really concern himself/herself with the nuances of the font file. This is perhaps the reason why we have basically one Open Type font available for Devanagari.

Open Type fonts for Indian languages generally require a large number of glyphs running into several hundreds. The essential idea of the Open Type font is to map a syllable into a shape. Since there are thousands of syllables, it is not meaningful to design a font which has an individual glyph for each syllable. The general idea is that a default shape formation rule be applied to a syllable but handle exceptions where appropriate. The default rule will probably work for about 70% of the syllables where the required matra is added to the consonant's shape. The graphic positioning of the matra may be important form the typesetting point of view since the matra cannot be put in a fixed place around the glyph. See the illustration below.

Designers of True type fonts knew this requirement and had simply included two or more glyphs for the same matra to handle variations in its placement with different consonants. Typically the matra is overlaid with the glyph of the consonant with an appropriate displacement wit respect to the coordinates of the graphical shape of the consonant. In the Open Type font, since typography was also an important consideration, the font specification provides for precise positioning of a glyph with respect to another when a new glyph is required to be shaped from two or three component glyphs. Thus it will be possible for us to have just one glyph designed for the matra but use it with any consonant by positioning it at an appropriate location with respect to each consonant. In the Open Type font, the designers have made a provision for handling this through the concept of a composite glyph which is a new glyph obtained from two or more basic glyphs in the font. This specific feature is exploited by Uniscribe to quickly identify the composite glyphs which can be rendered for a specific Unicode string for a syllable. However, a large number of composite glyphs will be required in this case. One will remember that composite glyphs were permitted even in True type fonts but precisely locating one glyph with respect to the other was not handled, only simple overlays. In fact, Microsoft experts recommend that a good way to design Open Type fonts for Indian scripts is to use as many composite glyphs as possible since the Uniscribe engine could easily map the Unicode strings to the component glyphs. The Open Type font can lead you to just one glyph from multiple character codes and it is now clear why this type of a font is being promoted for use with Indian languages where multiple character codes map to a shape. The Mangal font font for Devanagari supplied with Win2000 has nearly all its glyphs specified as composite glyphs. Designing an Open Type font is however not a simple proposition. Special tools are required. Worse, the Open Type font will have to carry a digital signature if it has to be allowed for use in Microsoft applications. Getting the font digitally signed is some task indeed! Summarizing the discussions 1. The Open Type font provides for multiple character codes to be mapped into a single shape. This is an important feature which distinguishes the Open Type from True type where one is invariably tied to one code one glyph mapping. 2. For Indian scripts, an Open Type font is inevitable if an application goes through Uniscribe (or its equivalent) in rendering Unicode text. It must be emphasized here that language dependent calls will have to be made to Uniscribe to handle the required rendering. This simply means that an application cannot be written in a language independent manner. This in our view is a fairly serious limitation of the Unicode based approach to computing with Indian languages. The common linguistic base across the languages can actually help the development of multilingual applications which can work transparently with any language. 3. Open Type fonts will invariably include a very large number of glyphs, most of which may be composite in nature. Yet, the same can be provided through a True type font which includes only the component glyphs and hence can be much smaller in size. 4. The Uniscribe shaping engine cannot permit multiple representations for the same Unicode string by specifying a parameter for each representation. This is the responsibility of the application.

A computer program cannot easily generate the display for a syllable electronically, unless it knows that it can provide the display consistent with the user's requirements. Put simply, an application will necessarily have to know which syllable will have to be constructed with ZWJ or ZWNJ codes, if the shape desired is different from what Uniscribe defaults to. The subtle message carried by the above statement is that localization of an application will not be easy since every application handling a script must know how to code the syllables using Unicode characters, to have conformity with conventions of the writing systems that are not coded into Uniscribe. Linguistic text processing will be quite difficult under the circumstances.

Sorting order with Unicode


The debate on Unicode sorting order or collation One of the issues which has received much attention in respect of Indian languages and Unicode is the problem of sorting order (called collation by some experts). Traditionally, the assignment of codes to the characters of a language took into consideration the order in which the letters of the alphabet would be arranged for purposes of creating lists which could be viewed easily and scanned quickly by a person. Almost all the classical sorting algorithms (including indexing of data bases) arrange the letters in the increasing or decreasing order of the assigned codes. It is clearly known that Unicode has not taken into account the required lexical ordering of the aksharas in any of the Indian scripts. This is understandable, for Unicode was essentially derived from ISCII where the ordering was based on similar sounding aksharas rather than the actual ordering conventions and this applied mainly to the Southern Languages. ISCII gave a uniform set of codes for all the languages however and perhaps on account of this no one really raised the issue. Unicode made a departure by assigning language specific (actually script specific) codes to our aksharas but in essence retained the basic structure of ISCII. Specific instances of aksharas that were ordered differently are shown below.

The two "ra"s of Tamil are placed together though they are separated by four consonants in the conventional order. The two "na"s in Tamil are placed together where as they are separated by nine consonants. The very soft "na" in Tamil actually comes at the end. The consonants in our languages are also grouped together linguistically and it will be necessary to keep this in mind when attempting any sort of Linguistic Text processing. Lexical ordering of text is desirable whenever we prepare information for manual view as in a dictionary or a list of names of students in a class. A recent paper written by an expert at Microsoft titled " Issues in Indic language collation" argues that in general, assignment of character codes for several world languages has not taken into consideration the lexical ordering and that the Unicode assignment cannot be faulted. The expert's assertion is that culturally and linguistically appropriate collation is influenced by a language and not the script. The author goes on to state that it will be shown in the paper that Unicode, as an encoding, is more than sufficient to support Indic scripts and languages, since it is only one step among many to develop culturally and linguistically appropriate software for India. One must read the statement carefully, for Microsoft has accepted that coding alone is not the issue but the application as well. It has also emphasized that an application (which is based on the code) must be culturally and linguistically appropriate. No one can deny the correctness of these observations. In placing the script above the language, i.e., emphasizing the need to handle the script in the computer rather than the linguistic content, a very peculiar situation has emerged, in respect of computing with Indian languages. The real issue is whether such applications can indeed be written with Unicode as the base. That is, in the context of linguistic processing can an application supporting Unicode truly incorporate the features called for in providing a culturally and linguistically appropriate solution to the problem at hand? This question can be easily answered.

A text processing application that places the script ahead of the language will necessarily have to examine the context in which a Unicode character is seen within a text string. A perfectly valid Unicode string is not necessarily valid in terms of its linguistic content and so every application must build into itself a great deal of linguistic information to map a given Unicode string into the linguistic entity that the user will understand. Such applications are not only very difficult to write but will be heavily influenced by the script itself making it virtually impossible to handle a truly multilingual interface. In the first place, it is a difficult proposition indeed to write any text processing application which has to work with multiple characters to arrive at a linguistic quantum, namely the syllable, which is central to all the Indian languages. If Unicode had concentrated on the linguistic content alone and had not prescribed rendering rules, the situation would be a little better. This is not the case however and linguistic processing with Unicode will require very complex algorithms to actually infer the context in which each character appears by examining the characters appearing before as well as those appearing after it. Consider the situation in respect of the Matras. The matra itself is not a proper linguistic unit but a representation of a medial vowel, i.e., a vowel occurring in a syllable in the middle or end of a word. Matras have been assigned codes so that a computer program can quickly identify a syllable boundary in a text string. If we ask ourselves the question, "How many times does a given vowel occur in some text, the program will have to match not only the occurrence of that vowel but its matra as well. This is two comparisons. Worse still, a vowel can occur in its basic form right in the middle of a word as shown below.

This means that to check for the presence of the vowel, one will have to perform two comparisons for each character but even that can be accepted. However the two comparisons will still not yield the correct results since the matra can be accepted only if it is preceded by a valid consonant. Now we begin to appreciate the complexity involved. Imagine checking the occurrences of the vowel shown in the illustration below. One has second thoughts on whether Microsoft applications do indeed assert that linguistic content can be preserved in a culturally appropriate manner!

Observation A valid or legal Unicode string is not necessarily linguistically legal (nonsense words are always linguistically legal). Getting linguistic content out of any Unicode string is a very difficult task on account of the multibyte nature of the syllable when expressed as a Unicode string. Also the presence of codes which have no linguistic content but only provide rendering information further complicates the processing. As of this writing (March 2003), linguistic collation has not been properly incorporated into any of the Microsoft applications which are known to provide Unicode support for Indian Languages. In the screen shot below, one can see the results of sorting a column of words in a table. Both Devanagari and Tamil examples are illustrated. It is clearly seen that only the Unicode ordering is preserved and not the conventional linguistically accepted ordering. The document was typed into Wordpad under Windows 2000, pasted onto word and the words placed inside a table using the convert text to table feature of Word.

For those who would like to try this out for themselves, we have provided a downloadable version of the file containing the words in Devanagari and Tamil which will open with Wordpad, notepad or Word under Win2000/XP. sorttest.doc ( Open with Wordpad or Word under win2000) It is equally amusing to observe the differences in the displayed text in each of the three applications, The team at SDL was originally under the impression that Microsoft had problems in rendering zero width glyphs in truetype fonts but Microsoft's own Opentype font is no exception. The culprit is not the font but the application. One can verify this by

opening a sample html file (sorttest.html) we have provided, with netscape4.7 or later which file contains the same text in UTF8. Shown below is a screen shot of Netscape rendering the text referenced above.

In the illustration below, the screen shot corresponds to the text copied from Netscape and pasted into Microsoft Word. Notice the problems arising out of incorrect interpretation of the Unicode string. Not only do we see problems with the placement of the words but the last Unicode character in each line seems to be rendered independently.

The screen shot below shows the same text copied from Internet Explorer and pasted into Wordpad. Notice how the last Unicode character has been missed during the rendering process!

If the onus is on the application to render a Unicode text string to conform to a linguistically appropriate form, one can immediately see the futility of attempting to write applications that deal with multilingual text, even assuming that we take support from Microsoft provided modules such as Uniscribe. The current implementations of Unicode support seem to concentrate mainly on data entry and not really any text processing. The wisdom of our Linguistic experts Linguistics has been an important subject of exposition and discussion in respect of Indian languages (Sanskrit and Tamil in particular) from early times. The great scholars and grammarians had clearly stated that the sound is more important than the shape. and hence one must master the art of discerning sounds correctly from any utterance. The script was secondary and we all know that the same sound can be represented in different scripts. Thus the ability to discern the sounds from written shapes was not considered important and in fact discouraged since distortions could occur on account of the variations in representation. In the stone inscriptions of Ashoka one finds occasional instances of conjuncts where the order of the consonants in writing one below the other is reversed. A reader familiar only with the script will no doubt read it incorrectly. Scholars known to the author of this paper have however opined that this is a classic example of a distortion when the person who does the carving fails to hear the sounds carefully. The context however tells us what the akshara should really be. Correct linguistic handling of text in Indian languages requires that a written shape is uniquely traced to a proper linguistic quantum which is usually a syllable but can well be a special symbol. Unicode will not be able to this

efficiently. That Unicode as an encoding is more than sufficient for supporting Indian scripts is not something one can accept. We must remember that the language comes first and then only a script for it. If you concentrate on the script and provide for dealing with it in a computer, you will be severely limited by what the computer program can actually display. On the other hand, Unicode is sufficient for carrying information that can be displayed leaving the viewer to extract the linguistic content from the display. Thus going from Unicode to display makes sense since the viewer will interpret the text linguistically but going back from the display so as to preserve the linguistic content requires extremely complex processing and it is not clear whether multilingual applications can really benefit from the use of Unicode.

Open type Fonts: A discussion


The conceptual basis for the Open type font Fonts are used when we display text in a computer application. The glyphs in a font correspond to the shapes of the letters and special symbols used in the writing system for a language. We generally associate a character code with a glyph so that a text string specified by a series of character codes is displayed simply by horizontally placing the associated glyphs one after the other.

The text string should contain the necessary linguistic content for it to be processed consistent with the requirements of the application. The glyph string on the other hand, has meaning only for the display and the linguistic content is expected to be inferred by the person viewing the display. In most western writing systems, a letter of the alphabet is individually mapped to a shape and so a one to one mapping exists between the characters in the text string and the glyphs in the display. Hence given a text string, the glyph string is obtained by a simple table lookup where the table is kept as part of the font. Each character in the text string is identified with a name and the table merely maps a name to a glyph value. We have seen elsewhere that this is the principle of font encoding. Thus a True type font or a bit-mapped font will have a table inside, mapping the character names to an integer (usually an eight bit value). Displaying text involves the process of graphically positioning the glyphs one after another in a horizontal sequence. The situation is different with syllabic writing systems where the displayed text is actually built up by applying the pattern for simple syllables but may associate special shapes with specific syllables. The text to be displayed could indeed be specified in terms of the consonants and vowel in a syllable which are the basic linguistic units in the language. But the desired shape for the syllable cannot be effected by simply placing the shapes for the consonants and the vowel in sequence.

A font for the language/script which follows a syllabic writing system will have shapes to build up the required display for any syllable but the number of shapes (Glyphs) will be much more than the set of vowels and consonants since the writing system has additional shapes for vowels which occur as part of syllables in the middle of a word. Also, unique shapes for certain syllables will be required. Seen below are some typical syllables and the shapes used to build them.

The illustration above clearly shows that the one code - one glyph relationship does not hold when the character codes differ from the glyph indices. In fact, many applications supporting the display of Indian language text merely used the glyph codes as the representation (i.e., use the glyph codes themselves as character codes) since they could use conventional font rendering methods to generate the display. The complexity of the application increases considerably when multiple codes have to be mapped into multiple glyphs. We note that linguistic processing is not impossible when font glyphs are used for representing the text but the processing would be dependent on the font. In the past there has been no attempt at standardizing a font for any Indian script. The usefulness of restricting the input string to contain only the codes for the consonants and vowels is seen when one thinks of linguistic processing. This is in fact the basis on which ISCII and Unicode work. However, with this simple assignment, the onus is on the application to render the text using any appropriate method. Typically a font may be used or one could convert the text into a TeX document and get an output which is typographically superior, or just use an XY plotter to draw curves and thus generate the shapes. If a font is used, the application is expected to to have specific knowledge of the glyphs in the font so that appropriate glyphs could be selected to form the display. The application is expected to know the rules of the writing system so as to arrive at the choice of the right glyphs for each syllable. This is a difficult task since it is not easy for an application to actually know what glyphs a given font offers. Even assuming that this can be done, the application would be tied to the availability of the specific font. On the other hand, if we pay attention to the rules of the writing system alone but have the provision to find out if a font has the support in terms of specific glyphs, perhaps some degree of standardization is possible. A conventional font cannot offer such a facility since only rendering information is stored inside the font file. Hence the concept of a new font format which can tell us what sort of glyphs it provides, in the context of a given writing system using a specific script. Unicode support for applications was conceived with the possibility of providing standardized support for rendering a string of consonants and a vowel. In other words, the issue under consideration is " whether one could incorporate the rules of the writing system into a program" and provide an interface to the application to invoke a specific rule to render a given syllable. It is well known that though the rules are well known, there is enough complexity in the process since alternate representations are permitted in practice. Yet, in principle it is possible to think of a model for generating the shape for the syllable somewhat along the lines indicated below. A pure vowel or a consonant with an implied "ah" in it would be rendered in its basic form.

A consonant vowel combination would be rendered by adding a matra (ligature) except for cases where unique shapes are specified. (Tamil and Malayalam have special shapes for "uh" and "ouh"). A list of exceptions could be maintained. A syllable with two or more consonants will be rendered typically using half forms in most Devanagari based scripts and one below the other in the Southern scripts. Special forms would apply for all cases where the shape of the syllable is well known. This set is typically of the order of a hundred and fifty. A full list of specific cases will however be maintained. Special ligatures for "ra" ( and "ya" in some of the Southern scripts) would be used depending on the position occupied by the consonant in the syllable, i.e., whether it occurs in the beginning, middle or the end of the string of consonants. By more or less listing all the rules observed in standard practice, one could conceivably code them into a module. Such a module would nevertheless be required to work with a font which has all the necessary glyphs. Also the process of identifying the glyph indices will involve an exhaustive application of each writing rule to a syllable to see if glyphs conforming to the requirements can be chosen for the display. When alternate forms of display are permitted, this becomes an essential requirement. Also, it helps the application default to a very simple form for display, if complex ligatures are not present. The writing system always allows any syllable to be represented using the generic shape for all but the last consonant in the syllable. The Open type font is a concept where it would be possible to find out whether a glyph satisfying a requirement in a syllable is indeed available in the font . As opposed to a conventional font which only stores rendering information for each glyph, the open type font can also provide information which relates a group of glyphs to a specific requirement. The multiple code to multiple glyphs mapping is essentially what is being attempted with this new font format. In every rendered syllable, there is some feature in the display that either identifies the presence of a specific consonant or a unique form for the syllable. This "some feature" may be ascertained with some effort.

If each of the glyphs in the font is related to one or more codes, then in principle one could incorporate a table into the font which table specifies the codes to glyphs mappings. Unlike the earlier fonts (True type or bit mapped) where a glyph is related to only one character, this new font called the Open Type font will incorporate features where a glyph would be specified in terms of other glyphs through a process of substitution or relative positioning. For instance

By providing a library of services to an application where the services support querying for specific features incorporated into the font by way of Alternate forms or representations for a character: Substitutions for a given glyph string Positioning information for specific ligatures one could in principle implement the rules of the writing system. The application will obtain the required glyphs from an Open type font which supports the required features in an exhaustive manner for all practically encountered syllables. It must be emphasized here that an open type font is not required at all for rendering text where a character maps into exactly one glyph. For writing systems which render syllables, the shaping engine which implements the rules could certainly benefit from the availability of an Open type font since it can select appropriate glyphs by querying the services provided by OTLS (Open type library services). With conventional fonts, this querying is ruled out. The specifications for an Open type font are quite complex since several tables have to be incorporated into the font file. These tables invariably reflect the idiosyncrasies of the writing system. The documentation provided by Microsoft and Adobe should be adequate for a designer to develop an Open type font. Yet, this is a complex process for most of the scripts merely on account of the large number of glyph substitutions and glyph positioning entries required in practice. In the Mangal Open type font, the one below the other form is pretty much absent for many important syllables. So designing an Open type font is not easy, unlike a regular True type font which may have more or less the same base glyphs. There are tools provided by the special interest groups (VOLT) which give some hints on converting existing True type fonts to Open type. A client application supporting Unicode for Indian languages will typically use the Open type Library services provided by Microsoft. This is not without its accompanying complexity though it appears that there is greater flexibility in rendering text since alternate forms could be used. The application must necessarily code into itself the rules of the writing system and use the OTLS to select glyphs matching the requirements. Much of this complexity can be reduced by introducing a shaping engine which does the job of implementing the rules and thus isolate the application from the actual rendering. This approach permits a degree of standardization in rendering text but the shaping engine's default behaviour may not offer the required flexibility which conventional practice demands. Microsoft has also provided this shaping engine. It is known as Uniscribe. The real problem of dealing with Open type fonts. When fonts are designed, the basic requirement will be to incorporate enough glyphs to cover all the shapes for the different syllables. One cannot think of a very large number of glyphs since the font will become unwieldy. Moreover mapping of the codes to the glyphs (substitutions) will require very large tables. On the other hand, a smaller number of glyphs might not adequately display all the syllables as per convention. Glyph design is hence a compromise between what would be a minimal set of shapes considered adequate and a set of shapes that will meet all the basic rules of the writing system. Thus the font designer is expected to know how all the required syllable should be rendered given the constraint on the number of glyphs. In an Open type font it is not merely enough to provide the required glyphs but more importantly identify how the composite glyphs are formed (how a set of glyphs map into another).

To summarize The shaping engine incorporates the rules of the writing system for each script and helps select appropriate glyphs from the Open type font. Experience tells us that the rules of the writing system are not rigid and conventions can vary. If the application must cater to different conventions, a default behaviour may not be appropriate and a parameter based selection of the display shape will be required. This parameter may have to be specified in the context of the syllable under consideration. This is what really complicates the design of the application. While a desired shape for a syllable may be easily forced by using zero width modifier, the complexity of linguistic processing automatically increases. The Open type font may not be the right way to go if applications are required to effect efficient text processing and also support an interactive user interface. It is conceivable that the tables we mentioned earlier, which are included in the Open Type font, may actually be brought out of the Open type font and given a standard representation. This way, the shaping engine can work with additional flexibility of dynamically choosing the glyphs from different fonts and thus meeting different requirements. This idea is certainly implementable since table look up is a fairly simple process.

About Uniscribe
Uniscribe: Rendering Unicode text in Windows Applications The main function of Uniscribe is to take an arbitrarily long Unicode string and map it into a sequence of syllables for display. It is assumed that the input string correctly represents the Unicode characters entered from an application through the keyboard or hs been generated electronically. The Unicode characters come from a set of assigned Unicode values for the script in use. Those having access to Windows XP/2000 can actually generate the keystrokes and see how Wordpad or Microsoft Word (or even Notepad) handle the input. In the illustration, zwj and zwnj refer to specific Unicode values which convey rendering information. You have to type them in not as zwj in English but through the decimal equivalents of their Unicode values. The zero width joiner (zwj) is typed in by holding the ALT key down while entering the decimal value 08205. For the zero width non joiner, the value is 08204. This seems to work in Word and Wordpad.

Built into the Uniscribe shaping engine are the rules for going from the Unicode string to the shape, consistent with the rendering recommendations from the Unicode consortium. Thus Uniscribe is nothing but a set of hard coded rules to render syllables. These rules are rigid (as implemented by Microsoft) and hence a user does not have the flexibility to get alternative representations except to code them differently using possibly the zero width joiners and non joiners. In the examples shown above, the same syllable is shown in different displayed forms but generated from different Unicode strings. The implementation of Uniscribe is such that part of the shaping information is derived from the font used for the script and this font must be an open type font. Open type fonts for Indian languages require the designer to be thoroughly familiar with the writing system and this can be a rather exacting requirement. On account of the basic structure where the open type font allows a single glyph to be selected from a sequence of character codes, the font tends to become unwieldy. The currently available Mangal Open Type font for WinXP/2000 has nearly 650 glyphs, many of which are derived from a much smaller set of basic glyphs. It would not be incorrect to state that the motivation for Open Type fonts came more from languages with a syllabic writing system with many ligatures and combined shapes than other typesetting considerations. In fact text in Indian languages can be comfortably typeset with existing Truetype fonts for the different scripts. The issue of concern is Data Entry. The names of Unicode characters (along with code values) are rigidly specified and there is absolutely no way new characters can be introduced without going through the consortium. When you do succeed in that, every application that is based on Unicode will have to be rewritten to accommodate the change.

Unicode, though a meaningful concept to represent text from different languages of the world ( more appropriately scripts) emphasizes the script first and then only the language. This is quite the opposite of our approach to languages. It is the language (defined by the sounds) that comes first and then only the script. We all know that any of the Indian languages can use any writing system so long as the sounds can be preserved. There will be no confusion in the process as we all know well that Sanskrit can be written in Devanagari, Sharada (from Kashmir) or Grantha from the south. All these retain the phonetic information in the script through properly formed rules or mapping the syllable into a shape. Marathi used to be written in a script known as Modi though one uses Devanagari these days. Unicode has a bias towards the rules of the writing system which cannot be denied. There are valid code values that will not refer to a linguistic element but to a shape. The zero width and non zero width joiners are examples of this provision. Hence deriving the linguistic content from a string of Unicode values is not as easy as simple string matching, when such characters are present. Even a simple application such as a text editor requires linguistic processing when a find or search and replace operation is to be supported. For those willing to experiment with the idiosyncrasies of Microsoft's implementation of Unicode support for Indian languages, the following is worth an attempt. In the screen shot below try and figure out the expression to be typed in to get a match for the strings shown. A copy of the file is available for download. Open the file with Wordpad and see if you can type in expressions to match all the strings. Even though some strings look identical, their Unicode representations are not. When the file is opened under Wordpad, the window which pops us when you select the find option does not seem to permit the entry of the zero width joiner or non joiner characters.

In respect of data entry today, most Indian languages require the use of punctuation marks and the few but important mathematical signs such as the plus, minus etc.. Since these are not explicitly included in the Unicode assignments for Indian languages, data entry would require frequent switching of the keyboard. Many keyboards for Indian language data entry (including the Microsoft Keyboard which is based on the Inscript layout) pack so many shapes into the keys that even standard symbols cannot be accommodated. (See if you can type in the parentheses in the Microsoft Tamil Keyboard!) Though Uniscribe is meant to provide the required representation of a syllable for display and printing, the onus is on the application to correctly handle the spacing of the text. What this means is that an application is intricately tied to Uniscribe and the associated Open type font and the developer must know the actual capabilities of Uniscribe's shaping behaviour. This is rather unfair, for developers should concentrate on the processing of the information and not be burdened with formatting details. Elsewhere in this analysis, we have provided examples of three different Microsoft applications that compute the widths of the same text string totally differently. It turns out that when you copy and paste a Unicode text string into Word, cursor movement no longer applies at the syllable level as required but more at the individual unicode character level. Cursor positioning to edit the copied text cannot be ascertained by

moving the cursor to the required syllable. Amusing results will be seen if you try and do this. Much of this can be inferred from the illustration above. The case against arbitrarily long syllables. The basic assignment of Unicode allows arbitrarily long syllables to be constructed even though they will make no sense. Uniscribe attempts to process long text strings to identify syllables and this can lead to absurdities. From what is known in India, there are only about a thousand meaningful syllables, most of which have only two consonants and rarely three or four consonants. There is virtually no need to allow new shapes for a new syllable even if it be built with three or four consonants because the writing system permits the syllable to be written in split form. While one may feel pleased that there is no limit to the syllables that can be formed by Uniscribe, one can readily see that a perfectly valid Unicode string can cause enough confusion to the shaping engine. We have already seen an example of this. Uniscribe could well stop with three or four consonant syllables to make the text preparation process simpler. Editing at the syllable level is not without its problems in Microsoft applications. Keeping track of two representations. The need to correctly identify syllables along with the need to to maintain correct spacing of text on the screen requires very complex processing. The problem arises as a consequence of the display being managed in terms of codes referring to glyphs while the text itself be handled using assigned character codes (Unicode) for the script. The irony is that the Open type font is also a Unicode font with valid glyph codes but not having a one to one relationship with the stored text in terms of characters and glyphs. Errors are bound to occur in any computation that has to struggle hard to keep track of two different representations at the same time. Copy/paste features in an application heavily rely on the ability of the application to trace back to the internally stored text from the displayed text. For most western scripts this is straightforward but for any writing system that follows a syllabic representation, this requirement is not easy to fulfill.

Limitations of Uniscribe
We have already seen the conceptual basis for Uniscribe. An application will examine the Unicode string to be processed and perform whatever linguistic processing is required. The result to be displayed will then be given to Uniscribe. Uniscribe will apply the rules of the writing system consistent with the language and return to the application the associated glyph string to be used with the font specified for the script. Uniscribe implements the rules of the writing system for a language (associated with the script) and decides if the display will be consistent with the rules by querying the Open Type Library Services (OTLS). This should return information to the querying program about the features supported in the font. Uniscribe will see if the feature supported will satisfy the writing system rule to be implemented and will select the glyphs to be shown if the rule is satisfied. Otherwise, Uniscribe may default to a form of display for the syllable. Uniscribe cannot work by itself and render text since it must know whether the specific rule can be implemented with the glyphs provided in the Open type font. Clearly some default behaviour is expected from Uniscribe when a rule cannot be implemented in the required manner. There can be a choice of displays even in the default behaviour since alternate forms for a syllable are always permitted. The real issue is one of deciding which form is better suited for the application. The limitations of Uniscribe may be examined from three different perspectives. Problems specific to Unicode assignments themselves. The extent to which the rules of the writing system have been correctly implemented. The default behaviour of Uniscribe, which is perhaps influenced by the features supported in an Open type font.

Conceptual problems with Unicode assignments Unicode puts the script ahead of the language and assumes that the writing system is influenced by the language. This means that we cannot associate a new language with the script without modifying Uniscribe. Perhaps there will be no need for this in practice, for it may be argued that a script is a means to giving a visual representation for a sound and a language is specified by the sounds the speaker utters. This is a wrong view to take because what is important is that a person should be in a position to identify the sound associated with a given shape

to get the linguistic content. So long as the person knows that the same sound can be represented in different forms in different scripts, he/she can comfortably read the text. So text in a given language could be written in any script that correctly relates shapes to the sound. It is common practice in India to write text in a language in many different scripts. Sanskrit- Devanagari, Brahmi, Grantha, Sharada, Phonetic alphabet, Telugu Tamil - Modern Tamil script, Vattezhuthu (700AD-1300AD), Tamil Brahmi Marathi - Devanagari, Modi Sindhi - Arabic script, Devanagari It will be almost impossible for us to use any of the above scripts with Uniscribe except Devanagari and Tamil, for the writing system system rules are very different for each script and the requirement cannot be simply handled by creating a suitable Open type font. Extent to which the rules of the writing system can be implemented This is largely a matter of exhaustively listing the writing conventions including all the alternate forms for all the syllables seen in common use. The person who does this must have both linguistic knowledge as well as knowledge of the script to vouch for the correctness of the rule. Such persons are rare and some of them are known to have an aversion for computers! You require experts who have learnt about the development of Typography over nearly a hundred and fifty years just to identify how manuscripts were typeset earlier consistent with the writing seen on palm leaf manuscripts or other writing media common in the country. Today, many states in the country have greatly simplified the rules by standardizing on a small set of shapes which can fit into a manual typewriter. So it may be virtually impossible for a person to present text consistent with some older manuscript, if Uniscribe implements only the modern rules. As of now, a rule can be implemented only if the associate Open type font provides for Glyphs consistent with the rule. An Open type font for Devanagari will become truly unwieldy should it become necessary to support glyphs conforming to the conventions which have been followed for years. Thus Uniscribe will also be limited by the capabilities of the font. Default behaviour of Uniscribe The default behaviour of Uniscribe is dictated by the extent to which the required features are supported in the Open type font. Also, the rules for syllable formation cannot be ignored. Given a Unicode string, identifying the syllables which have to be rendered is not an easy task if there are Unicode characters such as the zero width joiner and non joiner. The state machine which examines the string can indeed get confused if such characters are present in the input. In fact this happens with Microsoft Word. Uniscribe assumes that arbitrarily long syllables may also be input and defaults to amusing shapes for certain syllables. Try a syllable with four "ra"s. It will be quite difficult for any application to decide on an appropriate form for default rendering of the syllable, unless it knows what alternatives are available. This requires exhaustive querying of the Open Type Library Services and can make the application unnecessarily complex. String processing is best attempted when the quantum of information that is handled at a time is a data item of a known fixed size such as byte, two bytes or even four bytes. Regular expression matching will work best only when this is satisfied. In the absence of a fixed size quantum, any string processing will become complex and unwieldy.

Review of Microsoft applications


(with Unicode support for Indian languages) Review of Microsoft applications supporting unicode for Indian scripts/languages. Unicode support for Indian languages/scripts is in principle available under Windows 2000/XP. Currently Notepad, Wordpad and Word2000 seem to have provided application level support and allow data entry and word processing in Devanagari and Tamil. Towards this Microsoft includes two open type fonts, Mangal and Lata for Devanagari and Tamil respectively. Data entry is based on the INSCRIPT keyboard layout standardized for ISCII. This keyboard mapping is uniform across the languages in respect of keystrokes for the basic vowels and consonants. With the INSCRIPT method it may not be possible to type in the full compliment of aksharas consistent with the conventions followed in the writing systems. This layout does not also have keys for some of the punctuation marks. There are no specific keys for typing

in the zero width modifier characters. This will have to be accomplished only by typing in the decimal equivalent of the Unicode value while keeping the ALT key pressed. Among the applications in the Office 2000 suite, Word 2000 seems to implement text rendering using Uniscribe. Excel does not seem to go by the shaping engine. The extent to which data entry is supported consistent with the requirements of Unicode seems to vary across the applications. Find and replace boxes do not seem to support the entry of Unicode characters based on their decimal equivalents. Text rendering across applications is not consistent and is quite arbitrary. Word 2000 runs into problems in estimating the length of words and this causes unacceptable gaps between words. Editing is effected differently when you backspace or delete. Delete removes a whole syllable to the right while backspace deletes the last part of the syllable before the cursor. Cutting and pasting across applications results in many inconsistencies. There is very little support by way of linguistic processing. String matching in Word 2000 seems to match syllables but fails in the presence of some zero width modifiers. Text rendered in Devanagari departs from convention for many syllables which are written one below the other. This is not a serious problem for Hindi but alternate shapes as indicated are as per normal convention. We have used the IITM software to generate these forms and pasted them into the document.

Microsoft's implementation of Uniscribe conforms to the recommendations in the Unicode book. However, a valid Unicode string in any Indian language need not contain linguistically meaningful information. Quite likely, algorithms which look for linguistic content in a Unicode string will get confused! The availability of Uniscribe to shape Unicode text does not guarantee anything in respect of linguistic processing of text. This is the responsibility of the application and each application must code into itself enough linguistic knowledge to effect any meaningful text processing. The multibyte representation for a syllable, coupled with the need to filter out characters which relate to rendering information can cause the applications to become really messy. In the illustration below, the same linguistic content is displayed in twelve different ways, all legal in terms of Unicode representation. For an application to actually figure out that the strings convey the same linguistic information, very complex text processing will be required.

Download a copy of the file aditya.txt (Unicode Text file). Keep the shift key pressed while clicking on the link to prevent your browser from displaying the contents! The file may then be opened under Wordpad.

Open type fonts and True type fonts


Do we really require an Open type font to work with Unicode text in Indian Languages? Handling text in a computer can involve mere data entry and display (or printing) or more complex processing such as matching strings, generating a web page through a script etc.. A suitable font is almost always required for generating the display in an interactive application. It is well known that eight bit fonts (fonts with 255 glyphs or fewer) are adequate to render text in all the Indian languages even though a few representations involving complex ligatures might not be available. The real issue however is not the number of glyphs but how one would identify the glyphs to be rendered, given a representation of the text. The easiest method has been to simply generate and store the text in terms of the Glyph codes themselves and use conventional methods of rendering ASCII strings. By and large most applications supporting text preparation in Indian languages seem to have adopted this method. When glyph codes are used, data

entry is not intuitive and keyboard mappings tend to confuse the user since one will be typing in ligatures quite frequently and not just the basic consonants and vowels. The use of ISCII as a standard for storing text required that a suitable processing module be used to arrive at the glyph codes from the internally stored ISCII codes for the consonants, vowels and the matras. It is clear that such a processing module can become quite complex since it has to first identify syllable boundaries from the input codes and map the same to the required shape by piecing together an appropriate set of glyphs from the font. As an illustration, the syllable seen below may be built from three glyphs in the Sanskrit98 font. Among the three, the middle glyph has zero width, with the corresponding ligature drawn on the left of the vertical axis.

Zero width glyphs have helped build the required complex shapes through a simple process of concatenating the glyphs. In other words, zero width glyphs help create shapes which are formed by overlapping many shapes (usually two or three). Typically the matras in Devanagari are overlapped with the consonant shapes. The main advantage with this approach is that text can be rendered on most systems which can just take a string of eight bit codes. Most font rendering schemes have realized the need to correctly handle zero width glyphs (Win9X, Linux, Mac, Postscript). In these cases, the rendering engine is quite simple and just concatenates the shapes together. The question which has always been asked is "Can every display requirement be handled through the use of zero width glyphs (in respect of most scripts in India)"? While the answer to this question is certainly "yes", a large number of such glyphs will be required in practice to handle all the shapes which can be generated only by overlapping more basic shapes. It is quite difficult to accommodate a large number of these glyphs in an eight bit font. It may be noted here that TeX has indeed shown that an eight bit font may be all that we need for our scripts but the approach cannot be used in interactive applications. Developers who desire to use Unicode for Indian languages face the problem of building up the required shape for each syllable using only a Unicode font. For majority of the languages of the world, a unicode font need have one glyph only for every unicode character defined for the language. In respect of Indian languages, the situation is very different, since the Unicode font will have to accommodate literally thousands of glyphs. Certainly one could think of a Unicode font with several thousand glyphs where each glyph is directly a representation of a syllable. Unfortunately, when Unicode assignments were made, the experts felt that a scheme similar to ISCII would be sufficient. So, each Indian language got an assignment of a limited set of 128 code values from which it was assumed all syllables will be derived (represented) using variable number of Unicode characters. It was felt that since the one to one mapping between a Unicode character and a glyph does not apply, a rendering engine would have to be used which maps the Unicode characters to the glyphs of SOME font, without specifying the range of Unicode values for the font glyphs. The way out out of this situation was to suggest a new font concept called the Open type font which would incorporate features to map one or more Unicode characters to one more glyphs in an appropriate Unicode range. This Open type font would permit a large number of Glyphs, several hundreds perhaps, enough to generate all the required ligatures through positioning glyphs with respect to one another. With this the required ligatures would be obtained by selecting the glyphs appropriate to a syllable but shape the display by positioning the glyphs in precisely defined locations. The need for zero width glyphs does not arise, for the font rendering program would get positioning information from the glyph to be displayed which will now identify the component glyphs to be pieced together. The Open type font allows a string of unicode characters to be mapped into a single glyph thus permitting the generation of the shape of the syllable from a variable length string. By precisely locating the glyphs in relation to one another graphically, the need for multiple zero width glyphs for the same ligature (as in True type fonts) is eliminated. It is said that such precise positioning allows superior quality typography as well. It is a different matter however, if the basic glyphs themselves are not aesthetically pleasing as is the case with the Microsoft Mangal font!

An Open type Unicode font not only allows more than 256 glyphs but also builds into it the positioning information when multiple glyphs are overlaid. Essentially this is the same concept as that of a composite glyph in a conventional True type font. The composite glyph also has the advantage that we can specify it with just one code. However, when mapping characters in the text, a True type font will permit only one glyph to be mapped to one character. This is the distinct advantage of the Open type font where a string of Unicode values can map to a single glyph. When a font rendering program is called to display a composite glyph, it would dynamically build the glyph from the component glyphs by positioning them properly. If one uses zero width glyphs in a font, the same final result can be obtained but only by specifying a code for each glyph. If we examine the syllable shown earlier, an open type font could indeed include a glyph that is a combination of the first two ("sht" and "ra") and be mapped into the syllable "sh, t, ra" . In reality, many glyphs in the Microsoft Mangal font are composite glyphs (almost 500 of them) and the recommendation from Microsoft experts emphasizes the use of composite glyphs for as many glyphs as possible which directly relate to a syllable. The Uniscribe module, which constitutes the shaping engine for Unicode in Microsoft applications, will identify that "sh", "t" and "ra" would come out as a single shape by applying the rule that when the consonant "ra" comes as the last consonant in a syllable, it would be written using a ligature which can occur either attachment to the vertical stroke of the preceding consonant (as in "p, ra") or as an individual ligature below it, if the preceding consonant does not have a vertical stroke. It turns out that Microsoft displays the syllable in the illustration above not as a single ligature for "sh" and "t" but through a half form for "sh" and a ligature for "ra" under the consonant "t". It is now reasonably clear to us that a lot of rules are hard coded into Uniscribe. Some of the rules will depend on the availability of specific shapes (glyphs) in the font under use. Since the form of the syllable is hard coded into Uniscribe, the user or the developer cannot provide alternate forms for a syllable even if this form can be pieced together from other available glyphs in the font. Often a form where a conjunct can be shown without a halanth in any of the consonants is preferred by people. This is certainly not possible with Uniscribe as of today (March 2003). Tomorrow, if we do agree to build a new glyph into the Mangal font, Uniscribe will have to be rewritten! Of course Microsoft does not insist on the developer using Uniscribe. The onus is then on the designer to shape the syllables in the application itself, something that can lead to a lot of additional work. Uniscribe also works on the principle of internally defined rules which specify which form of a consonant applies in a given context. Thus "ra" occurring as the first consonant of a syllable is treated differently from a "ra" that occurs in the middle or at the end. Towards this, uniscribe also reorders the input string to handle cases where the first consonant is graphically positioned at the end, as in the case where the "reph" form applies. In Marathi, it is not always the case that the reph form is used each time "ra" occurs as the first consonant. So these rules, which are language dependent have to be handled by Uniscribe only when the language associate with the script is also specified as a parameter. It is not possible to dynamically introduce a language that uses Devanagari but has rules different from Sanskrit or Hindi! Glyph codes are required to be Unicode values. Writing applications which can transfer information between themselves through copy/paste greatly benefit from scripts which map one Unicode character into one font glyph. In this case the code of the displayed character is identical with that of the character in storage. One can readily identify the internally stored text merely by looking at the displayed string. We have seen that this cannot be the case with respect to Indian languages, for several Unicode characters in sequence constitute a syllable and hence a shape. The computer system (basically the OS) must use only a Unicode font to render the text since everything is Unicode based. The large set of Unicode values required in a font for an Indian language (Tamil may do with a small set) cannot be accommodated in any other Unicode range unless that range has no specific Unicode assignments. Taking note of this, developers have struck a compromise by designing Unicode (Open type too) fonts having glyph codes in a region designated as "Private Use area" by the Unicode consortium, where one has the freedom to locate their own characters of their own scripts. This in essence allows the characters of any new language to be assigned Unicode values in a totally free manner without prejudice to or interfering with the codes otherwise legally assigned to several other languages in the Unicode standard. Thus, Unicode text in Indian languages will be represented through the standard Unicode assignments for the different Indian languages but all corresponding fonts will locate their glyphs in the Private Use area. One can readily see that this offers no loss of flexibility in processing a syllable, for what is needed is the identification of a glyph that has a valid Unicode assigned to it. In a document displayed using such a font, going from the displayed code to the internal code is still a reality so long as we retain the stored text internally in some buffer and back track from the displayed codes simply by repeatedly generating temporary display codes and matching them against the actually displayed ones. So copy paste operations will be possible. In a one code one glyph case, the need for this internally stored text does not arise because the internally stored text from which the display was generated will be identical to the displayed codes themselves.

When we use the Private Use area, we may have no way of finding out what language text is being displayed unless we access the Unicode values of the internally stored text. Multilingual applications will have quite some work to do in relating the display to a language if the text displayed uses fonts in the Private use area but the actual code values are different. Thus all applications dealing with Unicode in Indian languages MUST always retain a buffer in which the Unicode string that has given rise to the current display is kept. Worse still, as editing operations are performed on displayed text, pointers linking graphical positioning of the glyphs with the internally stored text string must be maintained. This is a very complex issue and we know that Microsoft applications themselves have not handled this with care, as will be seen below. It is now apparent that the application has a lot of responsibility in actually positioning the syllables on the screen when Unicode strings have to be displayed. Errors in computing the widths of displayed glyphs can lead to a lot of confusion during the backtracking process. Errors of this type can cause unpleasant gaps in the displayed text and we know that this situation does exist even with Microsoft software! Seen below is a screen shot of three Microsoft applications handling the same text. These are Wordpad, Word and Excel all running under win/XP. The text was generated by typing into Wordpad and copied and pasted into the other two. The identical looking strings in the Wordpad display are not really identical in their internally stored form but differ due to the incorporation of zero width joiners. It is however clear that all the strings refer to the same syllable. The test as to whether the applications actually perform syllable level processing is also apparent from the illustration.

Examine how Word displays the strings. The wavy red line put in by Word (a spelling error being pointed out) tells us what Word thinks is actually the width of the displayed string! The situation with Excel is no less amusing where it does not seem to use Uniscribe at all but goes by one Unicode one glyph maxim, ignoring the zero width joiners altogether. More interesting to observe is what happens when you try a string match for the word. Wordpad would match only one string while Word matches five and misses out the one where gaps are seen in the word. You can verify all this for yourself if you have WindowsXP running on your computer. Just download the Unicode text file corresponding to the displayed text which we have made available for you. You can open the file in Wordpad or Word directly but must do a copy and paste into Excel. At this point one might point to the inconsistencies in text processing with Unicode. Text processing at the Syllable level cannot be solved by providing modules which identify syllable boundaries alone and display the text. The need to check the linguistic validity of a text string that is perfectly valid as a Unicode string is really the crux of the problem. The multibyte nature of the syllable coupled with the need to identify codes which do not carry linguistic information but only codes that help in rendering the syllable, will require a lot of comparisons with each Unicode character and severely affect performance, besides complicating the algorithms themselves.

All this goes to show that it is very difficult to write applications based on Unicode rendering. Applications which go only one way i.e., from Unicode text to display are perhaps the only ones which may work but this would restrict the applications to mere data entry and display. Even here an application must know how the shaping engine (Uniscribe or equivalent) renders the text to present the display appropriate to the user's needs. For instance, the onus is on the application to format the text graphically by ascertaining the character widths. Worse still, an application may actually be required to know when rendering information has to be inserted into a string through zero width or non zero width joiners and such. A major constraint which most applications will face is in permitting multilingual data entry. It will be very difficult to build applications that allow data entry in different scripts within the same interface unless it handles the keyboard itself. The moment you rely on the support given by the OS, you will invariably be forced to use alternate keyboards. As indicated elsewhere in this essay, it is not possible to type in punctuation marks in Tamil using the Microsoft Tamil Keyboard and one will have to switch keyboards. While one can certainly argue that this is consistent with the basic concept of Unicode where punctuation marks are assigned codes in a different region, the need to switch keyboards can be frustrating. It is never a good policy to require applications to handle text formatting by themselves. At least a meaningful API should be available which can take a Unicode string and render it on the display in a predictable fashion. This is very difficult to manage unless we have a one code one glyph situation. Perhaps a one code many glyphs situation is also not difficult to deal with, since the one code can really be that of a syllable. Unfortunately, Unicode has not taken this route. In Microsoft's implementation of Unicode support for Indian languages, it appears that the calculation of widths of displayed glyphs has some error. This is particularly so with zero width glyphs. It is clear that the responsibility for the correct display rests with the application and not the shaping engine. Shown below are screen shots of the same text in different applications, Word, Wordpad and and Netscape. One wonders how this has come about! Zero width glyphs from standard fonts (in this case a true type font from IIT Madras) are rendered correctly under Word but gaps are seen in Wordpad. Wordpad correctly interprets widths of characters in the Latha font which is Microsoft's own font but Word seems to suffer, especially with zero width space characters. If you are intrigued about the clear text typed in Windows 2000 (Devanagari and Tamil text), just look at the simple multilingual text editor developed at IIT Madras.

The adequacy of True type fonts Dealing with applications supporting user interfaces in Indian languages is entirely feasible with Unicode and True type fonts. It will be necessary to place many glyphs side by side to display a syllable but this can be managed with appropriate zero width glyphs. The application must now parse the input string to identify syllables. A significant amount of simplification can be effected if we agree to restrict syllable formation to a limited set of say about six hundred syllables (which by the way, will cover most of our requirements in respect of our languages). The mapping from a syllable to its glyphs may be accomplished through simple table lookup as opposed to complex rules built into Uniscribe. The multilingual software from IIT Madras has established that this approach is not only viable but very simple to implement. Syllable formation is effected at the input stage itself during data entry and each syllable stored internally as a fixed size code (two bytes). It is relatively easy to write parsing applications which can handle dynamically entered strings. The Acharya web site hosts a demo page where the viewer can verify that a sequence of consonants and vowels can be input to generate

the syllables dynamically and displayed as well in any script. Syllables may also be standardized by collectively taking all the basic sounds from each language and working with a superset of vowels and consonants. The text rendering process can be simplified considerably if we agree to deal with a finite set of syllables as opposed to allowing arbitrarily long ones. Over the years one has seen that almost all the text ever prepared in India includes just about 500-800 syllables depending on the language, which have to be shown with special ligatures. It is therefore sufficient if this set is catered to. Restricting the set of syllables gives us the flexibility to use tables to map the syllables to glyphs. Table lookup can also be effected dynamically giving us the additional flexibility to use alternate forms of display for syllables. If we carefully design our True type fonts, we can create a multilingual font supporting all the important scripts (nine of them) and place the glyphs in the region E000-E9ff region, where each script will have close to 250 glyphs. We can include many common glyphs in this font including punctuation marks, special symbols and such which we could not manage in a regular True type font for want of glyphs. Comparable Open type font would require at least 650 glyphs per script and we can see that it will be difficult to manage such a huge font, let alone design one. True type fonts also have other advantages. the rendering process is not tied to the availability of a specific font so long as the glyphs are present at the expected location. We can prepare text and get it rendered in any font of our choice where the glyphs occupy the specified locations. With Open type fonts, unless Unicode input conforms to the assigned code values and not the glyph codes, the characters will not be rendered right. If we create text in a microsoft application that allows us type in Unicode values in the private use area (E000-F7FF), we will not be able to view the text with the Mangal font even though it has glyphs in this range. There will be greater flexibility if an application can correctly identify the glyph codes and use any True type font that can render the glyphs right. This is how we currently display text in many Win9x applications where we generate ASCII text but view the same with a Devanagari or Tamil font. While it is true that a shaping engine is always required to render Unicode in Indian languages, the shaping engine should permit flexibility for us to use any compatible font. It does not appear that this is possible as of now since there is only one Open type font available for Devanagari and Uniscribe is tied to this. One can summarize the observations as follows. What Microsoft (perhaps other developers as well) has done is to demonstrate that text in Indian languages can be typed into any application. While it may appear that this is all one would require to run the application with Indian language support, the truth is that none of the applications can correctly interpret the entered text to effect further processing. In other words localization, the ability to support a truly interactive user interface where user commands are correctly and consistently interpreted across all applications, is something that has not been viewed seriously. When this does happen, we would not be surprised if the application is just monolingual and script specific. The use of Unicode (in respect of Indian languages) to truly bring in localization does not seem to be offering much promise. While one cannot deny that that someone can actually accomplish this in spite of the problems of multibyte codes, it is becoming clear to many that developers will find it easier to provide script and problem specific solutions by handling the script related issues themselves, for there is no doubt that they can handle the linguistic aspects with confidence.

Unicode can be supported


(Recommendations for developers) During the past decade (1991-2002), Systems Development Laboratory, IIT Madras has gained much insight in respect of computing with Indian languages. In fact, one of the applications developed in the lab relates to direct handling of multibyte codes during data entry in an interactive application. The multilingual editor has a special feature which permits users enter data in ITRANS. In effect this is equivalent to typing in the vowels and consonants but storing them in terms of syllables. The same approach can work with Unicode as well and a number of applications can indeed be developed on this basis. Developers can provide support for Unicode in all their applications both under Linux and Windows, if they agree to honour a few restrictions and also allow for two new Unicode values to be introduced as explained below. Introduce an additional code for the halanth which can be interpreted as a null vowel. So a consonant followed by the null vowel can also be treated as a proper syllable. The current halanth code can be retained to indicate syllable formation with succeeding consonants. During data entry one would use the null vowel to explicitly form generic consonants. In effect, this new code would do the same job as that of a zero width non joiner but has the advantage that it can be interpreted as a syllable without having to worry about the next character.

Agree to restrict syllable formation to a limited set and avoid arbitrarily long syllables. Provide standardized API for developers to collect keystrokes through this API so that syllables can be formed in a consistent fashion. Get rid of the codes for the Matras. You can use a regular vowel in place of the Matra since this does not in any way affect syllable formation. When you require a pure vowel inside a word, you can have a new Unicode value for what we may call a null consonant which can be typed in before the vowel. This null consonant simply implies that there is no consonant in the syllable and only a vowel which should be displayed in its pure form. The null consonant is the second new Unicode character we need to introduce. Use only True type fonts but make sure that the font rendering program will correctly handle zero width glyphs (this was the case with Win95/98/Me). Very high quality typography is indeed possible with True type fonts. Text prepared with fonts such as Sanskrit 1.2 or Sanskrit98 and printed under Win9X look so much better compared to what one can get with the Microsoft Open type font Mangal. These fonts are so well designed that they cater to a very large set of ligatures including Vedic accents, consonants with dots below and the very interesting bowed representation of a soft "ra" in Marathi. If we agree to use a True type font, we can actually place the glyphs in the E000 region and include as many as 250 glyphs for a script to take care of intricate ligatures as well. (several years ago a special Metafont designed for Devanagari actually supported the generation of more than a thousand conjuncts as well as Vedic symbols with just about 240 glyphs!)

If we carefully design our True type fonts, we can create a multilingual font supporting all the important scripts (nine of them) and place the glyphs in the region E000-E9FF region, where each script will have close to 250 glyphs. We can include many common glyphs in this font including punctuation marks, special symbols and such which we could not manage in a regular True type font for want of glyphs. Comparable Open type font would require at least 650 glyphs per script and we can see that it will be difficult to manage such a huge font, let alone design one. True type fonts also have other advantages. the rendering process is not tied to the availability of a specific font so long as the glyphs are present at the expected location. We can prepare text and get it rendered in any font of our choice where the glyphs occupy the specified locations. With Open type fonts, unless Unicode input conforms to the assigned code values and not the glyph codes, the characters will not be rendered right.

Application Development under Linux


(Unicode Support) The Multilingual System Development effort at SDL emphasizes the need to work at the syllable level when it comes to processing text in Indian languages. Unicode is one approach to representing syllables. There are other approaches as well which have been used for many many years such as ISCII. If we look at the Transliteration based representation of text in our scripts, we see that the letters of the Roman script are used to represent the syllables, again in a multibyte manner. As can be inferred from the discussions in earlier sections of this monograph, there is little difference between Unicode and a transliteration scheme. Programs such as ITRANS, RIT and many other transliteration schemes have successfully dealt with representing text in Indian languages. While it is true that many of these programs do not provide interactive interfaces, adding the support required is relatively straight forward. If this were really so, why is it that someone has not implemented it before for Unicode? We have some good answers here. Unicode provides guidelines for implementing the shaping engine, though not in explicit terms. People believed (and still believe) that these guidelines are sacred and developers should strictly adhere to them. There is no reason why we have to follow the guidelines if whatever we do in practice satisfies the essential requirement of rendering the syllables, the special symbols and punctuation. Unicode assumes that there can be no restrictions on syllable formation and so any syllable should be permitted no matter what consonants are present. Arbitrary syllables make little sense and experience has shown that in practice one encounters only a limited set, albeit numbering a few hundreds. The writing system does indeed provide for arbitrary ones by merely decomposing them into generic consonants except for the last. Hence if we can handle these, we would really be able to do take care of most text processing. The clue to doing this is to stop arbitrary syllable formation at the input stage itself, i.e., during data entry. Tools such as Lex and Yacc could be used to great advantage to parse the input string (i.e., keystrokes) to generate tokens that map directly to the specified set of syllables. The ITRANS package already has a complete definition file

for many scripts and can identify most of the syllables correctly. Once the syllable is identified, it can be rendered by merely looking up a table. As we can see, the table will have at most a thousand entries (typically about six hundred), each corresponding to a base syllable. Syllables which involve different vowels with the same base, can be rendered by using matras and only exceptions need be remembered. The main advantage to be gained in this case is that our syllables will not contain the modifier characters in the input string, thus paving the way for much better linguistic text processing. Should one decide, each syllable may indeed be mapped into a unique integer based on the scheme suggested by IIT Madras. This would take us to two different internal representations, one in terms of Unicode and the other in terms of fixed length syllable codes. A syllable can be rendered in one of many different scripts ( Unicode die-hards won't ever buy this) simply through table look up method. Virtually any font can be used which has the minimal set of glyphs required to render text in the specified script. Here we are deviating from a convention that the font used should conform to the encoding used in the text. This is justified on the grounds that for Indian scripts, it is well nigh impossible to force an encoding standard for text where the one code one glyph mapping applies. What we are doing here is essentially implementing the rules of the writing systems by first identifying the syllables at the input stage itself and completing the rendering process by simple table look up. If we change the font, we simply change the table. For the same font, we can use different tables at different times to get different representations for the same syllable. In effect, we will have our own Uniscribe which can dynamically be configured to work with a script and any appropriate font. There will be very few restrictions on the font itself except that zero width glyphs will have to be correctly rendered. Fortunately, X11 under Linux does a good job of this. Introducing a new script for a language simply involves the use of an appropriate font and a table mapping the syllables to the glyphs. Most of our multilingual requirements such as transliteration across scripts, uniform data entry for all the languages and most importantly, uniform approach to linguistic processing in all the scripts etc., can be comfortably met if we take this approach. Cut and paste across applications will require that we maintain a backing string and map the blocked text to portions in this string. GTK allows us to do this effectively. The multilingual editor for Linux from IIT Madras allows you to change the script on screen and allows effortless cut/copy and paste without disturbing the stored representation of the syllables. An input module may be provided to the developers which essentially is a character input facility along the lines of getchr( ). This module will be called by an application to input text. The module will return syllables in multibyte form or if necessary in a fixed width form. The syllables will be easy to work with from a linguistic angle, since no modifier codes will be present. A reasonable amount of equivalence in terms of code values across scripts may be possible now and transliteration may be more easily accomplished. Applications need not switch keyboards since the syllables will be common to all the languages. Only the font and the associated table will differ. Open type fonts can be avoided. Unicode fonts in the range E000-E9FF may still be used for rendering text. We will need at most about 240 glyphs for each script to get a very satisfactory (and reasonably complete) set of ligatures displayed. So we can actually have one single Unicode font catering to all the nine scripts in this range. In fact one can go back to the syllables from the glyph codes by parsing the glyph string with a parser which may be easily written using lex and Yacc, much the same way we had recommended earlier.

Examples of Unicode text rendering with different applications


Shown below are several examples of the same text rendered by different applications running under Windows and Linux. These examples vindicate the stand that rendering Unicode text in Indian Languages can become application dependent. Even different applications running under the same Operating system do not have a standard API they can fall back on to get some uniformity!

Interested viewers may want to check how the text gets rendered on their own systems which support Unicode rendering. The file aditya.txt is available for download.

The screen shot shows the text as manually typed into WordPad on a WindowsXP system. Wordpad seems to be quite faithful in implementing Unicode rendering rules. The first string has no modifier characters and is perhaps the easiest to render. The remaining eleven have been composed using one or more of the zero width modifiers. WordPad under Win2000 may render the strings differently, somewhat along the lines shown in the next image.

This is the rendering of the text under MS Word 2000 on an XP system. We make several interesting observations. 1. The modifier strings seem to have been interpreted correctly only in some cases. 2. Gaps appear between Glyphs. 3. Word has calculated the span of each string incorrectly as can be seen from the red wavy underline in each string. The span has no relationship to the actual width of the text, even for the properly specified string, viz. the first. 4. Modifiers occuring in the last syllable certainly cause problems with huge gaps.

This rendering is under Microsoft Word from Office /XP on a WinXP system. It is clear that Microsoft has taken care of some of the problems with Word2000, at least in respect of widths of the characters but again, modifier characters cause problems.

This is the rendering under Internet Explorer-6 on a Win2000 system. Again, one sees much confusion in rendering the codes with modifier characters. Apparently, text sent from a web server is handled differently by browsers. Noticeable are the characters which could not be rendered.

This is the rendering under Mozilla 1.0 on a Debian Linux system. Even the very first string has run into a problem of incorrectly placed Matra! The last syllable is rendered differently and uses the halanth inplace of the half letter! Surprisingly the widths seem to have been computed correclty. On Linux systems, Matra placement is far from satisfactory even when the Matra is located on the required side of the consonant.

This is the rendering under Opera on a debian System. Notice how the modifier codes have been faithfully interpreted but have been rendered using a different scheme! Opera seems to provide a clear clue to the presence of modifier characters! Unfortunately, this does not help. The Matra for "ih" is still on the wrong side!

The screen image on the left shows how Opera under WinXP renders the text. This is in contrast with the rendering under Linux. However, there it is still at variance with the rendering of the same text by Opera under Linux! Also notice how the zero width joiner before "yA" gets rendered with a new shape!

Unicode rendering examples.


The following examples illustrate the variations seen in different applications while rendering Unicode strings in Indian languages. This page was setup in August 2007 (after refinements to Unicode rendering were put into effect in recent versions of operating systems and applications). This site also includes pages written earlier highlighting the differences encountered in different applications. A short explanation of the rendering will be seen at the right of each screen shot. This explanation is provided to identify the nature of the problem in rendering, be it a problem with basic encoding itself or the idiosyncrasies of the application.

Illustrations Devanagari rendering under Ubuntu and Windows Vista Tamil rendering under Ubuntu and Windows Vista Data entry problems Application dependent rendering Problems with arbitrarily long syllables Devanagari (Ubuntu)

Shown at the left is the page from Google returning results for a search of pages which refer to the Acharya site. The page displayed by Firefox under Ubuntu, shows displaced matras s well as linearized rendering of syllables. The display under Firefox on a Vista system is proper.

Tamil (Ubuntu)

Shown at the left is a page from Google which includes Unicode Text in Tamil. The rendering under Ubuntu is totally inappropriate. The medial vowel shapes are in the wrong place. Worse still, the rendering of syllables with the vowel "u" are completely wrong. Tamil has the advantage of a simpler script where syllable formation is relatively easy. Unfortunately, the application uses inappropriate algorithms to render syllables. The rendering under Windows is correct, though one must keep in mind the fact that Unicode for Tamil does not address all the requirements.

Data Entry related issues The display at the left is the

Wordpad screen under Windows Vista. The problems of Unicode data entry are highlighted in the display. It has been possible to create identical displays for two different strings. This example shows that preparing a text string for a query may be an extremely difficult task. Zero width non joiners can bring in confusion when a syllable is linearized by the user. It turns out that Wordpad allows the entry of a zero width non joiner but Notepad does not permit the same. The problem here is that one is trying to create a syllable in two different ways, one with a single code and the other with two codes, resulting in ambiguity. The nukta character is not a linguistic entity and Unicode assignment for it is as inappropriate as the assignment of medial vowel forms. The linguistic structure demands that we assign codes so as to clearly write syllables which can be identified without confusion.

Application dependent rendering The fact that Unicode rendering is necessarily application dependent is illustrated in the two screen shots at the left. Wordpad and the Word processor under Microsoft Works are taken as applications. The rendering under wordpad is correct while the one with Word shows totally incorrect display. The applications run under Windows Vista. The assignment of Unicode values to Indian language letters is such that syllables have multibyte representations. To accommodate different renderings of the same syllable, Unicode allows the use of special characters which are known as Modifiers. The modifiers are not handled properly by different applications. Also the algorithm for rendering

cannot ever be standardized due to the variations permitted in syllable representation. This is the reason why rendering is basically a responsibility of the application. Note how the medial vowel shapes are incorrectly placed s well as incorrectly rendered.

Arbitrarily long syllables Unicode rendering often goes by the assumption that it should be possible to handle arbitrarily long syllables. It is easy to force on the application, syllable formation with valid Unicode values which have only symbolic value and not strictly a linguistic value for the code. The codes for the medial vowels, the nukta are few examples. These codes can confuse the application while arbitrarily long syllables are attempted. The display at the left is a consequence of entering a series of halanth codes (almost 400) in this case when the data entry state machine starts to misbehave. A copy of the file is available for you to verify this. Please note the differences in rendering between Wordpad and Notepad. One has to accept the fact that the rendering issue cannot be totally divorced from the application. When the application decides on how a specific case will be handled, uniformity

across applications is lost. While one can dismiss this as a pathological example, one should remember that, Unicode allows a user to compose a syllable with special modifier codes. Hence it may be virtually impossible to discern the internal representation of a displayed string, which information is essential while typing in query strings for searches.

You might also like