Character Encoding Standard for Indian Scripts - A Report

 

Shashank BHATT

Member Technical Staff

Center for Development Of Advanced Computing, Pune, India.

sbhatt@cdac.ernet.in

 

ABSTRACT

This paper presents a description of the current implementation of the national character encoding standard for Brahmi-based Indian scripts,  ISCII (Indian Script Code for Information Interchange) as detailed in the IS:13194:1991 standard document. It also discusses the implementation of Indian scripts in the Unicode Standard and a brief note on a common keyboard overlay for these scripts.

 

 

INTRODUCTION

There are 18 Scheduled languages in India as of today. These are in approximate order of usage: Hindi, Bengali, Telugu, Marathi, Tamil, Urdu, Gujarati, Kannada, Malayalam, Oriya, Punjabi, Assamese, Kashmiri, Sindhi, Konkani, Nepali, Manipuri and Sanskrit.

 

Urdu, Kashmiri and Sindhi are primarily written in Perso-Arabic scripts. As these scripts have a different alphabet, a different character encoding standard is envisaged for them.

 

The Brahmi based scripts which are the subject of this discussion can be bifurcated into the Northern and Southern scripts. The Northern scripts are Devanagari, Gujarati, Punjabi, Assamese, Bengali and Oriya. The Southern scripts are Telugu, Tamil, Malayalam and Kannada.

 

The official language of India, Hindi is written in the Devanagari script as are Marathi, Konkani, Nepali and Sanskrit. Manipuri is written extensively in the Bengali script. It is also written in the Meitei script.

 

HISTORY

Since the 1970s, different committees of the Department of Official Languages and the Department of Electronics (DOE) have been evolving different character encodings and keyboard overlays which would cater to all the Indian scripts. In July 1983, the DOE announced the ISCII-83 code which complied with the ISO 8-bit recommendations ("Report of the sub-committee on Standardization of Indian Scripts and their Codes for Information Processing", DOE, July 1983). This also had the recommendation on a common Phonographic based keyboard layout.

 

A keyboard standard for Indian scripts was brought out by the DOE in 1986 (Report of the committee for "Standardization of Keyboard Layout for Indian Script Based Computers" in Electronics-Information & Planning, Vol. 14, No.1, October 1986).

 

There was a revision of the ISCII code by the DOE in 1988 (Report of the sub-committee on "Standardization of Indian Script Codes for Information Interchange", DOE, August 1988. In November 1991 this was incorporated in the Indian Standard IS 13194:1991 and remains the prevalent standard today.

Indian Script Codes for Information Interchange - ISCII

As is evident from the introduction, it was imperative to arrive at a unified character encoding scheme and a common keyboard layout for the Indian scripts in order to facilitate implementations on computers. This is made possible by their common origin from the ancient Brahmi script and by the phonetic nature of the alphabet. The advantages of this philosophy are many. Any software which allows ISCII codes, can be used in any Indian script, enhancing it's commercial viability. In part, this is also the rationale behind the Unicode standard. Immediate transliteration between different Indian scripts become possible. Simultaneous availability of multiple Indian languages in the computer medium would accelerate the process of development and communication.

 

The ISCII code retains the standard ASCII code while utilizing the upper ASCII codes for Indian scripts. This makes it feasible to use Indian scripts along with English computers and software in an 8-bit environment.

 

The ISCII code table is a superset of all the characters required in the 10 Brahmi-based Indian scripts. These scripts share a large number of structural features between them as a consequence of their common Brahmi origin. The ISCII code contains only the basic alphabet required by the Indian scripts. All the composite characters are formed through combinations of these basic characters. The alphabet in each script may vary but they all share a common phonetic structure. The differences between scripts are primarily in their written forms. ISCII encoding is completely delinked from the physical glyphs used for display making it possible for a script to be displayed in a variety of styles depending on the conjunct (ligature) repertoire available in the glyph set.

 

A description of the structure of ISCII is given below. For convenience the Indian script characters are depicted in Devanagari and in transliterated diacritic Roman script.

 

The Consonants

Indian script consonants have an implicit + /a/ vowel included in them. They have been categorized according to their phonetic properties. There are 5 Vargs (Groups) and non-Varg consonants. Each Varg contains 5 consonants, the last of which is a nasal one. The first four consonants of each Varg, constitute the Primary and Secondary pair. The second consonant of each pair is the aspirated counterpart (has an additional "h" sound) of the first one.

 

Varg 1                                    

            ka        kha      ga        gha      La

Varg 2                                    

            ca         cha       ja         jha       µa

Varg 3                     b€               

            χa         χha       ·a        ·ha      ¸a

Varg 4                                    

            χa         χha       ·a        ·ha      ¸a

Varg 5             ¡ò        ¤É        ¦É        NÉ

            pa        pha      ba        bha      ma

 

Non-Varg               ªÉ        ®ú        }É        LÉ        χÉ        ¹É        ºÉ        ½þ

                        ya        ra         la         va        ¿a         Àa         sa         ha

 

Apart from these consonants, there are some other consonants used in some specific Indian scripts:

               

xÉà (ºa)             Tamil

ªÉà (»a) Used in Oriya, Bengali and Assamese.

®úÃ (¼a)   Is an extra trilled "ra" used in Tamil, Telugu and Malayalam.

³ý (½a)   Used in Tamil, Telugu, Kannada, Malayalam, Oriya, Gujarati and Marathi.

³Ãý (¾a)  Used in Tamil and Malayalam.

 

Vowels and Vowel Signs: Matras

There are separate symbols for all the vowels in Indian scripts which are pronounced independently (either at the beginning of a word, or after a vowel sound). The consonants in the Indian script themselves have an implicit vowel + /a/. To indicate a vowel sound other than the implicit one, a vowel-sign (Matra) is attached to the consonant. Thus there are equivalent Matras for all the vowels, excepting the  + vowel.

 

Roman             ¡          i           ¢           u          £          ¤           e

Vowel                     <                =                     

Matra                      Ê#                                     

Matra on        EòÉ       ÊEò       EòÒ       EÖò        EÚò        EÞò        Eäò

Roman             ®          ai         ¯          o          ‹          au        ²

Vowel              B                        +Éà       +Éä       +Éè       +Éì

Matra                       B                #Éà       #Éä       #Éè       #Éì

Matra on        Eàò        Eäò        Eèò        EòÉà       EòÉä       EòÉè       EòÉì

 

Vowel Modifiers

Anuswar

Anuswar indicates a nasal consonant sound. When an Anuswar comes before a consonant belonging to any of the 5 Vargs, then it represents the nasal consonant belonging to the Varg. Before a non-Varg consonant however the anuswar represents a different nasal sound

Nasalization Sign: Chandrabindu

The denotes nasalization of the preceding vowel. In Devanagari script it often gets substituted with Anuswar, as the latter is more convenient for writing.

Visarg #&

Comes after a vowel sound, and represents a sound similar to "h".

 

Vowel Omission Sign: Halant

In Indian scripts consonants are assumed to have an implicit vowel + /a/ within them unless an explicit Matra (vowel-sign) is attached. Thus a special sign Halant (#Â) is needed for indicating that the consonant does not have the implicit + vowel in it.

In Northern languages, the Halant at the end of a word generally gets dropped, though the ending still gets pronounced without a vowel. This doesn't happen in Southern languages and Sanskrit, where a Halant is always used to indicate a vowel-less ending.

 

Conjuncts

Indian scripts contain numerous conjuncts, which essentially are clusters of up to four consonants without the intervening implicit vowels. The shape of these conjuncts can differ from those of the constituting consonants. These conjuncts are formed in the ISCII code by putting the Halant (#Â) character, between the constituent consonants.

 

Example: IÉÊjɪɠ   =        Eò # ¹É iÉ # ®ú Ê# ªÉ

                   GòNÉ      =        Eò # ®ú NÉ

 

Diacritic Mark: Nukta

The Nukta is used for c€ and characters, in some Northern scripts. It is also used for deriving 5 other consonants in the Devanagari and Punjabi scripts, required for Urdu.

                                          b€                ¡ò

                                          c€                ¢ò

 

Punctuation

All punctuation marks used in Indian scripts are borrowed from English, except for the full-stop, instead of which a Viram (*) is used in the Northern scripts. The Viram is, however, being increasingly substituted by a full-stop. A double Viram (**) is also used in Sanskrit texts for indicating a verse ending.

 

Numerals

In all the Indian scripts the international numerals are being used increasingly. From the software viewpoint, usage of the same numerals as given in the ASCII set allows proper handling of numerals by existing software. For display rendition purposes however, it may be sometimes desirable to have separate Indian script numerals which are given in the ISCII table.

The ATR mechanism also allows display rendition of the ASCII numerals in an Indian script form. The ISCII numerals should be used only when it is not possible to use the ATR mechanism for selecting numerals in an Indian script.

 

Attribute Code (ATR)

The Attribute code, followed by a displayable ASCII character, defines a font attribute applicable for the following characters. The mechanism is meant for use in that medium where alternative font selection mechanism is not available.

 

Extension Code (EXT)

The Extension code, followed by an ISCII character, defines a new character which can combine with the previous ISCII character. This provision has been primarily made for supplementing Vedic signs along with the Devanagari text.

 

 

 

 

ISCII CODE TABLE

The ISCII code standard specifies a 7-bit code table which can be used in a 7 or 8-bit ISO compatible environment. It allows English and Indian scripts alphabets to be used simultaneously.

 

7-Bit Code Table of the Indian Script Alphabet

 

Hex

A

B

C

D

E

F

Hex

Dec

160

176

192

208

224

240

0

0

 

+Éä       ‹

f      ·ha

®úà       ¼a

        e

EXT

1

1

       Ä

+Éè      au

       ¸a

}É        la

        ®

0       0

2

2

       Æ

+Éì       ²

        ta

³        ½a

        ai

1       1

3

3

#&       Å

       ka

      tha

³Ãý       ¾a

        ¯

2       2

4

4

+        a

     kha

       da

LÉ        va

#Éà        o

3       3

5

5

       ¡

       ga

      dha

χÉ       ¿a

#Éä        ‹

4       4

6

6

<        i

      gha

       na

¹É       Àa

#Éè       au

5       5

7

7

        ¢

      La

xÉà      ºa

ºÉ       sa

#Éì        ²

6       6

8

8

=       u

      ca

       pa

½þ       ha

  (Halant)

7       7

9

9

       £

      cha

¡ò     pha

INV

  (Nukta)

8       8

A

10

       ¤

      ja

¤É       ba

       ¡

*        .

9       9

B

11

        e

      jha

¦É      bha

Ê#       i

 

 

C

12

B        ®

      µa

NÉ       ma

       ¢

 

 

D

13

        ai

       χa

ªÉ       ya

        u

 

 

E

14

        ¯

      χha

ªÉà      »a

        £

 

 

F

15

+Éà       o

b€       ·a

®ú        ra

        ¤

ATR

 

 

PROPERTIES

Phonetic Sequence

The ISCII characters, within a word, are kept in the same order as they would get pronounced. Example:

                   ®úɹ]ÅõÒªÉ    = ®ú #É ¹É # ]õ # ®ú #Ò ªÉ

                   ʽþxnùÒ     = ½þ Ê# xÉ # nù #Ò

As shown in the latter example, the display order may be different from the phonetic order. Having a spelling according to the phonetic order allows a name to be typed in the same way, regardless of the script it has to be displayed in.

 

Direct Sorting

Since there are variations in ordering of a few consonants between different Indian scripts, it is not possible to achieve perfect sorting in all Indian scripts. Special routines would be required for cultural sensitive requirements. For most purposes, however, the direct sorting achieved through the ISCII code should be sufficient.

 

Unique Spellings

By using only the basic characters in ISCII, there is only one unique way of typing a word. The spelling of a word is now the phonetic order of the constituent basic characters. This provides a unique spelling for each word, which is not affected by the display rendition. Unique spellings are essential for making spelling checkers and dictionaries. They are also essential to facilitate finding of words in a word-processor, or for information retrieval from a data-base.

 

Display Independence

A word in an Indian script can be displayed in a variety of styles depending on the conjunct repertoire used. ISCII codes however allow a complete delinking of the codes from the displayed fonts.

An ISCII syllable can be displayed using combination of basic shapes. Different implementations can choose variant techniques in combination of these basic shapes. The same text can thus be seen in different font styles by using a different font composition routine.

 

Transliteration

The ISCII codes are rendered on the display device according to the display composition methodology of the selected script. Transliteration to another script can thus be obtained by merely redisplaying the same text in a different script.

Since the display rendering process can be very flexible, it is possible to transliterate the Indian scripts to the Roman script, using diacritic marks. Similarly it is possible to transliterate them to their scripts such as Perso-Arabic.

Transliteration involves mere change of the script, in a manner that pronunciation is not affected. This is not the same as "translation" here the language itself changes.

 

The INSCRIPT Keyboard Overlay

The Inscript overlay contains characters required for all the Indian scripts, as defined by the ISCII character set. The Indian script alphabet has a logical structure, derived from the phonetic properties. The Inscript overlay mirrors this logical structure. The overlay has also been optimized from phonetic considerations. It is divided into two parts: the vowel pad on the left hand side, and the consonant pad on the right hand side.

Due to the phonetic/alphabetic nature of the keyboard, a person who knows typing in one Indian script can type in any other Indian script. The logical structure allows ease in learning and speed in touch typing. The keyboard remains optimal both from touch-typing and sight-typing points of view, in all Indian scripts.

 

 

 


UNICODE and ISCII

The Unicode standard for Indian scripts is based on the ISCII-1988 revision. In November 1991, at the time the Unicode Standard, Version 1.0 was published, the Bureau of National Standards published the current ISCII in the Indian Standard IS:13194:1991. The Unicode standard remains a superset of the ISCII-1991 character encoding and texts encoded with ISCII-1991 may be automatically converted to Unicode code values and back to their original encoding without loss of information. The following Indian scripts are supported in Unicode. Devanagari, Bengali, Gurmukhi (Punjabi), Gujarati, Oriya, Tamil, Telugu, Kannada and Malayalam.

 

CONCLUSIONS

The ISCII character encoding standard in its present form has been under implementation in the field for the last 10 years - since 1988. It has proven to be robust, with a wide range of applications existing under diverse computing platforms. It's usage has been made mandatory in many projects of state and national significance: the country's electoral data in Indian scripts is maintained in ISCII.  It's natural assimilation within the Unicode 2.0 standard confirms that this standard was developed with great foresight.

 

There has been in recent times, debate on the perceived shortcomings of ISCII and that there is scope for improvement. In the interest of all concerned, it would be prudent to work with the current standard than attempt to change it with the instability and uncertainty inherent in the process.

 

 

References

1. Bureau of Indian Standards (1991), Indian Script Code for Information Interchange IS 13194:1991. New Delhi, India.

2. Unicode Consortium (1996), The Unicode Standard. Version 2.0, Addison Wesley 1991-6. ISBN 0-201-48345-9

3. Hall, Patrick and Clews, John (1998) Customisable Internationalised Software: A win-win strategy for the South Asian software industry: Proceedings of the SAARC Conference on Extending the use of Multilingual and Multimedia Information and Technology, Pune, India, September 1998.