Character Encoding Standard for Indian
Scripts - A Report
Shashank BHATT
Member Technical Staff
Center for Development Of Advanced Computing, Pune, India.
sbhatt@cdac.ernet.in
ABSTRACT
This paper
presents a description of the current implementation of the national character
encoding standard for Brahmi-based Indian scripts, ISCII (Indian Script Code for Information Interchange) as
detailed in the IS:13194:1991 standard document. It also discusses the implementation
of Indian scripts in the Unicode Standard and a brief note on a common keyboard
overlay for these scripts.
INTRODUCTION
There are 18
Scheduled languages in India as of today. These are in approximate order of
usage: Hindi, Bengali, Telugu, Marathi, Tamil, Urdu, Gujarati, Kannada,
Malayalam, Oriya, Punjabi, Assamese, Kashmiri, Sindhi, Konkani, Nepali,
Manipuri and Sanskrit.
Urdu, Kashmiri
and Sindhi are primarily written in Perso-Arabic scripts. As these scripts have
a different alphabet, a different character encoding standard is envisaged for
them.
The Brahmi based
scripts which are the subject of this discussion can be bifurcated into the
Northern and Southern scripts. The Northern scripts are Devanagari, Gujarati,
Punjabi, Assamese, Bengali and Oriya. The Southern scripts are Telugu, Tamil,
Malayalam and Kannada.
The official
language of India, Hindi is written in the Devanagari script as are Marathi,
Konkani, Nepali and Sanskrit. Manipuri is written extensively in the Bengali
script. It is also written in the Meitei script.
HISTORY
Since the 1970s,
different committees of the Department of Official Languages and the Department
of Electronics (DOE) have been evolving different character encodings and
keyboard overlays which would cater to all the Indian scripts. In July 1983,
the DOE announced the ISCII-83 code which complied with the ISO 8-bit
recommendations ("Report of the sub-committee on Standardization of Indian
Scripts and their Codes for Information Processing", DOE, July 1983). This
also had the recommendation on a common Phonographic based keyboard layout.
A keyboard
standard for Indian scripts was brought out by the DOE in 1986 (Report of the
committee for "Standardization of Keyboard Layout for Indian Script Based
Computers" in Electronics-Information & Planning, Vol. 14, No.1,
October 1986).
There was a
revision of the ISCII code by the DOE in 1988 (Report of the sub-committee on
"Standardization of Indian Script Codes for Information Interchange",
DOE, August 1988. In November 1991 this was incorporated in the Indian Standard
IS 13194:1991 and remains the prevalent standard today.
Indian
Script Codes for Information Interchange - ISCII
As is evident
from the introduction, it was imperative to arrive at a unified character
encoding scheme and a common keyboard layout for the Indian scripts in order to
facilitate implementations on computers. This is made possible by their common
origin from the ancient Brahmi script and by the phonetic nature of the
alphabet. The advantages of this philosophy are many. Any software which allows
ISCII codes, can be used in any Indian script, enhancing it's commercial
viability. In part, this is also the rationale behind the Unicode standard.
Immediate transliteration between different Indian scripts become possible.
Simultaneous availability of multiple Indian languages in the computer medium
would accelerate the process of development and communication.
The ISCII code
retains the standard ASCII code while utilizing the upper ASCII codes for
Indian scripts. This makes it feasible to use Indian scripts along with English
computers and software in an 8-bit environment.
The ISCII code
table is a superset of all the characters required in the 10 Brahmi-based
Indian scripts. These scripts share a large number of structural features
between them as a consequence of their common Brahmi origin. The ISCII code
contains only the basic alphabet required by the Indian scripts. All the
composite characters are formed through combinations of these basic characters.
The alphabet in each script may vary but they all share a common phonetic
structure. The differences between scripts are primarily in their written
forms. ISCII encoding is completely delinked from the physical glyphs used for
display making it possible for a script to be displayed in a variety of styles
depending on the conjunct (ligature) repertoire available in the glyph set.
A description of
the structure of ISCII is given below. For convenience the Indian script
characters are depicted in Devanagari and in transliterated diacritic Roman
script.
The Consonants
Indian script
consonants have an implicit + /a/ vowel included in them.
They have been categorized according to their phonetic properties. There are 5
Vargs (Groups) and non-Varg consonants. Each Varg contains 5 consonants, the
last of which is a nasal one. The first four consonants of each Varg,
constitute the Primary and Secondary pair. The second consonant of each pair is
the aspirated counterpart (has an additional "h" sound) of the first
one.
Varg 1 Eò JÉ MÉ PÉ Ró
ka kha ga gha La
Varg 2 SÉ Uô VÉ ZÉ \É
ca cha ja jha µa
Varg 3 ]õ `ö b fø hÉ
χa χha ·a ·ha ¸a
Varg 4 iÉ lÉ nù vÉ xÉ
χa χha ·a ·ha ¸a
Varg 5 {É ¡ò ¤É ¦É NÉ
pa pha ba bha ma
Non-Varg ªÉ ®ú }É LÉ χÉ ¹É ºÉ ½þ
ya ra la va ¿a Àa sa ha
Apart from these consonants, there are some other consonants used in
some specific Indian scripts:
xÉÃ
(ºa) Tamil
ªÉà (»a) Used in Oriya, Bengali and Assamese.
®úÃ (¼a) Is an extra trilled "ra"
used in Tamil, Telugu and Malayalam.
³ý (½a) Used in Tamil, Telugu, Kannada,
Malayalam, Oriya, Gujarati and Marathi.
³Ãý (¾a) Used in Tamil and Malayalam.
Vowels
and Vowel Signs: Matras
There are
separate symbols for all the vowels in Indian scripts which are pronounced
independently (either at the beginning of a word, or after a vowel sound). The
consonants in the Indian script themselves have an implicit vowel + /a/. To indicate
a vowel sound other than the implicit one, a vowel-sign (Matra) is attached to
the consonant. Thus there are equivalent Matras for all the vowels, excepting
the + vowel.
Roman ¡ i ¢ u £ ¤ e
Vowel +É < <Ç = >ð @ñ Bà
Matra #É Ê# #Ò #Ö #Ú #Þ #ä
Matra on Eò EòÉ ÊEò EòÒ EÖò EÚò EÞò Eäò
Roman ® ai ¯ o au ²
Vowel B Bä Bì +Éà +Éä +Éè +Éì
Matra #à B Bä #Éà #Éä #Éè #Éì
Matra on Eò Eàò Eäò Eèò EòÉà EòÉä EòÉè EòÉì
Vowel Modifiers
Anuswar #Æ
Anuswar indicates a nasal consonant sound. When an Anuswar comes before
a consonant belonging to any of the 5 Vargs, then it represents the nasal
consonant belonging to the Varg. Before a non-Varg consonant however the
anuswar represents a different nasal sound
Nasalization Sign:
Chandrabindu #Ä
The #Ä denotes nasalization of the preceding vowel.
In Devanagari script it often gets substituted with Anuswar, as the latter is
more convenient for writing.
Visarg #&
Comes after a vowel sound, and represents a sound similar to
"h".
Vowel Omission Sign: Halant #Â
In Indian scripts
consonants are assumed to have an implicit vowel + /a/ within them
unless an explicit Matra (vowel-sign) is attached. Thus a special sign Halant (#Â) is needed for indicating
that the consonant does not have the implicit + vowel in it.
In Northern
languages, the Halant at the end of a word generally gets dropped, though the
ending still gets pronounced without a vowel. This doesn't happen in Southern
languages and Sanskrit, where a Halant is always used to indicate a vowel-less
ending.
Conjuncts
Indian scripts
contain numerous conjuncts, which essentially are clusters of up to four
consonants without the intervening implicit vowels. The shape of these
conjuncts can differ from those of the constituting consonants. These conjuncts
are formed in the ISCII code by putting the Halant (#Â) character,
between the constituent consonants.
Example: IÉÊjÉªÉ = Eò #Â
¹É iÉ # ®ú Ê# ªÉ
GòNÉ = Eò #Â
®ú NÉ
Diacritic
Mark: Nukta #Ã
The Nukta is used
for c and gø characters, in some
Northern scripts. It is also used for deriving 5 other consonants in the
Devanagari and Punjabi scripts, required for Urdu.
Eò JÉ MÉ VÉ b fø ¡ò
Fò KÉ NÉ WÉ c gø ¢ò
Punctuation
All punctuation
marks used in Indian scripts are borrowed from English, except for the
full-stop, instead of which a Viram (*) is used in the
Northern scripts. The Viram is, however, being increasingly substituted by a
full-stop. A double Viram (**) is also used in Sanskrit texts
for indicating a verse ending.
Numerals
In all the Indian
scripts the international numerals are being used increasingly. From the
software viewpoint, usage of the same numerals as given in the ASCII set allows
proper handling of numerals by existing software. For display rendition
purposes however, it may be sometimes desirable to have separate Indian script
numerals which are given in the ISCII table.
The ATR mechanism
also allows display rendition of the ASCII numerals in an Indian script form.
The ISCII numerals should be used only when it is not possible to use the ATR
mechanism for selecting numerals in an Indian script.
Attribute Code (ATR)
The Attribute
code, followed by a displayable ASCII character, defines a font attribute
applicable for the following characters. The mechanism is meant for use in that
medium where alternative font selection mechanism is not available.
Extension Code (EXT)
The Extension
code, followed by an ISCII character, defines a new character which can combine
with the previous ISCII character. This provision has been primarily made for
supplementing Vedic signs along with the Devanagari text.
ISCII CODE TABLE
The ISCII code
standard specifies a 7-bit code table which can be used in a 7 or 8-bit ISO
compatible environment. It allows English and Indian scripts alphabets to be
used simultaneously.
7-Bit
Code Table of the Indian Script Alphabet
|
|
Hex |
A |
B |
C |
D |
E |
F |
|
Hex |
Dec |
160 |
176 |
192 |
208 |
224 |
240 |
|
0 |
0 |
|
+Éä |
f ·ha |
®úÃ ¼a |
#à e |
EXT |
|
1 |
1 |
#Ä Ä |
+Éè au |
hÉ ¸a |
}É la |
#ä ® |
0 0 |
|
2 |
2 |
#Æ Æ |
+Éì ² |
iÉ ta |
³ ½a |
#è ai |
1 1 |
|
3 |
3 |
#& Å |
Eò ka |
lÉ tha |
³Ãý ¾a |
#ì ¯ |
2 2 |
|
4 |
4 |
+ a |
JÉ kha |
nù da |
LÉ va |
#Éà o |
3 3 |
|
5 |
5 |
+É ¡ |
MÉ ga |
vÉ dha |
χÉ ¿a |
#Éä |
4 4 |
|
6 |
6 |
< i |
PÉ gha |
xÉ na |
¹É Àa |
#Éè au |
5 5 |
|
7 |
7 |
<Ç ¢ |
Ró La |
xÉÃ ºa |
ºÉ sa |
#Éì ² |
6 6 |
|
8 |
8 |
= u |
SÉ ca |
{É pa |
½þ ha |
#Â (Halant) |
7 7 |
|
9 |
9 |
>ð £ |
Uô cha |
¡ò pha |
INV |
#Ã
(Nukta) |
8 8 |
|
A |
10 |
@ñ ¤ |
VÉ ja |
¤É ba |
#É ¡ |
* . |
9 9 |
|
B |
11 |
Bà e |
ZÉ jha |
¦É bha |
Ê# i |
|
|
|
C |
12 |
B ® |
\É µa |
NÉ ma |
#Ò ¢ |
|
|
|
D |
13 |
Bä ai |
]õ χa |
ªÉ ya |
#Ö u |
|
|
|
E |
14 |
Bì ¯ |
`ö χha |
ªÉà »a |
#Ú £ |
|
|
|
F |
15 |
+Éà o |
b ·a |
®ú ra |
#Þ ¤ |
ATR |
|
PROPERTIES
Phonetic Sequence
The ISCII
characters, within a word, are kept in the same order as they would get
pronounced. Example:
®úɹ]ÅõÒªÉ
=
®ú #É ¹É # ]õ # ®ú #Ò ªÉ
ʽþxnùÒ
=
½þ Ê# xÉ #Â nù #Ò
As shown in the
latter example, the display order may be different from the phonetic order.
Having a spelling according to the phonetic order allows a name to be typed in
the same way, regardless of the script it has to be displayed in.
Direct Sorting
Since there are
variations in ordering of a few consonants between different Indian scripts, it
is not possible to achieve perfect sorting in all Indian scripts. Special
routines would be required for cultural sensitive requirements. For most
purposes, however, the direct sorting achieved through the ISCII code should be
sufficient.
Unique Spellings
By using only the
basic characters in ISCII, there is only one unique way of typing a word. The
spelling of a word is now the phonetic order of the constituent basic
characters. This provides a unique spelling for each word, which is not
affected by the display rendition. Unique spellings are essential for making
spelling checkers and dictionaries. They are also essential to facilitate
finding of words in a word-processor, or for information retrieval from a
data-base.
Display Independence
A word in an
Indian script can be displayed in a variety of styles depending on the conjunct
repertoire used. ISCII codes however allow a complete delinking of the codes
from the displayed fonts.
An ISCII syllable
can be displayed using combination of basic shapes. Different implementations
can choose variant techniques in combination of these basic shapes. The same
text can thus be seen in different font styles by using a different font
composition routine.
Transliteration
The ISCII codes
are rendered on the display device according to the display composition
methodology of the selected script. Transliteration to another script can thus
be obtained by merely redisplaying the same text in a different script.
Since the display
rendering process can be very flexible, it is possible to transliterate the
Indian scripts to the Roman script, using diacritic marks. Similarly it is
possible to transliterate them to their scripts such as Perso-Arabic.
Transliteration
involves mere change of the script, in a manner that pronunciation is not
affected. This is not the same as "translation" here the language
itself changes.
The INSCRIPT Keyboard
Overlay
The Inscript
overlay contains characters required for all the Indian scripts, as defined by
the ISCII character set. The Indian script alphabet has a logical structure,
derived from the phonetic properties. The Inscript overlay mirrors this logical
structure. The overlay has also been optimized from phonetic considerations. It
is divided into two parts: the vowel pad on the left hand side, and the
consonant pad on the right hand side.
Due to the
phonetic/alphabetic nature of the keyboard, a person who knows typing in one
Indian script can type in any other Indian script. The logical structure allows
ease in learning and speed in touch typing. The keyboard remains optimal both
from touch-typing and sight-typing points of view, in all Indian scripts.
UNICODE and ISCII
The Unicode
standard for Indian scripts is based on the ISCII-1988 revision. In November
1991, at the time the Unicode Standard, Version 1.0 was published, the Bureau
of National Standards published the current ISCII in the Indian Standard
IS:13194:1991. The Unicode standard remains a superset of the ISCII-1991
character encoding and texts encoded with ISCII-1991 may be automatically
converted to Unicode code values and back to their original encoding without
loss of information. The following Indian scripts are supported in Unicode.
Devanagari, Bengali, Gurmukhi (Punjabi), Gujarati, Oriya, Tamil, Telugu,
Kannada and Malayalam.
CONCLUSIONS
The ISCII
character encoding standard in its present form has been under implementation
in the field for the last 10 years - since 1988. It has proven to be robust,
with a wide range of applications existing under diverse computing platforms.
It's usage has been made mandatory in many projects of state and national
significance: the country's electoral data in Indian scripts is maintained in
ISCII. It's natural assimilation
within the Unicode 2.0 standard confirms that this standard was developed with
great foresight.
There has been in
recent times, debate on the perceived shortcomings of ISCII and that there is
scope for improvement. In the interest of all concerned, it would be prudent to
work with the current standard than attempt to change it with the instability
and uncertainty inherent in the process.
References
1. Bureau of Indian Standards (1991), Indian Script Code for Information Interchange IS 13194:1991. New Delhi, India.
2. Unicode Consortium (1996), The Unicode Standard. Version 2.0, Addison Wesley 1991-6. ISBN 0-201-48345-9
3. Hall, Patrick and Clews, John (1998) Customisable Internationalised Software: A win-win strategy for the South Asian software industry: Proceedings of the SAARC Conference on Extending the use of Multilingual and Multimedia Information and Technology, Pune, India, September 1998.