Text Processing of National Languages

( India )

Chandrashekhar Raje
Member Technical Staff
Centre for Development of Advanced Computing






Text Processing of National Languages

Chandrashekhar Raje


Preface

The demand for computer systems capable of providing input/output facilities in Indman scripts had started in 70s. The need for this facility felt when the computers became popular for office automations. As a result of this lot of R & D efforts were made by various organizations to provide this facility. The approaches made by this people were mostly for the specific script and of adhoc basis. The simultaneous use of ASCII was not possible. The common formula for all the scripts in India was not thought. It was DOE who took initiative in 1983 for standardization of the Indian code chart. This ISSCII-83 code complied with ISO 8-bit code recommendations (Report of the sub-committee on standardization of Indian scripts and their codes for Information processing, DOE July 1983) . While retaining the ASCII character set in the lower half, it provided the Indian script character set in upper 96 characters. This also had the recommendation on a Phonographic based keyboard layout for all Indian scripts.

A common platform for all Indian scripts was thought in 1986 by DOE( Report of the committee for "Standardization of Keyboard Layout for Indian Scripts Based Computers" in Electronics-Information & Planning, Vol.14, No.1, Oct 1986). This report had recommended 8-bzt ISCII code. The revmsmon of the same in 1988 made the code chart more compact.


History of Indian Scripts

There are 15 officially recognized Indian Scripts. These scripts are broadly divided into two categories namely, Brahmi scripts and Perso-Arabic Scripts. The Brahmi scripts consists of, Devanagari, Punjabi, Gujarati, Oriya, Bengali, Assamese, Telugu, Kannada, Malayalam, Tamil. And the Perso-Arabic scripts includes Urdu, Sindhi and Kashmiri. Devanagari script is used by Hindi, Marathi and Sanskrit languages. The characteristics of the languages within the family are quite peculiar. They have the common phonetic structure, making the common character set. Within the same family again north Indian scripts like, Hindi, Marathi, Punjabi, Gujarati, Oriya, Bengali, Assamese have common features while Southern scripts like Tamil, Telugu, Kannada and Malayalam have common features. This clear division of characteristics has simplified the use on computers.

All these scripts mentioned above are written a nonlinear fashion. Unlike English, the width of the characters is different even mn a same script. the division between consonant and vowel is applied for all Indian scripts. The vowels getting attached to the consonant are not in one ( or horizontal) directions, they can be placed either on the top or the bottom of consonant. This makes the use of the scripts on computers more complicated to represent them.


TEXT PROCESSING

To use language for any applications the characteristics of it are required to know. Once this is known, the application can make use languages in a most uniform manner. Indian scripts have a very different structure. And have communality amongst them. They follow almost same rules, the way of representxng them is different.


Structure of Indian Scripts

Since the origin of all these scripts is same, they share the common phonetic structure. The alphabet may vary slightly and also the graphical shapes. All of them basic consonant and vowels, their phonetic representation is also same. Using this characteristic a transliteration facility between any Iidian scripts is possible. Same way is can be represented in Roman with the help of diacritic marks. Typically the alphabets get divided into following categories,


The Consonants

All Indian scripts use 5 types of consonants groups, called varga. Some of the vowel like a is included in the consonant category. Each varga has 5 consonants, with primary and secondary pairs. The second consonant in each pair is derived from the first consonant with 'h' sound, and have separate graphical representation.

consonants nasal example
K Kha Ga Gha Na gangA
Ca Cha Ja Jha N-a manc
Ta Tha Da Dha N. ghantA
Ta Tha Da Dha N sant
Pa Pha Ba Bha M stambha

other consonants not present in this category are,

Ya Ra La Va 'Sa S.a Sa Ha

and invisible consonant like, Ra (halant) and (halant) Ra, get formed differently.


Vowels

All the vowels are represented by separate symbols. These vowels are placed on the consonants either in the beginning or after the consonant. Each of these vowels are pronounced separately. Typical vowels are,

vowel : A i Ee u U ru Ee

usage : Ka Ki Kee Ku KU Kru KEe


vowel : e E a o O au ao

usage : Ke KE Ka Ko KO Kau Kao


anuswar

Gets used wmth nasal as shown in the example of consonants. e.g. rambhA


chandrabindu

e.g. PAnch


Visarga

Puts a sound of 'h' between two consonants. e.g.

du : kha


Halant

While forming the conjuncts a use of broken consonants is activated by halant. On mixing of two or more consonants the shape of the conjunct varies. Many a times halant is required to indicate the vowel-less ending.

e. g. Ramnathan


conjunct

Complex form made from consonants and vowels.

nishkriya = Na Ee Sha (halanat) Ka (halant) Ra i Ya

Halant is also used to make the soft halant and explicit halant.

Ka (halant) Ta = Kta to make Ka (halant) (halant) Ta = Kat


Nukta -

Nukta is used to derive some of the characters used in Hindi, Punjabi and Urdu. d. or k. etc.


Punctuations and numerals

All the punctuations and numerals are common between English and Indian scripts. They are used on the computers using English symbols.


Vedic characters

Apart from Hindi and Marathi, Devanagari has Sanskrit language, which uses Vedic symbols. The provision of these symbols is made by keeping the extended character set.


Standardization

The applications of scripts on the computers have rapidly increased. These applications are in various nature which includes, data processing, Desk Top Publishing, Telegraphic applications. If each of these application starts using scripts their own way then chances of their interfacing between any two field becomes rare. This interfacing requires enormous amount of effort to suit the requirements of both applications. At the same time similar kind of application might use the script in a different manner which will again limit the use. All these efforts for each applications, and limited use can be avoided by standardizing on the script code. This code can be designed by taking care of the characteristics of the languages and uniform rules. The standardization of the code charts for Indian languages for computer applications has been done. This has made the implementations easier. (refer to the code charts, ISCII, PC- ISCII, PC-ISCII-7)


Text processing on typewriter and computers

The text processing of Indian scripts on the mechanical typewriter and on computers is different. There are some limitations on the mechanical typewriter as compare to computers. Due to the complex nature of Indian script, formation of conjuncts is extremely difficult on the typewriters. The approach used in typewriters is most suited for graphical representation. For example, the formation of 'Pha' is done by using 'Pa' and remaining graphical part of 'Pha'. This is not a very user friendly approach. At the same time it does not suit for all Indian languages. Using computer the text processing of any kind is possible with the help of software and hardware. Specially in case of forming conjuncts the shapes of the characters vary, these various shapes can be provided on computers by software. On the other hand the output on the typewriter is not upto the mark. Many a times simultaneous use of English along with Indian script is required. This usage is made possible by standardizing the codes.

Indian scripts have tremendous applications in day-today life. These applications include, Word processing, Data base, DTP, Teleprinting, speech, OCR etc. Once the characteristics of the scripts are known, making use of them for any of this applications is possible. Standard code chart is designed for dedicated applications as well as with English. For any of this use, once the sequence of characters is known to form words or canjuncts, they can be sent to or received from the device and formation of exact word is done by software. This way device communication is normally, only the receive and sending device interprets the sequence.


Application areas

A special Word processing software for Indian scripts will make all the life easier.

The DTP applications mostly use typefaces or graphical approach, by developing special softwares to convert these codes to typefaces, a revolution in script technology can be made. The font design for this application itself can be based on this approach.

Indian scripts can be used for various video applications like, subtitling, captioning. Once the DTP application is made, porting them on video media is a simple task. One gets the quality output on the video media also.

Applications like teleprinter, telex can use various Indian scripts. The recognition of language and formation of words can be handled by software.

Speech recognintion for Indian scripts is one of the useful application. It requires huge database of words. This technology can be used for telephones, text to speech, talking aids to handicapped and teaching programs.

Translation between Indian scripts is also possible with the help of database. The Name translation work is already in progress and shows the positive results.

Some of the mission oriented softwares like, Land records project, Judiciary, Banking, Air/train/bus reservations and health program, can be developed.

Most of the applications mentioned above are developed or being developed at C-DAC (Centre for Development of Advanced Computing) , Pune.


Reference : BIS document for 'Indian Script code for Information

Interchange' IS 13194:1991.