Dr. Khaver ZIA


FAST Institute of Computer Science, Lahore. Pakistan

E-mail: kzia@fast.edu.pk



This paper reports on the efforts that have taken place to date on standardization of processing in Urdu. These efforts encompass a range of aspects related to the language, prominent being keyboard, font, and character set.


character codes, code table, multilingual computing, standardization, Urdu





Urdu is the national language of Pakistan. It is used in education, literature, office and court business, media, and in religious institutions. It holds in itself a repository of the cultural, religious and social heritage of the country. The language is versatile and has the potential to expand and grow to fulfill the needs of modern times.

Significant work has been done on the linguistic aspects of the language like orthography and lexicography. However with the advent of computers it is a natural desire of its adherents to harness the power of the computer to increase their productivity and efficiency in the usage of the language.

Like other languages, the need for standardization in Urdu was strongly felt with the introduction of mechanized composition. The typewriter marks the first level of this process of mechanization. This was followed by other machines like the tele-printer. However with the advent of the computer, a new dimension was added to the process of standardization. Efforts were made to formulate standards for Urdu similar to those developed for other languages. This paper gives a survey of these efforts. In this survey the standardization of the alphabet, the keyboard and the character set are discussed.


The characteristics of the Urdu language have been described in great detail [1] [3]. Here it is pertinent to mention that Urdu has traditionally been written in the Nastaleeq script. Although the script employs the basic letters of the language, the rendering of these letters in a word is extremely complex. The reason for this complexity is that Urdu text has traditionally been composed through calligraphy, a medium whose precepts are based on the aesthetic sense of the calligrapher rather than on any formula. So great is the variation in calligraphy that many times it is difficult to recognize the letters in a constituent word. This is because, in their calligraphed form, the individual letters partially or completely fuse into each other thereby losing their identity. A degree of fusion is purposely introduced to make the resulting fused glyph visually appealing.

Another characteristic of the Urdu is the existence of diacritics. Diacritics, although sparingly used, help in the proper pronunciation of the constituent word. The diacritics appear above or below a character to define a vowel or emphasize a particular sound. They are essential for removal of ambiguities, natural language processing and speech synthesis.


A major work accomplished by the Urdu language authority is the finalization of the character set of Urdu. Differences in this regard have been resolved. Further, the sorting order of basic alphabetical characters, with and without diacritical marks, has been specified. This was a pre-requisite to the formulation of the code table. The standardized character set is given in Appendix 1.


Urdu is written in the Nastaleeq script just as English is written in Roman and Hindi is written in Devnagri. However no standard font exists. In the past each calligrapher wrote the language according to his style. That style came to be known after his name. In 1980 Ahmed Mirza Jamil calligraphed 18,000 ligatures of Urdu, which he named Noori Nastaleeq. These were digitized and were used by computerized composing machinery. No other font of Urdu has been digitized for computerized processing.


Although efforts made to develop a standard keyboard date back to beginning of this century, complete standardization has not been achieved to date. The reasons are firstly a lack of proper enforcing authority and secondly a lack of appreciation of technical issues.

The first Urdu typewriter was available way back in 1911. Pakistan became independent in 1947 and the same year Urdu was declared as the national language of the country. Thereafter a number of individuals and organizations proposed their versions of keyboard layout. Based on these proposed designs a number of companies produced their typewriters and started marketing them. However every company's typewriter had a different keyboard layout. The differences were mostly on the order of the keys and the number of characters present. Soon the situation was so chaotic that it became necessary to specify the make of the typewriter when advertising for an Urdu typist. The government stepped in to rectify the situation. The Central Language Board established by the government proposed a standard typewriter keyboard in 1963. This contained keys of other local languages, namely Sindhi and Pushto. In 1974 the keyboard was modified and characters pertaining to other languages were replaced with needed arithmetic operations and numerals. These keyboard were designed on scientific grounds and benefited from frequency tables and bifurcation (balancing load on typist's fingers) techniques.

In 1980, the Muqtadra Qaumi Zaban (Urdu Language Authority) came out with a new keyboard layout for typewriters. It comprises of 46 keys to type 71 Urdu consonants, vowels, diacritics, and punctuation marks, and 21 key symbols for arithmetic calculations and digits. Most of the characters have more than one shape and it ouputs in the Naskh (i.e Arabised) script. Although it was specified that the keyboard layout would also be used for computerized processing, the capabilities that could be provided by the computer were not perceived in the design of the keyboard. These capabilities were the intelligence of the computer to select the shape of the character appropriate to the context, and ability to store multiple language character sets in its memory. A layout of this keyboard is shown in Appendix 2 *.

In the same year, work was done to develop a keyboard for bilingual (i.e. Urdu-English) teleprinter. This keyboard layout was more appropriate for use in a computer-based terminal and was adopted with modifications by word processor designers. A version of this keyboard recommended by Urdu Language Authority is shown in Appendix 3 *. However it has still not been standardized. The existence of two keyboard layouts, one for the typewriter and second for the computer is unfortunate and this situation needs to be rectified.

Unfortunately no character table exists for Urdu corresponding to the ACSII table for English. This situation is extremely detrimental to computerized processing in Urdu.

Every developer has devised his own character set for Urdu according to his convenience. The result is that data cannot be transported from one application to another. This also discourages developers who want to undertake allied work i.e. writing of spell checkers and generation of fonts, etc.

A major effort was taken in the middle of 1998 to rectify the above situation. A committee comprising linguists and computer scientists was constituted to formulate a standard code table for Urdu. The committee had a number of sessions over a period of six months and came up with recommendations for a standard code table. The issues involved in devising a standard table have been discussed elsewhere [2]. Here it is pertinent to enumerate the salient points of the proposed code table.

The divisions of the code table (shown in Appendices 4A and 4B) is as under:

    1. Alphabet. (43 Nos. Codes 80 to 122)
    2. Numerals (10 Nos. Code 48 to 57)
    3. Special Characters (32 Nos. Code 32 to 47, 58 to 65 and 192 to 199)
    4. Diacritics (18 Nos. Code 66 to 79 and 123 to 126 )
    5. Religious and linguistic Symbols (16 Nos. Code 160 to 175)
    6. Control characters (64 Nos. Code 0 to 31 and 128 to 159)
    7. Language toggle (1 No. Code 254)
    8. Vendor Area (31 Nos. Code 209 to 239)
    9. Expansion Area (39 Nos. Code 176 to 191, 200 to 208 and 240 to 253)
    10. Not Used (2 Nos. Code 127 and 255)

The following features of the code table are worth mentioning:

    1. It has been designed to fulfill the linguistic requirements and peculiarities of the Urdu language.
    2. It implements the sorting sequence specified by the Urdu Language Authority.
    3. It supports diacritics.
    4. It supports the calligraphic traditions of the language.
    5. It is compatible with Unix and Windows platforms.
    6. It facilitates application development and information exchange.
    7. It supports future enhancements.

Although efforts have been made in standardization of processing in Urdu, a lot needs to be done. The government, computer industry, printing industry and all other concerned organization need to make concerted efforts to complete the process of standardization at the earliest.


