Text Processing in Malay

(7ch AFSIT,Tokyo,Japan, September 27, 1994)

by
Muhammad Mun'im Ahmad Znbidi
Faculty of Electrical Engineering
Universiti Teknologi Malaysia
(UTM)
54l00 Kuala Lumpur
Malaysia

Haliza Ibralahim
Engineering Standards and Cortification Unit
Srandards and Industrial Reserach Institute Malaysia
(SIRJM)
40700 Shah Alam
Malaysia

* For details, please contact

Introduction

Malay(or Bahasa melayu) is the national language of Malaysia. It is very closely related to Bahasa Indonesia, the Indonesian official language. Both languages are native to the Malays indigenous to the Malay archipelago covering Malaysia, Singapore, Brunei, Southern Thailand and Indonesia. It is a special because it can be written in two scripts: Jawi (traditional Arabic based) and Rumi(Roman). The evolution of the scripts is historical.

Before the arrival of religions to the Malay archipelago, the Malays were animists and did not have any writings. Hinduism arrived first to the Malay archipelago and influenced the Malay culture up to this day. In Java, the first writing scripts began to appear in the form based on Sanscrit called Kawi. The Arab traders to the area introdused Islam and as most Malays converted to Islam, an Arabic-based script called Jawi envoleved. Some new characters not present in Arabic was added making Jawi alphabet a superset of Arabic.

The colonization of the Malay Peninsula by the British and Indonesia by the Ducth led to the invention of Romanized Malay by the colonizers(*1). The influence of the languages of the colonialists were reflected in the way some words are spelled. The descrepancy in spelling ended in 1973 when a joint Malaysian and Indonesian committee standardized the spellings of both national languages. Some examples are shown in Table 1.

____________________________________________________________________________
           Malay   Indonesian   Standardized   
 English   (pre-      (pre-        (post-      Example
           1973)      1973)        1973)    
____________________________________________________________________________
   ch        ch        tj           c          chuba/tjuba/cuba(try)  
   sh        sh        sj           sy         shahid/sjahid/syahid(martyr)
   kh        kh        ch           kh         akhir/achir/akhir(end)
   y         y         j            y          sayang/sajang/sayang(love)
   oo        u         oe           -          Timur/Timoer(East)
  ny(?)      ny        nj           ny         ganyang/ganjang(obliterate)
   j         j         dj           j          Jakarta/Djakarta
___________________________________________________________________________ 

Table 1: Example of Malay and Indonesian spellings of similar consonants and
                 vowels. The list is not exhaustive.
In Malaysia today, official documents are written in Rumi or English. Almost all Malay literary works are written in Rumi. For this purpose, any English word processing software is sufficient unless Malay-specific components such as spelling checkers and thesaurus are required. This probably explains the low standards of Malay documents produced because Malay-specific word processors are not so popular and spell-checking is just not performed. Jawi is used in Islamic religious documents where the same writing direction makes mixing of Malay and Arabic very convenient. This does not mean that all religious texts are written in Jawi. Not many Muslims are Jawi-literate, and thus many religious text are nowadays written in Rumi with quotations from the Quran written in Arabic. A few Jawi capable word processors which can process Arabic and Rumi are available but again they are not as popular as English version products.

Presently, there is a resurgence in the interest in Jawi in Malaysia. There had been requests to SIRIM to come up with a Jawi national character set standard. This led to the introduction of a draft Malaysian standard for data interchange purposes. There are also requirements to convert old Rumi documents to Jawi. Both of this led to new developments in Malay text processing.

The Jawi Alphabet

In its present form, the Jawi alphabet has 35 characters. Twenty nine are adopted from Arabic and six are invented by the Malays. Each letter may have up to four forms: alone, in the beginning, middle and end of a word. These are shown in Table 2.

Table 2: Jawi letters. The shading highlight letters invented by the Malays.

The switching of characters in the collating sequence means there are two conflicting requirement: upward-compatibility with the ISO 8859 Arabic character set with extra processing during collation or no compatibility with straightforward collating sequence. The designers the Jawi chalacter set decided on the former option since this is the method used to expand the Arab character set in UCS ISO 10646 and is also the method used by the Europeans when basing their character sets on ASCII. Unfortunately, the extra characters present in Jawi are not in UCS.

Inputting Jawi Tex

No jawi system software exists. There are several approaches to having Jawi in an English language computer. Three methods can be identified: by a regular application software, by software which can handle mixed left-to-light and right-to-left scripts and by Rumi to Jawi conversion.

The most commonly used method is to use standard application software,install non-standard Jawi fonts and switch fonts when Jawi is required. The advantage of the method is its low cost. Character selection is by a non-standard keyboard arrangement. Experienced users can memorize the keyboard layout and will not require character templats. Text entry is left-to-right as opposed to the right-to-left direction used in Jawi. The user must also identify the proper form (one out of four) of every character entered. Needless to say, this approach requires a new keyboard layout if incompatible font sets are used. Figure 1 shows the entry of jawi using the English version of Microsoft Word on a Machintosh. The work being done is entering data into the Quran Information System(QIS) being developed in UTM. The font used is customized and does not follow any established standard. The particular font is necessary as it includes special marks found only in the Quran.

Figure 1: Inputting Jawi text using English version word processor and custom fonts.

The second method is by using software which are capable of handlig the left-to-light text entry in Arabic. This allows text entry in the same direction of the thought process. Anothor advantage is the user has to type a single key for any character and the software will find the correct form of the character. Both of this feautures substanstially ease the mental burden on the user. For Jawi, normally the allows the keyboards themselves are English based, with English menus, but they allow mixed languages with primarily support for Arabic. When a Latin font is used, new characters are inserted to the left of the cursor. The appliciation software normally allows the keyborad layout to be displayed on a part of the screen. This is beneficial to new users. Depending on the package, custom fonts are probably used either in the Windows or Macintosh environment. However, the Macintosh computer has Worldscript support which allows one-byte or two-byte characters to be processed using standard fonts.

When no Jawi font is provided, a user types the Arabic most closely resembling the required Jawi character and then adding the dots after printing. For example, to enter the letter<まれい>(cha) when the letter itself is not present on the software, one would enter the letter<まれい> (ha) and then three dots would be added for each occurence of the letter. When a jawi font is provided, it is a modification of an Arabic font also provided with the software. Usually, the number of Jawi fonts is very small compared to the number of Arabic fonts provided. This method is the most commonly used for those working with jawi extensively, Examples of such software are WinText and AIKaatib.

A third Jawi input method being explored by some researchers is the convention of Rumi text into Jawi automatically. Aside from the ability of converting old Rumi texts into jawi, it also enables people not up to speed in Jawi writing to still create documents in Jawi(*2). This method is not without its flaws and is discussed in the next section.

Rumi to Jawi Conversion

In converting a word from Rumi to Jawi, the word is first broken into phonemes, converted to the corresponding phonemes in Jawi, and then reassembled as a Jawi word. Figure 2 shows the conversion of left-to-right phonemes in Rumi into equivalant right-to-left phonemes in Jawi.

Figure2: Phoneme-by-phoneme conversion from Rumi to Jawi.

Table 3 shows some possible mapping from Rumi to Jawi. A Rumi character can map onto several Jawi characters and several Rumi characters can map into the same Jawi character.

Table 3: Mapping of Rumi characters to Jawi.

The multiplicity of possible mappings can cause error for a trivial converter. For example, a source of mapping error is ambiguity on the letter 'e'. The letter 'e' have three sounds, but for mapping purposes, it is classified into two types: e pepet(as is saber) and e taling(as in error). The result in any Jawi character if it appears in the middle of a word but e taling is converted into the letter (ya). See Figure 3 for example. At the beginning of a word, e pepet is converted into the letter | (alif) but e taling becomes (alif and ya). To best way to correctly identify the correct 'e' is probably by dictionary lookup but this approach is probably inelegant.

Figure3: The same word can be spelled differently in Jawi.

The algorithm to break up a word into phonemes must take into account the possible meanings of a word. For the word rafi.(divorce) to be properly converted, it should be broken into the phonemes raj-i and spelled ,but it could theoretically be broken into ra-ji an spelled (which is meaningless).(See Figure 4.)

Figure4: Error in phoneme identification result in incomprehensible output.

Two-letter Rumi conbinations 'ng','ny','gh','kh',and 'sy' can map to individual Jawi characters. Again incorrect phoneme identification incomprehensible results. For example,the word 'busy' in Malay is actually one phoneme pronounced as bush, whereas, at a glance it looks likes the English word busy.

The Malay language borrows heavily words from Arabic and English. Most of the time, when a word is identified as Arabic based, it must be spelled exactly as in Arabic. This can be difficult because Arabic have different spelling rules than Malay. When the Arabic word is transliterated into Rumi then converted back to Jawi, the resulting Jawi word could look different from the Arabic original. A conversion algorithm must not miss any combination and should perform context analysis to find the correct Jawi word. For example, the word riba has two meanings each Derived from the Arabic, riba means bribe and must be spelled as . As a Malay word, riba also means lap (of a person) and must be spelled as .(See Figure 5).

Figure5: A Rumi word can result in different Jawi words depending on context.

Even though some words that are adopted from Arabic should follow Arabic spelling rules, other words may encounter transformation as it goes from Arabic to Malay. Not suprisingly, the Arabic originated word may look different when reconverted to Jawi. A common cause is the transformation of the letter (kaf) and (gaf). The letters most closely match the letter 'k' and 'q' respectively but both often appear only as 'k' in Malay. Similarly, the letters (ta) and (ta) gets changed to the letter 't' in Rumi. When converted to Jawi, 't' only appears as *. See Figure 6 for an example. Here two characters get changed going from Arabic to Rumi to Jawi. See also that the letter * (alif) is removed in the double conversion.

* For details, please contact

Figure 6: A word originally in Arabic may be written differntly in Jawi.

English words that become part of Malay can also be written in Jawi. During adoption, most English words are transliterated by sound into Rumi. From Rumi to Jawi, it is a matter of finding the appropriate phonemes which can produce the same sound. Figure 7 shows the English word physical, bollowed into Rumi, and then converted into the equivalent sound in Jawi.

Figure 7: Writing a word derived from English in Jawi.

Summary

Text processing in Malay does not have any special requirements as long as the Rumi version used. This is the case for most documents produced today. For processing Malay documents written in Jawi, several challenges exist. In using standard appliciations, custom fonts have to be created and data entry operators specially trained. Applications that can handle Jawi must also handle Jawi must also handle Rumi text as there is no place for Jawi-only text conversion in text processing applications which presents its own challenges.

References

  1. [1] <な>Daftar Ejaan Rumi-Jawi<な>,Dewan Bahasa dan Pustaka, Kuala lumpur, 1988.
  2. [2] <な>ISO Standard 8859<な>, International Standards Organization, 1993.
  3. [3] <な>ISO Standard 10646<な>, International Standards Organization, 1993.
  4. [4] <な>Draft Malaysian Standard for Jawi Character Set<な>, Standards and Industrial Research Institute of Malaysia, 1994.
  5. [5] Muda bin Ariffin,<な>Penterjemah Rumi ke Jawi<な>, BSe Thesis, 1988.