Special Considerations in
Thai-Language Database Construction

Technical Report

THAWEESAK Koanantakool

Member, Technical Committee for Information Technology Standardization
Thai Industrial Standards Institute (TISI), Ministry of Industry, Thailand:
Associate Director, Information Processing Institute for Education and Development, Thammasat University

This paper was delivered at the 4th AFSIT, October 24, 1990, Tokyo, Japan.


1 INTRODUCTION

This paper presents salient features of the Thai Language relevant to the construction of databases which handle of the Thai script. Existing standards in Thailand such as the character coding keyboard layout, character naming are presented. Special requirements for Thai Language processing are discussed. A number of state-of-the-art advanced systems are presented: electronic dictionary, thesaurus, phonetic matching OCR, machine translation and text database system. The outlook of these research projects seem promising.

2 STORAGE OF THAI LANGUAGE DATA

The Thai script is essentially a character-oriented system. A Thai word is written as a "string" of alphabetic symbols. There are 87 symbols defined in the coded-character standard TIS 620-2529 (1986), see Reference 1. The symbols are classified into five groups namely alphabets (46 symbols), vowels (18 symbols) tonal marks (3 symbols), numerics (10 symbols) and other symbols (8). The small number of symbols in Thai character set makes it ideal to use a single-byte coding to store both Thai and English data in the mass storage media and for data communications. The TIS 620-2529 standard permits the use of two coding schemes: one for ASCII style and another for EBCDIC compatibility. Database storage of Thai language information by computers thus presents no technical problems as long as the system supports the use of 8-bit code and there is no conflict between the defined character and the file system. Figure 1 illustrates the TIS 620-2529 definition of Thai characters in ASCII-extended format. The character names and the coded values are those submitted for ISO-DP/10646 standard by ++. None of Thai databases so far uses multi-byte coding for Thai characters.

3 THAI DISPLAY AND PRINTER

Handling visual representation of the Thai Language is not as simple as the storage aspect. The main reasons are that Thai writing rules require symbols to be placed at four possible levels on the written line. In a nut shell, we have main alphabets placed on the base line just like English; with some other smaller characters in the groups vowels/tonal marks must be placed further at the left, above or below the base line. If a tonal mark follows a vowel and the vowel is already placed above the main alphabet, then the tonal mark must be placed above the vowel. Figure 2 illustrates this nature of Thai writing, taking the display and dot-matrix printing problems as the case. It is typical to use 4 or 3-pass print method to complete the printing of a Thai line.

On CRT screens, many display schemes have been developed. In the early days, four standard ASCII lines have been combined to show a proper Thai line. At present, special display adapters have been developed for showing full 25-line of Thai text on a standard CRT such as the IBM-PC. New graphic capabilities such as VGA also permit the high-density display of Thai texts by softwaxe without requiring any hardware modifications.

It is easily understandable that one column of Thai text basically needs four bytes to describe: one of each for the super-high, high, middle and lower, positions. In practical terms, nobody affords those luxurious video RAM size and many display coding schemes have bem developed so that only two bytes are required to describe the display at all four levels. The display aspect-ratio of a Thai column is same as English, not square like the Kanji characters. Most display systems thus require doubling of VRAM size for displaying a Thai line, while the level or zone switching is handled by hardware. A typical minimum of 20 CRT scan lines are required in showing an acceptable Thai display line.

To summarise the basic rules of Thai display, the following statements are given here to highlight our considerations in constructing a Thai display:


Figure 1 Thai Standard Character Code, TIS 620-2529 (1986)


Figure 2 Thai display and printing convention

4 THAI DATA ENTRY

The industrial standard keyboard layout, TIS 820-2531 (1988), has been defined (see Figure 3 and Reference 2). The layout typically coexists with the standard QWERTY keyboard for dual-language entry. Texts are entered in the same fashion as the mechanical typewriter. Typing speed can be as high as 70-80 words per minute.


Figure : Figure 3 Thai Keyboard Layout, TIS 820-2531 (1988)

Data entry using mechanical typewriter paradigm is some cases violate the standard linguistic rules and causes processing overheads in database searching. These cases are:

(1) Vowels which must be placed at the left of the main consonant must be typed first, e.g. is traditionally entered and stored as and but is linguistically handled as is positioned in the dictionary under the group . The sorting algorithm must, reposition the vowel behind the consonant before sorting or word-comparison. The reposition process is reversed when sorting is done. Thai vowels which fall into this group are . Thai consonants in each word may consist of one or more alphabets.

(2) One exceptional vowel ( , ) has only one keystroke but shows two associated symbols on the display must be entered after the tonal mark, if any. In all other cases, tonal marks must follow the associated vowel. In fact, since both upper-vowels and tonal marks are dead keys (keys which do not advance the carriage or the cursor), one may enter them in any sequence on the type writer but only vowel-tonal mark sequence is allowed in computers.

Several Thai systems implement keyboard syntax checking in software to help screening out database inconsistencies which may cause searching problems.

5 THAI PROCESSING ALGORITHMS

Most computer applications used in Thailand must at least be able to accept Thai characters. Keyboard data entry is usually done at the terminal OS level. Database processing needs at least one change, the sorting algorithm. In the past, additional programs are used for Thai sorting but the trend is towards internal modification of the database engine with cooperation from DBMS manufacturers. By internal modification, only the string compare primitive operation is required, since this is the basis for all sorting and indexing mechanism.

When full-text handling is involved, the cases for word processing and text databases, more Thai algorithm are necessary. These algorithms are:

This may sound strange, but it is interesting to know that we do not put any space between words! The problem sounds like the word extraction problem from a continuous speech recognition system. A few research projects are being carried out at various universities to solve the problems and to find the ultimate solutions. At present, most PC- based word processors, DBMSs, spreadsheets can perform Thai text manipulation correctly to a satisfactory level. However, some research projects are more ambitious and futuristic in that many advanced systems have been crafted. These are, for example, electronic dictionary, thesaurus tools and electronic thesaurus, phonetic matching and database searching methods, Thai optical character recognition, machine translation and text database systems. Some details of these are described below.

6 ADVANCES IN THAI-LANGUAGE PROCESSING BY COMPUTERS

6,1. Electronic Dictionary System
Full dictionary systems have been developed by Kasetsart University. One is based on the Royal Academy Thai Dictionary BE2525 and another is based on Thai-English and English-Thai dictionaries. The systems were reported in Reference 3. Applications of these systems are: machine translation, Thai algorithm testing, development of Thai Thesaurus, and spelling checker.

6.2 Thesaurus Tools
Several researchers have been constructing thesauruses for quite sometimes. Now tools are being developed at Thammasat University to help thesaurus makers. Field specific thesauruses are also required by Ministry of Industry, Ministry of Commerce and Ministry of Science, Technology and Energy to help accessing text database systems.

6.3 Phonetic Matching and Data Entry
Soundex algorithms for Thai have been available for almost 10 years, and are being used in many data processing centers including the computerized directory access system (CDAS) by Telephone Organization of Thailand. Some further research programs are being carried out at Thammasat University to make the existing algorithms more efficient and applicable for high-speed data entry. It is expected that the result may help finding new data entry method and improve input rate by about threefolds. Also, it may help executives to use more computers directly because the difficulties in Thai typing could well be eliminated. By searching through dictionaries in real time, it is also expected that spelling errors will be reduced.

6.4 Thai OCR
The King Monkhut Institute of Technology at Lad Krabang (KMITL) has been active in this field. The researcher recently demonstrated a microcomputer package which recognise typewritten characters with satisfactory accuracy. More work is required to make the system more flexible and capable. The demand of Thai OCR is well recognised by the Science and Technology Development Board (STDB) of Thailand.

6.5 Machine Translation
About eight years ago Thailand has cooperated with France (University of Grenoble) in developing a workable ADRIANE-based machine translation system. The project lasted for about three years, and the results were not satisfactory. More recently, the National Electronics and Computer Technology Center (NECTEC) and CICC jointly set up another machine translation project with over twenty researchers from five universities involved. The system strategy is indirect transfer using interlingua as an intermediate language. Several supporting subsystems have been developed to help constructing the Thai grammatical rules, input/output subsystems, and dictionaries. A collection of recent reports are published by NECTEC (Reference 3).

6.6 Text Database System
Many text databases are already established in Thailand (see the companion paper in this document, "Major Databases in Thailand" by T. Koanantakool). Most of them use CDS/ISIS as the DBMS. Other systems are STAIRS/VS (Ministry of Justice) and Unidas (Ministry of Foreign Affairs). All text DBMSs have limitations in handling Thai words due to the characteristics of Thai Language as described above. The development of a Thai Text DBMS was started at Thammasat University in 1988 to handle the monitoring and evaluation of the Sixth National Economic and Social Development Plan for the country. The system is relatively simple and is limited to hierarchical-structured documents such as the Sixth Plan. The simplicity is the most crucial feature for the success of the project since it is to be used by general users, not computer experts. It was developed in a short period of only six-months.

A more sophisticated and comprehensive text database system has been planned for at Thammasat University. It started in August 1990 and covers several processing strategies including intelligent word editor (for 100% word separation of Thai texts), synonym search via thesaurus, phonetic search via new phonetic matching scheme, and high-speed data entry methods.

7 CONCLUDING REMARKS

Thailand has always been independent in her history. Thus the Thai Language remains unique and cannot be automatically handled by computers originated from the West. Thailand develops her own technology to adapt database products to handle the information stored in computer databases. National standards do exist to help minimizing differences in techniques used for data entry, storage as coded characters in the media. Processing of texts and data retrieval strategies are subjects of research projects in the country. Although the outlook is good, some more time is required to prove the practically of new information systems with full and proper handling of the language. The technical knowhow is applicable for other indo-china languages such as Burmese, Laos, Khmer and Sinhala (used in Sri Lanka).

8 REFERENCES:

  1. Thai Industrial Standards Inst., TIS 620-2529 (1986), "Standard for Thai Character Code for Computer", ISBN 974-8113-58-2, Ministry of Industry, Bangkok, Thailand.
  2. Thai Industrial Standards Inst., TIS 820-2531 (1988), "Standard for Layout of Thai Character Keys on Computer Keyboards", ISBN 974-8126-00-5, Ministry of Industry, Bangkok, Thailand.
  3. National Electronics and Computer Technology Center, "Research and Development Projects in the Fiscal Year 2532", Conference Proceedings (2 volumes), 15-16 August, 1990, Ministry of Science, Technology and Energy, Bangkok, Thailand.
  4. Koanantakool, T., and Koraksawet, S., "Common Specifications for Thai-Language Applications Development: Thai API Project", Research Report submitted for Ministry of Science, Technology and Energy, 16 August 1990.