This paper was delivered at the 4th AFSIT, October 24, 1990, Tokyo, Japan.
This paper presents salient features of the Thai Language relevant to the construction of databases which handle of the Thai script. Existing standards in Thailand such as the character coding keyboard layout, character naming are presented. Special requirements for Thai Language processing are discussed. A number of state-of-the-art advanced systems are presented: electronic dictionary, thesaurus, phonetic matching OCR, machine translation and text database system. The outlook of these research projects seem promising.
2 STORAGE OF THAI LANGUAGE DATA
The Thai script is essentially a character-oriented system. A Thai word is written as a "string" of alphabetic symbols. There are 87 symbols defined in the coded-character standard TIS 620-2529 (1986), see Reference 1. The symbols are classified into five groups namely alphabets (46 symbols), vowels (18 symbols) tonal marks (3 symbols), numerics (10 symbols) and other symbols (8). The small number of symbols in Thai character set makes it ideal to use a single-byte coding to store both Thai and English data in the mass storage media and for data communications. The TIS 620-2529 standard permits the use of two coding schemes: one for ASCII style and another for EBCDIC compatibility. Database storage of Thai language information by computers thus presents no technical problems as long as the system supports the use of 8-bit code and there is no conflict between the defined character and the file system. Figure 1 illustrates the TIS 620-2529 definition of Thai characters in ASCII-extended format. The character names and the coded values are those submitted for ISO-DP/10646 standard by ++. None of Thai databases so far uses multi-byte coding for Thai characters.
3 THAI DISPLAY AND PRINTER
Handling visual representation of the Thai Language is not as simple as the storage aspect. The main reasons are that Thai writing rules require symbols to be placed at four possible levels on the written line. In a nut shell, we have main alphabets placed on the base line just like English; with some other smaller characters in the groups vowels/tonal marks must be placed further at the left, above or below the base line. If a tonal mark follows a vowel and the vowel is already placed above the main alphabet, then the tonal mark must be placed above the vowel. Figure 2 illustrates this nature of Thai writing, taking the display and dot-matrix printing problems as the case. It is typical to use 4 or 3-pass print method to complete the printing of a Thai line.
On CRT screens, many display schemes have been developed. In the early days, four standard ASCII lines have been combined to show a proper Thai line. At present, special display adapters have been developed for showing full 25-line of Thai text on a standard CRT such as the IBM-PC. New graphic capabilities such as VGA also permit the high-density display of Thai texts by softwaxe without requiring any hardware modifications.
It is easily understandable that one column of Thai text basically needs four bytes to describe: one of each for the super-high, high, middle and lower, positions. In practical terms, nobody affords those luxurious video RAM size and many display coding schemes have bem developed so that only two bytes are required to describe the display at all four levels. The display aspect-ratio of a Thai column is same as English, not square like the Kanji characters. Most display systems thus require doubling of VRAM size for displaying a Thai line, while the level or zone switching is handled by hardware. A typical minimum of 20 CRT scan lines are required in showing an acceptable Thai display line.
To summarise the basic rules of Thai display, the following statements are given here to highlight our considerations in constructing a Thai display:


The industrial standard keyboard layout, TIS 820-2531 (1988), has been defined (see Figure 3 and Reference 2). The layout typically coexists with the standard QWERTY keyboard for dual-language entry. Texts are entered in the same fashion as the mechanical typewriter. Typing speed can be as high as 70-80 words per minute.

(1) Vowels which must be placed at the left of the main consonant must be typed first, e.g. is traditionally entered and stored as and but is linguistically handled as is positioned in the dictionary under the group . The sorting algorithm must, reposition the vowel behind the consonant before sorting or word-comparison. The reposition process is reversed when sorting is done. Thai vowels which fall into this group are . Thai consonants in each word may consist of one or more alphabets.
(2) One exceptional vowel ( , ) has only one keystroke but shows two associated symbols on the display must be entered after the tonal mark, if any. In all other cases, tonal marks must follow the associated vowel. In fact, since both upper-vowels and tonal marks are dead keys (keys which do not advance the carriage or the cursor), one may enter them in any sequence on the type writer but only vowel-tonal mark sequence is allowed in computers.
Several Thai systems implement keyboard syntax checking in software to help screening out database inconsistencies which may cause searching problems.
5 THAI PROCESSING ALGORITHMS
Most computer applications used in Thailand must at least be able to accept Thai characters. Keyboard data entry is usually done at the terminal OS level. Database processing needs at least one change, the sorting algorithm. In the past, additional programs are used for Thai sorting but the trend is towards internal modification of the database engine with cooperation from DBMS manufacturers. By internal modification, only the string compare primitive operation is required, since this is the basis for all sorting and indexing mechanism.
When full-text handling is involved, the cases for word processing and text databases, more Thai algorithm are necessary. These algorithms are:
6 ADVANCES IN THAI-LANGUAGE PROCESSING BY COMPUTERS
6,1. Electronic Dictionary System
Full dictionary systems have been developed by Kasetsart University.
One is based on the Royal Academy Thai Dictionary BE2525 and another
is based on Thai-English and English-Thai dictionaries. The systems
were reported in Reference 3. Applications of these systems are:
machine translation, Thai algorithm testing, development of Thai
Thesaurus, and spelling checker.
6.2 Thesaurus Tools
Several researchers have been constructing thesauruses for quite
sometimes. Now tools are being developed at Thammasat University to
help thesaurus makers. Field specific thesauruses are also required
by Ministry of Industry, Ministry of Commerce and Ministry of Science,
Technology and Energy to help accessing text database systems.
6.3 Phonetic Matching and Data Entry
Soundex algorithms for Thai have been available for almost 10 years,
and are being used in many data processing centers including the
computerized directory access system (CDAS) by Telephone Organization
of Thailand. Some further research programs are being carried out at
Thammasat University to make the existing algorithms more efficient
and applicable for high-speed data entry. It is expected that the
result may help finding new data entry method and improve input rate
by about threefolds. Also, it may help executives to use more
computers directly because the difficulties in Thai typing could well
be eliminated. By searching through dictionaries in real time, it is
also expected that spelling errors will be reduced.
6.4 Thai OCR
The King Monkhut Institute of Technology at Lad Krabang (KMITL) has
been active in this field. The researcher recently demonstrated a
microcomputer package which recognise typewritten characters with
satisfactory accuracy. More work is required to make the system more
flexible and capable. The demand of Thai OCR is well recognised by
the Science and Technology Development Board (STDB) of Thailand.
6.5 Machine Translation
About eight years ago Thailand has cooperated with France (University
of Grenoble) in developing a workable ADRIANE-based machine
translation system. The project lasted for about three years, and the
results were not satisfactory. More recently, the National
Electronics and Computer Technology Center (NECTEC) and CICC jointly
set up another machine translation project with over twenty
researchers from five universities involved. The system strategy is
indirect transfer using interlingua as an intermediate language.
Several supporting subsystems have been developed to help constructing
the Thai grammatical rules, input/output subsystems, and dictionaries.
A collection of recent reports are published by NECTEC (Reference 3).
6.6 Text Database System
Many text databases are already established in Thailand (see the
companion paper in this document, "Major Databases in Thailand" by T.
Koanantakool). Most of them use CDS/ISIS as the DBMS. Other systems
are STAIRS/VS (Ministry of Justice) and Unidas (Ministry of Foreign
Affairs). All text DBMSs have limitations in handling Thai words due
to the characteristics of Thai Language as described above. The
development of a Thai Text DBMS was started at Thammasat University in
1988 to handle the monitoring and evaluation of the Sixth National
Economic and Social Development Plan for the country. The system is
relatively simple and is limited to hierarchical-structured documents
such as the Sixth Plan. The simplicity is the most crucial feature
for the success of the project since it is to be used by general
users, not computer experts. It was developed in a short period of
only six-months.
A more sophisticated and comprehensive text database system has been planned for at Thammasat University. It started in August 1990 and covers several processing strategies including intelligent word editor (for 100% word separation of Thai texts), synonym search via thesaurus, phonetic search via new phonetic matching scheme, and high-speed data entry methods.
7 CONCLUDING REMARKS
Thailand has always been independent in her history. Thus the Thai Language remains unique and cannot be automatically handled by computers originated from the West. Thailand develops her own technology to adapt database products to handle the information stored in computer databases. National standards do exist to help minimizing differences in techniques used for data entry, storage as coded characters in the media. Processing of texts and data retrieval strategies are subjects of research projects in the country. Although the outlook is good, some more time is required to prove the practically of new information systems with full and proper handling of the language. The technical knowhow is applicable for other indo-china languages such as Burmese, Laos, Khmer and Sinhala (used in Sri Lanka).
8 REFERENCES: