TITLE OF YOUR PAPER

COMPUTERIZATION OF NATIONAL AND ETHNIC SCRIPTS

Country Report - Vietnam

 

Ngo Trung VIET

Vice Chairman of IT Standard Sub-Committee (ITSC)

STEERING COMMITTEE FOR THE NATIONAL PROGRAM ON IT

Email: vietnt@itnet.gov.vn

 

James DO

Vice President,

Vietnamese Nom Preservation Foundation

www.NomFoundation.org

nomfdn@mail.com


1. Scripts in Vietnam

Vietnam is a country with many ethnic groups, with a culture that is mixed with many other cultures in the region throughout its history. In fact, the cultures of many neighboring countries still exist inside Vietnam as ethnic cultures, apart from those which have already joined into Vietnamese culture. From this point of view, Chinese, Thai, Khmer, Champa cultures clearly manifest themselves, particularly with scripts; besides these are more ethnic groups, but they are small in number and in general don't have their own script. The 1999 population is about 76 million people, of whom 13 percent are ethnic people, with: three million Thai speakers, one million Khmer, one million Hoa (Chinese), 200 000 Cham. The main scripts are: Quoc ngu and Chu Nom for the Vietnamese, Tai (for the Thai), Khmer, Cham, and Chinese; the remaining ethnic groups transcribe their own languages with Quoc ngu.

Quoc ngu is a Latin-based script, with additional tone marks, developed in the XVI century, and adopted as the national script since 1920. Before 1920, Chu Han and Chu Nom -- both using Han ideographs -- were the national scripts. Chu Han was used from the first century to X century, from whence Chu Nom was developed to better represent spoken Vietnamese.

The Tai script, according to some some sources, first appeared around the X century, and belongs to the Pali family. Used mainly in northwest Vietnam, and influenced in modern times by the Quoc ngu script, the Tai script has been transformed, and differs slightly from the Thai script currently used in Thailand.

The Khmer script is used in the Khmer community of southern Vietnam, and is the same as the script in Cambodia.

Chinese characters are used by the Hoa community, in southern Vietnam.

Cham script is used by the Cham community in central and southern Vietnam.

The scripts in Vietnam today belong to many families of scripts in the world: ideographs, Latin, Pali, …

2. Policies on computerization of national scripts

The right of each ethnic group to their own script as well as their culture is enshrined in the Constitution. Quoc ngu is the national script, and is taught to all children. But ethnic children have the right to learn their own script at primary school, in addition to the national script. For ethnic groups without their own script, their writing system is created based on Quoc ngu. If an ethnic group has many script dialects, efforts have been made to unify them.

With the wide availability of microcomputers, the introduction of national and ethnic scripts into computers become a hot topic. Much efforts have been devoted to computerize some main scripts in Vietnam. The National Informatics Programs during 1980s took an important role to promote and encourage practical activities in the computerization of Quoc ngu -- principally by IT companies, universities and research institutes.

The computerization of national and ethnic scripts became a policy of IT development when the two organizations on IT standardization were set up: the Technical Committee on Information Technology (TCVN/JTC1) in 1993 and Information Technology Standards Subcommittee (ITSC) in 1995. Some key policies for computerization of national and ethnic scripts are:

 

3. Current status of computerization of scripts

With policies on computerization of scripts, much effort has been devoted in the past several years to develop code tables and the necessary software for the main scripts in Vietnam. The promotion of this software is an important and difficult training issue, especially for the ethnic people. Parallel to the encoding issues, work on storing ancient documents in some old scripts is needed as soon as possible to preserve endangered scripts and cultures.

3.1 Quoc ngu script

Processing of Vietnamese Quoc ngu in computers has been of the most interest to IT specialists in Vietnam since 1980, when microcomputers became available. From 1985 to 1991, with the increasing popularity of microcomputers, many research groups have invested to develop Quoc ngu character code sets in computers. Because of the inherent limitation of the 8-bit code space, there was and continues to be intense debate as to which encoding methods should be used for Quoc ngu: precomposed characters, and combining characters.

The Unicode standard, with the combining method as a key element in its design, presages the first national standard on IT -- TCVN 5712:1993 for 8-bit character code set -- which was a compromise between the elegance of the combining method and the practice of precomposed characters due to the limitation of the existing system and application software. TCVN 5712:1993 actually consists of two tables: VN1 with both combining and precomposed characters (see Table 1), and VN2 with only combining characters.

In cooperation with Microsoft, IBM, and Canada experts, ITSC proposed a 8-bit Vietnamese Quoc ngu code set using only the combining method, with strict adherence to ISO 8859 encoding methodology. This allows Vietnamese data to co-exist with the Western European scripts in an 8-bit environment. With some minor differences, this code set has been implemented in Windows Vietnamese by Microsoft (Code page cp1258), and in AIX by IBM (Code page cp1129).

In 1995, the Steering Committee for the National Program on IT (SCNPIT) funded the development of system add-in software to promote the use of TCVN 5712:1993 in administrative bodies. This software -- ABC, consisting of a keyboard and an extensive of Vietnamese fonts -- is now widely throughout the country as well as abroad, particularly in Web sites.

The crucial difficulty for 8-bit character code sets is that the scripts cannot co-exist. This problem disappears with 16-bit character code sets.

In the 16-bit environment, all Vietnamese Quoc ngu characters are encoded in Unicode and ISO/IEC 10646, but scattered across several pages (see Table 2 about the mapping table between TCVN-5712 VN1 and Unicode). Quoc ngu, Chu Han, Chu Nom, Tai, Khmer, Cham, and other scripts can co-exist without confusion and code conflict.

The availability of system and application software that fully support Unicode -- such as Windows 2000, Office 2000, Internet Explorer, Netscape Navigator, ... -- presents the opportunity for Vietnam to move beyond the limitations of 8-bit character code sets and fonts, into an environment which supports multiple scripts in an international standard scheme, backed by modern technology and robust industrial sales and support.

3.2 Chu Nom ideographs

When voting in favor of ISO/IEC 10646 in 1992, Vietnam expressed its intention to include Chu Nom ideographs into the international standard -- a proposal made in 1992. Two national standards on Chu Nom were developed: TCVN 5773:1993 and TCVN 6056:1995. Subsequently, in cooperation with Adobe, some errors were detected in these standards, and request for correction was made in 1998.

Vietnam actively participates in the Ideographic Rapporteur Group (IRG), formed in 1993, to concentrate on the ideograph (CJKV) repertoire for East Asian countries. In 1994, IRG accepted Vietnam’s initial request to include Chu Nom into the repertoire.

In 1995, Vietnam’s request to include more than 4000 Nom characters, into the V-column of ISO/IEC 10646, was approved by SC2/WG2.

In 1998, Vietnam requested to include more than 4000 proper Nom characters into the repertoire and ISO/IEC 10646, Part 2 -- bringing the number of Nom characters to 8210.

In 1999, Vietnam requested to include 1090 Nom characters, already found in the Kangxi Dictionary, into the repertoire. To date, the total number of Nom characters be included and coded into ISO/IEC 10646 is 9300.

The great problem for Chu Nom has been software. Consequently, the availability of system and application software that support Unicode brings the processing of Chu Nom in computers much, much closer to reality. The emphasis can now shift toward developing databases of old texts and documents in Chu Nom and Quoc ngu, making them available to everybody on the Internet.

 

3.3 Tai script

The Tai script in Vietnam has existed in many dialects, making it difficult to popularize throughout the community. In the 1950s, publishing documents and books in so many dialects was impossible with primitive printing techniques. Consequently, pressure in the community is very strong to unify for the many dialects. In 1958 and 1962, many seminars were held to come up with the new unified Tai script; the result was that several new components were introduced as tone marks and new vowels to simplify the writing.

In 1962, the Ministry of Education issued a directive to approve the new unified Tai script as the official script for the Thai ethnic group for use in primary schools. However, because of the war, no initiative was undertaken to put this new script into widespread. Now, the demand for a Tai script for education and economic development has intensified within the Thai community. Today, the computerization of the Tai script has become more feasible.

In 1999, an 8-bit character code set for the Tai script was developed (see Table 3), from which a proposal for transforming into a national standard has been submitted to the national standards body. Concurrently, a proposal was submitted to the ISO/IEC SC2/WG2 to include the Tai script into ISO/IEC 10646. Because of the similarities between writing systems for the Thai script, there is a tendency toward unification in order to faciliate communications between these different writing systems. Such unification may occur not at the character level but between phonetics, so that mapping tables can be established and a transcribing mechanism can be found.

Fonts and keyboard drivers for the Tai script have already been developed to print textbooks for children. Some ancient documents of the Thai ethnic groups can now be computerized.

 

3.4 Cham script

Because there many more university researchers on the Cham script than for Tai, the computerization for the Cham script began earlier. In 1994, fonts and keyboard drivers for Cham were developed at the Center of Eastern Studies, University of HoChiMinh City, where a Cham - Viet dictionary was published using this software.

Since 1994, the first effort in creating an 8-bit national character encoding standard for the Cham script was made; however, no further progress has been achieved due to a lack of necessary expertise. To overcome this difficulty, efforts are made at the international level to draw the attention from worldwide experts. In 1994, a proposal on including the Cham script into ISO/IEC 10646 was also submitted to the ISO-IEC SC2/WG2. Discussions on the Cham script between international experts have ensued, and some progress has since been made (see Table 4 for the latest proposal on Cham scripts).

 

3.5 Khmer script

Although the Khmer script is still used in the Khmer community in southern Vietnam, the lack of experts precludes any effort regarding this script. However, since the script is the same as that in Cambodia, software already available can readily be used.

3.6 Chinese characters

Chinese characters are taught in primary schools, and used in the Hoa community, especially in the south and Hochiminh City. Popular software packages imported from China and Taiwan preclude the need to develop software locally.

4. Some issues

From our experiences in computerization of national and ethnic scripts, the main difficulties are the lack of specialists on both scripts and computers, and the lack of an organization to support and fund the necessary research. The two difficulties are almost impossible to overcome at the national level, especially for developing countries. When computerization affects many people using the official script, and when the scale of market is larger, it is easier to realize. But for ethnic groups with a small population, the perceived market is even smaller than the market for the official script, to which not even many IT companies are attracted. Only concerted and persistent action from governments and other R&D organizations can be help in this endeavor. A framework for these activities like MLIT Symposiums are very helpful to encourage governments in the region to take the first and important step.

The second step of this computerization is the development software supporting to scripts already encoded in computers: fonts, keyboard drivers. Many local IT groups strive to solve the issues but, in general, these efforts are ad-hoc, not in concert with worldwide industrial development, and therefore cannot leverage from modern technology and international standards. Long-term success for computerization of national and ethnic scripts requires that: (1) the scripts are accepted into international standards, and (2) the implementation of these scripts, by means of the international standards, by the main IT companies in the widely accepted I18N and L10N framework. Developing countries need all the necessary support in these two parallel sets of activities.

The third step is to educate and train end users to use the script in computers. This step is much more difficult for ethnic groups living in distant regions, with inadequate facilities. A system of training and education centers from many levels -- international, national, regional -- is needed to popularize the modern technology to rural areas.

Technology can help save and preserve endangered languages and scripts. Beyond this, once scripts are computerized, opportunities for education and economic development afford the opportunity for the ethnic people to overcome poverty and backwardness, and to adapt to the modern life. Without such effort to rapidly apply IT, many languages, with small numbers of users, will inevitably disappear forever.

In this quest for the preservations of ethnic cultures and languages, international cooperation is a critical factor.

 

Hanoi, 15 August 1999




Table1. National Standard TCVN-5712:1993VN1
Table2. Mapping TCVN5712:1993(VN-1)to ISO 10646/Unicode
Table3. Code Table for Viet Tai script
Table4. Code table for Cham Script