Country Report

Current status of Multilingual Information Technology in China

Dr. Li Li
China standardization and Information Classifying &Coding Institute
bjeatrc@public3.bta.net.cn


Abstract
The paper introduces used in the mainland of China languages and the status of popularizing computers and Internet, and enumerates concerned national standards on character sets. The development of terminological data banks, machine translation and CNL technology with their products concerning multilingual information technology are also introduced in the paper. Finally, the author considers the needs of multilingual information technology in China enormous, mainly the needs of system software and translation software for supporting multilanguage.

Contents
1 Languages and their users in China

2 Rapid growth of China Computer market

3 Internet in China

4 China National standards on character sets

5 Terminology and terminological database(or term bank)

6 Machine translation

7 UNL

8 Summary

9 Acknowledgment


1. Languages and their users in China
China is the world's most populated country with 1.3 billion people, and a multi-national and multilingual country. Today in China there are 56 nationalities including the Han, Hui, Uyguar, Manchu, Mongolian, Korean, etc. The Han has its own language, i.e. the Chinese; and a very few, such as the Huis and Manchus use the Han language as their own. Over 30 nationalities (about 20 million people) have also their own written languages, among them the users of Mongolian, Tibetan, Uyguar, Kazak and Korean languages are more than that of others. Besides, followed the deepening of international exchange and cooperation as well as the rapid development of information technology, English now become the first foreign language in China, and German, Japanese, French and Russian are the foreign languages with high rate of utilization.


2. Rapid growth of China Computer market
Followed with the spread of information technology and the increasing decrease of computer's price, the amount of owned computers is now growing rapidly in the mainland of China from 500 thousand sets in the end of 1990, 5.5 million in 1996 to 13 million in 1997, and estimated that by the year of 2000, 20-30 million of sets will be reached with a popularizing rate of 15%.

The main growing point of computer's sales volume will be the domestic (household) computers. Now the amount of owned domestic computers is 1.6 million sets, its popularizing rate in cities is about 2.5%, the amount of owned domestic computers in big cities accounted for 85% of the total/amount, but 14.5% in medium and small cities. Among cities, the highest popularizing rate is Shenzhen reaching 19.8%, other cities, such as Guangzhou, Beijing and Shanghai reached 10.4%, 8.7% and 6.6% respectively. Among the sold computers, IBM is ranked first, amounting for 7.5%, next is "Compaq", "Lianxiang" and "Greatwall" successively. The demand for domestic computers is increasingly growing, and estimated that in 1998, it will increase by 7% in comparison with that in last year.

According to the statistical data, the users of domestic computers mostly are higher educated (at least graduated from secondary schools), because knowledge of foreign language is needed in applying computers. And, if this problem was solved, the users of computers would be greatly increased in the mainland of China. Thus, the development of multilingual information technology must stimulate greatly the China's computer market.


3. Internet in China
Followed with the popularization of computers, the Internet in China is developing rapidly, the Internet users have been increased from 200-300 to 620 thousand in the end of 1997. By the end of June 1998 the users reached 1.175 million, including 325 thousand directly networking users and 850 thousand dialing networking users. Among the networking users, 25.3% are in Beijing, 11.5% in Guangdong province, 7.8% in Shanghai. In the mainland of China, there are 540 thousand sets of networking computers including 820 thousand sets of directly networking computers, 460 thousand sets of dialing networking computers, accounting for 4.1% of the total amount of owned PC. 4.5% of the networking users considered that less information is obtained from the Internet owing to the lacking of satisfied multilingual translation tools.



Fig 1. Internet users (million)in China
Fig 2. Internet users (million)in China

There are many famous ISP in mainland of China, including CERNET, CSTNet, ChinaNet, ChinaGBN, CEInet, STI-net, CMINET, CENPOK, Beijing-online, Shanghai-hotline, Oriental Net scenery, etc.


4. Character sets and their application system
4.1 China National standards on character sets

Based on the GB1988(ISO646) and GB2311(ISO2022)Cthe mainland of China developed six codes of Chinese Character set for information interchange (Fig. 3), the GB2312 "Primary set" has 7445graphics and characters including 6763 Chinese characters and 682 graphic characters, its corresponding code of Chinese ideogram set is GB12345 with more 103 Chinese characters added. GB7589 has 7237 Chinese characters, its corresponding code of Chinese ideogram set is GB/T13131. GB/T7590has 7039 Chinese characters, its corresponding code of Chinese ideogram set is GB/T13132. Therefore, now the mainland of China's codes of Chinese characters set totaled 21039 simplified Chinese characters and correspondingly 21142 Chinese characters in their original complex forms.

Number Name Date Chinese Characters
GB2312 Code of Chinese graphic character set for information interchange - Primary set 1980 6763
GB/T7589 Code of Chinese ideogram set for information interchange - The 2nd supplementary set 1987 7237
GB/T7590 Code of Chinese ideogram set for information interchange - The 4nd supplementary set 1987 7039
GB/T12345 Code of Chinese ideogram set for information interchange supplementary set 1990 6866
GB/T13131 Code of Chinese ideogram set for information interchange - The 3nd supplementary set 1991 7235
GB/T13132 Code of Chinese ideogram set for information interchange - The 5nd supplementary set 1991 7039

Fig. 3 The list of China national standards about Chinese character set

GB13000.1 Information technology-Universal multiple-Octet code character set(UCS)-Part 1: Architecture and basic multiling plane(ISO10646-1) has been prepared in 1993 with 20902 Chinese characters.

Besides, China's national standards on codes character set of Mongolian, Uyguar, Korean and Yi languages have also been developed in 1993(Fig. 4).

GB8045 Mongolian 7-bit and 8-bit coded graphic character sets for information processing interchange 1987
GB12050 Information processing -Uighur coded graphic character sets for information interchange 1989
GB12052 Korean character coded character sets for information interchange 1989
GB13134 Yi coded character set for information interchange 1991

Fig. 4 The list of China national standards about character set


4.2 Application system
Based on the GB2312serial standards, the mainland of China has been applied a number of Chinesizing platforms or operating systems from CC-DOS, UCDOS to Chinese-starCChinese Windows3.x in succession. After them there are emerged some Chinese character systems, such as the "Julong"(1993, Shanghai), "AW97"(1997, Zhangjiajiang), and Windows95, in the basis of ISO 10646.

Based on multi-internal code, multilingual real time change and dynamic translation, the "RichWin" system provides multi-internal code automatic text converter to carry out text conversion using internal code and some special internal codes(e.g. GB, BIG5, HZ, ISO2022) provided by the system and is enable to read a text mixed composed by various internal codes(e.g. GB, BIG5, HZ, ISO2022).

Xinjiang University has developed a processing system capable of hybrid compatible process of five languages (Uyguar, Kazak, Kerkez, Chinese, and English).


5. Terminology and terminological database(or term bank)
Terminology is one of the bases in carrying out multilingual information interchange, a same concept would be expressed by different terminologies in different languages. There are a lot of institutions engaged in terminology work, mainly: China National technical Committee for Terminology Standardization(CNTCTS), China National Committee for Terms in Sciences and Technologies(CNCTS), State Language Commission of China(SLC), Encyclopedia of China Publishing House(ECPH), and others.

Nowadays, there are about 800 national standards on dedicated terminologies, and 130 thousand terminologies are included in the National Standards of China.

Attention to the research of establishing terminological data bank/database in China has been devoting since 1980s.Beginning from the establishment of 8 national standards concerning terminological data banks, up to now about 100 terminological data banks/databases were set up, among them some are of bigger scale, they are: China Mechanical and Electrical Engineering Terminological Database More then 44000 terms from three parts of mechanical, electric engineering and instrument were recorded, and every term from them has its English, Russian, German, Japanese and French equivalents attached to Chinese term.

China Encyclopedia Terminological Database(CETDB)
It is a collection of about 200 thousand entries, including terms from such fields as astronomy, mechanics, traditional Chinese Medical Science, press and publication, chemistry, civil air defence and etc.

China Standardization Term Bank
Now it is under establishment. Near 130 thousand terms from all the China's National Standards will be recorded in this termbank. Each Chinese term has its English equivalents, and the Japanese, Russian, German and French equivalents will be added progressively.

Historically, a number of terms are still not unified, particularly in the fields of new sciences and technologies. Therefore, the harmonization and unification of Chinese terminologies, as well as the establishment of concerned terminological databanks must be included in the research work on multilingual technology in the mainland of China and Taiwan, Hongkong and Macao regions. Now, terminologists and concerned institutions from the mainland of China and Taiwan, Hongkong and Macao regions start this work and , the principles and methods of harmonization and unification.


6. Machine translation
The study of machine translation in China started in 1950s from a Russian-Chinese machine translation experiment, and it is developing rapidly over these years. Now, the institutions engaging in its research and development, mainly are: Beijing University, Qinghua University, State Language Commission, Harbin Poly Technical University, the Chinese Academy of Sciences, as well as several companies, such as China Computer Software and Technical Service Co., Tianjin Datong Tongyi Software Research Institute, Beijing Tingyu Co. and others, from them the Beijing University and the Chinese Academy of Sciences enrol postgraduate students, and in Hebei Teacher College a speciality of machine translation has been set up. In 1997, a conference held in the city Xi'an, a Machine Translation Committee was planned to establish under the China Translator's Association, and this will promote the research and development of China's machine translation.

Products of machine translation software has been developed from 1-2 varieties to current dozens varieties in China's markets, accounting for 1% of China's software products, and belonging to sold-well products. There are two kinds of translation software products: the first kind is dictionaries, they are used in rapidly looking up wards and words groups; the second kind is translation, they are used in translating scientific and technological literatures. Nowadays, the products of the first kind are widely applied, and their users account for 80-90% of the PC users; 95% of the products are used in translation from English into Chinese or vice versa, but there are a few of dictionaries in other foreign languages. The main products include "Hsvision", "Beyond", "Kingsoft-Xdict", "NCCEdict", "HuiFeng Audio Dictionary" and "School". Most of the products have the function of "Clicking the mouse, translation is completed at once with the sound of pronunciation".

But the products of the second kind are the direction of developing machine translation, their main users are government offices, large and medium scale enterprises, colleges and universities, and scientific research institutions, and will finally be expanded to broad private users, so the development has a great future. The main products include "Star of translation", "Tongyi", "Gaoli", "Jieti Yiwang", "Jinyida", "Yidali", "Jishi Hanhua", "Huajian", "Yiwang", "C&T", etc. All they have a large-scale termbank , e.g. " Star of translation" has its 15 specialized termbases of Computer, communication, chemical engineering, petroleum geophysical prospection, thermal power generation, economy and irrigation works. Today, this kind of products can translate automatically whole phrases, whole paragraphs and full text with the accuracy of 70-85%, but the final translated text must be arranged manually. Most products of machine translation are concentrated in translating from English into Chinese and vice versa, but a few amount are concerned Japanese and Russian languages, other languages are least.

Now, the machine translating technology in the mainland of china is mixing together with other related technologies, one of the directions is combining with the technology of identifying printed letter and hand -written form, technology of language input and output, and full-text proof-reading technology. Main researching institutions include the Chinese Academy of Sciences, Qinghua University, Zhejiang University, Beijing Information Engineering College, and others. Among the products Huajian company of the Chinese Academy of Sciences has developed the "Huajian Electrical Translator" composing of the functions of hand-writing input, full-text translation and language output as a whole; another direction is combining with the network browser, e.g. "Net Reader 2.0" with translating speed of 500-1000 English words per second, its translated quality is enable to basically meet the users' requirement of browsing information, and directly search in the searching engine on Internet by Chinese keywords.

The study of automatic evaluating technology on the quality of machine-translated text has been carrying out over 12 years, and developed the Outline of testing English-Chinese machine translated text quality and the Automatic evaluating system of the English-Chinese machine translated text quality (MTE).


7. UNL (Universal Networking Language)
UNL Engineering is a huge transnational and translingual engineering which is implemented by UN University through networking language engineering to design a middle language-UNL language; and to develop a set of software of "converter" and "counter-converter" for each language, then a certain language become UNL language by "converter", further to become an another language by "counter-converter", and thus the conversion among languages is realized. The first phase of this engineering was attended by 13 countries including China, France and Japan, and the conversion software of 13 languages will be completed in the three years. This item was started in China since 1996, and is responsible to implement by the Micro-electrical Development Center.


8. Summary
Chinese government paid great attention to the development of information technology and information industry. In 1998, in the work of reforming administrative organs, the newly formed Ministry of Information Industry has been amalgamated of former Ministries of Electronic Industry and Posts & Telecommunications, the research and development of information technology are a key investigation direction of the state. Among the items in the National Ninth Five-Years Plan and "863 plan", many are concerned the multilingual information technology.

Although the quantity of owned computers and network users is very great, but the popularizing rate of computers and networks are so low. Followed the raising of the popularizing rate, the demand for multilingual information technology will continuously be expanded, thus multilingual information technology with it products will form a very large market in China.

Today, some problems concerning China's multilingual information technology must be solved as follows:
(1) To develop system software for simultaneously processing multilanguage;
(2) To establish authoritative multilingual (not only Chinese and English) terminological databank/database;
(3) To raise the quality of Chinese-English and English-Chinese machine translation software;
(4) To develop multal-translation software for Chinese and other languages.


9 Acknowledgment
I wish to express my thanks to CICC for its financial support in my attending this conference, and also to Mr. Fang Qing, Vice-director of CSICCI, for his approving this paper.


back