YAO Shiquan & ZHOU Bingyang
China State Bureau of Technical Supervision (CSBTS)
presented by BAI Yang
Engineer, China Standardization and Information Classification & Coding Institute, CSBTS
This paper was delivered at the 2nd AFSIT, March 10, 1989, Tokyo, Japan.
In the recent ten years, the encoding technique for the Chinese character entry has made a great progress as Chinese information processing technology develops rapidly based on the broad application of computers. The encoding for Chinese character entry is a frontier science with the cross-over of polydiscipline and a system engineering. The scientific and technical workers in China have, to varying levels, carried out great amount of study in the basic theory and engineering practice. For example, the statistics and investigation on Chinese character frequency, the decomposition and statistics on Chinese character root, the word syncopation and the word frequency statistics, the choice of the character element and its frequency statistics, dual-phonetics and its frequency statistics, the Chinese character exchange code and the development of such standardization as the dot matrix of Chinese character form and its property and so on. The developments of the fundamental theory and the standardization provide a reliable ground of the theory and technology for the design of standardization of the Chinese character entry encoding, and advance the rapid development of Chinese character entry encoding technology and take on a flourishing aspect. According to approximate statistics, over 500 versions have been put forward and over 50 versions among them have been applied to computers. Some of the versions possess the complete support of software and hardware systems and have reached higher level and achieve a good economic benefit in its applications. In view of the varieties of the versions and their levels, most of users were at a loss as to what to do in the face of the numerous versions and made their choice only by the commercial advertisements, some users even chose and used inappropriate versions. The manufacturers can not form batch process. All the things mentioned above have caused the government to re-invest and to waste in funds, manpower, material resources. Therefore it is an important problem to be anxiously solved to evaluate, test and choose the versions, and to lead the Chinese character encoding to develop healthily.
The brief description of the Chinese character entry on keyboard and the development of its evaluation technology is as follows:
1. The Outlines of the Way of the Chinese Character Entry on Keyboard
The Chinese character encoding entry on keyboard is an entry method widely used in the world, the common mini- keyboards are mainly used, and special mini-keyboards for the Chinese character entry are partially used as well. The encoding entry is a method in which the code entry, which carries out encoding for each Chinese character according to certain rules, replaces the direct entry of the complete Chinese character. The main codes are as follows:
1.1. Phonetic Code
Phonetic code is an encoding on the basis of the pronunciation of Chinese character. There are over 400 syllables in the standard Chinese pronunciation and at most four various tones in each of most syllables. The number of the syllables with tones is over 1280. Encoding is performed on the basis of Chinese phonetic alphabet (the national phonetic alphabet is used in Hong Kong, Taiwan and other overseas area). The phonetic encoding covers the complete phonetic one and the double phonetic one, e.g. Character "zhong", z-h-o-n-g five letters must be input when the complete phonetic encoding is used, the consonant zh and the vowel ong, which are defined at certain keys, are input respectively when the latter is used. The outstanding advantage to use phonetic codes is easy to master as Chinese people usually start learning the phonetic alphabet in primary school, even in kindergarten, and can completely master the usage of Chinese phonetic alphabet in the second grade in primary school, they, consequently, can input Chinese character with the phonetic code without additional study of the phonetic alphabet. However the problem to need solving is to eliminate the repeat codes introduced by the same phonetic characters. The on-screen display and the entry to choose characters are used to solve the problem. It may be said that most of unqualified users use the phonetic entry method.
1.2. Picture Code
Picture code is encoded on the basis of the property of the picture of Chinese character. The basic structural unit
stipulated by the picture of Chinese characters according to different encoding rules is defined as word root. The
definition and the number of a word root depend upon various encoding versions. The number of word roots is at
least a few or decades, at most hundreds or over a thousand. For example, the encoding by stroke form (version A),
developed by Professor Li Jin-Kai and widely used in the computer with multi-language which is the unique patent
won by him in China, divides the stroke forms into such eight word roots as transverse stroke-1 (
), upright
stroke-2 (
), left-falling stroke-3 (
), point-4 (
), turning stroke-5 (
), curved stroke-6(
), cross stroke-7 (
) and square stroke-0 (
). The principles to fetch codes are up first
and down second, left first and right second, at most 3 codes for each natural component, at most 6 codes for a
complete word, different code for the word with the same code. For example, fetch 310 for character
(right),
fetch 414013 for character
(finish). For example again, in such A-versions as POPULAR CODE, MONEY
CODE, FIRST THREE LAST ONE, CK CODE, QUADRUPLE ANGLE CODE WITH LEVELS, COMPONENT FORM
CODE, 4 5-3 CODE, which were first evaluated and tested nationwide, the component of Chinese character is taken
as word root, big element in priority, first three last one or first two last one is taken as the principles to fetch code.
For example, the character
in the popular code, it is only needed to input
and
.
Because Chinese character is an ideograph using picture, the entry version designed using the root property has
stronger audio-visualization, even such words users can not pronounced can also be input easily. The repeat codes
in picture code are less than those in phonetic code, therefore although picture code is more difficult to learn than
phonetic code, once users learn it, its entry speed is faster than that using picture code.
1.3. Phonetic-Picture Code
The phonetic-picture code is a Chinese character encoding which combines the Chinese word root property with the
phonetic property. Some versions use "sound" to bring along "picture", "picture" is behind "sound", others use
"picture" to bring along "sound", the phonetic property is behind the root property. For example, Grade 2 character
is decomposed into
and
by the root property, then the first letters g, f, j of Chinese
phonetic alphabet for the three characters can be taken as their codes. The Multi-functional code with 50 word
elements, JDL entry method without interval, the string-pearl entry method for Chinese character, the up-and-coming
youngster method, all are of the phonetic-picture code.
The study of Chinese character entry encoding based on the masses has now reached the higher level in China. Most of practical versions have been developed into Chinese character entry system from the simple Chinese character encoding. The use of character-term hybrid encoding has shortened the code-length for dynamic entry. Many versions have been equipped with Chinese character cards, glossary cards, made full use of the intelligence of microcomputers and developed the software support technique. Many versions have also used Computer Aided Optimum Design so that the optimum period of versions has greatly been shorten.
2 Evaluation and Test to the Methods of Chinese Character Entry Encoding
It has now become a commonly interested problem in the development of Chinese character information processing technology how theoretically and practically such many encoding methods for Chinese character are objectively evaluated through test tot choose the better and to eliminate the worse and progressively to take the standardized road.
At the end of 1970s, the academic circles of Chinese character encoding put forward the views of the scientific evaluation and test and some qualitative analysis methods. As early as the end of 1980, China Xin-Hua Agency Technological Institute first evaluated and tested the four well-known versions. In 1983, China Taiwan Council for the Promotion of Information & Communication Industry systematically evaluated and tested the seven items of the ten versions including Cang-Jie (It is said that Cang-Jie created Chinese character) version and published the results of the evaluation and test.
In summer of 1984, Chinese Character Specialization Editorial Board of China Chinese Information Institute Sponsored the evaluation and test to the five versions in Shanghai Jiao-Tong University, it is the first test taking the computer as the main test means in China.
In May of 1985, Electrical Computer Development Office of State Council, State Scientific and Technological Commission and State Standardization Bureau jointly issued the notice to organize the activities of the evaluation and test and founded the National Chinese Character Entry Version Evaluation and Test Office to lead this work. The First National Chinese-Character Entry Version Evaluation and Test was performed according to "the rules for evaluation and test of Chinese-character entry version on keyboard" (draft). The rules were progressively developed by Shanghai Jiao-Tong University, Sort Encoding Institute of State Standardization Bureau etc., based on the scientific theory of the related evaluation and test, summing up the experience of all past evaluation and test in our country (including Taiwan province) after it was repeatedly discussed by the experts on information theory, mathematics, engineering psychology, linguistics and philology, computer science and so on. The versions from the whole country were statistically tested and were qualitatively evaluated by experts within May of 1985 to February of 1986, the massive dynamic tests were developed within March to May of 1986. Each version was allotted to eight operators randomly (every operator passed the examination and reached the secondary school education level. However they had no computer knowledge and did not received the training of typing in Chinese and English). The difficulty of the samples, which included various materials with 5,000,000 words selected step by step from the alternate samples for test); the operators were trained and taught by the writers of the alternate samples. The accumulated teaching time was not over 20 hours at most, the operators for each version were in turn tested 30 minutes each day, the period of test was 38 days. The average speed of version A, measured from the rigorous tests above, can reach 43.16 words/minute, the highest speed was 66.23 words/minute. The shortest period of time was 38 hours. The average speed for certain versions selected from version A can reach 130-205 words/minute according to the tests of ordinary samples. The eleven versions of version A, selected from the collected 34 versions were tested, they were the popular code, the quadruple angles with levels, the component form code, the fifty word element code, the stroke picture code, the Chinese tone number code, 453 combination code and JDL no interval code. The four versions, the accompanied test version, the unscored version because the operator had too many sick leaves, the integer-word keyboard version and the unqualified version because the operator broke the rules, were not sorted. The evaluation and test has greatly promoted Chinese character entry version on keyboard to develop towards a higher level.
Judged by the comprehensive analysis of the evaluation and test technology, such three problems need to be solved now: the evaluation and test rules with high science and systematicness; reasonable choice of the number of operators to take part in the evaluation and test of the versions: the evaluation and test results of the versions participating in it with absolute comparability. The problems are being studied theoretically and solved comprehensively through numerous practical state evaluation and test.
In order to advance the development of the Chinese character code entry techniques on keyboard and to realize the commercialization and practicality, the government organized special scientific and technological team, taking facing the middle-school students who study in the school with nine-year education system as the aim, they are studying a group of popular Chinese character keyboard entry versions which include phonetic sound, spelling picture and the combination of both. The versions, together with the other individual version, will take part in the technological evaluation and test sponsored by the government and strive to be chosen and to be popularized.