ISO/IEC 10646-1 in Japan


Technical Report

Koji SHIBANO
Associate Professor, Tokyo International University

This paper was delivered at the 1st AFSIT, September 14, 1987, Tokyo, Japan.


In this paper, we will discuss the ISO/IEC 10646-1 in Japan starting from a new positioning of the 10646 in Japan. We then explain on the evidence of these judgments from the viewpoint of the problems of the internationalization of software, coded character sets, and the CJK unified ideograph. We also discuss the direction of the development of the future 10646, especially on the Asian characters including the analyses of the problems of the 10646.

1 Countermeasure by Japan to the 10646

Strenuous effort have been made in JIS-related works, aiming at the preparation of the draft of the JIS standards corresponding to ISO/IEC 10646 Universal Multiple-octed Coded Character Set. Japan has been assuming a negative attitude toward the 10646. However, such an attitude by Japan is not preferable when taking into account the efforts for the internationalization of Japan in other fields, such as the character code and the consistency with the domestic character code problems.

Hereafter in Japan, the dissemination of the 10646 is to be positively promoted and the successive movement is promoted from the existing JIS code to the JIS of the 10646. At the same time, full efforts should be made in the expansion and improvement of the 10646 in ISO activities.

2 Positioning of 10646

2.1 Internationalization of the software and the countermeasures of Japan (SC22 Guideline)
Until very recentry, it has been impossible in the international standards to appropriately handle the languages in tht DBCS (double byte character set) countries, including the Japanese language. The first commitment of the ISO to the national character set support in the international standards of Information processing, such as programming languages, was the ISO/IEC 9075 database language SQL, when the draft international standard (DIS) was submitted in 1985.[1]

In order to improve these situations, Japan established the Technical Committee of the Japanese Language Functions in 1987 and asked the ISO to appropriately handle the Japanese characters and the characters in the extensive non-Latin alphabet zone (especially DBCS countries in the international standards and proposed the specifications in the respective technical fields).

The SC2 in charge of programming language standards adopted the guideline for supporting the national character set in 1987 in response to the Japanese proposal, and asked appropriate handling of the 2-byte characters in all programming languages.[2]

In this guideline it was requested as the essential item to enable the handling of the 2-byte and 1-byte characters as the data type, the literal, and the comment.[2]

In addition, it was also requested to allow the 2-byte characters as the identifier. The 10646 is not the complete solution for such efforts but it is expected to make a large contribution. In future, Japan should match the efforts by Japan for the internationalization of software's, and should challenge the promotion of the 10646.

2.2 SO character code
The ISO has developed the ISO code successively since the recommendation, ISO R 646 [3] was published in 1967 (regarding this ISO 646 edition as a Bible). The history of the development of the major ISO codes and the concepts of the respective developments are analyzed and the path leading to the 10646 is described.

2.2.1 Principles of ISO 646[3]
The ISO 646 has been positioned as a basis of the ISO character code. This standard implemented e unification of the standard of the character code in the fields of both information processing and communication, and it aimed at the possibility of approriately handling the characters by expanding the 5-bit code with the CCITT Alphabet #2 which had been used with the 6-bit or 7-bit code with more extensive code spaces.

The design of this ISO 646 includes the principles as follows;

The first characteristic is that it aimed at the character code in the new-born information processing and the informatiom exchange and the unification of the character codes in the communication field. The second characteristic is that the coding was implemented on the characters as the graphics, neither as the meaning of the characters nor as the pronunciation of the characters.

Though the code spaces were expanded through the development from the 5-bit code to the 7-bit code there were still many restrictions. In order to implement the processing and exchange in the insufficient code space, other characters than the common 82 characters among 94 characters allocated to the graphic characters were allocated to the national characters or to the characters for each application software. It was generally determined to implement the development and utilization of a plurality of versions on the basis of the ISO 646. At the same time, the composite characters based on the BACKSPACE sequence were introduced since the above-mentioned 12 characters were found insufficient to express the accented characters in European languages.

Even in the USA, whose language does not need the characters with accent symbols, it was reported by ANSI in this period that at least 3 editions were necessary.[4]

2.2.2 ISO 2022[5]
It is necessary to expand the code in the use of a plurality of versions in the ISO 646. The ISO 2022 specified the expansion of this code.

The basic features of the ISO 2022 were the single shift to make use of calling the 12 national characters which are different ineach version of the ISO 646 on the one-character basis, and the locking shift to change he whole version.

It is necessary to use the ISO 2022 to process a large character set together with the ISO 646. In this sense the ISO 2022 is essential to process the character code in DBCS countries when the softwaew's of the 1-byte zone are utilized.

However, the ISO 2022 is not the standard to ensure the expansion of the character code but the standard to enable expansion thereof. In fact the adaptability of the ISO 2022 specifies the information interchange on the basis of the common escape sequence, which means that the processing of the 2-byte characters is actually rejected.

A large amount of labor is actually consumed to enable the use of the character of each country, including those of DBCS countries in the software prepared in the USA and Europe. The software prepared in DBCS countries are used only in DBCS countries, resulting in the actual non-tariff barrier.

The ISO 2022 itself is a quite complicated system and the implementation of all its features is not assumed. It is recommended to limit the system to minimun use, and only in the most-needed cases.

2.2.3 8859[6]
For the character code system consisting of the ISO 646and the ISO 2022, the following problems were gradually found more lucid in Europe. The following items are from the brief History of the ISO 8859 attached in the documents foe the DIS voting of the ISO 8859.

In order to resolve these problems, the ISO 8859 was designed on the basis of the following design philosophy. The following items are also from the brief History of the ISO 8859 attached in the documents for the DIS voting of the ISO 8859.

However, this character code has also shown its limit gradually. In particular, the ISO 2022 which had been disused was again needed in considering the use of the character code within the EC aiming at an integrated Europe. The section of the languages consisting of one code where the code expansion is not required has become quite difficult.

2.2.4 Toward the 10646[7]
In conclusion, the 8-bit code was found insufficient in every aspect in Europe. In the USA where the 7-bit ASCII code has been used the code spaces have been running short with the 7-bit code or the 8-bit code due to the expansion of the application field of the computer from the convetional numerical computation and the data processing to the documents processing and DTP.

In time, the ISO 10646 was developed in a similar design philosophy to the ISO 8859 as previously mentioned.

Problems with the present JIS code
(Note: the word "Kanji" is defined as the Chinese character used in daily life in Japan.)
In Japan, the Kanji Code Technical Committee was established within the Standard Committee of the Information Processing Sosiety in 1969 and the studies on the establishment of the standard of Kanji code were started.[4] As a result of these studeis, the JIS C 6226-1978 Kanji code for information exchange [8]8 was established in 1978. This standard was amended in 1983 and the second version of the standard [9] was established. Thereafter, the name of the standard was altered to JIS X0208, and in 1990, the third version of the standard [10] was established, and the JIS X0212-1990 Supplementary Kanji Code [11] was established.

Thus, the JIS code has been successively amended, but it still contains some problems.

First of all, what does the first Kanji code, JIS C6226-1978, mean? This code system includes the Greek characters and the Russian characters, which are not used in the normal script of the Japanese language, and includes every European character if the composite characters are included. Consideration of this point concludes that the design philosophy of the JIS C6226 is not the "expression of the normal Japanese language" as described in the scope, and it should be understood that the JIS C6226 has aimed at the self-concluding universal code similar to the 10646.

Unless understood in this manner, this code may be contradictory to the principles of the ISO 646 that one graphic character gives only one code point and that the code position of the common 82 characters in the ISO 646 should not be changed.

In reslity, however, the JIS C6220-1969 [12] that is a version of the ISO 646 including the Katakana was decided to be used concurrently. This causes many problems including the duplicate codeing.

The enviroment where it is essential to use these two codes constantly makes it essential to make use of the ISO 2022, resulting in the difference between the then-effective function of the ISO 2022 and the actual use thereof that causes the embarrass in the code representation. As a result, there still exists the shift JIS, the EUC, and the expanded JIS code as the code representation to be supported by a purality of companies, and the code system specific to the respective companies used in the processing and in the exchange. Ironically, the code conversion is required within Japan even in the plug compatible mainframe where conversion of only hardware is enabled worldwide. In the application software under the UNIX enviroment at least three of the above-mentioned common code representations are neede to be supported. Under the personal computer enviroment, the code conversion is required in most of the personal computers in implementing the display and printing od the characters.

That means, there actually exists no standard on the code representation in Japan.

Though the JIS C6226-1978 clearly denotes that the details of the use of the praghic characters and the disgn of the character shape are not specified, the "difference by the simplication of the shape" is specified to be disregarded at the end of the principles of unification, which are essentially required in selecting the Kanji (the character shape to be regarded as identical in the JIS interpretation, e.g., in the interpretation table 6 in the current JIS X0208-1990). The interpretation of these principles and the incorrect understanding of the principles of the graphic character in the ISO code are put together in the manner of thinking in the beginning of the daily use Kanji table (Joyo Kanji Hyo) thereby causing the problems of amendments in 1983.

In addition, the JIS C6226-1978 provided no reserved area in connection with the free apace. As a result, the vendors expanded the graphic characters freely in this space and the addition of the characters in the subsequent revisions was not accepted by some of the Japanese vendors.

In the revision in 1983, 22 sets of exchange of the first level and the second level characters were implemented, and the unification principle of "simplification of the shape" was applied to the entire system, leading to changes of the character shape of 294 characters. The exchange of the characters shall not be performed in any case in the character code standard from the viewpoint of the compatibility. There were many discussions on the application of the principle of simplifidation of Kanji in Daily use Kanji table (Toyo Kanji Hyo) and common use Kanji Table (Joyo Kanji Hyo) to other Kanjis than those listed in the Tables, but it was a problem where no consensus was made on the domesticbasis and the matter regarding the change of the character shape was not to be implemented specifically by JIS.

The addition of the supplementary Kanjis in 1990 was folowed by few makers.

As a result, the efforts made by JIS for a long time cuold not establish an effective standard, but they promoted confusion.

The only solution for such situations is believed to be promotion of the positive movement toward the 10646. However, it is thought that the abolition of the existing standard shall not be made in the standard of the code system, even when the movement toward a new standard is promoted. It should be noted that the code table itself shall not be changed in the future even for the existing standard, but it is considered necessary to revise the JIS X0201 and the JIS X0208 following the clarification of the contents of the specifications in the ISO.

3 CJK Unified Ideograph

The SC2 Japan Domestic Committee has opposed the CJK unfied ideograph for several reasons, however, it is now considering supporting this CJK unified ideograph. The evidence of such ideas are derived from the fact that the reasons agreeable to the CJK unified ideograph can be found, while the reasons oppsing connot be found. A variety of oppsing opinionc exist in Japan, but any one of them are incorrect.

The problems of the CJK unified ideograph are discussed beloe from several viewpoints.

3.1 Kanji Cutural Area
It is common knowledge that there existed the Kanji culture area, and Japan has obtained profits from the existance of this Kanji cultural area inits history. In fact, the Kanji sentences were used in the official documents by the end of the Edo era, and the knowledge of the Chinese classic literature were important facters for the educated people in the pre-war period. Also it is a clear fact that the education of the Kanji sentence is still achieved in the present time. The word to express such aspects is referred to as " 同文 the same script".

The common use Kanji Table (Joyo Kanji Hyo) which provides the reference of the Kanji use which is the most authorized in the present Japanese language is based on the Kang Xi Zidian in China. [14] The characters changed from the original listed in the Kang Xi Zidian were 131 among 1850 characters in the daily use Kanji Table (Toyo Kanji Hyo) and 355 among 1945 characters in the common use Kanji Table (Joyo Kanji Hyo).

The Shin Jigen and the Dai-Kanwa Jiten referred to in the specifics works of the JIS standard up to this time also specified the Kang Xi Zidian to be the most authorized.

In 1975, the "Delegation for examination of the Chinese character revolution" was dispatched from the Agency for Cultural Affairs to cope with the situation of the character communications at the grass roots level, and the studies on the unification of the simplified characters between China and Japan. [13] The results were reported at the 95th General Assembly of the Japanese Language Council approximately as follows.

"IT is impossible that all characters be coincided between Japan and China in the direction of simplification thereof. However, it is necessary gor the scholars of Japan and China to communicate with each other to exchange the knowledge and experiences concerning the problems such as the indexing of the characters and the character code in the electronic communications."

The communications as specified in this report was the study on the CJK-JRG, and the results after 20 years are the 10646.

3.2 Kanji Code of Japan, Chine and Korea
The character codes of Japan, china, and Korea up to this date were designed on the completely different princple. That means the coding on three aspect of the Kanji characters, i.i., shape, pronuncication and meaning, are carried out in the respective countries.

The Kanji code in Japan includes the variant characters like the common use Kanji Table and the unification of the character shape when available, and one code point was given to one character shape. That means Japan gave the code points to the shape.

In China the code point was given to the correst form of the character and two character patterns of the simplified character and the complex-formed character were given to one code point. Thai means the Kanji code in China gave the code point to the meaning or the character concept.

In Korea, the code point was given to the pronunciation of the Kanji character and a plurality of code points were allocated to one character when it has a plurality of pronunciations for one character. That means Korea gave the code point to the pronunciation.

Thus, the design philosophies of the respective countries are completely different and this is the problem. The ISO does not provide the character code to characters other than the graphic characters. In consideration of this matter, the design of the character code in China and Korea and the JIS code were incorrect.

Futher, in connection with the arrangement, the characters are arranged in the order of pronunciation in the respective national languages in the first level of Japan and China, and the Kanji codes of Korea. In Japan, the problem is pointed out that the collating order including the second level of this arrangement means nothing, and the indices in the order of the radicals and strokes on the basis of the Koki Jiten are attached as an annex to the standars.

In conclision, only the arrangement in accordance with the Koki Jiten is available as the arrangement is extensively agreeable in Japan, China, and Korea.

In the aspect of the practicable uses it is the prerequisite at the present time to use the kana-Kanji conversion, which is different from the time when the Kanji typewriters were used, and there is no chance to directly use the arrangement table of the character code. That means there in noevidence to criticize the 10646 in the arrangement aspect and the criticism in this aspect means the criticism of the indices themselves of the JIS Kanji code.

There are some opinions that point out the problems in the maintenance of the future unified Kanji and such problems are essential expenses for the internationalization that the USA and the European countries have paid up to this date, and are now to be borne by Japan, China, and Korea.

3.3 Size of the character set and its meaning

There are some passionate discussions that the design of the character set for the character code can be achieved by allocating the code point to the characters listed in the commonuse KAnji Table, the kang Xi Zidian, the Dai- Kanwa Jiten, or the like. However, these discussions themselves have been, and will be, incorrect. There exists the application field of the character table itself, whatever character table it may be, and the incorrectness is attributable\to the fact that the application field is not necessarily the same as the application filed of the Kanji code standard.

For example, the proper nouns are explicitly specified as the exception for application in the common use Kanji Table. If the assumption is made that the code point is given to the common use Kanji Table, major prefectures of Japan, such as "埼玉" (Saitama), "岡山"(Okayama), "大阪"(Osaka) can not be expressed in theis system.

The main application field of the Kanwa Jiten is reading classical Chinese literature such as the Shisho Gokyo (a quite indifferent attitude has traditionally been taken in the employment of the Japanese-oriented characters). Considering the use in the use in the data processing field, the Jpanese-oriented characters to be used in thefamily name and the first name shall be energetically collected.

Also in consideration of the use in the recent extensive application fields, the collection of the Kanji characters will be required in the fields such as Japanese literature, Japanese history, Oriental studies, and Buddhism studies.

The size of the character set which is included in the character code is also meaningful.

The size of the character set which is included in the character code is also meaningful.

The character set of the 2000-character level corresponded to the common use KAnji Table in Japan, or the level of the grand table of the simplified characters in China. If the table of the unified Kanji table of this table of this level is prepared, the unification of the commonly-used character shape in the common use Kanji Table in Japan with the simplified characters in China will be necessary. That means the unification is the matter of the national language policy on this level.
In addition, as discussed previously, even the names of the prefectures can not be expressed satisfactorily in Japan with this level of characters.

The expression of the fundamental daily document is possible with 300-character level which corresponds to the first level of JIS Kanji Code. Few troubles are experienced in the daily uses with 6000 characters which is the Kanji code level for daily use. The Kanji characters listed in the JApanese languege dictionaries are on this level.

The level of the Kanji characters which are constantly available in general printing companies and the newspaper publishing companies is 10000-character level. The number of the Kanji characters collected in the popular Kanji character dictionaries (Kanwa Jiten) corresponds to this level. What is corresponded to the 21000 characters which is the level to be collecter in the 10646 is the number of the Kanji characters of a KAnji character dictionary which is proud of the maximum number of the Kanji characters as one volume in JApan, nd required new prepararion of more than 10000 characters in the printing.

And 50000 is the number of the Kanji characters of the Kanji Xi Zidizn and the Dai-Kanwa Jiten whichare the dictionaris of the maximim class scale.

It can be said that the development of the unified ideograph of the 10646 is the trial to prepare the Kanji character dictionary in the character shape level of the modern computer time. The characters on the 2000-character level are naturally included in the large levels, but the discussions on this level are completely incorrect and make no sense in 21000 character level.

Furthermore, one Kanji character is polysmous, and the discussions on the mistaken understanding of one meaning which is representative of each country are quite incorrect. Among such incorrect examples are discusions on the difference of the meaning of "湯", "hot water" in Japan and "soup" in China, or on the difference between "机" and "機". Those who like these discussions are recommended to the Kanji character dictionary at hand.

Design of the character set of the 10646 and coding of the Asian characters.

Several problems are included in the design of the character set of the 10646 from the viewpoint of the codeing of the Asian characters. These problems are divided into the problems of the composing character in the south Asia and the problems of the unified ideograph of the Est Asia and we will discuss for each of them. The purpose of the analyses is not to criticize the character code of the respective countries but to make studies on the expansion of the future 10646 and the improvement and the direction thereof.

4.1 Composing character and composed character
The characters of South Asia have been treated as the composeing characters in the existing standards of the respective countireis, which are the respective versions of the 646 and ISO 10646. The composing characters in the ISO 646 were composed by using the BACKSPACE or the CARRIAGE RETURN. But, in 646, there were no additional rules on this composition while in the 10646 the detailed rules are specified on the composing characters of South Asia.

The first concern is whether or not there are any problems with these sophisticated rules on the composition in the respective countires in South Asia.

The second concern is the composition itself. As discussed in 2. the European character codes give the code to the composed characters in the information processing field and the composition itself tends to be prohibited as specified in8859. (However, in 10646, the European composed characters are also introduced, and this is the determination made for the unification of the character code with the TC46 (the library system), and this determination itself seems to have been necessary.)

The Korean language seems to be in a transition period from this composing character to the composed character. That is, both the Korean characters to be used in the composition and the composed Korean characters are included in the 10646.

In the first place, the use of the composed characters in the Latin characters was attributed to the shortage of the code space. It is more convenient to give the code point to the composed characters in every spect of the processing, the display, and the interchange.