/[ddp]/manuals/trunk/intro-i18n/intro-i18n.sgml
ViewVC logotype

Contents of /manuals/trunk/intro-i18n/intro-i18n.sgml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 6998 - (show annotations) (download) (as text)
Tue Dec 8 17:13:57 2009 UTC (3 years, 5 months ago) by osamu
File MIME type: text/x-sgml
File size: 155187 byte(s)
update URL and e-mail of Tomohiro KUBOTA
1 <!doctype debiandoc public "-//DebianDoc//DTD DebianDoc//EN"
2 [
3 <!entity % languages system "languages.ents"> %languages;
4 <!entity % examples system "examples.ents"> %examples;
5 ]>
6 <debiandoc>
7 <book>
8
9
10 <titlepag>
11 <title>Introduction to i18n</title>
12 <author>
13 <name>Tomohiro KUBOTA</name>
14 <email>debian at tmail dot plala dot or dot jp (retired DD)</email>
15 </author>
16 <version><date></version>
17 <abstract>
18 This document describes basic concepts for i18n
19 (internationalization), how to write an internationalized
20 software, and how to modify and internationalize a software.
21 Handling of characters is discussed in detail.
22 There are a few case-studies in which the author internationalized
23 softwares such as TWM.
24 </abstract>
25 <copyright>
26 <copyrightsummary>
27 Copyright &copy; 1999-2001 Tomohiro KUBOTA.
28 Chapters and sections whose original author is not KUBOTA are
29 copyright by their authors. Their names are written
30 at the top of the chapter or the section.
31 </copyrightsummary>
32 <p>
33 This manual is free software; you may redistribute it and/or modify it
34 under the terms of the GNU General Public License as published by the
35 Free Software Foundation; either version 2, or (at your option) any
36 later version.
37 </p>
38 <p>
39 This is distributed in the hope that it will be useful, but
40 <em>without any warranty</em>; without even the implied warranty of
41 merchantability or fitness for a particular purpose. See the GNU
42 General Public License for more details.
43 </p>
44 <p>
45 A copy of the GNU General Public License is available as
46 <tt>/usr/share/common-licenses/GPL</tt> in the Debian GNU/Linux
47 distribution or on the World Wide Web at
48 <url id="http://www.gnu.org/copyleft/gpl.html">.
49 You can also obtain it by writing to the Free
50 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
51 02111-1307, USA.
52 </p>
53 </copyright>
54 </titlepag>
55
56 <toc detail="sect1">
57
58 <chapt id="scope"><heading>About This Document</heading>
59
60 <sect id="scope2"><heading>Scope</heading>
61
62 <P>
63 This document describes the basic ideas of I18N; it's written for
64 programmers and package maintainers of Debian GNU/Linux and
65 other UNIX-like platforms.
66 The aim of this document is to offer an introduction to
67 the basic concepts, character codes, and points where care
68 should be taken when one writes an I18N-ed software or
69 an I18N patch for an existing software. There are many
70 know-hows and case-studies on internationalization of
71 softwares. This document also tries to introduce the
72 current state and existing problems for each language and country.
73 </P>
74
75 <P>
76 Minimum requirements - for example,
77 that characters should be displayed with fonts of the
78 proper charset (users of the software must be
79 able to at least guess what is written),
80 that characters must be inputed from keyboard, and
81 that softwares must not destroy characters -
82 are stressed in the document. I am trying to
83 describe a HOWTO to satisfy these requirements.
84 </P>
85
86 <P>
87 This document is strongly related to programming
88 languages such as C and standardized I18N methods such as
89 using locales and <prgn>gettext</prgn>.
90 </P>
91
92 <sect id="newversion"><heading>New Versions of This Document</heading>
93
94 <P>
95 The current version of this document is available
96 at
97 <url id="http://www.debian.org/doc/ddp"
98 name="DDP (Debian Documentation Project)"> page.
99 </P>
100
101 <p>
102 Note that the author rewrote this document in November 2000.
103 </p>
104
105 <p>
106 Since then, Debian had several releases and its packages support I18N better
107 with their supports of UTF-8.
108 This document does not cover these new developments but is kept here
109 since this helps understandings of fundamental I18N issues.
110 </p>
111
112 <sect id="feedback"><heading>Feedback and Contributions</heading>
113
114 <P>
115 This document needs contributions, especially for a
116 chapter on each languages (<ref id="languages">)
117 and a chapter on instances of I18N (<ref id="examples">).
118 These chapters consist of contributions.
119 </P>
120
121 <P>
122 Otherwise, this will be a document only on Japanization,
123 because the original author Tomohiro KUBOTA
124 (<email>kubota@debian.org</email>, retired DD and this is not a
125 working e-mail address any more)
126 speaks Japanese and live in Japan.
127 </P>
128
129 <P>
130 <ref id="spanish"> is written by
131 Eusebio C Rufian-Zilbermann <email>eusebio@acm.org</email>.
132 </P>
133
134 <P>
135 Discussions are held at <tt>debian-devel@lists.debian.org</tt> and
136 <tt>debian-i18n@lists.debian.org</tt> mailing list.
137 Please contact <tt>debian-doc@lists.debian.org</tt> if you wish to
138 update this document.
139 </P>
140
141 <chapt id="intro"><heading>Introduction</heading>
142
143 <sect id="intro-concepts"><heading>General Concepts</heading>
144
145 <P>
146 Debian includes many pieces of software. Though many of them
147 have the ability to process, input, and output text data, some
148 of these programs assume text is written in English (ASCII).
149 For people who use non-English languages, these programs are
150 barely usable. And more, though many softwares can handle
151 not only ASCII but also ISO-8859-1, some of them
152 cannot handle multibyte characters for CJK (Chinese, Japanese,
153 and Korean) languages, nor combined characters for Thai.
154 </P>
155
156 <P>
157 So far, people who use non-English languages have given up
158 using their native languages and have accepted computers as they were.
159 However, we should now forget such a wrong idea.
160 It is absurd that a person who
161 wants to use a computer has to learn English in advance.
162 </P>
163
164 <P>
165 I18N is needed in the following places.
166 <list>
167 <item>Displaying characters for the users' native languages.
168 <item>Inputing characters for the users' native languages.
169 <item>Handling files written in popular encodings
170 <footnote>
171 There are a few terms related to character code,
172 such as character set, character code, charset,
173 encoding, codeset, and so on. These words are explained
174 later.
175 </footnote>
176 that are used for the users' native languages.
177 <item>Using characters from the users' native languages for file names
178 and other items.
179 <item>Printing out characters from the users' native languages.
180 <item>Displaying messages by the program in the users' native languages.
181 <item>Formatting input and output of numbers, dates, money, etc., in a way that
182 obeys customs of the users' native cultures.
183 <item>Classifying and sorting characters, in a way that obey customs
184 of the users' native cultures.
185 <item>Using typesetting and hyphenation rules appropriate for the users' native
186 languages.
187 </list>
188 This document puts emphasis on the first three items. This is because
189 these three items are the basis for the other items. An another
190 reason is that you cannot use softwares lacking the first
191 three items at all, while you can use softwares lacking the other items,
192 albeit inconveniently. This document will also mention translation of
193 messages (item 6) which is often called as 'I18N'. Note that
194 the author regards the terminology of 'I18N' for calling translation
195 and <prgn>gettext</prgn>ization as completely wrong. The reason
196 may be well explained by the fact that the author did not include
197 translation and <prgn>gettext</prgn>ization in the important first
198 three items.
199 </P>
200
201 <P>
202 Imagine a word processor which can display error
203 and help messages in your native language while cannot process
204 your native language. You will easily understand that the word
205 processor is not usable. On the other hand, a word processor which
206 can process your native language, but only displays error and help messages
207 in English, is usable, though it is not convenient.
208 Before we think of developing convenient softwares, we have to
209 think of developing usable softwares.
210 </P>
211
212 <P>
213 The following terminology is widely used.
214 <list>
215 <item>I18N (internationalization) means modification of a software
216 or related technologies so that a software can potentially
217 handle multiple languages, customs, and so on in the world.
218 <item>L10N (localization) means implementation of a specific language
219 for an already internationalized software.
220 </list>
221 However, this terminology is valid only for one specific model
222 out of a few models which we should consider for I18N.
223 Now I will introduce a few models other than this I18N-L10N model.
224 <taglist>
225 <tag>a. <strong>L10N</strong> (localization) model</tag>
226 <item><p>
227 This model is to support two languages or character codes,
228 English (ASCII) and another specific one. Examples of
229 softwares which is developed using this model are:
230 Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual
231 Emacs) text editor which can input and output Japanese text files,
232 and Hanterm X terminal emulator which can display and input
233 Korean characters via a few Korean encodings.
234 Since each programmer has his or her own mother tongue,
235 there are numerous L10N patches and L10N programs
236 written to satisfy his or her own need.
237 </p></item>
238 <tag>b. <strong>I18N</strong> (internationalization) model</tag>
239 <item><p>
240 This model is to support many languages but only two
241 of them, English (ASCII) and another one, at the same time.
242 One have to specify the 'another' language, usually by <tt>LANG</tt>
243 environmental variable.
244 The above I18N-L10N model can be regarded as a part of
245 this I18N model.
246 <prgn>gettext</prgn>ization is categorized into I18N model.
247 </p></item>
248 <tag>c. <strong>M17N</strong> (multilingualization) model</tag>
249 <item><p>
250 This model is to support many languages at the same time.
251 For example, Mule (MULtilingual Enhancement to GNU Emacs)
252 can handle a text file which contains multiple languages -
253 for example, a paper on differences between Korean and Chinese
254 whose main text is written in Finnish. GNU Emacs 20 and
255 XEmacs now include Mule.
256 Note that the M17N model can only be applied in character-related
257 instances. For example, it is nonsense to display a message
258 like 'file not found' in many languages at the same time.
259 Unicode and UTF-8 are technologies which can be used for
260 this model.
261 <footnote>
262 I recommend not to implement Unicode and UTF-8 directly.
263 Instead, use locale technology and your software will
264 support not only UTF-8 but also many encodings
265 in the world. If you implement UTF-8 directly,
266 your software can handle UTF-8 only. Such a software
267 is not convenient.
268 </footnote>
269 </p></item>
270 </taglist>
271 </P>
272
273 <P>
274 Generally speaking, the M17N model is the best and the second-best is
275 the I18N model. The L10N model is the worst and you should not use it
276 except for a few fields where the I18N and M17N models are very difficult,
277 like DTP and X terminal emulator.
278 In other words, it is better for text-processing softwares to handle
279 many languages at the same time, than handle two (English and another language).
280 </P>
281
282 <P>
283 Now let me classify approaches for support of non-English languages
284 from another viewpoint.
285 <taglist>
286 <tag>A. Implementation <em>without</em> knowledge of each language</tag>
287 <item><p>
288 This approach is done by utilizing standardized methods supplied
289 by the kernel or libraries. The most important one is
290 <strong>locale</strong> technology which includes
291 <strong>locale category</strong>, conversion between
292 <strong>multibyte</strong> and <strong>wide
293 characters</strong> (<tt>wchar_t</tt>), and so on.
294 Another important technology is <prgn>gettext</prgn>.
295 The advantages of this approach are (1) that when the kernel or
296 libraries are upgraded, the software will automatically
297 support new additional languages, (2) that programmers need
298 not know each language, and (3) that a user can switch the behavior
299 of softwares with common method, like LANG variable.
300 The disadvantage is that there are categories or fields where
301 a standardized method is not available. For example, there
302 are no standardized methods for text typesetting rules such
303 as line-breaking and hyphenation.
304 </p></item>
305 <tag>B. Implementation using knowledge of each language</tag>
306 <item><p>
307 This approach is to directly implement information about
308 each language based on the knowledge of programmers and
309 contributors. L10N almost always uses this approach.
310 The advantage of this approach is that a detailed and strict
311 implementation is possible beyond the field where
312 standardized methods are available, such as auto-detection
313 of encodings of text files to be read. Language-specific
314 problems can be perfectly solved; of course, it depends on
315 the skill of the programmer). The disadvantages are
316 (1) that the number of supported languages is restricted
317 by the skill or the interest of the programmers or the
318 contributors, (2) that labor which should be united and
319 concentrated to upgrade the kernel or libraries is dispersed
320 into many softwares, that is, re-inventing of the wheel,
321 and (3) a user has to learn how to configure each software,
322 such as <tt>LESSCHARSET</tt> variable, <tt>.emacs</tt> file,
323 and other methods.
324 This approach can cause problems: for example, GNU roff
325 (before version 1.16) assumes <tt>0xad</tt> as a hyphen
326 character, which is valid only for ISO-8859-1.
327 However, a majestic M17N software such as Mule can be
328 built using this approach.
329 </p></item>
330 </taglist>
331 </P>
332
333 <P>
334 Using this classification, let me consider the L10N, I18N, and M17N models
335 from the programmer's point of view.
336 </P>
337
338 <P>
339 The L10N model can be realized only using his or her own knowledge on his or her
340 language (i.e. approach B). Since the motivation of L10N is
341 usually to satisfy the programmer's own need, extendability for the
342 third languages is often ignored.
343 Though L10N-ed softwares are primarily useful for people who
344 speaks the same language to the programmer, it is sometimes
345 useful for other people whose coding system is similar to
346 the programmer's. For example, a software which
347 doesn't recognize EUC-JP but doesn't break EUC-JP, will not
348 break EUC-KR also.
349 </P>
350
351 <P>
352 The main part of the I18N model is, in the case of a C program, achieved using
353 standardized locale technology and <prgn>gettext</prgn>.
354 An locale approach is classified into I18N because functions
355 related to locale change their behavior by the current locales
356 for six categories which are set by <tt>setlocale()</tt>.
357 Namely, approach A is emphasized for I18N. For field where
358 standardized methods are not available, however, approach B
359 cannot be avoided. Even in such a case, the developers should
360 be careful so that a support for new languages can be easily added
361 later even by other developers.
362 </P>
363
364 <P>
365 The M17N model can be achieved using international encodings such
366 as ISO 2022 and Unicode. Though you can hard-code these encodings
367 for your software (i.e. approach B), I recommend to use standardized
368 locale technology. However, using international encodings
369 is not sufficient to achieve the M17N model. You will have to prepare
370 a mechanism to switch <strong>input methods</strong>. You will also want
371 to prepare an encoding-guessing mechanism for input files,
372 such as <prgn>jless</prgn> and <prgn>emacs</prgn> have.
373 Mule is the best software which achieved M17N (though it does not
374 use locale technology).
375 </P>
376
377 <sect id="intro-organization"><heading>Organization</heading>
378
379 <P>
380 Let's preview the contents of each chapter in this document.
381 </P>
382
383 <P>
384 As I wrote, this document will put stress on correct handling of
385 characters and character codes for users' native
386 languages. To achieve this purpose, I will start the real contents
387 of this document by discussing basic important concepts on
388 characters in <ref id="coding">. Since this chapter includes
389 many terminologies, all of you will need to this chapter.
390 The next chapter, <ref id="codes">, introduces many national
391 and international standards of <em>coded character sets</em>
392 and <em>encodings</em>. I think almost of you can do without
393 reading this chapter, since <em>LOCALE</em> technology will
394 enable us to develop international softwares without knowledges
395 on these character sets and encodings. However, knowing
396 about these standards will help you to
397 understand the merit and necessity of LOCALE technology.
398 </P>
399
400 <P>
401 The following chapter of <ref id="languages">
402 describes the detailed informations for
403 each language. These informations will help people who develop
404 high-quality text processing softwares such as DTP and Web Browsers.
405 </P>
406
407 <P>
408 Chapter of <ref id="locale"> describes the most important
409 concept for I18N. Not only concepts but also many important
410 C functions are introduced in this chapter.
411 </P>
412
413 <P>
414 A few following chapters of <ref id="output">, <ref id="input">,
415 <ref id="internal">, and <ref id="internet"> are important
416 and frequent applications of LOCALE technology.
417 You can get solutions for typical problems on I18N in these
418 chapters.
419 </P>
420
421 <P>
422 You may need to develop software using some special libraries
423 or other languages than C/C++. Chapters of <ref id="library">
424 and <ref id="otherlanguage"> are written for such purposes.
425 </P>
426
427 <P>
428 Next chapter of <ref id="examples"> is a collection of case studies.
429 Both of generic and special technologies will be discussed.
430 You can also contribute writing a section for this chapter.
431 </P>
432
433 <P>
434 You may want to study more;
435 The last chapter of <ref id="reference"> is supplied for this purpose.
436 Some of references listed in the chapter are very important.
437 </P>
438
439
440 <chapt id="coding"><heading>Important Concepts for Character Coding Systems</heading>
441
442 <P>
443 Character coding system is one of the fundamental elements of the
444 software and information processing.
445 Without proper handling of character codes, your software is
446 far from realization of internationalization.
447 Thus the author begins this document with the story on character
448 codes.
449 </P>
450
451 <P>
452 In this chapter, basic concepts such as <em>coded character set</em>
453 and <em>encoding</em> are introduced. These terms will be needed
454 to read this document and other documents on internationalization
455 and character codes including Unicode.
456 </P>
457
458
459 <sect id="coding-general-term"><heading>Basic Terminology</heading>
460
461 <P>
462 At first I begin this chapter by defining a few very important word.
463 </P>
464
465 <P>
466 As many people point out, there is a confusion on terminology, since
467 words are used in various different ways. The author does not
468 want to add a new terminology to a confusing ocean of
469 various terminologies. Otherwise, terminology of
470 <url id="http://www.faqs.org/rfcs/rfc2130.html" name="RFC 2130">
471 will be
472 adopted in this document, besides one exception of a word 'character
473 set'.
474 </P>
475
476 <P>
477 <taglist>
478 <tag><strong>Character</strong>
479 <item><p>
480 Character is an individual unit of which sentence and text
481 consist. Character is an abstract notion.
482 </p></item>
483 <tag><strong>Glyph</strong>
484 <item><p>
485 Glyph is a specific instance of character. <em>Character</em>
486 and <em>glyph</em> is a pair of words. Sometimes a character
487 has multiple glyphs (for example, '$' may have one or two vertical
488 bar. Arabic characters have four glyphs for each character.
489 Some of CJK ideograms have many glyphs). Sometimes two or more
490 characters construct one glyph (for example, ligature of 'fi').
491 For almost cases, text data, which intend to contain not
492 visual information but abstract idea, don't have to have
493 information on glyphs, since difference between glyphs does
494 not affect the meaning of the text. However, distinction
495 between different glyphs for a single CJK ideogram may be
496 sometimes important for proper noun such as names of
497 persons and places. However, there are no standardized method
498 for plain text to have informations on glyphs so far. This
499 makes plain texts cannot be used for some special fields
500 such as citizen registration system, serious DTP such as
501 newspaper system, and so on.
502 </p></item>
503 <tag><strong>Encoding</strong>
504 <item><p>
505 Encoding is a rule where characters and texts are
506 expressed in combinations of bits or bytes in order to
507 treat characters in computers. Words of <em>character
508 coding system</em>, <em>character code</em>, <em>charset</em>,
509 and so on are used to express the same meaning.
510 Basically, <em>encoding</em> takes care of
511 <em>characters</em>, not <em>glyphs</em>.
512 There are many official and de-facto standards of encodings
513 such as ASCII, ISO 8859-{1,2,...,15},
514 ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2},
515 EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620,
516 VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE,
517 UTF-16BE, KOI8-R, and so on so on.
518 To construct an encoding, we have to consider the
519 following concepts. (Encoding = one or more
520 CCS + one CES).
521 </p></item>
522 <tag><strong>Character Set</strong>
523 <item><p>
524 Character set is a set of characters. This determines
525 a range of characters where the encoding can handle.
526 In contrast to <em>coded character set</em>, this is often
527 called as <em>non-coded character set</em>.
528 </p></item>
529 <tag><strong>Coded Character Set (CCS)</strong>
530 <item><p>
531 Coded character set (CCS) is a word defined in
532 <url id="http://www.faqs.org/rfcs/rfc2050.html" name="RFC 2050">
533 and means a character set where all characters
534 have unique numbers by some method. There are many national
535 and international standards for CCS.
536 Many national standards for CCS adopt
537 the way of coding so that they obey some of international
538 standards such as ISO 646 or ISO 2022. ASCII, BS 4730,
539 JISX 0201 Roman, and so on are examples of ISO-646 variants. All
540 ISO-646 variants, ISO 8859-*, JISX 0208, JISX 0212, KSX 1001,
541 GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are
542 examples of ISO 2022-compliant CCS. VISCII and Big5 are
543 examples of non-ISO 2022-compliant
544 CCS. UCS-2 and UCS-4 (ISO 10646) are also examples of CCS.
545 </p></item>
546 <tag><strong>Character Encoding Scheme (CES)</strong>
547 <item><p>
548 Character Encoding Scheme is also a word defined in
549 <url id="http://www.faqs.org/rfcs/rfc2050.html" name="RFC 2050">
550 to call methods to construct an encoding using one or
551 more CCS. This is important when two or more CCS are used
552 to construct an encoding.
553 ISO 2022 is a method to construct an encoding from
554 one or more ISO 2022-compliant CCS. ISO 2022 is very
555 complex system and subsets of ISO 2022 are usually used
556 such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII
557 and KSX 1001), and so on. CES is not important for
558 encodings with only one 8bit CCS.
559 UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be
560 regarded as CES whose CCS is Unicode or ISO 10646.
561 </p></item>
562 </taglist>
563 </P>
564
565 <P>
566 Some other words are usually used related to character codes.
567 </P>
568
569 <P>
570 <strong>Character code</strong> is a widely-used word to mean
571 <em>encoding</em>. This is an primitive and crude word to call
572 the way a computer handles characters with assigning numbers.
573 For example, <em>character code</em> can call <em>encoding</em>
574 and can call <em>coded character set</em>. Thus this word can
575 be used only in the case when both of them can be regard in
576 the same category. This word should be avoided in serious
577 discussions. This document will not use this word hereafter.
578 </P>
579
580 <P>
581 <strong>Codeset</strong> is a word to call <em>encoding</em>
582 or <em>character encoding scheme</em>.
583 <footnote>
584 This document used a word <em>codeset</em> before Novermber 2000
585 to call <em>encoding</em>. I changed terminology since
586 I could not find a word <em>codeset</em> in documents written
587 in English (I adopted this word from a book in Japanese).
588 <em>encoding</em> seems more popular.
589 </footnote>
590 </P>
591
592 <P>
593 <strong>charset</strong> is also a well-used word.
594 This word is used very widely, for example, in MIME (like
595 <tt>Content-Type: text/plain, charset=iso8859-1</tt>),
596 in XLFD (X Logical Font Description) font name
597 (CharSetResigtry and CharSetEncoding fields), and so on.
598 Note that <em>charset</em> in MIME is <em>encoding</em>,
599 while <em>charset</em> in XLFD font name is <em>coded character
600 set</em>. This is very confusing. In this document,
601 <em>charset</em> and <em>character set</em> are used in
602 XLFD meaning, since I think <em>character set</em> should
603 mean a set of characters, not encoding.
604 </P>
605
606 <P>
607 Ken Lunde's "CJKV Information Processing" uses a word
608 <strong>encoding method</strong>. He says that
609 ISO-2022, EUC, Big5, and Shift-JIS are examples of
610 <em>encoding methods</em>. It seems that his <em>encoding
611 method</em> is <em>CES</em> in this document. However,
612 we should notice that Big5 and Shift-JIS are encodings
613 while ISO-2022 and EUC are not.
614 <footnote>
615 During I18N programming, we will frequently meet with EUC-JP
616 or EUC-KR, while we well rarely meet with EUC. I think it is
617 not appropriate to stress EUC, a class of encodings, over
618 EUC-JP, EUC-KR, and so on, concrete encodings. It is just like
619 regarding ISO 8859 as a concrete encoding, though ISO 8859 is
620 a class of encodings of ISO 8859-{1,2,...,15}.
621 </footnote>
622 </P>
623
624 <P>
625 <url id="http://www.unicode.org/unicode/reports/tr17/"
626 name="Character Encoding Model, Unicode Technical Report #17">
627 (hereafter, <em>"the Report"</em>) suggests five-level model.
628 <list>
629 <item>ACR: abstract character repertoire
630 <item>CCS: Coded Character Set
631 <item>CEF: Character Encoding Form
632 <item>CES: Character Encoding Scheme
633 <item>TES: Transfer Encoding Syntax
634 </list>
635 </P>
636
637 <P>
638 <strong>TES</strong> is also suggested in
639 <url id="http://www.faqs.org/rfcs/rfc2130.html" name="RFC 2130">.
640 Some examples of
641 TES are: <em>base64</em>, <em>uuencode</em>, <em>BinHex</em>,
642 <em>quoted-printable</em>, <em>gzip</em>, and so on.
643 TES means a transform of encoded data which may (or may not) include
644 textual data. Thus, TES is not a part of character encoding.
645 However, TES is important in the Internet data exchange.
646 </P>
647
648 <P>
649 When using a computer, we rarely have a chance to face with
650 <strong>ACR</strong>.
651 Though it is true that CJK people have their national standard of
652 ACR (for example, standard for ideograms which can be used for
653 personal names) and some of us may need to handle these ACR with
654 computers (for example, citizen registration system), this is too
655 heavy theme for this document. This is because there are no
656 standardized or encouraged methods to handle these ACR. You may
657 have to build the whole system for such purposes. Good luck!
658 </P>
659
660 <P>
661 <strong>CCS</strong> in <em>"the Report"</em> is same as what I wrote
662 in this document.
663 It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201,
664 JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5,
665 CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on.
666 Some of them are national standards, some are international
667 standards, and others are de-facto standards.
668 </P>
669
670 <P>
671 <strong>CEF</strong> and <strong>CES</strong> in <em>"the Report"</em>
672 correspond to <strong>CES</strong> in this document.
673 This document will not distinguish these two, since I think there
674 are no inconvenience. An encoding with a significant CEF doesn't
675 have a significant CES (in <em>"the Report"</em> meaning), and
676 vice versa. Then why should we have to distinguish these two?
677 The only exception is UTF-16 series. In UTF-16 series,
678 UTF-16 is a CEF and UTF-16BE is a CES. This is the only case where
679 we need distinction between CEF and CES.
680 </P>
681
682 <P>
683 Now, <strong>CES</strong> is a concrete concept with concrete
684 examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP,
685 ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT,
686 ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE,
687 and so on. Now they are encodings themselves.
688 </P>
689
690 <P>
691 The most important concept in this section is distinction between
692 <em>coded character set</em> and <em>encoding</em>. <em>Coded
693 character set</em> is a component of <em>encoding</em>. Text data
694 are described in <em>encoding</em>, not <em>coded character set</em>.
695 </P>
696
697
698 <sect id="stateful"><heading>Stateless and Stateful</heading>
699
700 <P>
701 To construct an encoding with two or more CCS, CES has to supply
702 a method to avoid collision between these CCS.
703 There are two ways to do that. One is to make all characters
704 in the all CCS have unique code points. The other is to
705 allow characters from different CCS to have the same
706 code point and to have a code such as escape sequence to switch
707 <strong>SHIFT STATE</strong>, that is, to select one character set.
708 </P>
709
710 <P>
711 An encoding with shift states is called <strong>STATEFUL</strong> and
712 one without shift states is called <strong>STATELESS</strong>.
713 </P>
714
715 <P>
716 Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR,
717 ISO 2022-INT-1, ISO 2022-INT-2, and so on.
718 </P>
719
720 <P>
721 For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean
722 a Japanese Hiragana character 'GA' or two ASCII character of
723 '$' and ',' according to the shift state.
724 </P>
725
726 <sect id="multibyte"><heading>Multibyte encodings</heading>
727
728 <P>
729 Encodings are classified into multibyte ones and the others,
730 according to the relationship between number of characters and number of
731 bytes in the encoding.
732 </P>
733
734 <P>
735 In non-multibyte encoding, one character is always expressed
736 by one byte. On the other hand, one character may expressed in
737 one or more bytes in multibyte encoding. Note that the number
738 is not fixed even in a single encoding.
739 </P>
740
741 <P>
742 Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP,
743 Shift-JIS, Big5, UHC, UTF-8, and so on. Note that all of UTF-* are
744 multibyte.
745 </P>
746
747 <P>
748 Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2,
749 TIS 620, VISCII, and so on.
750 </P>
751
752 <P>
753 Note that even in non-multibyte encoding, number of characters
754 and number of bytes may differ if the encoding is stateful.
755 </P>
756
757 <P>
758 Ken Lunde's "CJKV Information Processing"
759 <footnote>
760 ISBN 1-56592-224-7, O'Reilly, 1999
761 </footnote>
762 classifies encoding methods
763 into the following three categories:
764 <list>
765 <item>modal
766 <item>non-modal
767 <item>fixed-length
768 </list>
769 <em>Modal</em> corresponds to <em>stateful</em> in this document.
770 Other two are <em>stateless</em>, where <em>non-modal</em> is
771 <em>multibyte</em> and <em>fixed-length</em> is
772 <em>non-multibyte</em>. However, I think <em>stateful</em> -
773 <em>stateless</em> and <em>multibyte</em> - <em>non-multibyte</em>
774 are independent concept.
775 <footnote>
776 though there are no existing encodings which is stateful and
777 non-multibyte.
778 </footnote>
779 </P>
780
781 <sect id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>
782
783 <P>
784 One ASCII character is always expressed by one byte
785 and occupies one column on console or X terminal emulators
786 (fixed font for X).
787 One must not make such an assumption for I18N programming
788 and have to clearly distinguish number of bytes, characters,
789 and columns.
790 </P>
791
792 <P>
793 Speaking of relationship between characters and bytes,
794 in multibyte encodings, two or more bytes may be needed
795 to express one character. In stateful encodings, escape
796 sequences are not related to any characters.
797 </P>
798
799 <P>
800 Number of columns is not defined in any standards. However,
801 it is usual that CJK ideograms, Japanese Hiragana and Katakana,
802 and Korean Hangul occupy two columns in console or X terminal emulators.
803 Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set
804 will occupy two columns and 'Half-width forms' will occupy one column.
805 Combining characters used for Thai and so on can be regarded as
806 zero-column characters. Though there are no standards, you can
807 use <tt>wcwidth()</tt> and <tt>wcswidth()</tt> for this purpose.
808 See <ref id="output-console-column"> for detail.
809 </P>
810
811
812
813
814
815
816 <chapt id="codes"><heading>Coded Character Sets And Encodings in the World</heading>
817
818 <P>
819 Here major coded character sets and encodings are introduced.
820 Note that you don't have to know the detail of these
821 character codes if you use LOCALE and <tt>wchar_t</tt> technology.
822 </P>
823
824 <P>
825 However, these knowledge will help you to understand why number
826 of bytes, characters, and columns should be counted separately,
827 why <tt>strchr()</tt> and so on should not be used, why you should
828 use LOCALE and <tt>wchar_t</tt> technology instead of hard-code
829 processing of existing character codes, and so on so on.
830 </P>
831
832 <P>
833 These varieties of character sets and encodings will tell you about
834 struggles of people in the world to handle their own languages by
835 computers. Especially, CJK people could not help working out various
836 technologies to use plenty of characters within ASCII-based computer
837 systems.
838 </P>
839
840 <P>
841 If you are planning to develop a text-processing software
842 beyond the fields which the LOCALE technology covers, you will
843 have to understand the following descriptions very well.
844 These fields include automatic detection of encodings
845 used for the input file (Most of Japanese-capable text viewers
846 such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism)
847 and so on.
848 </P>
849
850
851 <sect id="ascii"><heading>ASCII and ISO 646</heading>
852
853 <P>
854 <strong>ASCII</strong> is a CCS and also an encoding at the same time.
855 ASCII is 7bit and contains 94 printable characters which are
856 encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>.
857 </P>
858
859 <P>
860 <strong>ISO 646</strong> is the international standard of ASCII.
861 Following 12 characters of
862 <list>
863 <item>0x23 (number),
864 <item>0x24 (dollar),
865 <item>0x40 (at),
866 <item>0x5b (left square bracket),
867 <item>0x5c (backslash),
868 <item>0x5d (right square bracket),
869 <item>0x5e (caret),
870 <item>0x60 (backquote),
871 <item>0x7b (left curly brace),
872 <item>0x7c (vertical line),
873 <item>0x7d (right curly brace), and
874 <item>0x7e (tilde)
875 </list>
876 are called <strong>IRV</strong> (International Reference Version)
877 and other 82 (94 - 12 = 82) characters are called
878 <strong>BCT</strong> (Basic Code Table).
879 Characters at IRV can be different between countries.
880 Here is a few examples of versions of ISO 646.
881 <list>
882 <item>UK version (BS 4730)
883 <item>US version (ASCII): 0x23 is pound currency mark, and so on.
884 <item>Japanese version (JISX 0201 Roman): 0x5c is yen currency mark, and
885 so on.
886 <item>Italian version (UNI 0204-70): 0x7b is 'a' with grave accent, and
887 so on.
888 <item>French version (NF Z 62-010): 0x7b is 'e' with acute accent, and
889 so on.
890 </list>
891 </P>
892
893 <P>
894 As far as I know, all encodings (besides EBCDIC) in the world
895 are compatible with ISO 646.
896 </P>
897
898 <P>
899 Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
900 </P>
901
902 <P>
903 Nowadays usage of encodings incompatible with ASCII is not
904 encouraged and thus ISO 646-* (other than US version) should not
905 be used. One of the reason is that when a string is converted into
906 Unicode, the converter doesn't know whether IRVs are converted into
907 characters with same shapes or characters with same codes.
908 Another reason is that source codes
909 are written in ASCII. Source code must be readable anywhere.
910 </P>
911
912
913 <sect id="iso8859"><heading>ISO 8859</heading>
914
915 <P>
916 <strong>ISO 8859</strong> is both a series of CCS and a series of
917 encodings. It is an expansion of ASCII using all 8 bits.
918 Additional 96 printable characters encoded in 0xa0 - 0xff are
919 available besides 94 ASCII printable characters.
920 </P>
921
922 <P>
923 There are 10 variants of ISO 8859 (in 1997).
924 <taglist>
925 <tag>ISO-8859-1 Latin alphabet No.1 (1987)</tag>
926 <item>characters for western European languages
927 <tag>ISO-8859-2 Latin alphabet No.2 (1987)</tag>
928 <item>characters for central European languages
929 <tag>ISO-8859-3 Latin alphabet No.3 (1988)</tag>
930 <tag>ISO-8859-4 Latin alphabet No.4 (1988)</tag>
931 <item>characters for northern European languages
932 <tag>ISO-8859-5 Latin/Cyrillic alphabet (1988)</tag>
933 <tag>ISO-8859-6 Latin/Arabic alphabet (1987)</tag>
934 <tag>ISO-8859-7 Latin/Greek alphabet (1987)</tag>
935 <tag>ISO-8859-8 Latin/Hebrew alphabet (1988)</tag>
936 <tag>ISO-8859-9 Latin alphabet No.5 (1989)</tag>
937 <item>same as ISO-8859-1 except for Turkish instead of Icelandic
938 <tag>ISO-8859-10 Latin alphabet No.6 (1993)</tag>
939 <item>Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
940 <tag>ISO-8859-11 Latin/Thai alphabet (2001)</tag>
941 <item>same as TIS-620 Thai national standard
942 <tag>ISO-8859-13 Latin alphabet No.7 (1998)</tag>
943 <tag>ISO-8859-14 Latin alphabet No.8 (Celtic) (1998)</tag>
944 <tag>ISO-8859-15 Latin alphabet No.9 (1999)</tag>
945 <tag>ISO-8859-16 Latin alphabet No.10 (2001)</tag>
946 <item>&nbsp;</item>
947 </taglist>
948 </P>
949
950 <P>
951 A detailed explanation is found at
952 <url id="http://park.kiev.ua/mutliling/ml-docs/iso-8859.html">.
953 </P>
954
955
956 <sect id="iso-2022"><heading>ISO 2022</heading>
957
958 <P>
959 Using ASCII and ISO 646, we can use 94 characters at most.
960 Using ISO 8859, the number includes to 190 (= 94 + 96).
961 However, we may want to use much more characters.
962 Or, we may want to use some, not one, of these character sets.
963 One of the answer is ISO 2022.
964 </P>
965
966 <P>
967 <strong>ISO 2022</strong> is an international standard of CES.
968 ISO 2022 determines a few requirement for CCS to be a member
969 of ISO 2022-based encodings. It also defines a very
970 extensive (and complex) rules to combine these CCS into one
971 encoding. Many encodings such as EUC-*, ISO 2022-*,
972 compound text,
973 <footnote>
974 Compound text is a standard for text exchange between X clients.
975 </footnote>
976 and so on can be regarded as subsets of ISO 2022.
977 ISO 2022 is so complex that you may be not able to understand this.
978 It is OK; What is important here is the concept of ISO 2022 of
979 building an encoding by switching various (ISO 2022-compliant)
980 coded character sets.
981 </P>
982
983 <P>
984 The sixth edition of ECMA-35 is fully identical with
985 ISO 2022:1994 and you can find the official document
986 at <url id="http://www.ecma.ch/ecma1/stand/ECMA-035.HTM">.
987 </P>
988
989 <P>
990 ISO 2022 has two versions of 7bit and 8bit. At first
991 8bit version is explained. 7bit version is a subset
992 of 8bit version.
993 </P>
994
995 <P>
996 The 8bit code space is divided into four regions,
997 <list>
998 <item>0x00 - 0x1f: C0 (Control Characters 0),
999 <item>0x20 - 0x7f: GL (Graphic Characters Left),
1000 <item>0x80 - 0x9f: C1 (Control Characters 1), and
1001 <item>0xa0 - 0xff: GR (Graphic Characters Right).
1002 </list>
1003 </P>
1004
1005 <P>
1006 GL and GR is the spaces where (printable) character sets are mapped.
1007 </P>
1008
1009 <P>
1010 Next, all character sets, for example, ASCII, ISO 646-UK,
1011 and JIS X 0208, are classified into following four categories,
1012 <list>
1013 <item>(1) character set with 1-byte 94-character,
1014 <item>(2) character set with 1-byte 96-character,
1015 <item>(3) character set with multibyte 94-character, and
1016 <item>(4) character set with multibyte 96-character.
1017 </list>
1018 </P>
1019
1020 <P>
1021 Characters in character sets with 94-character are mapped
1022 into 0x21 - 0x7e. Characters in 96-character set are
1023 mapped into 0x20 - 0x7f.
1024 </P>
1025
1026 <P>
1027 For example, ASCII, ISO 646-UK, and JISX 0201 Katakana
1028 are classified into (1), JISX 0208 Japanese Kanji,
1029 KSX 1001 Korean, GB 2312-80 Chinese are classified into (3),
1030 and ISO 8859-* are classified to (2).
1031 </P>
1032
1033 <P>
1034 The mechanism to map these character sets into GL and GR is
1035 a bit complex. There are four buffers, G0, G1, G2, and G3.
1036 A character set is <strong>designated</strong> into one of these buffers
1037 and then a buffer is <strong>invoked</strong> into GL or GR.
1038 </P>
1039
1040 <P>
1041 Control sequences to 'designate' a character set into a
1042 buffer are determined as below.
1043 </P>
1044
1045 <P>
1046 <list>
1047 <item>A sequence to designate a character set with 1-byte 94-character
1048 <list>
1049 <item>into G0 set is: ESC 0x28 F,
1050 <item>into G1 set is: ESC 0x29 F,
1051 <item>into G2 set is: ESC 0x2a F, and
1052 <item>into G3 set is: ESC 0x2b F.
1053 </list>
1054 <item>A sequence to designate a character set with 1-byte 96-character
1055 <list>
1056 <item>into G1 set is: ESC 0x2d F,
1057 <item>into G2 set is: ESC 0x2e F, and
1058 <item>into G3 set is: ESC 0x2f F.
1059 </list>
1060 <item>A sequence to designate a character set with multibyte 94-character
1061 <list>
1062 <item>into G0 set is: ESC 0x24 0x28 F
1063 (exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
1064 <item>into G1 set is: ESC 0x24 0x29 F,
1065 <item>into G2 set is: ESC 0x24 0x2a F, and
1066 <item>into G3 set is: ESC 0x24 0x2b F.
1067 </list>
1068 <item>A sequence to designate a character set with multibyte 96-character
1069 <list>
1070 <item>into G1 set is: ESC 0x24 0x2d F,
1071 <item>into G2 set is: ESC 0x24 0x2e F, and
1072 <item>into G3 set is: ESC 0x24 0x2f F.
1073 </list>
1074 </list>
1075 where 'F' is determined for each character set:
1076 <list>
1077 <item>character set with 1-byte 94-character
1078 <list>
1079 <item>F=0x40 for ISO 646 IRV: 1983
1080 <item>F=0x41 for BS 4730 (UK)
1081 <item>F=0x42 for ANSI X3.4-1968 (ASCII)
1082 <item>F=0x43 for NATS Primary Set for Finland and Sweden
1083 <item>F=0x49 for JIS X 0201 Katakana
1084 <item>F=0x4a for JIS X 0201 Roman (Latin)
1085 <item>and more
1086 </list>
1087 <item>character set with 1-byte 96-character
1088 <list>
1089 <item>F=0x41 for ISO 8859-1 Latin-1
1090 <item>F=0x42 for ISO 8859-2 Latin-2
1091 <item>F=0x43 for ISO 8859-3 Latin-3
1092 <item>F=0x44 for ISO 8859-4 Latin-4
1093 <item>F=0x46 for ISO 8859-7 Latin/Greek
1094 <item>F=0x47 for ISO 8859-6 Latin/Arabic
1095 <item>F=0x48 for ISO 8859-8 Latin/Hebrew
1096 <item>F=0x4c for ISO 8859-5 Latin/Cyrillic
1097 <item>and more
1098 </list>
1099 <item>character set with multibyte 94-character
1100 <list>
1101 <item>F=0x40 for JISX 0208-1978 Japanese
1102 <item>F=0x41 for GB 2312-80 Chinese
1103 <item>F=0x42 for JISX 0208-1983 Japanese
1104 <item>F=0x43 for KSC 5601 Korean
1105 <item>F=0x44 for JISX 0212-1990 Japanese
1106 <item>F=0x45 for CCITT Extended GB (ISO-IR-165)
1107 <item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
1108 <item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
1109 <item>F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
1110 <item>F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
1111 <item>F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
1112 <item>F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
1113 <item>F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
1114 <item>and more
1115 </list>
1116 </list>
1117 The complete list of these coded character set is found at
1118 <url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
1119 name="International Register of Coded Character Sets">.
1120 </P>
1121
1122 <P>
1123 Control codes to 'invoke' one of G{0123} into GL or GR
1124 is determined as below.
1125 <list>
1126 <item>A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
1127 <item>A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)
1128 <item>A control code to invoke G2 into GL is: LS2 (Locking Shift 2)
1129 <item>A control code to invoke G3 into GL is: LS3 (Locking Shift 3)
1130 <item>A control code to invoke one character
1131 in G2 into GL is: SS2 (Single Shift 2)
1132 <item>A control code to invoke one character
1133 in G3 into GL is: SS3 (Single Shift 3)
1134 <item>A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
1135 <item>A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)
1136 <item>A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)
1137 </list>
1138 <footnote>
1139 WHAT IS THE VALUE OF THESE CONTROL CODES?
1140 </footnote>
1141 </P>
1142
1143 <P>
1144 Note that a code in a character set invoked into GR is
1145 or-ed with 0x80.
1146 </P>
1147
1148 <P>
1149 ISO 2022 also determines <strong>announcer</strong> code. For example,
1150 'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already
1151 invoked into GL'. This simplify the coding system. Even this
1152 announcer can be omitted if people who exchange data agree.
1153 </P>
1154
1155 <P>
1156 7bit version of ISO 2022 is a subset of 8bit version. It does not
1157 use C1 and GR.
1158 </P>
1159
1160 <P>
1161 Explanation on C0 and C1 is omitted here.
1162 </P>
1163
1164
1165
1166 <sect1 id="euc"><heading>EUC (Extended Unix Code)</heading>
1167
1168 <P>
1169 <strong>EUC</strong> is a CES which is a subset of 8bit version
1170 of ISO 2022 except for the usage of SS2 and SS3 code. Though these
1171 codes are used to invoke G2 and G3 into GL in ISO 2022, they are
1172 invoked into GR in EUC.
1173 <strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>,
1174 and <strong>EUC-TW</strong> are widely used encodings
1175 which use EUC as CES.
1176 </P>
1177
1178 <P>
1179 EUC is stateless.
1180 </P>
1181
1182 <P>
1183 EUC can contain 4 CCS by using G0, G1, G2, and G3.
1184 Though there is no requirement that ASCII is designated to G0,
1185 I don't know any EUC codeset in which ASCII is not designated to G0.
1186 </P>
1187
1188 <P>
1189 For EUC with G0-ASCII, all codes other than ASCII are encoded
1190 in 0x80 - 0xff and this is upward compatible to ASCII.
1191 </P>
1192
1193 <P>
1194 Expressions for characters in G0, G1, G2, and G3 character sets
1195 are described below in binary:
1196 <list>
1197 <item>G0: 0???????
1198 <item>G1: 1??????? [1??????? [...]]
1199 <item>G2: SS2 1??????? [1??????? [...]]
1200 <item>G3: SS3 1??????? [1??????? [...]]
1201 </list>
1202 where SS2 is 0x8e and SS3 is 0x8f.
1203 </P>
1204
1205
1206
1207 <sect1 id="iso2022set"><heading>ISO 2022-compliant Character Sets</heading>
1208
1209 <P>
1210 There are many national and international standards of coded
1211 character sets (CCS). Some of them are ISO 2022-compliant
1212 and can be used in ISO 2022 encoding.
1213 </P>
1214
1215 <P>
1216 ISO 2022-compliant CCS are classified into one of them:
1217 <list>
1218 <item>94 characters
1219 <item>96 characters
1220 <item>94x94x94x... characters
1221 </list>
1222 </P>
1223
1224 <P>
1225 The most famous 94 character set is US-ASCII. Also, all
1226 ISO 646 variants are ISO 2022-compliant 94 character sets.
1227 </P>
1228
1229 <P>
1230 All ISO 8859-* character sets are ISO 2022-compliant
1231 96 character sets.
1232 </P>
1233
1234 <P>
1235 There are many 94x94 character sets. All of them are related to
1236 CJK ideograms.
1237 <taglist>
1238 <tag><strong>JISX 0208</strong> (aka JIS C 6226)
1239 <item><p>National standard of Japan. 1978 version contains 6802 characters
1240 including Kanji (ideogram), Hiragana, Katakana, Latin, Greek,
1241 Cyrillic, numeric, and other symbols. The current (1997) version
1242 contains 7102 characters.</p>
1243 <tag><strong>JISX 0212</strong>
1244 <item><p>National standard of Japan. 6067 characters (almost of them
1245 are Kanji). This character set is intended to be used in
1246 addition to JISX 0208.</p>
1247 <tag><strong>JISX 0213</strong>
1248 <item><p>Japanese national standard. Released in 2000.
1249 This includes JISX 0208 characters and additional thousands
1250 of characters. Thus, this is intended to be an extension
1251 and a replacement of JISX 0208.
1252 This has two 94x94 character sets, one of them inclucdes JISX 0208
1253 plus about 2000 characters and the another includes about
1254 2400 characters.
1255 Exactly speaking, JISX 0213 is not a simple
1256 superset of JISX 0208 because a few tens of Kanji variants
1257 which is unified and share the same code points in JISX 0208
1258 are dis-unified and have separate code points in JISX 0213.
1259 Share many characters with JISX 0212.</p>
1260 <tag><strong>KSX 1001</strong> (aka KSC 5601)
1261 <item><p>National standard of South Korea. 8224 characters including
1262 2350 Hangul, Hanja (ideogram), Hiragana, Katakana, Latin,
1263 Greek, Cyrillic, and other symbils. Hanja are ordered in
1264 reading and Hanja with multiple readings are coded multiple times.</p>
1265 <tag><strong>KSX 1002</strong>
1266 <item><p>National standard of South Korea. 7659 characters including
1267 Hangul and Hanja. Intended to be used in addition to KSX 1001.</p>
1268 <tag><strong>KPS 9566</strong>
1269 <item><p>National standard of North Korea. Similar to KSX 1001.</p>
1270 <tag><strong>GB 2312</strong>
1271 <item><p>National standard of China. 7445 characters including
1272 6763 Hanzi (ideogram), Latin, Greek, Cyrillic, Hiragana,
1273 Katakana, and other symbols.</p>
1274 <tag><strong>GB 7589</strong> (aka GB2)
1275 <item><p>National standard of China. 7237 Hanzi. Intended to be
1276 used in addition to GB 2312.</p>
1277 <tag><strong>GB 7590</strong> (aka GB4)
1278 <item><p>National standard of China. 7039 Hanzi. Intended to be
1279 used in addition to GB 2312 and GB 7589.</p>
1280 <tag><strong>GB 12345</strong> (aka GB/T 12345, GB1 or GBF)
1281 <item><p>National standard of China. 7583 characters. Traditional
1282 characters version which correspond to GB 2312 simplified
1283 characters.
1284 <tag><strong>GB 13131</strong> (aka GB3)
1285 <item><p>National standard of China. Traditional
1286 characters version which correspond to GB 7589 simplified
1287 characters.
1288 <tag><strong>GB 13132</strong> (aka GB5)
1289 <item><p>National standard of China. Traditional
1290 characters version which correspond to GB 7590 simplified
1291 characters.
1292 <tag><strong>CNS 11643</strong>
1293 <item><p>National standard of Taiwan. Has 7 plains. Plain 1 and
1294 2 includes all characters included in Big5. Plain 1 includes
1295 6085 characters including Hanzi (ideogram), Latin, Greek,
1296 and other symbols. Plain 2 includes 7650. Number of character
1297 for plain 3 is 6184, plain 4 is 7298, plain 5 is 8603,
1298 plain 6 is 6388, and plain 7 is 6539.
1299 </taglist>
1300 </P>
1301
1302 <P>
1303 There is a 94x94x94 character set. This is <strong>CCCII</strong>.
1304 This is national standard of Taiwan. Now 73400 characters are
1305 included. (The number is increasing.)
1306 </P>
1307
1308 <P>
1309 Non-ISO 2022-compliant character sets are introduced later in
1310 <ref id="othercodes">.
1311 </P>
1312
1313 <sect1 id="iso2022enc"><heading>ISO 2022-compliant Encodings</heading>
1314
1315 <p>
1316 There are many ISO 2022-compliant encodings which are subsets
1317 of ISO 2022.
1318 </p>
1319
1320 <P>
1321 <taglist>
1322 <tag><strong>Compound Text</strong>
1323 <item><p>
1324 This is used for X clients to communicate each other,
1325 for example, copy-paste.
1326 </P>
1327 <tag><strong>EUC-JP</strong>
1328 <item><p>An EUC encoding with ASCII, JISX 0208, JISX 0201 Kana,
1329 and JISX 0212 coded character sets. There are many systems
1330 which does not support JISX 0201 Kana and JISX 0212.
1331 Widely used in Japan for POSIX systems.
1332 </p>
1333 <tag><strong>EUC-KR</strong>
1334 <item><p>An EUC encoding with ASCII and KSX 1001.
1335 </p>
1336 <tag><strong>CN-GB</strong> (aka EUC-CN)
1337 <item><p>An EUC encoding with ASCII and GB 2312.
1338 The most popular encoding in R. P. China. This encoding
1339 is sometimes referred as simply 'GB'.
1340 </p>
1341 <tag><strong>EUC-TW</strong>
1342 <item><p>An extended EUC encoding with ASCII, CNS 11643 plain 1,
1343 and other (2-7) plains of CNS 11643.
1344 </p>
1345 <tag><strong>ISO 2022-JP</strong>
1346 <item><p>Described in.
1347 <url id="http://www.faqs.org/rfcs/rfc1468.html" name="RFC 1468">.
1348 </p>
1349 <P>***** Not written yet *****</P>
1350 <tag><strong>ISO 2022-JP-1</strong> (upward compatible to ISO 2022-JP)
1351 <item><p>Described in
1352 <url id="http://www.faqs.org/rfcs/rfc2237.html" name="RFC 2237">.
1353 </p>
1354 <P>***** Not written yet *****</P>
1355 <tag><strong>ISO 2022-JP-2</strong> (upward compatible to ISO 2022-JP-1)
1356 <item><p>Described in
1357 <url id="http://www.faqs.org/rfcs/rfc1554.html" name="RFC 1554">.
1358 </p>
1359 <P>***** Not written yet *****</P>
1360 <tag><strong>ISO 2022-KR</strong>
1361 <item><p>aka Wansung. Described in
1362 <url id="http://www.faqs.org/rfcs/rfc1557.html" name="RFC 1557">.
1363 </p>
1364 <P>***** Not written yet *****</P>
1365 <tag><strong>ISO 2022-CN</strong>
1366 <item><p>Described in RFC
1367 <url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">.
1368 </p>
1369 <P>***** Not written yet *****</P>
1370 <tag><strong>ISO 2022-CN-EXT</strong> (upward compatible to ISO 2022-CN-EXT)
1371 <item><p>
1372 </p>
1373 </taglist>
1374 </P>
1375
1376 <P>
1377 Non-ISO 2022-compliant encodings are introduced later in
1378 <ref id="othercodes">.
1379 </P>
1380
1381 <sect id="unicodes"><heading>ISO 10646 and Unicode</heading>
1382
1383 <P>
1384 ISO 10646 and Unicode are an another standard so that we can
1385 develop international softwares easily. The special features
1386 of this new standard are:
1387 <list>
1388 <item>A united single CCS which intends to include all characters
1389 in the world. (ISO 2022 consists of multiple CCS.)
1390 <item>The character set intends to cover all conventional
1391 (or <em>legacy</em>) CCS in the world.
1392 <footnote>
1393 This is obviously not true for CNS 11643 because
1394 CNS 11643 contains 48711 characters while Unicode 3.0.1
1395 contains 49194 characters, only 483 excess than CNS 11643.
1396 </footnote>
1397 <item>Compatibility with ASCII and ISO 8859-1 is considered.
1398 <item>Chinese, Japanese, and Korean ideograms are united.
1399 This comes from a limitation of Unicode.
1400 This is not a merit.
1401 </list>
1402 </P>
1403
1404 <P>
1405 ISO 10646 is an official international standard. Unicode is
1406 developed by
1407 <url id="http://www.unicode.org" name="Unicode Consortium">.
1408 These two are almost identical. Indeed, these two are exactly
1409 identical at code points which are available in both two standards.
1410 Unicode is sometimes updated and the newest version is 3.0.1.
1411 </P>
1412
1413 <sect1 id="unicodes-ccs"><heading>UCS as a Coded Character Set</heading>
1414
1415 <P>
1416 ISO 10646 defines two CCS (coded character sets), <strong>UCS-2</strong>
1417 and <strong>UCS-4</strong>. UCS-2 is a subset of UCS-4.
1418 </P>
1419
1420 <P>
1421 UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits
1422 and each of them has special term.
1423 <list>
1424 <item>The top 7 bits are called <strong>Group</strong>.
1425 <item>Next 8 bits are called <strong>Plane</strong>.
1426 <item>Next 8 bits are <strong>Row</strong>.
1427 <item>The smallest 8 bits are <strong>Cell</strong>.
1428 </list>
1429 The first plane (Group = 0, Plane = 0) is called <strong>BMP</strong>
1430 (Basic Multilingual Plane) and UCS-2 is same to BMP.
1431 Thus, UCS-2 is a 16bit CCS.
1432 </P>
1433
1434 <P>
1435 Code points in UCS are often expressed as <strong>u+<tt>????</tt></strong>,
1436 where <tt>????</tt> is hexadecimal expression of the code point.
1437 </P>
1438
1439 <P>
1440 Characters in range of u+0021 - u+007e are same to ASCII and
1441 characters in range of u+0xa0 - u+0xff are same to ISO 8859-1.
1442 Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS.
1443 </P>
1444
1445 <P>
1446 Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS.
1447 <footnote>
1448 Exactly speaking, u+000000 - u+10ffff.
1449 </footnote>
1450 </P>
1451
1452 <P>
1453 The unique feature of these CCS compared with other CCS is
1454 <em>open repertoire</em>. They are developing even after
1455 they are released. Characters will be added in future.
1456 However, already coded characters will not changed.
1457 Unicode version 3.0.1 includes 49194 distinct coded characters.
1458 </P>
1459
1460 <sect1 id="unicode-ces"><heading>UTF as Character Encoding Schemes</heading>
1461
1462 <P>
1463 A few CES are used to construct encodings which use UCS as
1464 a CCS. They are <strong>UTF-7</strong>, <strong>UTF-8</strong>,
1465 <strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and
1466 <strong>UTF-16BE</strong>. UTF means Unicode (or UCS)
1467 Transformation Format.
1468 Since these CES always take UCS as the only CCS, they are also
1469 names for encodings.
1470 <footnote>
1471 Compare UTF and EUC. There are a few variants of EUC whose CCS
1472 are different (EUC-JP, EUC-KR, and so on). This is why we cannot
1473 call EUC as an encoding. In other words, calling of 'EUC'
1474 cannot specify an encoding. On the other hands, 'UTF-8'
1475 is the name for a specific concrete encoding.
1476 </footnote>
1477 </P>
1478
1479 <sect2 id="unicode-utf8"><heading>UTF-8</heading>
1480
1481 <P>
1482 UTF-8 is an encoding whose CCS is UCS-4. UTF-8
1483 is designed to be upward-compatible to ASCII.
1484 UTF-8 is multibyte and number of bytes needed to express
1485 one character is from 1 to 6.
1486 </P>
1487
1488 <P>
1489 Conversion from UCS-4 to UTF-8 is performed using a
1490 simple conversion rule.
1491 <example>
1492 UCS-4 (binary) UTF-8 (binary)
1493 00000000 00000000 00000000 0??????? 0???????
1494 00000000 00000000 00000??? ???????? 110????? 10??????
1495 00000000 00000000 ???????? ???????? 1110???? 10?????? 10??????
1496 00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10??????
1497 000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10??????
1498 0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10??????
1499 </example>
1500 Note the shortest one will be used though longer representation can
1501 express smaller UCS values.
1502 </P>
1503
1504 <P>
1505 UTF-8 seems to be one of the major candidates for standard codesets
1506 in the future. For example, Linux console and xterm supports UTF-8.
1507 Debian package of <package>locales</package> (version 2.1.97-1)
1508 contains <tt>ko_KR.UTF-8</tt> locale. I think the number of UTF-8
1509 locale will increase.
1510 </P>
1511
1512 <sect2 id="unicode-utf16"><heading>UTF-16</heading>
1513
1514 <P>
1515 UTF-16 is an encoding whose CCS is 20bit Unicode.
1516 </P>
1517
1518 <P>
1519 Characters in BMP are expressed using 16bit value of
1520 code point in Unicode CCS. There are two ways to express
1521 16bit value in 8bit stream. Some of you may heard a word
1522 <em>endian</em>. <em>Big endian</em> means an arrangement
1523 of octets which are part of a datum with many bits
1524 from most significant octet to least significant one.
1525 <em>Little endian</em> is opposite. For example,
1526 16bit value of <tt>0x1234</tt> is expressed as
1527 <tt>0x12 0x34</tt> in
1528 big endian and <tt>0x34 0x12</tt> in little endian.
1529 </P>
1530
1531 <P>
1532 UTF-16 supports both endians. Thus, Unicode character of
1533 <tt>u+1234</tt> can be expressed either in <tt>0x12 0x34</tt>
1534 or <tt>0x34 0x12</tt>. Instead, the UTF-16 texts
1535 have to have a <strong>BOM (Byte Order Mark)</strong> at first
1536 of them. The Unicode character <tt>u+feff</tt> zero width no-break
1537 space is called BOM when it is used to indicate the byte order
1538 or endian of texts. The mechanism is easy: in big endian,
1539 <tt>u+feff</tt> will be <tt>0xfe 0xff</tt> while it will be
1540 <tt>0xff 0xfe</tt> in little endian. Thus you can understand
1541 the endian of the text by reading the first two bytes.
1542 <footnote>
1543 I heard that BOM is mere a suggestion by a vendor.
1544 Read <url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
1545 name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
1546 for detail.
1547 </footnote>
1548 </P>
1549
1550 <P>
1551 Characters not included in BMP are expressed using <strong>surrogate
1552 pair</strong>. Code points of <tt>u+d800</tt> - <tt>u+dfff</tt>
1553 are reserved for this purpose. At first, 20 bits of Unicode code
1554 point are divided into two sets of 10 bits. The significant 10 bits
1555 are mapped to 10bit space of <tt>u+d800</tt> - <tt>u+dbff</tt>.
1556 The smaller 10 bits are mapped to 10bit space of <tt>u+dc00</tt> -
1557 <tt>u+dfff</tt>. Thus UTF-16 can express 20bit Unicode characters.
1558 </P>
1559
1560 <sect2 id="unicode-utf16bele"><heading>UTF-16BE and UTF-16LE</heading>
1561
1562 <P>
1563 UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to
1564 big and little endians, respectively.
1565 </P>
1566
1567
1568 <sect2 id="unicode-utf7"><heading>UTF-7</heading>
1569
1570 <P>
1571 UTF-7 is designed so that Unicode can be communicated using
1572 7bit communication path.
1573 </P>
1574
1575 <P>***** Not written yet *****</P>
1576
1577 <sect2 id="unicode-ucs"><heading>UCS-2 and UCS-4 as encodings</heading>
1578
1579 <P>
1580 Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings.
1581 </P>
1582
1583 <P>
1584 In UCS-2 encoding, Each UCS-2 character is expressed in two bytes.
1585 In UCS-4 encoding, Each UCS-4 character is expressed in four bytes.
1586 </P>
1587
1588 <sect1 id="unicode-problem"><heading>Problems on Unicode</heading>
1589
1590 <P>
1591 All standards are not free from politics and compromise.
1592 Though a concept of united single CCS for all characters in the
1593 world is very nice, Unicode had to consider compatibility
1594 with preceding international and local standards. And more,
1595 unlike the ideal concept, Unicode people considered efficiency
1596 too much. IMHO, surrogate pair is a mess caused by lack of
1597 16bit code space. I will introduce a few problems on Unicode.
1598 </P>
1599
1600 <sect2 id="unihan"><heading>Han Unification</heading>
1601
1602 <P>
1603 This is the point on which Unicode is criticized most strongly
1604 among many Japanese people.
1605 </P>
1606
1607 <P>
1608 A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian
1609 ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja).
1610 There are similar characters
1611 in these four character sets. (There are two sets of Chinese characters,
1612 simplified Chinese used in P. R. China and traditional Chinese used in
1613 Taiwan). To reduce the number of these ideograms to be encoded
1614 (the region for these characters can contain only 20992 characters
1615 while only Taiwan CNS 11643 standard contains 48711 characters),
1616 these similar characters are assumed to be the same.
1617 This is Han Unification.
1618 </P>
1619
1620 <P>
1621 However these characters are not exactly the same. If fonts for
1622 these characters are made from Chinese one, Japanese people will
1623 regard them wrong characters, though they may be able to read.
1624 Unicode people think these united characters are the same character
1625 with different glyphs.
1626 </P>
1627
1628 <P>
1629 An example of Han Unification is available at
1630 <url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9AA8" name="U+9AA8">.
1631 This is a Kanji character for 'bone'.
1632 <url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=8FCE" name="U+8FCE">
1633 is an another example of a Kanji character for 'welcome'. The part
1634 from left side to bottom side is 'run' radical. 'Run' radical
1635 is used for many Kanjis and all of them have the same problem.
1636 <url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76F4" name="U+76F4">
1637 is an another example of a Kanji character for 'straight'.
1638 I, a native Japanese speaker, cannot recognize Chiense version
1639 at all.
1640 </P>
1641
1642 <P>
1643 Unicode font vendors will hesitate to choose fonts for these characters,
1644 simplified Chinese character, traditional Chinese one, Japanese one, or
1645 Korean one. One method is to supply four fonts of simplified Chinese
1646 version, traditional Chinese version, Japanese version, and Korean version.
1647 Commercial OS vendor can release localized version of their OS ---
1648 for example, Japanese version of MS Windows can include Japanese version
1649 of Unicode font (this is what they are exactly doing). However, how
1650 should XFree86 or Debian do? I don't know...
1651 <footnote>
1652 XFree86 4.0 includes Japanese and Korean versions of ISO 10646-1 fonts.
1653 </footnote>
1654 <footnote>
1655 I heard that Chinese and Korean people don't mind the glyph of these
1656 characters. If this is always true, Japanese glyphs should be the
1657 default glyphs for these problematic characters for international
1658 systems such as Debian.
1659 </footnote>
1660 </P>
1661
1662 <sect2 id="crossmap"><heading>Cross Mapping Tables</heading>
1663
1664 <P>
1665 Unicode intents to be a superset of all major encodings in the world,
1666 such as ISO-8859-*, EUC-*, KOI8-*, and so on. The aim of this is to
1667 keep round-trip compatibility and to enable smooth migration from
1668 other encodings to Unicode.
1669 </P>
1670
1671 <P>
1672 Only providing a superset is not sufficient. Reliable cross mapping
1673 tables between Unicode and other encodings are needed. They are
1674 provided by
1675 <url id="http://www.unicode.org/Public/MAPPINGS/" name="Unicode
1676 Consortium">.
1677 </P>
1678
1679 <P>
1680 However, tables for East Asian encodings are not provided.
1681 They were provided but now are
1682 <url id="http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/"
1683 name="obsolete">.
1684 </P>
1685
1686 <P>
1687 You may want to use these mapping tables even though they are
1688 obsolete, because there are no other mapping tables available.
1689 However, you will find a severe problem for these tables.
1690 There are multiple different mapping tables for
1691 Japanese encodings which include JIS X 0208 character set.
1692 Thus, one same character in JIS X 0208 will be mapped into
1693 different Unicode characters according to these mapping tables.
1694 For example, Microsoft and Sun use different table, which
1695 results in Java on MS Windows sometimes break Japanese characters.
1696 </P>
1697
1698 <P>
1699 Though we Open Source people should respect interoperativity,
1700 we cannot achieve sufficient interoperativity because of this
1701 problem. All what we can achieve is interoperativity between
1702 Open Source softwares.
1703 </P>
1704
1705 <P>
1706 GNU libc uses <url id="http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT" name="JIS/JIS0208.TXT"> with a small modification.
1707 The modification is that
1708 <list>
1709 <item>original JIS0208.TXT:
1710 0x815F 0x2140 0x005C # REVERSE SOLIDUS
1711 <item>modified:
1712 0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS
1713 </list>
1714 The reason of this modification is that JIS X 0208 character set
1715 is almost always used with combination with ASCII in form of
1716 EUC-JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should
1717 be mapped into U+005C.
1718 This modified table is found at <tt>/usr/share/i18n/charmaps/EUC-JP.gz</tt>
1719 in Debian system. Of course this mapping table is NOT
1720 authorized nor reliable.
1721 </P>
1722
1723 <P>
1724 I hope Unicode Consortium to release an authorized reliable unique
1725 mapping table between Unicode and JIS X 0208.
1726 You can read <url id="http://www.debian.or.jp/~kubota/unicode-symbols.html"
1727 name="the detail of this problem">.
1728 </P>
1729
1730 <sect2 id="combining"><heading>Combining Characters</heading>
1731
1732 <P>
1733 Unicode has a way to synthesize a accented character by combining
1734 an accent symbol and a base character. For example, combining 'a' and
1735 '~' makes 'a' with tilde. More than two accent symbol can be added to
1736 a base character.
1737 </P>
1738
1739 <P>
1740 Languages such as Thai need combining characters. Combining characters
1741 are the only method to express characters in these languages.
1742 </P>
1743
1744 <P>
1745 However, a few problems arises.
1746 <taglist>
1747 <tag>Duplicate Encoding</tag>
1748 <item>
1749 There are multiple ways to express the same character.
1750 For example, u with umlaut can be expressed as <tt>u+00fc</tt>
1751 and also as <tt>u+0075</tt> + <tt>U+0308</tt>.
1752 How can we implement 'grep' and so on?
1753 <tag>Open Repertoire</tag>
1754 <item>
1755 Number of expressible characters grows unlimitedly.
1756 Non-existing characters can be expressed.
1757 </taglist>
1758 </P>
1759
1760
1761 <sect2 id="surrogate"><heading>Surrogate Pair</heading>
1762
1763 <P>
1764 The first version of Unicode had only 16bit code space,
1765 though 16bit is obviously insufficient to contain all
1766 characters in the world.
1767 <footnote>
1768 There are a few projects such as
1769 <url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
1770 (about 90000 characters),
1771 <url id="http://www.tron.org/index-e.html" name="TRON project">
1772 (about 130000 characters),
1773 and so on to develop a CCS which contains
1774 sufficient characters for professional usage in CJK world.
1775 </footnote>
1776 Thus surrogate pair is introduced in Unicode 2.0, to expand the
1777 number of characters, with keeping compatibility with former
1778 16bit Unicode.
1779 </P>
1780
1781 <P>
1782 However, surrogate pair breaks the principle that all characters
1783 are expressed with the same width of bits. This makes Unicode
1784 programming more difficult.
1785 </P>
1786
1787 <P>
1788 Fortunately, Debian and other UNIX-like systems will use UTF-8
1789 (not UTF-16) as a usual encoding for UCS. Thus, we don't need
1790 to handle UTF-16 and surrogate pair very often.
1791 </P>
1792
1793 <sect2 id="646problem"><heading>ISO 646-* Problem</heading>
1794
1795 <P>
1796 You will need a codeset converter between your local encodings
1797 (for example, ISO 8859-* or ISO 2022-*) and Unicode.
1798 For example, Shift-JIS encoding
1799 <footnote>
1800 The standard encoding for Macintosh and MS Windows.
1801 </footnote>
1802 consists from
1803 JISX 0201 Roman (Japanese version of ISO 646), not ASCII,
1804 which encodes yen currency mark at <tt>0x5c</tt>
1805 where backslash is encoded in ASCII.
1806 </P>
1807
1808 <P>
1809 Then which should your converter convert <tt>0x5c</tt> in Shift-JIS
1810 into in Unicode, <tt>u+005c</tt> (backslash) or <tt>u+00a5</tt>
1811 (yen currency mark)?
1812 You may say yen currency mark is the right solution.
1813 However, backslash (and then yen mark) is widely used for
1814 escape character. For example, 'new line' is expressed as
1815 'backslash - <tt>n</tt>' in C string literal and Japanese people use
1816 'yen currency mark - <tt>n</tt>'. You may say that program sources
1817 must written in ASCII and the wrong point is that you
1818 tried to convert program source. However, there are many
1819 source codes and so on written in Shift-JIS encoding.
1820 </P>
1821
1822 <P>
1823 Now Windows comes to support Unicode and the font
1824 at <tt>u+005c</tt> for Japanese version of Windows is yen currency mark.
1825 As you know, backslash (yen currency mark in Japan) is vitally
1826 important for Windows, because it is used to separate directory names.
1827 Fortunately, EUC-JP, which is widely used for UNIX in Japan,
1828 includes ASCII, not Japanese version of ISO 646. So this
1829 is not problem because it is clear <tt>0x5c</tt> is backslash.
1830 </P>
1831
1832 <P>
1833 Thus all local codesets should not use character sets incompatible
1834 to ASCII, such as ISO 646-*.
1835 </P>
1836
1837 <P>
1838 <url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
1839 name="Problems and Solutions for Unicode and User/Vendor Defined
1840 Characters"> discusses on this problem.
1841 </P>
1842
1843 <sect id="othercodes"><heading>Other Character Sets and Encodings</heading>
1844
1845 <P>
1846 Besides ISO 2022-compliant coded character sets and encodings
1847 described in <ref id="iso2022set"> and <ref id="iso2022enc">,
1848 there are many popular encodings which cannot be classified
1849 into an international standard (i.e., not ISO 2022-compliant
1850 nor Unicode). Internationalized softwares should
1851 support these encodings (again, you don't need to be aware of
1852 encodings if you use LOCALE and <tt>wchar_t</tt> technology).
1853 Some organizations are developing systems which go father than
1854 limitations of the current international standards, though these
1855 systems may be not diffused very much so far.
1856 </P>
1857
1858 <sect1 id="othercodes-big5"><heading>Big5</heading>
1859
1860 <P>
1861 <strong>Big5</strong> is a de-facto standard encoding for
1862 Taiwan (1984) and is upward-compatible with ASCII.
1863 It is also a CCS.
1864 </P>
1865
1866 <P>
1867 In Big5, <tt>0x21</tt> - <tt>0x7e</tt> means ASCII characters.
1868 <tt>0xa1</tt> - <tt>0xfe</tt> makes a pair with the following byte
1869 (<tt>0x40</tt> - <tt>0x7e</tt> and <tt>0xa1</tt> - <tt>0xfe</tt>)
1870 and means an ideogram and so on (13461 characters).
1871 <P>
1872
1873 <P>
1874 Though Taiwan has ISO 2022-compliant new standard CNS 11643,
1875 Big5 seems to be more popular than CNS 11643.
1876 (CNS 11643 is a CCS and there are a few ISO 2022-derived
1877 encodings which include CNS 11643.)
1878 </P>
1879
1880 <sect1 id="othercodes-uhc"><heading>UHC</heading>
1881
1882 <P>
1883 <strong>UHC</strong> is an encoding which is an upward-compatible
1884 with <strong>EUC-KR</strong>. Two-byte characters (the first byte:
1885 <tt>0x81</tt> - <tt>0xfe</tt>; the second byte:
1886 <tt>0x41</tt> - <tt>0x5a</tt>, <tt>0x61</tt> - <tt>0x7a</tt>, and
1887 <tt>0x81</tt> - <tt>0xfe</tt>) include KSX 1001 and other Hangul so
1888 that UHC can
1889 express all 11172 Hangul.
1890 </P>
1891
1892 <sect1 id="othercodes-johab"><heading>Johab</heading>
1893
1894 <P>
1895 <strong>Johab</strong> is an encoding whose character set is identical
1896 with <strong>UHC</strong>, i.e., ASCII, KSX 1001, and all other Hangul
1897 character.
1898 Johab means combination in Korean. In Johab, code point of a Hangul
1899 can be calculated from combination of Hangul parts (Jamo).
1900 </P>
1901
1902 <sect1 id="othercodes-hz"><heading>HZ, aka HZ-GB-2312</heading>
1903
1904 <p>
1905 <strong>HZ</strong> is an encoding described in
1906 <url id="http://www.faqs.org/rfcs/rfc1842.html" name="RFC 1842">.
1907 CCS (Coded character sets) of HZ is ASCII and GB2312. This is 7bit
1908 encoding.
1909 </p>
1910
1911 <p>
1912 Note that HZ is <em>not</em> upward-compatible with ASCII,
1913 since '<tt>~{</tt>' means GB2312 mode, '<tt>~}</tt>' means
1914 ASCII mode, and '<tt>~~</tt>' means ASCII '~'.
1915 </p>
1916
1917 <sect1 id="othercodes-gbk"><heading>GBK</heading>
1918
1919 <p>
1920 <strong>GBK</strong> is an encoding which is upward-compatible
1921 to CN-GB. GBK covers ASCII, GB2312, other Unicode 1.0 ideograms,
1922 and a bit more. The range of two-byte characters in GBK is:
1923 <tt>0x81</tt> - <tt>0xfe</tt> for the first byte and
1924 <tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> - <tt>0xfe</tt>
1925 for the second byte. 21886 code points out of 23940 in two-byte
1926 region are defined.
1927 </p>
1928
1929 <p>
1930 GBK is one of popular encodings in R. P. China.
1931 </p>
1932
1933 <sect1 id="othercodes-gb18030"><heading>GB18030</heading>
1934
1935 <p>
1936 <strong>GB 18030</strong> is an encoding which is upward-compatible
1937 to GBK and CN-GB. It is an recent national standard (released on
1938 17 March 2000) of China. It adds four-byte characters to GBK.
1939 Its range is:
1940 <tt>0x81</tt> - <tt>0xfe</tt> for the first byte,
1941 <tt>0x30</tt> - <tt>0x39</tt> for the second byte,
1942 <tt>0x81</tt> - <tt>0xfe</tt> for the third byte, and
1943 <tt>0x30</tt> - <tt>0x39</tt> for the forth byte.
1944 </p>
1945
1946 <p>
1947 It includes all characters of Unicode 3.0's Unihan Extension A.
1948 And more, GB 18030 supplies code space for all used and
1949 unused code points of Unicode's plane 0 (BMP) and 16 additional
1950 planes.
1951 </p>
1952
1953 <p>
1954 <url id="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf"
1955 name="A detailed explanation on GB18030"> is available.
1956 </p>
1957
1958 <sect1 id="othercodes-gccs"><heading>GCCS</heading>
1959
1960 <p>
1961 <strong>GCCS</strong> is a standard of coded character set
1962 by Hong Kong (HKSAR: Hong Kong Special Administrative Region).
1963 It includes 3049 characters. It is an abbreviation of Government Common
1964 Character Set. It is defined as an <em>additional character set
1965 for Big5</em>. Characters in GCCS are coded in User-Defined Area
1966 (just like Private Use Area for UCS) in Big5.
1967 </p>
1968
1969 <sect1 id="othercodes-hkscs"><heading>HKSCS</heading>
1970
1971 <p>
1972 <strong>HKSCS</strong> is an expansion and amendment of GCCS.
1973 It includes 4702 characters. It means Hong Kong Supplementary
1974 Character Set.
1975 </p>
1976
1977 <p>
1978 In addition to a usage in User-Defined Area in Big5,
1979 HKSCS defines a usage in Private Use Area in Unicode.
1980 </p>
1981
1982 <sect1 id="othercodes-shiftjis"><heading>Shift-JIS</heading>
1983
1984 <p>
1985 <strong>Shift-JIS</strong> is one of popular encodings in Japan.
1986 Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208.
1987 </p>
1988
1989 <p>
1990 JISX 0201 Roman is Japanese version of ISO 646. It defines
1991 yen currency mark for <tt>0x5c</tt>, where ASCII has backslash.
1992 <tt>0xa1</tt> - <tt>0xdf</tt> is one-byte character and is
1993 JISX 0201 Kana. Two-byte character (the first byte:
1994 <tt>0x81</tt> - <tt>0x9f</tt> and <tt>0xe0</tt> - <tt>0xef</tt>;
1995 the second byte: <tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> -
1996 <tt>0xfc</tt>) is JISX 0208.
1997 </p>
1998
1999 <p>
2000 Japanese version of MS DOS, MS Windows and Macintosh use this encoding,
2001 though this encoding is not often used in POSIX systems.
2002 </p>
2003
2004
2005 <sect1 id="othercodes-viscii"><heading>VISCII</heading>
2006
2007 <P>
2008 Vietnamese language uses 186 characters (Latin alphabets with accents)
2009 and other symbols.
2010 It is a bit more than the limit of ISO 8859-like encoding.
2011 </P>
2012
2013 <P>
2014 <strong>VISCII</strong> is a standard for Vietnamese.
2015 It is upward-compatible with ASCII. It is 8bit and stateless,
2016 like ISO 8859 series. However, it uses code points of
2017 not only <tt>0x21</tt> - <tt>0x7e</tt> and <tt>0xa0</tt> -
2018 <tt>0xff</tt> but also <tt>0x02</tt>, <tt>0x05</tt>, <tt>0x06</tt>,
2019 <tt>0x14</tt>, <tt>0x19</tt>, <tt>0x1e</tt>, and <tt>0x80</tt> -
2020 <tt>0x9f</tt>. This makes VISCII not-ISO 2022-compliant.
2021 </P>
2022
2023 <P>
2024 Vietnam has a new, ISO 2022-compliant character set
2025 <strong>TCVN 5712 VN2</strong> (aka <strong>VSCII</strong>).
2026 In TCVN 5712 VN2, accented characters are expressed as a
2027 combined character. Note that some of accented characters
2028 have their own code points.
2029 </P>
2030
2031 <sect1 id="othercodes-tron"><heading>TRON</heading>
2032
2033 <P>
2034 <url id="http://www.tron.org/index-e.html" name="TRON">
2035 is a project to develop a new operating system,
2036 founded as a collaboration of industries and academics
2037 in Japan since 1984.
2038 </P>
2039
2040 <P>
2041 The most diffused version of TRON operating system families
2042 is ITRON, a real-time OS for embedded systems.
2043 However, our interest is not on ITRON now.
2044 TRON determines a TRON encoding.
2045 </P>
2046
2047 <P>
2048 TRON's encoding is stateful. Each state is assigned
2049 to each language. It has already defined about 130000 characters
2050 (January 2000).
2051 </P>
2052
2053 <sect1 id="othercodes-mojikyo"><heading>Mojikyo</heading>
2054
2055 <P>
2056 <url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
2057 is a project to develop an environment by which a user
2058 can use many characters in the world. Mojikyo
2059 project has released an application software for
2060 MS Windows to display and input about 90000 characters.
2061 You can download the software and TrueType, TeX, and
2062 CID fonts, though they are not DFSG-free.
2063 </P>
2064
2065
2066
2067 <chapt id="languages"><heading>Characters in Each Country</heading>
2068
2069 <P>
2070 This chapter describes a specific information for each language.
2071 If you are developing a serious DTP software or planning to support
2072 detailed I18N, this chapter may help you.
2073 Contributions from people speaking each language are welcome.
2074 If you are to write a section on your language, please include
2075 these points:
2076 <enumlist>
2077 <item>kinds and number of characters used in the language,
2078 <item>explanation on coded character set(s) which is (are) standardized,
2079 <item>explanation on encoding(s) which is (are) standardized,
2080 <item>usage and popularity for each encoding,
2081 <item>de-facto standard, if any, on how many columns characters occupy,
2082 <item>writing direction and combined characters,
2083 <item>how to layout characters (word wrapping and so on),
2084 <item>widely used value for <tt>LANG</tt> environmental variable,
2085 <item>the way to input characters from keyboard and whether
2086 you want to input yes/no (and so on) in your language
2087 or in English,
2088 <item>a set of information needed for beautiful displaying, for example,
2089 where to break a line, hyphenation, word wrapping, and so on, and
2090 <item>other topics.
2091 </enumlist>
2092 </P>
2093
2094
2095 <P>
2096 Writers whose languages are written in different direction
2097 from European languages or needs a combined characters
2098 (I heard that is used in Thai) are encouraged to explain
2099 how to treat such languages.
2100 </P>
2101
2102
2103
2104 &japanese-japan;
2105 &spanish;
2106 &cyrillic;
2107
2108
2109
2110
2111
2112 <chapt id="locale"><heading>LOCALE technology</heading>
2113
2114 <P>
2115 <strong>LOCALE</strong> is a basic concept introduced
2116 into <strong>ISO C</strong> (ISO/IEC 9899:1990). The
2117 standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995).
2118 In LOCALE model, the behaviors of some C functions are dependent
2119 on LOCALE environment. LOCALE environment is divided
2120 into a few categories and each of these categories can
2121 be set independently using <tt>setlocale()</tt>.
2122 </P>
2123
2124 <P>
2125 <strong>POSIX</strong> also determines some standards around
2126 i18n. Almost of POSIX and ISO C standards are included in
2127 <strong>XPG4</strong> (X/Open Portability Guide) standard and
2128 all of them are included in XPG5 standard. Note that
2129 <strong>XPG5</strong> is included in UNIX specifications version 2.
2130 Thus support of XPG5 is mandatory to obtain Unix brand. In other words,
2131 all versions of Unix operating systems support XPG5.
2132 </P>
2133
2134 <P>
2135 The merit of using locale technology over hard-coding of Unicode
2136 is:
2137 <list>
2138 <item>The software can be written encoding-independent way.
2139 This means that this software can support all encodings
2140 which the OS supports, including 7bit, 8bit, multibyte,
2141 stateful, and stateless encodings such as ASCII, ISO 8859-*,
2142 EUC-*, ISO 2022-*, Big5, VISCII, TIS 620, UTF-*, and so on.
2143 <item>The software will provides a common unified method to
2144 configure locale and encoding. This benefits users.
2145 Otherwise, users will have to remember the method to enable
2146 UTF-8 mode for each software. Some softwares need <tt>-u8</tt>
2147 switch, other need X resource setting, other need
2148 <tt>.foobarrc</tt> file, other need a special environmental
2149 variable, other use UTF-8 for default. It is nonsense!
2150 <item>The advancement of the OS means the advancement of the
2151 software. Thus, you can use new locale without recompiling
2152 your software.
2153 </list>
2154 You can read the
2155 <url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
2156 name="Unicode support in the Solaris Operating Environment"> whitepapaer
2157 and understand the merit of this model.
2158 <url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
2159 name="Bruno Haible's Unicode HOWTO">
2160 also recommends this model.
2161 </P>
2162
2163 <sect id="localecategory">Locale Categories and <tt>setlocale()</tt></heading>
2164
2165 <P>
2166 In LOCALE model, the behaviors of some C functions are dependent
2167 on LOCALE environment. LOCALE environment is divided
2168 into six categories and each of these categories can
2169 be set independently using <tt>setlocale()</tt>.
2170 </P>
2171
2172 <P>
2173 The followings are the six categories:
2174 <taglist>
2175 <tag><strong>LC_CTYPE</strong>
2176 <item>
2177 <p>
2178 Category related to encodings.
2179 Characters which are encoded by LC_CTYPE-dependent encoding
2180 is called <strong>multibyte characters</strong>.
2181 Note that multibyte character doesn't need to be multibyte.
2182 </p>
2183 <p>
2184 LC_CTYPE-dependent functions are: character testing functions
2185 such as <tt>islower()</tt> and so on, multibyte character
2186 functions such as <tt>mblen()</tt> and so on, multibyte
2187 string functions such as <tt>mbstowcs()</tt> and so on,
2188 and so on.
2189 </p>
2190 </item>
2191 <tag><strong>LC_COLLATE</strong>
2192 <item>
2193 <p>
2194 Category related to sorting.
2195 <tt>strcoll()</tt> and so on are LC_COLLATE-dependent.
2196 </p>
2197 </item>
2198 <tag><strong>LC_MESSAGES</strong>
2199 <item>
2200 <p>
2201 Category related to the language for messages the software
2202 outputs. This category is used for <prgn>gettext</prgn>.
2203 </p>
2204 <tag><strong>LC_MONETARY</strong>
2205 <item>
2206 <p>
2207 Category related to format to show monetary numbers,
2208 for example, currency mark, comma or period, columns,
2209 and so on.
2210 <tt>localeconv()</tt> is the only function which is
2211 LC_MONETARY-dependent.
2212 </p>
2213 </item>
2214 <tag><strong>LC_NUMERIC</strong>
2215 <item>
2216 <p>
2217 Category related to format to show general numbers,
2218 for example, character for decimal point.
2219 </p>
2220 <p>
2221 Formatted I/O functions such as <tt>printf()</tt>,
2222 string conversion functions such as <tt>atof()</tt>,
2223 and so on are LC_NUMERIC-dependent.
2224 </p>
2225 </item>
2226 <tag><strong>LC_TIME</strong>
2227 <item>
2228 <p>
2229 Category related to format to show time and date,
2230 such as name of months and weeks, order of date,
2231 month, and year, and so on.
2232 </p>
2233 <p>
2234 <tt>strftime()</tt> and so on are LC_TIME-dependent.
2235 </p>
2236 </item>
2237 </taglist>
2238 </p>
2239
2240 <p>
2241 <tt>setlocale()</tt> is a function to set LOCALE.
2242 Usage is char *<tt>setlocale(</tt>int <em>category</em>, const char
2243 *<em>locale</em><tt>);</tt>. Header file of <tt>locale.h</tt>
2244 is needed for prototype declaration and definition of
2245 macros for category names. For example,
2246 <tt>setlocale(LC_TIME, "de_DE");</tt>.
2247 </p>
2248
2249 <p>
2250 For <em>category</em>, the following macros can be used:
2251 LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, LC_TIME, and
2252 LC_ALL. For <em>locale</em>, specific locale name, <tt>NULL</tt>,
2253 or <tt>""</tt> can be specified.
2254 </p>
2255
2256 <p>
2257 Giving <tt>NULL</tt> for <em>locale</em> will return the
2258 current value of the specified locale category. Otherwise,
2259 <tt>setlocale()</tt> returns the newly set locale name,
2260 or <tt>NULL</tt> for error.
2261 </p>
2262
2263 <p>
2264 Given <tt>""</tt> for <em>locale</em>, <tt>setlocale()</tt>
2265 will determine the locale name in the following manner:
2266 <list>
2267 <item>At first, consult <tt>LC_ALL</tt> environmental variable.
2268 <item>If <tt>LC_ALL</tt> is not available, consult environmental
2269 variable same as the name of the locale category.
2270 For example, <tt>LC_COLLATE</tt>.
2271 <item>If none of them are available, consult <tt>LANG</tt>
2272 environmental variable.
2273 </list>
2274 This is why a user is expected to set <tt>LANG</tt> variable.
2275 In other words, all what a user has to do is to set <tt>LANG</tt>
2276 variable so that all locale-compliant softwares work well for
2277 desired way.
2278 </p>
2279
2280 <p>
2281 Thus, I recommend strongly to call <tt>setlocale(LC_ALL, "");</tt>
2282 at the first of your softwares, if the softwares are to be
2283 international.
2284 </p>
2285
2286 <sect id="localename">Locale Names</heading>
2287
2288 <P>
2289 We can specify locale names for these six locale categories.
2290 Then, which name should we specify?
2291 </P>
2292
2293 <P>
2294 The syntax to build a locale name is determined as follows:
2295 <example>
2296 language[_territory][.codeset][@modifier]
2297 </example>
2298 where <em>language</em> is two lowercase alphabets described
2299 in ISO639, such as <tt>en</tt> for English, <tt>eo</tt> for
2300 Esperanto, and <tt>zh</tt> for Chinese, <em>territory</em>
2301 is two uppercase alphabets described in ISO3166, such as
2302 <tt>GB</tt> for United Kingdom, <tt>KR</tt> for Republic of
2303 Korea (South Korea), <tt>CN</tt> for China. There are no standard
2304 for <em>codeset</em> and <em>modifier</em>. GNU libc uses
2305 <tt>ISO-8859-1</tt>, <tt>ISO-8859-13</tt>, <tt>eucJP</tt>,
2306 <tt>SJIS</tt>, <tt>UTF8</tt>, and so on for <em>codeset</em>,
2307 and <tt>euro</tt> for <em>modifier</em>.
2308 </P>
2309
2310 <P>
2311 However, it is depend on the system which locale names are valid.
2312 In other words, you have to install <em>locale database</em> for
2313 locale you want to use. Type <tt>locale -a</tt> to display all
2314 supported locale names on the system.
2315 </P>
2316
2317 <p>
2318 Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are
2319 determined for the names for default behavior. For example,
2320 when your software need to parse the output of <tt>date(1)</tt>,
2321 you'd better call <tt>setlocale(LC_TIME, "C");</tt> before
2322 invocation of <tt>date(1)</tt>.
2323 </p>
2324
2325 <sect id="wchar">Multibyte Characters and Wide Characters</heading>
2326
2327 <p>
2328 Now we will concentrate on LC_CTYPE, which is the most important
2329 category in six locale categories.
2330 </p>
2331
2332 <p>
2333 Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*,
2334 ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world.
2335 It is inefficient and a cause of bugs, even not impossible, for
2336 every softwares to implement all these encodings.
2337 Fortunately, we can use LOCALE technology to solve this problem.
2338 <footnote>
2339 Usage of UCS-4 is the second best solution for this problem.
2340 Sometimes LOCALE technology cannot be used and UCS-4 is the
2341 best. I will discuss this solution later.
2342 </footnote>
2343 </p>
2344
2345 <p>
2346 <strong>Multibyte characters</strong> is a term to call characters
2347 encoded in locale-specific encoding. It is nothing special.
2348 It is mere a word to call our daily encodings. In ISO 8859-1 locale,
2349 ISO 8859-1 is multibyte character. In EUC-JP locale, EUC-JP
2350 is multibyte character. In UTF-8 locale, UTF-8 is multibyte character.
2351 In short, multibyte character is defined by <tt>LC_CTYPE</tt> locale
2352 category.
2353 Multibyte characters is used when your software inputs
2354 or outputs text data from/to everywhere out of your software,
2355 for example, standard input/output, display, keyboard, file,
2356 and so on, as you are doing everyday.
2357 <footnote>
2358 There are a few exceptions. Compound text should be used for
2359 communication between X clients. UTF-8 would be the standard
2360 for file names in Linux.
2361 </footnote>
2362 </p>
2363
2364 <p>
2365 You can handle multibyte characters using ordinal <tt>char</tt>
2366 or <tt>unsigned char</tt> types and ordinal character- and
2367 string-oriented functions. It is just like you used to do for
2368 ASCII and 8bit encodings.
2369 </p>
2370
2371 <p>
2372 Then why we call it with a special term of <em>multibyte character</em>?
2373 The answer is, ISO C specifies a set of functions which can handle
2374 multibyte characters properly. On the other hand, it is obvious that
2375 usual C functions such as <tt>strlen()</tt> cannot handle multibyte
2376 characters properly.
2377 </p>
2378
2379 <p>
2380 Then what is these functions which can handle multibyte characters
2381 properly? Please wait a minute.
2382 Multibyte character may be stateful or stateless and multibyte or
2383 non-multibyte, since it includes all encodings ever used and will
2384 be used on the earth. Thus it is not convenient for internal processing.
2385 It needs complex algorithm even for, for example, character
2386 extraction from a string, addition and division of a string,
2387 or counting of number of character in a string.
2388 Thus, <strong>wide characters</strong> should be used for internal
2389 processing. And, the main part of these C functions which can handle
2390 multibyte characters are functions for interconversion between
2391 multibyte characters and wide characters.
2392 These functions are introduced later. Note that you may
2393 be able to do without these functions, since ISO C supplies
2394 I/O functions with conversion.
2395 </p>
2396
2397 <p>
2398 Wide character is defined in ISO C
2399 <list>
2400 <item>that all characters are expressed in fixed width of bits.
2401 <item>that it is stateless, i.e., it doesn't have shift states.
2402 </list>
2403 </p>
2404
2405 <p>
2406 There are two types for wide characters: <tt>wchar_t</tt> and
2407 <tt>wint_t</tt>. <tt>wchar_t</tt> is a type which can contain
2408 one wide character. It is just like 'char' type can be used for
2409 contain one character. <tt>wint_t</tt> can contain one wide
2410 character or <tt>WEOF</tt>, an substitution of <tt>EOF</tt>.
2411 </p>
2412
2413 <p>
2414 A string of wide characters is achieved by an array of <tt>wchar_t</tt>,
2415 just like a string of characters is achieved by an array
2416 of <tt>char</tt>.
2417 </p>
2418
2419 <p>
2420 There are functions for <tt>wchar_t</tt>, substitute for functions
2421 for <tt>char</tt>.
2422 <list>
2423 <item><tt>strcat()</tt>, <tt>strncat()</tt> -&gt;
2424 <tt>wcscat()</tt>, <tt>wcsncat()</tt>
2425 <item><tt>strcpy()</tt>, <tt>strncpy()</tt> -&gt;
2426 <tt>wcscpy()</tt>, <tt>wcsncpy()</tt>
2427 <item><tt>strcmp()</tt>, <tt>strncmp()</tt> -&gt;
2428 <tt>wcscmp()</tt>, <tt>wcsncmp()</tt>
2429 <item><tt>strcasecmp()</tt>, <tt>strncasecmp()</tt> -&gt;
2430 <tt>wcscasecmp()</tt>, <tt>wcsncasecmp()</tt>
2431 <item><tt>strcoll()</tt>, <tt>strxfrm()</tt> -&gt;
2432 <tt>wcscoll()</tt>, <tt>wcsxfrm()</tt>
2433 <item><tt>strchr()</tt>, <tt>strrchr()</tt> -&gt;
2434 <tt>wcschr()</tt>, <tt>wcsrchr()</tt>
2435 <item><tt>strstr()</tt>, <tt>strpbrk()</tt> -&gt;
2436 <tt>wcsstr()</tt>, <tt>wcspbrk()</tt>
2437 <item><tt>strtok()</tt>, <tt>strspn()</tt>, <tt>strcspn()</tt> -&gt;
2438 <tt>wcstok()</tt>, <tt>wcsspn()</tt>, <tt>wcscspn()</tt>
2439 <item><tt>strtol()</tt>, <tt>strtoul()</tt>, <tt>strtod()</tt> -&gt;
2440 <tt>wcstol()</tt>, <tt>wcstoul()</tt>, <tt>wcstod()</tt>
2441 <item><tt>strftime()</tt> -&gt;
2442 <tt>wcsftime()</tt>
2443 <item><tt>strlen()</tt> -&gt;
2444 <tt>wcslen()</tt>
2445 <item><tt>toupper()</tt>, <tt>tolower()</tt> -&gt;
2446 <tt>towupper()</tt>, <tt>towlower()</tt>
2447 <item><tt>isalnum()</tt>, <tt>isalpha()</tt>, <tt>isblank()</tt>,
2448 <tt>iscntrl()</tt>, <tt>isdigit()</tt>, <tt>isgraph()</tt>,
2449 <tt>islower()</tt>, <tt>isprint()</tt>, <tt>ispunct()</tt>,
2450 <tt>isspace()</tt>, <tt>isupper()</tt>, <tt>isxdigit()</tt> -&gt;
2451 <tt>iswalnum()</tt>, <tt>iswalpha()</tt>, <tt>iswblank()</tt>,
2452 <tt>iswcntrl()</tt>, <tt>iswdigit()</tt>, <tt>iswgraph()</tt>,
2453 <tt>iswlower()</tt>, <tt>iswprint()</tt>, <tt>iswpunct()</tt>,
2454 <tt>iswspace()</tt>, <tt>iswupper()</tt>, <tt>iswxdigit()</tt>
2455 (<tt>isascii()</tt> doesn't have its wide character version).
2456 <item><tt>memset()</tt>, <tt>memcpy()</tt>, <tt>memmove</tt>,
2457 <tt>memmove()</tt>, <tt>memchr()</tt> -&gt;
2458 <tt>wmemset()</tt>, <tt>wmemcpy()</tt>, <tt>wmemmove</tt>,
2459 <tt>wmemmove()</tt>, <tt>wmemchr()</tt>
2460 </list>
2461 There are additional functions for <tt>wchar_t</tt>.
2462 <list>
2463 <item><tt>wcwidth()</tt>, <tt>wcswidth()</tt>
2464 <item><tt>wctrans()</tt>, <tt>towctrans()</tt>
2465 </list>
2466 </p>
2467
2468 <p>
2469 You cannot assume anything on the concrete value of <tt>wchar_t</tt>,
2470 besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII.
2471 <footnote>
2472 Some of you may know GNU libc uses UCS-4 for the internal expression
2473 of <tt>wchar_t</tt>. However, you should not use the knowledge.
2474 It may differ in other systems.
2475 </footnote>
2476 You may feel this limitation is too strong. If you cannot do
2477 under this limitation, you can use UCS-4 as the internal encoding.
2478 In such a case, you can write your software emulating
2479 the locale-sensible behavior using <tt>setlocale()</tt>,
2480 <tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>. Consult
2481 the section of <ref id="iconv">. Note that it is generally
2482 easier to use wide character than implement UCS-4 or UTF-8.
2483 </p>
2484
2485 <p>
2486 You can write wide character in the source code as <tt>L'a'</tt>
2487 and wide string as <tt>L"string"</tt>. Since the encoding
2488 for the source code is ASCII, you can only write ASCII
2489 characters. If you'd like to use other characters, you should
2490 use <prgn>gettext</prgn>.
2491 </p>
2492
2493 <p>
2494 There are two ways to use wide characters:
2495 <list>
2496 <item>I/O is described using multibyte characters. Inputed data
2497 are converted into wide character immediately after reading
2498 and data for output are converted from wide character to
2499 multibyte character immediately before writing. Conversion
2500 can be achieved using functions of <tt>mbstowcs()</tt>,
2501 <tt>mbsrtowcs()</tt>, <tt>wcstombs()</tt>, <tt>wcsrtombs()</tt>,
2502 <tt>mblen()</tt>, <tt>mbrlen()</tt>, <tt>mbsinit()</tt>,
2503 and so on.
2504 Please consult the manual pages for these functions.
2505 <item>Wide characters are directly used for I/O, using
2506 wide character functions such as <tt>getwchar()</tt>,
2507 <tt>fgetwc()</tt>, <tt>getwc()</tt>,
2508 <tt>ungetwc()</tt>, <tt>fgetws</tt>, <tt>putwchar()</tt>,
2509 <tt>fputwc()</tt>, <tt>putwc()</tt>, and <tt>fputws()</tt>,
2510 formatted I/O functions for wide characters such as
2511 <tt>fwscanf()</tt>, <tt>wscanf()</tt>, <tt>swscanf()</tt>,
2512 <tt>fwprintf()</tt>, <tt>wprintf()</tt>, <tt>swprintf()</tt>,
2513 <tt>vfwprintf()</tt>, <tt>vwprintf()</tt>, and
2514 <tt>vswprintf()</tt>, and wide character identifier
2515 of <tt>%lc</tt>, <tt>%C</tt>, <tt>%ls</tt>, <tt>%S</tt>
2516 for conventional formatted I/O functions.
2517 By using this approach, you don't need to handle
2518 multibyte characters at all.
2519 Please consult the manual pages for these functions.
2520 </list>
2521 Though latter functions are also determined in ISO C,
2522 these functions have became newly available since GNU libc 2.2.
2523 (Of course all UNIX operating systems have all functions described
2524 here.)
2525 </p>
2526
2527 <p>
2528 Note that very simple softwares such as <tt>echo</tt> doesn't
2529 have to care about multibyte character. and wide characters.
2530 Such software can input and output multibyte character as is.
2531 Of course you may modify these softwares using wide characters.
2532 It may be a good practice of wide character programming.
2533 Examples of a fragment of source codes will be discussed in
2534 <ref id="internal">.
2535 </p>
2536
2537 <p>
2538 There is an explanation of multibyte and wide characters also
2539 in Ken Lunde's "CJKV Information Processing" (p25). However,
2540 the explanation is entirely wrong.
2541 </p>
2542
2543 <sect id="locale_unicode">Unicode and LOCALE technology</heading>
2544
2545 <p>
2546 UTF-8 is considered as the future encoding and
2547 many softwares are coming to support UTF-8. Though some
2548 of these softwares implement UTF-8 directly, I recommend
2549 you to use LOCALE technology to support UTF-8.
2550 </p>
2551
2552 <p>
2553 How this can be achieved? It is easy! If you are a developer
2554 of a software and your software has already written using LOCALE
2555 technology, you don't have to do anything!
2556 </p>
2557
2558 <p>
2559 Using LOCALE technology benefits not only developers but also users.
2560 All a user has to do is set locale environment properly.
2561 Otherwise, a user has to remember the method to use UTF-8 mode
2562 for each software. Some softwares need <tt>-u8</tt> switch,
2563 other need X resource setting, other need <tt>.foobarrc</tt>
2564 file, other need a special environmental variable,
2565 other use UTF-8 for default. It is nonsense!
2566 </p>
2567
2568 <p>
2569 Solaris has been already developed using this model.
2570 Please consult
2571 <url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
2572 name="Unicode support in the Solaris Operating Environment"> whitepapaer.
2573 </p>
2574
2575 <p>
2576 However, it is likely that some of upstream developers of
2577 softwares of which you are maintaining a Debian package refuses
2578 to use <tt>wchar_t</tt> for some reasons, for example, that
2579 they are not familiar with LOCALE programming, that they think
2580 it is troublesome, that they are not keen on I18N, that it is much
2581 easier to modify the software to support UTF-8 than to modify it
2582 to use <tt>wchar_t</tt>, that the software must work even under
2583 non-internationalized OS such as MS-DOS, and so on.
2584 Some developers may think that support of UTF-8 is sufficient
2585 for I18N.
2586 <footnote>
2587 In such a case, do they think of abolishing support of 7bit or
2588 8bit non-multibyte encodings? If no, it may be unfair that
2589 8bit language speakers can use both UTF-8 and conventional (local)
2590 encodings while speakers of multibyte languages, combining
2591 characters, and so on cannot use their popular locale encodings.
2592 I think such a software cannot be called "internationalized".
2593 </footnote>
2594 Even in such cases, you can rewrite such a software so that it
2595 checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables
2596 to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>.
2597 You can also rewrite the software to call <tt>setlocale()</tt>,
2598 <tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software
2599 supports all encodings which the OS supports, as discussed later.
2600 Consult
2601 <url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html"
2602 name="the discussion in the Groff mailing list on the support of
2603 UTF-8 and locale-specific encodings">, mainly held by Werner
2604 LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA,
2605 the author of this document.
2606 </p>
2607
2608
2609
2610 <sect id="iconv"><heading><tt>nl_langinfo()</tt> and <tt>iconv()</tt></heading>
2611
2612 <p>
2613 Though ISO C defines extensive LOCALE-related functions,
2614 you may want more extensive support. You may also want
2615 conversion between different encodings.
2616 There are C functions which can be used for such purposes.
2617 </p>
2618
2619 <p>
2620 char *<tt>nl_langinfo(</tt>nl_item <em>item</em><tt>)</tt> is
2621 an XPG5 function to get LOCALE-related informations. You can
2622 get the following informations using the following macros
2623 for <em>item</em> defined in <tt>langinfo.h</tt> header file:
2624 <list>
2625 <item>names for days in week
2626 (<tt>DAY_1</tt> (Sunday), <tt>DAY_2</tt>, <tt>DAY_3</tt>,
2627 <tt>DAY_4</tt>, <tt>DAY_5</tt>, <tt>DAY_6</tt>, and <tt>DAY_7</tt>)
2628 <item>abbreviated names for days in week
2629 (<tt>ABDAY_1</tt> (Sun), <tt>ABDAY_2</tt>, <tt>ABDAY_3</tt>,
2630 <tt>ABDAY_4</tt>, <tt>ABDAY_5</tt>, <tt>ABDAY_6</tt>, and
2631 <tt>ABDAY_7</tt>)
2632 <item>names for months in year
2633 (<tt>MON_1</tt> (January), <tt>MON_2</tt>, <tt>MON_3</tt>,
2634 <tt>MON_4</tt>, <tt>MON_5</tt>, <tt>MON_6</tt>, <tt>MON_7</tt>,
2635 <tt>MON_8</tt>, <tt>MON_9</tt>, <tt>MON_10</tt>, <tt>MON_11</tt>,
2636 and <tt>MON_12</tt>)
2637 <item>abbreviated names for months in year
2638 (<tt>ABMON_1</tt> (January), <tt>ABMON_2</tt>, <tt>ABMON_3</tt>,
2639 <tt>ABMON_4</tt>, <tt>ABMON_5</tt>, <tt>ABMON_6</tt>,
2640 <tt>ABMON_7</tt>, <tt>ABMON_8</tt>, <tt>ABMON_9</tt>,
2641 <tt>ABMON_10</tt>, <tt>ABMON_11</tt>, and <tt>ABMON_12</tt>)
2642 <item>name for AM (<tt>AM_STR</tt>)
2643 <item>name for PM (<tt>PM_STR</tt>)
2644 <item>name of era (<tt>ERA</tt>)
2645 <item>format of date and time (<tt>D_T_FMT</tt>)
2646 <item>format of date and time (era-based) (<tt>ERA_D_T_FMT</tt>)
2647 <item>format of date (<tt>D_FMT</tt>)
2648 <item>format of date (era-based) (<tt>ERA_D_FMT</tt>)
2649 <item>format of time (24-hour format) (<tt>T_FMT</tt>)
2650 <item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>)
2651 <item>format of time (era-based) (<tt>ERA_T_FMT</tt>)
2652 <item>radix (<tt>RADIXCHAR</tt>)
2653 <item>thousands separator (<tt>THOUSEP</tt>)
2654 <item>alternative characters for numerics (<tt>ALT_DIGITS</tt>)
2655 <item>affirmative word (<tt>YESSTR</tt>)
2656 <item>affirmative response (<tt>YESEXPR</tt>)
2657 <item>negative word (<tt>NOSTR</tt>)
2658 <item>negative response (<tt>NOEXPR</tt>)
2659 <item>encoding (<tt>CODESET</tt>)
2660 </list>
2661 For example, you can get names for months and use them for
2662 your original output algorithm. <tt>YESEXPR</tt> and
2663 <tt>NOEXPR</tt> are convenient for softwares expecting Y/N
2664 answer from users.
2665 </p>
2666
2667 <p>
2668 <tt>iconv_open()</tt>, <tt>iconv()</tt>, and <tt>iconv_close()</tt>
2669 are functions to perform conversion between encodings.
2670 Please consult manpages for them.
2671 </p>
2672
2673 <p>
2674 Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>,
2675 you can easily modify Unicode-enabled software into locale-sensible
2676 truly internationalized software.
2677 </p>
2678
2679 <p>
2680 At first, add a line of <tt>setlocale(LC_ALL, "");</tt> at the
2681 first of the software. If it returns non-NULL, enable UTF-8 mode
2682 of the software.
2683 <example>
2684 int conversion = FALSE;
2685 char *locale = setlocale(LC_ALL, "");
2686 :
2687 :
2688 (original code to determine UTF-8 mode or not)
2689 :
2690 :
2691 if (locale != NULL &amp;&amp; utf_mode == FALSE) {
2692 utf8_mode = TRUE;
2693 conversion = TRUE;
2694 }
2695 </example>
2696 Then modify input routine as following:
2697 <example>
2698 #define INTERNALCODE "UTF-8"
2699 if (conversion == TRUE) {
2700 char *fromcode = nl_langinfo(CODESET);
2701 iconv_t conv = iconv_open(INTERNALCODE, fromcode);
2702 (reading and conversion...)
2703 iconv_close(conv);
2704 } else {
2705 (original reading routine)
2706 }
2707 </example>
2708 Finally modify the output routine as following:
2709 <example>
2710 if (conversion == TRUE) {
2711 char *tocode = nl_langinfo(CODESET);
2712 iconv_t conv = iconv_open(tocode, INTERNALCODE);
2713 (conversion and writing...)
2714 iconv_close(conv);
2715 } else {
2716 (original writing routine)
2717 }
2718 </example>
2719 Note that whole reading should be done at once since
2720 otherwise you may divide multibyte character.
2721 You can consult the <tt>iconv_prog.c</tt> file
2722 in the distribution of GNU libc for usage of <tt>iconv()</tt>.
2723 </p>
2724
2725 <p>
2726 Though <tt>nl_langinfo()</tt> is a standard function of XPG5
2727 and GNU libc supports it, it is not very portable. And more,
2728 there are no standard for encoding names for
2729 <tt>nl_langinfo()</tt> and <tt>iconv_open()</tt>.
2730 If this is a problem, you can use Bruno Haible's
2731 <url id="http://www.gnu.org/software/libiconv/"
2732 name="libiconv">. It has <tt>iconv()</tt>, <tt>iconv_open()</tt>,
2733 and <tt>iconv_close()</tt>. And more, it has <tt>locale_charset()</tt>,
2734 a replacement of <tt>nl_langinfo(CODESET)</tt>.
2735 </p>
2736
2737
2738 <sect id="locale-limit"><heading>Limit of Locale technology</heading>
2739
2740 <P>
2741 Locale model has a limit. That is, it cannot handle two locales at
2742 the same time. Especially, it cannot handle relationship between two
2743 locales at all.
2744 </P>
2745
2746 <P>
2747 For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings
2748 in Japan. EUC-JP is the de-facto standard for UNIX systems,
2749 ISO 2022-JP is the standard for Internet, and Shift-JIS is the
2750 encoding for Windows and Macintosh. Thus, Japanese people have to
2751 handle texts with these encodings. Text viewers such as <tt>jless</tt>
2752 and <tt>lv</tt> and editors such as <tt>emacs</tt> can automatically
2753 understand the encoding to be read. You cannot write such a software
2754 using Locale technology.
2755 </P>
2756
2757
2758
2759 <chapt id="output"><heading>Output to Display</heading>
2760
2761 <P>
2762 Here 'Output to Display' does not mean translation of messages using
2763 <prgn>gettext</prgn>.
2764 I will concern on whether characters are correctly displayed so that
2765 we can read it. For example, install <package>libcanna1g</package>
2766 package and display
2767 <tt>/usr/doc/libcanna1g/README.jp.gz</tt> on console or <prgn>xterm</prgn>
2768 (of course after
2769 ungzipping). This text file is written in Japanese but even Japanese
2770 people can not read such a row of strange characters. Which you would
2771 prefer if you were a Japanese speaker, an English message which can be read
2772 with a dictionary or such a row of strange characters which is
2773 a result of <prgn>gettext</prgn>ization?
2774 <footnote>
2775 (Yes, there <em>are</em> ways to display Japanese characters
2776 correctly -- <prgn>kon</prgn> (in <package>kon2</package> package)
2777 for console and <prgn>kterm</prgn> for X, and Japanese people are
2778 happy with <prgn>gettext</prgn>ized Japanese messages.)
2779 </footnote>
2780 </P>
2781
2782 <P>
2783 Problems on displaying non-English (non-ASCII) characters
2784 are discussed below.
2785 </P>
2786
2787
2788
2789 <sect id="output-console"><heading>Console Softwares</heading>
2790
2791 <P>
2792 In this section, problems on displaying characters on
2793 <strong>console</strong> are discussed.
2794 <footnote>
2795 This section does not include problems on developing console;
2796 This section includes problems on developing softwares which run
2797 on console.
2798 </footnote>
2799 Here, console includes a bare <strong>Linux console</strong> including
2800 framebuffer and conventional version, special consoles such as
2801 <strong>kon2</strong>, <strong>jfbterm</strong>, <strong>chdrv</strong>,
2802 and so on constructed by special softwares, and X terminal emulators
2803 such as <strong>xterm</strong>, <strong>kterm</strong>,
2804 <strong>hanterm</strong>, <strong>xiterm</strong>, <strong>rxvt</strong>,
2805 <strong>xvt</strong>, <strong>gnome-terminal</strong>,
2806 <strong>wterm</strong>, <strong>aterm</strong>, <strong>eterm</strong>,
2807 and so on. Remote environments via telnet and secure shell such as
2808 <strong>NCSA telnet</strong> for Macintosh and <strong>Tera Term</strong>
2809 for Windows are also regarded as consoles.
2810 </P>
2811
2812 <P>
2813 The feature of console is that:
2814 <list>
2815 <item>All what a software has to do is to send a correct encoding
2816 to standard output. Softwares on console don't need to
2817 care about fonts and so on.
2818 <item>Fonts with fixed sizes are used. The unit of the width
2819 of the font is called 'column'. 'Doublewidth' fonts, i.e.,
2820 fonts whose width is 2 columns, are used for CJK ideograms,
2821 Japanese Hiragana and Katakana, Korean Hangul, and related
2822 symbols. Combined characters used for Thai and so on can be
2823 regarded as 'zero'-column characters.
2824 </list>
2825 </P>
2826
2827 <sect1 id="output-console-code"><heading>Encoding</heading>
2828
2829 <P>
2830 Softwares running on the console are not responsible for displaying.
2831 The console itself is responsible. There are consoles
2832 which can display encodings other than ASCII such as
2833 <taglist>
2834 <tag>kon in kon2 package
2835 <item>EUC-JP, Shift-JIS, and ISO-2022-JP
2836 <tag>jfbterm
2837 <item>EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96,
2838 and 94x94 coded character sets whose fonts are available)
2839 <tag>kterm
2840 <item>EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including
2841 ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212,
2842 GB 2312, and KSC 5601)
2843 <tag>krxvt in rxvt-ml package
2844 <item>EUC-JP
2845 <tag>crxvt-gb in rxvt-ml package
2846 <item>CN-GB
2847 <tag>crxvt-big5 in rxvt-ml package
2848 <item>Big5
2849 <tag>cxtermb5 in cxterm-big5 package
2850 <item>Big5
2851 <tag>xcinterm-big5 in xcin package
2852 <item>Big5
2853 <tag>xcinterm-gb in xcin package
2854 <item>CN-GB
2855 <tag>xcinterm-gbk in xcin package
2856 <item>GBK
2857 <tag>xcinterm-big5hkscs in xcin package
2858 <item>Big5 with HKSCS
2859 <tag>hanterm
2860 <item>EUC-KR, Johab, and ISO 2022-KR
2861 <tag>xiterm and txiterm in xiterm+thai package
2862 <item>TIS 620
2863 <tag>xterm
2864 <item>UTF-8
2865 </taglist>
2866 However, there are no way for a software on console to know which
2867 encoding is available. I think it is a responsibility for
2868 a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG
2869 environmental variable). Provided LC_CTYPE locale is set properly,
2870 a software can use it to know which encoding to be supported
2871 by the console.
2872 </P>
2873
2874 <P>
2875 Concerning the translated messages by <prgn>gettext</prgn>,
2876 the software does not need anything. It works well if the
2877 user properly set LC_CTYPE and LC_MESSAGES locale.
2878 </P>
2879
2880 <P>
2881 If you are handling a string in non-ASCII encoding (using
2882 multibyte character, UTF-8 directly, and so on), you will have
2883 to care about points which you don't have to care about if you are
2884 using ASCII.
2885 <list>
2886 <item>8-bit cleanness. I think everyone understand this.
2887 <item>Continuity of multibyte characters. In multibyte encodings
2888 such as EUC-JP and UTF-8, one character may consist
2889 from more than two bytes. These bytes should be outputed
2890 continued. Insertion of additional codes between the
2891 continuing bytes can break the character. I have seen a
2892 software which outputs location control code everytime
2893 it outputs one byte. It breaks multibyte character.
2894 </list>
2895 </P>
2896
2897 <sect1 id="output-console-column"><heading>Number of Columns</heading>
2898
2899 <P>
2900 Internationalized console software cannot assume that a character
2901 always occupy one column. You can get the number of column of a
2902 character of a string using <tt>wcwidth()</tt> and
2903 <tt>wcswidth()</tt>. Note that you have to use
2904 <tt>wchar_t</tt>-style programming since these functions have
2905 a <tt>wchar_t</tt> parameter.
2906 </P>
2907
2908 <P>
2909 Additional cares have to be taken not to destroy multicolumn
2910 characters. For example, imagine your software displayed a
2911 double-column character at (row, column) = (1, 1). What will occur
2912 when your software then display a single-column character at (row, column)
2913 = (1, 2) or at (1, 1) ? The single-column character erases
2914 the half of the double-column character? Nobody knows the answer.
2915 It depends on the implementation of the console. All what I can
2916 tell is that your software should avoid such cases.
2917 </P>
2918
2919 <P>
2920 If your software inputs a string from keyboard, you will have to
2921 take more cares. All of numbers of characters, bytes, and columns
2922 differ. For example, in UTF-8 encoding, one character of
2923 'a' with acute accent occupies two bytes and one column. One
2924 character of CJK-ideograph occupies three bytes and two columns.
2925 For example, if the user types 'Backspace', how many backspace
2926 code (0x08) should the software outputs? How many bytes should
2927 the software erase from the internal buffer?
2928 Don't be nervous; you can use <tt>wchar_t</tt> which assures
2929 one character occupy one <tt>wchar_t</tt> everytime and you can
2930 use <tt>wcwidth()</tt> to know the number of columns.
2931 Note that control codes such as 'backspace' (0x08) and so on are
2932 column-oriented everytime. It backs 'one' column even if the
2933 character at the position is a doublewidth character.
2934 </P>
2935
2936
2937 <sect id="output-x"><heading>X Clients</heading>
2938
2939 <P>
2940 The way to develop X clients can differ drastically dependent on
2941 the toolkits to be used. At first, Xlib-style programming is
2942 discussed since Xlib is the fundamental for all other toolkits.
2943 Then a few toolkits are discussed.
2944 </P>
2945
2946 <sect1 id="output-x-xlib"><heading>Xlib programming</heading>
2947
2948 <P>
2949 X itself is already internationalized. X11R5 has introduced
2950 an idea of 'fontset' for internationalized text output.
2951 Thus all what X clients have to do is to use the 'fontset'-related
2952 functions.
2953 </P>
2954
2955 <P>
2956 The most important part for internationalization of displaying
2957 for X clients is the usage of internationalized
2958 <strong>XFontSet</strong>-related functions introduced since
2959 X11R5 instead of conventional <strong>XFontStruct</strong>-related
2960 functions.
2961 </P>
2962
2963 <P>
2964 The main feature of XFontSet is that it can handle multiple fonts
2965 at the same time. This is related to the distinction between
2966 coded character set (CCS) and character encoding scheme (CES)
2967 which I wrote at the section of <ref id="coding-general-term">.
2968 Some encodings in the world use multiple coded character
2969 sets at the same time. This is the reason we have to handle
2970 multiple X fonts at the same time.
2971 <footnote>
2972 Though UTF-8 is an encoding with single CCS, the current
2973 version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
2974 </footnote>
2975 </P>
2976
2977 <P>
2978 Another significant feature of XFontSet is that it is
2979 locale (LC_CTYPE)-sensible. This means that you have to
2980 call <tt>setlocale()</tt> before you use XFontSet-related
2981 functions. And more, you have to specify the string you want
2982 to draw as a multibyte character or a wide character.
2983 </P>
2984
2985 <P>
2986 In the conventional <tt>XFontStruct</tt> model, an X client
2987 opens a font using <tt>XLoadQueryFont()</tt>, draw a string
2988 using <tt>XDrawString()</tt>, and close the font using
2989 <tt>XFreeFont()</tt>. On the other hand, in the internationalized
2990 <tt>XFontSet</tt> model, an X client opens a font using
2991 <tt>XCreateFontSet()</tt>, draw a string using <tt>XmbDrawString()</tt>,
2992 and close the font using <tt>XFreeFontSet()</tt>.
2993 The following are a concise list of substitution.
2994 <list>
2995 <item><tt>XFontStruct</tt> -&gt; <tt>XFontSet</tt>
2996 <item><tt>XLoadQueryFont()</tt> -&gt; <tt>XCreateFontSet()</tt>
2997 <item>both of <tt>XDrawString()</tt> and <tt>XDrawString16</tt>
2998 -&gt; either of <tt>XmbDrawString()</tt> or <tt>XwcDrawString()</tt>
2999 <item>both of <tt>XDrawImageString()</tt> and <tt>XDrawImageString16</tt>
3000 -&gt; either of <tt>XmbDrawImageString()</tt> or
3001 <tt>XwcDrawImageString()</tt>
3002 </list>
3003 Note that <tt>XFontStruct</tt> is usually used as a pointer, while
3004 <tt>XFontSet</tt> itself is a pointer.
3005 </P>
3006
3007 <P>
3008 Some people (ISO-8859-1-language speakers) may think that
3009 <tt>XFontSet</tt>-related functions are not 8-bit clean.
3010 This is wrong. <tt>XFontSet</tt>-related
3011 functions work according to <tt>LC_CTYPE</tt> locale. The default
3012 LC_CTYPE locale uses ASCII. Thus, if a user doesn't set <tt>LANG</tt>,
3013 <tt>LC_CTYPE</tt>, nor <tt>LC_ALL</tt> environmental variable,
3014 <tt>XFontSet</tt>-related functions will use ASCII, i.e., not 8-bit
3015 clean. The user has to set <tt>LANG</tt>, <tt>LC_CTYPE</tt>, or
3016 <tt>LC_ALL</tt> environmental variable properly (for example,
3017 <tt>LANG=en_US</tt>).
3018 </P>
3019
3020 <P>
3021 The upstream developers of X clients sometimes hate to enforce
3022 users to set such environmental variables.
3023 <footnote>
3024 IMHO, all users will have to set LANG properly when UTF-8 will
3025 become popular.
3026 </footnote>
3027 In such a case,
3028 The X clients should have two ways to output text, i.e.,
3029 <tt>XFontStruct</tt>-related conventional way and
3030 <tt>XFontSet</tt>-related internationalized way.
3031 If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>,
3032 or <tt>"POSIX"</tt>, use
3033 <tt>XFontStruct</tt> way. Otherwise use <tt>XFontSet</tt> way.
3034 The author implemented this algorithm to a few window managers
3035 such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),
3036 sawmill (0.28), and so on.
3037 </P>
3038
3039 <P>
3040 Window managers need more modifications related to inter-clients
3041 communication. This topic will be described later.
3042 </P>
3043
3044 <sect1 id="output-x-aw"><heading>Athena widgets</heading>
3045
3046 <P>
3047 Athena widget is already internationalized.
3048 </P>
3049
3050 <P>***** Not written yet *****</P>
3051
3052 <sect1 id="output-x-gtk"><heading>Gtk and Gnome</heading>
3053
3054 <P>
3055 Gtk is already internationalized.
3056 </P>
3057
3058 <P>***** Not written yet *****</P>
3059
3060 <sect1 id="output-x-qt"><heading>Qt and KDE</heading>
3061
3062 <P>
3063 Though internationalized version of Qt was available for a long
3064 time, it could not be the official version of Qt. The license
3065 of Qt of those days inhibited to distribute internationalized
3066 version of Qt. However, Troll Tech at last changed their mind
3067 and Qt's license and now the official version of Qt is
3068 internationalized.
3069 </P>
3070
3071 <P>***** Not written yet *****</P>
3072
3073 <chapt id="input"><heading>Input from Keyboard</heading>
3074
3075 <P>
3076 it is obvious that a text editor needs ability to input text
3077 from keyboard, otherwise the text editor is entirely useless.
3078 Similarly, an internationalized text editor needs ability to input
3079 characters used for various languages. Other softwares such
3080 as shells, libraries such as readline, environments such as
3081 consoles and X terminal emulators, script languages such as perl,
3082 tcl/tk, python, and ruby, and application softwares such as
3083 word processors, draw and paints, file managers such as
3084 Midnight Commander, web browsers, mailers, and so on
3085 also need ability to input internationalized text. Otherwise
3086 these softwares are entirely useless.
3087 </P>
3088
3089 <P>
3090 There are various languages in the world. Thus, proper input methods
3091 vary from languages to languages.
3092 <list>
3093 <item>Some languages such as English doesn't need any special input
3094 methods. All characters for the language can be inputted
3095 by a single key on a keyboard. Keymap is all which a user
3096 has to care.
3097 <item>Some other languages such as German need a simple extension.
3098 For example, u with umlaut can be inputted with two strokes
3099 of ':' and 'u'. A way to switch ordinal input mode (key
3100 strokes of ':' and 'u' inputs ':' and 'u') and the extension
3101 input mode (key strokes of ':' and 'u' bears u with umlaut)
3102 has to be supplied. Almost languages in the world can be
3103 inputted with this method.
3104 <item>Other languages such as Chinese and Japanese need a complicated
3105 input method, since they use thousands of characters.
3106 Since it is very difficult and challenging problem to develop
3107 a clever input method, a few companies are developing Japanese
3108 input methods. Typical Japanese input methods are shipped
3109 with tens of megabytes of conversion dictionary.
3110 It is often very troublesome to set up an input method for
3111 these languages.
3112 <footnote>
3113 This is a field where proprietary systems such as MS Windows
3114 and Macintosh are much easier than free systems such as
3115 Debian and FreeBSD.
3116 </footnote>
3117 You also have to be practiced to use
3118 these input methods.
3119 </list>
3120 Different technologies are used for these languages.
3121 The aim of this chapter is to introduce technologies for them.
3122 </P>
3123
3124
3125 <sect id="input-console"><heading>Non-X Softwares</heading>
3126
3127 <P>
3128 Ideally, it is a responsibility for console and X terminal emulators
3129 to supply an input method. This situation is already achieved for
3130 simple languages which don't need complicated input methods.
3131 Thus, non-X softwares don't need to care about input methods.
3132 </P>
3133
3134 <P>
3135 There are a few Debian packages for consoles and X terminal
3136 emulators which supply input methods for particular languages.
3137 <taglist>
3138 <tag><strong>xiterm</strong> in xiterm+thai package
3139 <item>Thai characters
3140 <tag><strong>hanterm</strong>
3141 <item>Korean Hangul
3142 <tag><strong>cxtermb5</strong> in cxterm-big5 package
3143 <item>Big5 traditional Chinese ideograms
3144 <tag><strong>cce</strong>
3145 <item>CN-GB simplified Chinese ideograms
3146 </taglist>
3147 And more, there are a few softwares which supply input methods for
3148 existing console environment.
3149 <taglist>
3150 <tag><strong>skkfep</strong>
3151 <item>Japanese (needs SKK as a conversion engine)
3152 <tag><strong>uum</strong>
3153 <item>Japanese (needs Wnn as a conversion engine; not
3154 avaliable as a Debian package)
3155 <tag><strong>canuum</strong>
3156 <item>Japanese (needs Canna as a conversion engine; not
3157 avaliable as a Debian package)
3158 </taglist>
3159 However, since input methods for complex languages have not been
3160 available historically, a few non-X softwares have been developed
3161 with input methods.
3162 <taglist>
3163 <tag><strong>jvim-canna</strong>
3164 <item>A text editor which can input Japanese (needs Canna
3165 as a conversion engine.)
3166 <tag><strong>jed-canna</strong>
3167 <item>A text editor which can input Japanese (needs Canna
3168 as a conversion engine.)
3169 <tag><strong>nvi-m17n-canna</strong>
3170 <item>A text editor which can input Japanese (needs Canna
3171 as a conversion engine.)
3172 </taglist>
3173 </P>
3174
3175 <P>
3176 You have to take care of the differences between number of
3177 <em>characters</em>, <em>columns</em>, and <em>bytes</em>.
3178 For example, you can find immediately that <prgn>bash</prgn>
3179 cannot handle UTF-8 input properly when you invoke <prgn>bash</prgn>
3180 on UTF-8 Xterm and push BackSpace key. This is because
3181 <prgn>readline</prgn> always erase one column on the screen
3182 and one byte in the internal buffer for one stroke of 'BackSpace'
3183 key. To solve this problem, <strong>wide character</strong>
3184 should be used for internal processing. One stroke of 'BackSpace'
3185 should erase <tt>wcwidth()</tt> columns on the screen and
3186 one <tt>wchar_t</tt> unit in the internal buffer.
3187 </P>
3188
3189
3190 <sect id="input-x"><heading>X Softwares</heading>
3191
3192 <P>
3193 X11R5 is the first internationalized version of X Window System.
3194 However, X11R5 supplied two sample implements of international
3195 text input. They are <strong>Xsi</strong> and <strong>Ximp</strong>.
3196 Existence of two different protocols was an annoying situation.
3197 However, X11R6 determined <strong>XIM</strong>, a new protocol
3198 for internationalized text input, as the standard. Internationalized
3199 X softwares should support text input using XIM.
3200 </P>
3201
3202 <P>
3203 They are designed using <em>server-client</em> model.
3204 The client calls the server when necessary. The server
3205 supplies conversion from key stroke to internationalized text.
3206 </P>
3207
3208 <P>
3209 <strong>Kinput</strong> and <strong>kinput2</strong>
3210 are protocols for Japanese text input, which existed before X11R5.
3211 Some softwares such as <prgn>kterm</prgn> and so on supports
3212 kinput2 protocol. <prgn>kinput2</prgn> is the server software.
3213 Since the current version of <prgn>kinput2</prgn> supports XIM protocol,
3214 you don't need to support kinput protocol.
3215 </P>
3216
3217 <sect1 id="input-x-devel"><heading>Developing XIM clients</heading>
3218
3219 <P>***** Not written yet *****</P>
3220
3221 <P>
3222 Development of XIM client is a bit complicated. You can read
3223 source code for <prgn>rxvt</prgn> and <prgn>xedit</prgn> to
3224 study.
3225 </P>
3226
3227 <P>
3228 <url id="http://www.ainet.or.jp/~inoue/im/index-e.html"
3229 name="Programming for Japanse characters input"> is a
3230 good introduction to XIM programming.
3231 </P>
3232
3233 <sect1 id="input-x-examples"><heading>Examples of XIM softwares</heading>
3234
3235 <P>
3236 The following are examples of softwares which can work as XIM clients.
3237 <list>
3238 <item>X Terminal Emulators such as <prgn>krxvt</prgn>, <prgn>kterm</prgn>,
3239 and so on.
3240 <item>Text editors such as <prgn>xedit</prgn>, <prgn>gedit</prgn>, and
3241 so on.
3242 <item>Web rowser <prgn>mozilla</prgn>.
3243 </list>
3244 The following are examples of softwares which can work as XIM servers.
3245 <list>
3246 <item><prgn>kinput</prgn> and <prgn>skkinput</prgn> for Japanese.
3247 </list>
3248 </P>
3249
3250 <sect1 id="input-x-setup"><heading>Using XIM softwares</heading>
3251
3252 <P>
3253 Here I will explain how to use XIM input with Debian system.
3254 This will help developers and package maintainers who want to
3255 test XIM facility of their softwares. Debian Woody or later
3256 systems are assumed.
3257 </P>
3258
3259 <P>
3260 At first, locale database has to be prepared. Uncomment
3261 <tt>ja_JP.EUC-JP EUC-JP</tt>, <tt>ko_KR.EUC-KR EUC-KR</tt>,
3262 <tt>zh_CN.GB2312</tt>, and <tt>zh_TW BIG5</tt> lines in
3263 <tt>/etc/locale.gen</tt> and invoke <prgn>/usr/sbin/locale-gen</prgn>.
3264 This will prepare locale database under <tt>/usr/share/locale/</tt>.
3265 For systems other than Debian Woody or later, please take the valid
3266 procedure for these systems to prepare locale database.
3267 </P>
3268
3269 <P>
3270 Basic Chinese, Japanese, and Korean X fonts are included in
3271 <package>xfonts-base</package> package for Debian Woody and later.
3272 </P>
3273
3274 <P>
3275 XIM server must be installed. For <strong>Japanese</strong>,
3276 <package>kinput2</package> or <package>skkinput</package> packages
3277 are available. <package>kinput2</package> supports Japanese input
3278 engines of <strong>Canna</strong> and <strong>FreeWnn</strong> and
3279 <package>skkinput</package> supports <strong>SKK</strong>.
3280 For <strong>Korean</strong>, <package>ami</package> is available.
3281 For <strong>traditional Chinese</strong> and <strong>simplified
3282 Chinese</strong>, <package>xcin</package> is available.
3283 </P>
3284
3285 <P>
3286 Of course you need an XIM client software. <prgn>xedit</prgn>
3287 in <package>xbase-clients</package> package is an example of
3288 XIM client.
3289 </P>
3290
3291 <P>
3292 Then, login as a non-root user. Environment variables of
3293 <tt>LC_ALL</tt> (or <tt>LANG</tt>) and <tt>XMODIFIERS</tt>
3294 must be set as following.
3295 <list>
3296 <item>for <strong>Japanese</strong>/<strong>kinput2</strong>:
3297 <tt>LC_ALL=ja_JP.eucJP</tt> and <tt>XMODIFIERS=@im=kinput2</tt>
3298 <item>for <strong>Korean</strong>/<strong>ami</strong>:
3299 <tt>LC_ALL=ko_KR.eucKR</tt> and <tt>XMODIFIERS=@im=Ami</tt>
3300 <item>for <strong>traditional Chinese</strong>/<strong>xcin</strong>:
3301 <tt>LC_ALL=zh_TW.Big5</tt> and <tt>XMODIFIERS=@im=xcin</tt>
3302 <item>for <strong>simplified Chinese</strong>/<strong>xcin</strong>:
3303 <tt>LC_ALL=zh_CN.GB2312</tt> and <tt>XMODIFIERS=@im=xcin-zh_CN.GB2312</tt>
3304 </list>
3305 </P>
3306
3307 <P>
3308 Then invoke the XIM server. Just invoke it with background mode
3309 (with &amp;). <strong>kinput2</strong> and <strong>ami</strong>
3310 don't open a new window while <strong>xcin</strong> opens a new
3311 window and outputs some messages.
3312 </P>
3313
3314 <P>
3315 Then invoke the XIM client. Focus on an input area of the software.
3316 Hit Shift-Space or Control-Space and type something. Did some strange
3317 characters appear? This document is too brief to explain
3318 how to input valid CJK characters and sentences with these XIM
3319 servers. Please consult documents of XIM servers.
3320 </P>
3321
3322 <sect id="input-emacs"><heading>Emacsen</heading>
3323
3324 <P>
3325 <strong>GNU Emacs</strong> and <strong>XEmacs</strong> take
3326 an entirely different model for international input.
3327 </P>
3328
3329 <P>
3330 They supply all input methods for various languages.
3331 Instead of relying on console or XIM, they use these input
3332 methods. These input methods can be selected by
3333 <tt>M-x set-input-method</tt> command. The selected input
3334 method can be switched on and off by <tt>M-x toggle-input-method</tt>
3335 command.
3336 </P>
3337
3338 <P>
3339 GNU Emacs supplies input methods for
3340 British, Catalan,
3341 Chinese (array30, 4corner, b5-quick, cns-quick, cns-tsangchi,
3342 ctlau, ctlaub, ecdict, etzy, punct, punct-b5, py, py-b5,
3343 py-punct, py-punct-b5, qj, qj-b5, sw, tonepy, ziranma, zozy),
3344 Czech, Danish, Devanagari, Esperanto,
3345 Ethiopic, Finnish, French, German, Greek, Hebrew, Icelandic,
3346 IPA, Irish, Italian, Japanese (egg-wnn, skk),
3347 Korean (hangul, hangul3, hanja, hanja3),
3348 Lao, Norwegian, Portuguese, Romanian, Scandinavian,
3349 Slovak, Spanish, Swedish, Thai, Tibetan, Turkish, Vietnamese,
3350 Latin-{1,2,3,4,5},
3351 Cyrillic (beylorussian, jcuken, jis-russian, macedonian,
3352 serbian, transit, transit-bulgarian, ulrainian, yawerty),
3353 and so on.
3354 </P>
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366 <chapt id="internal"><heading>Internal Processing and File I/O</heading>
3367
3368 <P>
3369 There are many text-processing softwares, such as
3370 <prgn>grep</prgn>,
3371 <prgn>groff</prgn>,
3372 <prgn>head</prgn>,
3373 <prgn>sort</prgn>,
3374 <prgn>wc</prgn>,
3375 <prgn>uniq</prgn>,
3376 <prgn>nl</prgn>,
3377 <prgn>expand</prgn>,
3378 and so on.
3379 There are also many script languages which are often used for
3380 text processing, such as
3381 <prgn>sed</prgn>,
3382 <prgn>awk</prgn>,
3383 <prgn>perl</prgn>,
3384 <prgn>python</prgn>,
3385 <prgn>ruby</prgn>,
3386 and so on.
3387 These softwares need to be internationalized.
3388 </P>
3389
3390 <P>
3391 From a user's point of view, a software can use any internal encodings
3392 if I/O is done correctly. It is because a user cannot be aware of
3393 which kind of internal code is used in the software.
3394 </P>
3395
3396 <P>
3397 There are two candidate for internal encoding. One is
3398 <strong>wide character</strong> and the another is <strong>UCS-4</strong>.
3399 You can also use Mule-type encoding, where a pair of a number
3400 to express CCS and a number to express a character consist a unit.
3401 </P>
3402
3403 <P>
3404 I recommend to use <em>wide character</em>, for reasons I alread
3405 explained in <ref id="locale">, i.e., wide character can be
3406 encoding-independent and can support various encodings in the
3407 world including UTF-8, can supply a common united way for users
3408 to choose encodings, and so on.
3409 </P>
3410
3411 <P>
3412 Here a few examples of handling of <tt>wchar_t</tt> are shown.
3413 </P>
3414
3415
3416 <sect id="internal-stream"><heading>Stream I/O of Characters</heading>
3417
3418 <P>
3419 The following program is a small example of stream I/O of wide characters.
3420 <example>
3421 #include &lt;stdio.h&gt;
3422 #include &lt;wchar.h&gt;
3423 #include &lt;locale.h&gt;
3424 main()
3425 {
3426 wint_t c;
3427
3428 setlocale(LC_ALL, "");
3429 while(1) {
3430 c = getwchar();
3431 if (c == WEOF) break;
3432 putwchar(c);
3433 }
3434 }
3435 </example>
3436 I think you can easily imagine a corresponding version using <tt>char</tt>.
3437 Since this software does not do any character manipulation, you can use
3438 ordinal <tt>char</tt> for this software.
3439 </P>
3440
3441 <P>
3442 There are a few points. At first, never forget to call
3443 <tt>setlocale()</tt>. Then, <tt>putwchar()</tt>,
3444 <tt>getwchar()</tt>, and <tt>WEOF</tt> are the replacements of
3445 <tt>putchar()</tt>, <tt>getchar()</tt>, and <tt>EOF</tt>, respectively.
3446 Use <tt>wint_t</tt> instead of <tt>int</tt> for <tt>getwchar()</tt>.
3447 </P>
3448
3449
3450 <sect id="internal-wc"><heading>Character Classification</heading>
3451
3452 <P>
3453 Here is an example of character clasification using <tt>wchar_t</tt>.
3454 At first, this is a non-internationalized version.
3455 <example>
3456 /*
3457 * wc.c
3458 *
3459 * Word Counter
3460 *
3461 */
3462
3463 #include &lt;stdio.h&gt;
3464 #include &lt;string.h&gt;
3465
3466 int main(int argc, char **argv)
3467 {
3468 int n, p=0, d=0, c=0, w=0, l=0;
3469
3470 while ((n=getchar()) != EOF) {
3471 c++;
3472 if (isdigit(n)) d++;
3473 if (strchr(" \t\n", n)) w++;
3474 if (n == '\n') l++;
3475 }
3476
3477 printf("%d characters, %d digits, %d words, and %d lines\n",
3478 c, d, w, l);
3479 }
3480 </example>
3481 Here is the internationalized version.
3482 <example>
3483 /*
3484 * wc-i.c
3485 *
3486 * Word Counter (internationalized version)
3487 *
3488 */
3489
3490 #include &lt;stdio.h&gt;
3491 #include &lt;string.h&gt;
3492 #include &lt;locale.h&gt;
3493
3494 int main(int argc, char **argv)
3495 {
3496 int p=0, d=0, c=0, w=0, l=0;
3497 wint_t n;
3498
3499 setlocale(LC_ALL, "");
3500
3501 while ((n=getwchar()) != EOF) {
3502 c++;
3503 if (iswdigit(n)) d++;
3504 if (wcschr(L" \t\n", n)) w++;
3505 if (n == L'\n') l++;
3506 }
3507
3508 printf("%d characters, %d digits, %d words, and %d lines\n",
3509 c, d, w, l);
3510 }
3511 </example>
3512 </P>
3513
3514 <P>
3515 This example shows that <tt>iswdigit()</tt> is used instead of
3516 <tt>isdigit()</tt>. And more, <tt>L"string"</tt> and <tt>L'char'</tt>
3517 for wide character string and wide character.
3518 </P>
3519
3520 <sect id="internal-length"><heading>Length of String</heading>
3521
3522 <P>
3523 The following is a sample program to obtain the length of the
3524 inputed string. Note that number of bytes and number of characters
3525 are not distinguished.
3526 <example>
3527 /* length.c
3528 *
3529 * a sample program to obtain the length of the inputed string
3530 * NOT INTERNATIONALIZED
3531 */
3532
3533 #include &lt;stdio.h&gt;
3534 #include &lt;string.h&gt;
3535
3536 int main(int argc, char **argv)
3537 {
3538 int len;
3539
3540 if (argc &lt; 2) {
3541 printf("Usage: %s [string]\n", argv[0]);
3542 return 0;
3543 }
3544
3545 printf("Your string is: \"%s\".\n", argv[1]);
3546
3547 len = strlen(argv[1]);
3548 printf("Length of your string is: %d bytes.\n", len);
3549 printf("Length of your string is: %d characters.\n", len);
3550 printf("Width of your string is: %d columns.\n", len);
3551 return 0;
3552 }
3553 </example>
3554 </P>
3555
3556 <P>
3557 The following is a internationalized version of the program
3558 using wide characters.
3559 <example>
3560 /* length-i.c
3561 *
3562 * a sample program to obtain the length of the inputed string
3563 * INTERNATIONALIZED
3564 */
3565
3566 #include &lt;stdio.h&gt;
3567 #include &lt;string.h&gt;
3568 #include &lt;locale.h&gt;
3569
3570 int main(int argc, char **argv)
3571 {
3572 int len, n;
3573 wchar_t *wp;
3574
3575 /* All softwares using locale should write this line */
3576 setlocale(LC_ALL, "");
3577
3578 if (argc &lt; 2) {
3579 printf("Usage: %s [string]\n", argv[0]);
3580 return 0;
3581 }
3582
3583 printf("Your string is: \"%s\".\n", argv[1]);
3584
3585 /* The concept of 'byte' is universal. */
3586 len = strlen(argv[1]);
3587 printf("Length of your string is: %d bytes.\n", len);
3588
3589 /* To obtain number of characters, it is the easiest way */
3590 /* to convert the string into wide string. The number of */
3591 /* characters is equal to the number of wide characters. */
3592 /* It does not exceed the number of bytes. */
3593 n = strlen(argv[1]) * sizeof(wchar_t);
3594 wp = (wchar_t *)malloc(n);
3595 len = mbstowcs(wp, argv[1], n);
3596 printf("Length of your string is: %d characters.\n", len);
3597
3598 printf("Width of your string is: %d columns.\n", wcswidth(wp, len));
3599
3600 return 0;
3601 }
3602 </example>
3603 </P>
3604
3605 <P>
3606 This program can count multibyte characters correctly.
3607 Of course the user has to set LANG variable properly.
3608 </P>
3609
3610 <P>
3611 For example, on UTF-8 xterm...
3612 <example>
3613 $ export LANG=ko_KR.UTF-8
3614 $ ./length-i (a Hangul character)
3615 Your string is: "(the character)"
3616 Length of your string is: 3 bytes.
3617 Length of your string is: 1 characters.
3618 Width of your string is: 2 columns.
3619 </example>
3620 </P>
3621
3622
3623
3624 <sect id="internal-extract"><heading>Extraction of Characters</heading>
3625
3626 <P>
3627 The following program extracts all characters contained in the given
3628 string.
3629 <example>
3630 /* extract.c
3631 *
3632 * a sample program to extract each character contained in the string
3633 * not internationalized
3634 */
3635
3636 #include &lt;stdio.h&gt;
3637 #include &lt;string.h&gt;
3638
3639 int main(int argc, char **argv)
3640 {
3641 char *p;
3642 int c;
3643
3644 if (argc &lt; 2) {
3645 printf("Usage: %s [string]\n", argv[0]);
3646 return 0;
3647 }
3648
3649 printf("Your string is: \"%s\".\n", argv[1]);
3650
3651 c = 0;
3652 for (p=argv[1] ; *p ; p++) {
3653 printf("Character #%d is \"%c\".\n", ++c, *p);
3654 }
3655 return 0;
3656 }
3657 </example>
3658 Using wide characters, the program can be rewritten as following.
3659 <example>
3660 /* extract-i.c
3661 *
3662 * a sample program to extract each character contained in the string
3663 * INTERNATIONALIZED
3664 */
3665
3666 #include &lt;stdio.h&gt;
3667 #include &lt;string.h&gt;
3668 #include &lt;locale.h&gt;
3669 #include &lt;stdlib.h&gt;
3670
3671 int main(int argc, char **argv)
3672 {
3673 wchar_t *wp;
3674 char p[MB_CUR_MAX+1];
3675 int c, n, len;
3676
3677 /* Don't forget. */
3678 setlocale(LC_ALL, "");
3679
3680 if (argc &lt; 2) {
3681 printf("Usage: %s [string]\n", argv[0]);
3682 return 0;
3683 }
3684
3685 printf("Your string is: \"%s\".\n", argv[1]);
3686
3687 /* To obtain each character of the string, it is easy to convert */
3688 /* the string into wide string and re-convert each of the wide */
3689 /* string into multibyte characters. */
3690 n = strlen(argv[1]) * sizeof(wchar_t);
3691 wp = (wchar_t *)malloc(n);
3692 len = mbstowcs(wp, argv[1], n);
3693 for (c=0; c&lt;len; c++) {
3694 /* re-convert from wide character to multibyte character */
3695 int x;
3696 x = wctomb(p, wp[c]);
3697 /* One multibyte character may be two or more bytes. */
3698 /* Thus "%s" is used instead of "%c". */
3699 if (x&gt;0) p[x]=0;
3700 printf("Character #%d is \"%s\" (%d byte(s)) \n", c, p, x);
3701 }
3702
3703 return 0;
3704 }
3705 </example>
3706 </P>
3707
3708 <P>
3709 Note that this program doesn't work well if the multibyte character
3710 is stateful.
3711 </P>
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721 <chapt id="internet"><heading>the Internet</heading>
3722
3723 <P>
3724 The Internet is a world-wide network of computer.
3725 Thus the text data exchanged via the Internet must be
3726 internationalized.
3727 </P>
3728
3729 <P>
3730 The concept of internationalization did not exist
3731 at the dawn of the Internet, since it was developed in US.
3732 Protocols used in the Internet were developed to be
3733 upward-compatible with the existing protocols.
3734 </P>
3735
3736 <P>
3737 One of the key technology of the internationalization
3738 of the Internet data exchange is <strong>MIME</strong>.
3739 </P>
3740
3741 <sect id="mailnews"><heading>Mail/News</heading>
3742
3743 <P>
3744 Internet mail uses SMTP
3745 (<url id="http://www.faqs.org/rfcs/rfc821.html" name="RFC 821">)
3746 and ESMTP
3747 (<url id="http://www.faqs.org/rfcs/rfc1869.html" name="RFC 1869">)
3748 protocols. SMTP is 7bit protocol and ESMTP is 8bit.
3749 </P>
3750
3751 <P>
3752 Original SMTP can only send ASCII characters. Thus
3753 non-ASCII characters (ISO 8859-*, Asian characters, and so on)
3754 have to be converted into ASCII characters.
3755 </P>
3756
3757 <P>
3758 MIME
3759 (<url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045">,
3760 <url id="http://www.faqs.org/rfcs/rfc2046.html" name="2046">,
3761 <url id="http://www.faqs.org/rfcs/rfc2047.html" name="2047">,
3762 <url id="http://www.faqs.org/rfcs/rfc2048.html" name="2048">, and
3763 <url id="http://www.faqs.org/rfcs/rfc2049.html" name="2049">)
3764 deals with this problem.
3765 </P>
3766
3767 <P>
3768 At first
3769 <url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045">
3770 determines three new headers.
3771 <list>
3772 <item>MIME-Version:
3773 <item>Content-Type:
3774 <item>Content-Transfer-Encoding:
3775 </list>
3776 Now <tt>MIME-Version</tt> is 1.0 and thus all MIME mails have
3777 a header like this:
3778 <example>
3779 MIME-Version: 1.0
3780 </example>
3781 <tt>Content-Type</tt> describes the type of content.
3782 For example, an usual mail with Japanese text has a header like that:
3783 <example>
3784 Content-Type: text/plain; charset="iso-2022-jp"
3785 </example>
3786 Available types are described in
3787 <url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">.
3788 <tt>Content-Transfer-Encoding</tt> describes the way to
3789 convert the contents. Available values are <tt>BINARY</tt>,
3790 <tt>7bit</tt>, <tt>8bit</tt>, <tt>BASE64</tt>, and <tt>QUOTED-PRINTABLE</tt>.
3791 Since SMTP cannot handle 8bit data, <tt>8bit</tt> and <tt>BINARY</tt>
3792 cannot be used. ESMTP can use them.
3793 Base64 and quoted-printable are ways to convert 8bit data into 7bit
3794 and 8bit data have to be converted using either of them to sent by SMTP.
3795 </P>
3796
3797 <P>
3798 <url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
3799 describes media type and sub type for
3800 <tt>Content-Type</tt> header. Available types are
3801 <tt>text</tt>, <tt>image</tt>, <tt>audio</tt>, <tt>video</tt>,
3802 and <tt>application</tt>. Now we are interested in <tt>text</tt>
3803 because we are discussing about i18n.
3804 Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>,
3805 <tt>html</tt>, and so on. <tt>charset</tt> parameter can also be
3806 added to specify encodings.
3807 <tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>,
3808 <tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by
3809 <url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
3810 for <tt>charset</tt>. This list can be added by writing
3811 a new RFC.
3812 <list>
3813 <item><url id="http://www.faqs.org/rfcs/rfc1468.html" name="RFC 1468">
3814 <tt>ISO-2022-JP</tt>
3815 <item><url id="http://www.faqs.org/rfcs/rfc1554.html" name="RFC 1554">
3816 <tt>ISO-2022-JP-2</tt>
3817 <item><url id="http://www.faqs.org/rfcs/rfc1557.html" name="RFC 1557">
3818 <tt>ISO-2022-KR</tt>
3819 <item><url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">
3820 <tt>ISO-2022-CN</tt>
3821 <item><url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">
3822 <tt>ISO-2022-CN-EXT</tt>
3823 <item><url id="http://www.faqs.org/rfcs/rfc1842.html" name="RFC 1842">
3824 <tt>HZ-GB-2312</tt>
3825 <item><url id="http://www.faqs.org/rfcs/rfc1641.html" name="RFC 1641">
3826 <tt>UNICODE-1-1</tt>
3827 <item><url id="http://www.faqs.org/rfcs/rfc1642.html" name="RFC 1642">
3828 <tt>UNICODE-1-1-UTF-7</tt>
3829 <item><url id="http://www.faqs.org/rfcs/rfc1815.html" name="RFC 1815">
3830 <tt>ISO-10646-1</tt>
3831 </list>
3832 </P>
3833
3834 <P>
3835 <url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045"> and
3836 and
3837 <url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
3838 determine the way to write non-ASCII characters
3839 in the main text of mail. On the other hand,
3840 <url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2047"> describes
3841 'encoded words' which is the way to write non-ASCII characters in the header.
3842 It is like that:
3843 <tt>=?</tt><var>encoding</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>,
3844 where <var>encoding</var> is selected from the list of <tt>charset</tt>
3845 of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt>
3846 or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for
3847 base64, and <var>data</var> is encoded data whose length is less than
3848 76 bytes. If the <var>data</var> is longer than 75 bytes,
3849 it must be divided into multiple encoded words.
3850 For example,
3851 <example>
3852 Subject: =?ISO-2022-JP?B?GyRCNEE7eiROJTUlViU4JSclLyVIGyhC?=
3853 </example>
3854 reads 'a subject written in Kanji' in Japanese (ISO-2022-JP,
3855 encoded by base64). Of course human cannot read it.
3856 </P>
3857
3858
3859 <sect id="www"><heading>WWW</heading>
3860
3861 <P>
3862 WWW is a system that HTML documents (mainly; and files in other formats)
3863 are transferred using HTTP protocol.
3864 </P>
3865
3866 <P>
3867 HTTP protocol is defined by
3868 <url id="http://www.faqs.org/rfcs/rfc2068.html" name="RFC 2068">.
3869 HTTP uses headers like mails and <tt>Content-Type</tt> header
3870 is used to describe the type of the contents.
3871 Though <tt>charset</tt> parameter can be described in the
3872 header, it is rarely used.
3873 </P>
3874
3875 <P>
3876 <url id="http://www.faqs.org/rfcs/rfc1866.html" name="RFC 1866">
3877 describes that the default encoding for HTML is
3878 ISO-8859-1. However, many web pages are written in,
3879 for example, Japanese and Korean using (of course) encodings
3880 different from ISO-8859-1.
3881 Sometimes the HTML document describes:
3882 <example>
3883 &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"&gt;
3884 </example>
3885 which declares that the page is written in ISO-2022-JP.
3886 However, there many pages without any declaration of encoding.
3887 </P>
3888
3889 <P>
3890 Web browsers have to deal with such a circumstance.
3891 Of course web browsers have to be able to deal with every
3892 encodings in the world which is listed in MIME.
3893 However, many web browsers can only deal with ASCII
3894 or ISO-8859-1. Such web browsers are useless at all
3895 for non-ASCII or non-ISO-8859-1 people.
3896 </P>
3897
3898 <P>
3899 URL should be written in ASCII character,
3900 though non-ASCII characters can be expressed
3901 using <tt>%</tt><var>nn</var> sequence where <var>nn</var>
3902 is hexadecimal value. This is because there are
3903 no way to specify encoding. Wester-European people
3904 would treat it as ISO-8859-1, while Japanese people
3905 would treat it as EUC-JP or SHIFT-JIS.
3906 </P>
3907
3908
3909
3910
3911
3912
3913
3914
3915
3916
3917
3918
3919
3920
3921
3922
3923
3924 <chapt id="library"><heading>Libraries and Components</heading>
3925
3926
3927 <P>
3928 We sometimes use libraries and components which are not
3929 very popular. We may have to pay special attention for
3930 internationalization of these libraries and components.
3931 </P>
3932
3933 <P>
3934 On the other hand, we can use libraries and components
3935 for improvement of internationalization. This chapter
3936 introduces such a libraries and components.
3937 </P>
3938
3939 <sect id="gettext"><heading>Gettext and Translation</heading>
3940
3941 <P>
3942 GNU Gettext is a tool to internationalize messages a software outputs
3943 according to locale status of <tt>LC_MESSAGES</tt>.
3944 A <prgn>gettext</prgn>ized software contains messages written in
3945 various languages (according to available translators) and
3946 a user can choose them using environmental variables.
3947 GNU gettext is a part of Debian system.
3948 </P>
3949
3950 <P>
3951 Install <package>gettext</package> package and read info pages for details.
3952 </P>
3953
3954 <P>
3955 Don't use non-ASCII characters for '<tt>msgid</tt>'.
3956 Be careful because you may tend to use ISO-8859-1 characters.
3957 For example, '&copy;' (copyright mark; you may be not able to
3958 read the copyright mark NOW in THIS document) is non-ASCII character
3959 (0xa9 in ISO-8859-1).
3960 Otherwise, translators may feel difficulty to edit catalog files
3961 because of conflict between encodings for <tt>msgid</tt> and in
3962 <tt>msgstr</tt>.
3963 </P>
3964
3965 <P>
3966 Be sure the message can be displayed in the assumed environment.
3967 In other words, you have to read the chapter of 'Output to Display'
3968 in this document and internationalize the output mechanism
3969 of your software prior to <prgn>gettext</prgn>ization.
3970 <em>ENGLISH MESSAGES ARE PREFERRED EVEN FOR NON-ENGLISH-SPEAKING PEOPLE,
3971 THAN MEANINGLESS BROKEN MESSAGES.</em>
3972 </P>
3973
3974 <P>
3975 The 2nd (3rd, ...) byte of multibyte characters or
3976 all bytes of non-ASCII characters in stateful encodings
3977 can be 0x5c (same to backslash in ASCII) or 0x22
3978 (same to double quote in ASCII).
3979 These characters have to properly escaped because
3980 present version of GNU gettext doesn't care the
3981 'charset' subitem of '<tt>Content-Type</tt>' item for '<tt>msgstr</tt>'.
3982 </P>
3983
3984 <P>
3985 A <prgn>gettext</prgn>ed message must not used in multiple contexts.
3986 This is because a word may have different meaning in different context.
3987 For example, a verb means an order or a command if it appears
3988 at the top of the sentence in English. However, different languages
3989 have different grammar. If a verb is <prgn>gettext</prgn>ed and it is used
3990 both in a usual sentence and in an imperative sentence,
3991 one cannot translate it.
3992 </P>
3993
3994
3995 <P>
3996 If a sentence is <prgn>gettext</prgn>ed, never divide the sentence.
3997 If a sentence is divided in the original source code,
3998 connect them so as to single string contains the full
3999 sentence.
4000 This is because the order of words in a sentence
4001 is different among languages.
4002 For example, a routine
4003 <example>
4004 printf("There ");
4005 switch(num_of_files) {
4006 case 0:
4007 printf("are no files ");
4008 break;
4009 case 1:
4010 printf("is 1 file ");
4011 break;
4012 default:
4013 printf("are %d files ", num_of_files);
4014 break;
4015 }
4016 printf("in %s directory.\n", dir_name);
4017 </example>
4018 has to be written like that:
4019 <example>
4020 switch(num_of_files) {
4021 case 0:
4022 printf("There are no files in %s directory", dir_name);
4023 break;
4024 case 1:
4025 printf("There is 1 file in %s directory", dir_name);
4026 break;
4027 default:
4028 printf("There are %d files in %s directory", num_of_files, dir_name);
4029 break;
4030 }
4031 </example>
4032 before it is <prgn>gettext</prgn>ized.
4033 </P>
4034
4035 <P>
4036 A software with <prgn>gettext</prgn>ed messages should not depend on
4037 the length of the messages. The messages may get longer
4038 in different language.
4039 </P>
4040
4041 <P>
4042 When two or more '%' directive for formatted output functions
4043 such as <tt>printf()</tt> appear in a message,
4044 the order of these '%' directives may be changed by
4045 translation. In such a case, the translator can specify
4046 the order.
4047 See section of 'Special Comments preceding Keywords'
4048 in info page of <prgn>gettext</prgn> for detail.
4049 </P>
4050
4051 <P>
4052 Now there are projects to translate messages in various softwares.
4053 For example,
4054 <url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
4055 name="Translation Project">.
4056 </P>
4057
4058
4059
4060 <sect1 id="gettextize"><heading>Gettext-ization of A Software</heading>
4061
4062 <P>
4063 At first, the software has to have the following lines.
4064 <example>
4065 int main(int argc, char **argv)
4066 {
4067 ...
4068 setlocale (LC_ALL, ""); /* This is not for gettext but
4069 all i18n software should have
4070 this line. */
4071 bindtextdomain (PACKAGE, LOCALEDIR);
4072 textdomain (PACKAGE);
4073 ...
4074 }
4075 </example>
4076 where <var>PACKAGE</var> is the name of the catalog file and
4077 <var>LOCALEDIR</var> is <tt>"/usr/share/locale"</tt> for Debian.
4078 <var>PACKAGE</var> and <var>LOCALEDIR</var> should be defined
4079 in a header file or <tt>Makefile</tt>.
4080 </P>
4081
4082 <P>
4083 It is convenient to prepare the following header file.
4084 <example>
4085 #include &lt;libintl.h&gt;
4086 #define _(String) gettext((String))
4087 </example>
4088 and messages in source files should be written as
4089 <tt>_("message")</tt>, instead of <tt>"message"</tt>.
4090 </P>
4091
4092 <P>
4093 Next, catalog files have to be prepared.
4094 </P>
4095
4096 <P>
4097 At first, a template for catalog file is prepared
4098 using <prgn>xgettext</prgn>.
4099 At default a template file <tt>message.po</tt> is
4100 prepared.
4101 <footnote>
4102 I HAVE TO WRITE EXPLANATION.
4103 </footnote>
4104 </P>
4105
4106
4107
4108 <sect1 id="gettext-translate"><heading>Translation</heading>
4109
4110 <P>
4111 Though <prgn>gettext</prgn>ization of a software is a temporal
4112 work, translation is a continuing work because you have to
4113 translate new (or modified) messages when (or before) a new
4114 version of the software is released.
4115 </P>
4116
4117
4118 <sect id="readline"><heading>Readline Library</heading>
4119
4120 <P>***** Not written yet *****</P>
4121
4122 <P>
4123 Readline library need to be internationalized.
4124 </P>
4125
4126 <sect id="ncurses"><heading>Ncurses Library</heading>
4127
4128 <P>***** Not written yet *****</P>
4129
4130 <P>
4131 <strong>Ncurses</strong> is a free implementation of curses library.
4132 Though this library is now maintained by Free Software Foundation,
4133 it is not covered by GNU General Public License.
4134 </P>
4135
4136 <P>
4137 Ncurses library need to be internationalized.
4138 </P>
4139
4140
4141
4142
4143
4144
4145
4146
4147 <chapt id="otherlanguage"><heading>Softwares Written in Other than C/C++</heading>
4148
4149 <P>
4150 Though C and C++ was, is, and will be the main language for
4151 software development for UNIX-like platforms, other languages,
4152 especially scripting languages, are often used.
4153 </P>
4154
4155 <P>
4156 Generally, languages other than C/C++ have less support for I18N
4157 then C/C++. However, nowadays other languages than C/C++ are
4158 coming to support Locale and Unicode.
4159 </P>
4160
4161 <sect id="fortran"><heading>Fortran</heading>
4162
4163 <P>***** Not written yet *****</P>
4164
4165 <sect id="pascal"><heading>Pascal</heading>
4166
4167 <P>***** Not written yet *****</P>
4168
4169 <sect id="perl"><heading>Perl</heading>
4170
4171 <P>
4172 Perl is one of the most important languages. Indeed,
4173 Debian system defines Perl as essential.
4174 </P>
4175
4176 <P>
4177 Perl 5.6 can handle UTF-8 characters. Declaration of
4178 <tt>use utf8;</tt> will enable it. For example,
4179 <tt>length()</tt> will return the number of characters,
4180 not the number of bytes.
4181 </P>
4182
4183 <P>
4184 However, it does not work well for me... why?
4185 </P>
4186
4187 <P>***** Not written yet *****</P>
4188
4189 <sect id="python"><heading>Python</heading>
4190
4191 <P>***** Not written yet *****</P>
4192
4193 <sect id="ruby"><heading>Ruby</heading>
4194
4195 <P>***** Not written yet *****</P>
4196
4197 <sect id="tcltk"><heading>Tcl/Tk</heading>
4198
4199 <P>***** Not written yet *****</P>
4200
4201 <P>
4202 Tcl/Tk is already internationalized. It is locale-sensible.
4203 It automatically uses proper font for various characters.
4204 Though it uses UTF-8 as internal encoding, users of Tcl/Tk
4205 don't have to aware of it. This is because Tcl/Tk converts
4206 encodings.
4207 </P>
4208
4209 <sect id="java"><heading>Java</heading>
4210
4211 <p>
4212 Full internationalization is naturally lead from
4213 Java's "Write Once, Run Anywhere" principle.
4214 To achieve this, Java uses Unicode as internal code
4215 for <tt>char</tt> and <tt>String</tt>. It is important
4216 that Unicode is <em>internal</em> code. Java obeys
4217 the current LOCALE and encoding is automatically
4218 converted for I/O. Thus, <em>users</em> of applications written
4219 in Java doesn't need to be aware of Unicode.
4220 </p>
4221
4222 <p>
4223 Then how about <em>developers</em>? They also don't need
4224 to be aware of the internal encoding. Character processings
4225 such as counting of number of characers in a string work well.
4226 And more, you don't have to worry about display/input.
4227 </p>
4228
4229 <p>
4230 However, you may want to handle specified encodings for,
4231 for example, MIME encoding/decoding. For such purposes,
4232 I/O can be done by specifying external encoding.
4233 Check <tt>InputStreamReader</tt> and <tt>OutputStreamReader</tt>
4234 classes. You can also convert between the internal encoding
4235 and specified encodings by
4236 <tt>String.getBytes(</tt><em>encoding</em><tt>)</tt> and
4237 <tt>String(byte []</tt> <em>bytes</em><tt>, </tt><em>encoding</em><tt>)</tt>.
4238 </p>
4239
4240
4241
4242
4243 <sect id="shellscript"><heading>Shell Script</heading>
4244
4245 <P>***** Not written yet *****</P>
4246
4247 <sect id="lisp"><heading>Lisp</heading>
4248
4249 <P>***** Not written yet *****</P>
4250
4251
4252
4253
4254
4255
4256
4257
4258
4259
4260
4261 <chapt id="examples"><heading>Examples of I18N</heading>
4262
4263 <P>
4264 Programmers who have internationalized softwares, have
4265 written a patch of L10N, and so on are encouraged to contribute
4266 to this chapter.
4267 </P>
4268
4269
4270
4271 &twm;
4272 &minicom;
4273 &user-ja;
4274 &fontset;
4275
4276
4277
4278
4279
4280
4281
4282
4283 <chapt id="reference"><heading>References</heading>
4284
4285 <P>
4286 General
4287 <list>
4288 <item>
4289 <url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
4290 name="Unicode support in the Solaris Operating Environment">
4291 shows what is needed for software developers to support UTF-8.
4292 <item>
4293 <url id="http://www.unix-systems.org/version2/whatsnew/login_mse.html"
4294 name="The Open Group's summary of ISO C Amendment 1">
4295 is a detailed explanation on locale and wide character technologies.
4296 <item>
4297 <url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
4298 name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
4299 is a detailed explanation on UTF-8 and Unicode.
4300 <item>
4301 <url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
4302 name="Bruno Haible's Unicode HOWTO">
4303 <item>
4304 Tomohiro KUBOTA (original author of this Introduction to I18N),
4305 <url id="http://www8.plala.or.jp/tkubota1/mojibake/"
4306 name="What is MOJIBAKE"> shows what occurs when character handling
4307 is improper. Mojibake is a Japanese word which almost all computer
4308 users (not only Linux/BSD/Unix but also Windows/Macintosh) know.
4309 <item>
4310 Ken Lunde, "CJKV Information Processing", ISBN 1-56592-224-7,
4311 O'Reilly, 1999
4312 <item>
4313 Mikiko NISHIKIMI, Naoto TAKAHASHI, Satoru TOMURA, Ken'ichi HANDA,
4314 Seiji KUWARI, Shin'ichi MUKAIGAWA, and Tomoko YOSHIDA,
4315 "<url id="http://web.kyoto-inet.or.jp/people/tomoko-y/biwa/multi/"
4316 name="MARUCHIRINGARU KANKYOU NO JITSUGEN - X Window/Wnn/Mule/WWW BURAUZA
4317 DENO TAKOKUGO KANKYO">" or "Realization of Multilingual Environment
4318 - Multilingual Environment in X Window/Wnn/Mule/WWW Browser"
4319 (in Japanese), ISBN4-88735-020-1, TOPPAN, 1996
4320 <item>
4321 Yoshihiro KIYOKANE and Youichi SUEHIRO,
4322 "<url id="http://www.geocities.co.jp/SiliconValley-PaloAlto/8090/"
4323 name="KOKUSAIKA PUROGURAMINGU - I18N HANDOBUKKU">" or "Internationalization
4324 Programming - I18N Handbook" (in Japanese), ISBN4-320-02904-6,
4325 KYORITSU, 1998
4326 <item>
4327 Syuuji SADO and Tomoko YOSHIDA,
4328 "<url id="http://web.kyoto-inet.or.jp/people/tomoko-y/japanese/index.html"
4329 name="Linux/FreeBSD NIHONGO KANKYOU NO KOUCHIKU TO KATSUYOU">" or
4330 "Construction and Utilization of Linux/FreeBSD Japanese Environment"
4331 (in Japanese), ISBN4-7973-0480-4, SOFTBANK, 1997
4332 <item>
4333 Kouichi YASUOKA and Motoko YASUOKA
4334 "<url id="http://www.dendai.ac.jp/press/book_da/ISBN4-501-53060-X.html"
4335 name="MOJI KOODO NO SEKAI">" or "The World of Character Codes" (in Japanese),
4336 ISBN4-501-53060-X, Tokyo Denki University Press Center, 1999
4337 </list>
4338 </P>
4339
4340 <P>
4341 Characters (general)
4342 <list>
4343 <item>
4344 <url id="http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html"
4345 name="Character Tables">
4346 Graphic images for various character sets in the world.
4347 <item>
4348 <url id="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"
4349 name="Ken Lunde's CJK info">
4350 information on CJK (Chinese, Japanese, and Korean) character
4351 set standards, written by the writer of "CJKV Information Processing"
4352 published by O'Reilly.
4353 <item>
4354 <url id="http://www.isi.edu/in-notes/iana/assignments/character-sets"
4355 name="IANA character set registry">
4356 Note that both coded character sets (for example, KS_C_5601-1987,
4357 MIBenum 36) and encodings (for example, ISO-2022-KR, MIBenum: 37)
4358 are registered. How confusing!
4359 <item>
4360 <url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
4361 name="International Register of Coded Character Sets">
4362 A complete list of registered CCS, with ISO 2022 escape sequences.
4363 PDF files for these CCS are also available.
4364 </list>
4365 Characters (ISO 8859)
4366 <list>
4367 <item>
4368 <url id="http://czyborra.com/charsets/iso8859.html"
4369 name="ISO 8859 Alphabet Soup">
4370 </list>
4371 Characters (ISO 2022)
4372 <list>
4373 <item>
4374 <url id="http://www.ecma.ch/ecma1/stand/ECMA-035.HTM">
4375 </list>
4376 Characters (ISO 10646 and Unicode)
4377 <list>
4378 <item><url id="http://www.unicode.org/" name="Unicode Consortium">
4379 <item>
4380 <url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
4381 name="Problems and Solutions for Unicode and User/Vendor Defined
4382 Characters">
4383 </list>
4384 </P>
4385
4386 <P>
4387 Softwares
4388 <list>
4389 <item>
4390 <url id="http://www.wg.omron.co.jp/~shin/Arena-CJK-doc/"
4391 name="Arena-i18n">
4392 Multilingual web browser.
4393 <item>
4394 <url id="http://www.mozilla.org/" name="Mozilla">
4395 is also a multilingual web browser.
4396 <item>
4397 <url id="http://www.m17n.org/mule/" name="Mule">
4398 Multilingual editor whose function is included in GNU Emacs 20
4399 and XEmacs 20.
4400 Mule is the most advanced m17n software in my knowledge.
4401 <item>
4402 <url id="http://www3.justnet.ne.jp/~nmasu/linux/jfbterm/indexn.html"
4403 name="JFBTERM"> (in Japanese) is a multilingual terminal for
4404 Linux framebuffer console. Supported encodings are ISO 2022, EUC-JP,
4405 CN-GB, and EUC-KR. Supported CCS are ISO 8859-{1,2,3,4,5,6,7,8,9,10},
4406 JISX 0201, JISX 0208, GB 2312, and KSX 1001.
4407 <item>
4408 <url id="http://www.gnu.org/directory/UNICON.html"
4409 name="UNICON Project"> intends to implement display/input
4410 CJK(Chinese/Japanese/Korean) characters under the Framebuffer under
4411 Linux.
4412 <item>
4413 <url id="http://programmer.lib.sjtu.edu.cn/cce/cce.html"
4414 name="CCE - Chinese Console Environment"> enables CN-GB Chinese
4415 to be displayed on Linux and FreeBSD console. It also supplies
4416 input methods for Chinese.
4417 <item>
4418 <url id="http://dickey.his.com/xterm/"
4419 name="Xterm"> is a part of XFree86 distribution. It can display
4420 UTF-8 encoding including doublewidth characters and combining
4421 characters.
4422 <item>
4423 <url id="http://www.rxvt.org/"
4424 name="Rxvt"> can display multibyte encodings such as EUC-JP,
4425 Shift-JIS, CN-GB, and Big-5.
4426 <item>
4427 <url id="http://www.gnu.org/software/libiconv/"
4428 name="libiconv"> provides
4429 <tt>iconv()</tt> implementation for systems which don't have one.
4430 It supports various encodings like ASCII, ISO 8859-*, KOI8-*,
4431 EUC-*, ISO 2022-*, Big5, Shift-JIS, TIS 620, UTF-*, UCS-*,
4432 CP*, Mac*, and so on. This library also has <tt>locale_charset()</tt>,
4433 a replacement of <tt>nl_langinfo(CODESET)</tt>.
4434 <item>
4435 <url id="http://clisp.cons.org/~haible/packages-libutf8.html"
4436 name="libutf8 - a Unicode/UTF-8 locale plugin"> provides
4437 UTF-8 locale support for systems which don't have UTF-8 locales.
4438 <item>
4439 <url id="http://www.pango.org/" name="Pango"> is a project to
4440 develop a portable high-quality text rendering engine.
4441 </list>
4442 </P>
4443
4444 <P>
4445 Projects and Organizations
4446 <list>
4447 <item>
4448 <url id="http://www.li18nux.org/"
4449 name="Linux Internationalization Initiative">, or Li18nux,
4450 focuses on the i18n of a core set of APIs and components of Linux
4451 distributions. The results will be proposed to LSB.
4452 <item>
4453 <url id="http://www.li18nux.org/li18nux2k/"
4454 name="LI18NUX 2000 Globalization Specification"> is the first
4455 fruits of Li18nux.
4456 focuses on the i18n of a core set of APIs and components of Linux
4457 distributions. The results will be proposed to LSB.
4458 <item>
4459 <url id="http://citrus.bsdclub.org/"
4460 name="Citrus Project"> is a project to implement
4461 locale/iconv for BSD series OSes so that these OSes conform to
4462 ISO C / SUSV2.
4463 <item>
4464 <url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
4465 name="Translation Project">
4466 <item>
4467 <url id="http://www.mojikyo.org/" name="Mojikyo">
4468 <item>
4469 <url id="http://www.tron.org/index-e.html" name="TRON project">
4470 </list>
4471 <P>
4472
4473
4474
4475
4476
4477 </book>
4478 </debiandoc>

  ViewVC Help
Powered by ViewVC 1.1.5