/[ddp]/manuals/trunk/intro-i18n/intro-i18n.sgml
ViewVC logotype

Contents of /manuals/trunk/intro-i18n/intro-i18n.sgml

Parent Directory Parent Directory | Revision Log Revision Log


Revision 857 - (show annotations) (download) (as text)
Mon Oct 11 03:53:06 1999 UTC (13 years, 7 months ago) by kubota
File MIME type: text/x-sgml
File size: 59628 byte(s)
Added alternative definition of l10n (introduction).
Introduced nkf and tcs to convert character sets (japanese-japan).
A little change in section of gettext.
A note is added in section of locale.
A chapter on Spanish is added.
More detailed explanation on Japanese codesets (japanese-japan).
Added id for all sections.
How to setup Japanese-inputing environment (japanese-japan).
1 <!doctype debiandoc public "-//DebianDoc//DTD DebianDoc//EN"
2 [
3 <!entity % languages system "languages.ents"> %languages;
4 <!entity % examples system "examples.ents"> %examples;
5 ]>
6 <debiandoc>
7 <book>
8
9
10 <titlepag>
11 <title>Introduction to i18n</title>
12 <author>
13 <name>Tomohiro KUBOTA</name>
14 <email>kubota@debian.or.jp</email>
15 </author>
16 <version><date></version>
17 <abstract>
18 This document describes introduction to i18n (internationalization)
19 for programmers and package maintainers.
20 </abstract>
21 <copyright>
22 <copyrightsummary>
23 Copyright &copy; 1999 Tomohiro KUBOTA.
24 For chapters and sections whose original author is not KUBOTA,
25 the authors of them have copyright. Their names are written
26 at the top of the chapter or the section.
27 </copyrightsummary>
28 <p>
29 This manual is free software; you may redistribute it and/or modify it
30 under the terms of the GNU General Public License as published by the
31 Free Software Foundation; either version 2, or (at your option) any
32 later version.
33 </p>
34 <p>
35 This is distributed in the hope that it will be useful, but
36 <em>without any warranty</em>; without even the implied warranty of
37 merchantability or fitness for a particular purpose. See the GNU
38 General Public License for more details.
39 </p>
40 <p>
41 A copy of the GNU General Public License is available as
42 <tt>/usr/share/common-licenses/GPL</tt> in the Debian GNU/Linux
43 distribution or on the World Wide Web at
44 <url id="http://www.gnu.org/copyleft/gpl.html" name="&urlname">.
45 You can also obtain it by writing to the Free
46 Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
47 02111-1307, USA.
48 </p>
49 </copyright>
50 </titlepag>
51
52 <toc detail="sect1">
53
54 <chapt id="scope"><heading>About This Document</heading>
55
56 <sect><heading>Scope</heading>
57
58 <P>
59 This document describes the basic ideas of I18N written for
60 programmers and package maintainers of Debian GNU/Linux.
61 The aim of this document is to offer an introduction to
62 basic concepts, character codes, and points of which care
63 should be taken when one writes an I18N-ed software or
64 an I18N patch for an existing software. This document
65 also tries to introduce the real state and existing
66 problems for each language and country.
67 </P>
68
69 <P>
70 This document does not describe the details of programming,
71 except for the last chapter where instances of I18N are
72 collected.
73 </P>
74
75 <P>
76 Minimum requirements, for example,
77 that characters should be displayed proper font (at least users
78 of the software must be able to guess what is written),
79 that characters must be inputed from keyboard, and
80 that softwares must not destroy characters,
81 are stressed in the document and I am trying to
82 describe a HOWTO to satisfy these requirements.
83 </P>
84
85 <P>
86 Though this document is strongly related to programming
87 languages such as C and standardized I18N methods such as
88 <prgn>gettext</prgn> and LOCALE, this document does not supply a
89 detailed explanation of them.
90 </P>
91
92 <sect id="newversion"><heading>New Versions of This Document</heading>
93
94 <P>
95 The current version of this document is available
96 at
97 <url id="http://www.debian.org/~elphick/ddp/"
98 name="DDP (Debian Documentation Project)"> page.
99 </P>
100
101 <sect id="feedback"><heading>Feedback and Contributions</heading>
102
103 <P>
104 This document needs contributions, especially for a
105 chapter on each languages (<ref id="languages">)
106 and a chapter on instances of I18N (<ref id="examples">).
107 These chapters are consist of contributions.
108 </P>
109
110 <P>
111 Otherwise, this will be a mere document only on Japanization,
112 because the original author Tomohiro KUBOTA
113 (<email>kubota@debian.or.jp</email>)
114 speaks Japanese and live in Japan.
115 </P>
116
117 <P>
118 <ref id="spanish"> is written by
119 Eusebio C Rufian-Zilbermann <email>eusebio@acm.org</email>.
120 </P>
121
122 <P>
123 Discussions are held at <tt>debian-devel@lists.debian.org</tt> mailing list.
124 (May <tt>debian-doc</tt> or <tt>debian-i18n</tt> be more suitable?)
125 </P>
126
127 <chapt id="intro"><heading>Introduction</heading>
128
129 <P>
130 Debian system includes many softwares. Though many of them
131 have faculty to process, output, and input text data, a part
132 of these programs assume text as written in English (ASCII).
133 For people who use non-English language these programs are
134 hardly usable.
135 </P>
136
137 <P>
138 So far people who use non-English languages have given up
139 and accepted computers as such. However we should throw away
140 such a wrong idea now. It is nonsense that a person who
141 want to use a computer has to learn English in advance.
142 </P>
143
144 <P>
145 There are a few approaches for softwares to be able to handle
146 non-English languages. What we need to do at first is to know
147 the differences between these approaches and to choose one
148 approach for each case.
149 </P>
150
151 <P>
152 <taglist>
153 <tag>a. <strong>L10N</strong> (localization)</tag>
154 <item><p>
155 This approach is to support two languages or character sets,
156 English (ASCII) and another specified one. An example is
157 Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual Emacs).
158 Since a programmer has his/her own mother tongue,
159 there are numerous L10N patches and L10N softwares
160 written to satisfy his/her own need.
161 </p></item>
162 <tag>b. <strong>I18N</strong> (internationalization)</tag>
163 <item><p>
164 This approach is to support many languages but only two
165 of them, English (ASCII) and another one, at the same time.
166 One have to specify the 'another' language by <tt>LANG</tt>
167 environmental variable or so on.
168 LOCALE in C and <prgn>gettext</prgn> is categorized into I18N.
169 </p></item>
170 <tag>c. <strong>M17N</strong> (multilingualization)</tag>
171 <item><p>
172 This approach is to support many languages at the same time.
173 For example, Mule (MULtilingual Enhancement to GNU Emacs)
174 can treat a text file which contains multiple languages,
175 for example, a paper on difference between Korean and Chinese
176 whose main text is written in Finnish. Now GNU Emacs 20 and
177 XEmacs include Mule.
178 </p></item>
179 </taglist>
180 </P>
181
182 <P>
183 Generally speaking I18N approach is better than L10N and M17N than I18N.
184 In other words, text-processing softwares are 'better' which can treat
185 many languages at the same time, than can treat two (English and an
186 another) languages.
187 </P>
188
189 <P>
190 Sometimes 'localization' means preparing language (or culture)-specific
191 data for already i18n-ed software. For example, translation of
192 <prgn>gettext</prgn>ed messages is 'localization' in this meaning.
193 Almost commercial softwares seem to adopt this approach. Thus
194 at first original (or US) version is released and after certain
195 period localized versions are released.
196 However, in this document, such 'localization' is included in
197 'internationalization' because these two have common technical
198 topics and because such 'localization' has the same limitation
199 as 'internationalization' as described below.
200 <footnote>
201 Another reason is that such 'localization' only bears many localized
202 versions, instead of a single internationalized version. This
203 means labor is dispersed. Since our labor is limited, it is more
204 effecient to concentrate on a single properly internationalized
205 version, whose user in a specific language only have to set
206 <tt>LANG</tt> environmental variable to let the software proprely.
207 Also, 'internationalized distribuion' is our goal.
208 </footnote>
209 Instead, 'localization' in document means what I wrote,
210 because there are many localized softwares which adopts 'localization'
211 of my meaning and Debian developers and package maintainers are
212 interested in unification of these localized softwares.
213 </P>
214
215 <P>
216 Now let me classify approaches for support of non-English languages
217 from an another viewpoint.
218 </P>
219
220 <P>
221 <taglist>
222 <tag>A. Implementation <em>without</em> Knowledge on Each Language</tag>
223 <item><p>
224 By utilizing standardized methods supplied by the kernel or libraries
225 such as LOCALE, <tt>wchar_t</tt>, and <prgn>gettext</prgn>, this
226 approach is possible.
227 The advantages of this approach are (1) that when the kernel or
228 libraries is upgraded the software may support new additional languages
229 and (2) that programmers need not know each language.
230 The disadvantage is that there are categories or fields where
231 a standardized method is not available. So far standardized
232 methods are available in the field of I18N such as LOCALE and
233 <prgn>gettext</prgn> and no standards are established for M17N approach.
234 Furthermore, there are no standard for number of columns a
235 character occupies nor methods for inputting non-English
236 language on console (that is, interface to inputting library).
237 </p></item>
238
239 <tag>B. Implementation Using Knowledge on Each Language</tag>
240 <item><p>
241 This approach is to directly implement information about
242 each language based on knowledge of programmers and
243 contributors. L10N almost always uses this approach.
244 The advantage of this approach is that detailed and strict
245 implementation is possible beyond the field where
246 standardized methods are available. Language-specific
247 problems can be perfectly solved (of course it depends on
248 the skill of the programmer). The disadvantages are
249 (1) that the number of supported languages is restricted
250 by the skill or the interest of the programmers or the
251 contributors, (2) that labor which should be united and
252 concentrated to upgrade the kernel or libraries is dispersed
253 into many softwares, that is, re-inventing of the wheel.
254 However a majestic M17N software such as Mule can be
255 built by strongly propel this approach.
256 </p></item>
257 </taglist>
258 </P>
259
260 <P>
261 Using this classification, let me consider L10N, I18N and M17N
262 from programmer's point of view.
263 </P>
264
265 <P>
266 L10N can be realized only using his/her own knowledge on his/her
267 language. For example, all what you have to do is to implement
268 your knowledge on SHIFT-JIS coding system. Since the motivation
269 of L10N is usually to satisfy programmer's own need, extensiveness
270 for the third language is often ignored. Then, approach B, not A, is
271 taken.
272 Though L10N-ed softwares are basically useful for people who
273 speaks the same language to the programmer, it is sometimes
274 useful for other people whose coding system is similar to
275 the programmer's. For example, a software which
276 doesn't recognize EUC-JP but doesn't break EUC-JP, does not
277 break EUC-KR also.
278 </P>
279
280 <P>
281 Main part of I18N is, in the case of C program, achieved using
282 standardized methods such as LOCALE, <tt>wchar_t</tt>,
283 and <prgn>gettext</prgn>.
284 An LOCALE approach is classified into I18N because functions
285 related to LOCALE change their behavior by a parameter
286 to <tt>setlocale()</tt> or environmental variables such as <tt>LANG</tt>.
287 Namely, approach A is emphasized for I18N. For field where
288 standardized methods are not available, however, approach B
289 cannot be avoided. Even in such a case, an interface and
290 support for each language should be designed to be separated
291 so that a support for new languages can be easily added.
292 </P>
293
294 <P>
295 Unfortunately there are no standardized methods for M17N so far.
296 Exceptions are ISO-2022-INT-* and UNICODE codesets which can
297 express many languages at the same time. However, ISO-2022-INT-*
298 is stateful and thus implementation may be difficult and
299 UNICODE lacks a compatibility to eastern Asian standards
300 and UNICODE itself has many variants (UCS-* and UTF-*) though
301 they can be converted one another easily. Of course M17N-ed
302 software cannot be written only with M17N-ed codeset.
303 Thus approach B cannot be avoided for M17N so far.
304 Efforts for standardization in various fields for M17N should
305 be made. Mule is the only software which achieved M17N.
306 </P>
307
308 <P>
309 This document is focused on I18N. Note that an I18N-ed software
310 cannot process a text file which contains more than three languages,
311 for example, Finnish, Chinese, and Korean (a paper written in
312 Finnish, on comparison of Chinese and Korean). M17N is needed
313 for such a case. Don't forget that the true goal is M17N and
314 I18N is a compromise.
315 </P>
316
317 <P>
318 For people using non-Latin letters, I18N does not include
319 messages written in their languages nor file names written
320 their languages. Yes, it is true they should be achieved.
321 However, on considering our current state, we can say these
322 requires are too much luxury. Our true necessity is,
323 for example, that characters in our languages are displayed
324 using correct font without destroying the screen, that
325 a way for our characters to be inputed is supplied, and
326 that our languages can be inputed correctly. It would be
327 fine if text-processing softwares such as <prgn>perl</prgn>
328 and <prgn>grep</prgn> processes our languages correctly.
329 </P>
330
331 <P>
332 Regarding such circumstance on which we stand, the author
333 concentrate on the problems which is truly needed rather
334 than right and ideal I18N/M17N.
335 In other words, the focus of this document is on the way
336 characters should be displayed, inputed, and processed without
337 destroying them, not on the time-displaying format, currency symbol,
338 and so on.
339 </P>
340
341 <chapt id="coding"><heading>Character Coding Systems</heading>
342
343 <P>
344 Here major character sets and codesets are introduced.
345 The last section of this chapter contains information
346 on each language. Contributions for this section for
347 many languages are especially welcome, though contributions
348 for the whole text are of course welcome.
349 </P>
350
351 <sect id="coding-general"><heading>General Discussions</heading>
352
353 <sect1 id="codeset"><heading>Character / Character Set / Coded Character Set (or Codeset)</heading>
354
355 <P>
356 <strong>CHARACTER CODE</strong> is a set of combinations of bits in order to
357 treat characters in computers. To determine a character code
358 it is needed that to determine a set of characters to be encoded.
359 </P>
360
361 <P>
362 This set of character is called <strong>CHARACTER SET</strong>.
363 There are many standards of character sets in the world.
364 For example, JIS X 0208 contains main characters used in Japanese.
365 Usually, a characters set is not only a collection of characters, but
366 each character in the set also has its own number.
367 Usually the numbering is done so that the set
368 is consistent with international standards.
369 For example, many 7bit local character sets are identical to
370 one of ISO 646-* codesets. Many other local character sets are
371 related to ISO 8859 or ISO 2022.
372 </P>
373
374 <P>
375 Then one selects a character set or multiple character sets and
376 assigns codes for characters included in the character set(s).
377 This way to assign code is called <strong>ENCODING</strong>.
378 The set of encoded characters are called <strong>CODED CHARACTER SET</strong>
379 or <strong>CODESET</strong>.
380 For example, ISO-2022-JP <strong>codeset</strong> contains
381 <strong>character set</strong>s of ASCII, JIS X 0201 Katakana,
382 and JIS X 0208 Kanji.
383 Encoding for a codeset including multiple character sets
384 is usually done in two stages, at first in each character
385 set and then for combination of character sets.
386 </P>
387
388 <P>
389 For a codeset including only one character set, we don't
390 have to distinguish 'character set' and 'codeset'.
391 For example, ASCII is a character set and a codeset at the
392 same time.
393 </P>
394
395 <sect1 id="stateful"><heading>Stateless and Stateful</heading>
396
397 <P>
398 For codeset including multiple character sets it is needed
399 to determine the way to combine these character sets when encoding.
400 There are two ways to do that. One is to make all characters
401 in the all character sets have unique codes. The other is to
402 allow characters from different character sets to have the same
403 code and to have a code such as escape sequence to switch
404 <strong>SHIFT STATE</strong>, that is, to select one character set.
405 </P>
406
407 <P>
408 A codeset with shift state is called <strong>STATEFUL</strong> and
409 one without shift state is called <strong>STATELESS</strong>.
410 </P>
411
412 <P>
413 Generally stateful codesets can contain more characters than
414 stateless one. However, implementation of stateful codeset
415 is much difficult than that of stateless codeset.
416 </P>
417
418 <sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>
419
420 <P>
421 One ASCII character is always expressed by one byte
422 and occupies one column on console or fixed font for X.
423 One must not make such an assumption for I18N programming
424 and have to clearly distinguish number of bytes, characters,
425 and columns.
426 </P>
427
428 <sect id="standards"><heading>Standards for Character Codes</heading>
429
430 <sect1 id="ascii"><heading>ASCII and ISO 646</heading>
431
432 <P>
433 <strong>ASCII</strong> is a character set and also a codeset at the same time.
434 ASCII is 7bit and contains 94 printable characters which are
435 encoded in the region of 0x21-0x7e.
436 </P>
437
438 <P>
439 <strong>ISO 646</strong> is the international standard of ASCII. Following
440 12 characters of
441 <list>
442 <item>0x23 (number),
443 <item>0x24 (dollar),
444 <item>0x40 (at),
445 <item>0x5b (left square bracket),
446 <item>0x5c (backslash),
447 <item>0x5d (right square bracket),
448 <item>0x5e (caret),
449 <item>0x60 (backquote),
450 <item>0x7b (left curly brace),
451 <item>0x7c (vertical line),
452 <item>0x7d (right curly brace), and
453 <item>0x7e (tilde)
454 </list>
455 are called <strong>IRV</strong> (International Reference Version)
456 and other 82 (94 - 12 = 82) characters are called
457 <strong>BCT</strong> (Basic Code Table).
458 Characters at IRV can be different between countries.
459 For example, UK version of ISO 646 has pound currency
460 symbol at 0x23 and Japanese version has yen currency
461 symbol at 0x5c. US version of ISO 646 is same to ASCII.
462 </P>
463
464 <P>
465 As far as I know, all codesets in the world contains
466 ISO 646 character set.
467 </P>
468
469 <P>
470 Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
471 </P>
472
473 <P>
474 Nowadays usage of codesets incompatible with ASCII is not encouraged
475 and thus ISO 646-* should not be used. One of the reason is that
476 when a string is converted into Unicode, the converter doesn't
477 know whether IRVs are converted into characters with same shapes
478 or characters with same codes. Another reason is that source codes
479 are written in ASCII. Source code must be readable anywhere.
480 </P>
481
482
483
484 <sect1 id="iso8859"><heading>ISO 8859</heading>
485
486 <P>
487 <strong>ISO 8859</strong> is an expansion of ASCII using all 8 bits.
488 Additional 96 printable characters encoded in 0xa0 - 0xff are
489 available besides 94 ASCII printable characters.
490 </P>
491
492 <P>
493 There are 10 variants of ISO 8859 (in 1997).
494 <taglist>
495 <tag>ISO-8859-1 Latin alphabet No.1 (1987)</tag>
496 <item>characters for western European languages
497 <tag>ISO-8859-2 Latin alphabet No.2 (1987)</tag>
498 <item>characters for central European languages
499 <tag>ISO-8859-3 Latin alphabet No.3 (1988)</tag>
500 <tag>ISO-8859-4 Latin alphabet No.4 (1988)</tag>
501 <item>characters for northern European languages
502 <tag>ISO-8859-5 Latin/Cyrillic alphabet (1988)</tag>
503 <tag>ISO-8859-6 Latin/Arabic alphabet (1987)</tag>
504 <tag>ISO-8859-7 Latin/Greek alphabet (1987)</tag>
505 <tag>ISO-8859-8 Latin/Hebrew alphabet (1988)</tag>
506 <tag>ISO-8859-9 Latin alphabet No.5 (1989)</tag>
507 <item>same as ISO-8859-1 except for Turkish instead of Icelandic
508 <tag>ISO-8859-10 Latin alphabet No.6 (1993)</tag>
509 <item>Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
510 </taglist>
511 </P>
512
513 <P>
514 A detailed explanation is found at
515 <url id="http://park.kiev.ua/mutliling/ml-docs/iso-8859.html" name="&urlname">.
516 </P>
517
518
519 <sect1 id="iso-2022"><heading>ISO 2022</heading>
520
521 <P>
522 <strong>ISO 2022</strong> is a very powerful codeset where multiple
523 character sets including 1byte and multibyte can be
524 expressed at the same time. It is stateful.
525 There are many subset codeset of ISO 2022, for example,
526 ISO-2022-JP, EUC, and compound-text. ISO-2022-*
527 is widely used for mail/news. EUC has several variants,
528 for example, EUC-JP and EUC-KR and widely used for
529 UNIX(-like) systems. Compound-text is the standard
530 codeset for X Window System.
531 </P>
532
533 <P>
534 The sixth edition of ECMA-35 is fully identical with
535 ISO 2022:1994 and you can find the official document
536 at <url id="http://www.ecma.ch/stand/ECMA-035.HTM" name="&urlname">.
537 </P>
538
539 <P>
540 ISO 2022 has two versions of 7bit and 8bit. At first
541 8bit version is explained. 7bit version is a subset
542 of 8bit version.
543 </P>
544
545 <P>
546 The 8bit code space are divided into four regions,
547 <list>
548 <item>0x00 - 0x1f: C0 (Control Characters 0),
549 <item>0x20 - 0x7f: GL (Graphic Characters Left),
550 <item>0x80 - 0x9f: C1 (Control Characters 1), and
551 <item>0xa0 - 0xff: GR (Graphic Characters Right).
552 </list>
553 </P>
554
555 <P>
556 GL and GR is the spaces where (printable) character sets are mapped.
557 </P>
558
559 <P>
560 Next, all character sets, for example, ASCII, ISO 646-UK,
561 and JIS X 0208, are classified into following four categories,
562 <list>
563 <item>(1) character set with 1-byte 94-character,
564 <item>(2) character set with 1-byte 96-character,
565 <item>(3) character set with multibyte 94-character, and
566 <item>(4) character set with multibyte 96-character.
567 </list>
568 </P>
569
570 <P>
571 Characters in character sets with 94-character are mapped
572 into 0x21 - 0x7e. Characters in 96-character set are
573 mapped into 0x20 - 0x7f.
574 </P>
575
576 <P>
577 For example, ASCII, ISO 646-UK, and JIS X 0201 Katakana
578 are classified into (1), JIS X 0208 Japanese Kanji,
579 KS C 5601 Korean, GB 2312-80 Chinese are classified into (3),
580 and ISO 8859-* are classified to (2).
581 </P>
582
583 <P>
584 The mechanism to map these character sets into GL and GR is
585 a bit complex. There are four buffers, G0, G1, G2, and G3.
586 A character set is <strong>designated</strong> into one of these buffers
587 and then a buffer is <strong>invoked</strong> into GL or GR.
588 </P>
589
590 <P>
591 Control sequences to 'designate' a character set into a
592 buffer are determined as below.
593 </P>
594
595 <P>
596 <list>
597 <item>A sequence to designate a character set with 1-byte 94-character
598 <list>
599 <item>into G0 set is: ESC 0x28 F,
600 <item>into G1 set is: ESC 0x29 F,
601 <item>into G2 set is: ESC 0x2a F, and
602 <item>into G3 set is: ESC 0x2b F.
603 </list>
604 <item>A sequence to designate a character set with 1-byte 96-character
605 <list>
606 <item>into G1 set is: ESC 0x2d F,
607 <item>into G2 set is: ESC 0x2e F, and
608 <item>into G3 set is: ESC 0x2f F.
609 </list>
610 <item>A sequence to designate a character set with multibyte 94-character
611 <list>
612 <item>into G0 set is: ESC 0x24 0x28 F
613 (exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
614 <item>into G1 set is: ESC 0x24 0x29 F,
615 <item>into G2 set is: ESC 0x24 0x2a F, and
616 <item>into G3 set is: ESC 0x24 0x2b F.
617 </list>
618 <item>A sequence to designate a character set with multibyte 96-character
619 <list>
620 <item>into G1 set is: ESC 0x24 0x2d F,
621 <item>into G2 set is: ESC 0x24 0x2e F, and
622 <item>into G3 set is: ESC 0x24 0x2f F.
623 </list>
624 </list>
625 where 'F' is determined for each character set:
626 <list>
627 <item>character set with 1-byte 94-character
628 <list>
629 <item>F=0x40 for ISO 646 IRV: 1983
630 <item>F=0x41 for BS 4730 (UK)
631 <item>F=0x42 for ANSI X3.4-1968 (ASCII)
632 <item>F=0x43 for NATS Primary Set for Finland and Sweden
633 <item>F=0x49 for JIS X 0201 Katakana
634 <item>F=0x4a for JIS X 0201 Roman (Latin)
635 <item>and more
636 </list>
637 <item>character set with 1-byte 96-character
638 <list>
639 <item>F=0x41 for ISO 8859-1 Latin-1
640 <item>F=0x42 for ISO 8859-2 Latin-2
641 <item>F=0x43 for ISO 8859-3 Latin-3
642 <item>F=0x44 for ISO 8859-4 Latin-4
643 <item>F=0x46 for ISO 8859-7 Latin/Greek
644 <item>F=0x47 for ISO 8859-6 Latin/Arabic
645 <item>F=0x48 for ISO 8859-8 Latin/Hebrew
646 <item>F=0x4c for ISO 8859-5 Latin/Cyrillic
647 <item>and more
648 </list>
649 <item>character set with multibyte 94-character
650 <list>
651 <item>F=0x40 for JIS X 0208-1978 Japanese
652 <item>F=0x41 for GB 2312-80 Chinese
653 <item>F=0x42 for JIS X 0208-1983 Japanese
654 <item>F=0x43 for KS C 5601 Korean
655 <item>F=0x44 for JIS X 0212-1990 Japanese
656 <item>F=0x45 for CCITT Extended GB (ISO-IR-165)
657 <item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
658 <item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
659 <item>F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
660 <item>F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
661 <item>F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
662 <item>F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
663 <item>F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
664 <item>and more
665 </list>
666 </list>
667 <footnote>
668 WHERE CAN I FIND THE COMPLETE AND AUTHORITATIVE TABLE OF THIS?
669 </footnote>
670 </P>
671
672 <P>
673 Control codes to 'invoke' one of G{0123} into GL or GR
674 is determined as below.
675 <list>
676 <item>A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
677 <item>A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)
678 <item>A control code to invoke G2 into GL is: LS2 (Locking Shift 2)
679 <item>A control code to invoke G3 into GL is: LS3 (Locking Shift 3)
680 <item>A control code to invoke one character
681 in G2 into GL is: SS2 (Single Shift 2)
682 <item>A control code to invoke one character
683 in G3 into GL is: SS3 (Single Shift 3)
684 <item>A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
685 <item>A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)
686 <item>A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)
687 </list>
688 <footnote>
689 WHAT IS THE VALUE OF THESE CONTROL CODES?
690 </footnote>
691 </P>
692
693 <P>
694 Note that a character code in a character set invoked into GR is
695 or-ed with 0x80.
696 </P>
697
698 <P>
699 ISO 2022 also determines <strong>announcer</strong> code. For example,
700 'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already
701 invoked into GL'. This simplify the coding system. Even this
702 announcer can be omitted if people who exchange data agree.
703 </P>
704
705 <P>
706 7bit version of ISO 2022 is a subset of 8bit version. It does not
707 use C1 and GR.
708 </P>
709
710 <P>
711 Explanation on C0 and C1 is omitted here.
712 </P>
713
714
715 <sect2 id="compound"><heading>Compound Text</heading>
716
717 <P>
718 <strong>Compound Text</strong> is a subset of ISO 2022,
719 which is used for X clients to communicate each other,
720 for example, copy-paste.
721 </P>
722
723 <P>
724 Compound Text is stateful.
725 <footnote>
726 I HAVE TO WRITE EXPLANATION.
727 </footnote>
728 </P>
729
730
731
732 <sect2 id="euc"><heading>EUC (Extended Unix Code)</heading>
733
734 <P>
735 <strong>EUC</strong> is a subset of 8bit version of ISO 2022 except for the
736 usage of SS2 and SS3 code. Though these codes are used
737 to invoke G2 and G3 into GL in ISO 2022, they are invoked
738 into GR in EUC.
739 This is not a specific codeset but a way to generate a new codeset,
740 for example, EUC-Japanese and EUC-Korean.
741 </P>
742
743 <P>
744 EUC is stateless.
745 </P>
746
747 <P>
748 EUC can contain 4 character sets by using G0, G1, G2, and G3
749 with specific character sets designated.
750 Though there is no requirement that ASCII is designated to G0,
751 I don't know any EUC codeset in which ASCII is not designated to G0.
752 </P>
753
754 <P>
755 For EUC with G0-ASCII, all codes other than ASCII are encoded
756 in 0x80 - 0xff and this is upward compatible to ASCII.
757 </P>
758
759 <P>
760 Expressions for characters in G0, G1, G2, and G3 character sets
761 are described below in binary:
762 <list>
763 <item>G0: 0???????
764 <item>G1: 1??????? [1??????? [...]]
765 <item>G2: SS2 1??????? [1??????? [...]]
766 <item>G3: SS3 1??????? [1??????? [...]]
767 </list>
768 </P>
769
770 <P>
771 where SS2 is 0x8e and SS3 is 0x8f.
772 </P>
773
774
775
776 <sect1 id="unicodes"><heading>ISO/IEC 10646 (UCS-4, UCS-2), UNICODE, UTF-8, UTF-16</heading>
777
778 <P>
779 These codesets are intended to express all characters in the
780 world in a united character set.
781 </P>
782
783 <P>
784 In this document UCS-4 and UCS-2 are regarded as character sets
785 and also codesets. The others are codesets using UCS-4 or its
786 subset as a character set.
787 </P>
788
789
790 <sect2 id="unicode"><heading>Unicode 2.1</heading>
791
792 <P>
793 <strong>Unicode</strong> is a codeset which is designed to be able to express
794 all characters in the world, like ISO 2022.
795 </P>
796
797 <P>
798 Unicode is a stateless codeset, different from ISO 2022.
799 Since all characters are 16bit-length (or multiple of 16bit
800 for combining characters and surrogate pairs), Unicode is not
801 upward-compatible to ASCII, though characters at 0x0021 - 0x007e
802 of Unicode are same to 0x21 - 0x7e of ASCII.
803 </P>
804
805 <P>
806 Unicode as a codeset includes one character set,
807 a subset (plane 0 - 16) of UCS-4. UCS-4 is explained later.
808 </P>
809
810 <P>
811 Unicode (without surrogate pair) is same to UCS-2 (explained later).
812 </P>
813
814 <P>
815 Unicode has three remarkable features of Han Unification,
816 Combining Characters, and Surrogate Pair.
817 </P>
818
819
820 <sect3 id="unihan"><heading>Han Unification</heading>
821
822 <P>
823 This is the point on which Unicode is criticized most strongly
824 among Japanese (and also among Korean and Chinese, I suppose) people.
825 </P>
826
827 <P>
828 A region of 0x4e00 - 0x9fff in UCS-2 is used for Japanese Kanji,
829 Chinese Hanzi, and Korean Hanja. There are similar characters
830 in these four character sets. (There are two sets of Chinese characters,
831 simplified Chinese used in P. R. China and traditional Chinese used in
832 Taiwan). To reduce the number of these ideograms to be encoded
833 (the region for these characters can contain only 20992 characters),
834 these similar characters are assumed to be the same.
835 This is Han Unification.
836 </P>
837
838 <P>
839 However these characters are not exactly the same. If fonts for
840 these characters are made from Chinese one, Japanese people will
841 regard them wrong characters, though they will be able to read.
842 </P>
843
844 <P>
845 An example of Han Unification is available at
846 <url id="http://charts.unicode.org/unihan/unihan.acgi$0x9AA8" name="&urlname">.
847 This is a Kanji character for 'bone'.
848 <url id="http://charts.unicode.org/unihan/unihan.acgi$0x8FCE" name="&urlname">
849 is an another example of a Kanji character for 'welcome'.
850 </P>
851
852
853
854 <sect3 id="combining"><heading>Combining Characters</heading>
855
856 <P>
857 Unicode has a way to synthesize a accented character by combining
858 an accent symbol and a base character. For example, combining 'a' and
859 '~' makes 'a' with tilde. More than two accent symbol can be added to
860 a base character.
861 </P>
862
863 <P>
864 This faculty is convenient to express Arabic and Thai characters.
865 However, a few problems arises.
866 </P>
867
868 <P>
869 <taglist>
870 <tag>Duplicate Encoding</tag>
871 <item>
872 There are multiple ways to express the same character.
873 <tag>Open Repertoire</tag>
874 <item>
875 Number of expressible characters grows unlimitedly.
876 Non-existing characters can be expressed.
877 </taglist>
878 </P>
879
880 <P>
881 And more, this threaten the principle that all characters
882 are expressed by constant bit length.
883 </P>
884
885
886 <sect3 id="surrogate"><heading>Surrogate Pair aka UTF-16</heading>
887
888 <P>
889 Though Unicode aimed to express all characters in the world
890 in a constant 16 bits, 65536 is apparently insufficient to
891 express all characters in the world.
892 </P>
893
894 <P>
895 Surrogate pair is introduced in Unicode 2.0, to expand the
896 number of characters, by expressing one character by special
897 two continuing 16bit codes.
898 </P>
899
900 <P>
901 0xd800 - 0xdfff is the region reserved for surrogate pair.
902 The first 16bit code must be in the region of 0xd800 - 0xdbff.
903 The second 16bit code must be in the region of 0xdc00 - 0xdfff.
904 Since each region has 1024 expressions, surrogate pair can
905 express 1048576 (1024 * 1024 = 1048576) characters.
906 </P>
907
908 <P>
909 Plane 1 - 16 of Group 0 of UCS-4 are mapped to these areas.
910 UCS-4 will be explained later.
911 </P>
912
913
914 <sect3 id="646problem"><heading>ISO 646-* Problem</heading>
915
916 <P>
917 You will need a codeset converter between your local codeset
918 (for example, ISO 8859-* or ISO 2022-*) and Unicode.
919 If you are a Japanese, you may use Japanese version
920 of ISO 646, which encodes yen currency mark at 0x5c where backslash
921 is encoded in ASCII.
922 </P>
923
924 <P>
925 Then which should your converter convert 0x5c in your local codeset
926 into in Unicode, 0x005c (backslash) or yen currency mark?
927 You may say yen currency mark is the right solution.
928 However, backslash (and then yen mark) is widely used for
929 escape character. For example, 'new line' is expressed as
930 'backslash - n' in C string literal and Japanese people use
931 'yen currency mark - n'. You may say that program sources
932 must written in ASCII and the wrong point is that you
933 tried to convert program source. Then how about your original
934 configuration file for various softwares?
935 </P>
936
937 <P>
938 For example, Shift-JIS codeset, which is the standard codeset
939 for Windows/Macintosh in Japan, includes Japanese version of
940 ISO 646. The 'right' way is convert 0x5c into yen currency mark
941 in Unicode. Now Windows comes to support Unicode and the font
942 at 0x005c is yen currency mark. As you know, backslash
943 (yen currency mark in Japan) is vitally important for Windows,
944 because it is used to separate directory names.
945 Fortunately, EUC-JP, which is widely used for UNIX in Japan,
946 includes ASCII, not Japanese version of ISO 646. So this
947 is not problem because it is clear 0x5c is backslash.
948 </P>
949
950 <P>
951 Thus all local codesets should not use character sets incompatible
952 to ASCII, such as ISO 646-*.
953 </P>
954
955
956
957 <sect3 id="consistency"><heading>Consistency with Local Character Sets</heading>
958
959 <P>
960 Local character sets can be newly determined or obsoleted.
961 I don't know Unicode can adapt itself to such cases.
962 This is a REAL fear. Now (1999) a new character set (JIS X 0208
963 3rd and 4th level) is discussed in Japan. This character set
964 may make older character set (JIS X 0212) obsoleted.
965 </P>
966
967 <P>
968 And one more problem. JIS X 0208, the main character set
969 for Japanese language, has many special symbols, such as
970 circle, star, parentheses, and so on. Correspondence
971 from these characters and Unicode is not standardized.
972 Thus the conversion tables are different from vender to vender.
973 I guess this problem is not peculiar to JIS X 0208.
974 </P>
975
976
977
978 <sect2 id="ucs"><heading>ISO 10646, UCS-2, and UCS-4</heading>
979
980 <P>
981 ISO 10646 determines two character sets, <strong>UCS-2</strong>
982 and <strong>UCS-4</strong>.
983 UCS-2 is a subset of UCS-4.
984 </P>
985
986 <P>
987 UCS-4 is a 32bit character set. Each of 4 bytes in 32bit expression
988 of UCS-4 is called <strong>Group</strong>, <strong>Plane</strong>,
989 <strong>Row</strong>, and <strong>Cell</strong>, respectively.
990 The first plane (Group = 0, Plane = 0) is called <strong>BMP</strong>
991 (Basic Multilingual Plane) and UCS-2 is same to BMP.
992 </P>
993
994 <P>
995 Both UCS-2 and UCS-4 are not upward-compatible to ASCII,
996 though characters at 0x0021-0x007e in UCS-2
997 (and 0x00000021 - 0x00007e in UCS-4) are same to
998 0x21 - 0x7e in ASCII.
999 </P>
1000
1001 <P>
1002 Though UCS-2 and UCS-4 are explained as character sets,
1003 they are also codesets.
1004 </P>
1005
1006 <P>
1007 When a string expressed in
1008 UCS-2 (or UCS-4) is stored into a file, there are two ways,
1009 big endian and little endian. To clarify which endian is
1010 used, a magic character is added at the top of the string.
1011 The character is 'zero width no-break space', whose
1012 code is 0xfeff in UCS-2 and 0x0000feff in UCS-4.
1013 </P>
1014
1015
1016
1017 <sect2 id="utf-8"><heading>UTF-8</heading>
1018
1019 <P>
1020 In spite of the name, <strong>UTF-8</strong> is not similar to UTF-16 at all.
1021 </P>
1022
1023 <P>
1024 UTF-8 is a codeset which includes UCS-4 as a character set and is
1025 upward-compatible to ASCII.
1026 Conversion from UCS-4 to UTF-8 is performed using a
1027 simple conversion rule.
1028 <example>
1029 UCS-4 (binary) UTF-8 (binary)
1030 00000000 00000000 00000000 0??????? 0???????
1031 00000000 00000000 00000??? ???????? 110????? 10??????
1032 00000000 00000000 ???????? ???????? 1110???? 10?????? 10??????
1033 00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10??????
1034 000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10??????
1035 0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10??????
1036 </example>
1037 </P>
1038
1039
1040
1041 <sect2 id="utf-2000"><heading>UTF-2000</heading>
1042
1043 <P>
1044 I heard that there is a new code UTF-2000. I don't know at all except
1045 for the name UTF-2000.
1046 <footnote>
1047 I HAVE TO WRITE EXPLANATION.
1048 </footnote>
1049 </P>
1050
1051
1052
1053
1054 <chapt id="languages"><heading>Characters in Each Country</heading>
1055
1056 <P>
1057 This chapter describes a specific information for each language.
1058 Contributions from people speaking each language are welcome.
1059 If you are to write a section on your language, please include
1060 these points:
1061 <enumlist>
1062 <item>kinds and number of characters used in the language,
1063 <item>explanation on character set(s) which is (are) standardized,
1064 <item>explanation on codeset(s) which is (are) standardized,
1065 <item>usage and popularity for each codeset,
1066 <item>de-facto standard, if any, on how many columns characters occupy,
1067 <item>writing direction and combined characters,
1068 <item>how to layout characters (word wrapping and so on),
1069 <item>widely used value for <tt>LANG</tt> environmental variable,
1070 <item>the way to input characters from keyboard and whether
1071 you want to input yes/no (and so on) in your language
1072 or in English,
1073 <item>a set of information needed for beautiful displaying, for example,
1074 where to break a line, hyphenation, word wrapping, and so on, and
1075 <item>other topics.
1076 </enumlist>
1077 </P>
1078
1079
1080 <P>
1081 Writers whose languages are written in different direction
1082 from European languages or needs a combined characters
1083 (I heard that is used in Thai) are encouraged to explain
1084 how to treat such languages.
1085 </P>
1086
1087
1088
1089 &japanese-japan;
1090 &spanish;
1091
1092
1093
1094
1095
1096
1097
1098 <chapt id="output"><heading>Output to Display</heading>
1099
1100 <P>
1101 Here 'Output to Display' does not mean I18N of messages using
1102 <prgn>gettext</prgn>.
1103 I will concern on whether characters are correctly outputed so that
1104 we can read it. For example, install <package>libcanna1g</package>
1105 package and display
1106 <tt>/usr/doc/libcanna1g/README.jp.gz</tt> on console or <prgn>xterm</prgn>
1107 (of course after
1108 ungzipping). This text file is written in Japanese but even Japanese
1109 people can not read such a row of strange characters. Which you would
1110 prefer if you were a Japanese speaker, an English message which can be read
1111 with a dictionary or such a row of strange characters which is
1112 a result of <prgn>gettext</prgn>ization?
1113 (Yes, there <em>is</em> a way to display
1114 Japanese characters correctly -- <prgn>kon</prgn> (in <package>kon2</package>
1115 package) for console and <prgn>kterm</prgn> for X, and
1116 Japanese people are happy with <prgn>gettext</prgn>ized Japanese messages.)
1117 </P>
1118
1119 <P>
1120 Problems on displaying non-English characters are discussed below.
1121 Since the mother tongue of the author is Japanese, the content may
1122 be biased to Japanese.
1123 </P>
1124
1125
1126
1127 <sect id="output-console"><heading>Console Softwares</heading>
1128
1129 <P>
1130 Softwares running on the console are not responsible for displaying.
1131 The console itself is responsible. There are terminal emulators
1132 which can display non-English languages such as <prgn>kterm</prgn> (Japanese),
1133 <prgn>krxvt</prgn>, <prgn>grxvt</prgn>, and <prgn>crxvt</prgn>
1134 (Japanese, Greek, and Chinese, included
1135 in <package>rxvt-ml</package> package), <prgn>cxterm</prgn>
1136 (Chinese, Korean, and Japanese, non-free),
1137 and so on and softwares with which non-English characters can be
1138 displayed on console such as <package>kon2</package> (Japanese).
1139 </P>
1140
1141 <P>
1142 All what a software on console (including terminal emulator and so on)
1143 has to do is that output a correct code to the console.
1144 </P>
1145
1146 <P>
1147 At first, it is important not to destroy string data.
1148 Sometimes it can be done only by 8bit-clean-ize.
1149 '8bit-clean' means that the software does not destroy the
1150 most significant bit (MSB) of data the software treats.
1151 </P>
1152
1153 <P>
1154 Next, be careful for a software which sends control codes such
1155 as location every time it output 1 byte. Such codes destroy
1156 the continuity of multibyte character.
1157 </P>
1158
1159 <P>
1160 Be also careful for destruction of multicolumn characters.
1161 For example, when a string exceeds the width of the console,
1162 the string is divided at the end of the line. Terminal emulators
1163 should have a faculty to avoid such a 'excess of line width' type
1164 destruction of character but so far no terminal emulators
1165 have such a faculty. (Only one exception --- shell mode of Emacs.
1166 However, unfortunately shell mode of Emacs is a dumb terminal and
1167 many softwares cannot be run on it.) Thus each software on
1168 console should be careful.
1169 </P>
1170
1171 <P>
1172 There is another reason to destroy multicolumn characters.
1173 When a message is overwritten on another string, a part
1174 of a character which is a part of a previous string can be
1175 left not overwritten. This may be more troublesome than many
1176 people would think because multicolumn character can be
1177 written at every columns, not only at the multiple of the
1178 width of the character.
1179 </P>
1180
1181 <P>
1182 These destruction of continuity of multibyte characters may
1183 be a cause of the destruction of the whole line following
1184 the character. Whether this can occur depends on the internal
1185 implementation of console program. This can occur if the
1186 terminal emulator does not treat columns, bytes and characters
1187 properly separately. The shell mode of Emacs is the only example
1188 doing that but there are no chance to overwrite character on
1189 the shell mode of Emacs, because it is a dumb terminal.
1190 </P>
1191
1192 <P>
1193 There are no standards for number of columns a character occupies.
1194 This can be a large problem for softwares with <tt>ncurses</tt>.
1195 There is no 'right' way to solve this. Each software has to
1196 have an information for each character set. Consult section 2.6
1197 for each language. Take care of the distinction between number
1198 of columns, bytes, and characters. For subset of EUC-JP
1199 (ASCII alphabets and JIS X 0208 kanji), number of bytes and columns
1200 are equal (1-byte character occupy 1 column and 2-byte character
1201 occupy 2 columns).
1202 </P>
1203
1204 <P>
1205 Another important point is that the string has to be converted
1206 into a codeset which the console can understand. So far there
1207 are no consoles which understand Unicode.
1208 </P>
1209
1210
1211
1212 <sect id="output-x"><heading>X Clients</heading>
1213
1214 <P>
1215 X itself is already internationalized. Thus many languages can
1216 be displayed if fonts are properly prepared. It is users'
1217 responsibility to prepare fonts and all what softwares have
1218 to do is to be careful to selection of fonts.
1219 </P>
1220
1221 <P>
1222 Though codesets other than ASCII often contains multiple character sets,
1223 fontsets for X are prepared for each character sets. So a set of fontsets
1224 for set of character sets should be used instead of a single fontset.
1225 </P>
1226
1227 <P>
1228 For example, C programs using Xlib should use series of functions
1229 related to XFontSet structure instead of functions for XFontStruct
1230 structure.
1231 <example>
1232 Font | FontSet
1233 ==================+====================
1234 XFontStruct | XFontSet
1235 ------------------+--------------------
1236 XLoadFont() | XCreateFontSet()
1237 ------------------+--------------------
1238 XUnloadFont() | XFreeFontSet()
1239 ------------------+--------------------
1240 XQueryFont() | XFontsOfFontSet()
1241 ------------------+--------------------
1242 XDrawString() | XmbDrawString()
1243 XDrawString16() | XwcDrawString()
1244 ------------------+--------------------
1245 XDrawText() | XmbDrawText()
1246 XDrawText16() | XwcDrawText()
1247 ------------------+--------------------
1248 </example>
1249 </P>
1250
1251 <P>
1252 If a software uses the left-hand functions it have to be rewritten
1253 using the corresponding right-hand functions in the table. Note that
1254 this table is not perfect but only for an example. Since these
1255 right-hand functions use wide characters and multibyte characters
1256 in C, setlocale() has to be called in advance.
1257 </P>
1258
1259 <P>
1260 The same problem exists for softwares using toolkits such as
1261 athena, GTK+, Qt, and so on.
1262 </P>
1263
1264
1265
1266
1267
1268
1269
1270 <chapt id="input"><heading>Input from Keyboard</heading>
1271
1272 <P>
1273 I18N of display is a prerequisite for I18N of input from keyboard.
1274 I18N is not necessary only for answering Yes/No. Most
1275 Japanese-speaking people regard it is too troublesome only for
1276 answer Y/N to invoke the input method, input alphabetical
1277 representation of Japanese, and convert to Japanese character.
1278 This would be true for Korean and Chinese. On the other hand
1279 softwares such as text editor, word processor, terminal emulator,
1280 and shell should have I18N-ed input support.
1281 </P>
1282
1283
1284
1285 <sect id="input-console"><heading>Console Softwares</heading>
1286
1287 <sect1 id="input-console-console"><heading>Invoked in the Console and Kon</heading>
1288
1289 <P>
1290 Canna and Wnn is client/server type Japanese input methods.
1291 Wnn has its variants for Korean and Chinese.
1292 They have their own protocols and there are no standards.
1293 There are softwares to add a faculty of inputting Japanese
1294 to console by connecting console and these input methods,
1295 but these softwares (canuum for Canna and uum for Wnn) are
1296 not Debianized yet. There are a few softwares which can talk
1297 Canna or Wnn protocol directly, for example, nvi-m17n-canna.
1298 In Debian system, these softwares 'depends' on libcanna or wnn
1299 packages.
1300 </P>
1301
1302 <P>
1303 GNU Emacs offers methods for inputting many languages
1304 such as Japanese, Chinese, Korean, Latin-{12345}, Russian,
1305 Greek, Hebrew, Thai, Vietnamese, Czech, and so on
1306 in the console environment. XEmacs also offers similar
1307 mechanism but the set of supported languages are different.
1308 We will be very happy if the input faculty of (X)Emacs
1309 becomes a library and other softwares can use. The author
1310 doesn't know this can be achieved or not.
1311 </P>
1312
1313 <P>
1314 After an input method is supplied,
1315 inputed codes must be treated correctly.
1316 That is, the software must be aware of the number
1317 of bytes, characters, and columns.
1318 For example, you have to know how many bytes should be
1319 deleted and how many '^H' code should be sent to console
1320 when 'BS' key is pushed.
1321 </P>
1322
1323
1324 <sect1 id="input-console-x"><heading>Invoked in an X Terminal Emulator</heading>
1325
1326 <P>
1327 X has a standard to input various languages. That is XIM.
1328 Kinput2 is a software to connect Canna and/or Wnn and XIM protocol.
1329 And more, terminal emulators such as kterm and krxvt have a
1330 faculty to connect to XIM. So the way to input various languages
1331 is supplied.
1332 </P>
1333
1334 <P>
1335 All what softwares running on a terminal emulator have to do is
1336 to accept the input properly.
1337 </P>
1338
1339 <P>
1340 At first 8bit-clean-ize is needed.
1341 Important softwares such as <prgn>bash</prgn> and <prgn>tcsh</prgn>
1342 are already 8bit-clean-ized
1343 and they accept non-ASCII characters.
1344 </P>
1345
1346 <P>
1347 However, since these softwares aren't conscious of multibyte
1348 characters, editing the inputed line is a bit hard. For example,
1349 we have to push Backspace key twice to erase a 2byte character.
1350 If we make a mistake, the inputed string will be broken.
1351 For stateful codesets or a character whose bytes and columns are different,
1352 editing would be much more difficult.
1353 (Fortunately, most of Japanese and Korean characters are expressed in
1354 2 bytes and occupies 2 columns. That is, number of bytes and columns
1355 are identical.)
1356 Thus the softwares should be conscious of multibyte codesets.
1357 </P>
1358
1359
1360
1361
1362
1363 <sect id="input-x"><heading>X Clients</heading>
1364
1365 <P>
1366 All you need is that:
1367 <list>
1368 <item>To accept input from XIM. 'Over-the-spot' conversion is desirable but
1369 not essential.
1370 <item>To accept 'paste' using Compound Text.
1371 </list>
1372 </P>
1373
1374
1375
1376
1377
1378
1379
1380 <chapt id="internal"><heading>Internal Processing and File I/O</heading>
1381
1382 <P>
1383 From a user's point of view, a software can use any internal codesets
1384 if I/O is done correctly. It is because a user cannot be aware of
1385 which kind of internal code is used in the software.
1386 </P>
1387
1388 <P>
1389 From a programmer's point of view, he/she
1390 <list>
1391 <item>can count number of <em>character</em> (not <em>bytes</em> or
1392 <em>columns</em>) correctly,
1393 <item>cannot split a multibyte character, and
1394 <item>don't have to be careful in shift state
1395 </list>
1396 without knowledge on specific codesets, by using
1397 wide character in C, kinds of Unicode, and so on.
1398 </P>
1399
1400 <P>
1401 Since you may not assume anything about
1402 implementation of wide character (value of <tt>wchar_t</tt>),
1403 you cannot do anything more than the library prepares,
1404 for example, obtain number of columns a character occupies.
1405 </P>
1406
1407
1408
1409
1410
1411
1412
1413
1414 <chapt id="other"><heading>Other Special Topics</heading>
1415
1416 <sect id="locale"><heading>Locale in C</heading>
1417
1418 <P>
1419 Locale is the main faculty for I18N of C language.
1420 The easiest way to use locale is to call <tt>setlocale(LC_ALL, "")</tt>.
1421 </P>
1422
1423 <P>
1424 Locale model is that a software changes its behavior
1425 according to its language environment. The environment can be
1426 set independently for six categories of
1427 LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC,
1428 and LC_TIME.
1429 For example, a message can be written in proper language
1430 and a proper format for date/time expression is used if
1431 properly implemented.
1432 </P>
1433
1434 <P>
1435 If <tt>setlocale(LC_ALL, "")</tt> is described at the start of the
1436 software, the choice of the environment is done by environmental variables
1437 whose names are same to the names of categories.
1438 If LC_ALL variable is defined, LC_ALL takes precedence over these
1439 variables. If neither of them are defined, LANG variable is adopted.
1440 If LANG is also not defined, 'C' locale, which means default behavior,
1441 is used.
1442 </P>
1443
1444 <P>
1445 Though valid values for these environmental variables (locale names)
1446 depend on the kind and set-up of the OS, the format of locale names
1447 is usually like <tt>ja_JP.ujis</tt>, where two lowercase characters
1448 mean language (<tt>ja</tt> = Japanese), two capital characters
1449 mean country (<tt>JP</tt> = Japan), and characters after dot mean
1450 codeset (<tt>ujis</tt> = EUC-JP). Type <tt>locale -a</tt> to display
1451 all valid locale names.
1452 </P>
1453
1454 <P>
1455 Note that m17n is not achieved by locale model at all,
1456 because a user has to choose only one language for one category.
1457 Sometimes locale model is even insufficient for i18n.
1458 For example, there are many languages where multiple codesets
1459 are used at the same time and sometimes code conversion or
1460 code distinction is needed.
1461 </P>
1462
1463 <sect id="wchar"><heading>Multibyte and Wide characters in C</heading>
1464
1465 <P>
1466 Multibyte character is a character which is expressed by
1467 two or more bytes. Multibyte character corresponds to
1468 the 'real' codeset used for input/output. The expression
1469 of multibyte character depends on <tt>LC_CTYPE</tt>.
1470 </P>
1471
1472 <P>
1473 Since multibyte character can be stateful (that is, can have
1474 shift status) and the number of bytes a character does not
1475 have to be a constant, implementation using multibyte character
1476 can be difficult. Thus wide character can be used.
1477 </P>
1478
1479 <P>
1480 Wide character is stateless and the size of every wide characters
1481 are same. Functions for conversion between multibyte character
1482 and wide character (and string of multibyte characters and
1483 string of wide characters) are supplied by library.
1484 Wide character is expressed using <tt>wchar_t</tt> type.
1485 String of wide characters is expressed
1486 as a array of <tt>wchar_t</tt>, like string of ASCII characters is expressed
1487 as a array of <tt>char</tt>.
1488 </P>
1489
1490 <P>
1491 Thus it is convenient to input multibyte characters from a stream,
1492 convert them into wide characters, process, convert back into
1493 multibyte characters, and output them to a stream. <tt>wchar_t</tt> is
1494 used as a internal code.
1495 </P>
1496
1497 <P>
1498 Functions for conversion between multibyte and wide characters/strings
1499 are shown below:
1500 <list>
1501 <item><tt>mbtowc()</tt> and <tt>mbrtowc()</tt> to convert
1502 from multibyte to wide character.
1503 <item><tt>mblen()</tt>, <tt>mbrlen()</tt> to obtain the number
1504 of characters of multibyte character string.
1505 <item><tt>mbstowcs()</tt>, <tt>mbsrtowcs()</tt> to convert from
1506 multibyte to wide character string.
1507 <item><tt>wctomb()</tt>, <tt>wcrtomb()</tt> to convert from wide
1508 to multibyte character.
1509 <item><tt>wcstombs()</tt>, <tt>wcsrtombs()</tt> to convert from
1510 wide to multibyte character string.
1511 <item><tt>mbsinit()</tt> to check shift status.
1512 <item><tt>btowc()</tt> and <tt>wctob()</tt> to convert 1byte and
1513 wide characters.
1514 </list>
1515 </P>
1516
1517 <P>
1518 '<tt>r</tt>' version of these functions (for example, <tt>mbrtowc</tt>)
1519 have an additional parameter to a pointer to a <tt>mbstate_t</tt>
1520 variable which contains the shift status. Since non-'<tt>r</tt>'
1521 version of these functions have shift status in their internal
1522 (static) variable, these can treat only one succession of string at a time.
1523 </P>
1524
1525 <P>
1526 See manpages of these functions for further information.
1527 </P>
1528
1529 <P>
1530 The implementation of wchar_t is not determined by any
1531 standards, though UCS-4 is used for glibc. You must not
1532 assume the implementation of <tt>wchar_t</tt>.
1533 </P>
1534
1535 <P>
1536 Though usual functions such as <tt>printf()</tt> can be used for multibyte
1537 characters for input/output, one have to take care of escape
1538 character '<tt>%</tt>' used in formatted input/output functions, because
1539 a part of a multibyte character can have same value as ASCII
1540 code of '<tt>%</tt>'.
1541 </P>
1542
1543
1544 <sect id="gettext"><heading>Gettext</heading>
1545
1546 <P>
1547 Gettext is a tool to internationalize messages a software outputs
1548 according to locale status of <tt>LC_MESSAGES</tt>.
1549 A <prgn>gettext</prgn>ized software contains messages written in
1550 various languages (according to available translators) and
1551 a user can choose them using environmental variables.
1552 GNU gettext is a part of Debian system.
1553 </P>
1554
1555 <P>
1556 Install <package>gettext</package> package and read info pages for details.
1557 </P>
1558
1559 <P>
1560 Don't use non-ASCII characters for '<tt>msgid</tt>'.
1561 Be careful because you may tend to use ISO-8859-1 characters.
1562 For example, '&copy;' (copyright mark; you may be not able to
1563 read the copyright mark NOW in THIS document) is non-ASCII character
1564 (0xa9 in ISO-8859-1).
1565 Otherwise, translators may feel difficulty to edit catalog files
1566 because of conflict between codesets for <tt>msgid</tt> and in
1567 <tt>msgstr</tt>.
1568 </P>
1569
1570 <P>
1571 Be sure the message can be displayed in the assumed environment.
1572 In other words, you have to read the chapter of 'Output to Display'
1573 in this document and internationalize the output mechanism
1574 of your software prior to <prgn>gettext</prgn>ization.
1575 <em>ENGLISH MESSAGES ARE PREFERRED EVEN FOR NON-ENGLISH-SPEAKING PEOPLE,
1576 THAN MEANINGLESS BROKEN MESSAGES.</em>
1577 </P>
1578
1579 <P>
1580 The 2nd (3rd, ...) byte of multibyte characters or
1581 all bytes of non-ASCII characters in stateful codesets
1582 can be 0x5c (same to backslash in ASCII) or 0x22
1583 (same to double quote in ASCII).
1584 These characters have to properly escaped because
1585 present version of GNU gettext doesn't care the
1586 'charset' subitem of '<tt>Content-Type</tt>' item for '<tt>msgstr</tt>'.
1587 </P>
1588
1589 <P>
1590 A <prgn>gettext</prgn>ed message must not used in multiple contexts.
1591 This is because a word may have different meaning in different context.
1592 For example, a verb means an order or a command if it appears
1593 at the top of the sentence in English. However, different languages
1594 have different grammar. If a verb is <prgn>gettext</prgn>ed and it is used
1595 both in a usual sentence and in an imperative sentence,
1596 one cannot translate it.
1597 </P>
1598
1599
1600 <P>
1601 If a sentence is <prgn>gettext</prgn>ed, never divide the sentence.
1602 If a sentence is divided in the original source code,
1603 connect them so as to single string contains the full
1604 sentence.
1605 This is because the order of words in a sentence
1606 is different among languages.
1607 </P>
1608
1609 <P>
1610 A software with <prgn>gettext</prgn>ed messages should not depend on
1611 the length of the messages. The messages may get longer
1612 in different language.
1613 </P>
1614
1615 <P>
1616 When two or more '%' directive for formatted output functions
1617 such as <tt>printf()</tt> appear in a message,
1618 the order of these '%' directives may be changed by
1619 translation. In such a case, the translator can specify
1620 the order.
1621 See section of 'Special Comments preceding Keywords'
1622 in info page of <prgn>gettext</prgn> for detail.
1623 </P>
1624
1625 <P>
1626 Now there are projects to translate messages in various softwares.
1627 For example,
1628 <url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
1629 name="Translation Project">.
1630 </P>
1631
1632
1633
1634 <sect1 id="gettextize"><heading>Gettext-ize a software</heading>
1635
1636 <P>
1637 At first, the software has to have the following lines.
1638 <example>
1639 int main(int argc, char **argv)
1640 {
1641 ...
1642 setlocale (LC_ALL, ""); /* This is not for gettext but
1643 all i18n software should have
1644 this line. */
1645 bindtextdomain (PACKAGE, LOCALEDIR);
1646 textdomain (PACKAGE);
1647 ...
1648 }
1649 </example>
1650 where <var>PACKAGE</var> is the name of the catalog file and
1651 <var>LOCALEDIR</var> is <tt>"/usr/share/locale"</tt> for Debian.
1652 <var>PACKAGE</var> and <var>LOCALEDIR</var> should be defined
1653 in a header file or <tt>Makefile</tt>.
1654 </P>
1655
1656 <P>
1657 It is convenient to prepare the following header file.
1658 <example>
1659 #include &lt;libintl.h&gt;
1660 #define _(String) gettext((String))
1661 </example>
1662 and messages in source files should be written as
1663 <tt>_("message")</tt>, instead of <tt>"message"</tt>.
1664 </P>
1665
1666 <P>
1667 Next, catalog files have to be prepared.
1668 </P>
1669
1670 <P>
1671 At first, a template for catalog file is prepared
1672 using <prgn>xgettext</prgn>.
1673 At default a template file <tt>message.po</tt> is
1674 prepared.
1675 <footnote>
1676 I HAVE TO WRITE EXPLANATION.
1677 </footnote>
1678 </P>
1679
1680
1681
1682 <sect1 id="gettext-translate"><heading>Translation</heading>
1683
1684 <P>
1685 Though <prgn>gettext</prgn>ization of a software is a temporal
1686 work, translation is a continuing work because you have to
1687 translate new (or modified) messages when (or before) a new
1688 version of the software is released.
1689 </P>
1690
1691 <sect id="mailnews"><heading>Mail/News</heading>
1692
1693 <P>
1694 Headers and main texts of mail and news messages have
1695 to expressed in 7bit. Headers and main texts have
1696 different standard for using non-ASCII codesets.
1697 (ESMTP, the extension of SMTP, can treat 8bit messages.)
1698 </P>
1699
1700 <P>
1701 Codesets for main text is specified in
1702 '<tt>codeset</tt>' subitem of
1703 '<tt>Content-type</tt>' header item.
1704 The whole list of parameters which can be written
1705 is found at ***.
1706 <footnote>
1707 I HAVE TO FIND THIS LIST (RFC?)
1708 </footnote>
1709 </P>
1710
1711 <P>
1712 'B' encoding and 'Q' encoding are used to use non-ASCII codesets
1713 in the headers. These 'B' and 'Q' encodings are not codesets themselves.
1714 They are a way to express non-ASCII strings using ASCII characters.
1715 <footnote>
1716 I HAVE TO WRITE EXPLANATION
1717 </footnote>
1718 </P>
1719
1720
1721
1722
1723
1724
1725
1726 <chapt id="examples"><heading>Examples of I18N</heading>
1727
1728 <P>
1729 Programmers who have internationalized softwares, have
1730 written a patch of L10N, and so on are encouraged to contribute
1731 to this chapter.
1732 </P>
1733
1734
1735
1736 &minicom;
1737 &user-ja;
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747 <chapt id="reference"><heading>References</heading>
1748
1749 <P>
1750 General
1751 <list>
1752 <item>
1753 <url id="http://i44www.info.uni-karlsruhe.de/~drepper/conf96/paper.html"
1754 name="i18n in GNU Project">
1755 <item>
1756 <url id="http://cns-web.bu.edu/pub/djohnson/web_files/i18n/i18n.html"
1757 name="Concept of C/UNIX i18n">
1758 </list>
1759 </P>
1760
1761 <P>
1762 Characters (general)
1763 <list>
1764 <item>
1765 <url id="http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html"
1766 name="&urlname">
1767 </list>
1768 </P>
1769
1770 <P>
1771 Characters (ISO 8859)
1772 <list>
1773 <item>
1774 <url id="http://czyborra.com/charsets/iso8859.html" name="&urlname">
1775 <item>
1776 <url id="http://park.kiev.ua/multiling/ml-docs/iso-8859.html"
1777 name="&urlname">
1778 <item>
1779 <url id="http://www.terena.nl/projects/multiling/ml-docs/iso-8859.html"
1780 name="&urlname">
1781 </list>
1782 </P>
1783
1784 <P>
1785 Characters (ISO 2022)
1786 <list>
1787 <item>
1788 <url id="http://www.ewos.be/tg-cs/gconcept.htm" name="&urlname">
1789 <item>
1790 <url id="http://www.ecma.ch/stand/ECMA-035.HTM" name="&urlname">
1791 </list>
1792 </P>
1793
1794 <P>
1795 Characters (Unicode)
1796 <list>
1797 <item><url id="http://www.unicode.org/" name="&urlname">
1798 </list>
1799 </P>
1800
1801 <P>
1802 Example of i18n
1803 <list>
1804 <item>
1805 <url id="http://www.wg.omron.co.jp/~shin/Arena-CJK-doc/"
1806 name="Arena-i18n">
1807 Multilingual web browser.
1808 <item>
1809 <url id="http://www.m17n.org/mule/" name="Mule">
1810 Multilingual editor whose function is included in GNU Emacs 20
1811 and XEmacs 20.
1812 Mule is the most advanced m17n software in my knowledge.
1813 </list>
1814 </P>
1815
1816 <P>
1817 Projects
1818 <list>
1819 <item>
1820 <url id="http://www.li18nux.org/"
1821 name="Linux Internationalization Initiative">, or Li18nux,
1822 focuses on the i18n of a core set of APIs and components of Linux
1823 distributions. The results will be proposed to LSB.
1824 <item>
1825 <url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
1826 name="Translation Project">
1827 </list>
1828 <P>
1829
1830
1831
1832
1833
1834 </book>
1835 </debiandoc>

  ViewVC Help
Powered by ViewVC 1.1.5