| 1 |
<!doctype debiandoc public "-//DebianDoc//DTD DebianDoc//EN"
|
| 2 |
[
|
| 3 |
<!entity % languages system "languages.ents"> %languages;
|
| 4 |
<!entity % examples system "examples.ents"> %examples;
|
| 5 |
]>
|
| 6 |
<debiandoc>
|
| 7 |
<book>
|
| 8 |
|
| 9 |
|
| 10 |
<titlepag>
|
| 11 |
<title>Introduction to i18n</title>
|
| 12 |
<author>
|
| 13 |
<name>Tomohiro KUBOTA</name>
|
| 14 |
<email>kubota@debian.or.jp</email>
|
| 15 |
</author>
|
| 16 |
<version><date></version>
|
| 17 |
<abstract>
|
| 18 |
This document describes introduction to i18n (internationalization)
|
| 19 |
for programmers and package maintainers.
|
| 20 |
</abstract>
|
| 21 |
<copyright>
|
| 22 |
<copyrightsummary>
|
| 23 |
Copyright © 1999 Tomohiro KUBOTA.
|
| 24 |
For chapters and sections whose original author is not KUBOTA,
|
| 25 |
the authors of them have copyright. Their names are written
|
| 26 |
at the top of the chapter or the section.
|
| 27 |
</copyrightsummary>
|
| 28 |
<p>
|
| 29 |
This manual is free software; you may redistribute it and/or modify it
|
| 30 |
under the terms of the GNU General Public License as published by the
|
| 31 |
Free Software Foundation; either version 2, or (at your option) any
|
| 32 |
later version.
|
| 33 |
</p>
|
| 34 |
<p>
|
| 35 |
This is distributed in the hope that it will be useful, but
|
| 36 |
<em>without any warranty</em>; without even the implied warranty of
|
| 37 |
merchantability or fitness for a particular purpose. See the GNU
|
| 38 |
General Public License for more details.
|
| 39 |
</p>
|
| 40 |
<p>
|
| 41 |
A copy of the GNU General Public License is available as
|
| 42 |
<tt>/usr/share/common-licenses/GPL</tt> in the Debian GNU/Linux
|
| 43 |
distribution or on the World Wide Web at
|
| 44 |
<url id="http://www.gnu.org/copyleft/gpl.html" name="&urlname">.
|
| 45 |
You can also obtain it by writing to the Free
|
| 46 |
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
|
| 47 |
02111-1307, USA.
|
| 48 |
</p>
|
| 49 |
</copyright>
|
| 50 |
</titlepag>
|
| 51 |
|
| 52 |
<toc detail="sect1">
|
| 53 |
|
| 54 |
<chapt id="scope"><heading>About This Document</heading>
|
| 55 |
|
| 56 |
<sect><heading>Scope</heading>
|
| 57 |
|
| 58 |
<P>
|
| 59 |
This document describes the basic ideas of I18N written for
|
| 60 |
programmers and package maintainers of Debian GNU/Linux.
|
| 61 |
The aim of this document is to offer an introduction to
|
| 62 |
basic concepts, character codes, and points of which care
|
| 63 |
should be taken when one writes an I18N-ed software or
|
| 64 |
an I18N patch for an existing software. This document
|
| 65 |
also tries to introduce the real state and existing
|
| 66 |
problems for each language and country.
|
| 67 |
</P>
|
| 68 |
|
| 69 |
<P>
|
| 70 |
This document does not describe the details of programming,
|
| 71 |
except for the last chapter where instances of I18N are
|
| 72 |
collected.
|
| 73 |
</P>
|
| 74 |
|
| 75 |
<P>
|
| 76 |
Minimum requirements, for example,
|
| 77 |
that characters should be displayed proper font (at least users
|
| 78 |
of the software must be able to guess what is written),
|
| 79 |
that characters must be inputed from keyboard, and
|
| 80 |
that softwares must not destroy characters,
|
| 81 |
are stressed in the document and I am trying to
|
| 82 |
describe a HOWTO to satisfy these requirements.
|
| 83 |
</P>
|
| 84 |
|
| 85 |
<P>
|
| 86 |
Though this document is strongly related to programming
|
| 87 |
languages such as C and standardized I18N methods such as
|
| 88 |
<prgn>gettext</prgn> and LOCALE, this document does not supply a
|
| 89 |
detailed explanation of them.
|
| 90 |
</P>
|
| 91 |
|
| 92 |
<sect id="newversion"><heading>New Versions of This Document</heading>
|
| 93 |
|
| 94 |
<P>
|
| 95 |
The current version of this document is available
|
| 96 |
at
|
| 97 |
<url id="http://www.debian.org/~elphick/ddp/"
|
| 98 |
name="DDP (Debian Documentation Project)"> page.
|
| 99 |
</P>
|
| 100 |
|
| 101 |
<sect id="feedback"><heading>Feedback and Contributions</heading>
|
| 102 |
|
| 103 |
<P>
|
| 104 |
This document needs contributions, especially for a
|
| 105 |
chapter on each languages (<ref id="languages">)
|
| 106 |
and a chapter on instances of I18N (<ref id="examples">).
|
| 107 |
These chapters are consist of contributions.
|
| 108 |
</P>
|
| 109 |
|
| 110 |
<P>
|
| 111 |
Otherwise, this will be a mere document only on Japanization,
|
| 112 |
because the original author Tomohiro KUBOTA
|
| 113 |
(<email>kubota@debian.or.jp</email>)
|
| 114 |
speaks Japanese and live in Japan.
|
| 115 |
</P>
|
| 116 |
|
| 117 |
<P>
|
| 118 |
<ref id="spanish"> is written by
|
| 119 |
Eusebio C Rufian-Zilbermann <email>eusebio@acm.org</email>.
|
| 120 |
</P>
|
| 121 |
|
| 122 |
<P>
|
| 123 |
Discussions are held at <tt>debian-devel@lists.debian.org</tt> mailing list.
|
| 124 |
(May <tt>debian-doc</tt> or <tt>debian-i18n</tt> be more suitable?)
|
| 125 |
</P>
|
| 126 |
|
| 127 |
<chapt id="intro"><heading>Introduction</heading>
|
| 128 |
|
| 129 |
<P>
|
| 130 |
Debian system includes many softwares. Though many of them
|
| 131 |
have faculty to process, output, and input text data, a part
|
| 132 |
of these programs assume text as written in English (ASCII).
|
| 133 |
For people who use non-English language these programs are
|
| 134 |
hardly usable.
|
| 135 |
</P>
|
| 136 |
|
| 137 |
<P>
|
| 138 |
So far people who use non-English languages have given up
|
| 139 |
and accepted computers as such. However we should throw away
|
| 140 |
such a wrong idea now. It is nonsense that a person who
|
| 141 |
want to use a computer has to learn English in advance.
|
| 142 |
</P>
|
| 143 |
|
| 144 |
<P>
|
| 145 |
There are a few approaches for softwares to be able to handle
|
| 146 |
non-English languages. What we need to do at first is to know
|
| 147 |
the differences between these approaches and to choose one
|
| 148 |
approach for each case.
|
| 149 |
</P>
|
| 150 |
|
| 151 |
<P>
|
| 152 |
<taglist>
|
| 153 |
<tag>a. <strong>L10N</strong> (localization)</tag>
|
| 154 |
<item><p>
|
| 155 |
This approach is to support two languages or character sets,
|
| 156 |
English (ASCII) and another specified one. An example is
|
| 157 |
Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual Emacs).
|
| 158 |
Since a programmer has his/her own mother tongue,
|
| 159 |
there are numerous L10N patches and L10N softwares
|
| 160 |
written to satisfy his/her own need.
|
| 161 |
</p></item>
|
| 162 |
<tag>b. <strong>I18N</strong> (internationalization)</tag>
|
| 163 |
<item><p>
|
| 164 |
This approach is to support many languages but only two
|
| 165 |
of them, English (ASCII) and another one, at the same time.
|
| 166 |
One have to specify the 'another' language by <tt>LANG</tt>
|
| 167 |
environmental variable or so on.
|
| 168 |
LOCALE in C and <prgn>gettext</prgn> is categorized into I18N.
|
| 169 |
</p></item>
|
| 170 |
<tag>c. <strong>M17N</strong> (multilingualization)</tag>
|
| 171 |
<item><p>
|
| 172 |
This approach is to support many languages at the same time.
|
| 173 |
For example, Mule (MULtilingual Enhancement to GNU Emacs)
|
| 174 |
can treat a text file which contains multiple languages,
|
| 175 |
for example, a paper on difference between Korean and Chinese
|
| 176 |
whose main text is written in Finnish. Now GNU Emacs 20 and
|
| 177 |
XEmacs include Mule.
|
| 178 |
</p></item>
|
| 179 |
</taglist>
|
| 180 |
</P>
|
| 181 |
|
| 182 |
<P>
|
| 183 |
Generally speaking I18N approach is better than L10N and M17N than I18N.
|
| 184 |
In other words, text-processing softwares are 'better' which can treat
|
| 185 |
many languages at the same time, than can treat two (English and an
|
| 186 |
another) languages.
|
| 187 |
</P>
|
| 188 |
|
| 189 |
<P>
|
| 190 |
Sometimes 'localization' means preparing language (or culture)-specific
|
| 191 |
data for already i18n-ed software. For example, translation of
|
| 192 |
<prgn>gettext</prgn>ed messages is 'localization' in this meaning.
|
| 193 |
Almost commercial softwares seem to adopt this approach. Thus
|
| 194 |
at first original (or US) version is released and after certain
|
| 195 |
period localized versions are released.
|
| 196 |
However, in this document, such 'localization' is included in
|
| 197 |
'internationalization' because these two have common technical
|
| 198 |
topics and because such 'localization' has the same limitation
|
| 199 |
as 'internationalization' as described below.
|
| 200 |
<footnote>
|
| 201 |
Another reason is that such 'localization' only bears many localized
|
| 202 |
versions, instead of a single internationalized version. This
|
| 203 |
means labor is dispersed. Since our labor is limited, it is more
|
| 204 |
effecient to concentrate on a single properly internationalized
|
| 205 |
version, whose user in a specific language only have to set
|
| 206 |
<tt>LANG</tt> environmental variable to let the software proprely.
|
| 207 |
Also, 'internationalized distribuion' is our goal.
|
| 208 |
</footnote>
|
| 209 |
Instead, 'localization' in document means what I wrote,
|
| 210 |
because there are many localized softwares which adopts 'localization'
|
| 211 |
of my meaning and Debian developers and package maintainers are
|
| 212 |
interested in unification of these localized softwares.
|
| 213 |
</P>
|
| 214 |
|
| 215 |
<P>
|
| 216 |
Now let me classify approaches for support of non-English languages
|
| 217 |
from an another viewpoint.
|
| 218 |
</P>
|
| 219 |
|
| 220 |
<P>
|
| 221 |
<taglist>
|
| 222 |
<tag>A. Implementation <em>without</em> Knowledge on Each Language</tag>
|
| 223 |
<item><p>
|
| 224 |
By utilizing standardized methods supplied by the kernel or libraries
|
| 225 |
such as LOCALE, <tt>wchar_t</tt>, and <prgn>gettext</prgn>, this
|
| 226 |
approach is possible.
|
| 227 |
The advantages of this approach are (1) that when the kernel or
|
| 228 |
libraries is upgraded the software may support new additional languages
|
| 229 |
and (2) that programmers need not know each language.
|
| 230 |
The disadvantage is that there are categories or fields where
|
| 231 |
a standardized method is not available. So far standardized
|
| 232 |
methods are available in the field of I18N such as LOCALE and
|
| 233 |
<prgn>gettext</prgn> and no standards are established for M17N approach.
|
| 234 |
Furthermore, there are no standard for number of columns a
|
| 235 |
character occupies nor methods for inputting non-English
|
| 236 |
language on console (that is, interface to inputting library).
|
| 237 |
</p></item>
|
| 238 |
|
| 239 |
<tag>B. Implementation Using Knowledge on Each Language</tag>
|
| 240 |
<item><p>
|
| 241 |
This approach is to directly implement information about
|
| 242 |
each language based on knowledge of programmers and
|
| 243 |
contributors. L10N almost always uses this approach.
|
| 244 |
The advantage of this approach is that detailed and strict
|
| 245 |
implementation is possible beyond the field where
|
| 246 |
standardized methods are available. Language-specific
|
| 247 |
problems can be perfectly solved (of course it depends on
|
| 248 |
the skill of the programmer). The disadvantages are
|
| 249 |
(1) that the number of supported languages is restricted
|
| 250 |
by the skill or the interest of the programmers or the
|
| 251 |
contributors, (2) that labor which should be united and
|
| 252 |
concentrated to upgrade the kernel or libraries is dispersed
|
| 253 |
into many softwares, that is, re-inventing of the wheel.
|
| 254 |
However a majestic M17N software such as Mule can be
|
| 255 |
built by strongly propel this approach.
|
| 256 |
</p></item>
|
| 257 |
</taglist>
|
| 258 |
</P>
|
| 259 |
|
| 260 |
<P>
|
| 261 |
Using this classification, let me consider L10N, I18N and M17N
|
| 262 |
from programmer's point of view.
|
| 263 |
</P>
|
| 264 |
|
| 265 |
<P>
|
| 266 |
L10N can be realized only using his/her own knowledge on his/her
|
| 267 |
language. For example, all what you have to do is to implement
|
| 268 |
your knowledge on SHIFT-JIS coding system. Since the motivation
|
| 269 |
of L10N is usually to satisfy programmer's own need, extensiveness
|
| 270 |
for the third language is often ignored. Then, approach B, not A, is
|
| 271 |
taken.
|
| 272 |
Though L10N-ed softwares are basically useful for people who
|
| 273 |
speaks the same language to the programmer, it is sometimes
|
| 274 |
useful for other people whose coding system is similar to
|
| 275 |
the programmer's. For example, a software which
|
| 276 |
doesn't recognize EUC-JP but doesn't break EUC-JP, does not
|
| 277 |
break EUC-KR also.
|
| 278 |
</P>
|
| 279 |
|
| 280 |
<P>
|
| 281 |
Main part of I18N is, in the case of C program, achieved using
|
| 282 |
standardized methods such as LOCALE, <tt>wchar_t</tt>,
|
| 283 |
and <prgn>gettext</prgn>.
|
| 284 |
An LOCALE approach is classified into I18N because functions
|
| 285 |
related to LOCALE change their behavior by a parameter
|
| 286 |
to <tt>setlocale()</tt> or environmental variables such as <tt>LANG</tt>.
|
| 287 |
Namely, approach A is emphasized for I18N. For field where
|
| 288 |
standardized methods are not available, however, approach B
|
| 289 |
cannot be avoided. Even in such a case, an interface and
|
| 290 |
support for each language should be designed to be separated
|
| 291 |
so that a support for new languages can be easily added.
|
| 292 |
</P>
|
| 293 |
|
| 294 |
<P>
|
| 295 |
Unfortunately there are no standardized methods for M17N so far.
|
| 296 |
Exceptions are ISO-2022-INT-* and UNICODE codesets which can
|
| 297 |
express many languages at the same time. However, ISO-2022-INT-*
|
| 298 |
is stateful and thus implementation may be difficult and
|
| 299 |
UNICODE lacks a compatibility to eastern Asian standards
|
| 300 |
and UNICODE itself has many variants (UCS-* and UTF-*) though
|
| 301 |
they can be converted one another easily. Of course M17N-ed
|
| 302 |
software cannot be written only with M17N-ed codeset.
|
| 303 |
Thus approach B cannot be avoided for M17N so far.
|
| 304 |
Efforts for standardization in various fields for M17N should
|
| 305 |
be made. Mule is the only software which achieved M17N.
|
| 306 |
</P>
|
| 307 |
|
| 308 |
<P>
|
| 309 |
This document is focused on I18N. Note that an I18N-ed software
|
| 310 |
cannot process a text file which contains more than three languages,
|
| 311 |
for example, Finnish, Chinese, and Korean (a paper written in
|
| 312 |
Finnish, on comparison of Chinese and Korean). M17N is needed
|
| 313 |
for such a case. Don't forget that the true goal is M17N and
|
| 314 |
I18N is a compromise.
|
| 315 |
</P>
|
| 316 |
|
| 317 |
<P>
|
| 318 |
For people using non-Latin letters, I18N does not include
|
| 319 |
messages written in their languages nor file names written
|
| 320 |
their languages. Yes, it is true they should be achieved.
|
| 321 |
However, on considering our current state, we can say these
|
| 322 |
requires are too much luxury. Our true necessity is,
|
| 323 |
for example, that characters in our languages are displayed
|
| 324 |
using correct font without destroying the screen, that
|
| 325 |
a way for our characters to be inputed is supplied, and
|
| 326 |
that our languages can be inputed correctly. It would be
|
| 327 |
fine if text-processing softwares such as <prgn>perl</prgn>
|
| 328 |
and <prgn>grep</prgn> processes our languages correctly.
|
| 329 |
</P>
|
| 330 |
|
| 331 |
<P>
|
| 332 |
Regarding such circumstance on which we stand, the author
|
| 333 |
concentrate on the problems which is truly needed rather
|
| 334 |
than right and ideal I18N/M17N.
|
| 335 |
In other words, the focus of this document is on the way
|
| 336 |
characters should be displayed, inputed, and processed without
|
| 337 |
destroying them, not on the time-displaying format, currency symbol,
|
| 338 |
and so on.
|
| 339 |
</P>
|
| 340 |
|
| 341 |
<chapt id="coding"><heading>Character Coding Systems</heading>
|
| 342 |
|
| 343 |
<P>
|
| 344 |
Here major character sets and codesets are introduced.
|
| 345 |
The last section of this chapter contains information
|
| 346 |
on each language. Contributions for this section for
|
| 347 |
many languages are especially welcome, though contributions
|
| 348 |
for the whole text are of course welcome.
|
| 349 |
</P>
|
| 350 |
|
| 351 |
<sect id="coding-general"><heading>General Discussions</heading>
|
| 352 |
|
| 353 |
<sect1 id="codeset"><heading>Character / Character Set / Coded Character Set (or Codeset)</heading>
|
| 354 |
|
| 355 |
<P>
|
| 356 |
<strong>CHARACTER CODE</strong> is a set of combinations of bits in order to
|
| 357 |
treat characters in computers. To determine a character code
|
| 358 |
it is needed that to determine a set of characters to be encoded.
|
| 359 |
</P>
|
| 360 |
|
| 361 |
<P>
|
| 362 |
This set of character is called <strong>CHARACTER SET</strong>.
|
| 363 |
There are many standards of character sets in the world.
|
| 364 |
For example, JIS X 0208 contains main characters used in Japanese.
|
| 365 |
Usually, a characters set is not only a collection of characters, but
|
| 366 |
each character in the set also has its own number.
|
| 367 |
Usually the numbering is done so that the set
|
| 368 |
is consistent with international standards.
|
| 369 |
For example, many 7bit local character sets are identical to
|
| 370 |
one of ISO 646-* codesets. Many other local character sets are
|
| 371 |
related to ISO 8859 or ISO 2022.
|
| 372 |
</P>
|
| 373 |
|
| 374 |
<P>
|
| 375 |
Then one selects a character set or multiple character sets and
|
| 376 |
assigns codes for characters included in the character set(s).
|
| 377 |
This way to assign code is called <strong>ENCODING</strong>.
|
| 378 |
The set of encoded characters are called <strong>CODED CHARACTER SET</strong>
|
| 379 |
or <strong>CODESET</strong>.
|
| 380 |
For example, ISO-2022-JP <strong>codeset</strong> contains
|
| 381 |
<strong>character set</strong>s of ASCII, JIS X 0201 Katakana,
|
| 382 |
and JIS X 0208 Kanji.
|
| 383 |
Encoding for a codeset including multiple character sets
|
| 384 |
is usually done in two stages, at first in each character
|
| 385 |
set and then for combination of character sets.
|
| 386 |
</P>
|
| 387 |
|
| 388 |
<P>
|
| 389 |
For a codeset including only one character set, we don't
|
| 390 |
have to distinguish 'character set' and 'codeset'.
|
| 391 |
For example, ASCII is a character set and a codeset at the
|
| 392 |
same time.
|
| 393 |
</P>
|
| 394 |
|
| 395 |
<sect1 id="stateful"><heading>Stateless and Stateful</heading>
|
| 396 |
|
| 397 |
<P>
|
| 398 |
For codeset including multiple character sets it is needed
|
| 399 |
to determine the way to combine these character sets when encoding.
|
| 400 |
There are two ways to do that. One is to make all characters
|
| 401 |
in the all character sets have unique codes. The other is to
|
| 402 |
allow characters from different character sets to have the same
|
| 403 |
code and to have a code such as escape sequence to switch
|
| 404 |
<strong>SHIFT STATE</strong>, that is, to select one character set.
|
| 405 |
</P>
|
| 406 |
|
| 407 |
<P>
|
| 408 |
A codeset with shift state is called <strong>STATEFUL</strong> and
|
| 409 |
one without shift state is called <strong>STATELESS</strong>.
|
| 410 |
</P>
|
| 411 |
|
| 412 |
<P>
|
| 413 |
Generally stateful codesets can contain more characters than
|
| 414 |
stateless one. However, implementation of stateful codeset
|
| 415 |
is much difficult than that of stateless codeset.
|
| 416 |
</P>
|
| 417 |
|
| 418 |
<sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>
|
| 419 |
|
| 420 |
<P>
|
| 421 |
One ASCII character is always expressed by one byte
|
| 422 |
and occupies one column on console or fixed font for X.
|
| 423 |
One must not make such an assumption for I18N programming
|
| 424 |
and have to clearly distinguish number of bytes, characters,
|
| 425 |
and columns.
|
| 426 |
</P>
|
| 427 |
|
| 428 |
<sect id="standards"><heading>Standards for Character Codes</heading>
|
| 429 |
|
| 430 |
<sect1 id="ascii"><heading>ASCII and ISO 646</heading>
|
| 431 |
|
| 432 |
<P>
|
| 433 |
<strong>ASCII</strong> is a character set and also a codeset at the same time.
|
| 434 |
ASCII is 7bit and contains 94 printable characters which are
|
| 435 |
encoded in the region of 0x21-0x7e.
|
| 436 |
</P>
|
| 437 |
|
| 438 |
<P>
|
| 439 |
<strong>ISO 646</strong> is the international standard of ASCII. Following
|
| 440 |
12 characters of
|
| 441 |
<list>
|
| 442 |
<item>0x23 (number),
|
| 443 |
<item>0x24 (dollar),
|
| 444 |
<item>0x40 (at),
|
| 445 |
<item>0x5b (left square bracket),
|
| 446 |
<item>0x5c (backslash),
|
| 447 |
<item>0x5d (right square bracket),
|
| 448 |
<item>0x5e (caret),
|
| 449 |
<item>0x60 (backquote),
|
| 450 |
<item>0x7b (left curly brace),
|
| 451 |
<item>0x7c (vertical line),
|
| 452 |
<item>0x7d (right curly brace), and
|
| 453 |
<item>0x7e (tilde)
|
| 454 |
</list>
|
| 455 |
are called <strong>IRV</strong> (International Reference Version)
|
| 456 |
and other 82 (94 - 12 = 82) characters are called
|
| 457 |
<strong>BCT</strong> (Basic Code Table).
|
| 458 |
Characters at IRV can be different between countries.
|
| 459 |
For example, UK version of ISO 646 has pound currency
|
| 460 |
symbol at 0x23 and Japanese version has yen currency
|
| 461 |
symbol at 0x5c. US version of ISO 646 is same to ASCII.
|
| 462 |
</P>
|
| 463 |
|
| 464 |
<P>
|
| 465 |
As far as I know, all codesets in the world contains
|
| 466 |
ISO 646 character set.
|
| 467 |
</P>
|
| 468 |
|
| 469 |
<P>
|
| 470 |
Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
|
| 471 |
</P>
|
| 472 |
|
| 473 |
<P>
|
| 474 |
Nowadays usage of codesets incompatible with ASCII is not encouraged
|
| 475 |
and thus ISO 646-* should not be used. One of the reason is that
|
| 476 |
when a string is converted into Unicode, the converter doesn't
|
| 477 |
know whether IRVs are converted into characters with same shapes
|
| 478 |
or characters with same codes. Another reason is that source codes
|
| 479 |
are written in ASCII. Source code must be readable anywhere.
|
| 480 |
</P>
|
| 481 |
|
| 482 |
|
| 483 |
|
| 484 |
<sect1 id="iso8859"><heading>ISO 8859</heading>
|
| 485 |
|
| 486 |
<P>
|
| 487 |
<strong>ISO 8859</strong> is an expansion of ASCII using all 8 bits.
|
| 488 |
Additional 96 printable characters encoded in 0xa0 - 0xff are
|
| 489 |
available besides 94 ASCII printable characters.
|
| 490 |
</P>
|
| 491 |
|
| 492 |
<P>
|
| 493 |
There are 10 variants of ISO 8859 (in 1997).
|
| 494 |
<taglist>
|
| 495 |
<tag>ISO-8859-1 Latin alphabet No.1 (1987)</tag>
|
| 496 |
<item>characters for western European languages
|
| 497 |
<tag>ISO-8859-2 Latin alphabet No.2 (1987)</tag>
|
| 498 |
<item>characters for central European languages
|
| 499 |
<tag>ISO-8859-3 Latin alphabet No.3 (1988)</tag>
|
| 500 |
<tag>ISO-8859-4 Latin alphabet No.4 (1988)</tag>
|
| 501 |
<item>characters for northern European languages
|
| 502 |
<tag>ISO-8859-5 Latin/Cyrillic alphabet (1988)</tag>
|
| 503 |
<tag>ISO-8859-6 Latin/Arabic alphabet (1987)</tag>
|
| 504 |
<tag>ISO-8859-7 Latin/Greek alphabet (1987)</tag>
|
| 505 |
<tag>ISO-8859-8 Latin/Hebrew alphabet (1988)</tag>
|
| 506 |
<tag>ISO-8859-9 Latin alphabet No.5 (1989)</tag>
|
| 507 |
<item>same as ISO-8859-1 except for Turkish instead of Icelandic
|
| 508 |
<tag>ISO-8859-10 Latin alphabet No.6 (1993)</tag>
|
| 509 |
<item>Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
|
| 510 |
</taglist>
|
| 511 |
</P>
|
| 512 |
|
| 513 |
<P>
|
| 514 |
A detailed explanation is found at
|
| 515 |
<url id="http://park.kiev.ua/mutliling/ml-docs/iso-8859.html" name="&urlname">.
|
| 516 |
</P>
|
| 517 |
|
| 518 |
|
| 519 |
<sect1 id="iso-2022"><heading>ISO 2022</heading>
|
| 520 |
|
| 521 |
<P>
|
| 522 |
<strong>ISO 2022</strong> is a very powerful codeset where multiple
|
| 523 |
character sets including 1byte and multibyte can be
|
| 524 |
expressed at the same time. It is stateful.
|
| 525 |
There are many subset codeset of ISO 2022, for example,
|
| 526 |
ISO-2022-JP, EUC, and compound-text. ISO-2022-*
|
| 527 |
is widely used for mail/news. EUC has several variants,
|
| 528 |
for example, EUC-JP and EUC-KR and widely used for
|
| 529 |
UNIX(-like) systems. Compound-text is the standard
|
| 530 |
codeset for X Window System.
|
| 531 |
</P>
|
| 532 |
|
| 533 |
<P>
|
| 534 |
The sixth edition of ECMA-35 is fully identical with
|
| 535 |
ISO 2022:1994 and you can find the official document
|
| 536 |
at <url id="http://www.ecma.ch/stand/ECMA-035.HTM" name="&urlname">.
|
| 537 |
</P>
|
| 538 |
|
| 539 |
<P>
|
| 540 |
ISO 2022 has two versions of 7bit and 8bit. At first
|
| 541 |
8bit version is explained. 7bit version is a subset
|
| 542 |
of 8bit version.
|
| 543 |
</P>
|
| 544 |
|
| 545 |
<P>
|
| 546 |
The 8bit code space are divided into four regions,
|
| 547 |
<list>
|
| 548 |
<item>0x00 - 0x1f: C0 (Control Characters 0),
|
| 549 |
<item>0x20 - 0x7f: GL (Graphic Characters Left),
|
| 550 |
<item>0x80 - 0x9f: C1 (Control Characters 1), and
|
| 551 |
<item>0xa0 - 0xff: GR (Graphic Characters Right).
|
| 552 |
</list>
|
| 553 |
</P>
|
| 554 |
|
| 555 |
<P>
|
| 556 |
GL and GR is the spaces where (printable) character sets are mapped.
|
| 557 |
</P>
|
| 558 |
|
| 559 |
<P>
|
| 560 |
Next, all character sets, for example, ASCII, ISO 646-UK,
|
| 561 |
and JIS X 0208, are classified into following four categories,
|
| 562 |
<list>
|
| 563 |
<item>(1) character set with 1-byte 94-character,
|
| 564 |
<item>(2) character set with 1-byte 96-character,
|
| 565 |
<item>(3) character set with multibyte 94-character, and
|
| 566 |
<item>(4) character set with multibyte 96-character.
|
| 567 |
</list>
|
| 568 |
</P>
|
| 569 |
|
| 570 |
<P>
|
| 571 |
Characters in character sets with 94-character are mapped
|
| 572 |
into 0x21 - 0x7e. Characters in 96-character set are
|
| 573 |
mapped into 0x20 - 0x7f.
|
| 574 |
</P>
|
| 575 |
|
| 576 |
<P>
|
| 577 |
For example, ASCII, ISO 646-UK, and JIS X 0201 Katakana
|
| 578 |
are classified into (1), JIS X 0208 Japanese Kanji,
|
| 579 |
KS C 5601 Korean, GB 2312-80 Chinese are classified into (3),
|
| 580 |
and ISO 8859-* are classified to (2).
|
| 581 |
</P>
|
| 582 |
|
| 583 |
<P>
|
| 584 |
The mechanism to map these character sets into GL and GR is
|
| 585 |
a bit complex. There are four buffers, G0, G1, G2, and G3.
|
| 586 |
A character set is <strong>designated</strong> into one of these buffers
|
| 587 |
and then a buffer is <strong>invoked</strong> into GL or GR.
|
| 588 |
</P>
|
| 589 |
|
| 590 |
<P>
|
| 591 |
Control sequences to 'designate' a character set into a
|
| 592 |
buffer are determined as below.
|
| 593 |
</P>
|
| 594 |
|
| 595 |
<P>
|
| 596 |
<list>
|
| 597 |
<item>A sequence to designate a character set with 1-byte 94-character
|
| 598 |
<list>
|
| 599 |
<item>into G0 set is: ESC 0x28 F,
|
| 600 |
<item>into G1 set is: ESC 0x29 F,
|
| 601 |
<item>into G2 set is: ESC 0x2a F, and
|
| 602 |
<item>into G3 set is: ESC 0x2b F.
|
| 603 |
</list>
|
| 604 |
<item>A sequence to designate a character set with 1-byte 96-character
|
| 605 |
<list>
|
| 606 |
<item>into G1 set is: ESC 0x2d F,
|
| 607 |
<item>into G2 set is: ESC 0x2e F, and
|
| 608 |
<item>into G3 set is: ESC 0x2f F.
|
| 609 |
</list>
|
| 610 |
<item>A sequence to designate a character set with multibyte 94-character
|
| 611 |
<list>
|
| 612 |
<item>into G0 set is: ESC 0x24 0x28 F
|
| 613 |
(exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
|
| 614 |
<item>into G1 set is: ESC 0x24 0x29 F,
|
| 615 |
<item>into G2 set is: ESC 0x24 0x2a F, and
|
| 616 |
<item>into G3 set is: ESC 0x24 0x2b F.
|
| 617 |
</list>
|
| 618 |
<item>A sequence to designate a character set with multibyte 96-character
|
| 619 |
<list>
|
| 620 |
<item>into G1 set is: ESC 0x24 0x2d F,
|
| 621 |
<item>into G2 set is: ESC 0x24 0x2e F, and
|
| 622 |
<item>into G3 set is: ESC 0x24 0x2f F.
|
| 623 |
</list>
|
| 624 |
</list>
|
| 625 |
where 'F' is determined for each character set:
|
| 626 |
<list>
|
| 627 |
<item>character set with 1-byte 94-character
|
| 628 |
<list>
|
| 629 |
<item>F=0x40 for ISO 646 IRV: 1983
|
| 630 |
<item>F=0x41 for BS 4730 (UK)
|
| 631 |
<item>F=0x42 for ANSI X3.4-1968 (ASCII)
|
| 632 |
<item>F=0x43 for NATS Primary Set for Finland and Sweden
|
| 633 |
<item>F=0x49 for JIS X 0201 Katakana
|
| 634 |
<item>F=0x4a for JIS X 0201 Roman (Latin)
|
| 635 |
<item>and more
|
| 636 |
</list>
|
| 637 |
<item>character set with 1-byte 96-character
|
| 638 |
<list>
|
| 639 |
<item>F=0x41 for ISO 8859-1 Latin-1
|
| 640 |
<item>F=0x42 for ISO 8859-2 Latin-2
|
| 641 |
<item>F=0x43 for ISO 8859-3 Latin-3
|
| 642 |
<item>F=0x44 for ISO 8859-4 Latin-4
|
| 643 |
<item>F=0x46 for ISO 8859-7 Latin/Greek
|
| 644 |
<item>F=0x47 for ISO 8859-6 Latin/Arabic
|
| 645 |
<item>F=0x48 for ISO 8859-8 Latin/Hebrew
|
| 646 |
<item>F=0x4c for ISO 8859-5 Latin/Cyrillic
|
| 647 |
<item>and more
|
| 648 |
</list>
|
| 649 |
<item>character set with multibyte 94-character
|
| 650 |
<list>
|
| 651 |
<item>F=0x40 for JIS X 0208-1978 Japanese
|
| 652 |
<item>F=0x41 for GB 2312-80 Chinese
|
| 653 |
<item>F=0x42 for JIS X 0208-1983 Japanese
|
| 654 |
<item>F=0x43 for KS C 5601 Korean
|
| 655 |
<item>F=0x44 for JIS X 0212-1990 Japanese
|
| 656 |
<item>F=0x45 for CCITT Extended GB (ISO-IR-165)
|
| 657 |
<item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
|
| 658 |
<item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
|
| 659 |
<item>F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
|
| 660 |
<item>F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
|
| 661 |
<item>F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
|
| 662 |
<item>F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
|
| 663 |
<item>F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
|
| 664 |
<item>and more
|
| 665 |
</list>
|
| 666 |
</list>
|
| 667 |
<footnote>
|
| 668 |
WHERE CAN I FIND THE COMPLETE AND AUTHORITATIVE TABLE OF THIS?
|
| 669 |
</footnote>
|
| 670 |
</P>
|
| 671 |
|
| 672 |
<P>
|
| 673 |
Control codes to 'invoke' one of G{0123} into GL or GR
|
| 674 |
is determined as below.
|
| 675 |
<list>
|
| 676 |
<item>A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
|
| 677 |
<item>A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)
|
| 678 |
<item>A control code to invoke G2 into GL is: LS2 (Locking Shift 2)
|
| 679 |
<item>A control code to invoke G3 into GL is: LS3 (Locking Shift 3)
|
| 680 |
<item>A control code to invoke one character
|
| 681 |
in G2 into GL is: SS2 (Single Shift 2)
|
| 682 |
<item>A control code to invoke one character
|
| 683 |
in G3 into GL is: SS3 (Single Shift 3)
|
| 684 |
<item>A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
|
| 685 |
<item>A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)
|
| 686 |
<item>A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)
|
| 687 |
</list>
|
| 688 |
<footnote>
|
| 689 |
WHAT IS THE VALUE OF THESE CONTROL CODES?
|
| 690 |
</footnote>
|
| 691 |
</P>
|
| 692 |
|
| 693 |
<P>
|
| 694 |
Note that a character code in a character set invoked into GR is
|
| 695 |
or-ed with 0x80.
|
| 696 |
</P>
|
| 697 |
|
| 698 |
<P>
|
| 699 |
ISO 2022 also determines <strong>announcer</strong> code. For example,
|
| 700 |
'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already
|
| 701 |
invoked into GL'. This simplify the coding system. Even this
|
| 702 |
announcer can be omitted if people who exchange data agree.
|
| 703 |
</P>
|
| 704 |
|
| 705 |
<P>
|
| 706 |
7bit version of ISO 2022 is a subset of 8bit version. It does not
|
| 707 |
use C1 and GR.
|
| 708 |
</P>
|
| 709 |
|
| 710 |
<P>
|
| 711 |
Explanation on C0 and C1 is omitted here.
|
| 712 |
</P>
|
| 713 |
|
| 714 |
|
| 715 |
<sect2 id="compound"><heading>Compound Text</heading>
|
| 716 |
|
| 717 |
<P>
|
| 718 |
<strong>Compound Text</strong> is a subset of ISO 2022,
|
| 719 |
which is used for X clients to communicate each other,
|
| 720 |
for example, copy-paste.
|
| 721 |
</P>
|
| 722 |
|
| 723 |
<P>
|
| 724 |
Compound Text is stateful.
|
| 725 |
<footnote>
|
| 726 |
I HAVE TO WRITE EXPLANATION.
|
| 727 |
</footnote>
|
| 728 |
</P>
|
| 729 |
|
| 730 |
|
| 731 |
|
| 732 |
<sect2 id="euc"><heading>EUC (Extended Unix Code)</heading>
|
| 733 |
|
| 734 |
<P>
|
| 735 |
<strong>EUC</strong> is a subset of 8bit version of ISO 2022 except for the
|
| 736 |
usage of SS2 and SS3 code. Though these codes are used
|
| 737 |
to invoke G2 and G3 into GL in ISO 2022, they are invoked
|
| 738 |
into GR in EUC.
|
| 739 |
This is not a specific codeset but a way to generate a new codeset,
|
| 740 |
for example, EUC-Japanese and EUC-Korean.
|
| 741 |
</P>
|
| 742 |
|
| 743 |
<P>
|
| 744 |
EUC is stateless.
|
| 745 |
</P>
|
| 746 |
|
| 747 |
<P>
|
| 748 |
EUC can contain 4 character sets by using G0, G1, G2, and G3
|
| 749 |
with specific character sets designated.
|
| 750 |
Though there is no requirement that ASCII is designated to G0,
|
| 751 |
I don't know any EUC codeset in which ASCII is not designated to G0.
|
| 752 |
</P>
|
| 753 |
|
| 754 |
<P>
|
| 755 |
For EUC with G0-ASCII, all codes other than ASCII are encoded
|
| 756 |
in 0x80 - 0xff and this is upward compatible to ASCII.
|
| 757 |
</P>
|
| 758 |
|
| 759 |
<P>
|
| 760 |
Expressions for characters in G0, G1, G2, and G3 character sets
|
| 761 |
are described below in binary:
|
| 762 |
<list>
|
| 763 |
<item>G0: 0???????
|
| 764 |
<item>G1: 1??????? [1??????? [...]]
|
| 765 |
<item>G2: SS2 1??????? [1??????? [...]]
|
| 766 |
<item>G3: SS3 1??????? [1??????? [...]]
|
| 767 |
</list>
|
| 768 |
</P>
|
| 769 |
|
| 770 |
<P>
|
| 771 |
where SS2 is 0x8e and SS3 is 0x8f.
|
| 772 |
</P>
|
| 773 |
|
| 774 |
|
| 775 |
|
| 776 |
<sect1 id="unicodes"><heading>ISO/IEC 10646 (UCS-4, UCS-2), UNICODE, UTF-8, UTF-16</heading>
|
| 777 |
|
| 778 |
<P>
|
| 779 |
These codesets are intended to express all characters in the
|
| 780 |
world in a united character set.
|
| 781 |
</P>
|
| 782 |
|
| 783 |
<P>
|
| 784 |
In this document UCS-4 and UCS-2 are regarded as character sets
|
| 785 |
and also codesets. The others are codesets using UCS-4 or its
|
| 786 |
subset as a character set.
|
| 787 |
</P>
|
| 788 |
|
| 789 |
|
| 790 |
<sect2 id="unicode"><heading>Unicode 2.1</heading>
|
| 791 |
|
| 792 |
<P>
|
| 793 |
<strong>Unicode</strong> is a codeset which is designed to be able to express
|
| 794 |
all characters in the world, like ISO 2022.
|
| 795 |
</P>
|
| 796 |
|
| 797 |
<P>
|
| 798 |
Unicode is a stateless codeset, different from ISO 2022.
|
| 799 |
Since all characters are 16bit-length (or multiple of 16bit
|
| 800 |
for combining characters and surrogate pairs), Unicode is not
|
| 801 |
upward-compatible to ASCII, though characters at 0x0021 - 0x007e
|
| 802 |
of Unicode are same to 0x21 - 0x7e of ASCII.
|
| 803 |
</P>
|
| 804 |
|
| 805 |
<P>
|
| 806 |
Unicode as a codeset includes one character set,
|
| 807 |
a subset (plane 0 - 16) of UCS-4. UCS-4 is explained later.
|
| 808 |
</P>
|
| 809 |
|
| 810 |
<P>
|
| 811 |
Unicode (without surrogate pair) is same to UCS-2 (explained later).
|
| 812 |
</P>
|
| 813 |
|
| 814 |
<P>
|
| 815 |
Unicode has three remarkable features of Han Unification,
|
| 816 |
Combining Characters, and Surrogate Pair.
|
| 817 |
</P>
|
| 818 |
|
| 819 |
|
| 820 |
<sect3 id="unihan"><heading>Han Unification</heading>
|
| 821 |
|
| 822 |
<P>
|
| 823 |
This is the point on which Unicode is criticized most strongly
|
| 824 |
among Japanese (and also among Korean and Chinese, I suppose) people.
|
| 825 |
</P>
|
| 826 |
|
| 827 |
<P>
|
| 828 |
A region of 0x4e00 - 0x9fff in UCS-2 is used for Japanese Kanji,
|
| 829 |
Chinese Hanzi, and Korean Hanja. There are similar characters
|
| 830 |
in these four character sets. (There are two sets of Chinese characters,
|
| 831 |
simplified Chinese used in P. R. China and traditional Chinese used in
|
| 832 |
Taiwan). To reduce the number of these ideograms to be encoded
|
| 833 |
(the region for these characters can contain only 20992 characters),
|
| 834 |
these similar characters are assumed to be the same.
|
| 835 |
This is Han Unification.
|
| 836 |
</P>
|
| 837 |
|
| 838 |
<P>
|
| 839 |
However these characters are not exactly the same. If fonts for
|
| 840 |
these characters are made from Chinese one, Japanese people will
|
| 841 |
regard them wrong characters, though they will be able to read.
|
| 842 |
</P>
|
| 843 |
|
| 844 |
<P>
|
| 845 |
An example of Han Unification is available at
|
| 846 |
<url id="http://charts.unicode.org/unihan/unihan.acgi$0x9AA8" name="&urlname">.
|
| 847 |
This is a Kanji character for 'bone'.
|
| 848 |
<url id="http://charts.unicode.org/unihan/unihan.acgi$0x8FCE" name="&urlname">
|
| 849 |
is an another example of a Kanji character for 'welcome'.
|
| 850 |
</P>
|
| 851 |
|
| 852 |
|
| 853 |
|
| 854 |
<sect3 id="combining"><heading>Combining Characters</heading>
|
| 855 |
|
| 856 |
<P>
|
| 857 |
Unicode has a way to synthesize a accented character by combining
|
| 858 |
an accent symbol and a base character. For example, combining 'a' and
|
| 859 |
'~' makes 'a' with tilde. More than two accent symbol can be added to
|
| 860 |
a base character.
|
| 861 |
</P>
|
| 862 |
|
| 863 |
<P>
|
| 864 |
This faculty is convenient to express Arabic and Thai characters.
|
| 865 |
However, a few problems arises.
|
| 866 |
</P>
|
| 867 |
|
| 868 |
<P>
|
| 869 |
<taglist>
|
| 870 |
<tag>Duplicate Encoding</tag>
|
| 871 |
<item>
|
| 872 |
There are multiple ways to express the same character.
|
| 873 |
<tag>Open Repertoire</tag>
|
| 874 |
<item>
|
| 875 |
Number of expressible characters grows unlimitedly.
|
| 876 |
Non-existing characters can be expressed.
|
| 877 |
</taglist>
|
| 878 |
</P>
|
| 879 |
|
| 880 |
<P>
|
| 881 |
And more, this threaten the principle that all characters
|
| 882 |
are expressed by constant bit length.
|
| 883 |
</P>
|
| 884 |
|
| 885 |
|
| 886 |
<sect3 id="surrogate"><heading>Surrogate Pair aka UTF-16</heading>
|
| 887 |
|
| 888 |
<P>
|
| 889 |
Though Unicode aimed to express all characters in the world
|
| 890 |
in a constant 16 bits, 65536 is apparently insufficient to
|
| 891 |
express all characters in the world.
|
| 892 |
</P>
|
| 893 |
|
| 894 |
<P>
|
| 895 |
Surrogate pair is introduced in Unicode 2.0, to expand the
|
| 896 |
number of characters, by expressing one character by special
|
| 897 |
two continuing 16bit codes.
|
| 898 |
</P>
|
| 899 |
|
| 900 |
<P>
|
| 901 |
0xd800 - 0xdfff is the region reserved for surrogate pair.
|
| 902 |
The first 16bit code must be in the region of 0xd800 - 0xdbff.
|
| 903 |
The second 16bit code must be in the region of 0xdc00 - 0xdfff.
|
| 904 |
Since each region has 1024 expressions, surrogate pair can
|
| 905 |
express 1048576 (1024 * 1024 = 1048576) characters.
|
| 906 |
</P>
|
| 907 |
|
| 908 |
<P>
|
| 909 |
Plane 1 - 16 of Group 0 of UCS-4 are mapped to these areas.
|
| 910 |
UCS-4 will be explained later.
|
| 911 |
</P>
|
| 912 |
|
| 913 |
|
| 914 |
<sect3 id="646problem"><heading>ISO 646-* Problem</heading>
|
| 915 |
|
| 916 |
<P>
|
| 917 |
You will need a codeset converter between your local codeset
|
| 918 |
(for example, ISO 8859-* or ISO 2022-*) and Unicode.
|
| 919 |
If you are a Japanese, you may use Japanese version
|
| 920 |
of ISO 646, which encodes yen currency mark at 0x5c where backslash
|
| 921 |
is encoded in ASCII.
|
| 922 |
</P>
|
| 923 |
|
| 924 |
<P>
|
| 925 |
Then which should your converter convert 0x5c in your local codeset
|
| 926 |
into in Unicode, 0x005c (backslash) or yen currency mark?
|
| 927 |
You may say yen currency mark is the right solution.
|
| 928 |
However, backslash (and then yen mark) is widely used for
|
| 929 |
escape character. For example, 'new line' is expressed as
|
| 930 |
'backslash - n' in C string literal and Japanese people use
|
| 931 |
'yen currency mark - n'. You may say that program sources
|
| 932 |
must written in ASCII and the wrong point is that you
|
| 933 |
tried to convert program source. Then how about your original
|
| 934 |
configuration file for various softwares?
|
| 935 |
</P>
|
| 936 |
|
| 937 |
<P>
|
| 938 |
For example, Shift-JIS codeset, which is the standard codeset
|
| 939 |
for Windows/Macintosh in Japan, includes Japanese version of
|
| 940 |
ISO 646. The 'right' way is convert 0x5c into yen currency mark
|
| 941 |
in Unicode. Now Windows comes to support Unicode and the font
|
| 942 |
at 0x005c is yen currency mark. As you know, backslash
|
| 943 |
(yen currency mark in Japan) is vitally important for Windows,
|
| 944 |
because it is used to separate directory names.
|
| 945 |
Fortunately, EUC-JP, which is widely used for UNIX in Japan,
|
| 946 |
includes ASCII, not Japanese version of ISO 646. So this
|
| 947 |
is not problem because it is clear 0x5c is backslash.
|
| 948 |
</P>
|
| 949 |
|
| 950 |
<P>
|
| 951 |
Thus all local codesets should not use character sets incompatible
|
| 952 |
to ASCII, such as ISO 646-*.
|
| 953 |
</P>
|
| 954 |
|
| 955 |
|
| 956 |
|
| 957 |
<sect3 id="consistency"><heading>Consistency with Local Character Sets</heading>
|
| 958 |
|
| 959 |
<P>
|
| 960 |
Local character sets can be newly determined or obsoleted.
|
| 961 |
I don't know Unicode can adapt itself to such cases.
|
| 962 |
This is a REAL fear. Now (1999) a new character set (JIS X 0208
|
| 963 |
3rd and 4th level) is discussed in Japan. This character set
|
| 964 |
may make older character set (JIS X 0212) obsoleted.
|
| 965 |
</P>
|
| 966 |
|
| 967 |
<P>
|
| 968 |
And one more problem. JIS X 0208, the main character set
|
| 969 |
for Japanese language, has many special symbols, such as
|
| 970 |
circle, star, parentheses, and so on. Correspondence
|
| 971 |
from these characters and Unicode is not standardized.
|
| 972 |
Thus the conversion tables are different from vender to vender.
|
| 973 |
I guess this problem is not peculiar to JIS X 0208.
|
| 974 |
</P>
|
| 975 |
|
| 976 |
|
| 977 |
|
| 978 |
<sect2 id="ucs"><heading>ISO 10646, UCS-2, and UCS-4</heading>
|
| 979 |
|
| 980 |
<P>
|
| 981 |
ISO 10646 determines two character sets, <strong>UCS-2</strong>
|
| 982 |
and <strong>UCS-4</strong>.
|
| 983 |
UCS-2 is a subset of UCS-4.
|
| 984 |
</P>
|
| 985 |
|
| 986 |
<P>
|
| 987 |
UCS-4 is a 32bit character set. Each of 4 bytes in 32bit expression
|
| 988 |
of UCS-4 is called <strong>Group</strong>, <strong>Plane</strong>,
|
| 989 |
<strong>Row</strong>, and <strong>Cell</strong>, respectively.
|
| 990 |
The first plane (Group = 0, Plane = 0) is called <strong>BMP</strong>
|
| 991 |
(Basic Multilingual Plane) and UCS-2 is same to BMP.
|
| 992 |
</P>
|
| 993 |
|
| 994 |
<P>
|
| 995 |
Both UCS-2 and UCS-4 are not upward-compatible to ASCII,
|
| 996 |
though characters at 0x0021-0x007e in UCS-2
|
| 997 |
(and 0x00000021 - 0x00007e in UCS-4) are same to
|
| 998 |
0x21 - 0x7e in ASCII.
|
| 999 |
</P>
|
| 1000 |
|
| 1001 |
<P>
|
| 1002 |
Though UCS-2 and UCS-4 are explained as character sets,
|
| 1003 |
they are also codesets.
|
| 1004 |
</P>
|
| 1005 |
|
| 1006 |
<P>
|
| 1007 |
When a string expressed in
|
| 1008 |
UCS-2 (or UCS-4) is stored into a file, there are two ways,
|
| 1009 |
big endian and little endian. To clarify which endian is
|
| 1010 |
used, a magic character is added at the top of the string.
|
| 1011 |
The character is 'zero width no-break space', whose
|
| 1012 |
code is 0xfeff in UCS-2 and 0x0000feff in UCS-4.
|
| 1013 |
</P>
|
| 1014 |
|
| 1015 |
|
| 1016 |
|
| 1017 |
<sect2 id="utf-8"><heading>UTF-8</heading>
|
| 1018 |
|
| 1019 |
<P>
|
| 1020 |
In spite of the name, <strong>UTF-8</strong> is not similar to UTF-16 at all.
|
| 1021 |
</P>
|
| 1022 |
|
| 1023 |
<P>
|
| 1024 |
UTF-8 is a codeset which includes UCS-4 as a character set and is
|
| 1025 |
upward-compatible to ASCII.
|
| 1026 |
Conversion from UCS-4 to UTF-8 is performed using a
|
| 1027 |
simple conversion rule.
|
| 1028 |
<example>
|
| 1029 |
UCS-4 (binary) UTF-8 (binary)
|
| 1030 |
00000000 00000000 00000000 0??????? 0???????
|
| 1031 |
00000000 00000000 00000??? ???????? 110????? 10??????
|
| 1032 |
00000000 00000000 ???????? ???????? 1110???? 10?????? 10??????
|
| 1033 |
00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10??????
|
| 1034 |
000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10??????
|
| 1035 |
0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10??????
|
| 1036 |
</example>
|
| 1037 |
</P>
|
| 1038 |
|
| 1039 |
|
| 1040 |
|
| 1041 |
<sect2 id="utf-2000"><heading>UTF-2000</heading>
|
| 1042 |
|
| 1043 |
<P>
|
| 1044 |
I heard that there is a new code UTF-2000. I don't know at all except
|
| 1045 |
for the name UTF-2000.
|
| 1046 |
<footnote>
|
| 1047 |
I HAVE TO WRITE EXPLANATION.
|
| 1048 |
</footnote>
|
| 1049 |
</P>
|
| 1050 |
|
| 1051 |
|
| 1052 |
|
| 1053 |
|
| 1054 |
<chapt id="languages"><heading>Characters in Each Country</heading>
|
| 1055 |
|
| 1056 |
<P>
|
| 1057 |
This chapter describes a specific information for each language.
|
| 1058 |
Contributions from people speaking each language are welcome.
|
| 1059 |
If you are to write a section on your language, please include
|
| 1060 |
these points:
|
| 1061 |
<enumlist>
|
| 1062 |
<item>kinds and number of characters used in the language,
|
| 1063 |
<item>explanation on character set(s) which is (are) standardized,
|
| 1064 |
<item>explanation on codeset(s) which is (are) standardized,
|
| 1065 |
<item>usage and popularity for each codeset,
|
| 1066 |
<item>de-facto standard, if any, on how many columns characters occupy,
|
| 1067 |
<item>writing direction and combined characters,
|
| 1068 |
<item>how to layout characters (word wrapping and so on),
|
| 1069 |
<item>widely used value for <tt>LANG</tt> environmental variable,
|
| 1070 |
<item>the way to input characters from keyboard and whether
|
| 1071 |
you want to input yes/no (and so on) in your language
|
| 1072 |
or in English,
|
| 1073 |
<item>a set of information needed for beautiful displaying, for example,
|
| 1074 |
where to break a line, hyphenation, word wrapping, and so on, and
|
| 1075 |
<item>other topics.
|
| 1076 |
</enumlist>
|
| 1077 |
</P>
|
| 1078 |
|
| 1079 |
|
| 1080 |
<P>
|
| 1081 |
Writers whose languages are written in different direction
|
| 1082 |
from European languages or needs a combined characters
|
| 1083 |
(I heard that is used in Thai) are encouraged to explain
|
| 1084 |
how to treat such languages.
|
| 1085 |
</P>
|
| 1086 |
|
| 1087 |
|
| 1088 |
|
| 1089 |
&japanese-japan;
|
| 1090 |
&spanish;
|
| 1091 |
|
| 1092 |
|
| 1093 |
|
| 1094 |
|
| 1095 |
|
| 1096 |
|
| 1097 |
|
| 1098 |
<chapt id="output"><heading>Output to Display</heading>
|
| 1099 |
|
| 1100 |
<P>
|
| 1101 |
Here 'Output to Display' does not mean I18N of messages using
|
| 1102 |
<prgn>gettext</prgn>.
|
| 1103 |
I will concern on whether characters are correctly outputed so that
|
| 1104 |
we can read it. For example, install <package>libcanna1g</package>
|
| 1105 |
package and display
|
| 1106 |
<tt>/usr/doc/libcanna1g/README.jp.gz</tt> on console or <prgn>xterm</prgn>
|
| 1107 |
(of course after
|
| 1108 |
ungzipping). This text file is written in Japanese but even Japanese
|
| 1109 |
people can not read such a row of strange characters. Which you would
|
| 1110 |
prefer if you were a Japanese speaker, an English message which can be read
|
| 1111 |
with a dictionary or such a row of strange characters which is
|
| 1112 |
a result of <prgn>gettext</prgn>ization?
|
| 1113 |
(Yes, there <em>is</em> a way to display
|
| 1114 |
Japanese characters correctly -- <prgn>kon</prgn> (in <package>kon2</package>
|
| 1115 |
package) for console and <prgn>kterm</prgn> for X, and
|
| 1116 |
Japanese people are happy with <prgn>gettext</prgn>ized Japanese messages.)
|
| 1117 |
</P>
|
| 1118 |
|
| 1119 |
<P>
|
| 1120 |
Problems on displaying non-English characters are discussed below.
|
| 1121 |
Since the mother tongue of the author is Japanese, the content may
|
| 1122 |
be biased to Japanese.
|
| 1123 |
</P>
|
| 1124 |
|
| 1125 |
|
| 1126 |
|
| 1127 |
<sect id="output-console"><heading>Console Softwares</heading>
|
| 1128 |
|
| 1129 |
<P>
|
| 1130 |
Softwares running on the console are not responsible for displaying.
|
| 1131 |
The console itself is responsible. There are terminal emulators
|
| 1132 |
which can display non-English languages such as <prgn>kterm</prgn> (Japanese),
|
| 1133 |
<prgn>krxvt</prgn>, <prgn>grxvt</prgn>, and <prgn>crxvt</prgn>
|
| 1134 |
(Japanese, Greek, and Chinese, included
|
| 1135 |
in <package>rxvt-ml</package> package), <prgn>cxterm</prgn>
|
| 1136 |
(Chinese, Korean, and Japanese, non-free),
|
| 1137 |
and so on and softwares with which non-English characters can be
|
| 1138 |
displayed on console such as <package>kon2</package> (Japanese).
|
| 1139 |
</P>
|
| 1140 |
|
| 1141 |
<P>
|
| 1142 |
All what a software on console (including terminal emulator and so on)
|
| 1143 |
has to do is that output a correct code to the console.
|
| 1144 |
</P>
|
| 1145 |
|
| 1146 |
<P>
|
| 1147 |
At first, it is important not to destroy string data.
|
| 1148 |
Sometimes it can be done only by 8bit-clean-ize.
|
| 1149 |
'8bit-clean' means that the software does not destroy the
|
| 1150 |
most significant bit (MSB) of data the software treats.
|
| 1151 |
</P>
|
| 1152 |
|
| 1153 |
<P>
|
| 1154 |
Next, be careful for a software which sends control codes such
|
| 1155 |
as location every time it output 1 byte. Such codes destroy
|
| 1156 |
the continuity of multibyte character.
|
| 1157 |
</P>
|
| 1158 |
|
| 1159 |
<P>
|
| 1160 |
Be also careful for destruction of multicolumn characters.
|
| 1161 |
For example, when a string exceeds the width of the console,
|
| 1162 |
the string is divided at the end of the line. Terminal emulators
|
| 1163 |
should have a faculty to avoid such a 'excess of line width' type
|
| 1164 |
destruction of character but so far no terminal emulators
|
| 1165 |
have such a faculty. (Only one exception --- shell mode of Emacs.
|
| 1166 |
However, unfortunately shell mode of Emacs is a dumb terminal and
|
| 1167 |
many softwares cannot be run on it.) Thus each software on
|
| 1168 |
console should be careful.
|
| 1169 |
</P>
|
| 1170 |
|
| 1171 |
<P>
|
| 1172 |
There is another reason to destroy multicolumn characters.
|
| 1173 |
When a message is overwritten on another string, a part
|
| 1174 |
of a character which is a part of a previous string can be
|
| 1175 |
left not overwritten. This may be more troublesome than many
|
| 1176 |
people would think because multicolumn character can be
|
| 1177 |
written at every columns, not only at the multiple of the
|
| 1178 |
width of the character.
|
| 1179 |
</P>
|
| 1180 |
|
| 1181 |
<P>
|
| 1182 |
These destruction of continuity of multibyte characters may
|
| 1183 |
be a cause of the destruction of the whole line following
|
| 1184 |
the character. Whether this can occur depends on the internal
|
| 1185 |
implementation of console program. This can occur if the
|
| 1186 |
terminal emulator does not treat columns, bytes and characters
|
| 1187 |
properly separately. The shell mode of Emacs is the only example
|
| 1188 |
doing that but there are no chance to overwrite character on
|
| 1189 |
the shell mode of Emacs, because it is a dumb terminal.
|
| 1190 |
</P>
|
| 1191 |
|
| 1192 |
<P>
|
| 1193 |
There are no standards for number of columns a character occupies.
|
| 1194 |
This can be a large problem for softwares with <tt>ncurses</tt>.
|
| 1195 |
There is no 'right' way to solve this. Each software has to
|
| 1196 |
have an information for each character set. Consult section 2.6
|
| 1197 |
for each language. Take care of the distinction between number
|
| 1198 |
of columns, bytes, and characters. For subset of EUC-JP
|
| 1199 |
(ASCII alphabets and JIS X 0208 kanji), number of bytes and columns
|
| 1200 |
are equal (1-byte character occupy 1 column and 2-byte character
|
| 1201 |
occupy 2 columns).
|
| 1202 |
</P>
|
| 1203 |
|
| 1204 |
<P>
|
| 1205 |
Another important point is that the string has to be converted
|
| 1206 |
into a codeset which the console can understand. So far there
|
| 1207 |
are no consoles which understand Unicode.
|
| 1208 |
</P>
|
| 1209 |
|
| 1210 |
|
| 1211 |
|
| 1212 |
<sect id="output-x"><heading>X Clients</heading>
|
| 1213 |
|
| 1214 |
<P>
|
| 1215 |
X itself is already internationalized. Thus many languages can
|
| 1216 |
be displayed if fonts are properly prepared. It is users'
|
| 1217 |
responsibility to prepare fonts and all what softwares have
|
| 1218 |
to do is to be careful to selection of fonts.
|
| 1219 |
</P>
|
| 1220 |
|
| 1221 |
<P>
|
| 1222 |
Though codesets other than ASCII often contains multiple character sets,
|
| 1223 |
fontsets for X are prepared for each character sets. So a set of fontsets
|
| 1224 |
for set of character sets should be used instead of a single fontset.
|
| 1225 |
</P>
|
| 1226 |
|
| 1227 |
<P>
|
| 1228 |
For example, C programs using Xlib should use series of functions
|
| 1229 |
related to XFontSet structure instead of functions for XFontStruct
|
| 1230 |
structure.
|
| 1231 |
<example>
|
| 1232 |
Font | FontSet
|
| 1233 |
==================+====================
|
| 1234 |
XFontStruct | XFontSet
|
| 1235 |
------------------+--------------------
|
| 1236 |
XLoadFont() | XCreateFontSet()
|
| 1237 |
------------------+--------------------
|
| 1238 |
XUnloadFont() | XFreeFontSet()
|
| 1239 |
------------------+--------------------
|
| 1240 |
XQueryFont() | XFontsOfFontSet()
|
| 1241 |
------------------+--------------------
|
| 1242 |
XDrawString() | XmbDrawString()
|
| 1243 |
XDrawString16() | XwcDrawString()
|
| 1244 |
------------------+--------------------
|
| 1245 |
XDrawText() | XmbDrawText()
|
| 1246 |
XDrawText16() | XwcDrawText()
|
| 1247 |
------------------+--------------------
|
| 1248 |
</example>
|
| 1249 |
</P>
|
| 1250 |
|
| 1251 |
<P>
|
| 1252 |
If a software uses the left-hand functions it have to be rewritten
|
| 1253 |
using the corresponding right-hand functions in the table. Note that
|
| 1254 |
this table is not perfect but only for an example. Since these
|
| 1255 |
right-hand functions use wide characters and multibyte characters
|
| 1256 |
in C, setlocale() has to be called in advance.
|
| 1257 |
</P>
|
| 1258 |
|
| 1259 |
<P>
|
| 1260 |
The same problem exists for softwares using toolkits such as
|
| 1261 |
athena, GTK+, Qt, and so on.
|
| 1262 |
</P>
|
| 1263 |
|
| 1264 |
|
| 1265 |
|
| 1266 |
|
| 1267 |
|
| 1268 |
|
| 1269 |
|
| 1270 |
<chapt id="input"><heading>Input from Keyboard</heading>
|
| 1271 |
|
| 1272 |
<P>
|
| 1273 |
I18N of display is a prerequisite for I18N of input from keyboard.
|
| 1274 |
I18N is not necessary only for answering Yes/No. Most
|
| 1275 |
Japanese-speaking people regard it is too troublesome only for
|
| 1276 |
answer Y/N to invoke the input method, input alphabetical
|
| 1277 |
representation of Japanese, and convert to Japanese character.
|
| 1278 |
This would be true for Korean and Chinese. On the other hand
|
| 1279 |
softwares such as text editor, word processor, terminal emulator,
|
| 1280 |
and shell should have I18N-ed input support.
|
| 1281 |
</P>
|
| 1282 |
|
| 1283 |
|
| 1284 |
|
| 1285 |
<sect id="input-console"><heading>Console Softwares</heading>
|
| 1286 |
|
| 1287 |
<sect1 id="input-console-console"><heading>Invoked in the Console and Kon</heading>
|
| 1288 |
|
| 1289 |
<P>
|
| 1290 |
Canna and Wnn is client/server type Japanese input methods.
|
| 1291 |
Wnn has its variants for Korean and Chinese.
|
| 1292 |
They have their own protocols and there are no standards.
|
| 1293 |
There are softwares to add a faculty of inputting Japanese
|
| 1294 |
to console by connecting console and these input methods,
|
| 1295 |
but these softwares (canuum for Canna and uum for Wnn) are
|
| 1296 |
not Debianized yet. There are a few softwares which can talk
|
| 1297 |
Canna or Wnn protocol directly, for example, nvi-m17n-canna.
|
| 1298 |
In Debian system, these softwares 'depends' on libcanna or wnn
|
| 1299 |
packages.
|
| 1300 |
</P>
|
| 1301 |
|
| 1302 |
<P>
|
| 1303 |
GNU Emacs offers methods for inputting many languages
|
| 1304 |
such as Japanese, Chinese, Korean, Latin-{12345}, Russian,
|
| 1305 |
Greek, Hebrew, Thai, Vietnamese, Czech, and so on
|
| 1306 |
in the console environment. XEmacs also offers similar
|
| 1307 |
mechanism but the set of supported languages are different.
|
| 1308 |
We will be very happy if the input faculty of (X)Emacs
|
| 1309 |
becomes a library and other softwares can use. The author
|
| 1310 |
doesn't know this can be achieved or not.
|
| 1311 |
</P>
|
| 1312 |
|
| 1313 |
<P>
|
| 1314 |
After an input method is supplied,
|
| 1315 |
inputed codes must be treated correctly.
|
| 1316 |
That is, the software must be aware of the number
|
| 1317 |
of bytes, characters, and columns.
|
| 1318 |
For example, you have to know how many bytes should be
|
| 1319 |
deleted and how many '^H' code should be sent to console
|
| 1320 |
when 'BS' key is pushed.
|
| 1321 |
</P>
|
| 1322 |
|
| 1323 |
|
| 1324 |
<sect1 id="input-console-x"><heading>Invoked in an X Terminal Emulator</heading>
|
| 1325 |
|
| 1326 |
<P>
|
| 1327 |
X has a standard to input various languages. That is XIM.
|
| 1328 |
Kinput2 is a software to connect Canna and/or Wnn and XIM protocol.
|
| 1329 |
And more, terminal emulators such as kterm and krxvt have a
|
| 1330 |
faculty to connect to XIM. So the way to input various languages
|
| 1331 |
is supplied.
|
| 1332 |
</P>
|
| 1333 |
|
| 1334 |
<P>
|
| 1335 |
All what softwares running on a terminal emulator have to do is
|
| 1336 |
to accept the input properly.
|
| 1337 |
</P>
|
| 1338 |
|
| 1339 |
<P>
|
| 1340 |
At first 8bit-clean-ize is needed.
|
| 1341 |
Important softwares such as <prgn>bash</prgn> and <prgn>tcsh</prgn>
|
| 1342 |
are already 8bit-clean-ized
|
| 1343 |
and they accept non-ASCII characters.
|
| 1344 |
</P>
|
| 1345 |
|
| 1346 |
<P>
|
| 1347 |
However, since these softwares aren't conscious of multibyte
|
| 1348 |
characters, editing the inputed line is a bit hard. For example,
|
| 1349 |
we have to push Backspace key twice to erase a 2byte character.
|
| 1350 |
If we make a mistake, the inputed string will be broken.
|
| 1351 |
For stateful codesets or a character whose bytes and columns are different,
|
| 1352 |
editing would be much more difficult.
|
| 1353 |
(Fortunately, most of Japanese and Korean characters are expressed in
|
| 1354 |
2 bytes and occupies 2 columns. That is, number of bytes and columns
|
| 1355 |
are identical.)
|
| 1356 |
Thus the softwares should be conscious of multibyte codesets.
|
| 1357 |
</P>
|
| 1358 |
|
| 1359 |
|
| 1360 |
|
| 1361 |
|
| 1362 |
|
| 1363 |
<sect id="input-x"><heading>X Clients</heading>
|
| 1364 |
|
| 1365 |
<P>
|
| 1366 |
All you need is that:
|
| 1367 |
<list>
|
| 1368 |
<item>To accept input from XIM. 'Over-the-spot' conversion is desirable but
|
| 1369 |
not essential.
|
| 1370 |
<item>To accept 'paste' using Compound Text.
|
| 1371 |
</list>
|
| 1372 |
</P>
|
| 1373 |
|
| 1374 |
|
| 1375 |
|
| 1376 |
|
| 1377 |
|
| 1378 |
|
| 1379 |
|
| 1380 |
<chapt id="internal"><heading>Internal Processing and File I/O</heading>
|
| 1381 |
|
| 1382 |
<P>
|
| 1383 |
From a user's point of view, a software can use any internal codesets
|
| 1384 |
if I/O is done correctly. It is because a user cannot be aware of
|
| 1385 |
which kind of internal code is used in the software.
|
| 1386 |
</P>
|
| 1387 |
|
| 1388 |
<P>
|
| 1389 |
From a programmer's point of view, he/she
|
| 1390 |
<list>
|
| 1391 |
<item>can count number of <em>character</em> (not <em>bytes</em> or
|
| 1392 |
<em>columns</em>) correctly,
|
| 1393 |
<item>cannot split a multibyte character, and
|
| 1394 |
<item>don't have to be careful in shift state
|
| 1395 |
</list>
|
| 1396 |
without knowledge on specific codesets, by using
|
| 1397 |
wide character in C, kinds of Unicode, and so on.
|
| 1398 |
</P>
|
| 1399 |
|
| 1400 |
<P>
|
| 1401 |
Since you may not assume anything about
|
| 1402 |
implementation of wide character (value of <tt>wchar_t</tt>),
|
| 1403 |
you cannot do anything more than the library prepares,
|
| 1404 |
for example, obtain number of columns a character occupies.
|
| 1405 |
</P>
|
| 1406 |
|
| 1407 |
|
| 1408 |
|
| 1409 |
|
| 1410 |
|
| 1411 |
|
| 1412 |
|
| 1413 |
|
| 1414 |
<chapt id="other"><heading>Other Special Topics</heading>
|
| 1415 |
|
| 1416 |
<sect id="locale"><heading>Locale in C</heading>
|
| 1417 |
|
| 1418 |
<P>
|
| 1419 |
Locale is the main faculty for I18N of C language.
|
| 1420 |
The easiest way to use locale is to call <tt>setlocale(LC_ALL, "")</tt>.
|
| 1421 |
</P>
|
| 1422 |
|
| 1423 |
<P>
|
| 1424 |
Locale model is that a software changes its behavior
|
| 1425 |
according to its language environment. The environment can be
|
| 1426 |
set independently for six categories of
|
| 1427 |
LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC,
|
| 1428 |
and LC_TIME.
|
| 1429 |
For example, a message can be written in proper language
|
| 1430 |
and a proper format for date/time expression is used if
|
| 1431 |
properly implemented.
|
| 1432 |
</P>
|
| 1433 |
|
| 1434 |
<P>
|
| 1435 |
If <tt>setlocale(LC_ALL, "")</tt> is described at the start of the
|
| 1436 |
software, the choice of the environment is done by environmental variables
|
| 1437 |
whose names are same to the names of categories.
|
| 1438 |
If LC_ALL variable is defined, LC_ALL takes precedence over these
|
| 1439 |
variables. If neither of them are defined, LANG variable is adopted.
|
| 1440 |
If LANG is also not defined, 'C' locale, which means default behavior,
|
| 1441 |
is used.
|
| 1442 |
</P>
|
| 1443 |
|
| 1444 |
<P>
|
| 1445 |
Though valid values for these environmental variables (locale names)
|
| 1446 |
depend on the kind and set-up of the OS, the format of locale names
|
| 1447 |
is usually like <tt>ja_JP.ujis</tt>, where two lowercase characters
|
| 1448 |
mean language (<tt>ja</tt> = Japanese), two capital characters
|
| 1449 |
mean country (<tt>JP</tt> = Japan), and characters after dot mean
|
| 1450 |
codeset (<tt>ujis</tt> = EUC-JP). Type <tt>locale -a</tt> to display
|
| 1451 |
all valid locale names.
|
| 1452 |
</P>
|
| 1453 |
|
| 1454 |
<P>
|
| 1455 |
Note that m17n is not achieved by locale model at all,
|
| 1456 |
because a user has to choose only one language for one category.
|
| 1457 |
Sometimes locale model is even insufficient for i18n.
|
| 1458 |
For example, there are many languages where multiple codesets
|
| 1459 |
are used at the same time and sometimes code conversion or
|
| 1460 |
code distinction is needed.
|
| 1461 |
</P>
|
| 1462 |
|
| 1463 |
<sect id="wchar"><heading>Multibyte and Wide characters in C</heading>
|
| 1464 |
|
| 1465 |
<P>
|
| 1466 |
Multibyte character is a character which is expressed by
|
| 1467 |
two or more bytes. Multibyte character corresponds to
|
| 1468 |
the 'real' codeset used for input/output. The expression
|
| 1469 |
of multibyte character depends on <tt>LC_CTYPE</tt>.
|
| 1470 |
</P>
|
| 1471 |
|
| 1472 |
<P>
|
| 1473 |
Since multibyte character can be stateful (that is, can have
|
| 1474 |
shift status) and the number of bytes a character does not
|
| 1475 |
have to be a constant, implementation using multibyte character
|
| 1476 |
can be difficult. Thus wide character can be used.
|
| 1477 |
</P>
|
| 1478 |
|
| 1479 |
<P>
|
| 1480 |
Wide character is stateless and the size of every wide characters
|
| 1481 |
are same. Functions for conversion between multibyte character
|
| 1482 |
and wide character (and string of multibyte characters and
|
| 1483 |
string of wide characters) are supplied by library.
|
| 1484 |
Wide character is expressed using <tt>wchar_t</tt> type.
|
| 1485 |
String of wide characters is expressed
|
| 1486 |
as a array of <tt>wchar_t</tt>, like string of ASCII characters is expressed
|
| 1487 |
as a array of <tt>char</tt>.
|
| 1488 |
</P>
|
| 1489 |
|
| 1490 |
<P>
|
| 1491 |
Thus it is convenient to input multibyte characters from a stream,
|
| 1492 |
convert them into wide characters, process, convert back into
|
| 1493 |
multibyte characters, and output them to a stream. <tt>wchar_t</tt> is
|
| 1494 |
used as a internal code.
|
| 1495 |
</P>
|
| 1496 |
|
| 1497 |
<P>
|
| 1498 |
Functions for conversion between multibyte and wide characters/strings
|
| 1499 |
are shown below:
|
| 1500 |
<list>
|
| 1501 |
<item><tt>mbtowc()</tt> and <tt>mbrtowc()</tt> to convert
|
| 1502 |
from multibyte to wide character.
|
| 1503 |
<item><tt>mblen()</tt>, <tt>mbrlen()</tt> to obtain the number
|
| 1504 |
of characters of multibyte character string.
|
| 1505 |
<item><tt>mbstowcs()</tt>, <tt>mbsrtowcs()</tt> to convert from
|
| 1506 |
multibyte to wide character string.
|
| 1507 |
<item><tt>wctomb()</tt>, <tt>wcrtomb()</tt> to convert from wide
|
| 1508 |
to multibyte character.
|
| 1509 |
<item><tt>wcstombs()</tt>, <tt>wcsrtombs()</tt> to convert from
|
| 1510 |
wide to multibyte character string.
|
| 1511 |
<item><tt>mbsinit()</tt> to check shift status.
|
| 1512 |
<item><tt>btowc()</tt> and <tt>wctob()</tt> to convert 1byte and
|
| 1513 |
wide characters.
|
| 1514 |
</list>
|
| 1515 |
</P>
|
| 1516 |
|
| 1517 |
<P>
|
| 1518 |
'<tt>r</tt>' version of these functions (for example, <tt>mbrtowc</tt>)
|
| 1519 |
have an additional parameter to a pointer to a <tt>mbstate_t</tt>
|
| 1520 |
variable which contains the shift status. Since non-'<tt>r</tt>'
|
| 1521 |
version of these functions have shift status in their internal
|
| 1522 |
(static) variable, these can treat only one succession of string at a time.
|
| 1523 |
</P>
|
| 1524 |
|
| 1525 |
<P>
|
| 1526 |
See manpages of these functions for further information.
|
| 1527 |
</P>
|
| 1528 |
|
| 1529 |
<P>
|
| 1530 |
The implementation of wchar_t is not determined by any
|
| 1531 |
standards, though UCS-4 is used for glibc. You must not
|
| 1532 |
assume the implementation of <tt>wchar_t</tt>.
|
| 1533 |
</P>
|
| 1534 |
|
| 1535 |
<P>
|
| 1536 |
Though usual functions such as <tt>printf()</tt> can be used for multibyte
|
| 1537 |
characters for input/output, one have to take care of escape
|
| 1538 |
character '<tt>%</tt>' used in formatted input/output functions, because
|
| 1539 |
a part of a multibyte character can have same value as ASCII
|
| 1540 |
code of '<tt>%</tt>'.
|
| 1541 |
</P>
|
| 1542 |
|
| 1543 |
|
| 1544 |
<sect id="gettext"><heading>Gettext</heading>
|
| 1545 |
|
| 1546 |
<P>
|
| 1547 |
Gettext is a tool to internationalize messages a software outputs
|
| 1548 |
according to locale status of <tt>LC_MESSAGES</tt>.
|
| 1549 |
A <prgn>gettext</prgn>ized software contains messages written in
|
| 1550 |
various languages (according to available translators) and
|
| 1551 |
a user can choose them using environmental variables.
|
| 1552 |
GNU gettext is a part of Debian system.
|
| 1553 |
</P>
|
| 1554 |
|
| 1555 |
<P>
|
| 1556 |
Install <package>gettext</package> package and read info pages for details.
|
| 1557 |
</P>
|
| 1558 |
|
| 1559 |
<P>
|
| 1560 |
Don't use non-ASCII characters for '<tt>msgid</tt>'.
|
| 1561 |
Be careful because you may tend to use ISO-8859-1 characters.
|
| 1562 |
For example, '©' (copyright mark; you may be not able to
|
| 1563 |
read the copyright mark NOW in THIS document) is non-ASCII character
|
| 1564 |
(0xa9 in ISO-8859-1).
|
| 1565 |
Otherwise, translators may feel difficulty to edit catalog files
|
| 1566 |
because of conflict between codesets for <tt>msgid</tt> and in
|
| 1567 |
<tt>msgstr</tt>.
|
| 1568 |
</P>
|
| 1569 |
|
| 1570 |
<P>
|
| 1571 |
Be sure the message can be displayed in the assumed environment.
|
| 1572 |
In other words, you have to read the chapter of 'Output to Display'
|
| 1573 |
in this document and internationalize the output mechanism
|
| 1574 |
of your software prior to <prgn>gettext</prgn>ization.
|
| 1575 |
<em>ENGLISH MESSAGES ARE PREFERRED EVEN FOR NON-ENGLISH-SPEAKING PEOPLE,
|
| 1576 |
THAN MEANINGLESS BROKEN MESSAGES.</em>
|
| 1577 |
</P>
|
| 1578 |
|
| 1579 |
<P>
|
| 1580 |
The 2nd (3rd, ...) byte of multibyte characters or
|
| 1581 |
all bytes of non-ASCII characters in stateful codesets
|
| 1582 |
can be 0x5c (same to backslash in ASCII) or 0x22
|
| 1583 |
(same to double quote in ASCII).
|
| 1584 |
These characters have to properly escaped because
|
| 1585 |
present version of GNU gettext doesn't care the
|
| 1586 |
'charset' subitem of '<tt>Content-Type</tt>' item for '<tt>msgstr</tt>'.
|
| 1587 |
</P>
|
| 1588 |
|
| 1589 |
<P>
|
| 1590 |
A <prgn>gettext</prgn>ed message must not used in multiple contexts.
|
| 1591 |
This is because a word may have different meaning in different context.
|
| 1592 |
For example, a verb means an order or a command if it appears
|
| 1593 |
at the top of the sentence in English. However, different languages
|
| 1594 |
have different grammar. If a verb is <prgn>gettext</prgn>ed and it is used
|
| 1595 |
both in a usual sentence and in an imperative sentence,
|
| 1596 |
one cannot translate it.
|
| 1597 |
</P>
|
| 1598 |
|
| 1599 |
|
| 1600 |
<P>
|
| 1601 |
If a sentence is <prgn>gettext</prgn>ed, never divide the sentence.
|
| 1602 |
If a sentence is divided in the original source code,
|
| 1603 |
connect them so as to single string contains the full
|
| 1604 |
sentence.
|
| 1605 |
This is because the order of words in a sentence
|
| 1606 |
is different among languages.
|
| 1607 |
</P>
|
| 1608 |
|
| 1609 |
<P>
|
| 1610 |
A software with <prgn>gettext</prgn>ed messages should not depend on
|
| 1611 |
the length of the messages. The messages may get longer
|
| 1612 |
in different language.
|
| 1613 |
</P>
|
| 1614 |
|
| 1615 |
<P>
|
| 1616 |
When two or more '%' directive for formatted output functions
|
| 1617 |
such as <tt>printf()</tt> appear in a message,
|
| 1618 |
the order of these '%' directives may be changed by
|
| 1619 |
translation. In such a case, the translator can specify
|
| 1620 |
the order.
|
| 1621 |
See section of 'Special Comments preceding Keywords'
|
| 1622 |
in info page of <prgn>gettext</prgn> for detail.
|
| 1623 |
</P>
|
| 1624 |
|
| 1625 |
<P>
|
| 1626 |
Now there are projects to translate messages in various softwares.
|
| 1627 |
For example,
|
| 1628 |
<url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
|
| 1629 |
name="Translation Project">.
|
| 1630 |
</P>
|
| 1631 |
|
| 1632 |
|
| 1633 |
|
| 1634 |
<sect1 id="gettextize"><heading>Gettext-ize a software</heading>
|
| 1635 |
|
| 1636 |
<P>
|
| 1637 |
At first, the software has to have the following lines.
|
| 1638 |
<example>
|
| 1639 |
int main(int argc, char **argv)
|
| 1640 |
{
|
| 1641 |
...
|
| 1642 |
setlocale (LC_ALL, ""); /* This is not for gettext but
|
| 1643 |
all i18n software should have
|
| 1644 |
this line. */
|
| 1645 |
bindtextdomain (PACKAGE, LOCALEDIR);
|
| 1646 |
textdomain (PACKAGE);
|
| 1647 |
...
|
| 1648 |
}
|
| 1649 |
</example>
|
| 1650 |
where <var>PACKAGE</var> is the name of the catalog file and
|
| 1651 |
<var>LOCALEDIR</var> is <tt>"/usr/share/locale"</tt> for Debian.
|
| 1652 |
<var>PACKAGE</var> and <var>LOCALEDIR</var> should be defined
|
| 1653 |
in a header file or <tt>Makefile</tt>.
|
| 1654 |
</P>
|
| 1655 |
|
| 1656 |
<P>
|
| 1657 |
It is convenient to prepare the following header file.
|
| 1658 |
<example>
|
| 1659 |
#include <libintl.h>
|
| 1660 |
#define _(String) gettext((String))
|
| 1661 |
</example>
|
| 1662 |
and messages in source files should be written as
|
| 1663 |
<tt>_("message")</tt>, instead of <tt>"message"</tt>.
|
| 1664 |
</P>
|
| 1665 |
|
| 1666 |
<P>
|
| 1667 |
Next, catalog files have to be prepared.
|
| 1668 |
</P>
|
| 1669 |
|
| 1670 |
<P>
|
| 1671 |
At first, a template for catalog file is prepared
|
| 1672 |
using <prgn>xgettext</prgn>.
|
| 1673 |
At default a template file <tt>message.po</tt> is
|
| 1674 |
prepared.
|
| 1675 |
<footnote>
|
| 1676 |
I HAVE TO WRITE EXPLANATION.
|
| 1677 |
</footnote>
|
| 1678 |
</P>
|
| 1679 |
|
| 1680 |
|
| 1681 |
|
| 1682 |
<sect1 id="gettext-translate"><heading>Translation</heading>
|
| 1683 |
|
| 1684 |
<P>
|
| 1685 |
Though <prgn>gettext</prgn>ization of a software is a temporal
|
| 1686 |
work, translation is a continuing work because you have to
|
| 1687 |
translate new (or modified) messages when (or before) a new
|
| 1688 |
version of the software is released.
|
| 1689 |
</P>
|
| 1690 |
|
| 1691 |
<sect id="mailnews"><heading>Mail/News</heading>
|
| 1692 |
|
| 1693 |
<P>
|
| 1694 |
Headers and main texts of mail and news messages have
|
| 1695 |
to expressed in 7bit. Headers and main texts have
|
| 1696 |
different standard for using non-ASCII codesets.
|
| 1697 |
(ESMTP, the extension of SMTP, can treat 8bit messages.)
|
| 1698 |
</P>
|
| 1699 |
|
| 1700 |
<P>
|
| 1701 |
Codesets for main text is specified in
|
| 1702 |
'<tt>codeset</tt>' subitem of
|
| 1703 |
'<tt>Content-type</tt>' header item.
|
| 1704 |
The whole list of parameters which can be written
|
| 1705 |
is found at ***.
|
| 1706 |
<footnote>
|
| 1707 |
I HAVE TO FIND THIS LIST (RFC?)
|
| 1708 |
</footnote>
|
| 1709 |
</P>
|
| 1710 |
|
| 1711 |
<P>
|
| 1712 |
'B' encoding and 'Q' encoding are used to use non-ASCII codesets
|
| 1713 |
in the headers. These 'B' and 'Q' encodings are not codesets themselves.
|
| 1714 |
They are a way to express non-ASCII strings using ASCII characters.
|
| 1715 |
<footnote>
|
| 1716 |
I HAVE TO WRITE EXPLANATION
|
| 1717 |
</footnote>
|
| 1718 |
</P>
|
| 1719 |
|
| 1720 |
|
| 1721 |
|
| 1722 |
|
| 1723 |
|
| 1724 |
|
| 1725 |
|
| 1726 |
<chapt id="examples"><heading>Examples of I18N</heading>
|
| 1727 |
|
| 1728 |
<P>
|
| 1729 |
Programmers who have internationalized softwares, have
|
| 1730 |
written a patch of L10N, and so on are encouraged to contribute
|
| 1731 |
to this chapter.
|
| 1732 |
</P>
|
| 1733 |
|
| 1734 |
|
| 1735 |
|
| 1736 |
&minicom;
|
| 1737 |
&user-ja;
|
| 1738 |
|
| 1739 |
|
| 1740 |
|
| 1741 |
|
| 1742 |
|
| 1743 |
|
| 1744 |
|
| 1745 |
|
| 1746 |
|
| 1747 |
<chapt id="reference"><heading>References</heading>
|
| 1748 |
|
| 1749 |
<P>
|
| 1750 |
General
|
| 1751 |
<list>
|
| 1752 |
<item>
|
| 1753 |
<url id="http://i44www.info.uni-karlsruhe.de/~drepper/conf96/paper.html"
|
| 1754 |
name="i18n in GNU Project">
|
| 1755 |
<item>
|
| 1756 |
<url id="http://cns-web.bu.edu/pub/djohnson/web_files/i18n/i18n.html"
|
| 1757 |
name="Concept of C/UNIX i18n">
|
| 1758 |
</list>
|
| 1759 |
</P>
|
| 1760 |
|
| 1761 |
<P>
|
| 1762 |
Characters (general)
|
| 1763 |
<list>
|
| 1764 |
<item>
|
| 1765 |
<url id="http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html"
|
| 1766 |
name="&urlname">
|
| 1767 |
</list>
|
| 1768 |
</P>
|
| 1769 |
|
| 1770 |
<P>
|
| 1771 |
Characters (ISO 8859)
|
| 1772 |
<list>
|
| 1773 |
<item>
|
| 1774 |
<url id="http://czyborra.com/charsets/iso8859.html" name="&urlname">
|
| 1775 |
<item>
|
| 1776 |
<url id="http://park.kiev.ua/multiling/ml-docs/iso-8859.html"
|
| 1777 |
name="&urlname">
|
| 1778 |
<item>
|
| 1779 |
<url id="http://www.terena.nl/projects/multiling/ml-docs/iso-8859.html"
|
| 1780 |
name="&urlname">
|
| 1781 |
</list>
|
| 1782 |
</P>
|
| 1783 |
|
| 1784 |
<P>
|
| 1785 |
Characters (ISO 2022)
|
| 1786 |
<list>
|
| 1787 |
<item>
|
| 1788 |
<url id="http://www.ewos.be/tg-cs/gconcept.htm" name="&urlname">
|
| 1789 |
<item>
|
| 1790 |
<url id="http://www.ecma.ch/stand/ECMA-035.HTM" name="&urlname">
|
| 1791 |
</list>
|
| 1792 |
</P>
|
| 1793 |
|
| 1794 |
<P>
|
| 1795 |
Characters (Unicode)
|
| 1796 |
<list>
|
| 1797 |
<item><url id="http://www.unicode.org/" name="&urlname">
|
| 1798 |
</list>
|
| 1799 |
</P>
|
| 1800 |
|
| 1801 |
<P>
|
| 1802 |
Example of i18n
|
| 1803 |
<list>
|
| 1804 |
<item>
|
| 1805 |
<url id="http://www.wg.omron.co.jp/~shin/Arena-CJK-doc/"
|
| 1806 |
name="Arena-i18n">
|
| 1807 |
Multilingual web browser.
|
| 1808 |
<item>
|
| 1809 |
<url id="http://www.m17n.org/mule/" name="Mule">
|
| 1810 |
Multilingual editor whose function is included in GNU Emacs 20
|
| 1811 |
and XEmacs 20.
|
| 1812 |
Mule is the most advanced m17n software in my knowledge.
|
| 1813 |
</list>
|
| 1814 |
</P>
|
| 1815 |
|
| 1816 |
<P>
|
| 1817 |
Projects
|
| 1818 |
<list>
|
| 1819 |
<item>
|
| 1820 |
<url id="http://www.li18nux.org/"
|
| 1821 |
name="Linux Internationalization Initiative">, or Li18nux,
|
| 1822 |
focuses on the i18n of a core set of APIs and components of Linux
|
| 1823 |
distributions. The results will be proposed to LSB.
|
| 1824 |
<item>
|
| 1825 |
<url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
|
| 1826 |
name="Translation Project">
|
| 1827 |
</list>
|
| 1828 |
<P>
|
| 1829 |
|
| 1830 |
|
| 1831 |
|
| 1832 |
|
| 1833 |
|
| 1834 |
</book>
|
| 1835 |
</debiandoc>
|