| 1 |
<!doctype debiandoc public "-//DebianDoc//DTD DebianDoc//EN"
|
| 2 |
[
|
| 3 |
<!entity % languages system "languages.ents"> %languages;
|
| 4 |
<!entity % examples system "examples.ents"> %examples;
|
| 5 |
]>
|
| 6 |
<debiandoc>
|
| 7 |
<book>
|
| 8 |
|
| 9 |
|
| 10 |
<titlepag>
|
| 11 |
<title>Introduction to i18n</title>
|
| 12 |
<author>
|
| 13 |
<name>Tomohiro KUBOTA</name>
|
| 14 |
<email>debian at tmail dot plala dot or dot jp (retired DD)</email>
|
| 15 |
</author>
|
| 16 |
<version><date></version>
|
| 17 |
<abstract>
|
| 18 |
This document describes basic concepts for i18n
|
| 19 |
(internationalization), how to write an internationalized
|
| 20 |
software, and how to modify and internationalize a software.
|
| 21 |
Handling of characters is discussed in detail.
|
| 22 |
There are a few case-studies in which the author internationalized
|
| 23 |
softwares such as TWM.
|
| 24 |
</abstract>
|
| 25 |
<copyright>
|
| 26 |
<copyrightsummary>
|
| 27 |
Copyright © 1999-2001 Tomohiro KUBOTA.
|
| 28 |
Chapters and sections whose original author is not KUBOTA are
|
| 29 |
copyright by their authors. Their names are written
|
| 30 |
at the top of the chapter or the section.
|
| 31 |
</copyrightsummary>
|
| 32 |
<p>
|
| 33 |
This manual is free software; you may redistribute it and/or modify it
|
| 34 |
under the terms of the GNU General Public License as published by the
|
| 35 |
Free Software Foundation; either version 2, or (at your option) any
|
| 36 |
later version.
|
| 37 |
</p>
|
| 38 |
<p>
|
| 39 |
This is distributed in the hope that it will be useful, but
|
| 40 |
<em>without any warranty</em>; without even the implied warranty of
|
| 41 |
merchantability or fitness for a particular purpose. See the GNU
|
| 42 |
General Public License for more details.
|
| 43 |
</p>
|
| 44 |
<p>
|
| 45 |
A copy of the GNU General Public License is available as
|
| 46 |
<tt>/usr/share/common-licenses/GPL</tt> in the Debian GNU/Linux
|
| 47 |
distribution or on the World Wide Web at
|
| 48 |
<url id="http://www.gnu.org/copyleft/gpl.html">.
|
| 49 |
You can also obtain it by writing to the Free
|
| 50 |
Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
|
| 51 |
02111-1307, USA.
|
| 52 |
</p>
|
| 53 |
</copyright>
|
| 54 |
</titlepag>
|
| 55 |
|
| 56 |
<toc detail="sect1">
|
| 57 |
|
| 58 |
<chapt id="scope"><heading>About This Document</heading>
|
| 59 |
|
| 60 |
<sect id="scope2"><heading>Scope</heading>
|
| 61 |
|
| 62 |
<P>
|
| 63 |
This document describes the basic ideas of I18N; it's written for
|
| 64 |
programmers and package maintainers of Debian GNU/Linux and
|
| 65 |
other UNIX-like platforms.
|
| 66 |
The aim of this document is to offer an introduction to
|
| 67 |
the basic concepts, character codes, and points where care
|
| 68 |
should be taken when one writes an I18N-ed software or
|
| 69 |
an I18N patch for an existing software. There are many
|
| 70 |
know-hows and case-studies on internationalization of
|
| 71 |
softwares. This document also tries to introduce the
|
| 72 |
current state and existing problems for each language and country.
|
| 73 |
</P>
|
| 74 |
|
| 75 |
<P>
|
| 76 |
Minimum requirements - for example,
|
| 77 |
that characters should be displayed with fonts of the
|
| 78 |
proper charset (users of the software must be
|
| 79 |
able to at least guess what is written),
|
| 80 |
that characters must be inputed from keyboard, and
|
| 81 |
that softwares must not destroy characters -
|
| 82 |
are stressed in the document. I am trying to
|
| 83 |
describe a HOWTO to satisfy these requirements.
|
| 84 |
</P>
|
| 85 |
|
| 86 |
<P>
|
| 87 |
This document is strongly related to programming
|
| 88 |
languages such as C and standardized I18N methods such as
|
| 89 |
using locales and <prgn>gettext</prgn>.
|
| 90 |
</P>
|
| 91 |
|
| 92 |
<sect id="newversion"><heading>New Versions of This Document</heading>
|
| 93 |
|
| 94 |
<P>
|
| 95 |
The current version of this document is available
|
| 96 |
at
|
| 97 |
<url id="http://www.debian.org/doc/ddp"
|
| 98 |
name="DDP (Debian Documentation Project)"> page.
|
| 99 |
</P>
|
| 100 |
|
| 101 |
<p>
|
| 102 |
Note that the author rewrote this document in November 2000.
|
| 103 |
</p>
|
| 104 |
|
| 105 |
<p>
|
| 106 |
Since then, Debian had several releases and its packages support I18N better
|
| 107 |
with their supports of UTF-8.
|
| 108 |
This document does not cover these new developments but is kept here
|
| 109 |
since this helps understandings of fundamental I18N issues.
|
| 110 |
</p>
|
| 111 |
|
| 112 |
<sect id="feedback"><heading>Feedback and Contributions</heading>
|
| 113 |
|
| 114 |
<P>
|
| 115 |
This document needs contributions, especially for a
|
| 116 |
chapter on each languages (<ref id="languages">)
|
| 117 |
and a chapter on instances of I18N (<ref id="examples">).
|
| 118 |
These chapters consist of contributions.
|
| 119 |
</P>
|
| 120 |
|
| 121 |
<P>
|
| 122 |
Otherwise, this will be a document only on Japanization,
|
| 123 |
because the original author Tomohiro KUBOTA
|
| 124 |
(<email>kubota@debian.org</email>, retired DD and this is not a
|
| 125 |
working e-mail address any more)
|
| 126 |
speaks Japanese and live in Japan.
|
| 127 |
</P>
|
| 128 |
|
| 129 |
<P>
|
| 130 |
<ref id="spanish"> is written by
|
| 131 |
Eusebio C Rufian-Zilbermann <email>eusebio@acm.org</email>.
|
| 132 |
</P>
|
| 133 |
|
| 134 |
<P>
|
| 135 |
Discussions are held at <tt>debian-devel@lists.debian.org</tt> and
|
| 136 |
<tt>debian-i18n@lists.debian.org</tt> mailing list.
|
| 137 |
Please contact <tt>debian-doc@lists.debian.org</tt> if you wish to
|
| 138 |
update this document.
|
| 139 |
</P>
|
| 140 |
|
| 141 |
<chapt id="intro"><heading>Introduction</heading>
|
| 142 |
|
| 143 |
<sect id="intro-concepts"><heading>General Concepts</heading>
|
| 144 |
|
| 145 |
<P>
|
| 146 |
Debian includes many pieces of software. Though many of them
|
| 147 |
have the ability to process, input, and output text data, some
|
| 148 |
of these programs assume text is written in English (ASCII).
|
| 149 |
For people who use non-English languages, these programs are
|
| 150 |
barely usable. And more, though many softwares can handle
|
| 151 |
not only ASCII but also ISO-8859-1, some of them
|
| 152 |
cannot handle multibyte characters for CJK (Chinese, Japanese,
|
| 153 |
and Korean) languages, nor combined characters for Thai.
|
| 154 |
</P>
|
| 155 |
|
| 156 |
<P>
|
| 157 |
So far, people who use non-English languages have given up
|
| 158 |
using their native languages and have accepted computers as they were.
|
| 159 |
However, we should now forget such a wrong idea.
|
| 160 |
It is absurd that a person who
|
| 161 |
wants to use a computer has to learn English in advance.
|
| 162 |
</P>
|
| 163 |
|
| 164 |
<P>
|
| 165 |
I18N is needed in the following places.
|
| 166 |
<list>
|
| 167 |
<item>Displaying characters for the users' native languages.
|
| 168 |
<item>Inputing characters for the users' native languages.
|
| 169 |
<item>Handling files written in popular encodings
|
| 170 |
<footnote>
|
| 171 |
There are a few terms related to character code,
|
| 172 |
such as character set, character code, charset,
|
| 173 |
encoding, codeset, and so on. These words are explained
|
| 174 |
later.
|
| 175 |
</footnote>
|
| 176 |
that are used for the users' native languages.
|
| 177 |
<item>Using characters from the users' native languages for file names
|
| 178 |
and other items.
|
| 179 |
<item>Printing out characters from the users' native languages.
|
| 180 |
<item>Displaying messages by the program in the users' native languages.
|
| 181 |
<item>Formatting input and output of numbers, dates, money, etc., in a way that
|
| 182 |
obeys customs of the users' native cultures.
|
| 183 |
<item>Classifying and sorting characters, in a way that obey customs
|
| 184 |
of the users' native cultures.
|
| 185 |
<item>Using typesetting and hyphenation rules appropriate for the users' native
|
| 186 |
languages.
|
| 187 |
</list>
|
| 188 |
This document puts emphasis on the first three items. This is because
|
| 189 |
these three items are the basis for the other items. An another
|
| 190 |
reason is that you cannot use softwares lacking the first
|
| 191 |
three items at all, while you can use softwares lacking the other items,
|
| 192 |
albeit inconveniently. This document will also mention translation of
|
| 193 |
messages (item 6) which is often called as 'I18N'. Note that
|
| 194 |
the author regards the terminology of 'I18N' for calling translation
|
| 195 |
and <prgn>gettext</prgn>ization as completely wrong. The reason
|
| 196 |
may be well explained by the fact that the author did not include
|
| 197 |
translation and <prgn>gettext</prgn>ization in the important first
|
| 198 |
three items.
|
| 199 |
</P>
|
| 200 |
|
| 201 |
<P>
|
| 202 |
Imagine a word processor which can display error
|
| 203 |
and help messages in your native language while cannot process
|
| 204 |
your native language. You will easily understand that the word
|
| 205 |
processor is not usable. On the other hand, a word processor which
|
| 206 |
can process your native language, but only displays error and help messages
|
| 207 |
in English, is usable, though it is not convenient.
|
| 208 |
Before we think of developing convenient softwares, we have to
|
| 209 |
think of developing usable softwares.
|
| 210 |
</P>
|
| 211 |
|
| 212 |
<P>
|
| 213 |
The following terminology is widely used.
|
| 214 |
<list>
|
| 215 |
<item>I18N (internationalization) means modification of a software
|
| 216 |
or related technologies so that a software can potentially
|
| 217 |
handle multiple languages, customs, and so on in the world.
|
| 218 |
<item>L10N (localization) means implementation of a specific language
|
| 219 |
for an already internationalized software.
|
| 220 |
</list>
|
| 221 |
However, this terminology is valid only for one specific model
|
| 222 |
out of a few models which we should consider for I18N.
|
| 223 |
Now I will introduce a few models other than this I18N-L10N model.
|
| 224 |
<taglist>
|
| 225 |
<tag>a. <strong>L10N</strong> (localization) model</tag>
|
| 226 |
<item><p>
|
| 227 |
This model is to support two languages or character codes,
|
| 228 |
English (ASCII) and another specific one. Examples of
|
| 229 |
softwares which is developed using this model are:
|
| 230 |
Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual
|
| 231 |
Emacs) text editor which can input and output Japanese text files,
|
| 232 |
and Hanterm X terminal emulator which can display and input
|
| 233 |
Korean characters via a few Korean encodings.
|
| 234 |
Since each programmer has his or her own mother tongue,
|
| 235 |
there are numerous L10N patches and L10N programs
|
| 236 |
written to satisfy his or her own need.
|
| 237 |
</p></item>
|
| 238 |
<tag>b. <strong>I18N</strong> (internationalization) model</tag>
|
| 239 |
<item><p>
|
| 240 |
This model is to support many languages but only two
|
| 241 |
of them, English (ASCII) and another one, at the same time.
|
| 242 |
One have to specify the 'another' language, usually by <tt>LANG</tt>
|
| 243 |
environmental variable.
|
| 244 |
The above I18N-L10N model can be regarded as a part of
|
| 245 |
this I18N model.
|
| 246 |
<prgn>gettext</prgn>ization is categorized into I18N model.
|
| 247 |
</p></item>
|
| 248 |
<tag>c. <strong>M17N</strong> (multilingualization) model</tag>
|
| 249 |
<item><p>
|
| 250 |
This model is to support many languages at the same time.
|
| 251 |
For example, Mule (MULtilingual Enhancement to GNU Emacs)
|
| 252 |
can handle a text file which contains multiple languages -
|
| 253 |
for example, a paper on differences between Korean and Chinese
|
| 254 |
whose main text is written in Finnish. GNU Emacs 20 and
|
| 255 |
XEmacs now include Mule.
|
| 256 |
Note that the M17N model can only be applied in character-related
|
| 257 |
instances. For example, it is nonsense to display a message
|
| 258 |
like 'file not found' in many languages at the same time.
|
| 259 |
Unicode and UTF-8 are technologies which can be used for
|
| 260 |
this model.
|
| 261 |
<footnote>
|
| 262 |
I recommend not to implement Unicode and UTF-8 directly.
|
| 263 |
Instead, use locale technology and your software will
|
| 264 |
support not only UTF-8 but also many encodings
|
| 265 |
in the world. If you implement UTF-8 directly,
|
| 266 |
your software can handle UTF-8 only. Such a software
|
| 267 |
is not convenient.
|
| 268 |
</footnote>
|
| 269 |
</p></item>
|
| 270 |
</taglist>
|
| 271 |
</P>
|
| 272 |
|
| 273 |
<P>
|
| 274 |
Generally speaking, the M17N model is the best and the second-best is
|
| 275 |
the I18N model. The L10N model is the worst and you should not use it
|
| 276 |
except for a few fields where the I18N and M17N models are very difficult,
|
| 277 |
like DTP and X terminal emulator.
|
| 278 |
In other words, it is better for text-processing softwares to handle
|
| 279 |
many languages at the same time, than handle two (English and another language).
|
| 280 |
</P>
|
| 281 |
|
| 282 |
<P>
|
| 283 |
Now let me classify approaches for support of non-English languages
|
| 284 |
from another viewpoint.
|
| 285 |
<taglist>
|
| 286 |
<tag>A. Implementation <em>without</em> knowledge of each language</tag>
|
| 287 |
<item><p>
|
| 288 |
This approach is done by utilizing standardized methods supplied
|
| 289 |
by the kernel or libraries. The most important one is
|
| 290 |
<strong>locale</strong> technology which includes
|
| 291 |
<strong>locale category</strong>, conversion between
|
| 292 |
<strong>multibyte</strong> and <strong>wide
|
| 293 |
characters</strong> (<tt>wchar_t</tt>), and so on.
|
| 294 |
Another important technology is <prgn>gettext</prgn>.
|
| 295 |
The advantages of this approach are (1) that when the kernel or
|
| 296 |
libraries are upgraded, the software will automatically
|
| 297 |
support new additional languages, (2) that programmers need
|
| 298 |
not know each language, and (3) that a user can switch the behavior
|
| 299 |
of softwares with common method, like LANG variable.
|
| 300 |
The disadvantage is that there are categories or fields where
|
| 301 |
a standardized method is not available. For example, there
|
| 302 |
are no standardized methods for text typesetting rules such
|
| 303 |
as line-breaking and hyphenation.
|
| 304 |
</p></item>
|
| 305 |
<tag>B. Implementation using knowledge of each language</tag>
|
| 306 |
<item><p>
|
| 307 |
This approach is to directly implement information about
|
| 308 |
each language based on the knowledge of programmers and
|
| 309 |
contributors. L10N almost always uses this approach.
|
| 310 |
The advantage of this approach is that a detailed and strict
|
| 311 |
implementation is possible beyond the field where
|
| 312 |
standardized methods are available, such as auto-detection
|
| 313 |
of encodings of text files to be read. Language-specific
|
| 314 |
problems can be perfectly solved; of course, it depends on
|
| 315 |
the skill of the programmer). The disadvantages are
|
| 316 |
(1) that the number of supported languages is restricted
|
| 317 |
by the skill or the interest of the programmers or the
|
| 318 |
contributors, (2) that labor which should be united and
|
| 319 |
concentrated to upgrade the kernel or libraries is dispersed
|
| 320 |
into many softwares, that is, re-inventing of the wheel,
|
| 321 |
and (3) a user has to learn how to configure each software,
|
| 322 |
such as <tt>LESSCHARSET</tt> variable, <tt>.emacs</tt> file,
|
| 323 |
and other methods.
|
| 324 |
This approach can cause problems: for example, GNU roff
|
| 325 |
(before version 1.16) assumes <tt>0xad</tt> as a hyphen
|
| 326 |
character, which is valid only for ISO-8859-1.
|
| 327 |
However, a majestic M17N software such as Mule can be
|
| 328 |
built using this approach.
|
| 329 |
</p></item>
|
| 330 |
</taglist>
|
| 331 |
</P>
|
| 332 |
|
| 333 |
<P>
|
| 334 |
Using this classification, let me consider the L10N, I18N, and M17N models
|
| 335 |
from the programmer's point of view.
|
| 336 |
</P>
|
| 337 |
|
| 338 |
<P>
|
| 339 |
The L10N model can be realized only using his or her own knowledge on his or her
|
| 340 |
language (i.e. approach B). Since the motivation of L10N is
|
| 341 |
usually to satisfy the programmer's own need, extendability for the
|
| 342 |
third languages is often ignored.
|
| 343 |
Though L10N-ed softwares are primarily useful for people who
|
| 344 |
speaks the same language to the programmer, it is sometimes
|
| 345 |
useful for other people whose coding system is similar to
|
| 346 |
the programmer's. For example, a software which
|
| 347 |
doesn't recognize EUC-JP but doesn't break EUC-JP, will not
|
| 348 |
break EUC-KR also.
|
| 349 |
</P>
|
| 350 |
|
| 351 |
<P>
|
| 352 |
The main part of the I18N model is, in the case of a C program, achieved using
|
| 353 |
standardized locale technology and <prgn>gettext</prgn>.
|
| 354 |
An locale approach is classified into I18N because functions
|
| 355 |
related to locale change their behavior by the current locales
|
| 356 |
for six categories which are set by <tt>setlocale()</tt>.
|
| 357 |
Namely, approach A is emphasized for I18N. For field where
|
| 358 |
standardized methods are not available, however, approach B
|
| 359 |
cannot be avoided. Even in such a case, the developers should
|
| 360 |
be careful so that a support for new languages can be easily added
|
| 361 |
later even by other developers.
|
| 362 |
</P>
|
| 363 |
|
| 364 |
<P>
|
| 365 |
The M17N model can be achieved using international encodings such
|
| 366 |
as ISO 2022 and Unicode. Though you can hard-code these encodings
|
| 367 |
for your software (i.e. approach B), I recommend to use standardized
|
| 368 |
locale technology. However, using international encodings
|
| 369 |
is not sufficient to achieve the M17N model. You will have to prepare
|
| 370 |
a mechanism to switch <strong>input methods</strong>. You will also want
|
| 371 |
to prepare an encoding-guessing mechanism for input files,
|
| 372 |
such as <prgn>jless</prgn> and <prgn>emacs</prgn> have.
|
| 373 |
Mule is the best software which achieved M17N (though it does not
|
| 374 |
use locale technology).
|
| 375 |
</P>
|
| 376 |
|
| 377 |
<sect id="intro-organization"><heading>Organization</heading>
|
| 378 |
|
| 379 |
<P>
|
| 380 |
Let's preview the contents of each chapter in this document.
|
| 381 |
</P>
|
| 382 |
|
| 383 |
<P>
|
| 384 |
As I wrote, this document will put stress on correct handling of
|
| 385 |
characters and character codes for users' native
|
| 386 |
languages. To achieve this purpose, I will start the real contents
|
| 387 |
of this document by discussing basic important concepts on
|
| 388 |
characters in <ref id="coding">. Since this chapter includes
|
| 389 |
many terminologies, all of you will need to this chapter.
|
| 390 |
The next chapter, <ref id="codes">, introduces many national
|
| 391 |
and international standards of <em>coded character sets</em>
|
| 392 |
and <em>encodings</em>. I think almost of you can do without
|
| 393 |
reading this chapter, since <em>LOCALE</em> technology will
|
| 394 |
enable us to develop international softwares without knowledges
|
| 395 |
on these character sets and encodings. However, knowing
|
| 396 |
about these standards will help you to
|
| 397 |
understand the merit and necessity of LOCALE technology.
|
| 398 |
</P>
|
| 399 |
|
| 400 |
<P>
|
| 401 |
The following chapter of <ref id="languages">
|
| 402 |
describes the detailed informations for
|
| 403 |
each language. These informations will help people who develop
|
| 404 |
high-quality text processing softwares such as DTP and Web Browsers.
|
| 405 |
</P>
|
| 406 |
|
| 407 |
<P>
|
| 408 |
Chapter of <ref id="locale"> describes the most important
|
| 409 |
concept for I18N. Not only concepts but also many important
|
| 410 |
C functions are introduced in this chapter.
|
| 411 |
</P>
|
| 412 |
|
| 413 |
<P>
|
| 414 |
A few following chapters of <ref id="output">, <ref id="input">,
|
| 415 |
<ref id="internal">, and <ref id="internet"> are important
|
| 416 |
and frequent applications of LOCALE technology.
|
| 417 |
You can get solutions for typical problems on I18N in these
|
| 418 |
chapters.
|
| 419 |
</P>
|
| 420 |
|
| 421 |
<P>
|
| 422 |
You may need to develop software using some special libraries
|
| 423 |
or other languages than C/C++. Chapters of <ref id="library">
|
| 424 |
and <ref id="otherlanguage"> are written for such purposes.
|
| 425 |
</P>
|
| 426 |
|
| 427 |
<P>
|
| 428 |
Next chapter of <ref id="examples"> is a collection of case studies.
|
| 429 |
Both of generic and special technologies will be discussed.
|
| 430 |
You can also contribute writing a section for this chapter.
|
| 431 |
</P>
|
| 432 |
|
| 433 |
<P>
|
| 434 |
You may want to study more;
|
| 435 |
The last chapter of <ref id="reference"> is supplied for this purpose.
|
| 436 |
Some of references listed in the chapter are very important.
|
| 437 |
</P>
|
| 438 |
|
| 439 |
|
| 440 |
<chapt id="coding"><heading>Important Concepts for Character Coding Systems</heading>
|
| 441 |
|
| 442 |
<P>
|
| 443 |
Character coding system is one of the fundamental elements of the
|
| 444 |
software and information processing.
|
| 445 |
Without proper handling of character codes, your software is
|
| 446 |
far from realization of internationalization.
|
| 447 |
Thus the author begins this document with the story on character
|
| 448 |
codes.
|
| 449 |
</P>
|
| 450 |
|
| 451 |
<P>
|
| 452 |
In this chapter, basic concepts such as <em>coded character set</em>
|
| 453 |
and <em>encoding</em> are introduced. These terms will be needed
|
| 454 |
to read this document and other documents on internationalization
|
| 455 |
and character codes including Unicode.
|
| 456 |
</P>
|
| 457 |
|
| 458 |
|
| 459 |
<sect id="coding-general-term"><heading>Basic Terminology</heading>
|
| 460 |
|
| 461 |
<P>
|
| 462 |
At first I begin this chapter by defining a few very important word.
|
| 463 |
</P>
|
| 464 |
|
| 465 |
<P>
|
| 466 |
As many people point out, there is a confusion on terminology, since
|
| 467 |
words are used in various different ways. The author does not
|
| 468 |
want to add a new terminology to a confusing ocean of
|
| 469 |
various terminologies. Otherwise, terminology of
|
| 470 |
<url id="http://www.faqs.org/rfcs/rfc2130.html" name="RFC 2130">
|
| 471 |
will be
|
| 472 |
adopted in this document, besides one exception of a word 'character
|
| 473 |
set'.
|
| 474 |
</P>
|
| 475 |
|
| 476 |
<P>
|
| 477 |
<taglist>
|
| 478 |
<tag><strong>Character</strong>
|
| 479 |
<item><p>
|
| 480 |
Character is an individual unit of which sentence and text
|
| 481 |
consist. Character is an abstract notion.
|
| 482 |
</p></item>
|
| 483 |
<tag><strong>Glyph</strong>
|
| 484 |
<item><p>
|
| 485 |
Glyph is a specific instance of character. <em>Character</em>
|
| 486 |
and <em>glyph</em> is a pair of words. Sometimes a character
|
| 487 |
has multiple glyphs (for example, '$' may have one or two vertical
|
| 488 |
bar. Arabic characters have four glyphs for each character.
|
| 489 |
Some of CJK ideograms have many glyphs). Sometimes two or more
|
| 490 |
characters construct one glyph (for example, ligature of 'fi').
|
| 491 |
For almost cases, text data, which intend to contain not
|
| 492 |
visual information but abstract idea, don't have to have
|
| 493 |
information on glyphs, since difference between glyphs does
|
| 494 |
not affect the meaning of the text. However, distinction
|
| 495 |
between different glyphs for a single CJK ideogram may be
|
| 496 |
sometimes important for proper noun such as names of
|
| 497 |
persons and places. However, there are no standardized method
|
| 498 |
for plain text to have informations on glyphs so far. This
|
| 499 |
makes plain texts cannot be used for some special fields
|
| 500 |
such as citizen registration system, serious DTP such as
|
| 501 |
newspaper system, and so on.
|
| 502 |
</p></item>
|
| 503 |
<tag><strong>Encoding</strong>
|
| 504 |
<item><p>
|
| 505 |
Encoding is a rule where characters and texts are
|
| 506 |
expressed in combinations of bits or bytes in order to
|
| 507 |
treat characters in computers. Words of <em>character
|
| 508 |
coding system</em>, <em>character code</em>, <em>charset</em>,
|
| 509 |
and so on are used to express the same meaning.
|
| 510 |
Basically, <em>encoding</em> takes care of
|
| 511 |
<em>characters</em>, not <em>glyphs</em>.
|
| 512 |
There are many official and de-facto standards of encodings
|
| 513 |
such as ASCII, ISO 8859-{1,2,...,15},
|
| 514 |
ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2},
|
| 515 |
EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620,
|
| 516 |
VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE,
|
| 517 |
UTF-16BE, KOI8-R, and so on so on.
|
| 518 |
To construct an encoding, we have to consider the
|
| 519 |
following concepts. (Encoding = one or more
|
| 520 |
CCS + one CES).
|
| 521 |
</p></item>
|
| 522 |
<tag><strong>Character Set</strong>
|
| 523 |
<item><p>
|
| 524 |
Character set is a set of characters. This determines
|
| 525 |
a range of characters where the encoding can handle.
|
| 526 |
In contrast to <em>coded character set</em>, this is often
|
| 527 |
called as <em>non-coded character set</em>.
|
| 528 |
</p></item>
|
| 529 |
<tag><strong>Coded Character Set (CCS)</strong>
|
| 530 |
<item><p>
|
| 531 |
Coded character set (CCS) is a word defined in
|
| 532 |
<url id="http://www.faqs.org/rfcs/rfc2050.html" name="RFC 2050">
|
| 533 |
and means a character set where all characters
|
| 534 |
have unique numbers by some method. There are many national
|
| 535 |
and international standards for CCS.
|
| 536 |
Many national standards for CCS adopt
|
| 537 |
the way of coding so that they obey some of international
|
| 538 |
standards such as ISO 646 or ISO 2022. ASCII, BS 4730,
|
| 539 |
JISX 0201 Roman, and so on are examples of ISO-646 variants. All
|
| 540 |
ISO-646 variants, ISO 8859-*, JISX 0208, JISX 0212, KSX 1001,
|
| 541 |
GB 2312, CNS 11643, CCCII, TIS 620, TCVN 5712, and so on are
|
| 542 |
examples of ISO 2022-compliant CCS. VISCII and Big5 are
|
| 543 |
examples of non-ISO 2022-compliant
|
| 544 |
CCS. UCS-2 and UCS-4 (ISO 10646) are also examples of CCS.
|
| 545 |
</p></item>
|
| 546 |
<tag><strong>Character Encoding Scheme (CES)</strong>
|
| 547 |
<item><p>
|
| 548 |
Character Encoding Scheme is also a word defined in
|
| 549 |
<url id="http://www.faqs.org/rfcs/rfc2050.html" name="RFC 2050">
|
| 550 |
to call methods to construct an encoding using one or
|
| 551 |
more CCS. This is important when two or more CCS are used
|
| 552 |
to construct an encoding.
|
| 553 |
ISO 2022 is a method to construct an encoding from
|
| 554 |
one or more ISO 2022-compliant CCS. ISO 2022 is very
|
| 555 |
complex system and subsets of ISO 2022 are usually used
|
| 556 |
such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII
|
| 557 |
and KSX 1001), and so on. CES is not important for
|
| 558 |
encodings with only one 8bit CCS.
|
| 559 |
UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be
|
| 560 |
regarded as CES whose CCS is Unicode or ISO 10646.
|
| 561 |
</p></item>
|
| 562 |
</taglist>
|
| 563 |
</P>
|
| 564 |
|
| 565 |
<P>
|
| 566 |
Some other words are usually used related to character codes.
|
| 567 |
</P>
|
| 568 |
|
| 569 |
<P>
|
| 570 |
<strong>Character code</strong> is a widely-used word to mean
|
| 571 |
<em>encoding</em>. This is an primitive and crude word to call
|
| 572 |
the way a computer handles characters with assigning numbers.
|
| 573 |
For example, <em>character code</em> can call <em>encoding</em>
|
| 574 |
and can call <em>coded character set</em>. Thus this word can
|
| 575 |
be used only in the case when both of them can be regard in
|
| 576 |
the same category. This word should be avoided in serious
|
| 577 |
discussions. This document will not use this word hereafter.
|
| 578 |
</P>
|
| 579 |
|
| 580 |
<P>
|
| 581 |
<strong>Codeset</strong> is a word to call <em>encoding</em>
|
| 582 |
or <em>character encoding scheme</em>.
|
| 583 |
<footnote>
|
| 584 |
This document used a word <em>codeset</em> before Novermber 2000
|
| 585 |
to call <em>encoding</em>. I changed terminology since
|
| 586 |
I could not find a word <em>codeset</em> in documents written
|
| 587 |
in English (I adopted this word from a book in Japanese).
|
| 588 |
<em>encoding</em> seems more popular.
|
| 589 |
</footnote>
|
| 590 |
</P>
|
| 591 |
|
| 592 |
<P>
|
| 593 |
<strong>charset</strong> is also a well-used word.
|
| 594 |
This word is used very widely, for example, in MIME (like
|
| 595 |
<tt>Content-Type: text/plain, charset=iso8859-1</tt>),
|
| 596 |
in XLFD (X Logical Font Description) font name
|
| 597 |
(CharSetResigtry and CharSetEncoding fields), and so on.
|
| 598 |
Note that <em>charset</em> in MIME is <em>encoding</em>,
|
| 599 |
while <em>charset</em> in XLFD font name is <em>coded character
|
| 600 |
set</em>. This is very confusing. In this document,
|
| 601 |
<em>charset</em> and <em>character set</em> are used in
|
| 602 |
XLFD meaning, since I think <em>character set</em> should
|
| 603 |
mean a set of characters, not encoding.
|
| 604 |
</P>
|
| 605 |
|
| 606 |
<P>
|
| 607 |
Ken Lunde's "CJKV Information Processing" uses a word
|
| 608 |
<strong>encoding method</strong>. He says that
|
| 609 |
ISO-2022, EUC, Big5, and Shift-JIS are examples of
|
| 610 |
<em>encoding methods</em>. It seems that his <em>encoding
|
| 611 |
method</em> is <em>CES</em> in this document. However,
|
| 612 |
we should notice that Big5 and Shift-JIS are encodings
|
| 613 |
while ISO-2022 and EUC are not.
|
| 614 |
<footnote>
|
| 615 |
During I18N programming, we will frequently meet with EUC-JP
|
| 616 |
or EUC-KR, while we well rarely meet with EUC. I think it is
|
| 617 |
not appropriate to stress EUC, a class of encodings, over
|
| 618 |
EUC-JP, EUC-KR, and so on, concrete encodings. It is just like
|
| 619 |
regarding ISO 8859 as a concrete encoding, though ISO 8859 is
|
| 620 |
a class of encodings of ISO 8859-{1,2,...,15}.
|
| 621 |
</footnote>
|
| 622 |
</P>
|
| 623 |
|
| 624 |
<P>
|
| 625 |
<url id="http://www.unicode.org/unicode/reports/tr17/"
|
| 626 |
name="Character Encoding Model, Unicode Technical Report #17">
|
| 627 |
(hereafter, <em>"the Report"</em>) suggests five-level model.
|
| 628 |
<list>
|
| 629 |
<item>ACR: abstract character repertoire
|
| 630 |
<item>CCS: Coded Character Set
|
| 631 |
<item>CEF: Character Encoding Form
|
| 632 |
<item>CES: Character Encoding Scheme
|
| 633 |
<item>TES: Transfer Encoding Syntax
|
| 634 |
</list>
|
| 635 |
</P>
|
| 636 |
|
| 637 |
<P>
|
| 638 |
<strong>TES</strong> is also suggested in
|
| 639 |
<url id="http://www.faqs.org/rfcs/rfc2130.html" name="RFC 2130">.
|
| 640 |
Some examples of
|
| 641 |
TES are: <em>base64</em>, <em>uuencode</em>, <em>BinHex</em>,
|
| 642 |
<em>quoted-printable</em>, <em>gzip</em>, and so on.
|
| 643 |
TES means a transform of encoded data which may (or may not) include
|
| 644 |
textual data. Thus, TES is not a part of character encoding.
|
| 645 |
However, TES is important in the Internet data exchange.
|
| 646 |
</P>
|
| 647 |
|
| 648 |
<P>
|
| 649 |
When using a computer, we rarely have a chance to face with
|
| 650 |
<strong>ACR</strong>.
|
| 651 |
Though it is true that CJK people have their national standard of
|
| 652 |
ACR (for example, standard for ideograms which can be used for
|
| 653 |
personal names) and some of us may need to handle these ACR with
|
| 654 |
computers (for example, citizen registration system), this is too
|
| 655 |
heavy theme for this document. This is because there are no
|
| 656 |
standardized or encouraged methods to handle these ACR. You may
|
| 657 |
have to build the whole system for such purposes. Good luck!
|
| 658 |
</P>
|
| 659 |
|
| 660 |
<P>
|
| 661 |
<strong>CCS</strong> in <em>"the Report"</em> is same as what I wrote
|
| 662 |
in this document.
|
| 663 |
It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201,
|
| 664 |
JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5,
|
| 665 |
CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on.
|
| 666 |
Some of them are national standards, some are international
|
| 667 |
standards, and others are de-facto standards.
|
| 668 |
</P>
|
| 669 |
|
| 670 |
<P>
|
| 671 |
<strong>CEF</strong> and <strong>CES</strong> in <em>"the Report"</em>
|
| 672 |
correspond to <strong>CES</strong> in this document.
|
| 673 |
This document will not distinguish these two, since I think there
|
| 674 |
are no inconvenience. An encoding with a significant CEF doesn't
|
| 675 |
have a significant CES (in <em>"the Report"</em> meaning), and
|
| 676 |
vice versa. Then why should we have to distinguish these two?
|
| 677 |
The only exception is UTF-16 series. In UTF-16 series,
|
| 678 |
UTF-16 is a CEF and UTF-16BE is a CES. This is the only case where
|
| 679 |
we need distinction between CEF and CES.
|
| 680 |
</P>
|
| 681 |
|
| 682 |
<P>
|
| 683 |
Now, <strong>CES</strong> is a concrete concept with concrete
|
| 684 |
examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP,
|
| 685 |
ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT,
|
| 686 |
ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE,
|
| 687 |
and so on. Now they are encodings themselves.
|
| 688 |
</P>
|
| 689 |
|
| 690 |
<P>
|
| 691 |
The most important concept in this section is distinction between
|
| 692 |
<em>coded character set</em> and <em>encoding</em>. <em>Coded
|
| 693 |
character set</em> is a component of <em>encoding</em>. Text data
|
| 694 |
are described in <em>encoding</em>, not <em>coded character set</em>.
|
| 695 |
</P>
|
| 696 |
|
| 697 |
|
| 698 |
<sect id="stateful"><heading>Stateless and Stateful</heading>
|
| 699 |
|
| 700 |
<P>
|
| 701 |
To construct an encoding with two or more CCS, CES has to supply
|
| 702 |
a method to avoid collision between these CCS.
|
| 703 |
There are two ways to do that. One is to make all characters
|
| 704 |
in the all CCS have unique code points. The other is to
|
| 705 |
allow characters from different CCS to have the same
|
| 706 |
code point and to have a code such as escape sequence to switch
|
| 707 |
<strong>SHIFT STATE</strong>, that is, to select one character set.
|
| 708 |
</P>
|
| 709 |
|
| 710 |
<P>
|
| 711 |
An encoding with shift states is called <strong>STATEFUL</strong> and
|
| 712 |
one without shift states is called <strong>STATELESS</strong>.
|
| 713 |
</P>
|
| 714 |
|
| 715 |
<P>
|
| 716 |
Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR,
|
| 717 |
ISO 2022-INT-1, ISO 2022-INT-2, and so on.
|
| 718 |
</P>
|
| 719 |
|
| 720 |
<P>
|
| 721 |
For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean
|
| 722 |
a Japanese Hiragana character 'GA' or two ASCII character of
|
| 723 |
'$' and ',' according to the shift state.
|
| 724 |
</P>
|
| 725 |
|
| 726 |
<sect id="multibyte"><heading>Multibyte encodings</heading>
|
| 727 |
|
| 728 |
<P>
|
| 729 |
Encodings are classified into multibyte ones and the others,
|
| 730 |
according to the relationship between number of characters and number of
|
| 731 |
bytes in the encoding.
|
| 732 |
</P>
|
| 733 |
|
| 734 |
<P>
|
| 735 |
In non-multibyte encoding, one character is always expressed
|
| 736 |
by one byte. On the other hand, one character may expressed in
|
| 737 |
one or more bytes in multibyte encoding. Note that the number
|
| 738 |
is not fixed even in a single encoding.
|
| 739 |
</P>
|
| 740 |
|
| 741 |
<P>
|
| 742 |
Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP,
|
| 743 |
Shift-JIS, Big5, UHC, UTF-8, and so on. Note that all of UTF-* are
|
| 744 |
multibyte.
|
| 745 |
</P>
|
| 746 |
|
| 747 |
<P>
|
| 748 |
Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2,
|
| 749 |
TIS 620, VISCII, and so on.
|
| 750 |
</P>
|
| 751 |
|
| 752 |
<P>
|
| 753 |
Note that even in non-multibyte encoding, number of characters
|
| 754 |
and number of bytes may differ if the encoding is stateful.
|
| 755 |
</P>
|
| 756 |
|
| 757 |
<P>
|
| 758 |
Ken Lunde's "CJKV Information Processing"
|
| 759 |
<footnote>
|
| 760 |
ISBN 1-56592-224-7, O'Reilly, 1999
|
| 761 |
</footnote>
|
| 762 |
classifies encoding methods
|
| 763 |
into the following three categories:
|
| 764 |
<list>
|
| 765 |
<item>modal
|
| 766 |
<item>non-modal
|
| 767 |
<item>fixed-length
|
| 768 |
</list>
|
| 769 |
<em>Modal</em> corresponds to <em>stateful</em> in this document.
|
| 770 |
Other two are <em>stateless</em>, where <em>non-modal</em> is
|
| 771 |
<em>multibyte</em> and <em>fixed-length</em> is
|
| 772 |
<em>non-multibyte</em>. However, I think <em>stateful</em> -
|
| 773 |
<em>stateless</em> and <em>multibyte</em> - <em>non-multibyte</em>
|
| 774 |
are independent concept.
|
| 775 |
<footnote>
|
| 776 |
though there are no existing encodings which is stateful and
|
| 777 |
non-multibyte.
|
| 778 |
</footnote>
|
| 779 |
</P>
|
| 780 |
|
| 781 |
<sect id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>
|
| 782 |
|
| 783 |
<P>
|
| 784 |
One ASCII character is always expressed by one byte
|
| 785 |
and occupies one column on console or X terminal emulators
|
| 786 |
(fixed font for X).
|
| 787 |
One must not make such an assumption for I18N programming
|
| 788 |
and have to clearly distinguish number of bytes, characters,
|
| 789 |
and columns.
|
| 790 |
</P>
|
| 791 |
|
| 792 |
<P>
|
| 793 |
Speaking of relationship between characters and bytes,
|
| 794 |
in multibyte encodings, two or more bytes may be needed
|
| 795 |
to express one character. In stateful encodings, escape
|
| 796 |
sequences are not related to any characters.
|
| 797 |
</P>
|
| 798 |
|
| 799 |
<P>
|
| 800 |
Number of columns is not defined in any standards. However,
|
| 801 |
it is usual that CJK ideograms, Japanese Hiragana and Katakana,
|
| 802 |
and Korean Hangul occupy two columns in console or X terminal emulators.
|
| 803 |
Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set
|
| 804 |
will occupy two columns and 'Half-width forms' will occupy one column.
|
| 805 |
Combining characters used for Thai and so on can be regarded as
|
| 806 |
zero-column characters. Though there are no standards, you can
|
| 807 |
use <tt>wcwidth()</tt> and <tt>wcswidth()</tt> for this purpose.
|
| 808 |
See <ref id="output-console-column"> for detail.
|
| 809 |
</P>
|
| 810 |
|
| 811 |
|
| 812 |
|
| 813 |
|
| 814 |
|
| 815 |
|
| 816 |
<chapt id="codes"><heading>Coded Character Sets And Encodings in the World</heading>
|
| 817 |
|
| 818 |
<P>
|
| 819 |
Here major coded character sets and encodings are introduced.
|
| 820 |
Note that you don't have to know the detail of these
|
| 821 |
character codes if you use LOCALE and <tt>wchar_t</tt> technology.
|
| 822 |
</P>
|
| 823 |
|
| 824 |
<P>
|
| 825 |
However, these knowledge will help you to understand why number
|
| 826 |
of bytes, characters, and columns should be counted separately,
|
| 827 |
why <tt>strchr()</tt> and so on should not be used, why you should
|
| 828 |
use LOCALE and <tt>wchar_t</tt> technology instead of hard-code
|
| 829 |
processing of existing character codes, and so on so on.
|
| 830 |
</P>
|
| 831 |
|
| 832 |
<P>
|
| 833 |
These varieties of character sets and encodings will tell you about
|
| 834 |
struggles of people in the world to handle their own languages by
|
| 835 |
computers. Especially, CJK people could not help working out various
|
| 836 |
technologies to use plenty of characters within ASCII-based computer
|
| 837 |
systems.
|
| 838 |
</P>
|
| 839 |
|
| 840 |
<P>
|
| 841 |
If you are planning to develop a text-processing software
|
| 842 |
beyond the fields which the LOCALE technology covers, you will
|
| 843 |
have to understand the following descriptions very well.
|
| 844 |
These fields include automatic detection of encodings
|
| 845 |
used for the input file (Most of Japanese-capable text viewers
|
| 846 |
such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism)
|
| 847 |
and so on.
|
| 848 |
</P>
|
| 849 |
|
| 850 |
|
| 851 |
<sect id="ascii"><heading>ASCII and ISO 646</heading>
|
| 852 |
|
| 853 |
<P>
|
| 854 |
<strong>ASCII</strong> is a CCS and also an encoding at the same time.
|
| 855 |
ASCII is 7bit and contains 94 printable characters which are
|
| 856 |
encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>.
|
| 857 |
</P>
|
| 858 |
|
| 859 |
<P>
|
| 860 |
<strong>ISO 646</strong> is the international standard of ASCII.
|
| 861 |
Following 12 characters of
|
| 862 |
<list>
|
| 863 |
<item>0x23 (number),
|
| 864 |
<item>0x24 (dollar),
|
| 865 |
<item>0x40 (at),
|
| 866 |
<item>0x5b (left square bracket),
|
| 867 |
<item>0x5c (backslash),
|
| 868 |
<item>0x5d (right square bracket),
|
| 869 |
<item>0x5e (caret),
|
| 870 |
<item>0x60 (backquote),
|
| 871 |
<item>0x7b (left curly brace),
|
| 872 |
<item>0x7c (vertical line),
|
| 873 |
<item>0x7d (right curly brace), and
|
| 874 |
<item>0x7e (tilde)
|
| 875 |
</list>
|
| 876 |
are called <strong>IRV</strong> (International Reference Version)
|
| 877 |
and other 82 (94 - 12 = 82) characters are called
|
| 878 |
<strong>BCT</strong> (Basic Code Table).
|
| 879 |
Characters at IRV can be different between countries.
|
| 880 |
Here is a few examples of versions of ISO 646.
|
| 881 |
<list>
|
| 882 |
<item>UK version (BS 4730)
|
| 883 |
<item>US version (ASCII): 0x23 is pound currency mark, and so on.
|
| 884 |
<item>Japanese version (JISX 0201 Roman): 0x5c is yen currency mark, and
|
| 885 |
so on.
|
| 886 |
<item>Italian version (UNI 0204-70): 0x7b is 'a' with grave accent, and
|
| 887 |
so on.
|
| 888 |
<item>French version (NF Z 62-010): 0x7b is 'e' with acute accent, and
|
| 889 |
so on.
|
| 890 |
</list>
|
| 891 |
</P>
|
| 892 |
|
| 893 |
<P>
|
| 894 |
As far as I know, all encodings (besides EBCDIC) in the world
|
| 895 |
are compatible with ISO 646.
|
| 896 |
</P>
|
| 897 |
|
| 898 |
<P>
|
| 899 |
Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
|
| 900 |
</P>
|
| 901 |
|
| 902 |
<P>
|
| 903 |
Nowadays usage of encodings incompatible with ASCII is not
|
| 904 |
encouraged and thus ISO 646-* (other than US version) should not
|
| 905 |
be used. One of the reason is that when a string is converted into
|
| 906 |
Unicode, the converter doesn't know whether IRVs are converted into
|
| 907 |
characters with same shapes or characters with same codes.
|
| 908 |
Another reason is that source codes
|
| 909 |
are written in ASCII. Source code must be readable anywhere.
|
| 910 |
</P>
|
| 911 |
|
| 912 |
|
| 913 |
<sect id="iso8859"><heading>ISO 8859</heading>
|
| 914 |
|
| 915 |
<P>
|
| 916 |
<strong>ISO 8859</strong> is both a series of CCS and a series of
|
| 917 |
encodings. It is an expansion of ASCII using all 8 bits.
|
| 918 |
Additional 96 printable characters encoded in 0xa0 - 0xff are
|
| 919 |
available besides 94 ASCII printable characters.
|
| 920 |
</P>
|
| 921 |
|
| 922 |
<P>
|
| 923 |
There are 10 variants of ISO 8859 (in 1997).
|
| 924 |
<taglist>
|
| 925 |
<tag>ISO-8859-1 Latin alphabet No.1 (1987)</tag>
|
| 926 |
<item>characters for western European languages
|
| 927 |
<tag>ISO-8859-2 Latin alphabet No.2 (1987)</tag>
|
| 928 |
<item>characters for central European languages
|
| 929 |
<tag>ISO-8859-3 Latin alphabet No.3 (1988)</tag>
|
| 930 |
<tag>ISO-8859-4 Latin alphabet No.4 (1988)</tag>
|
| 931 |
<item>characters for northern European languages
|
| 932 |
<tag>ISO-8859-5 Latin/Cyrillic alphabet (1988)</tag>
|
| 933 |
<tag>ISO-8859-6 Latin/Arabic alphabet (1987)</tag>
|
| 934 |
<tag>ISO-8859-7 Latin/Greek alphabet (1987)</tag>
|
| 935 |
<tag>ISO-8859-8 Latin/Hebrew alphabet (1988)</tag>
|
| 936 |
<tag>ISO-8859-9 Latin alphabet No.5 (1989)</tag>
|
| 937 |
<item>same as ISO-8859-1 except for Turkish instead of Icelandic
|
| 938 |
<tag>ISO-8859-10 Latin alphabet No.6 (1993)</tag>
|
| 939 |
<item>Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
|
| 940 |
<tag>ISO-8859-11 Latin/Thai alphabet (2001)</tag>
|
| 941 |
<item>same as TIS-620 Thai national standard
|
| 942 |
<tag>ISO-8859-13 Latin alphabet No.7 (1998)</tag>
|
| 943 |
<tag>ISO-8859-14 Latin alphabet No.8 (Celtic) (1998)</tag>
|
| 944 |
<tag>ISO-8859-15 Latin alphabet No.9 (1999)</tag>
|
| 945 |
<tag>ISO-8859-16 Latin alphabet No.10 (2001)</tag>
|
| 946 |
<item> </item>
|
| 947 |
</taglist>
|
| 948 |
</P>
|
| 949 |
|
| 950 |
<P>
|
| 951 |
A detailed explanation is found at
|
| 952 |
<url id="http://park.kiev.ua/mutliling/ml-docs/iso-8859.html">.
|
| 953 |
</P>
|
| 954 |
|
| 955 |
|
| 956 |
<sect id="iso-2022"><heading>ISO 2022</heading>
|
| 957 |
|
| 958 |
<P>
|
| 959 |
Using ASCII and ISO 646, we can use 94 characters at most.
|
| 960 |
Using ISO 8859, the number includes to 190 (= 94 + 96).
|
| 961 |
However, we may want to use much more characters.
|
| 962 |
Or, we may want to use some, not one, of these character sets.
|
| 963 |
One of the answer is ISO 2022.
|
| 964 |
</P>
|
| 965 |
|
| 966 |
<P>
|
| 967 |
<strong>ISO 2022</strong> is an international standard of CES.
|
| 968 |
ISO 2022 determines a few requirement for CCS to be a member
|
| 969 |
of ISO 2022-based encodings. It also defines a very
|
| 970 |
extensive (and complex) rules to combine these CCS into one
|
| 971 |
encoding. Many encodings such as EUC-*, ISO 2022-*,
|
| 972 |
compound text,
|
| 973 |
<footnote>
|
| 974 |
Compound text is a standard for text exchange between X clients.
|
| 975 |
</footnote>
|
| 976 |
and so on can be regarded as subsets of ISO 2022.
|
| 977 |
ISO 2022 is so complex that you may be not able to understand this.
|
| 978 |
It is OK; What is important here is the concept of ISO 2022 of
|
| 979 |
building an encoding by switching various (ISO 2022-compliant)
|
| 980 |
coded character sets.
|
| 981 |
</P>
|
| 982 |
|
| 983 |
<P>
|
| 984 |
The sixth edition of ECMA-35 is fully identical with
|
| 985 |
ISO 2022:1994 and you can find the official document
|
| 986 |
at <url id="http://www.ecma.ch/ecma1/stand/ECMA-035.HTM">.
|
| 987 |
</P>
|
| 988 |
|
| 989 |
<P>
|
| 990 |
ISO 2022 has two versions of 7bit and 8bit. At first
|
| 991 |
8bit version is explained. 7bit version is a subset
|
| 992 |
of 8bit version.
|
| 993 |
</P>
|
| 994 |
|
| 995 |
<P>
|
| 996 |
The 8bit code space is divided into four regions,
|
| 997 |
<list>
|
| 998 |
<item>0x00 - 0x1f: C0 (Control Characters 0),
|
| 999 |
<item>0x20 - 0x7f: GL (Graphic Characters Left),
|
| 1000 |
<item>0x80 - 0x9f: C1 (Control Characters 1), and
|
| 1001 |
<item>0xa0 - 0xff: GR (Graphic Characters Right).
|
| 1002 |
</list>
|
| 1003 |
</P>
|
| 1004 |
|
| 1005 |
<P>
|
| 1006 |
GL and GR is the spaces where (printable) character sets are mapped.
|
| 1007 |
</P>
|
| 1008 |
|
| 1009 |
<P>
|
| 1010 |
Next, all character sets, for example, ASCII, ISO 646-UK,
|
| 1011 |
and JIS X 0208, are classified into following four categories,
|
| 1012 |
<list>
|
| 1013 |
<item>(1) character set with 1-byte 94-character,
|
| 1014 |
<item>(2) character set with 1-byte 96-character,
|
| 1015 |
<item>(3) character set with multibyte 94-character, and
|
| 1016 |
<item>(4) character set with multibyte 96-character.
|
| 1017 |
</list>
|
| 1018 |
</P>
|
| 1019 |
|
| 1020 |
<P>
|
| 1021 |
Characters in character sets with 94-character are mapped
|
| 1022 |
into 0x21 - 0x7e. Characters in 96-character set are
|
| 1023 |
mapped into 0x20 - 0x7f.
|
| 1024 |
</P>
|
| 1025 |
|
| 1026 |
<P>
|
| 1027 |
For example, ASCII, ISO 646-UK, and JISX 0201 Katakana
|
| 1028 |
are classified into (1), JISX 0208 Japanese Kanji,
|
| 1029 |
KSX 1001 Korean, GB 2312-80 Chinese are classified into (3),
|
| 1030 |
and ISO 8859-* are classified to (2).
|
| 1031 |
</P>
|
| 1032 |
|
| 1033 |
<P>
|
| 1034 |
The mechanism to map these character sets into GL and GR is
|
| 1035 |
a bit complex. There are four buffers, G0, G1, G2, and G3.
|
| 1036 |
A character set is <strong>designated</strong> into one of these buffers
|
| 1037 |
and then a buffer is <strong>invoked</strong> into GL or GR.
|
| 1038 |
</P>
|
| 1039 |
|
| 1040 |
<P>
|
| 1041 |
Control sequences to 'designate' a character set into a
|
| 1042 |
buffer are determined as below.
|
| 1043 |
</P>
|
| 1044 |
|
| 1045 |
<P>
|
| 1046 |
<list>
|
| 1047 |
<item>A sequence to designate a character set with 1-byte 94-character
|
| 1048 |
<list>
|
| 1049 |
<item>into G0 set is: ESC 0x28 F,
|
| 1050 |
<item>into G1 set is: ESC 0x29 F,
|
| 1051 |
<item>into G2 set is: ESC 0x2a F, and
|
| 1052 |
<item>into G3 set is: ESC 0x2b F.
|
| 1053 |
</list>
|
| 1054 |
<item>A sequence to designate a character set with 1-byte 96-character
|
| 1055 |
<list>
|
| 1056 |
<item>into G1 set is: ESC 0x2d F,
|
| 1057 |
<item>into G2 set is: ESC 0x2e F, and
|
| 1058 |
<item>into G3 set is: ESC 0x2f F.
|
| 1059 |
</list>
|
| 1060 |
<item>A sequence to designate a character set with multibyte 94-character
|
| 1061 |
<list>
|
| 1062 |
<item>into G0 set is: ESC 0x24 0x28 F
|
| 1063 |
(exception: 'ESC 0x24 F' for F = 0x40, 0x41, 0x42.),
|
| 1064 |
<item>into G1 set is: ESC 0x24 0x29 F,
|
| 1065 |
<item>into G2 set is: ESC 0x24 0x2a F, and
|
| 1066 |
<item>into G3 set is: ESC 0x24 0x2b F.
|
| 1067 |
</list>
|
| 1068 |
<item>A sequence to designate a character set with multibyte 96-character
|
| 1069 |
<list>
|
| 1070 |
<item>into G1 set is: ESC 0x24 0x2d F,
|
| 1071 |
<item>into G2 set is: ESC 0x24 0x2e F, and
|
| 1072 |
<item>into G3 set is: ESC 0x24 0x2f F.
|
| 1073 |
</list>
|
| 1074 |
</list>
|
| 1075 |
where 'F' is determined for each character set:
|
| 1076 |
<list>
|
| 1077 |
<item>character set with 1-byte 94-character
|
| 1078 |
<list>
|
| 1079 |
<item>F=0x40 for ISO 646 IRV: 1983
|
| 1080 |
<item>F=0x41 for BS 4730 (UK)
|
| 1081 |
<item>F=0x42 for ANSI X3.4-1968 (ASCII)
|
| 1082 |
<item>F=0x43 for NATS Primary Set for Finland and Sweden
|
| 1083 |
<item>F=0x49 for JIS X 0201 Katakana
|
| 1084 |
<item>F=0x4a for JIS X 0201 Roman (Latin)
|
| 1085 |
<item>and more
|
| 1086 |
</list>
|
| 1087 |
<item>character set with 1-byte 96-character
|
| 1088 |
<list>
|
| 1089 |
<item>F=0x41 for ISO 8859-1 Latin-1
|
| 1090 |
<item>F=0x42 for ISO 8859-2 Latin-2
|
| 1091 |
<item>F=0x43 for ISO 8859-3 Latin-3
|
| 1092 |
<item>F=0x44 for ISO 8859-4 Latin-4
|
| 1093 |
<item>F=0x46 for ISO 8859-7 Latin/Greek
|
| 1094 |
<item>F=0x47 for ISO 8859-6 Latin/Arabic
|
| 1095 |
<item>F=0x48 for ISO 8859-8 Latin/Hebrew
|
| 1096 |
<item>F=0x4c for ISO 8859-5 Latin/Cyrillic
|
| 1097 |
<item>and more
|
| 1098 |
</list>
|
| 1099 |
<item>character set with multibyte 94-character
|
| 1100 |
<list>
|
| 1101 |
<item>F=0x40 for JISX 0208-1978 Japanese
|
| 1102 |
<item>F=0x41 for GB 2312-80 Chinese
|
| 1103 |
<item>F=0x42 for JISX 0208-1983 Japanese
|
| 1104 |
<item>F=0x43 for KSC 5601 Korean
|
| 1105 |
<item>F=0x44 for JISX 0212-1990 Japanese
|
| 1106 |
<item>F=0x45 for CCITT Extended GB (ISO-IR-165)
|
| 1107 |
<item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
|
| 1108 |
<item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
|
| 1109 |
<item>F=0x49 for CNS 11643-1992 Set 3 (Taiwan)
|
| 1110 |
<item>F=0x4a for CNS 11643-1992 Set 4 (Taiwan)
|
| 1111 |
<item>F=0x4b for CNS 11643-1992 Set 5 (Taiwan)
|
| 1112 |
<item>F=0x4c for CNS 11643-1992 Set 6 (Taiwan)
|
| 1113 |
<item>F=0x4d for CNS 11643-1992 Set 7 (Taiwan)
|
| 1114 |
<item>and more
|
| 1115 |
</list>
|
| 1116 |
</list>
|
| 1117 |
The complete list of these coded character set is found at
|
| 1118 |
<url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
|
| 1119 |
name="International Register of Coded Character Sets">.
|
| 1120 |
</P>
|
| 1121 |
|
| 1122 |
<P>
|
| 1123 |
Control codes to 'invoke' one of G{0123} into GL or GR
|
| 1124 |
is determined as below.
|
| 1125 |
<list>
|
| 1126 |
<item>A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
|
| 1127 |
<item>A control code to invoke G1 into GL is: (L)SO ((Locking) Shift In)
|
| 1128 |
<item>A control code to invoke G2 into GL is: LS2 (Locking Shift 2)
|
| 1129 |
<item>A control code to invoke G3 into GL is: LS3 (Locking Shift 3)
|
| 1130 |
<item>A control code to invoke one character
|
| 1131 |
in G2 into GL is: SS2 (Single Shift 2)
|
| 1132 |
<item>A control code to invoke one character
|
| 1133 |
in G3 into GL is: SS3 (Single Shift 3)
|
| 1134 |
<item>A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
|
| 1135 |
<item>A control code to invoke G2 into GR is: LS2R (Locking Shift 2 Right)
|
| 1136 |
<item>A control code to invoke G3 into GR is: LS3R (Locking Shift 3 Right)
|
| 1137 |
</list>
|
| 1138 |
<footnote>
|
| 1139 |
WHAT IS THE VALUE OF THESE CONTROL CODES?
|
| 1140 |
</footnote>
|
| 1141 |
</P>
|
| 1142 |
|
| 1143 |
<P>
|
| 1144 |
Note that a code in a character set invoked into GR is
|
| 1145 |
or-ed with 0x80.
|
| 1146 |
</P>
|
| 1147 |
|
| 1148 |
<P>
|
| 1149 |
ISO 2022 also determines <strong>announcer</strong> code. For example,
|
| 1150 |
'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already
|
| 1151 |
invoked into GL'. This simplify the coding system. Even this
|
| 1152 |
announcer can be omitted if people who exchange data agree.
|
| 1153 |
</P>
|
| 1154 |
|
| 1155 |
<P>
|
| 1156 |
7bit version of ISO 2022 is a subset of 8bit version. It does not
|
| 1157 |
use C1 and GR.
|
| 1158 |
</P>
|
| 1159 |
|
| 1160 |
<P>
|
| 1161 |
Explanation on C0 and C1 is omitted here.
|
| 1162 |
</P>
|
| 1163 |
|
| 1164 |
|
| 1165 |
|
| 1166 |
<sect1 id="euc"><heading>EUC (Extended Unix Code)</heading>
|
| 1167 |
|
| 1168 |
<P>
|
| 1169 |
<strong>EUC</strong> is a CES which is a subset of 8bit version
|
| 1170 |
of ISO 2022 except for the usage of SS2 and SS3 code. Though these
|
| 1171 |
codes are used to invoke G2 and G3 into GL in ISO 2022, they are
|
| 1172 |
invoked into GR in EUC.
|
| 1173 |
<strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>,
|
| 1174 |
and <strong>EUC-TW</strong> are widely used encodings
|
| 1175 |
which use EUC as CES.
|
| 1176 |
</P>
|
| 1177 |
|
| 1178 |
<P>
|
| 1179 |
EUC is stateless.
|
| 1180 |
</P>
|
| 1181 |
|
| 1182 |
<P>
|
| 1183 |
EUC can contain 4 CCS by using G0, G1, G2, and G3.
|
| 1184 |
Though there is no requirement that ASCII is designated to G0,
|
| 1185 |
I don't know any EUC codeset in which ASCII is not designated to G0.
|
| 1186 |
</P>
|
| 1187 |
|
| 1188 |
<P>
|
| 1189 |
For EUC with G0-ASCII, all codes other than ASCII are encoded
|
| 1190 |
in 0x80 - 0xff and this is upward compatible to ASCII.
|
| 1191 |
</P>
|
| 1192 |
|
| 1193 |
<P>
|
| 1194 |
Expressions for characters in G0, G1, G2, and G3 character sets
|
| 1195 |
are described below in binary:
|
| 1196 |
<list>
|
| 1197 |
<item>G0: 0???????
|
| 1198 |
<item>G1: 1??????? [1??????? [...]]
|
| 1199 |
<item>G2: SS2 1??????? [1??????? [...]]
|
| 1200 |
<item>G3: SS3 1??????? [1??????? [...]]
|
| 1201 |
</list>
|
| 1202 |
where SS2 is 0x8e and SS3 is 0x8f.
|
| 1203 |
</P>
|
| 1204 |
|
| 1205 |
|
| 1206 |
|
| 1207 |
<sect1 id="iso2022set"><heading>ISO 2022-compliant Character Sets</heading>
|
| 1208 |
|
| 1209 |
<P>
|
| 1210 |
There are many national and international standards of coded
|
| 1211 |
character sets (CCS). Some of them are ISO 2022-compliant
|
| 1212 |
and can be used in ISO 2022 encoding.
|
| 1213 |
</P>
|
| 1214 |
|
| 1215 |
<P>
|
| 1216 |
ISO 2022-compliant CCS are classified into one of them:
|
| 1217 |
<list>
|
| 1218 |
<item>94 characters
|
| 1219 |
<item>96 characters
|
| 1220 |
<item>94x94x94x... characters
|
| 1221 |
</list>
|
| 1222 |
</P>
|
| 1223 |
|
| 1224 |
<P>
|
| 1225 |
The most famous 94 character set is US-ASCII. Also, all
|
| 1226 |
ISO 646 variants are ISO 2022-compliant 94 character sets.
|
| 1227 |
</P>
|
| 1228 |
|
| 1229 |
<P>
|
| 1230 |
All ISO 8859-* character sets are ISO 2022-compliant
|
| 1231 |
96 character sets.
|
| 1232 |
</P>
|
| 1233 |
|
| 1234 |
<P>
|
| 1235 |
There are many 94x94 character sets. All of them are related to
|
| 1236 |
CJK ideograms.
|
| 1237 |
<taglist>
|
| 1238 |
<tag><strong>JISX 0208</strong> (aka JIS C 6226)
|
| 1239 |
<item><p>National standard of Japan. 1978 version contains 6802 characters
|
| 1240 |
including Kanji (ideogram), Hiragana, Katakana, Latin, Greek,
|
| 1241 |
Cyrillic, numeric, and other symbols. The current (1997) version
|
| 1242 |
contains 7102 characters.</p>
|
| 1243 |
<tag><strong>JISX 0212</strong>
|
| 1244 |
<item><p>National standard of Japan. 6067 characters (almost of them
|
| 1245 |
are Kanji). This character set is intended to be used in
|
| 1246 |
addition to JISX 0208.</p>
|
| 1247 |
<tag><strong>JISX 0213</strong>
|
| 1248 |
<item><p>Japanese national standard. Released in 2000.
|
| 1249 |
This includes JISX 0208 characters and additional thousands
|
| 1250 |
of characters. Thus, this is intended to be an extension
|
| 1251 |
and a replacement of JISX 0208.
|
| 1252 |
This has two 94x94 character sets, one of them inclucdes JISX 0208
|
| 1253 |
plus about 2000 characters and the another includes about
|
| 1254 |
2400 characters.
|
| 1255 |
Exactly speaking, JISX 0213 is not a simple
|
| 1256 |
superset of JISX 0208 because a few tens of Kanji variants
|
| 1257 |
which is unified and share the same code points in JISX 0208
|
| 1258 |
are dis-unified and have separate code points in JISX 0213.
|
| 1259 |
Share many characters with JISX 0212.</p>
|
| 1260 |
<tag><strong>KSX 1001</strong> (aka KSC 5601)
|
| 1261 |
<item><p>National standard of South Korea. 8224 characters including
|
| 1262 |
2350 Hangul, Hanja (ideogram), Hiragana, Katakana, Latin,
|
| 1263 |
Greek, Cyrillic, and other symbils. Hanja are ordered in
|
| 1264 |
reading and Hanja with multiple readings are coded multiple times.</p>
|
| 1265 |
<tag><strong>KSX 1002</strong>
|
| 1266 |
<item><p>National standard of South Korea. 7659 characters including
|
| 1267 |
Hangul and Hanja. Intended to be used in addition to KSX 1001.</p>
|
| 1268 |
<tag><strong>KPS 9566</strong>
|
| 1269 |
<item><p>National standard of North Korea. Similar to KSX 1001.</p>
|
| 1270 |
<tag><strong>GB 2312</strong>
|
| 1271 |
<item><p>National standard of China. 7445 characters including
|
| 1272 |
6763 Hanzi (ideogram), Latin, Greek, Cyrillic, Hiragana,
|
| 1273 |
Katakana, and other symbols.</p>
|
| 1274 |
<tag><strong>GB 7589</strong> (aka GB2)
|
| 1275 |
<item><p>National standard of China. 7237 Hanzi. Intended to be
|
| 1276 |
used in addition to GB 2312.</p>
|
| 1277 |
<tag><strong>GB 7590</strong> (aka GB4)
|
| 1278 |
<item><p>National standard of China. 7039 Hanzi. Intended to be
|
| 1279 |
used in addition to GB 2312 and GB 7589.</p>
|
| 1280 |
<tag><strong>GB 12345</strong> (aka GB/T 12345, GB1 or GBF)
|
| 1281 |
<item><p>National standard of China. 7583 characters. Traditional
|
| 1282 |
characters version which correspond to GB 2312 simplified
|
| 1283 |
characters.
|
| 1284 |
<tag><strong>GB 13131</strong> (aka GB3)
|
| 1285 |
<item><p>National standard of China. Traditional
|
| 1286 |
characters version which correspond to GB 7589 simplified
|
| 1287 |
characters.
|
| 1288 |
<tag><strong>GB 13132</strong> (aka GB5)
|
| 1289 |
<item><p>National standard of China. Traditional
|
| 1290 |
characters version which correspond to GB 7590 simplified
|
| 1291 |
characters.
|
| 1292 |
<tag><strong>CNS 11643</strong>
|
| 1293 |
<item><p>National standard of Taiwan. Has 7 plains. Plain 1 and
|
| 1294 |
2 includes all characters included in Big5. Plain 1 includes
|
| 1295 |
6085 characters including Hanzi (ideogram), Latin, Greek,
|
| 1296 |
and other symbols. Plain 2 includes 7650. Number of character
|
| 1297 |
for plain 3 is 6184, plain 4 is 7298, plain 5 is 8603,
|
| 1298 |
plain 6 is 6388, and plain 7 is 6539.
|
| 1299 |
</taglist>
|
| 1300 |
</P>
|
| 1301 |
|
| 1302 |
<P>
|
| 1303 |
There is a 94x94x94 character set. This is <strong>CCCII</strong>.
|
| 1304 |
This is national standard of Taiwan. Now 73400 characters are
|
| 1305 |
included. (The number is increasing.)
|
| 1306 |
</P>
|
| 1307 |
|
| 1308 |
<P>
|
| 1309 |
Non-ISO 2022-compliant character sets are introduced later in
|
| 1310 |
<ref id="othercodes">.
|
| 1311 |
</P>
|
| 1312 |
|
| 1313 |
<sect1 id="iso2022enc"><heading>ISO 2022-compliant Encodings</heading>
|
| 1314 |
|
| 1315 |
<p>
|
| 1316 |
There are many ISO 2022-compliant encodings which are subsets
|
| 1317 |
of ISO 2022.
|
| 1318 |
</p>
|
| 1319 |
|
| 1320 |
<P>
|
| 1321 |
<taglist>
|
| 1322 |
<tag><strong>Compound Text</strong>
|
| 1323 |
<item><p>
|
| 1324 |
This is used for X clients to communicate each other,
|
| 1325 |
for example, copy-paste.
|
| 1326 |
</P>
|
| 1327 |
<tag><strong>EUC-JP</strong>
|
| 1328 |
<item><p>An EUC encoding with ASCII, JISX 0208, JISX 0201 Kana,
|
| 1329 |
and JISX 0212 coded character sets. There are many systems
|
| 1330 |
which does not support JISX 0201 Kana and JISX 0212.
|
| 1331 |
Widely used in Japan for POSIX systems.
|
| 1332 |
</p>
|
| 1333 |
<tag><strong>EUC-KR</strong>
|
| 1334 |
<item><p>An EUC encoding with ASCII and KSX 1001.
|
| 1335 |
</p>
|
| 1336 |
<tag><strong>CN-GB</strong> (aka EUC-CN)
|
| 1337 |
<item><p>An EUC encoding with ASCII and GB 2312.
|
| 1338 |
The most popular encoding in R. P. China. This encoding
|
| 1339 |
is sometimes referred as simply 'GB'.
|
| 1340 |
</p>
|
| 1341 |
<tag><strong>EUC-TW</strong>
|
| 1342 |
<item><p>An extended EUC encoding with ASCII, CNS 11643 plain 1,
|
| 1343 |
and other (2-7) plains of CNS 11643.
|
| 1344 |
</p>
|
| 1345 |
<tag><strong>ISO 2022-JP</strong>
|
| 1346 |
<item><p>Described in.
|
| 1347 |
<url id="http://www.faqs.org/rfcs/rfc1468.html" name="RFC 1468">.
|
| 1348 |
</p>
|
| 1349 |
<P>***** Not written yet *****</P>
|
| 1350 |
<tag><strong>ISO 2022-JP-1</strong> (upward compatible to ISO 2022-JP)
|
| 1351 |
<item><p>Described in
|
| 1352 |
<url id="http://www.faqs.org/rfcs/rfc2237.html" name="RFC 2237">.
|
| 1353 |
</p>
|
| 1354 |
<P>***** Not written yet *****</P>
|
| 1355 |
<tag><strong>ISO 2022-JP-2</strong> (upward compatible to ISO 2022-JP-1)
|
| 1356 |
<item><p>Described in
|
| 1357 |
<url id="http://www.faqs.org/rfcs/rfc1554.html" name="RFC 1554">.
|
| 1358 |
</p>
|
| 1359 |
<P>***** Not written yet *****</P>
|
| 1360 |
<tag><strong>ISO 2022-KR</strong>
|
| 1361 |
<item><p>aka Wansung. Described in
|
| 1362 |
<url id="http://www.faqs.org/rfcs/rfc1557.html" name="RFC 1557">.
|
| 1363 |
</p>
|
| 1364 |
<P>***** Not written yet *****</P>
|
| 1365 |
<tag><strong>ISO 2022-CN</strong>
|
| 1366 |
<item><p>Described in RFC
|
| 1367 |
<url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">.
|
| 1368 |
</p>
|
| 1369 |
<P>***** Not written yet *****</P>
|
| 1370 |
<tag><strong>ISO 2022-CN-EXT</strong> (upward compatible to ISO 2022-CN-EXT)
|
| 1371 |
<item><p>
|
| 1372 |
</p>
|
| 1373 |
</taglist>
|
| 1374 |
</P>
|
| 1375 |
|
| 1376 |
<P>
|
| 1377 |
Non-ISO 2022-compliant encodings are introduced later in
|
| 1378 |
<ref id="othercodes">.
|
| 1379 |
</P>
|
| 1380 |
|
| 1381 |
<sect id="unicodes"><heading>ISO 10646 and Unicode</heading>
|
| 1382 |
|
| 1383 |
<P>
|
| 1384 |
ISO 10646 and Unicode are an another standard so that we can
|
| 1385 |
develop international softwares easily. The special features
|
| 1386 |
of this new standard are:
|
| 1387 |
<list>
|
| 1388 |
<item>A united single CCS which intends to include all characters
|
| 1389 |
in the world. (ISO 2022 consists of multiple CCS.)
|
| 1390 |
<item>The character set intends to cover all conventional
|
| 1391 |
(or <em>legacy</em>) CCS in the world.
|
| 1392 |
<footnote>
|
| 1393 |
This is obviously not true for CNS 11643 because
|
| 1394 |
CNS 11643 contains 48711 characters while Unicode 3.0.1
|
| 1395 |
contains 49194 characters, only 483 excess than CNS 11643.
|
| 1396 |
</footnote>
|
| 1397 |
<item>Compatibility with ASCII and ISO 8859-1 is considered.
|
| 1398 |
<item>Chinese, Japanese, and Korean ideograms are united.
|
| 1399 |
This comes from a limitation of Unicode.
|
| 1400 |
This is not a merit.
|
| 1401 |
</list>
|
| 1402 |
</P>
|
| 1403 |
|
| 1404 |
<P>
|
| 1405 |
ISO 10646 is an official international standard. Unicode is
|
| 1406 |
developed by
|
| 1407 |
<url id="http://www.unicode.org" name="Unicode Consortium">.
|
| 1408 |
These two are almost identical. Indeed, these two are exactly
|
| 1409 |
identical at code points which are available in both two standards.
|
| 1410 |
Unicode is sometimes updated and the newest version is 3.0.1.
|
| 1411 |
</P>
|
| 1412 |
|
| 1413 |
<sect1 id="unicodes-ccs"><heading>UCS as a Coded Character Set</heading>
|
| 1414 |
|
| 1415 |
<P>
|
| 1416 |
ISO 10646 defines two CCS (coded character sets), <strong>UCS-2</strong>
|
| 1417 |
and <strong>UCS-4</strong>. UCS-2 is a subset of UCS-4.
|
| 1418 |
</P>
|
| 1419 |
|
| 1420 |
<P>
|
| 1421 |
UCS-4 is a 31bit CCS. These 31 bits are divided into 7, 8, 8, and 8 bits
|
| 1422 |
and each of them has special term.
|
| 1423 |
<list>
|
| 1424 |
<item>The top 7 bits are called <strong>Group</strong>.
|
| 1425 |
<item>Next 8 bits are called <strong>Plane</strong>.
|
| 1426 |
<item>Next 8 bits are <strong>Row</strong>.
|
| 1427 |
<item>The smallest 8 bits are <strong>Cell</strong>.
|
| 1428 |
</list>
|
| 1429 |
The first plane (Group = 0, Plane = 0) is called <strong>BMP</strong>
|
| 1430 |
(Basic Multilingual Plane) and UCS-2 is same to BMP.
|
| 1431 |
Thus, UCS-2 is a 16bit CCS.
|
| 1432 |
</P>
|
| 1433 |
|
| 1434 |
<P>
|
| 1435 |
Code points in UCS are often expressed as <strong>u+<tt>????</tt></strong>,
|
| 1436 |
where <tt>????</tt> is hexadecimal expression of the code point.
|
| 1437 |
</P>
|
| 1438 |
|
| 1439 |
<P>
|
| 1440 |
Characters in range of u+0021 - u+007e are same to ASCII and
|
| 1441 |
characters in range of u+0xa0 - u+0xff are same to ISO 8859-1.
|
| 1442 |
Thus it is very easy to convert between ASCII or ISO 8859-1 and UCS.
|
| 1443 |
</P>
|
| 1444 |
|
| 1445 |
<P>
|
| 1446 |
Unicode (version 3.0.1) uses a 20bit subset of UCS-4 as a CCS.
|
| 1447 |
<footnote>
|
| 1448 |
Exactly speaking, u+000000 - u+10ffff.
|
| 1449 |
</footnote>
|
| 1450 |
</P>
|
| 1451 |
|
| 1452 |
<P>
|
| 1453 |
The unique feature of these CCS compared with other CCS is
|
| 1454 |
<em>open repertoire</em>. They are developing even after
|
| 1455 |
they are released. Characters will be added in future.
|
| 1456 |
However, already coded characters will not changed.
|
| 1457 |
Unicode version 3.0.1 includes 49194 distinct coded characters.
|
| 1458 |
</P>
|
| 1459 |
|
| 1460 |
<sect1 id="unicode-ces"><heading>UTF as Character Encoding Schemes</heading>
|
| 1461 |
|
| 1462 |
<P>
|
| 1463 |
A few CES are used to construct encodings which use UCS as
|
| 1464 |
a CCS. They are <strong>UTF-7</strong>, <strong>UTF-8</strong>,
|
| 1465 |
<strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and
|
| 1466 |
<strong>UTF-16BE</strong>. UTF means Unicode (or UCS)
|
| 1467 |
Transformation Format.
|
| 1468 |
Since these CES always take UCS as the only CCS, they are also
|
| 1469 |
names for encodings.
|
| 1470 |
<footnote>
|
| 1471 |
Compare UTF and EUC. There are a few variants of EUC whose CCS
|
| 1472 |
are different (EUC-JP, EUC-KR, and so on). This is why we cannot
|
| 1473 |
call EUC as an encoding. In other words, calling of 'EUC'
|
| 1474 |
cannot specify an encoding. On the other hands, 'UTF-8'
|
| 1475 |
is the name for a specific concrete encoding.
|
| 1476 |
</footnote>
|
| 1477 |
</P>
|
| 1478 |
|
| 1479 |
<sect2 id="unicode-utf8"><heading>UTF-8</heading>
|
| 1480 |
|
| 1481 |
<P>
|
| 1482 |
UTF-8 is an encoding whose CCS is UCS-4. UTF-8
|
| 1483 |
is designed to be upward-compatible to ASCII.
|
| 1484 |
UTF-8 is multibyte and number of bytes needed to express
|
| 1485 |
one character is from 1 to 6.
|
| 1486 |
</P>
|
| 1487 |
|
| 1488 |
<P>
|
| 1489 |
Conversion from UCS-4 to UTF-8 is performed using a
|
| 1490 |
simple conversion rule.
|
| 1491 |
<example>
|
| 1492 |
UCS-4 (binary) UTF-8 (binary)
|
| 1493 |
00000000 00000000 00000000 0??????? 0???????
|
| 1494 |
00000000 00000000 00000??? ???????? 110????? 10??????
|
| 1495 |
00000000 00000000 ???????? ???????? 1110???? 10?????? 10??????
|
| 1496 |
00000000 000????? ???????? ???????? 11110??? 10?????? 10?????? 10??????
|
| 1497 |
000000?? ???????? ???????? ???????? 111110?? 10?????? 10?????? 10?????? 10??????
|
| 1498 |
0??????? ???????? ???????? ???????? 1111110? 10?????? 10?????? 10?????? 10?????? 10??????
|
| 1499 |
</example>
|
| 1500 |
Note the shortest one will be used though longer representation can
|
| 1501 |
express smaller UCS values.
|
| 1502 |
</P>
|
| 1503 |
|
| 1504 |
<P>
|
| 1505 |
UTF-8 seems to be one of the major candidates for standard codesets
|
| 1506 |
in the future. For example, Linux console and xterm supports UTF-8.
|
| 1507 |
Debian package of <package>locales</package> (version 2.1.97-1)
|
| 1508 |
contains <tt>ko_KR.UTF-8</tt> locale. I think the number of UTF-8
|
| 1509 |
locale will increase.
|
| 1510 |
</P>
|
| 1511 |
|
| 1512 |
<sect2 id="unicode-utf16"><heading>UTF-16</heading>
|
| 1513 |
|
| 1514 |
<P>
|
| 1515 |
UTF-16 is an encoding whose CCS is 20bit Unicode.
|
| 1516 |
</P>
|
| 1517 |
|
| 1518 |
<P>
|
| 1519 |
Characters in BMP are expressed using 16bit value of
|
| 1520 |
code point in Unicode CCS. There are two ways to express
|
| 1521 |
16bit value in 8bit stream. Some of you may heard a word
|
| 1522 |
<em>endian</em>. <em>Big endian</em> means an arrangement
|
| 1523 |
of octets which are part of a datum with many bits
|
| 1524 |
from most significant octet to least significant one.
|
| 1525 |
<em>Little endian</em> is opposite. For example,
|
| 1526 |
16bit value of <tt>0x1234</tt> is expressed as
|
| 1527 |
<tt>0x12 0x34</tt> in
|
| 1528 |
big endian and <tt>0x34 0x12</tt> in little endian.
|
| 1529 |
</P>
|
| 1530 |
|
| 1531 |
<P>
|
| 1532 |
UTF-16 supports both endians. Thus, Unicode character of
|
| 1533 |
<tt>u+1234</tt> can be expressed either in <tt>0x12 0x34</tt>
|
| 1534 |
or <tt>0x34 0x12</tt>. Instead, the UTF-16 texts
|
| 1535 |
have to have a <strong>BOM (Byte Order Mark)</strong> at first
|
| 1536 |
of them. The Unicode character <tt>u+feff</tt> zero width no-break
|
| 1537 |
space is called BOM when it is used to indicate the byte order
|
| 1538 |
or endian of texts. The mechanism is easy: in big endian,
|
| 1539 |
<tt>u+feff</tt> will be <tt>0xfe 0xff</tt> while it will be
|
| 1540 |
<tt>0xff 0xfe</tt> in little endian. Thus you can understand
|
| 1541 |
the endian of the text by reading the first two bytes.
|
| 1542 |
<footnote>
|
| 1543 |
I heard that BOM is mere a suggestion by a vendor.
|
| 1544 |
Read <url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
|
| 1545 |
name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
|
| 1546 |
for detail.
|
| 1547 |
</footnote>
|
| 1548 |
</P>
|
| 1549 |
|
| 1550 |
<P>
|
| 1551 |
Characters not included in BMP are expressed using <strong>surrogate
|
| 1552 |
pair</strong>. Code points of <tt>u+d800</tt> - <tt>u+dfff</tt>
|
| 1553 |
are reserved for this purpose. At first, 20 bits of Unicode code
|
| 1554 |
point are divided into two sets of 10 bits. The significant 10 bits
|
| 1555 |
are mapped to 10bit space of <tt>u+d800</tt> - <tt>u+dbff</tt>.
|
| 1556 |
The smaller 10 bits are mapped to 10bit space of <tt>u+dc00</tt> -
|
| 1557 |
<tt>u+dfff</tt>. Thus UTF-16 can express 20bit Unicode characters.
|
| 1558 |
</P>
|
| 1559 |
|
| 1560 |
<sect2 id="unicode-utf16bele"><heading>UTF-16BE and UTF-16LE</heading>
|
| 1561 |
|
| 1562 |
<P>
|
| 1563 |
UTF-16BE and UTF-16LE are variants of UTF-16 which are limited to
|
| 1564 |
big and little endians, respectively.
|
| 1565 |
</P>
|
| 1566 |
|
| 1567 |
|
| 1568 |
<sect2 id="unicode-utf7"><heading>UTF-7</heading>
|
| 1569 |
|
| 1570 |
<P>
|
| 1571 |
UTF-7 is designed so that Unicode can be communicated using
|
| 1572 |
7bit communication path.
|
| 1573 |
</P>
|
| 1574 |
|
| 1575 |
<P>***** Not written yet *****</P>
|
| 1576 |
|
| 1577 |
<sect2 id="unicode-ucs"><heading>UCS-2 and UCS-4 as encodings</heading>
|
| 1578 |
|
| 1579 |
<P>
|
| 1580 |
Though I introduced UCS-2 and UCS-4 are CCS, they can be encodings.
|
| 1581 |
</P>
|
| 1582 |
|
| 1583 |
<P>
|
| 1584 |
In UCS-2 encoding, Each UCS-2 character is expressed in two bytes.
|
| 1585 |
In UCS-4 encoding, Each UCS-4 character is expressed in four bytes.
|
| 1586 |
</P>
|
| 1587 |
|
| 1588 |
<sect1 id="unicode-problem"><heading>Problems on Unicode</heading>
|
| 1589 |
|
| 1590 |
<P>
|
| 1591 |
All standards are not free from politics and compromise.
|
| 1592 |
Though a concept of united single CCS for all characters in the
|
| 1593 |
world is very nice, Unicode had to consider compatibility
|
| 1594 |
with preceding international and local standards. And more,
|
| 1595 |
unlike the ideal concept, Unicode people considered efficiency
|
| 1596 |
too much. IMHO, surrogate pair is a mess caused by lack of
|
| 1597 |
16bit code space. I will introduce a few problems on Unicode.
|
| 1598 |
</P>
|
| 1599 |
|
| 1600 |
<sect2 id="unihan"><heading>Han Unification</heading>
|
| 1601 |
|
| 1602 |
<P>
|
| 1603 |
This is the point on which Unicode is criticized most strongly
|
| 1604 |
among many Japanese people.
|
| 1605 |
</P>
|
| 1606 |
|
| 1607 |
<P>
|
| 1608 |
A region of 0x4e00 - 0x9fff in UCS-2 is used for Eastern-Asian
|
| 1609 |
ideographs (Japanese Kanji, Chinese Hanzi, and Korean Hanja).
|
| 1610 |
There are similar characters
|
| 1611 |
in these four character sets. (There are two sets of Chinese characters,
|
| 1612 |
simplified Chinese used in P. R. China and traditional Chinese used in
|
| 1613 |
Taiwan). To reduce the number of these ideograms to be encoded
|
| 1614 |
(the region for these characters can contain only 20992 characters
|
| 1615 |
while only Taiwan CNS 11643 standard contains 48711 characters),
|
| 1616 |
these similar characters are assumed to be the same.
|
| 1617 |
This is Han Unification.
|
| 1618 |
</P>
|
| 1619 |
|
| 1620 |
<P>
|
| 1621 |
However these characters are not exactly the same. If fonts for
|
| 1622 |
these characters are made from Chinese one, Japanese people will
|
| 1623 |
regard them wrong characters, though they may be able to read.
|
| 1624 |
Unicode people think these united characters are the same character
|
| 1625 |
with different glyphs.
|
| 1626 |
</P>
|
| 1627 |
|
| 1628 |
<P>
|
| 1629 |
An example of Han Unification is available at
|
| 1630 |
<url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=9AA8" name="U+9AA8">.
|
| 1631 |
This is a Kanji character for 'bone'.
|
| 1632 |
<url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=8FCE" name="U+8FCE">
|
| 1633 |
is an another example of a Kanji character for 'welcome'. The part
|
| 1634 |
from left side to bottom side is 'run' radical. 'Run' radical
|
| 1635 |
is used for many Kanjis and all of them have the same problem.
|
| 1636 |
<url id="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76F4" name="U+76F4">
|
| 1637 |
is an another example of a Kanji character for 'straight'.
|
| 1638 |
I, a native Japanese speaker, cannot recognize Chiense version
|
| 1639 |
at all.
|
| 1640 |
</P>
|
| 1641 |
|
| 1642 |
<P>
|
| 1643 |
Unicode font vendors will hesitate to choose fonts for these characters,
|
| 1644 |
simplified Chinese character, traditional Chinese one, Japanese one, or
|
| 1645 |
Korean one. One method is to supply four fonts of simplified Chinese
|
| 1646 |
version, traditional Chinese version, Japanese version, and Korean version.
|
| 1647 |
Commercial OS vendor can release localized version of their OS ---
|
| 1648 |
for example, Japanese version of MS Windows can include Japanese version
|
| 1649 |
of Unicode font (this is what they are exactly doing). However, how
|
| 1650 |
should XFree86 or Debian do? I don't know...
|
| 1651 |
<footnote>
|
| 1652 |
XFree86 4.0 includes Japanese and Korean versions of ISO 10646-1 fonts.
|
| 1653 |
</footnote>
|
| 1654 |
<footnote>
|
| 1655 |
I heard that Chinese and Korean people don't mind the glyph of these
|
| 1656 |
characters. If this is always true, Japanese glyphs should be the
|
| 1657 |
default glyphs for these problematic characters for international
|
| 1658 |
systems such as Debian.
|
| 1659 |
</footnote>
|
| 1660 |
</P>
|
| 1661 |
|
| 1662 |
<sect2 id="crossmap"><heading>Cross Mapping Tables</heading>
|
| 1663 |
|
| 1664 |
<P>
|
| 1665 |
Unicode intents to be a superset of all major encodings in the world,
|
| 1666 |
such as ISO-8859-*, EUC-*, KOI8-*, and so on. The aim of this is to
|
| 1667 |
keep round-trip compatibility and to enable smooth migration from
|
| 1668 |
other encodings to Unicode.
|
| 1669 |
</P>
|
| 1670 |
|
| 1671 |
<P>
|
| 1672 |
Only providing a superset is not sufficient. Reliable cross mapping
|
| 1673 |
tables between Unicode and other encodings are needed. They are
|
| 1674 |
provided by
|
| 1675 |
<url id="http://www.unicode.org/Public/MAPPINGS/" name="Unicode
|
| 1676 |
Consortium">.
|
| 1677 |
</P>
|
| 1678 |
|
| 1679 |
<P>
|
| 1680 |
However, tables for East Asian encodings are not provided.
|
| 1681 |
They were provided but now are
|
| 1682 |
<url id="http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/"
|
| 1683 |
name="obsolete">.
|
| 1684 |
</P>
|
| 1685 |
|
| 1686 |
<P>
|
| 1687 |
You may want to use these mapping tables even though they are
|
| 1688 |
obsolete, because there are no other mapping tables available.
|
| 1689 |
However, you will find a severe problem for these tables.
|
| 1690 |
There are multiple different mapping tables for
|
| 1691 |
Japanese encodings which include JIS X 0208 character set.
|
| 1692 |
Thus, one same character in JIS X 0208 will be mapped into
|
| 1693 |
different Unicode characters according to these mapping tables.
|
| 1694 |
For example, Microsoft and Sun use different table, which
|
| 1695 |
results in Java on MS Windows sometimes break Japanese characters.
|
| 1696 |
</P>
|
| 1697 |
|
| 1698 |
<P>
|
| 1699 |
Though we Open Source people should respect interoperativity,
|
| 1700 |
we cannot achieve sufficient interoperativity because of this
|
| 1701 |
problem. All what we can achieve is interoperativity between
|
| 1702 |
Open Source softwares.
|
| 1703 |
</P>
|
| 1704 |
|
| 1705 |
<P>
|
| 1706 |
GNU libc uses <url id="http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0208.TXT" name="JIS/JIS0208.TXT"> with a small modification.
|
| 1707 |
The modification is that
|
| 1708 |
<list>
|
| 1709 |
<item>original JIS0208.TXT:
|
| 1710 |
0x815F 0x2140 0x005C # REVERSE SOLIDUS
|
| 1711 |
<item>modified:
|
| 1712 |
0x815F 0x2140 0xFF3C # FULLWIDTH REVERSE SOLIDUS
|
| 1713 |
</list>
|
| 1714 |
The reason of this modification is that JIS X 0208 character set
|
| 1715 |
is almost always used with combination with ASCII in form of
|
| 1716 |
EUC-JP and so on. ASCII 0x5c, not JIS X 0208 0x2140, should
|
| 1717 |
be mapped into U+005C.
|
| 1718 |
This modified table is found at <tt>/usr/share/i18n/charmaps/EUC-JP.gz</tt>
|
| 1719 |
in Debian system. Of course this mapping table is NOT
|
| 1720 |
authorized nor reliable.
|
| 1721 |
</P>
|
| 1722 |
|
| 1723 |
<P>
|
| 1724 |
I hope Unicode Consortium to release an authorized reliable unique
|
| 1725 |
mapping table between Unicode and JIS X 0208.
|
| 1726 |
You can read <url id="http://www.debian.or.jp/~kubota/unicode-symbols.html"
|
| 1727 |
name="the detail of this problem">.
|
| 1728 |
</P>
|
| 1729 |
|
| 1730 |
<sect2 id="combining"><heading>Combining Characters</heading>
|
| 1731 |
|
| 1732 |
<P>
|
| 1733 |
Unicode has a way to synthesize a accented character by combining
|
| 1734 |
an accent symbol and a base character. For example, combining 'a' and
|
| 1735 |
'~' makes 'a' with tilde. More than two accent symbol can be added to
|
| 1736 |
a base character.
|
| 1737 |
</P>
|
| 1738 |
|
| 1739 |
<P>
|
| 1740 |
Languages such as Thai need combining characters. Combining characters
|
| 1741 |
are the only method to express characters in these languages.
|
| 1742 |
</P>
|
| 1743 |
|
| 1744 |
<P>
|
| 1745 |
However, a few problems arises.
|
| 1746 |
<taglist>
|
| 1747 |
<tag>Duplicate Encoding</tag>
|
| 1748 |
<item>
|
| 1749 |
There are multiple ways to express the same character.
|
| 1750 |
For example, u with umlaut can be expressed as <tt>u+00fc</tt>
|
| 1751 |
and also as <tt>u+0075</tt> + <tt>U+0308</tt>.
|
| 1752 |
How can we implement 'grep' and so on?
|
| 1753 |
<tag>Open Repertoire</tag>
|
| 1754 |
<item>
|
| 1755 |
Number of expressible characters grows unlimitedly.
|
| 1756 |
Non-existing characters can be expressed.
|
| 1757 |
</taglist>
|
| 1758 |
</P>
|
| 1759 |
|
| 1760 |
|
| 1761 |
<sect2 id="surrogate"><heading>Surrogate Pair</heading>
|
| 1762 |
|
| 1763 |
<P>
|
| 1764 |
The first version of Unicode had only 16bit code space,
|
| 1765 |
though 16bit is obviously insufficient to contain all
|
| 1766 |
characters in the world.
|
| 1767 |
<footnote>
|
| 1768 |
There are a few projects such as
|
| 1769 |
<url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
|
| 1770 |
(about 90000 characters),
|
| 1771 |
<url id="http://www.tron.org/index-e.html" name="TRON project">
|
| 1772 |
(about 130000 characters),
|
| 1773 |
and so on to develop a CCS which contains
|
| 1774 |
sufficient characters for professional usage in CJK world.
|
| 1775 |
</footnote>
|
| 1776 |
Thus surrogate pair is introduced in Unicode 2.0, to expand the
|
| 1777 |
number of characters, with keeping compatibility with former
|
| 1778 |
16bit Unicode.
|
| 1779 |
</P>
|
| 1780 |
|
| 1781 |
<P>
|
| 1782 |
However, surrogate pair breaks the principle that all characters
|
| 1783 |
are expressed with the same width of bits. This makes Unicode
|
| 1784 |
programming more difficult.
|
| 1785 |
</P>
|
| 1786 |
|
| 1787 |
<P>
|
| 1788 |
Fortunately, Debian and other UNIX-like systems will use UTF-8
|
| 1789 |
(not UTF-16) as a usual encoding for UCS. Thus, we don't need
|
| 1790 |
to handle UTF-16 and surrogate pair very often.
|
| 1791 |
</P>
|
| 1792 |
|
| 1793 |
<sect2 id="646problem"><heading>ISO 646-* Problem</heading>
|
| 1794 |
|
| 1795 |
<P>
|
| 1796 |
You will need a codeset converter between your local encodings
|
| 1797 |
(for example, ISO 8859-* or ISO 2022-*) and Unicode.
|
| 1798 |
For example, Shift-JIS encoding
|
| 1799 |
<footnote>
|
| 1800 |
The standard encoding for Macintosh and MS Windows.
|
| 1801 |
</footnote>
|
| 1802 |
consists from
|
| 1803 |
JISX 0201 Roman (Japanese version of ISO 646), not ASCII,
|
| 1804 |
which encodes yen currency mark at <tt>0x5c</tt>
|
| 1805 |
where backslash is encoded in ASCII.
|
| 1806 |
</P>
|
| 1807 |
|
| 1808 |
<P>
|
| 1809 |
Then which should your converter convert <tt>0x5c</tt> in Shift-JIS
|
| 1810 |
into in Unicode, <tt>u+005c</tt> (backslash) or <tt>u+00a5</tt>
|
| 1811 |
(yen currency mark)?
|
| 1812 |
You may say yen currency mark is the right solution.
|
| 1813 |
However, backslash (and then yen mark) is widely used for
|
| 1814 |
escape character. For example, 'new line' is expressed as
|
| 1815 |
'backslash - <tt>n</tt>' in C string literal and Japanese people use
|
| 1816 |
'yen currency mark - <tt>n</tt>'. You may say that program sources
|
| 1817 |
must written in ASCII and the wrong point is that you
|
| 1818 |
tried to convert program source. However, there are many
|
| 1819 |
source codes and so on written in Shift-JIS encoding.
|
| 1820 |
</P>
|
| 1821 |
|
| 1822 |
<P>
|
| 1823 |
Now Windows comes to support Unicode and the font
|
| 1824 |
at <tt>u+005c</tt> for Japanese version of Windows is yen currency mark.
|
| 1825 |
As you know, backslash (yen currency mark in Japan) is vitally
|
| 1826 |
important for Windows, because it is used to separate directory names.
|
| 1827 |
Fortunately, EUC-JP, which is widely used for UNIX in Japan,
|
| 1828 |
includes ASCII, not Japanese version of ISO 646. So this
|
| 1829 |
is not problem because it is clear <tt>0x5c</tt> is backslash.
|
| 1830 |
</P>
|
| 1831 |
|
| 1832 |
<P>
|
| 1833 |
Thus all local codesets should not use character sets incompatible
|
| 1834 |
to ASCII, such as ISO 646-*.
|
| 1835 |
</P>
|
| 1836 |
|
| 1837 |
<P>
|
| 1838 |
<url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
|
| 1839 |
name="Problems and Solutions for Unicode and User/Vendor Defined
|
| 1840 |
Characters"> discusses on this problem.
|
| 1841 |
</P>
|
| 1842 |
|
| 1843 |
<sect id="othercodes"><heading>Other Character Sets and Encodings</heading>
|
| 1844 |
|
| 1845 |
<P>
|
| 1846 |
Besides ISO 2022-compliant coded character sets and encodings
|
| 1847 |
described in <ref id="iso2022set"> and <ref id="iso2022enc">,
|
| 1848 |
there are many popular encodings which cannot be classified
|
| 1849 |
into an international standard (i.e., not ISO 2022-compliant
|
| 1850 |
nor Unicode). Internationalized softwares should
|
| 1851 |
support these encodings (again, you don't need to be aware of
|
| 1852 |
encodings if you use LOCALE and <tt>wchar_t</tt> technology).
|
| 1853 |
Some organizations are developing systems which go father than
|
| 1854 |
limitations of the current international standards, though these
|
| 1855 |
systems may be not diffused very much so far.
|
| 1856 |
</P>
|
| 1857 |
|
| 1858 |
<sect1 id="othercodes-big5"><heading>Big5</heading>
|
| 1859 |
|
| 1860 |
<P>
|
| 1861 |
<strong>Big5</strong> is a de-facto standard encoding for
|
| 1862 |
Taiwan (1984) and is upward-compatible with ASCII.
|
| 1863 |
It is also a CCS.
|
| 1864 |
</P>
|
| 1865 |
|
| 1866 |
<P>
|
| 1867 |
In Big5, <tt>0x21</tt> - <tt>0x7e</tt> means ASCII characters.
|
| 1868 |
<tt>0xa1</tt> - <tt>0xfe</tt> makes a pair with the following byte
|
| 1869 |
(<tt>0x40</tt> - <tt>0x7e</tt> and <tt>0xa1</tt> - <tt>0xfe</tt>)
|
| 1870 |
and means an ideogram and so on (13461 characters).
|
| 1871 |
<P>
|
| 1872 |
|
| 1873 |
<P>
|
| 1874 |
Though Taiwan has ISO 2022-compliant new standard CNS 11643,
|
| 1875 |
Big5 seems to be more popular than CNS 11643.
|
| 1876 |
(CNS 11643 is a CCS and there are a few ISO 2022-derived
|
| 1877 |
encodings which include CNS 11643.)
|
| 1878 |
</P>
|
| 1879 |
|
| 1880 |
<sect1 id="othercodes-uhc"><heading>UHC</heading>
|
| 1881 |
|
| 1882 |
<P>
|
| 1883 |
<strong>UHC</strong> is an encoding which is an upward-compatible
|
| 1884 |
with <strong>EUC-KR</strong>. Two-byte characters (the first byte:
|
| 1885 |
<tt>0x81</tt> - <tt>0xfe</tt>; the second byte:
|
| 1886 |
<tt>0x41</tt> - <tt>0x5a</tt>, <tt>0x61</tt> - <tt>0x7a</tt>, and
|
| 1887 |
<tt>0x81</tt> - <tt>0xfe</tt>) include KSX 1001 and other Hangul so
|
| 1888 |
that UHC can
|
| 1889 |
express all 11172 Hangul.
|
| 1890 |
</P>
|
| 1891 |
|
| 1892 |
<sect1 id="othercodes-johab"><heading>Johab</heading>
|
| 1893 |
|
| 1894 |
<P>
|
| 1895 |
<strong>Johab</strong> is an encoding whose character set is identical
|
| 1896 |
with <strong>UHC</strong>, i.e., ASCII, KSX 1001, and all other Hangul
|
| 1897 |
character.
|
| 1898 |
Johab means combination in Korean. In Johab, code point of a Hangul
|
| 1899 |
can be calculated from combination of Hangul parts (Jamo).
|
| 1900 |
</P>
|
| 1901 |
|
| 1902 |
<sect1 id="othercodes-hz"><heading>HZ, aka HZ-GB-2312</heading>
|
| 1903 |
|
| 1904 |
<p>
|
| 1905 |
<strong>HZ</strong> is an encoding described in
|
| 1906 |
<url id="http://www.faqs.org/rfcs/rfc1842.html" name="RFC 1842">.
|
| 1907 |
CCS (Coded character sets) of HZ is ASCII and GB2312. This is 7bit
|
| 1908 |
encoding.
|
| 1909 |
</p>
|
| 1910 |
|
| 1911 |
<p>
|
| 1912 |
Note that HZ is <em>not</em> upward-compatible with ASCII,
|
| 1913 |
since '<tt>~{</tt>' means GB2312 mode, '<tt>~}</tt>' means
|
| 1914 |
ASCII mode, and '<tt>~~</tt>' means ASCII '~'.
|
| 1915 |
</p>
|
| 1916 |
|
| 1917 |
<sect1 id="othercodes-gbk"><heading>GBK</heading>
|
| 1918 |
|
| 1919 |
<p>
|
| 1920 |
<strong>GBK</strong> is an encoding which is upward-compatible
|
| 1921 |
to CN-GB. GBK covers ASCII, GB2312, other Unicode 1.0 ideograms,
|
| 1922 |
and a bit more. The range of two-byte characters in GBK is:
|
| 1923 |
<tt>0x81</tt> - <tt>0xfe</tt> for the first byte and
|
| 1924 |
<tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> - <tt>0xfe</tt>
|
| 1925 |
for the second byte. 21886 code points out of 23940 in two-byte
|
| 1926 |
region are defined.
|
| 1927 |
</p>
|
| 1928 |
|
| 1929 |
<p>
|
| 1930 |
GBK is one of popular encodings in R. P. China.
|
| 1931 |
</p>
|
| 1932 |
|
| 1933 |
<sect1 id="othercodes-gb18030"><heading>GB18030</heading>
|
| 1934 |
|
| 1935 |
<p>
|
| 1936 |
<strong>GB 18030</strong> is an encoding which is upward-compatible
|
| 1937 |
to GBK and CN-GB. It is an recent national standard (released on
|
| 1938 |
17 March 2000) of China. It adds four-byte characters to GBK.
|
| 1939 |
Its range is:
|
| 1940 |
<tt>0x81</tt> - <tt>0xfe</tt> for the first byte,
|
| 1941 |
<tt>0x30</tt> - <tt>0x39</tt> for the second byte,
|
| 1942 |
<tt>0x81</tt> - <tt>0xfe</tt> for the third byte, and
|
| 1943 |
<tt>0x30</tt> - <tt>0x39</tt> for the forth byte.
|
| 1944 |
</p>
|
| 1945 |
|
| 1946 |
<p>
|
| 1947 |
It includes all characters of Unicode 3.0's Unihan Extension A.
|
| 1948 |
And more, GB 18030 supplies code space for all used and
|
| 1949 |
unused code points of Unicode's plane 0 (BMP) and 16 additional
|
| 1950 |
planes.
|
| 1951 |
</p>
|
| 1952 |
|
| 1953 |
<p>
|
| 1954 |
<url id="ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf"
|
| 1955 |
name="A detailed explanation on GB18030"> is available.
|
| 1956 |
</p>
|
| 1957 |
|
| 1958 |
<sect1 id="othercodes-gccs"><heading>GCCS</heading>
|
| 1959 |
|
| 1960 |
<p>
|
| 1961 |
<strong>GCCS</strong> is a standard of coded character set
|
| 1962 |
by Hong Kong (HKSAR: Hong Kong Special Administrative Region).
|
| 1963 |
It includes 3049 characters. It is an abbreviation of Government Common
|
| 1964 |
Character Set. It is defined as an <em>additional character set
|
| 1965 |
for Big5</em>. Characters in GCCS are coded in User-Defined Area
|
| 1966 |
(just like Private Use Area for UCS) in Big5.
|
| 1967 |
</p>
|
| 1968 |
|
| 1969 |
<sect1 id="othercodes-hkscs"><heading>HKSCS</heading>
|
| 1970 |
|
| 1971 |
<p>
|
| 1972 |
<strong>HKSCS</strong> is an expansion and amendment of GCCS.
|
| 1973 |
It includes 4702 characters. It means Hong Kong Supplementary
|
| 1974 |
Character Set.
|
| 1975 |
</p>
|
| 1976 |
|
| 1977 |
<p>
|
| 1978 |
In addition to a usage in User-Defined Area in Big5,
|
| 1979 |
HKSCS defines a usage in Private Use Area in Unicode.
|
| 1980 |
</p>
|
| 1981 |
|
| 1982 |
<sect1 id="othercodes-shiftjis"><heading>Shift-JIS</heading>
|
| 1983 |
|
| 1984 |
<p>
|
| 1985 |
<strong>Shift-JIS</strong> is one of popular encodings in Japan.
|
| 1986 |
Its CCS are JISX 0201 Roman, JISX 0201 Kana, and JISX 0208.
|
| 1987 |
</p>
|
| 1988 |
|
| 1989 |
<p>
|
| 1990 |
JISX 0201 Roman is Japanese version of ISO 646. It defines
|
| 1991 |
yen currency mark for <tt>0x5c</tt>, where ASCII has backslash.
|
| 1992 |
<tt>0xa1</tt> - <tt>0xdf</tt> is one-byte character and is
|
| 1993 |
JISX 0201 Kana. Two-byte character (the first byte:
|
| 1994 |
<tt>0x81</tt> - <tt>0x9f</tt> and <tt>0xe0</tt> - <tt>0xef</tt>;
|
| 1995 |
the second byte: <tt>0x40</tt> - <tt>0x7e</tt> and <tt>0x80</tt> -
|
| 1996 |
<tt>0xfc</tt>) is JISX 0208.
|
| 1997 |
</p>
|
| 1998 |
|
| 1999 |
<p>
|
| 2000 |
Japanese version of MS DOS, MS Windows and Macintosh use this encoding,
|
| 2001 |
though this encoding is not often used in POSIX systems.
|
| 2002 |
</p>
|
| 2003 |
|
| 2004 |
|
| 2005 |
<sect1 id="othercodes-viscii"><heading>VISCII</heading>
|
| 2006 |
|
| 2007 |
<P>
|
| 2008 |
Vietnamese language uses 186 characters (Latin alphabets with accents)
|
| 2009 |
and other symbols.
|
| 2010 |
It is a bit more than the limit of ISO 8859-like encoding.
|
| 2011 |
</P>
|
| 2012 |
|
| 2013 |
<P>
|
| 2014 |
<strong>VISCII</strong> is a standard for Vietnamese.
|
| 2015 |
It is upward-compatible with ASCII. It is 8bit and stateless,
|
| 2016 |
like ISO 8859 series. However, it uses code points of
|
| 2017 |
not only <tt>0x21</tt> - <tt>0x7e</tt> and <tt>0xa0</tt> -
|
| 2018 |
<tt>0xff</tt> but also <tt>0x02</tt>, <tt>0x05</tt>, <tt>0x06</tt>,
|
| 2019 |
<tt>0x14</tt>, <tt>0x19</tt>, <tt>0x1e</tt>, and <tt>0x80</tt> -
|
| 2020 |
<tt>0x9f</tt>. This makes VISCII not-ISO 2022-compliant.
|
| 2021 |
</P>
|
| 2022 |
|
| 2023 |
<P>
|
| 2024 |
Vietnam has a new, ISO 2022-compliant character set
|
| 2025 |
<strong>TCVN 5712 VN2</strong> (aka <strong>VSCII</strong>).
|
| 2026 |
In TCVN 5712 VN2, accented characters are expressed as a
|
| 2027 |
combined character. Note that some of accented characters
|
| 2028 |
have their own code points.
|
| 2029 |
</P>
|
| 2030 |
|
| 2031 |
<sect1 id="othercodes-tron"><heading>TRON</heading>
|
| 2032 |
|
| 2033 |
<P>
|
| 2034 |
<url id="http://www.tron.org/index-e.html" name="TRON">
|
| 2035 |
is a project to develop a new operating system,
|
| 2036 |
founded as a collaboration of industries and academics
|
| 2037 |
in Japan since 1984.
|
| 2038 |
</P>
|
| 2039 |
|
| 2040 |
<P>
|
| 2041 |
The most diffused version of TRON operating system families
|
| 2042 |
is ITRON, a real-time OS for embedded systems.
|
| 2043 |
However, our interest is not on ITRON now.
|
| 2044 |
TRON determines a TRON encoding.
|
| 2045 |
</P>
|
| 2046 |
|
| 2047 |
<P>
|
| 2048 |
TRON's encoding is stateful. Each state is assigned
|
| 2049 |
to each language. It has already defined about 130000 characters
|
| 2050 |
(January 2000).
|
| 2051 |
</P>
|
| 2052 |
|
| 2053 |
<sect1 id="othercodes-mojikyo"><heading>Mojikyo</heading>
|
| 2054 |
|
| 2055 |
<P>
|
| 2056 |
<url id="http://www.mojikyo.gr.jp/" name="Mojikyo">
|
| 2057 |
is a project to develop an environment by which a user
|
| 2058 |
can use many characters in the world. Mojikyo
|
| 2059 |
project has released an application software for
|
| 2060 |
MS Windows to display and input about 90000 characters.
|
| 2061 |
You can download the software and TrueType, TeX, and
|
| 2062 |
CID fonts, though they are not DFSG-free.
|
| 2063 |
</P>
|
| 2064 |
|
| 2065 |
|
| 2066 |
|
| 2067 |
<chapt id="languages"><heading>Characters in Each Country</heading>
|
| 2068 |
|
| 2069 |
<P>
|
| 2070 |
This chapter describes a specific information for each language.
|
| 2071 |
If you are developing a serious DTP software or planning to support
|
| 2072 |
detailed I18N, this chapter may help you.
|
| 2073 |
Contributions from people speaking each language are welcome.
|
| 2074 |
If you are to write a section on your language, please include
|
| 2075 |
these points:
|
| 2076 |
<enumlist>
|
| 2077 |
<item>kinds and number of characters used in the language,
|
| 2078 |
<item>explanation on coded character set(s) which is (are) standardized,
|
| 2079 |
<item>explanation on encoding(s) which is (are) standardized,
|
| 2080 |
<item>usage and popularity for each encoding,
|
| 2081 |
<item>de-facto standard, if any, on how many columns characters occupy,
|
| 2082 |
<item>writing direction and combined characters,
|
| 2083 |
<item>how to layout characters (word wrapping and so on),
|
| 2084 |
<item>widely used value for <tt>LANG</tt> environmental variable,
|
| 2085 |
<item>the way to input characters from keyboard and whether
|
| 2086 |
you want to input yes/no (and so on) in your language
|
| 2087 |
or in English,
|
| 2088 |
<item>a set of information needed for beautiful displaying, for example,
|
| 2089 |
where to break a line, hyphenation, word wrapping, and so on, and
|
| 2090 |
<item>other topics.
|
| 2091 |
</enumlist>
|
| 2092 |
</P>
|
| 2093 |
|
| 2094 |
|
| 2095 |
<P>
|
| 2096 |
Writers whose languages are written in different direction
|
| 2097 |
from European languages or needs a combined characters
|
| 2098 |
(I heard that is used in Thai) are encouraged to explain
|
| 2099 |
how to treat such languages.
|
| 2100 |
</P>
|
| 2101 |
|
| 2102 |
|
| 2103 |
|
| 2104 |
&japanese-japan;
|
| 2105 |
&spanish;
|
| 2106 |
&cyrillic;
|
| 2107 |
|
| 2108 |
|
| 2109 |
|
| 2110 |
|
| 2111 |
|
| 2112 |
<chapt id="locale"><heading>LOCALE technology</heading>
|
| 2113 |
|
| 2114 |
<P>
|
| 2115 |
<strong>LOCALE</strong> is a basic concept introduced
|
| 2116 |
into <strong>ISO C</strong> (ISO/IEC 9899:1990). The
|
| 2117 |
standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995).
|
| 2118 |
In LOCALE model, the behaviors of some C functions are dependent
|
| 2119 |
on LOCALE environment. LOCALE environment is divided
|
| 2120 |
into a few categories and each of these categories can
|
| 2121 |
be set independently using <tt>setlocale()</tt>.
|
| 2122 |
</P>
|
| 2123 |
|
| 2124 |
<P>
|
| 2125 |
<strong>POSIX</strong> also determines some standards around
|
| 2126 |
i18n. Almost of POSIX and ISO C standards are included in
|
| 2127 |
<strong>XPG4</strong> (X/Open Portability Guide) standard and
|
| 2128 |
all of them are included in XPG5 standard. Note that
|
| 2129 |
<strong>XPG5</strong> is included in UNIX specifications version 2.
|
| 2130 |
Thus support of XPG5 is mandatory to obtain Unix brand. In other words,
|
| 2131 |
all versions of Unix operating systems support XPG5.
|
| 2132 |
</P>
|
| 2133 |
|
| 2134 |
<P>
|
| 2135 |
The merit of using locale technology over hard-coding of Unicode
|
| 2136 |
is:
|
| 2137 |
<list>
|
| 2138 |
<item>The software can be written encoding-independent way.
|
| 2139 |
This means that this software can support all encodings
|
| 2140 |
which the OS supports, including 7bit, 8bit, multibyte,
|
| 2141 |
stateful, and stateless encodings such as ASCII, ISO 8859-*,
|
| 2142 |
EUC-*, ISO 2022-*, Big5, VISCII, TIS 620, UTF-*, and so on.
|
| 2143 |
<item>The software will provides a common unified method to
|
| 2144 |
configure locale and encoding. This benefits users.
|
| 2145 |
Otherwise, users will have to remember the method to enable
|
| 2146 |
UTF-8 mode for each software. Some softwares need <tt>-u8</tt>
|
| 2147 |
switch, other need X resource setting, other need
|
| 2148 |
<tt>.foobarrc</tt> file, other need a special environmental
|
| 2149 |
variable, other use UTF-8 for default. It is nonsense!
|
| 2150 |
<item>The advancement of the OS means the advancement of the
|
| 2151 |
software. Thus, you can use new locale without recompiling
|
| 2152 |
your software.
|
| 2153 |
</list>
|
| 2154 |
You can read the
|
| 2155 |
<url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
|
| 2156 |
name="Unicode support in the Solaris Operating Environment"> whitepapaer
|
| 2157 |
and understand the merit of this model.
|
| 2158 |
<url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
|
| 2159 |
name="Bruno Haible's Unicode HOWTO">
|
| 2160 |
also recommends this model.
|
| 2161 |
</P>
|
| 2162 |
|
| 2163 |
<sect id="localecategory">Locale Categories and <tt>setlocale()</tt></heading>
|
| 2164 |
|
| 2165 |
<P>
|
| 2166 |
In LOCALE model, the behaviors of some C functions are dependent
|
| 2167 |
on LOCALE environment. LOCALE environment is divided
|
| 2168 |
into six categories and each of these categories can
|
| 2169 |
be set independently using <tt>setlocale()</tt>.
|
| 2170 |
</P>
|
| 2171 |
|
| 2172 |
<P>
|
| 2173 |
The followings are the six categories:
|
| 2174 |
<taglist>
|
| 2175 |
<tag><strong>LC_CTYPE</strong>
|
| 2176 |
<item>
|
| 2177 |
<p>
|
| 2178 |
Category related to encodings.
|
| 2179 |
Characters which are encoded by LC_CTYPE-dependent encoding
|
| 2180 |
is called <strong>multibyte characters</strong>.
|
| 2181 |
Note that multibyte character doesn't need to be multibyte.
|
| 2182 |
</p>
|
| 2183 |
<p>
|
| 2184 |
LC_CTYPE-dependent functions are: character testing functions
|
| 2185 |
such as <tt>islower()</tt> and so on, multibyte character
|
| 2186 |
functions such as <tt>mblen()</tt> and so on, multibyte
|
| 2187 |
string functions such as <tt>mbstowcs()</tt> and so on,
|
| 2188 |
and so on.
|
| 2189 |
</p>
|
| 2190 |
</item>
|
| 2191 |
<tag><strong>LC_COLLATE</strong>
|
| 2192 |
<item>
|
| 2193 |
<p>
|
| 2194 |
Category related to sorting.
|
| 2195 |
<tt>strcoll()</tt> and so on are LC_COLLATE-dependent.
|
| 2196 |
</p>
|
| 2197 |
</item>
|
| 2198 |
<tag><strong>LC_MESSAGES</strong>
|
| 2199 |
<item>
|
| 2200 |
<p>
|
| 2201 |
Category related to the language for messages the software
|
| 2202 |
outputs. This category is used for <prgn>gettext</prgn>.
|
| 2203 |
</p>
|
| 2204 |
<tag><strong>LC_MONETARY</strong>
|
| 2205 |
<item>
|
| 2206 |
<p>
|
| 2207 |
Category related to format to show monetary numbers,
|
| 2208 |
for example, currency mark, comma or period, columns,
|
| 2209 |
and so on.
|
| 2210 |
<tt>localeconv()</tt> is the only function which is
|
| 2211 |
LC_MONETARY-dependent.
|
| 2212 |
</p>
|
| 2213 |
</item>
|
| 2214 |
<tag><strong>LC_NUMERIC</strong>
|
| 2215 |
<item>
|
| 2216 |
<p>
|
| 2217 |
Category related to format to show general numbers,
|
| 2218 |
for example, character for decimal point.
|
| 2219 |
</p>
|
| 2220 |
<p>
|
| 2221 |
Formatted I/O functions such as <tt>printf()</tt>,
|
| 2222 |
string conversion functions such as <tt>atof()</tt>,
|
| 2223 |
and so on are LC_NUMERIC-dependent.
|
| 2224 |
</p>
|
| 2225 |
</item>
|
| 2226 |
<tag><strong>LC_TIME</strong>
|
| 2227 |
<item>
|
| 2228 |
<p>
|
| 2229 |
Category related to format to show time and date,
|
| 2230 |
such as name of months and weeks, order of date,
|
| 2231 |
month, and year, and so on.
|
| 2232 |
</p>
|
| 2233 |
<p>
|
| 2234 |
<tt>strftime()</tt> and so on are LC_TIME-dependent.
|
| 2235 |
</p>
|
| 2236 |
</item>
|
| 2237 |
</taglist>
|
| 2238 |
</p>
|
| 2239 |
|
| 2240 |
<p>
|
| 2241 |
<tt>setlocale()</tt> is a function to set LOCALE.
|
| 2242 |
Usage is char *<tt>setlocale(</tt>int <em>category</em>, const char
|
| 2243 |
*<em>locale</em><tt>);</tt>. Header file of <tt>locale.h</tt>
|
| 2244 |
is needed for prototype declaration and definition of
|
| 2245 |
macros for category names. For example,
|
| 2246 |
<tt>setlocale(LC_TIME, "de_DE");</tt>.
|
| 2247 |
</p>
|
| 2248 |
|
| 2249 |
<p>
|
| 2250 |
For <em>category</em>, the following macros can be used:
|
| 2251 |
LC_CTYPE, LC_COLLATE, LC_MONETARY, LC_NUMERIC, LC_TIME, and
|
| 2252 |
LC_ALL. For <em>locale</em>, specific locale name, <tt>NULL</tt>,
|
| 2253 |
or <tt>""</tt> can be specified.
|
| 2254 |
</p>
|
| 2255 |
|
| 2256 |
<p>
|
| 2257 |
Giving <tt>NULL</tt> for <em>locale</em> will return the
|
| 2258 |
current value of the specified locale category. Otherwise,
|
| 2259 |
<tt>setlocale()</tt> returns the newly set locale name,
|
| 2260 |
or <tt>NULL</tt> for error.
|
| 2261 |
</p>
|
| 2262 |
|
| 2263 |
<p>
|
| 2264 |
Given <tt>""</tt> for <em>locale</em>, <tt>setlocale()</tt>
|
| 2265 |
will determine the locale name in the following manner:
|
| 2266 |
<list>
|
| 2267 |
<item>At first, consult <tt>LC_ALL</tt> environmental variable.
|
| 2268 |
<item>If <tt>LC_ALL</tt> is not available, consult environmental
|
| 2269 |
variable same as the name of the locale category.
|
| 2270 |
For example, <tt>LC_COLLATE</tt>.
|
| 2271 |
<item>If none of them are available, consult <tt>LANG</tt>
|
| 2272 |
environmental variable.
|
| 2273 |
</list>
|
| 2274 |
This is why a user is expected to set <tt>LANG</tt> variable.
|
| 2275 |
In other words, all what a user has to do is to set <tt>LANG</tt>
|
| 2276 |
variable so that all locale-compliant softwares work well for
|
| 2277 |
desired way.
|
| 2278 |
</p>
|
| 2279 |
|
| 2280 |
<p>
|
| 2281 |
Thus, I recommend strongly to call <tt>setlocale(LC_ALL, "");</tt>
|
| 2282 |
at the first of your softwares, if the softwares are to be
|
| 2283 |
international.
|
| 2284 |
</p>
|
| 2285 |
|
| 2286 |
<sect id="localename">Locale Names</heading>
|
| 2287 |
|
| 2288 |
<P>
|
| 2289 |
We can specify locale names for these six locale categories.
|
| 2290 |
Then, which name should we specify?
|
| 2291 |
</P>
|
| 2292 |
|
| 2293 |
<P>
|
| 2294 |
The syntax to build a locale name is determined as follows:
|
| 2295 |
<example>
|
| 2296 |
language[_territory][.codeset][@modifier]
|
| 2297 |
</example>
|
| 2298 |
where <em>language</em> is two lowercase alphabets described
|
| 2299 |
in ISO639, such as <tt>en</tt> for English, <tt>eo</tt> for
|
| 2300 |
Esperanto, and <tt>zh</tt> for Chinese, <em>territory</em>
|
| 2301 |
is two uppercase alphabets described in ISO3166, such as
|
| 2302 |
<tt>GB</tt> for United Kingdom, <tt>KR</tt> for Republic of
|
| 2303 |
Korea (South Korea), <tt>CN</tt> for China. There are no standard
|
| 2304 |
for <em>codeset</em> and <em>modifier</em>. GNU libc uses
|
| 2305 |
<tt>ISO-8859-1</tt>, <tt>ISO-8859-13</tt>, <tt>eucJP</tt>,
|
| 2306 |
<tt>SJIS</tt>, <tt>UTF8</tt>, and so on for <em>codeset</em>,
|
| 2307 |
and <tt>euro</tt> for <em>modifier</em>.
|
| 2308 |
</P>
|
| 2309 |
|
| 2310 |
<P>
|
| 2311 |
However, it is depend on the system which locale names are valid.
|
| 2312 |
In other words, you have to install <em>locale database</em> for
|
| 2313 |
locale you want to use. Type <tt>locale -a</tt> to display all
|
| 2314 |
supported locale names on the system.
|
| 2315 |
</P>
|
| 2316 |
|
| 2317 |
<p>
|
| 2318 |
Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are
|
| 2319 |
determined for the names for default behavior. For example,
|
| 2320 |
when your software need to parse the output of <tt>date(1)</tt>,
|
| 2321 |
you'd better call <tt>setlocale(LC_TIME, "C");</tt> before
|
| 2322 |
invocation of <tt>date(1)</tt>.
|
| 2323 |
</p>
|
| 2324 |
|
| 2325 |
<sect id="wchar">Multibyte Characters and Wide Characters</heading>
|
| 2326 |
|
| 2327 |
<p>
|
| 2328 |
Now we will concentrate on LC_CTYPE, which is the most important
|
| 2329 |
category in six locale categories.
|
| 2330 |
</p>
|
| 2331 |
|
| 2332 |
<p>
|
| 2333 |
Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*,
|
| 2334 |
ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world.
|
| 2335 |
It is inefficient and a cause of bugs, even not impossible, for
|
| 2336 |
every softwares to implement all these encodings.
|
| 2337 |
Fortunately, we can use LOCALE technology to solve this problem.
|
| 2338 |
<footnote>
|
| 2339 |
Usage of UCS-4 is the second best solution for this problem.
|
| 2340 |
Sometimes LOCALE technology cannot be used and UCS-4 is the
|
| 2341 |
best. I will discuss this solution later.
|
| 2342 |
</footnote>
|
| 2343 |
</p>
|
| 2344 |
|
| 2345 |
<p>
|
| 2346 |
<strong>Multibyte characters</strong> is a term to call characters
|
| 2347 |
encoded in locale-specific encoding. It is nothing special.
|
| 2348 |
It is mere a word to call our daily encodings. In ISO 8859-1 locale,
|
| 2349 |
ISO 8859-1 is multibyte character. In EUC-JP locale, EUC-JP
|
| 2350 |
is multibyte character. In UTF-8 locale, UTF-8 is multibyte character.
|
| 2351 |
In short, multibyte character is defined by <tt>LC_CTYPE</tt> locale
|
| 2352 |
category.
|
| 2353 |
Multibyte characters is used when your software inputs
|
| 2354 |
or outputs text data from/to everywhere out of your software,
|
| 2355 |
for example, standard input/output, display, keyboard, file,
|
| 2356 |
and so on, as you are doing everyday.
|
| 2357 |
<footnote>
|
| 2358 |
There are a few exceptions. Compound text should be used for
|
| 2359 |
communication between X clients. UTF-8 would be the standard
|
| 2360 |
for file names in Linux.
|
| 2361 |
</footnote>
|
| 2362 |
</p>
|
| 2363 |
|
| 2364 |
<p>
|
| 2365 |
You can handle multibyte characters using ordinal <tt>char</tt>
|
| 2366 |
or <tt>unsigned char</tt> types and ordinal character- and
|
| 2367 |
string-oriented functions. It is just like you used to do for
|
| 2368 |
ASCII and 8bit encodings.
|
| 2369 |
</p>
|
| 2370 |
|
| 2371 |
<p>
|
| 2372 |
Then why we call it with a special term of <em>multibyte character</em>?
|
| 2373 |
The answer is, ISO C specifies a set of functions which can handle
|
| 2374 |
multibyte characters properly. On the other hand, it is obvious that
|
| 2375 |
usual C functions such as <tt>strlen()</tt> cannot handle multibyte
|
| 2376 |
characters properly.
|
| 2377 |
</p>
|
| 2378 |
|
| 2379 |
<p>
|
| 2380 |
Then what is these functions which can handle multibyte characters
|
| 2381 |
properly? Please wait a minute.
|
| 2382 |
Multibyte character may be stateful or stateless and multibyte or
|
| 2383 |
non-multibyte, since it includes all encodings ever used and will
|
| 2384 |
be used on the earth. Thus it is not convenient for internal processing.
|
| 2385 |
It needs complex algorithm even for, for example, character
|
| 2386 |
extraction from a string, addition and division of a string,
|
| 2387 |
or counting of number of character in a string.
|
| 2388 |
Thus, <strong>wide characters</strong> should be used for internal
|
| 2389 |
processing. And, the main part of these C functions which can handle
|
| 2390 |
multibyte characters are functions for interconversion between
|
| 2391 |
multibyte characters and wide characters.
|
| 2392 |
These functions are introduced later. Note that you may
|
| 2393 |
be able to do without these functions, since ISO C supplies
|
| 2394 |
I/O functions with conversion.
|
| 2395 |
</p>
|
| 2396 |
|
| 2397 |
<p>
|
| 2398 |
Wide character is defined in ISO C
|
| 2399 |
<list>
|
| 2400 |
<item>that all characters are expressed in fixed width of bits.
|
| 2401 |
<item>that it is stateless, i.e., it doesn't have shift states.
|
| 2402 |
</list>
|
| 2403 |
</p>
|
| 2404 |
|
| 2405 |
<p>
|
| 2406 |
There are two types for wide characters: <tt>wchar_t</tt> and
|
| 2407 |
<tt>wint_t</tt>. <tt>wchar_t</tt> is a type which can contain
|
| 2408 |
one wide character. It is just like 'char' type can be used for
|
| 2409 |
contain one character. <tt>wint_t</tt> can contain one wide
|
| 2410 |
character or <tt>WEOF</tt>, an substitution of <tt>EOF</tt>.
|
| 2411 |
</p>
|
| 2412 |
|
| 2413 |
<p>
|
| 2414 |
A string of wide characters is achieved by an array of <tt>wchar_t</tt>,
|
| 2415 |
just like a string of characters is achieved by an array
|
| 2416 |
of <tt>char</tt>.
|
| 2417 |
</p>
|
| 2418 |
|
| 2419 |
<p>
|
| 2420 |
There are functions for <tt>wchar_t</tt>, substitute for functions
|
| 2421 |
for <tt>char</tt>.
|
| 2422 |
<list>
|
| 2423 |
<item><tt>strcat()</tt>, <tt>strncat()</tt> ->
|
| 2424 |
<tt>wcscat()</tt>, <tt>wcsncat()</tt>
|
| 2425 |
<item><tt>strcpy()</tt>, <tt>strncpy()</tt> ->
|
| 2426 |
<tt>wcscpy()</tt>, <tt>wcsncpy()</tt>
|
| 2427 |
<item><tt>strcmp()</tt>, <tt>strncmp()</tt> ->
|
| 2428 |
<tt>wcscmp()</tt>, <tt>wcsncmp()</tt>
|
| 2429 |
<item><tt>strcasecmp()</tt>, <tt>strncasecmp()</tt> ->
|
| 2430 |
<tt>wcscasecmp()</tt>, <tt>wcsncasecmp()</tt>
|
| 2431 |
<item><tt>strcoll()</tt>, <tt>strxfrm()</tt> ->
|
| 2432 |
<tt>wcscoll()</tt>, <tt>wcsxfrm()</tt>
|
| 2433 |
<item><tt>strchr()</tt>, <tt>strrchr()</tt> ->
|
| 2434 |
<tt>wcschr()</tt>, <tt>wcsrchr()</tt>
|
| 2435 |
<item><tt>strstr()</tt>, <tt>strpbrk()</tt> ->
|
| 2436 |
<tt>wcsstr()</tt>, <tt>wcspbrk()</tt>
|
| 2437 |
<item><tt>strtok()</tt>, <tt>strspn()</tt>, <tt>strcspn()</tt> ->
|
| 2438 |
<tt>wcstok()</tt>, <tt>wcsspn()</tt>, <tt>wcscspn()</tt>
|
| 2439 |
<item><tt>strtol()</tt>, <tt>strtoul()</tt>, <tt>strtod()</tt> ->
|
| 2440 |
<tt>wcstol()</tt>, <tt>wcstoul()</tt>, <tt>wcstod()</tt>
|
| 2441 |
<item><tt>strftime()</tt> ->
|
| 2442 |
<tt>wcsftime()</tt>
|
| 2443 |
<item><tt>strlen()</tt> ->
|
| 2444 |
<tt>wcslen()</tt>
|
| 2445 |
<item><tt>toupper()</tt>, <tt>tolower()</tt> ->
|
| 2446 |
<tt>towupper()</tt>, <tt>towlower()</tt>
|
| 2447 |
<item><tt>isalnum()</tt>, <tt>isalpha()</tt>, <tt>isblank()</tt>,
|
| 2448 |
<tt>iscntrl()</tt>, <tt>isdigit()</tt>, <tt>isgraph()</tt>,
|
| 2449 |
<tt>islower()</tt>, <tt>isprint()</tt>, <tt>ispunct()</tt>,
|
| 2450 |
<tt>isspace()</tt>, <tt>isupper()</tt>, <tt>isxdigit()</tt> ->
|
| 2451 |
<tt>iswalnum()</tt>, <tt>iswalpha()</tt>, <tt>iswblank()</tt>,
|
| 2452 |
<tt>iswcntrl()</tt>, <tt>iswdigit()</tt>, <tt>iswgraph()</tt>,
|
| 2453 |
<tt>iswlower()</tt>, <tt>iswprint()</tt>, <tt>iswpunct()</tt>,
|
| 2454 |
<tt>iswspace()</tt>, <tt>iswupper()</tt>, <tt>iswxdigit()</tt>
|
| 2455 |
(<tt>isascii()</tt> doesn't have its wide character version).
|
| 2456 |
<item><tt>memset()</tt>, <tt>memcpy()</tt>, <tt>memmove</tt>,
|
| 2457 |
<tt>memmove()</tt>, <tt>memchr()</tt> ->
|
| 2458 |
<tt>wmemset()</tt>, <tt>wmemcpy()</tt>, <tt>wmemmove</tt>,
|
| 2459 |
<tt>wmemmove()</tt>, <tt>wmemchr()</tt>
|
| 2460 |
</list>
|
| 2461 |
There are additional functions for <tt>wchar_t</tt>.
|
| 2462 |
<list>
|
| 2463 |
<item><tt>wcwidth()</tt>, <tt>wcswidth()</tt>
|
| 2464 |
<item><tt>wctrans()</tt>, <tt>towctrans()</tt>
|
| 2465 |
</list>
|
| 2466 |
</p>
|
| 2467 |
|
| 2468 |
<p>
|
| 2469 |
You cannot assume anything on the concrete value of <tt>wchar_t</tt>,
|
| 2470 |
besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII.
|
| 2471 |
<footnote>
|
| 2472 |
Some of you may know GNU libc uses UCS-4 for the internal expression
|
| 2473 |
of <tt>wchar_t</tt>. However, you should not use the knowledge.
|
| 2474 |
It may differ in other systems.
|
| 2475 |
</footnote>
|
| 2476 |
You may feel this limitation is too strong. If you cannot do
|
| 2477 |
under this limitation, you can use UCS-4 as the internal encoding.
|
| 2478 |
In such a case, you can write your software emulating
|
| 2479 |
the locale-sensible behavior using <tt>setlocale()</tt>,
|
| 2480 |
<tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>. Consult
|
| 2481 |
the section of <ref id="iconv">. Note that it is generally
|
| 2482 |
easier to use wide character than implement UCS-4 or UTF-8.
|
| 2483 |
</p>
|
| 2484 |
|
| 2485 |
<p>
|
| 2486 |
You can write wide character in the source code as <tt>L'a'</tt>
|
| 2487 |
and wide string as <tt>L"string"</tt>. Since the encoding
|
| 2488 |
for the source code is ASCII, you can only write ASCII
|
| 2489 |
characters. If you'd like to use other characters, you should
|
| 2490 |
use <prgn>gettext</prgn>.
|
| 2491 |
</p>
|
| 2492 |
|
| 2493 |
<p>
|
| 2494 |
There are two ways to use wide characters:
|
| 2495 |
<list>
|
| 2496 |
<item>I/O is described using multibyte characters. Inputed data
|
| 2497 |
are converted into wide character immediately after reading
|
| 2498 |
and data for output are converted from wide character to
|
| 2499 |
multibyte character immediately before writing. Conversion
|
| 2500 |
can be achieved using functions of <tt>mbstowcs()</tt>,
|
| 2501 |
<tt>mbsrtowcs()</tt>, <tt>wcstombs()</tt>, <tt>wcsrtombs()</tt>,
|
| 2502 |
<tt>mblen()</tt>, <tt>mbrlen()</tt>, <tt>mbsinit()</tt>,
|
| 2503 |
and so on.
|
| 2504 |
Please consult the manual pages for these functions.
|
| 2505 |
<item>Wide characters are directly used for I/O, using
|
| 2506 |
wide character functions such as <tt>getwchar()</tt>,
|
| 2507 |
<tt>fgetwc()</tt>, <tt>getwc()</tt>,
|
| 2508 |
<tt>ungetwc()</tt>, <tt>fgetws</tt>, <tt>putwchar()</tt>,
|
| 2509 |
<tt>fputwc()</tt>, <tt>putwc()</tt>, and <tt>fputws()</tt>,
|
| 2510 |
formatted I/O functions for wide characters such as
|
| 2511 |
<tt>fwscanf()</tt>, <tt>wscanf()</tt>, <tt>swscanf()</tt>,
|
| 2512 |
<tt>fwprintf()</tt>, <tt>wprintf()</tt>, <tt>swprintf()</tt>,
|
| 2513 |
<tt>vfwprintf()</tt>, <tt>vwprintf()</tt>, and
|
| 2514 |
<tt>vswprintf()</tt>, and wide character identifier
|
| 2515 |
of <tt>%lc</tt>, <tt>%C</tt>, <tt>%ls</tt>, <tt>%S</tt>
|
| 2516 |
for conventional formatted I/O functions.
|
| 2517 |
By using this approach, you don't need to handle
|
| 2518 |
multibyte characters at all.
|
| 2519 |
Please consult the manual pages for these functions.
|
| 2520 |
</list>
|
| 2521 |
Though latter functions are also determined in ISO C,
|
| 2522 |
these functions have became newly available since GNU libc 2.2.
|
| 2523 |
(Of course all UNIX operating systems have all functions described
|
| 2524 |
here.)
|
| 2525 |
</p>
|
| 2526 |
|
| 2527 |
<p>
|
| 2528 |
Note that very simple softwares such as <tt>echo</tt> doesn't
|
| 2529 |
have to care about multibyte character. and wide characters.
|
| 2530 |
Such software can input and output multibyte character as is.
|
| 2531 |
Of course you may modify these softwares using wide characters.
|
| 2532 |
It may be a good practice of wide character programming.
|
| 2533 |
Examples of a fragment of source codes will be discussed in
|
| 2534 |
<ref id="internal">.
|
| 2535 |
</p>
|
| 2536 |
|
| 2537 |
<p>
|
| 2538 |
There is an explanation of multibyte and wide characters also
|
| 2539 |
in Ken Lunde's "CJKV Information Processing" (p25). However,
|
| 2540 |
the explanation is entirely wrong.
|
| 2541 |
</p>
|
| 2542 |
|
| 2543 |
<sect id="locale_unicode">Unicode and LOCALE technology</heading>
|
| 2544 |
|
| 2545 |
<p>
|
| 2546 |
UTF-8 is considered as the future encoding and
|
| 2547 |
many softwares are coming to support UTF-8. Though some
|
| 2548 |
of these softwares implement UTF-8 directly, I recommend
|
| 2549 |
you to use LOCALE technology to support UTF-8.
|
| 2550 |
</p>
|
| 2551 |
|
| 2552 |
<p>
|
| 2553 |
How this can be achieved? It is easy! If you are a developer
|
| 2554 |
of a software and your software has already written using LOCALE
|
| 2555 |
technology, you don't have to do anything!
|
| 2556 |
</p>
|
| 2557 |
|
| 2558 |
<p>
|
| 2559 |
Using LOCALE technology benefits not only developers but also users.
|
| 2560 |
All a user has to do is set locale environment properly.
|
| 2561 |
Otherwise, a user has to remember the method to use UTF-8 mode
|
| 2562 |
for each software. Some softwares need <tt>-u8</tt> switch,
|
| 2563 |
other need X resource setting, other need <tt>.foobarrc</tt>
|
| 2564 |
file, other need a special environmental variable,
|
| 2565 |
other use UTF-8 for default. It is nonsense!
|
| 2566 |
</p>
|
| 2567 |
|
| 2568 |
<p>
|
| 2569 |
Solaris has been already developed using this model.
|
| 2570 |
Please consult
|
| 2571 |
<url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
|
| 2572 |
name="Unicode support in the Solaris Operating Environment"> whitepapaer.
|
| 2573 |
</p>
|
| 2574 |
|
| 2575 |
<p>
|
| 2576 |
However, it is likely that some of upstream developers of
|
| 2577 |
softwares of which you are maintaining a Debian package refuses
|
| 2578 |
to use <tt>wchar_t</tt> for some reasons, for example, that
|
| 2579 |
they are not familiar with LOCALE programming, that they think
|
| 2580 |
it is troublesome, that they are not keen on I18N, that it is much
|
| 2581 |
easier to modify the software to support UTF-8 than to modify it
|
| 2582 |
to use <tt>wchar_t</tt>, that the software must work even under
|
| 2583 |
non-internationalized OS such as MS-DOS, and so on.
|
| 2584 |
Some developers may think that support of UTF-8 is sufficient
|
| 2585 |
for I18N.
|
| 2586 |
<footnote>
|
| 2587 |
In such a case, do they think of abolishing support of 7bit or
|
| 2588 |
8bit non-multibyte encodings? If no, it may be unfair that
|
| 2589 |
8bit language speakers can use both UTF-8 and conventional (local)
|
| 2590 |
encodings while speakers of multibyte languages, combining
|
| 2591 |
characters, and so on cannot use their popular locale encodings.
|
| 2592 |
I think such a software cannot be called "internationalized".
|
| 2593 |
</footnote>
|
| 2594 |
Even in such cases, you can rewrite such a software so that it
|
| 2595 |
checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables
|
| 2596 |
to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>.
|
| 2597 |
You can also rewrite the software to call <tt>setlocale()</tt>,
|
| 2598 |
<tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software
|
| 2599 |
supports all encodings which the OS supports, as discussed later.
|
| 2600 |
Consult
|
| 2601 |
<url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html"
|
| 2602 |
name="the discussion in the Groff mailing list on the support of
|
| 2603 |
UTF-8 and locale-specific encodings">, mainly held by Werner
|
| 2604 |
LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA,
|
| 2605 |
the author of this document.
|
| 2606 |
</p>
|
| 2607 |
|
| 2608 |
|
| 2609 |
|
| 2610 |
<sect id="iconv"><heading><tt>nl_langinfo()</tt> and <tt>iconv()</tt></heading>
|
| 2611 |
|
| 2612 |
<p>
|
| 2613 |
Though ISO C defines extensive LOCALE-related functions,
|
| 2614 |
you may want more extensive support. You may also want
|
| 2615 |
conversion between different encodings.
|
| 2616 |
There are C functions which can be used for such purposes.
|
| 2617 |
</p>
|
| 2618 |
|
| 2619 |
<p>
|
| 2620 |
char *<tt>nl_langinfo(</tt>nl_item <em>item</em><tt>)</tt> is
|
| 2621 |
an XPG5 function to get LOCALE-related informations. You can
|
| 2622 |
get the following informations using the following macros
|
| 2623 |
for <em>item</em> defined in <tt>langinfo.h</tt> header file:
|
| 2624 |
<list>
|
| 2625 |
<item>names for days in week
|
| 2626 |
(<tt>DAY_1</tt> (Sunday), <tt>DAY_2</tt>, <tt>DAY_3</tt>,
|
| 2627 |
<tt>DAY_4</tt>, <tt>DAY_5</tt>, <tt>DAY_6</tt>, and <tt>DAY_7</tt>)
|
| 2628 |
<item>abbreviated names for days in week
|
| 2629 |
(<tt>ABDAY_1</tt> (Sun), <tt>ABDAY_2</tt>, <tt>ABDAY_3</tt>,
|
| 2630 |
<tt>ABDAY_4</tt>, <tt>ABDAY_5</tt>, <tt>ABDAY_6</tt>, and
|
| 2631 |
<tt>ABDAY_7</tt>)
|
| 2632 |
<item>names for months in year
|
| 2633 |
(<tt>MON_1</tt> (January), <tt>MON_2</tt>, <tt>MON_3</tt>,
|
| 2634 |
<tt>MON_4</tt>, <tt>MON_5</tt>, <tt>MON_6</tt>, <tt>MON_7</tt>,
|
| 2635 |
<tt>MON_8</tt>, <tt>MON_9</tt>, <tt>MON_10</tt>, <tt>MON_11</tt>,
|
| 2636 |
and <tt>MON_12</tt>)
|
| 2637 |
<item>abbreviated names for months in year
|
| 2638 |
(<tt>ABMON_1</tt> (January), <tt>ABMON_2</tt>, <tt>ABMON_3</tt>,
|
| 2639 |
<tt>ABMON_4</tt>, <tt>ABMON_5</tt>, <tt>ABMON_6</tt>,
|
| 2640 |
<tt>ABMON_7</tt>, <tt>ABMON_8</tt>, <tt>ABMON_9</tt>,
|
| 2641 |
<tt>ABMON_10</tt>, <tt>ABMON_11</tt>, and <tt>ABMON_12</tt>)
|
| 2642 |
<item>name for AM (<tt>AM_STR</tt>)
|
| 2643 |
<item>name for PM (<tt>PM_STR</tt>)
|
| 2644 |
<item>name of era (<tt>ERA</tt>)
|
| 2645 |
<item>format of date and time (<tt>D_T_FMT</tt>)
|
| 2646 |
<item>format of date and time (era-based) (<tt>ERA_D_T_FMT</tt>)
|
| 2647 |
<item>format of date (<tt>D_FMT</tt>)
|
| 2648 |
<item>format of date (era-based) (<tt>ERA_D_FMT</tt>)
|
| 2649 |
<item>format of time (24-hour format) (<tt>T_FMT</tt>)
|
| 2650 |
<item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>)
|
| 2651 |
<item>format of time (era-based) (<tt>ERA_T_FMT</tt>)
|
| 2652 |
<item>radix (<tt>RADIXCHAR</tt>)
|
| 2653 |
<item>thousands separator (<tt>THOUSEP</tt>)
|
| 2654 |
<item>alternative characters for numerics (<tt>ALT_DIGITS</tt>)
|
| 2655 |
<item>affirmative word (<tt>YESSTR</tt>)
|
| 2656 |
<item>affirmative response (<tt>YESEXPR</tt>)
|
| 2657 |
<item>negative word (<tt>NOSTR</tt>)
|
| 2658 |
<item>negative response (<tt>NOEXPR</tt>)
|
| 2659 |
<item>encoding (<tt>CODESET</tt>)
|
| 2660 |
</list>
|
| 2661 |
For example, you can get names for months and use them for
|
| 2662 |
your original output algorithm. <tt>YESEXPR</tt> and
|
| 2663 |
<tt>NOEXPR</tt> are convenient for softwares expecting Y/N
|
| 2664 |
answer from users.
|
| 2665 |
</p>
|
| 2666 |
|
| 2667 |
<p>
|
| 2668 |
<tt>iconv_open()</tt>, <tt>iconv()</tt>, and <tt>iconv_close()</tt>
|
| 2669 |
are functions to perform conversion between encodings.
|
| 2670 |
Please consult manpages for them.
|
| 2671 |
</p>
|
| 2672 |
|
| 2673 |
<p>
|
| 2674 |
Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>,
|
| 2675 |
you can easily modify Unicode-enabled software into locale-sensible
|
| 2676 |
truly internationalized software.
|
| 2677 |
</p>
|
| 2678 |
|
| 2679 |
<p>
|
| 2680 |
At first, add a line of <tt>setlocale(LC_ALL, "");</tt> at the
|
| 2681 |
first of the software. If it returns non-NULL, enable UTF-8 mode
|
| 2682 |
of the software.
|
| 2683 |
<example>
|
| 2684 |
int conversion = FALSE;
|
| 2685 |
char *locale = setlocale(LC_ALL, "");
|
| 2686 |
:
|
| 2687 |
:
|
| 2688 |
(original code to determine UTF-8 mode or not)
|
| 2689 |
:
|
| 2690 |
:
|
| 2691 |
if (locale != NULL && utf_mode == FALSE) {
|
| 2692 |
utf8_mode = TRUE;
|
| 2693 |
conversion = TRUE;
|
| 2694 |
}
|
| 2695 |
</example>
|
| 2696 |
Then modify input routine as following:
|
| 2697 |
<example>
|
| 2698 |
#define INTERNALCODE "UTF-8"
|
| 2699 |
if (conversion == TRUE) {
|
| 2700 |
char *fromcode = nl_langinfo(CODESET);
|
| 2701 |
iconv_t conv = iconv_open(INTERNALCODE, fromcode);
|
| 2702 |
(reading and conversion...)
|
| 2703 |
iconv_close(conv);
|
| 2704 |
} else {
|
| 2705 |
(original reading routine)
|
| 2706 |
}
|
| 2707 |
</example>
|
| 2708 |
Finally modify the output routine as following:
|
| 2709 |
<example>
|
| 2710 |
if (conversion == TRUE) {
|
| 2711 |
char *tocode = nl_langinfo(CODESET);
|
| 2712 |
iconv_t conv = iconv_open(tocode, INTERNALCODE);
|
| 2713 |
(conversion and writing...)
|
| 2714 |
iconv_close(conv);
|
| 2715 |
} else {
|
| 2716 |
(original writing routine)
|
| 2717 |
}
|
| 2718 |
</example>
|
| 2719 |
Note that whole reading should be done at once since
|
| 2720 |
otherwise you may divide multibyte character.
|
| 2721 |
You can consult the <tt>iconv_prog.c</tt> file
|
| 2722 |
in the distribution of GNU libc for usage of <tt>iconv()</tt>.
|
| 2723 |
</p>
|
| 2724 |
|
| 2725 |
<p>
|
| 2726 |
Though <tt>nl_langinfo()</tt> is a standard function of XPG5
|
| 2727 |
and GNU libc supports it, it is not very portable. And more,
|
| 2728 |
there are no standard for encoding names for
|
| 2729 |
<tt>nl_langinfo()</tt> and <tt>iconv_open()</tt>.
|
| 2730 |
If this is a problem, you can use Bruno Haible's
|
| 2731 |
<url id="http://www.gnu.org/software/libiconv/"
|
| 2732 |
name="libiconv">. It has <tt>iconv()</tt>, <tt>iconv_open()</tt>,
|
| 2733 |
and <tt>iconv_close()</tt>. And more, it has <tt>locale_charset()</tt>,
|
| 2734 |
a replacement of <tt>nl_langinfo(CODESET)</tt>.
|
| 2735 |
</p>
|
| 2736 |
|
| 2737 |
|
| 2738 |
<sect id="locale-limit"><heading>Limit of Locale technology</heading>
|
| 2739 |
|
| 2740 |
<P>
|
| 2741 |
Locale model has a limit. That is, it cannot handle two locales at
|
| 2742 |
the same time. Especially, it cannot handle relationship between two
|
| 2743 |
locales at all.
|
| 2744 |
</P>
|
| 2745 |
|
| 2746 |
<P>
|
| 2747 |
For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings
|
| 2748 |
in Japan. EUC-JP is the de-facto standard for UNIX systems,
|
| 2749 |
ISO 2022-JP is the standard for Internet, and Shift-JIS is the
|
| 2750 |
encoding for Windows and Macintosh. Thus, Japanese people have to
|
| 2751 |
handle texts with these encodings. Text viewers such as <tt>jless</tt>
|
| 2752 |
and <tt>lv</tt> and editors such as <tt>emacs</tt> can automatically
|
| 2753 |
understand the encoding to be read. You cannot write such a software
|
| 2754 |
using Locale technology.
|
| 2755 |
</P>
|
| 2756 |
|
| 2757 |
|
| 2758 |
|
| 2759 |
<chapt id="output"><heading>Output to Display</heading>
|
| 2760 |
|
| 2761 |
<P>
|
| 2762 |
Here 'Output to Display' does not mean translation of messages using
|
| 2763 |
<prgn>gettext</prgn>.
|
| 2764 |
I will concern on whether characters are correctly displayed so that
|
| 2765 |
we can read it. For example, install <package>libcanna1g</package>
|
| 2766 |
package and display
|
| 2767 |
<tt>/usr/doc/libcanna1g/README.jp.gz</tt> on console or <prgn>xterm</prgn>
|
| 2768 |
(of course after
|
| 2769 |
ungzipping). This text file is written in Japanese but even Japanese
|
| 2770 |
people can not read such a row of strange characters. Which you would
|
| 2771 |
prefer if you were a Japanese speaker, an English message which can be read
|
| 2772 |
with a dictionary or such a row of strange characters which is
|
| 2773 |
a result of <prgn>gettext</prgn>ization?
|
| 2774 |
<footnote>
|
| 2775 |
(Yes, there <em>are</em> ways to display Japanese characters
|
| 2776 |
correctly -- <prgn>kon</prgn> (in <package>kon2</package> package)
|
| 2777 |
for console and <prgn>kterm</prgn> for X, and Japanese people are
|
| 2778 |
happy with <prgn>gettext</prgn>ized Japanese messages.)
|
| 2779 |
</footnote>
|
| 2780 |
</P>
|
| 2781 |
|
| 2782 |
<P>
|
| 2783 |
Problems on displaying non-English (non-ASCII) characters
|
| 2784 |
are discussed below.
|
| 2785 |
</P>
|
| 2786 |
|
| 2787 |
|
| 2788 |
|
| 2789 |
<sect id="output-console"><heading>Console Softwares</heading>
|
| 2790 |
|
| 2791 |
<P>
|
| 2792 |
In this section, problems on displaying characters on
|
| 2793 |
<strong>console</strong> are discussed.
|
| 2794 |
<footnote>
|
| 2795 |
This section does not include problems on developing console;
|
| 2796 |
This section includes problems on developing softwares which run
|
| 2797 |
on console.
|
| 2798 |
</footnote>
|
| 2799 |
Here, console includes a bare <strong>Linux console</strong> including
|
| 2800 |
framebuffer and conventional version, special consoles such as
|
| 2801 |
<strong>kon2</strong>, <strong>jfbterm</strong>, <strong>chdrv</strong>,
|
| 2802 |
and so on constructed by special softwares, and X terminal emulators
|
| 2803 |
such as <strong>xterm</strong>, <strong>kterm</strong>,
|
| 2804 |
<strong>hanterm</strong>, <strong>xiterm</strong>, <strong>rxvt</strong>,
|
| 2805 |
<strong>xvt</strong>, <strong>gnome-terminal</strong>,
|
| 2806 |
<strong>wterm</strong>, <strong>aterm</strong>, <strong>eterm</strong>,
|
| 2807 |
and so on. Remote environments via telnet and secure shell such as
|
| 2808 |
<strong>NCSA telnet</strong> for Macintosh and <strong>Tera Term</strong>
|
| 2809 |
for Windows are also regarded as consoles.
|
| 2810 |
</P>
|
| 2811 |
|
| 2812 |
<P>
|
| 2813 |
The feature of console is that:
|
| 2814 |
<list>
|
| 2815 |
<item>All what a software has to do is to send a correct encoding
|
| 2816 |
to standard output. Softwares on console don't need to
|
| 2817 |
care about fonts and so on.
|
| 2818 |
<item>Fonts with fixed sizes are used. The unit of the width
|
| 2819 |
of the font is called 'column'. 'Doublewidth' fonts, i.e.,
|
| 2820 |
fonts whose width is 2 columns, are used for CJK ideograms,
|
| 2821 |
Japanese Hiragana and Katakana, Korean Hangul, and related
|
| 2822 |
symbols. Combined characters used for Thai and so on can be
|
| 2823 |
regarded as 'zero'-column characters.
|
| 2824 |
</list>
|
| 2825 |
</P>
|
| 2826 |
|
| 2827 |
<sect1 id="output-console-code"><heading>Encoding</heading>
|
| 2828 |
|
| 2829 |
<P>
|
| 2830 |
Softwares running on the console are not responsible for displaying.
|
| 2831 |
The console itself is responsible. There are consoles
|
| 2832 |
which can display encodings other than ASCII such as
|
| 2833 |
<taglist>
|
| 2834 |
<tag>kon in kon2 package
|
| 2835 |
<item>EUC-JP, Shift-JIS, and ISO-2022-JP
|
| 2836 |
<tag>jfbterm
|
| 2837 |
<item>EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96,
|
| 2838 |
and 94x94 coded character sets whose fonts are available)
|
| 2839 |
<tag>kterm
|
| 2840 |
<item>EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including
|
| 2841 |
ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212,
|
| 2842 |
GB 2312, and KSC 5601)
|
| 2843 |
<tag>krxvt in rxvt-ml package
|
| 2844 |
<item>EUC-JP
|
| 2845 |
<tag>crxvt-gb in rxvt-ml package
|
| 2846 |
<item>CN-GB
|
| 2847 |
<tag>crxvt-big5 in rxvt-ml package
|
| 2848 |
<item>Big5
|
| 2849 |
<tag>cxtermb5 in cxterm-big5 package
|
| 2850 |
<item>Big5
|
| 2851 |
<tag>xcinterm-big5 in xcin package
|
| 2852 |
<item>Big5
|
| 2853 |
<tag>xcinterm-gb in xcin package
|
| 2854 |
<item>CN-GB
|
| 2855 |
<tag>xcinterm-gbk in xcin package
|
| 2856 |
<item>GBK
|
| 2857 |
<tag>xcinterm-big5hkscs in xcin package
|
| 2858 |
<item>Big5 with HKSCS
|
| 2859 |
<tag>hanterm
|
| 2860 |
<item>EUC-KR, Johab, and ISO 2022-KR
|
| 2861 |
<tag>xiterm and txiterm in xiterm+thai package
|
| 2862 |
<item>TIS 620
|
| 2863 |
<tag>xterm
|
| 2864 |
<item>UTF-8
|
| 2865 |
</taglist>
|
| 2866 |
However, there are no way for a software on console to know which
|
| 2867 |
encoding is available. I think it is a responsibility for
|
| 2868 |
a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG
|
| 2869 |
environmental variable). Provided LC_CTYPE locale is set properly,
|
| 2870 |
a software can use it to know which encoding to be supported
|
| 2871 |
by the console.
|
| 2872 |
</P>
|
| 2873 |
|
| 2874 |
<P>
|
| 2875 |
Concerning the translated messages by <prgn>gettext</prgn>,
|
| 2876 |
the software does not need anything. It works well if the
|
| 2877 |
user properly set LC_CTYPE and LC_MESSAGES locale.
|
| 2878 |
</P>
|
| 2879 |
|
| 2880 |
<P>
|
| 2881 |
If you are handling a string in non-ASCII encoding (using
|
| 2882 |
multibyte character, UTF-8 directly, and so on), you will have
|
| 2883 |
to care about points which you don't have to care about if you are
|
| 2884 |
using ASCII.
|
| 2885 |
<list>
|
| 2886 |
<item>8-bit cleanness. I think everyone understand this.
|
| 2887 |
<item>Continuity of multibyte characters. In multibyte encodings
|
| 2888 |
such as EUC-JP and UTF-8, one character may consist
|
| 2889 |
from more than two bytes. These bytes should be outputed
|
| 2890 |
continued. Insertion of additional codes between the
|
| 2891 |
continuing bytes can break the character. I have seen a
|
| 2892 |
software which outputs location control code everytime
|
| 2893 |
it outputs one byte. It breaks multibyte character.
|
| 2894 |
</list>
|
| 2895 |
</P>
|
| 2896 |
|
| 2897 |
<sect1 id="output-console-column"><heading>Number of Columns</heading>
|
| 2898 |
|
| 2899 |
<P>
|
| 2900 |
Internationalized console software cannot assume that a character
|
| 2901 |
always occupy one column. You can get the number of column of a
|
| 2902 |
character of a string using <tt>wcwidth()</tt> and
|
| 2903 |
<tt>wcswidth()</tt>. Note that you have to use
|
| 2904 |
<tt>wchar_t</tt>-style programming since these functions have
|
| 2905 |
a <tt>wchar_t</tt> parameter.
|
| 2906 |
</P>
|
| 2907 |
|
| 2908 |
<P>
|
| 2909 |
Additional cares have to be taken not to destroy multicolumn
|
| 2910 |
characters. For example, imagine your software displayed a
|
| 2911 |
double-column character at (row, column) = (1, 1). What will occur
|
| 2912 |
when your software then display a single-column character at (row, column)
|
| 2913 |
= (1, 2) or at (1, 1) ? The single-column character erases
|
| 2914 |
the half of the double-column character? Nobody knows the answer.
|
| 2915 |
It depends on the implementation of the console. All what I can
|
| 2916 |
tell is that your software should avoid such cases.
|
| 2917 |
</P>
|
| 2918 |
|
| 2919 |
<P>
|
| 2920 |
If your software inputs a string from keyboard, you will have to
|
| 2921 |
take more cares. All of numbers of characters, bytes, and columns
|
| 2922 |
differ. For example, in UTF-8 encoding, one character of
|
| 2923 |
'a' with acute accent occupies two bytes and one column. One
|
| 2924 |
character of CJK-ideograph occupies three bytes and two columns.
|
| 2925 |
For example, if the user types 'Backspace', how many backspace
|
| 2926 |
code (0x08) should the software outputs? How many bytes should
|
| 2927 |
the software erase from the internal buffer?
|
| 2928 |
Don't be nervous; you can use <tt>wchar_t</tt> which assures
|
| 2929 |
one character occupy one <tt>wchar_t</tt> everytime and you can
|
| 2930 |
use <tt>wcwidth()</tt> to know the number of columns.
|
| 2931 |
Note that control codes such as 'backspace' (0x08) and so on are
|
| 2932 |
column-oriented everytime. It backs 'one' column even if the
|
| 2933 |
character at the position is a doublewidth character.
|
| 2934 |
</P>
|
| 2935 |
|
| 2936 |
|
| 2937 |
<sect id="output-x"><heading>X Clients</heading>
|
| 2938 |
|
| 2939 |
<P>
|
| 2940 |
The way to develop X clients can differ drastically dependent on
|
| 2941 |
the toolkits to be used. At first, Xlib-style programming is
|
| 2942 |
discussed since Xlib is the fundamental for all other toolkits.
|
| 2943 |
Then a few toolkits are discussed.
|
| 2944 |
</P>
|
| 2945 |
|
| 2946 |
<sect1 id="output-x-xlib"><heading>Xlib programming</heading>
|
| 2947 |
|
| 2948 |
<P>
|
| 2949 |
X itself is already internationalized. X11R5 has introduced
|
| 2950 |
an idea of 'fontset' for internationalized text output.
|
| 2951 |
Thus all what X clients have to do is to use the 'fontset'-related
|
| 2952 |
functions.
|
| 2953 |
</P>
|
| 2954 |
|
| 2955 |
<P>
|
| 2956 |
The most important part for internationalization of displaying
|
| 2957 |
for X clients is the usage of internationalized
|
| 2958 |
<strong>XFontSet</strong>-related functions introduced since
|
| 2959 |
X11R5 instead of conventional <strong>XFontStruct</strong>-related
|
| 2960 |
functions.
|
| 2961 |
</P>
|
| 2962 |
|
| 2963 |
<P>
|
| 2964 |
The main feature of XFontSet is that it can handle multiple fonts
|
| 2965 |
at the same time. This is related to the distinction between
|
| 2966 |
coded character set (CCS) and character encoding scheme (CES)
|
| 2967 |
which I wrote at the section of <ref id="coding-general-term">.
|
| 2968 |
Some encodings in the world use multiple coded character
|
| 2969 |
sets at the same time. This is the reason we have to handle
|
| 2970 |
multiple X fonts at the same time.
|
| 2971 |
<footnote>
|
| 2972 |
Though UTF-8 is an encoding with single CCS, the current
|
| 2973 |
version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
|
| 2974 |
</footnote>
|
| 2975 |
</P>
|
| 2976 |
|
| 2977 |
<P>
|
| 2978 |
Another significant feature of XFontSet is that it is
|
| 2979 |
locale (LC_CTYPE)-sensible. This means that you have to
|
| 2980 |
call <tt>setlocale()</tt> before you use XFontSet-related
|
| 2981 |
functions. And more, you have to specify the string you want
|
| 2982 |
to draw as a multibyte character or a wide character.
|
| 2983 |
</P>
|
| 2984 |
|
| 2985 |
<P>
|
| 2986 |
In the conventional <tt>XFontStruct</tt> model, an X client
|
| 2987 |
opens a font using <tt>XLoadQueryFont()</tt>, draw a string
|
| 2988 |
using <tt>XDrawString()</tt>, and close the font using
|
| 2989 |
<tt>XFreeFont()</tt>. On the other hand, in the internationalized
|
| 2990 |
<tt>XFontSet</tt> model, an X client opens a font using
|
| 2991 |
<tt>XCreateFontSet()</tt>, draw a string using <tt>XmbDrawString()</tt>,
|
| 2992 |
and close the font using <tt>XFreeFontSet()</tt>.
|
| 2993 |
The following are a concise list of substitution.
|
| 2994 |
<list>
|
| 2995 |
<item><tt>XFontStruct</tt> -> <tt>XFontSet</tt>
|
| 2996 |
<item><tt>XLoadQueryFont()</tt> -> <tt>XCreateFontSet()</tt>
|
| 2997 |
<item>both of <tt>XDrawString()</tt> and <tt>XDrawString16</tt>
|
| 2998 |
-> either of <tt>XmbDrawString()</tt> or <tt>XwcDrawString()</tt>
|
| 2999 |
<item>both of <tt>XDrawImageString()</tt> and <tt>XDrawImageString16</tt>
|
| 3000 |
-> either of <tt>XmbDrawImageString()</tt> or
|
| 3001 |
<tt>XwcDrawImageString()</tt>
|
| 3002 |
</list>
|
| 3003 |
Note that <tt>XFontStruct</tt> is usually used as a pointer, while
|
| 3004 |
<tt>XFontSet</tt> itself is a pointer.
|
| 3005 |
</P>
|
| 3006 |
|
| 3007 |
<P>
|
| 3008 |
Some people (ISO-8859-1-language speakers) may think that
|
| 3009 |
<tt>XFontSet</tt>-related functions are not 8-bit clean.
|
| 3010 |
This is wrong. <tt>XFontSet</tt>-related
|
| 3011 |
functions work according to <tt>LC_CTYPE</tt> locale. The default
|
| 3012 |
LC_CTYPE locale uses ASCII. Thus, if a user doesn't set <tt>LANG</tt>,
|
| 3013 |
<tt>LC_CTYPE</tt>, nor <tt>LC_ALL</tt> environmental variable,
|
| 3014 |
<tt>XFontSet</tt>-related functions will use ASCII, i.e., not 8-bit
|
| 3015 |
clean. The user has to set <tt>LANG</tt>, <tt>LC_CTYPE</tt>, or
|
| 3016 |
<tt>LC_ALL</tt> environmental variable properly (for example,
|
| 3017 |
<tt>LANG=en_US</tt>).
|
| 3018 |
</P>
|
| 3019 |
|
| 3020 |
<P>
|
| 3021 |
The upstream developers of X clients sometimes hate to enforce
|
| 3022 |
users to set such environmental variables.
|
| 3023 |
<footnote>
|
| 3024 |
IMHO, all users will have to set LANG properly when UTF-8 will
|
| 3025 |
become popular.
|
| 3026 |
</footnote>
|
| 3027 |
In such a case,
|
| 3028 |
The X clients should have two ways to output text, i.e.,
|
| 3029 |
<tt>XFontStruct</tt>-related conventional way and
|
| 3030 |
<tt>XFontSet</tt>-related internationalized way.
|
| 3031 |
If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>,
|
| 3032 |
or <tt>"POSIX"</tt>, use
|
| 3033 |
<tt>XFontStruct</tt> way. Otherwise use <tt>XFontSet</tt> way.
|
| 3034 |
The author implemented this algorithm to a few window managers
|
| 3035 |
such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),
|
| 3036 |
sawmill (0.28), and so on.
|
| 3037 |
</P>
|
| 3038 |
|
| 3039 |
<P>
|
| 3040 |
Window managers need more modifications related to inter-clients
|
| 3041 |
communication. This topic will be described later.
|
| 3042 |
</P>
|
| 3043 |
|
| 3044 |
<sect1 id="output-x-aw"><heading>Athena widgets</heading>
|
| 3045 |
|
| 3046 |
<P>
|
| 3047 |
Athena widget is already internationalized.
|
| 3048 |
</P>
|
| 3049 |
|
| 3050 |
<P>***** Not written yet *****</P>
|
| 3051 |
|
| 3052 |
<sect1 id="output-x-gtk"><heading>Gtk and Gnome</heading>
|
| 3053 |
|
| 3054 |
<P>
|
| 3055 |
Gtk is already internationalized.
|
| 3056 |
</P>
|
| 3057 |
|
| 3058 |
<P>***** Not written yet *****</P>
|
| 3059 |
|
| 3060 |
<sect1 id="output-x-qt"><heading>Qt and KDE</heading>
|
| 3061 |
|
| 3062 |
<P>
|
| 3063 |
Though internationalized version of Qt was available for a long
|
| 3064 |
time, it could not be the official version of Qt. The license
|
| 3065 |
of Qt of those days inhibited to distribute internationalized
|
| 3066 |
version of Qt. However, Troll Tech at last changed their mind
|
| 3067 |
and Qt's license and now the official version of Qt is
|
| 3068 |
internationalized.
|
| 3069 |
</P>
|
| 3070 |
|
| 3071 |
<P>***** Not written yet *****</P>
|
| 3072 |
|
| 3073 |
<chapt id="input"><heading>Input from Keyboard</heading>
|
| 3074 |
|
| 3075 |
<P>
|
| 3076 |
it is obvious that a text editor needs ability to input text
|
| 3077 |
from keyboard, otherwise the text editor is entirely useless.
|
| 3078 |
Similarly, an internationalized text editor needs ability to input
|
| 3079 |
characters used for various languages. Other softwares such
|
| 3080 |
as shells, libraries such as readline, environments such as
|
| 3081 |
consoles and X terminal emulators, script languages such as perl,
|
| 3082 |
tcl/tk, python, and ruby, and application softwares such as
|
| 3083 |
word processors, draw and paints, file managers such as
|
| 3084 |
Midnight Commander, web browsers, mailers, and so on
|
| 3085 |
also need ability to input internationalized text. Otherwise
|
| 3086 |
these softwares are entirely useless.
|
| 3087 |
</P>
|
| 3088 |
|
| 3089 |
<P>
|
| 3090 |
There are various languages in the world. Thus, proper input methods
|
| 3091 |
vary from languages to languages.
|
| 3092 |
<list>
|
| 3093 |
<item>Some languages such as English doesn't need any special input
|
| 3094 |
methods. All characters for the language can be inputted
|
| 3095 |
by a single key on a keyboard. Keymap is all which a user
|
| 3096 |
has to care.
|
| 3097 |
<item>Some other languages such as German need a simple extension.
|
| 3098 |
For example, u with umlaut can be inputted with two strokes
|
| 3099 |
of ':' and 'u'. A way to switch ordinal input mode (key
|
| 3100 |
strokes of ':' and 'u' inputs ':' and 'u') and the extension
|
| 3101 |
input mode (key strokes of ':' and 'u' bears u with umlaut)
|
| 3102 |
has to be supplied. Almost languages in the world can be
|
| 3103 |
inputted with this method.
|
| 3104 |
<item>Other languages such as Chinese and Japanese need a complicated
|
| 3105 |
input method, since they use thousands of characters.
|
| 3106 |
Since it is very difficult and challenging problem to develop
|
| 3107 |
a clever input method, a few companies are developing Japanese
|
| 3108 |
input methods. Typical Japanese input methods are shipped
|
| 3109 |
with tens of megabytes of conversion dictionary.
|
| 3110 |
It is often very troublesome to set up an input method for
|
| 3111 |
these languages.
|
| 3112 |
<footnote>
|
| 3113 |
This is a field where proprietary systems such as MS Windows
|
| 3114 |
and Macintosh are much easier than free systems such as
|
| 3115 |
Debian and FreeBSD.
|
| 3116 |
</footnote>
|
| 3117 |
You also have to be practiced to use
|
| 3118 |
these input methods.
|
| 3119 |
</list>
|
| 3120 |
Different technologies are used for these languages.
|
| 3121 |
The aim of this chapter is to introduce technologies for them.
|
| 3122 |
</P>
|
| 3123 |
|
| 3124 |
|
| 3125 |
<sect id="input-console"><heading>Non-X Softwares</heading>
|
| 3126 |
|
| 3127 |
<P>
|
| 3128 |
Ideally, it is a responsibility for console and X terminal emulators
|
| 3129 |
to supply an input method. This situation is already achieved for
|
| 3130 |
simple languages which don't need complicated input methods.
|
| 3131 |
Thus, non-X softwares don't need to care about input methods.
|
| 3132 |
</P>
|
| 3133 |
|
| 3134 |
<P>
|
| 3135 |
There are a few Debian packages for consoles and X terminal
|
| 3136 |
emulators which supply input methods for particular languages.
|
| 3137 |
<taglist>
|
| 3138 |
<tag><strong>xiterm</strong> in xiterm+thai package
|
| 3139 |
<item>Thai characters
|
| 3140 |
<tag><strong>hanterm</strong>
|
| 3141 |
<item>Korean Hangul
|
| 3142 |
<tag><strong>cxtermb5</strong> in cxterm-big5 package
|
| 3143 |
<item>Big5 traditional Chinese ideograms
|
| 3144 |
<tag><strong>cce</strong>
|
| 3145 |
<item>CN-GB simplified Chinese ideograms
|
| 3146 |
</taglist>
|
| 3147 |
And more, there are a few softwares which supply input methods for
|
| 3148 |
existing console environment.
|
| 3149 |
<taglist>
|
| 3150 |
<tag><strong>skkfep</strong>
|
| 3151 |
<item>Japanese (needs SKK as a conversion engine)
|
| 3152 |
<tag><strong>uum</strong>
|
| 3153 |
<item>Japanese (needs Wnn as a conversion engine; not
|
| 3154 |
avaliable as a Debian package)
|
| 3155 |
<tag><strong>canuum</strong>
|
| 3156 |
<item>Japanese (needs Canna as a conversion engine; not
|
| 3157 |
avaliable as a Debian package)
|
| 3158 |
</taglist>
|
| 3159 |
However, since input methods for complex languages have not been
|
| 3160 |
available historically, a few non-X softwares have been developed
|
| 3161 |
with input methods.
|
| 3162 |
<taglist>
|
| 3163 |
<tag><strong>jvim-canna</strong>
|
| 3164 |
<item>A text editor which can input Japanese (needs Canna
|
| 3165 |
as a conversion engine.)
|
| 3166 |
<tag><strong>jed-canna</strong>
|
| 3167 |
<item>A text editor which can input Japanese (needs Canna
|
| 3168 |
as a conversion engine.)
|
| 3169 |
<tag><strong>nvi-m17n-canna</strong>
|
| 3170 |
<item>A text editor which can input Japanese (needs Canna
|
| 3171 |
as a conversion engine.)
|
| 3172 |
</taglist>
|
| 3173 |
</P>
|
| 3174 |
|
| 3175 |
<P>
|
| 3176 |
You have to take care of the differences between number of
|
| 3177 |
<em>characters</em>, <em>columns</em>, and <em>bytes</em>.
|
| 3178 |
For example, you can find immediately that <prgn>bash</prgn>
|
| 3179 |
cannot handle UTF-8 input properly when you invoke <prgn>bash</prgn>
|
| 3180 |
on UTF-8 Xterm and push BackSpace key. This is because
|
| 3181 |
<prgn>readline</prgn> always erase one column on the screen
|
| 3182 |
and one byte in the internal buffer for one stroke of 'BackSpace'
|
| 3183 |
key. To solve this problem, <strong>wide character</strong>
|
| 3184 |
should be used for internal processing. One stroke of 'BackSpace'
|
| 3185 |
should erase <tt>wcwidth()</tt> columns on the screen and
|
| 3186 |
one <tt>wchar_t</tt> unit in the internal buffer.
|
| 3187 |
</P>
|
| 3188 |
|
| 3189 |
|
| 3190 |
<sect id="input-x"><heading>X Softwares</heading>
|
| 3191 |
|
| 3192 |
<P>
|
| 3193 |
X11R5 is the first internationalized version of X Window System.
|
| 3194 |
However, X11R5 supplied two sample implements of international
|
| 3195 |
text input. They are <strong>Xsi</strong> and <strong>Ximp</strong>.
|
| 3196 |
Existence of two different protocols was an annoying situation.
|
| 3197 |
However, X11R6 determined <strong>XIM</strong>, a new protocol
|
| 3198 |
for internationalized text input, as the standard. Internationalized
|
| 3199 |
X softwares should support text input using XIM.
|
| 3200 |
</P>
|
| 3201 |
|
| 3202 |
<P>
|
| 3203 |
They are designed using <em>server-client</em> model.
|
| 3204 |
The client calls the server when necessary. The server
|
| 3205 |
supplies conversion from key stroke to internationalized text.
|
| 3206 |
</P>
|
| 3207 |
|
| 3208 |
<P>
|
| 3209 |
<strong>Kinput</strong> and <strong>kinput2</strong>
|
| 3210 |
are protocols for Japanese text input, which existed before X11R5.
|
| 3211 |
Some softwares such as <prgn>kterm</prgn> and so on supports
|
| 3212 |
kinput2 protocol. <prgn>kinput2</prgn> is the server software.
|
| 3213 |
Since the current version of <prgn>kinput2</prgn> supports XIM protocol,
|
| 3214 |
you don't need to support kinput protocol.
|
| 3215 |
</P>
|
| 3216 |
|
| 3217 |
<sect1 id="input-x-devel"><heading>Developing XIM clients</heading>
|
| 3218 |
|
| 3219 |
<P>***** Not written yet *****</P>
|
| 3220 |
|
| 3221 |
<P>
|
| 3222 |
Development of XIM client is a bit complicated. You can read
|
| 3223 |
source code for <prgn>rxvt</prgn> and <prgn>xedit</prgn> to
|
| 3224 |
study.
|
| 3225 |
</P>
|
| 3226 |
|
| 3227 |
<P>
|
| 3228 |
<url id="http://www.ainet.or.jp/~inoue/im/index-e.html"
|
| 3229 |
name="Programming for Japanse characters input"> is a
|
| 3230 |
good introduction to XIM programming.
|
| 3231 |
</P>
|
| 3232 |
|
| 3233 |
<sect1 id="input-x-examples"><heading>Examples of XIM softwares</heading>
|
| 3234 |
|
| 3235 |
<P>
|
| 3236 |
The following are examples of softwares which can work as XIM clients.
|
| 3237 |
<list>
|
| 3238 |
<item>X Terminal Emulators such as <prgn>krxvt</prgn>, <prgn>kterm</prgn>,
|
| 3239 |
and so on.
|
| 3240 |
<item>Text editors such as <prgn>xedit</prgn>, <prgn>gedit</prgn>, and
|
| 3241 |
so on.
|
| 3242 |
<item>Web rowser <prgn>mozilla</prgn>.
|
| 3243 |
</list>
|
| 3244 |
The following are examples of softwares which can work as XIM servers.
|
| 3245 |
<list>
|
| 3246 |
<item><prgn>kinput</prgn> and <prgn>skkinput</prgn> for Japanese.
|
| 3247 |
</list>
|
| 3248 |
</P>
|
| 3249 |
|
| 3250 |
<sect1 id="input-x-setup"><heading>Using XIM softwares</heading>
|
| 3251 |
|
| 3252 |
<P>
|
| 3253 |
Here I will explain how to use XIM input with Debian system.
|
| 3254 |
This will help developers and package maintainers who want to
|
| 3255 |
test XIM facility of their softwares. Debian Woody or later
|
| 3256 |
systems are assumed.
|
| 3257 |
</P>
|
| 3258 |
|
| 3259 |
<P>
|
| 3260 |
At first, locale database has to be prepared. Uncomment
|
| 3261 |
<tt>ja_JP.EUC-JP EUC-JP</tt>, <tt>ko_KR.EUC-KR EUC-KR</tt>,
|
| 3262 |
<tt>zh_CN.GB2312</tt>, and <tt>zh_TW BIG5</tt> lines in
|
| 3263 |
<tt>/etc/locale.gen</tt> and invoke <prgn>/usr/sbin/locale-gen</prgn>.
|
| 3264 |
This will prepare locale database under <tt>/usr/share/locale/</tt>.
|
| 3265 |
For systems other than Debian Woody or later, please take the valid
|
| 3266 |
procedure for these systems to prepare locale database.
|
| 3267 |
</P>
|
| 3268 |
|
| 3269 |
<P>
|
| 3270 |
Basic Chinese, Japanese, and Korean X fonts are included in
|
| 3271 |
<package>xfonts-base</package> package for Debian Woody and later.
|
| 3272 |
</P>
|
| 3273 |
|
| 3274 |
<P>
|
| 3275 |
XIM server must be installed. For <strong>Japanese</strong>,
|
| 3276 |
<package>kinput2</package> or <package>skkinput</package> packages
|
| 3277 |
are available. <package>kinput2</package> supports Japanese input
|
| 3278 |
engines of <strong>Canna</strong> and <strong>FreeWnn</strong> and
|
| 3279 |
<package>skkinput</package> supports <strong>SKK</strong>.
|
| 3280 |
For <strong>Korean</strong>, <package>ami</package> is available.
|
| 3281 |
For <strong>traditional Chinese</strong> and <strong>simplified
|
| 3282 |
Chinese</strong>, <package>xcin</package> is available.
|
| 3283 |
</P>
|
| 3284 |
|
| 3285 |
<P>
|
| 3286 |
Of course you need an XIM client software. <prgn>xedit</prgn>
|
| 3287 |
in <package>xbase-clients</package> package is an example of
|
| 3288 |
XIM client.
|
| 3289 |
</P>
|
| 3290 |
|
| 3291 |
<P>
|
| 3292 |
Then, login as a non-root user. Environment variables of
|
| 3293 |
<tt>LC_ALL</tt> (or <tt>LANG</tt>) and <tt>XMODIFIERS</tt>
|
| 3294 |
must be set as following.
|
| 3295 |
<list>
|
| 3296 |
<item>for <strong>Japanese</strong>/<strong>kinput2</strong>:
|
| 3297 |
<tt>LC_ALL=ja_JP.eucJP</tt> and <tt>XMODIFIERS=@im=kinput2</tt>
|
| 3298 |
<item>for <strong>Korean</strong>/<strong>ami</strong>:
|
| 3299 |
<tt>LC_ALL=ko_KR.eucKR</tt> and <tt>XMODIFIERS=@im=Ami</tt>
|
| 3300 |
<item>for <strong>traditional Chinese</strong>/<strong>xcin</strong>:
|
| 3301 |
<tt>LC_ALL=zh_TW.Big5</tt> and <tt>XMODIFIERS=@im=xcin</tt>
|
| 3302 |
<item>for <strong>simplified Chinese</strong>/<strong>xcin</strong>:
|
| 3303 |
<tt>LC_ALL=zh_CN.GB2312</tt> and <tt>XMODIFIERS=@im=xcin-zh_CN.GB2312</tt>
|
| 3304 |
</list>
|
| 3305 |
</P>
|
| 3306 |
|
| 3307 |
<P>
|
| 3308 |
Then invoke the XIM server. Just invoke it with background mode
|
| 3309 |
(with &). <strong>kinput2</strong> and <strong>ami</strong>
|
| 3310 |
don't open a new window while <strong>xcin</strong> opens a new
|
| 3311 |
window and outputs some messages.
|
| 3312 |
</P>
|
| 3313 |
|
| 3314 |
<P>
|
| 3315 |
Then invoke the XIM client. Focus on an input area of the software.
|
| 3316 |
Hit Shift-Space or Control-Space and type something. Did some strange
|
| 3317 |
characters appear? This document is too brief to explain
|
| 3318 |
how to input valid CJK characters and sentences with these XIM
|
| 3319 |
servers. Please consult documents of XIM servers.
|
| 3320 |
</P>
|
| 3321 |
|
| 3322 |
<sect id="input-emacs"><heading>Emacsen</heading>
|
| 3323 |
|
| 3324 |
<P>
|
| 3325 |
<strong>GNU Emacs</strong> and <strong>XEmacs</strong> take
|
| 3326 |
an entirely different model for international input.
|
| 3327 |
</P>
|
| 3328 |
|
| 3329 |
<P>
|
| 3330 |
They supply all input methods for various languages.
|
| 3331 |
Instead of relying on console or XIM, they use these input
|
| 3332 |
methods. These input methods can be selected by
|
| 3333 |
<tt>M-x set-input-method</tt> command. The selected input
|
| 3334 |
method can be switched on and off by <tt>M-x toggle-input-method</tt>
|
| 3335 |
command.
|
| 3336 |
</P>
|
| 3337 |
|
| 3338 |
<P>
|
| 3339 |
GNU Emacs supplies input methods for
|
| 3340 |
British, Catalan,
|
| 3341 |
Chinese (array30, 4corner, b5-quick, cns-quick, cns-tsangchi,
|
| 3342 |
ctlau, ctlaub, ecdict, etzy, punct, punct-b5, py, py-b5,
|
| 3343 |
py-punct, py-punct-b5, qj, qj-b5, sw, tonepy, ziranma, zozy),
|
| 3344 |
Czech, Danish, Devanagari, Esperanto,
|
| 3345 |
Ethiopic, Finnish, French, German, Greek, Hebrew, Icelandic,
|
| 3346 |
IPA, Irish, Italian, Japanese (egg-wnn, skk),
|
| 3347 |
Korean (hangul, hangul3, hanja, hanja3),
|
| 3348 |
Lao, Norwegian, Portuguese, Romanian, Scandinavian,
|
| 3349 |
Slovak, Spanish, Swedish, Thai, Tibetan, Turkish, Vietnamese,
|
| 3350 |
Latin-{1,2,3,4,5},
|
| 3351 |
Cyrillic (beylorussian, jcuken, jis-russian, macedonian,
|
| 3352 |
serbian, transit, transit-bulgarian, ulrainian, yawerty),
|
| 3353 |
and so on.
|
| 3354 |
</P>
|
| 3355 |
|
| 3356 |
|
| 3357 |
|
| 3358 |
|
| 3359 |
|
| 3360 |
|
| 3361 |
|
| 3362 |
|
| 3363 |
|
| 3364 |
|
| 3365 |
|
| 3366 |
<chapt id="internal"><heading>Internal Processing and File I/O</heading>
|
| 3367 |
|
| 3368 |
<P>
|
| 3369 |
There are many text-processing softwares, such as
|
| 3370 |
<prgn>grep</prgn>,
|
| 3371 |
<prgn>groff</prgn>,
|
| 3372 |
<prgn>head</prgn>,
|
| 3373 |
<prgn>sort</prgn>,
|
| 3374 |
<prgn>wc</prgn>,
|
| 3375 |
<prgn>uniq</prgn>,
|
| 3376 |
<prgn>nl</prgn>,
|
| 3377 |
<prgn>expand</prgn>,
|
| 3378 |
and so on.
|
| 3379 |
There are also many script languages which are often used for
|
| 3380 |
text processing, such as
|
| 3381 |
<prgn>sed</prgn>,
|
| 3382 |
<prgn>awk</prgn>,
|
| 3383 |
<prgn>perl</prgn>,
|
| 3384 |
<prgn>python</prgn>,
|
| 3385 |
<prgn>ruby</prgn>,
|
| 3386 |
and so on.
|
| 3387 |
These softwares need to be internationalized.
|
| 3388 |
</P>
|
| 3389 |
|
| 3390 |
<P>
|
| 3391 |
From a user's point of view, a software can use any internal encodings
|
| 3392 |
if I/O is done correctly. It is because a user cannot be aware of
|
| 3393 |
which kind of internal code is used in the software.
|
| 3394 |
</P>
|
| 3395 |
|
| 3396 |
<P>
|
| 3397 |
There are two candidate for internal encoding. One is
|
| 3398 |
<strong>wide character</strong> and the another is <strong>UCS-4</strong>.
|
| 3399 |
You can also use Mule-type encoding, where a pair of a number
|
| 3400 |
to express CCS and a number to express a character consist a unit.
|
| 3401 |
</P>
|
| 3402 |
|
| 3403 |
<P>
|
| 3404 |
I recommend to use <em>wide character</em>, for reasons I alread
|
| 3405 |
explained in <ref id="locale">, i.e., wide character can be
|
| 3406 |
encoding-independent and can support various encodings in the
|
| 3407 |
world including UTF-8, can supply a common united way for users
|
| 3408 |
to choose encodings, and so on.
|
| 3409 |
</P>
|
| 3410 |
|
| 3411 |
<P>
|
| 3412 |
Here a few examples of handling of <tt>wchar_t</tt> are shown.
|
| 3413 |
</P>
|
| 3414 |
|
| 3415 |
|
| 3416 |
<sect id="internal-stream"><heading>Stream I/O of Characters</heading>
|
| 3417 |
|
| 3418 |
<P>
|
| 3419 |
The following program is a small example of stream I/O of wide characters.
|
| 3420 |
<example>
|
| 3421 |
#include <stdio.h>
|
| 3422 |
#include <wchar.h>
|
| 3423 |
#include <locale.h>
|
| 3424 |
main()
|
| 3425 |
{
|
| 3426 |
wint_t c;
|
| 3427 |
|
| 3428 |
setlocale(LC_ALL, "");
|
| 3429 |
while(1) {
|
| 3430 |
c = getwchar();
|
| 3431 |
if (c == WEOF) break;
|
| 3432 |
putwchar(c);
|
| 3433 |
}
|
| 3434 |
}
|
| 3435 |
</example>
|
| 3436 |
I think you can easily imagine a corresponding version using <tt>char</tt>.
|
| 3437 |
Since this software does not do any character manipulation, you can use
|
| 3438 |
ordinal <tt>char</tt> for this software.
|
| 3439 |
</P>
|
| 3440 |
|
| 3441 |
<P>
|
| 3442 |
There are a few points. At first, never forget to call
|
| 3443 |
<tt>setlocale()</tt>. Then, <tt>putwchar()</tt>,
|
| 3444 |
<tt>getwchar()</tt>, and <tt>WEOF</tt> are the replacements of
|
| 3445 |
<tt>putchar()</tt>, <tt>getchar()</tt>, and <tt>EOF</tt>, respectively.
|
| 3446 |
Use <tt>wint_t</tt> instead of <tt>int</tt> for <tt>getwchar()</tt>.
|
| 3447 |
</P>
|
| 3448 |
|
| 3449 |
|
| 3450 |
<sect id="internal-wc"><heading>Character Classification</heading>
|
| 3451 |
|
| 3452 |
<P>
|
| 3453 |
Here is an example of character clasification using <tt>wchar_t</tt>.
|
| 3454 |
At first, this is a non-internationalized version.
|
| 3455 |
<example>
|
| 3456 |
/*
|
| 3457 |
* wc.c
|
| 3458 |
*
|
| 3459 |
* Word Counter
|
| 3460 |
*
|
| 3461 |
*/
|
| 3462 |
|
| 3463 |
#include <stdio.h>
|
| 3464 |
#include <string.h>
|
| 3465 |
|
| 3466 |
int main(int argc, char **argv)
|
| 3467 |
{
|
| 3468 |
int n, p=0, d=0, c=0, w=0, l=0;
|
| 3469 |
|
| 3470 |
while ((n=getchar()) != EOF) {
|
| 3471 |
c++;
|
| 3472 |
if (isdigit(n)) d++;
|
| 3473 |
if (strchr(" \t\n", n)) w++;
|
| 3474 |
if (n == '\n') l++;
|
| 3475 |
}
|
| 3476 |
|
| 3477 |
printf("%d characters, %d digits, %d words, and %d lines\n",
|
| 3478 |
c, d, w, l);
|
| 3479 |
}
|
| 3480 |
</example>
|
| 3481 |
Here is the internationalized version.
|
| 3482 |
<example>
|
| 3483 |
/*
|
| 3484 |
* wc-i.c
|
| 3485 |
*
|
| 3486 |
* Word Counter (internationalized version)
|
| 3487 |
*
|
| 3488 |
*/
|
| 3489 |
|
| 3490 |
#include <stdio.h>
|
| 3491 |
#include <string.h>
|
| 3492 |
#include <locale.h>
|
| 3493 |
|
| 3494 |
int main(int argc, char **argv)
|
| 3495 |
{
|
| 3496 |
int p=0, d=0, c=0, w=0, l=0;
|
| 3497 |
wint_t n;
|
| 3498 |
|
| 3499 |
setlocale(LC_ALL, "");
|
| 3500 |
|
| 3501 |
while ((n=getwchar()) != EOF) {
|
| 3502 |
c++;
|
| 3503 |
if (iswdigit(n)) d++;
|
| 3504 |
if (wcschr(L" \t\n", n)) w++;
|
| 3505 |
if (n == L'\n') l++;
|
| 3506 |
}
|
| 3507 |
|
| 3508 |
printf("%d characters, %d digits, %d words, and %d lines\n",
|
| 3509 |
c, d, w, l);
|
| 3510 |
}
|
| 3511 |
</example>
|
| 3512 |
</P>
|
| 3513 |
|
| 3514 |
<P>
|
| 3515 |
This example shows that <tt>iswdigit()</tt> is used instead of
|
| 3516 |
<tt>isdigit()</tt>. And more, <tt>L"string"</tt> and <tt>L'char'</tt>
|
| 3517 |
for wide character string and wide character.
|
| 3518 |
</P>
|
| 3519 |
|
| 3520 |
<sect id="internal-length"><heading>Length of String</heading>
|
| 3521 |
|
| 3522 |
<P>
|
| 3523 |
The following is a sample program to obtain the length of the
|
| 3524 |
inputed string. Note that number of bytes and number of characters
|
| 3525 |
are not distinguished.
|
| 3526 |
<example>
|
| 3527 |
/* length.c
|
| 3528 |
*
|
| 3529 |
* a sample program to obtain the length of the inputed string
|
| 3530 |
* NOT INTERNATIONALIZED
|
| 3531 |
*/
|
| 3532 |
|
| 3533 |
#include <stdio.h>
|
| 3534 |
#include <string.h>
|
| 3535 |
|
| 3536 |
int main(int argc, char **argv)
|
| 3537 |
{
|
| 3538 |
int len;
|
| 3539 |
|
| 3540 |
if (argc < 2) {
|
| 3541 |
printf("Usage: %s [string]\n", argv[0]);
|
| 3542 |
return 0;
|
| 3543 |
}
|
| 3544 |
|
| 3545 |
printf("Your string is: \"%s\".\n", argv[1]);
|
| 3546 |
|
| 3547 |
len = strlen(argv[1]);
|
| 3548 |
printf("Length of your string is: %d bytes.\n", len);
|
| 3549 |
printf("Length of your string is: %d characters.\n", len);
|
| 3550 |
printf("Width of your string is: %d columns.\n", len);
|
| 3551 |
return 0;
|
| 3552 |
}
|
| 3553 |
</example>
|
| 3554 |
</P>
|
| 3555 |
|
| 3556 |
<P>
|
| 3557 |
The following is a internationalized version of the program
|
| 3558 |
using wide characters.
|
| 3559 |
<example>
|
| 3560 |
/* length-i.c
|
| 3561 |
*
|
| 3562 |
* a sample program to obtain the length of the inputed string
|
| 3563 |
* INTERNATIONALIZED
|
| 3564 |
*/
|
| 3565 |
|
| 3566 |
#include <stdio.h>
|
| 3567 |
#include <string.h>
|
| 3568 |
#include <locale.h>
|
| 3569 |
|
| 3570 |
int main(int argc, char **argv)
|
| 3571 |
{
|
| 3572 |
int len, n;
|
| 3573 |
wchar_t *wp;
|
| 3574 |
|
| 3575 |
/* All softwares using locale should write this line */
|
| 3576 |
setlocale(LC_ALL, "");
|
| 3577 |
|
| 3578 |
if (argc < 2) {
|
| 3579 |
printf("Usage: %s [string]\n", argv[0]);
|
| 3580 |
return 0;
|
| 3581 |
}
|
| 3582 |
|
| 3583 |
printf("Your string is: \"%s\".\n", argv[1]);
|
| 3584 |
|
| 3585 |
/* The concept of 'byte' is universal. */
|
| 3586 |
len = strlen(argv[1]);
|
| 3587 |
printf("Length of your string is: %d bytes.\n", len);
|
| 3588 |
|
| 3589 |
/* To obtain number of characters, it is the easiest way */
|
| 3590 |
/* to convert the string into wide string. The number of */
|
| 3591 |
/* characters is equal to the number of wide characters. */
|
| 3592 |
/* It does not exceed the number of bytes. */
|
| 3593 |
n = strlen(argv[1]) * sizeof(wchar_t);
|
| 3594 |
wp = (wchar_t *)malloc(n);
|
| 3595 |
len = mbstowcs(wp, argv[1], n);
|
| 3596 |
printf("Length of your string is: %d characters.\n", len);
|
| 3597 |
|
| 3598 |
printf("Width of your string is: %d columns.\n", wcswidth(wp, len));
|
| 3599 |
|
| 3600 |
return 0;
|
| 3601 |
}
|
| 3602 |
</example>
|
| 3603 |
</P>
|
| 3604 |
|
| 3605 |
<P>
|
| 3606 |
This program can count multibyte characters correctly.
|
| 3607 |
Of course the user has to set LANG variable properly.
|
| 3608 |
</P>
|
| 3609 |
|
| 3610 |
<P>
|
| 3611 |
For example, on UTF-8 xterm...
|
| 3612 |
<example>
|
| 3613 |
$ export LANG=ko_KR.UTF-8
|
| 3614 |
$ ./length-i (a Hangul character)
|
| 3615 |
Your string is: "(the character)"
|
| 3616 |
Length of your string is: 3 bytes.
|
| 3617 |
Length of your string is: 1 characters.
|
| 3618 |
Width of your string is: 2 columns.
|
| 3619 |
</example>
|
| 3620 |
</P>
|
| 3621 |
|
| 3622 |
|
| 3623 |
|
| 3624 |
<sect id="internal-extract"><heading>Extraction of Characters</heading>
|
| 3625 |
|
| 3626 |
<P>
|
| 3627 |
The following program extracts all characters contained in the given
|
| 3628 |
string.
|
| 3629 |
<example>
|
| 3630 |
/* extract.c
|
| 3631 |
*
|
| 3632 |
* a sample program to extract each character contained in the string
|
| 3633 |
* not internationalized
|
| 3634 |
*/
|
| 3635 |
|
| 3636 |
#include <stdio.h>
|
| 3637 |
#include <string.h>
|
| 3638 |
|
| 3639 |
int main(int argc, char **argv)
|
| 3640 |
{
|
| 3641 |
char *p;
|
| 3642 |
int c;
|
| 3643 |
|
| 3644 |
if (argc < 2) {
|
| 3645 |
printf("Usage: %s [string]\n", argv[0]);
|
| 3646 |
return 0;
|
| 3647 |
}
|
| 3648 |
|
| 3649 |
printf("Your string is: \"%s\".\n", argv[1]);
|
| 3650 |
|
| 3651 |
c = 0;
|
| 3652 |
for (p=argv[1] ; *p ; p++) {
|
| 3653 |
printf("Character #%d is \"%c\".\n", ++c, *p);
|
| 3654 |
}
|
| 3655 |
return 0;
|
| 3656 |
}
|
| 3657 |
</example>
|
| 3658 |
Using wide characters, the program can be rewritten as following.
|
| 3659 |
<example>
|
| 3660 |
/* extract-i.c
|
| 3661 |
*
|
| 3662 |
* a sample program to extract each character contained in the string
|
| 3663 |
* INTERNATIONALIZED
|
| 3664 |
*/
|
| 3665 |
|
| 3666 |
#include <stdio.h>
|
| 3667 |
#include <string.h>
|
| 3668 |
#include <locale.h>
|
| 3669 |
#include <stdlib.h>
|
| 3670 |
|
| 3671 |
int main(int argc, char **argv)
|
| 3672 |
{
|
| 3673 |
wchar_t *wp;
|
| 3674 |
char p[MB_CUR_MAX+1];
|
| 3675 |
int c, n, len;
|
| 3676 |
|
| 3677 |
/* Don't forget. */
|
| 3678 |
setlocale(LC_ALL, "");
|
| 3679 |
|
| 3680 |
if (argc < 2) {
|
| 3681 |
printf("Usage: %s [string]\n", argv[0]);
|
| 3682 |
return 0;
|
| 3683 |
}
|
| 3684 |
|
| 3685 |
printf("Your string is: \"%s\".\n", argv[1]);
|
| 3686 |
|
| 3687 |
/* To obtain each character of the string, it is easy to convert */
|
| 3688 |
/* the string into wide string and re-convert each of the wide */
|
| 3689 |
/* string into multibyte characters. */
|
| 3690 |
n = strlen(argv[1]) * sizeof(wchar_t);
|
| 3691 |
wp = (wchar_t *)malloc(n);
|
| 3692 |
len = mbstowcs(wp, argv[1], n);
|
| 3693 |
for (c=0; c<len; c++) {
|
| 3694 |
/* re-convert from wide character to multibyte character */
|
| 3695 |
int x;
|
| 3696 |
x = wctomb(p, wp[c]);
|
| 3697 |
/* One multibyte character may be two or more bytes. */
|
| 3698 |
/* Thus "%s" is used instead of "%c". */
|
| 3699 |
if (x>0) p[x]=0;
|
| 3700 |
printf("Character #%d is \"%s\" (%d byte(s)) \n", c, p, x);
|
| 3701 |
}
|
| 3702 |
|
| 3703 |
return 0;
|
| 3704 |
}
|
| 3705 |
</example>
|
| 3706 |
</P>
|
| 3707 |
|
| 3708 |
<P>
|
| 3709 |
Note that this program doesn't work well if the multibyte character
|
| 3710 |
is stateful.
|
| 3711 |
</P>
|
| 3712 |
|
| 3713 |
|
| 3714 |
|
| 3715 |
|
| 3716 |
|
| 3717 |
|
| 3718 |
|
| 3719 |
|
| 3720 |
|
| 3721 |
<chapt id="internet"><heading>the Internet</heading>
|
| 3722 |
|
| 3723 |
<P>
|
| 3724 |
The Internet is a world-wide network of computer.
|
| 3725 |
Thus the text data exchanged via the Internet must be
|
| 3726 |
internationalized.
|
| 3727 |
</P>
|
| 3728 |
|
| 3729 |
<P>
|
| 3730 |
The concept of internationalization did not exist
|
| 3731 |
at the dawn of the Internet, since it was developed in US.
|
| 3732 |
Protocols used in the Internet were developed to be
|
| 3733 |
upward-compatible with the existing protocols.
|
| 3734 |
</P>
|
| 3735 |
|
| 3736 |
<P>
|
| 3737 |
One of the key technology of the internationalization
|
| 3738 |
of the Internet data exchange is <strong>MIME</strong>.
|
| 3739 |
</P>
|
| 3740 |
|
| 3741 |
<sect id="mailnews"><heading>Mail/News</heading>
|
| 3742 |
|
| 3743 |
<P>
|
| 3744 |
Internet mail uses SMTP
|
| 3745 |
(<url id="http://www.faqs.org/rfcs/rfc821.html" name="RFC 821">)
|
| 3746 |
and ESMTP
|
| 3747 |
(<url id="http://www.faqs.org/rfcs/rfc1869.html" name="RFC 1869">)
|
| 3748 |
protocols. SMTP is 7bit protocol and ESMTP is 8bit.
|
| 3749 |
</P>
|
| 3750 |
|
| 3751 |
<P>
|
| 3752 |
Original SMTP can only send ASCII characters. Thus
|
| 3753 |
non-ASCII characters (ISO 8859-*, Asian characters, and so on)
|
| 3754 |
have to be converted into ASCII characters.
|
| 3755 |
</P>
|
| 3756 |
|
| 3757 |
<P>
|
| 3758 |
MIME
|
| 3759 |
(<url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045">,
|
| 3760 |
<url id="http://www.faqs.org/rfcs/rfc2046.html" name="2046">,
|
| 3761 |
<url id="http://www.faqs.org/rfcs/rfc2047.html" name="2047">,
|
| 3762 |
<url id="http://www.faqs.org/rfcs/rfc2048.html" name="2048">, and
|
| 3763 |
<url id="http://www.faqs.org/rfcs/rfc2049.html" name="2049">)
|
| 3764 |
deals with this problem.
|
| 3765 |
</P>
|
| 3766 |
|
| 3767 |
<P>
|
| 3768 |
At first
|
| 3769 |
<url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045">
|
| 3770 |
determines three new headers.
|
| 3771 |
<list>
|
| 3772 |
<item>MIME-Version:
|
| 3773 |
<item>Content-Type:
|
| 3774 |
<item>Content-Transfer-Encoding:
|
| 3775 |
</list>
|
| 3776 |
Now <tt>MIME-Version</tt> is 1.0 and thus all MIME mails have
|
| 3777 |
a header like this:
|
| 3778 |
<example>
|
| 3779 |
MIME-Version: 1.0
|
| 3780 |
</example>
|
| 3781 |
<tt>Content-Type</tt> describes the type of content.
|
| 3782 |
For example, an usual mail with Japanese text has a header like that:
|
| 3783 |
<example>
|
| 3784 |
Content-Type: text/plain; charset="iso-2022-jp"
|
| 3785 |
</example>
|
| 3786 |
Available types are described in
|
| 3787 |
<url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">.
|
| 3788 |
<tt>Content-Transfer-Encoding</tt> describes the way to
|
| 3789 |
convert the contents. Available values are <tt>BINARY</tt>,
|
| 3790 |
<tt>7bit</tt>, <tt>8bit</tt>, <tt>BASE64</tt>, and <tt>QUOTED-PRINTABLE</tt>.
|
| 3791 |
Since SMTP cannot handle 8bit data, <tt>8bit</tt> and <tt>BINARY</tt>
|
| 3792 |
cannot be used. ESMTP can use them.
|
| 3793 |
Base64 and quoted-printable are ways to convert 8bit data into 7bit
|
| 3794 |
and 8bit data have to be converted using either of them to sent by SMTP.
|
| 3795 |
</P>
|
| 3796 |
|
| 3797 |
<P>
|
| 3798 |
<url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
|
| 3799 |
describes media type and sub type for
|
| 3800 |
<tt>Content-Type</tt> header. Available types are
|
| 3801 |
<tt>text</tt>, <tt>image</tt>, <tt>audio</tt>, <tt>video</tt>,
|
| 3802 |
and <tt>application</tt>. Now we are interested in <tt>text</tt>
|
| 3803 |
because we are discussing about i18n.
|
| 3804 |
Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>,
|
| 3805 |
<tt>html</tt>, and so on. <tt>charset</tt> parameter can also be
|
| 3806 |
added to specify encodings.
|
| 3807 |
<tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>,
|
| 3808 |
<tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by
|
| 3809 |
<url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
|
| 3810 |
for <tt>charset</tt>. This list can be added by writing
|
| 3811 |
a new RFC.
|
| 3812 |
<list>
|
| 3813 |
<item><url id="http://www.faqs.org/rfcs/rfc1468.html" name="RFC 1468">
|
| 3814 |
<tt>ISO-2022-JP</tt>
|
| 3815 |
<item><url id="http://www.faqs.org/rfcs/rfc1554.html" name="RFC 1554">
|
| 3816 |
<tt>ISO-2022-JP-2</tt>
|
| 3817 |
<item><url id="http://www.faqs.org/rfcs/rfc1557.html" name="RFC 1557">
|
| 3818 |
<tt>ISO-2022-KR</tt>
|
| 3819 |
<item><url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">
|
| 3820 |
<tt>ISO-2022-CN</tt>
|
| 3821 |
<item><url id="http://www.faqs.org/rfcs/rfc1922.html" name="RFC 1922">
|
| 3822 |
<tt>ISO-2022-CN-EXT</tt>
|
| 3823 |
<item><url id="http://www.faqs.org/rfcs/rfc1842.html" name="RFC 1842">
|
| 3824 |
<tt>HZ-GB-2312</tt>
|
| 3825 |
<item><url id="http://www.faqs.org/rfcs/rfc1641.html" name="RFC 1641">
|
| 3826 |
<tt>UNICODE-1-1</tt>
|
| 3827 |
<item><url id="http://www.faqs.org/rfcs/rfc1642.html" name="RFC 1642">
|
| 3828 |
<tt>UNICODE-1-1-UTF-7</tt>
|
| 3829 |
<item><url id="http://www.faqs.org/rfcs/rfc1815.html" name="RFC 1815">
|
| 3830 |
<tt>ISO-10646-1</tt>
|
| 3831 |
</list>
|
| 3832 |
</P>
|
| 3833 |
|
| 3834 |
<P>
|
| 3835 |
<url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2045"> and
|
| 3836 |
and
|
| 3837 |
<url id="http://www.faqs.org/rfcs/rfc2046.html" name="RFC 2046">
|
| 3838 |
determine the way to write non-ASCII characters
|
| 3839 |
in the main text of mail. On the other hand,
|
| 3840 |
<url id="http://www.faqs.org/rfcs/rfc2045.html" name="RFC 2047"> describes
|
| 3841 |
'encoded words' which is the way to write non-ASCII characters in the header.
|
| 3842 |
It is like that:
|
| 3843 |
<tt>=?</tt><var>encoding</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>,
|
| 3844 |
where <var>encoding</var> is selected from the list of <tt>charset</tt>
|
| 3845 |
of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt>
|
| 3846 |
or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for
|
| 3847 |
base64, and <var>data</var> is encoded data whose length is less than
|
| 3848 |
76 bytes. If the <var>data</var> is longer than 75 bytes,
|
| 3849 |
it must be divided into multiple encoded words.
|
| 3850 |
For example,
|
| 3851 |
<example>
|
| 3852 |
Subject: =?ISO-2022-JP?B?GyRCNEE7eiROJTUlViU4JSclLyVIGyhC?=
|
| 3853 |
</example>
|
| 3854 |
reads 'a subject written in Kanji' in Japanese (ISO-2022-JP,
|
| 3855 |
encoded by base64). Of course human cannot read it.
|
| 3856 |
</P>
|
| 3857 |
|
| 3858 |
|
| 3859 |
<sect id="www"><heading>WWW</heading>
|
| 3860 |
|
| 3861 |
<P>
|
| 3862 |
WWW is a system that HTML documents (mainly; and files in other formats)
|
| 3863 |
are transferred using HTTP protocol.
|
| 3864 |
</P>
|
| 3865 |
|
| 3866 |
<P>
|
| 3867 |
HTTP protocol is defined by
|
| 3868 |
<url id="http://www.faqs.org/rfcs/rfc2068.html" name="RFC 2068">.
|
| 3869 |
HTTP uses headers like mails and <tt>Content-Type</tt> header
|
| 3870 |
is used to describe the type of the contents.
|
| 3871 |
Though <tt>charset</tt> parameter can be described in the
|
| 3872 |
header, it is rarely used.
|
| 3873 |
</P>
|
| 3874 |
|
| 3875 |
<P>
|
| 3876 |
<url id="http://www.faqs.org/rfcs/rfc1866.html" name="RFC 1866">
|
| 3877 |
describes that the default encoding for HTML is
|
| 3878 |
ISO-8859-1. However, many web pages are written in,
|
| 3879 |
for example, Japanese and Korean using (of course) encodings
|
| 3880 |
different from ISO-8859-1.
|
| 3881 |
Sometimes the HTML document describes:
|
| 3882 |
<example>
|
| 3883 |
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp">
|
| 3884 |
</example>
|
| 3885 |
which declares that the page is written in ISO-2022-JP.
|
| 3886 |
However, there many pages without any declaration of encoding.
|
| 3887 |
</P>
|
| 3888 |
|
| 3889 |
<P>
|
| 3890 |
Web browsers have to deal with such a circumstance.
|
| 3891 |
Of course web browsers have to be able to deal with every
|
| 3892 |
encodings in the world which is listed in MIME.
|
| 3893 |
However, many web browsers can only deal with ASCII
|
| 3894 |
or ISO-8859-1. Such web browsers are useless at all
|
| 3895 |
for non-ASCII or non-ISO-8859-1 people.
|
| 3896 |
</P>
|
| 3897 |
|
| 3898 |
<P>
|
| 3899 |
URL should be written in ASCII character,
|
| 3900 |
though non-ASCII characters can be expressed
|
| 3901 |
using <tt>%</tt><var>nn</var> sequence where <var>nn</var>
|
| 3902 |
is hexadecimal value. This is because there are
|
| 3903 |
no way to specify encoding. Wester-European people
|
| 3904 |
would treat it as ISO-8859-1, while Japanese people
|
| 3905 |
would treat it as EUC-JP or SHIFT-JIS.
|
| 3906 |
</P>
|
| 3907 |
|
| 3908 |
|
| 3909 |
|
| 3910 |
|
| 3911 |
|
| 3912 |
|
| 3913 |
|
| 3914 |
|
| 3915 |
|
| 3916 |
|
| 3917 |
|
| 3918 |
|
| 3919 |
|
| 3920 |
|
| 3921 |
|
| 3922 |
|
| 3923 |
|
| 3924 |
<chapt id="library"><heading>Libraries and Components</heading>
|
| 3925 |
|
| 3926 |
|
| 3927 |
<P>
|
| 3928 |
We sometimes use libraries and components which are not
|
| 3929 |
very popular. We may have to pay special attention for
|
| 3930 |
internationalization of these libraries and components.
|
| 3931 |
</P>
|
| 3932 |
|
| 3933 |
<P>
|
| 3934 |
On the other hand, we can use libraries and components
|
| 3935 |
for improvement of internationalization. This chapter
|
| 3936 |
introduces such a libraries and components.
|
| 3937 |
</P>
|
| 3938 |
|
| 3939 |
<sect id="gettext"><heading>Gettext and Translation</heading>
|
| 3940 |
|
| 3941 |
<P>
|
| 3942 |
GNU Gettext is a tool to internationalize messages a software outputs
|
| 3943 |
according to locale status of <tt>LC_MESSAGES</tt>.
|
| 3944 |
A <prgn>gettext</prgn>ized software contains messages written in
|
| 3945 |
various languages (according to available translators) and
|
| 3946 |
a user can choose them using environmental variables.
|
| 3947 |
GNU gettext is a part of Debian system.
|
| 3948 |
</P>
|
| 3949 |
|
| 3950 |
<P>
|
| 3951 |
Install <package>gettext</package> package and read info pages for details.
|
| 3952 |
</P>
|
| 3953 |
|
| 3954 |
<P>
|
| 3955 |
Don't use non-ASCII characters for '<tt>msgid</tt>'.
|
| 3956 |
Be careful because you may tend to use ISO-8859-1 characters.
|
| 3957 |
For example, '©' (copyright mark; you may be not able to
|
| 3958 |
read the copyright mark NOW in THIS document) is non-ASCII character
|
| 3959 |
(0xa9 in ISO-8859-1).
|
| 3960 |
Otherwise, translators may feel difficulty to edit catalog files
|
| 3961 |
because of conflict between encodings for <tt>msgid</tt> and in
|
| 3962 |
<tt>msgstr</tt>.
|
| 3963 |
</P>
|
| 3964 |
|
| 3965 |
<P>
|
| 3966 |
Be sure the message can be displayed in the assumed environment.
|
| 3967 |
In other words, you have to read the chapter of 'Output to Display'
|
| 3968 |
in this document and internationalize the output mechanism
|
| 3969 |
of your software prior to <prgn>gettext</prgn>ization.
|
| 3970 |
<em>ENGLISH MESSAGES ARE PREFERRED EVEN FOR NON-ENGLISH-SPEAKING PEOPLE,
|
| 3971 |
THAN MEANINGLESS BROKEN MESSAGES.</em>
|
| 3972 |
</P>
|
| 3973 |
|
| 3974 |
<P>
|
| 3975 |
The 2nd (3rd, ...) byte of multibyte characters or
|
| 3976 |
all bytes of non-ASCII characters in stateful encodings
|
| 3977 |
can be 0x5c (same to backslash in ASCII) or 0x22
|
| 3978 |
(same to double quote in ASCII).
|
| 3979 |
These characters have to properly escaped because
|
| 3980 |
present version of GNU gettext doesn't care the
|
| 3981 |
'charset' subitem of '<tt>Content-Type</tt>' item for '<tt>msgstr</tt>'.
|
| 3982 |
</P>
|
| 3983 |
|
| 3984 |
<P>
|
| 3985 |
A <prgn>gettext</prgn>ed message must not used in multiple contexts.
|
| 3986 |
This is because a word may have different meaning in different context.
|
| 3987 |
For example, a verb means an order or a command if it appears
|
| 3988 |
at the top of the sentence in English. However, different languages
|
| 3989 |
have different grammar. If a verb is <prgn>gettext</prgn>ed and it is used
|
| 3990 |
both in a usual sentence and in an imperative sentence,
|
| 3991 |
one cannot translate it.
|
| 3992 |
</P>
|
| 3993 |
|
| 3994 |
|
| 3995 |
<P>
|
| 3996 |
If a sentence is <prgn>gettext</prgn>ed, never divide the sentence.
|
| 3997 |
If a sentence is divided in the original source code,
|
| 3998 |
connect them so as to single string contains the full
|
| 3999 |
sentence.
|
| 4000 |
This is because the order of words in a sentence
|
| 4001 |
is different among languages.
|
| 4002 |
For example, a routine
|
| 4003 |
<example>
|
| 4004 |
printf("There ");
|
| 4005 |
switch(num_of_files) {
|
| 4006 |
case 0:
|
| 4007 |
printf("are no files ");
|
| 4008 |
break;
|
| 4009 |
case 1:
|
| 4010 |
printf("is 1 file ");
|
| 4011 |
break;
|
| 4012 |
default:
|
| 4013 |
printf("are %d files ", num_of_files);
|
| 4014 |
break;
|
| 4015 |
}
|
| 4016 |
printf("in %s directory.\n", dir_name);
|
| 4017 |
</example>
|
| 4018 |
has to be written like that:
|
| 4019 |
<example>
|
| 4020 |
switch(num_of_files) {
|
| 4021 |
case 0:
|
| 4022 |
printf("There are no files in %s directory", dir_name);
|
| 4023 |
break;
|
| 4024 |
case 1:
|
| 4025 |
printf("There is 1 file in %s directory", dir_name);
|
| 4026 |
break;
|
| 4027 |
default:
|
| 4028 |
printf("There are %d files in %s directory", num_of_files, dir_name);
|
| 4029 |
break;
|
| 4030 |
}
|
| 4031 |
</example>
|
| 4032 |
before it is <prgn>gettext</prgn>ized.
|
| 4033 |
</P>
|
| 4034 |
|
| 4035 |
<P>
|
| 4036 |
A software with <prgn>gettext</prgn>ed messages should not depend on
|
| 4037 |
the length of the messages. The messages may get longer
|
| 4038 |
in different language.
|
| 4039 |
</P>
|
| 4040 |
|
| 4041 |
<P>
|
| 4042 |
When two or more '%' directive for formatted output functions
|
| 4043 |
such as <tt>printf()</tt> appear in a message,
|
| 4044 |
the order of these '%' directives may be changed by
|
| 4045 |
translation. In such a case, the translator can specify
|
| 4046 |
the order.
|
| 4047 |
See section of 'Special Comments preceding Keywords'
|
| 4048 |
in info page of <prgn>gettext</prgn> for detail.
|
| 4049 |
</P>
|
| 4050 |
|
| 4051 |
<P>
|
| 4052 |
Now there are projects to translate messages in various softwares.
|
| 4053 |
For example,
|
| 4054 |
<url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
|
| 4055 |
name="Translation Project">.
|
| 4056 |
</P>
|
| 4057 |
|
| 4058 |
|
| 4059 |
|
| 4060 |
<sect1 id="gettextize"><heading>Gettext-ization of A Software</heading>
|
| 4061 |
|
| 4062 |
<P>
|
| 4063 |
At first, the software has to have the following lines.
|
| 4064 |
<example>
|
| 4065 |
int main(int argc, char **argv)
|
| 4066 |
{
|
| 4067 |
...
|
| 4068 |
setlocale (LC_ALL, ""); /* This is not for gettext but
|
| 4069 |
all i18n software should have
|
| 4070 |
this line. */
|
| 4071 |
bindtextdomain (PACKAGE, LOCALEDIR);
|
| 4072 |
textdomain (PACKAGE);
|
| 4073 |
...
|
| 4074 |
}
|
| 4075 |
</example>
|
| 4076 |
where <var>PACKAGE</var> is the name of the catalog file and
|
| 4077 |
<var>LOCALEDIR</var> is <tt>"/usr/share/locale"</tt> for Debian.
|
| 4078 |
<var>PACKAGE</var> and <var>LOCALEDIR</var> should be defined
|
| 4079 |
in a header file or <tt>Makefile</tt>.
|
| 4080 |
</P>
|
| 4081 |
|
| 4082 |
<P>
|
| 4083 |
It is convenient to prepare the following header file.
|
| 4084 |
<example>
|
| 4085 |
#include <libintl.h>
|
| 4086 |
#define _(String) gettext((String))
|
| 4087 |
</example>
|
| 4088 |
and messages in source files should be written as
|
| 4089 |
<tt>_("message")</tt>, instead of <tt>"message"</tt>.
|
| 4090 |
</P>
|
| 4091 |
|
| 4092 |
<P>
|
| 4093 |
Next, catalog files have to be prepared.
|
| 4094 |
</P>
|
| 4095 |
|
| 4096 |
<P>
|
| 4097 |
At first, a template for catalog file is prepared
|
| 4098 |
using <prgn>xgettext</prgn>.
|
| 4099 |
At default a template file <tt>message.po</tt> is
|
| 4100 |
prepared.
|
| 4101 |
<footnote>
|
| 4102 |
I HAVE TO WRITE EXPLANATION.
|
| 4103 |
</footnote>
|
| 4104 |
</P>
|
| 4105 |
|
| 4106 |
|
| 4107 |
|
| 4108 |
<sect1 id="gettext-translate"><heading>Translation</heading>
|
| 4109 |
|
| 4110 |
<P>
|
| 4111 |
Though <prgn>gettext</prgn>ization of a software is a temporal
|
| 4112 |
work, translation is a continuing work because you have to
|
| 4113 |
translate new (or modified) messages when (or before) a new
|
| 4114 |
version of the software is released.
|
| 4115 |
</P>
|
| 4116 |
|
| 4117 |
|
| 4118 |
<sect id="readline"><heading>Readline Library</heading>
|
| 4119 |
|
| 4120 |
<P>***** Not written yet *****</P>
|
| 4121 |
|
| 4122 |
<P>
|
| 4123 |
Readline library need to be internationalized.
|
| 4124 |
</P>
|
| 4125 |
|
| 4126 |
<sect id="ncurses"><heading>Ncurses Library</heading>
|
| 4127 |
|
| 4128 |
<P>***** Not written yet *****</P>
|
| 4129 |
|
| 4130 |
<P>
|
| 4131 |
<strong>Ncurses</strong> is a free implementation of curses library.
|
| 4132 |
Though this library is now maintained by Free Software Foundation,
|
| 4133 |
it is not covered by GNU General Public License.
|
| 4134 |
</P>
|
| 4135 |
|
| 4136 |
<P>
|
| 4137 |
Ncurses library need to be internationalized.
|
| 4138 |
</P>
|
| 4139 |
|
| 4140 |
|
| 4141 |
|
| 4142 |
|
| 4143 |
|
| 4144 |
|
| 4145 |
|
| 4146 |
|
| 4147 |
<chapt id="otherlanguage"><heading>Softwares Written in Other than C/C++</heading>
|
| 4148 |
|
| 4149 |
<P>
|
| 4150 |
Though C and C++ was, is, and will be the main language for
|
| 4151 |
software development for UNIX-like platforms, other languages,
|
| 4152 |
especially scripting languages, are often used.
|
| 4153 |
</P>
|
| 4154 |
|
| 4155 |
<P>
|
| 4156 |
Generally, languages other than C/C++ have less support for I18N
|
| 4157 |
then C/C++. However, nowadays other languages than C/C++ are
|
| 4158 |
coming to support Locale and Unicode.
|
| 4159 |
</P>
|
| 4160 |
|
| 4161 |
<sect id="fortran"><heading>Fortran</heading>
|
| 4162 |
|
| 4163 |
<P>***** Not written yet *****</P>
|
| 4164 |
|
| 4165 |
<sect id="pascal"><heading>Pascal</heading>
|
| 4166 |
|
| 4167 |
<P>***** Not written yet *****</P>
|
| 4168 |
|
| 4169 |
<sect id="perl"><heading>Perl</heading>
|
| 4170 |
|
| 4171 |
<P>
|
| 4172 |
Perl is one of the most important languages. Indeed,
|
| 4173 |
Debian system defines Perl as essential.
|
| 4174 |
</P>
|
| 4175 |
|
| 4176 |
<P>
|
| 4177 |
Perl 5.6 can handle UTF-8 characters. Declaration of
|
| 4178 |
<tt>use utf8;</tt> will enable it. For example,
|
| 4179 |
<tt>length()</tt> will return the number of characters,
|
| 4180 |
not the number of bytes.
|
| 4181 |
</P>
|
| 4182 |
|
| 4183 |
<P>
|
| 4184 |
However, it does not work well for me... why?
|
| 4185 |
</P>
|
| 4186 |
|
| 4187 |
<P>***** Not written yet *****</P>
|
| 4188 |
|
| 4189 |
<sect id="python"><heading>Python</heading>
|
| 4190 |
|
| 4191 |
<P>***** Not written yet *****</P>
|
| 4192 |
|
| 4193 |
<sect id="ruby"><heading>Ruby</heading>
|
| 4194 |
|
| 4195 |
<P>***** Not written yet *****</P>
|
| 4196 |
|
| 4197 |
<sect id="tcltk"><heading>Tcl/Tk</heading>
|
| 4198 |
|
| 4199 |
<P>***** Not written yet *****</P>
|
| 4200 |
|
| 4201 |
<P>
|
| 4202 |
Tcl/Tk is already internationalized. It is locale-sensible.
|
| 4203 |
It automatically uses proper font for various characters.
|
| 4204 |
Though it uses UTF-8 as internal encoding, users of Tcl/Tk
|
| 4205 |
don't have to aware of it. This is because Tcl/Tk converts
|
| 4206 |
encodings.
|
| 4207 |
</P>
|
| 4208 |
|
| 4209 |
<sect id="java"><heading>Java</heading>
|
| 4210 |
|
| 4211 |
<p>
|
| 4212 |
Full internationalization is naturally lead from
|
| 4213 |
Java's "Write Once, Run Anywhere" principle.
|
| 4214 |
To achieve this, Java uses Unicode as internal code
|
| 4215 |
for <tt>char</tt> and <tt>String</tt>. It is important
|
| 4216 |
that Unicode is <em>internal</em> code. Java obeys
|
| 4217 |
the current LOCALE and encoding is automatically
|
| 4218 |
converted for I/O. Thus, <em>users</em> of applications written
|
| 4219 |
in Java doesn't need to be aware of Unicode.
|
| 4220 |
</p>
|
| 4221 |
|
| 4222 |
<p>
|
| 4223 |
Then how about <em>developers</em>? They also don't need
|
| 4224 |
to be aware of the internal encoding. Character processings
|
| 4225 |
such as counting of number of characers in a string work well.
|
| 4226 |
And more, you don't have to worry about display/input.
|
| 4227 |
</p>
|
| 4228 |
|
| 4229 |
<p>
|
| 4230 |
However, you may want to handle specified encodings for,
|
| 4231 |
for example, MIME encoding/decoding. For such purposes,
|
| 4232 |
I/O can be done by specifying external encoding.
|
| 4233 |
Check <tt>InputStreamReader</tt> and <tt>OutputStreamReader</tt>
|
| 4234 |
classes. You can also convert between the internal encoding
|
| 4235 |
and specified encodings by
|
| 4236 |
<tt>String.getBytes(</tt><em>encoding</em><tt>)</tt> and
|
| 4237 |
<tt>String(byte []</tt> <em>bytes</em><tt>, </tt><em>encoding</em><tt>)</tt>.
|
| 4238 |
</p>
|
| 4239 |
|
| 4240 |
|
| 4241 |
|
| 4242 |
|
| 4243 |
<sect id="shellscript"><heading>Shell Script</heading>
|
| 4244 |
|
| 4245 |
<P>***** Not written yet *****</P>
|
| 4246 |
|
| 4247 |
<sect id="lisp"><heading>Lisp</heading>
|
| 4248 |
|
| 4249 |
<P>***** Not written yet *****</P>
|
| 4250 |
|
| 4251 |
|
| 4252 |
|
| 4253 |
|
| 4254 |
|
| 4255 |
|
| 4256 |
|
| 4257 |
|
| 4258 |
|
| 4259 |
|
| 4260 |
|
| 4261 |
<chapt id="examples"><heading>Examples of I18N</heading>
|
| 4262 |
|
| 4263 |
<P>
|
| 4264 |
Programmers who have internationalized softwares, have
|
| 4265 |
written a patch of L10N, and so on are encouraged to contribute
|
| 4266 |
to this chapter.
|
| 4267 |
</P>
|
| 4268 |
|
| 4269 |
|
| 4270 |
|
| 4271 |
&twm;
|
| 4272 |
&minicom;
|
| 4273 |
&user-ja;
|
| 4274 |
&fontset;
|
| 4275 |
|
| 4276 |
|
| 4277 |
|
| 4278 |
|
| 4279 |
|
| 4280 |
|
| 4281 |
|
| 4282 |
|
| 4283 |
<chapt id="reference"><heading>References</heading>
|
| 4284 |
|
| 4285 |
<P>
|
| 4286 |
General
|
| 4287 |
<list>
|
| 4288 |
<item>
|
| 4289 |
<url id="http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT"
|
| 4290 |
name="Unicode support in the Solaris Operating Environment">
|
| 4291 |
shows what is needed for software developers to support UTF-8.
|
| 4292 |
<item>
|
| 4293 |
<url id="http://www.unix-systems.org/version2/whatsnew/login_mse.html"
|
| 4294 |
name="The Open Group's summary of ISO C Amendment 1">
|
| 4295 |
is a detailed explanation on locale and wide character technologies.
|
| 4296 |
<item>
|
| 4297 |
<url id="http://www.cl.cam.ac.uk/~mgk25/unicode.html"
|
| 4298 |
name="Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux">
|
| 4299 |
is a detailed explanation on UTF-8 and Unicode.
|
| 4300 |
<item>
|
| 4301 |
<url id="ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html"
|
| 4302 |
name="Bruno Haible's Unicode HOWTO">
|
| 4303 |
<item>
|
| 4304 |
Tomohiro KUBOTA (original author of this Introduction to I18N),
|
| 4305 |
<url id="http://www8.plala.or.jp/tkubota1/mojibake/"
|
| 4306 |
name="What is MOJIBAKE"> shows what occurs when character handling
|
| 4307 |
is improper. Mojibake is a Japanese word which almost all computer
|
| 4308 |
users (not only Linux/BSD/Unix but also Windows/Macintosh) know.
|
| 4309 |
<item>
|
| 4310 |
Ken Lunde, "CJKV Information Processing", ISBN 1-56592-224-7,
|
| 4311 |
O'Reilly, 1999
|
| 4312 |
<item>
|
| 4313 |
Mikiko NISHIKIMI, Naoto TAKAHASHI, Satoru TOMURA, Ken'ichi HANDA,
|
| 4314 |
Seiji KUWARI, Shin'ichi MUKAIGAWA, and Tomoko YOSHIDA,
|
| 4315 |
"<url id="http://web.kyoto-inet.or.jp/people/tomoko-y/biwa/multi/"
|
| 4316 |
name="MARUCHIRINGARU KANKYOU NO JITSUGEN - X Window/Wnn/Mule/WWW BURAUZA
|
| 4317 |
DENO TAKOKUGO KANKYO">" or "Realization of Multilingual Environment
|
| 4318 |
- Multilingual Environment in X Window/Wnn/Mule/WWW Browser"
|
| 4319 |
(in Japanese), ISBN4-88735-020-1, TOPPAN, 1996
|
| 4320 |
<item>
|
| 4321 |
Yoshihiro KIYOKANE and Youichi SUEHIRO,
|
| 4322 |
"<url id="http://www.geocities.co.jp/SiliconValley-PaloAlto/8090/"
|
| 4323 |
name="KOKUSAIKA PUROGURAMINGU - I18N HANDOBUKKU">" or "Internationalization
|
| 4324 |
Programming - I18N Handbook" (in Japanese), ISBN4-320-02904-6,
|
| 4325 |
KYORITSU, 1998
|
| 4326 |
<item>
|
| 4327 |
Syuuji SADO and Tomoko YOSHIDA,
|
| 4328 |
"<url id="http://web.kyoto-inet.or.jp/people/tomoko-y/japanese/index.html"
|
| 4329 |
name="Linux/FreeBSD NIHONGO KANKYOU NO KOUCHIKU TO KATSUYOU">" or
|
| 4330 |
"Construction and Utilization of Linux/FreeBSD Japanese Environment"
|
| 4331 |
(in Japanese), ISBN4-7973-0480-4, SOFTBANK, 1997
|
| 4332 |
<item>
|
| 4333 |
Kouichi YASUOKA and Motoko YASUOKA
|
| 4334 |
"<url id="http://www.dendai.ac.jp/press/book_da/ISBN4-501-53060-X.html"
|
| 4335 |
name="MOJI KOODO NO SEKAI">" or "The World of Character Codes" (in Japanese),
|
| 4336 |
ISBN4-501-53060-X, Tokyo Denki University Press Center, 1999
|
| 4337 |
</list>
|
| 4338 |
</P>
|
| 4339 |
|
| 4340 |
<P>
|
| 4341 |
Characters (general)
|
| 4342 |
<list>
|
| 4343 |
<item>
|
| 4344 |
<url id="http://www.kudpc.kyoto-u.ac.jp/~yasuoka/CJK.html"
|
| 4345 |
name="Character Tables">
|
| 4346 |
Graphic images for various character sets in the world.
|
| 4347 |
<item>
|
| 4348 |
<url id="ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf"
|
| 4349 |
name="Ken Lunde's CJK info">
|
| 4350 |
information on CJK (Chinese, Japanese, and Korean) character
|
| 4351 |
set standards, written by the writer of "CJKV Information Processing"
|
| 4352 |
published by O'Reilly.
|
| 4353 |
<item>
|
| 4354 |
<url id="http://www.isi.edu/in-notes/iana/assignments/character-sets"
|
| 4355 |
name="IANA character set registry">
|
| 4356 |
Note that both coded character sets (for example, KS_C_5601-1987,
|
| 4357 |
MIBenum 36) and encodings (for example, ISO-2022-KR, MIBenum: 37)
|
| 4358 |
are registered. How confusing!
|
| 4359 |
<item>
|
| 4360 |
<url id="http://www.itscj.ipsj.or.jp/ISO-IR/"
|
| 4361 |
name="International Register of Coded Character Sets">
|
| 4362 |
A complete list of registered CCS, with ISO 2022 escape sequences.
|
| 4363 |
PDF files for these CCS are also available.
|
| 4364 |
</list>
|
| 4365 |
Characters (ISO 8859)
|
| 4366 |
<list>
|
| 4367 |
<item>
|
| 4368 |
<url id="http://czyborra.com/charsets/iso8859.html"
|
| 4369 |
name="ISO 8859 Alphabet Soup">
|
| 4370 |
</list>
|
| 4371 |
Characters (ISO 2022)
|
| 4372 |
<list>
|
| 4373 |
<item>
|
| 4374 |
<url id="http://www.ecma.ch/ecma1/stand/ECMA-035.HTM">
|
| 4375 |
</list>
|
| 4376 |
Characters (ISO 10646 and Unicode)
|
| 4377 |
<list>
|
| 4378 |
<item><url id="http://www.unicode.org/" name="Unicode Consortium">
|
| 4379 |
<item>
|
| 4380 |
<url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html"
|
| 4381 |
name="Problems and Solutions for Unicode and User/Vendor Defined
|
| 4382 |
Characters">
|
| 4383 |
</list>
|
| 4384 |
</P>
|
| 4385 |
|
| 4386 |
<P>
|
| 4387 |
Softwares
|
| 4388 |
<list>
|
| 4389 |
<item>
|
| 4390 |
<url id="http://www.wg.omron.co.jp/~shin/Arena-CJK-doc/"
|
| 4391 |
name="Arena-i18n">
|
| 4392 |
Multilingual web browser.
|
| 4393 |
<item>
|
| 4394 |
<url id="http://www.mozilla.org/" name="Mozilla">
|
| 4395 |
is also a multilingual web browser.
|
| 4396 |
<item>
|
| 4397 |
<url id="http://www.m17n.org/mule/" name="Mule">
|
| 4398 |
Multilingual editor whose function is included in GNU Emacs 20
|
| 4399 |
and XEmacs 20.
|
| 4400 |
Mule is the most advanced m17n software in my knowledge.
|
| 4401 |
<item>
|
| 4402 |
<url id="http://www3.justnet.ne.jp/~nmasu/linux/jfbterm/indexn.html"
|
| 4403 |
name="JFBTERM"> (in Japanese) is a multilingual terminal for
|
| 4404 |
Linux framebuffer console. Supported encodings are ISO 2022, EUC-JP,
|
| 4405 |
CN-GB, and EUC-KR. Supported CCS are ISO 8859-{1,2,3,4,5,6,7,8,9,10},
|
| 4406 |
JISX 0201, JISX 0208, GB 2312, and KSX 1001.
|
| 4407 |
<item>
|
| 4408 |
<url id="http://www.gnu.org/directory/UNICON.html"
|
| 4409 |
name="UNICON Project"> intends to implement display/input
|
| 4410 |
CJK(Chinese/Japanese/Korean) characters under the Framebuffer under
|
| 4411 |
Linux.
|
| 4412 |
<item>
|
| 4413 |
<url id="http://programmer.lib.sjtu.edu.cn/cce/cce.html"
|
| 4414 |
name="CCE - Chinese Console Environment"> enables CN-GB Chinese
|
| 4415 |
to be displayed on Linux and FreeBSD console. It also supplies
|
| 4416 |
input methods for Chinese.
|
| 4417 |
<item>
|
| 4418 |
<url id="http://dickey.his.com/xterm/"
|
| 4419 |
name="Xterm"> is a part of XFree86 distribution. It can display
|
| 4420 |
UTF-8 encoding including doublewidth characters and combining
|
| 4421 |
characters.
|
| 4422 |
<item>
|
| 4423 |
<url id="http://www.rxvt.org/"
|
| 4424 |
name="Rxvt"> can display multibyte encodings such as EUC-JP,
|
| 4425 |
Shift-JIS, CN-GB, and Big-5.
|
| 4426 |
<item>
|
| 4427 |
<url id="http://www.gnu.org/software/libiconv/"
|
| 4428 |
name="libiconv"> provides
|
| 4429 |
<tt>iconv()</tt> implementation for systems which don't have one.
|
| 4430 |
It supports various encodings like ASCII, ISO 8859-*, KOI8-*,
|
| 4431 |
EUC-*, ISO 2022-*, Big5, Shift-JIS, TIS 620, UTF-*, UCS-*,
|
| 4432 |
CP*, Mac*, and so on. This library also has <tt>locale_charset()</tt>,
|
| 4433 |
a replacement of <tt>nl_langinfo(CODESET)</tt>.
|
| 4434 |
<item>
|
| 4435 |
<url id="http://clisp.cons.org/~haible/packages-libutf8.html"
|
| 4436 |
name="libutf8 - a Unicode/UTF-8 locale plugin"> provides
|
| 4437 |
UTF-8 locale support for systems which don't have UTF-8 locales.
|
| 4438 |
<item>
|
| 4439 |
<url id="http://www.pango.org/" name="Pango"> is a project to
|
| 4440 |
develop a portable high-quality text rendering engine.
|
| 4441 |
</list>
|
| 4442 |
</P>
|
| 4443 |
|
| 4444 |
<P>
|
| 4445 |
Projects and Organizations
|
| 4446 |
<list>
|
| 4447 |
<item>
|
| 4448 |
<url id="http://www.li18nux.org/"
|
| 4449 |
name="Linux Internationalization Initiative">, or Li18nux,
|
| 4450 |
focuses on the i18n of a core set of APIs and components of Linux
|
| 4451 |
distributions. The results will be proposed to LSB.
|
| 4452 |
<item>
|
| 4453 |
<url id="http://www.li18nux.org/li18nux2k/"
|
| 4454 |
name="LI18NUX 2000 Globalization Specification"> is the first
|
| 4455 |
fruits of Li18nux.
|
| 4456 |
focuses on the i18n of a core set of APIs and components of Linux
|
| 4457 |
distributions. The results will be proposed to LSB.
|
| 4458 |
<item>
|
| 4459 |
<url id="http://citrus.bsdclub.org/"
|
| 4460 |
name="Citrus Project"> is a project to implement
|
| 4461 |
locale/iconv for BSD series OSes so that these OSes conform to
|
| 4462 |
ISO C / SUSV2.
|
| 4463 |
<item>
|
| 4464 |
<url id="http://www.iro.umontreal.ca/~pinard/po/HTML/"
|
| 4465 |
name="Translation Project">
|
| 4466 |
<item>
|
| 4467 |
<url id="http://www.mojikyo.org/" name="Mojikyo">
|
| 4468 |
<item>
|
| 4469 |
<url id="http://www.tron.org/index-e.html" name="TRON project">
|
| 4470 |
</list>
|
| 4471 |
<P>
|
| 4472 |
|
| 4473 |
|
| 4474 |
|
| 4475 |
|
| 4476 |
|
| 4477 |
</book>
|
| 4478 |
</debiandoc>
|