| 72 |
|
|
| 73 |
<P> |
<P> |
| 74 |
Minimum requirements, for example, |
Minimum requirements, for example, |
| 75 |
that characters should be displayed proper font (at least users |
that characters should be displayed with fonts with |
| 76 |
of the software must be able to guess what is written), |
proper charset (at least users of the software must be |
| 77 |
|
able to guess what is written), |
| 78 |
that characters must be inputed from keyboard, and |
that characters must be inputed from keyboard, and |
| 79 |
that softwares must not destroy characters, |
that softwares must not destroy characters, |
| 80 |
are stressed in the document and I am trying to |
are stressed in the document and I am trying to |
| 154 |
<list> |
<list> |
| 155 |
<item>Display characters for users' native languages. |
<item>Display characters for users' native languages. |
| 156 |
<item>Input characters for users' native languages. |
<item>Input characters for users' native languages. |
| 157 |
<item>Handle files written in popular character codes |
<item>Handle files written in popular encodings |
| 158 |
<footnote> |
<footnote> |
| 159 |
There are a few terms related to character code, |
There are a few terms related to character code, |
| 160 |
such as character set, character code, charset, |
such as character set, character code, charset, |
| 219 |
Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual |
Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual |
| 220 |
Emacs) text editor which can input/output Japanese text file, |
Emacs) text editor which can input/output Japanese text file, |
| 221 |
Hanterm X terminal emulator which can display and input |
Hanterm X terminal emulator which can display and input |
| 222 |
Korean characters via a few Korean character codes. |
Korean characters via a few Korean encodings. |
| 223 |
Since a programmer has his/her own mother tongue, |
Since a programmer has his/her own mother tongue, |
| 224 |
there are numerous L10N patches and L10N softwares |
there are numerous L10N patches and L10N softwares |
| 225 |
written to satisfy his/her own need. |
written to satisfy his/her own need. |
| 250 |
<footnote> |
<footnote> |
| 251 |
I recommend not to implement Unicode and UTF-8 directly. |
I recommend not to implement Unicode and UTF-8 directly. |
| 252 |
Instead, use LOCALE technology and your software will |
Instead, use LOCALE technology and your software will |
| 253 |
support not only UTF-8 but also many character codes |
support not only UTF-8 but also many encodings |
| 254 |
in the world. |
in the world. |
| 255 |
</footnote> |
</footnote> |
| 256 |
</p></item> |
</p></item> |
| 298 |
The advantage of this approach is that detailed and strict |
The advantage of this approach is that detailed and strict |
| 299 |
implementation is possible beyond the field where |
implementation is possible beyond the field where |
| 300 |
standardized methods are available, such as auto-detection |
standardized methods are available, such as auto-detection |
| 301 |
of character codes of text files to be read. Language-specific |
of encodings of text files to be read. Language-specific |
| 302 |
problems can be perfectly solved (of course it depends on |
problems can be perfectly solved (of course it depends on |
| 303 |
the skill of the programmer). The disadvantages are |
the skill of the programmer). The disadvantages are |
| 304 |
(1) that the number of supported languages is restricted |
(1) that the number of supported languages is restricted |
| 350 |
</P> |
</P> |
| 351 |
|
|
| 352 |
<P> |
<P> |
| 353 |
M17N-model can be achieved using international character codes such |
M17N-model can be achieved using international encodings such |
| 354 |
as ISO-2022 and Unicode. Though you can hard-code these character codes |
as ISO 2022 and Unicode. Though you can hard-code these encodings |
| 355 |
for your software (i.e. approach B), I recommend to use standardized |
for your software (i.e. approach B), I recommend to use standardized |
| 356 |
LOCALE technology. However, using international character codes |
LOCALE technology. However, using international encdoings |
| 357 |
is not sufficient to achieve M17N-model. You will have to prepare |
is not sufficient to achieve M17N-model. You will have to prepare |
| 358 |
a mechanism to switch <strong>input methods</strong>. You will also want |
a mechanism to switch <strong>input methods</strong>. You will also want |
| 359 |
to prepare a character code-guessing mechanism for input files. |
to prepare an encoding-guessing mechanism for input files, |
| 360 |
|
such as <prgn>jless</prgn> and <prgn>emacs</prgn> have. |
| 361 |
Mule is the only software which achieved M17N (though it does not |
Mule is the only software which achieved M17N (though it does not |
| 362 |
use LOCALE technology). |
use LOCALE technology). |
| 363 |
</P> |
</P> |
| 372 |
I have already wrote that this document will put stress on |
I have already wrote that this document will put stress on |
| 373 |
correct handling of characters and character codes for users' native |
correct handling of characters and character codes for users' native |
| 374 |
languages. To achieve this purpose, I will discuss on popular |
languages. To achieve this purpose, I will discuss on popular |
| 375 |
character codes in the world at the first chapter of |
character sets and encodings in the world at the first chapter of |
| 376 |
<ref id="coding">. You will not need the detailed |
<ref id="coding">. You will not need the detailed |
| 377 |
knowledges for these character codes if you will use LOCALE technology. |
knowledges for these character codes if you will use LOCALE technology. |
| 378 |
The aim of this chapter is only for showing the concepts used in these |
The aim of this chapter is only for showing the concepts used in these |
| 426 |
</P> |
</P> |
| 427 |
|
|
| 428 |
<P> |
<P> |
| 429 |
Here major character codes are introduced. |
Here major character sets and encodings are introduced. |
| 430 |
Note that you don't have to know the detail of these |
Note that you don't have to know the detail of these |
| 431 |
character codes if you use LOCALE and <tt>wchar_t</tt> technology. |
character codes if you use LOCALE and <tt>wchar_t</tt> technology. |
| 432 |
However, these knowledge will help you to understand why number |
However, these knowledge will help you to understand why number |
| 440 |
If you are planning to develop a text-processing software |
If you are planning to develop a text-processing software |
| 441 |
beyond the fields which the LOCALE technology covers, you will |
beyond the fields which the LOCALE technology covers, you will |
| 442 |
have to understand the following descriptions very well. |
have to understand the following descriptions very well. |
| 443 |
These fields include automatic detection of character code |
These fields include automatic detection of encodings |
| 444 |
used for the input file (Most of Japanese-capable text viewers |
used for the input file (Most of Japanese-capable text viewers |
| 445 |
such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism) |
such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism) |
| 446 |
and so on. |
and so on. |
| 482 |
such as citizen registration system, serious DTP such as |
such as citizen registration system, serious DTP such as |
| 483 |
newspaper system, and so on. |
newspaper system, and so on. |
| 484 |
</p></item> |
</p></item> |
| 485 |
<tag><strong>Character Code</strong> |
<tag><strong>Encoding</strong> |
| 486 |
<item><p> |
<item><p> |
| 487 |
Character code is a rule where characters and texts are |
Encoding is a rule where characters and texts are |
| 488 |
expressed in combinations of bits or bytes in order to |
expressed in combinations of bits or bytes in order to |
| 489 |
treat characters in computers. Words of <em>character |
treat characters in computers. Words of <em>character |
| 490 |
coding system</em>, <em>charset</em>, and so on are used |
coding system</em>, <em>character code</em>, <em>charset</em>, |
| 491 |
to express the same meaning. Basically, character code |
and so on are used to express the same meaning. |
| 492 |
takes care of <em>characters</em>, not <em>glyphs</em>. |
Basically, <em>encoding</em> takes care of |
| 493 |
There are many official and de-facto standards of character |
<em>characters</em>, not <em>glyphs</em>. |
| 494 |
codes such as ASCII, ISO 8859-{1,2,...,15}, |
There are many official and de-facto standards of encodings |
| 495 |
|
such as ASCII, ISO 8859-{1,2,...,15}, |
| 496 |
ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2}, |
ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2}, |
| 497 |
EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620, |
EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620, |
| 498 |
VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE, |
VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE, |
| 499 |
UTF-16BE, KOI8-R, and so on so on. |
UTF-16BE, KOI8-R, and so on so on. |
| 500 |
To construct a character code, we have to consider the |
To construct an encoding, we have to consider the |
| 501 |
following concepts. (Character code = one or more |
following concepts. (Encoding = one or more |
| 502 |
CCS + one CES). |
CCS + one CES). |
| 503 |
</p></item> |
</p></item> |
| 504 |
<tag><strong>Character Set</strong> |
<tag><strong>Character Set</strong> |
| 505 |
<item><p> |
<item><p> |
| 506 |
Character set is a set of characters. This determines |
Character set is a set of characters. This determines |
| 507 |
a range of characters where the character code can handle. |
a range of characters where the encoding can handle. |
| 508 |
|
In contrast to <em>coded character set</em>, this is often |
| 509 |
|
called as <em>non-coded character set</em>. |
| 510 |
</p></item> |
</p></item> |
| 511 |
<tag><strong>Coded Character Set (CCS)</strong> |
<tag><strong>Coded Character Set (CCS)</strong> |
| 512 |
<item><p> |
<item><p> |
| 527 |
<tag><strong>Character Encoding Scheme (CES)</strong> |
<tag><strong>Character Encoding Scheme (CES)</strong> |
| 528 |
<item><p> |
<item><p> |
| 529 |
Character Encoding Scheme is also a word defined in RFC 2050 |
Character Encoding Scheme is also a word defined in RFC 2050 |
| 530 |
to call methods to construct a character code using one or |
to call methods to construct an encoding using one or |
| 531 |
more CCS. This is important when two or more CCS are used |
more CCS. This is important when two or more CCS are used |
| 532 |
to construct a character code. |
to construct an encoding. |
| 533 |
ISO 2022 is a method to construct a character code from |
ISO 2022 is a method to construct an encoding from |
| 534 |
one or more ISO 2022-compliant CCS. ISO 2022 is very |
one or more ISO 2022-compliant CCS. ISO 2022 is very |
| 535 |
complex system and subsets of ISO 2022 are usually used |
complex system and subsets of ISO 2022 are usually used |
| 536 |
such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII |
such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII |
| 537 |
and KSX 1001), and so on. CES is not important for |
and KSX 1001), and so on. CES is not important for |
| 538 |
character codes with only one CCS. |
encodings with only one 8bit CCS. |
| 539 |
UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be |
UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be |
| 540 |
regarded as CES whose CCS is Unicode or ISO 10646. |
regarded as CES whose CCS is Unicode or ISO 10646. |
| 541 |
</p></item> |
</p></item> |
| 542 |
</taglist> |
</taglist> |
| 543 |
</P> |
</P> |
| 544 |
|
|
| 545 |
|
<P> |
| 546 |
|
Some other words are usually used related to character codes. |
| 547 |
|
</P> |
| 548 |
|
|
| 549 |
|
<P> |
| 550 |
|
<strong>Character code</strong> is a widely-used word to mean |
| 551 |
|
<em>encoding</em>. This is an primitive and crude word to call |
| 552 |
|
the way a computer handles characters with assigning numbers. |
| 553 |
|
For example, <em>character code</em> can call <em>encoding</em> |
| 554 |
|
and can call <em>coded character set</em>. Thus this word can |
| 555 |
|
be used only in the case when both of them can be regard in |
| 556 |
|
the same category. This word should be avoided in serious |
| 557 |
|
discussions. This document will not use this word hereafter. |
| 558 |
|
</P> |
| 559 |
|
|
| 560 |
|
<P> |
| 561 |
|
<strong>Codeset</strong> is a word to call <em>encoding</em> |
| 562 |
|
or <em>character encoding scheme</em>. |
| 563 |
|
<footnote> |
| 564 |
|
This document used a word <em>codeset</em> before Novermber 2000 |
| 565 |
|
to call <em>encoding</em>. I changed terminology since |
| 566 |
|
<em>encoding</em> seems more popular. |
| 567 |
|
</footnote> |
| 568 |
|
</P> |
| 569 |
|
|
| 570 |
|
<P> |
| 571 |
|
<strong>charset</strong> is also a well-used word. |
| 572 |
|
This word is used very widely, for example, in MIME (like |
| 573 |
|
<tt>Content-Type: text/plain, charset=iso8859-1</tt>), |
| 574 |
|
in XLFD (X Logical Font Description) font name |
| 575 |
|
(CharSetResigtry and CharSetEncoding fields), and so on. |
| 576 |
|
Note that <em>charset</em> in MIME is <em>encoding</em>, |
| 577 |
|
while <em>charset</em> in XLFD font name is <em>coded character |
| 578 |
|
set</em>. This is very confusing. |
| 579 |
|
</P> |
| 580 |
|
|
| 581 |
|
<P> |
| 582 |
|
Ken Lunde's "CJKV Information Processing" uses a word |
| 583 |
|
<strong>encoding method</strong>. He says that |
| 584 |
|
ISO-2022, EUC, Big5, and Shift-JIS are examples of |
| 585 |
|
<em>encoding methods</em>. It seems that his <em>encoding |
| 586 |
|
method</em> is <em>CES</em> in this document. However, |
| 587 |
|
we should notice that Big5 and Shift-JIS are encodings |
| 588 |
|
while ISO-2022 and EUC are not. |
| 589 |
|
<footnote> |
| 590 |
|
During I18N programming, we will frequently meet with EUC-JP |
| 591 |
|
or EUC-KR, while we well rarely meet with EUC. I think it is |
| 592 |
|
not appropriate to stress EUC, a class of encodings, over |
| 593 |
|
EUC-JP, EUC-KR, and so on, concrete encodings. |
| 594 |
|
</footnote> |
| 595 |
|
</P> |
| 596 |
|
|
| 597 |
|
<P> |
| 598 |
|
<url id="http://www.unicode.org/unicode/reports/tr17/" |
| 599 |
|
name="Character Encoding Model, Unicode Technilcal Report #17"> |
| 600 |
|
(hereafter, <em>"the Report"</em>) suggests five-level model. |
| 601 |
|
<list> |
| 602 |
|
<item>ACR: abstract character repertoire |
| 603 |
|
<item>CCS: Coded Character Set |
| 604 |
|
<item>CEF: Character Encoding Form |
| 605 |
|
<item>CES: Character Encoding Scheme |
| 606 |
|
<item>TES: Transfer Encoding Syntax |
| 607 |
|
</list> |
| 608 |
|
</P> |
| 609 |
|
|
| 610 |
|
<P> |
| 611 |
|
<strong>TES</strong> is also suggested in RFC 2130. Some examples of |
| 612 |
|
TES are: <em>base64</em>, <em>uuencode</em>, <em>BinHex</em>, |
| 613 |
|
<em>quoted-printable</em>, <em>gzip</em>, and so on. |
| 614 |
|
TES means a transform of encoded data which may (or may not) include |
| 615 |
|
textual data. Thus, TES is not a part of character encoding. |
| 616 |
|
However, TES is important in the Internet data exchange. |
| 617 |
|
</P> |
| 618 |
|
|
| 619 |
|
<P> |
| 620 |
|
When using a computer, we rarely have a chance to face with |
| 621 |
|
<strong>ACR</strong>. |
| 622 |
|
Though it is true that CJK people have their national standard of |
| 623 |
|
ACR (for example, standard for ideograms which can be used for |
| 624 |
|
personal names) and some of us may need to handle these ACR with |
| 625 |
|
computers (for example, citizen registration system), this is too |
| 626 |
|
heavy theme for this document. This is because there are no |
| 627 |
|
standardized or encouraged methods to handle these ACR. You may |
| 628 |
|
have to build the whole system for such purposes. Good lack! |
| 629 |
|
</P> |
| 630 |
|
|
| 631 |
|
<P> |
| 632 |
|
<strong>CCS</strong> in <em>"the Report"</em> is same as what I wrote |
| 633 |
|
in this document. |
| 634 |
|
It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201, |
| 635 |
|
JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5, |
| 636 |
|
CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on. |
| 637 |
|
Some of them are national standards, some are international |
| 638 |
|
standards, and others are de-facto standards. |
| 639 |
|
</P> |
| 640 |
|
|
| 641 |
|
<P> |
| 642 |
|
<strong>CEF</strong> and <strong>CES</strong> in <em>"the Report"</em> |
| 643 |
|
correspond to <strong>CES</strong> in this document. |
| 644 |
|
This document will not distinguish these two, since I think there |
| 645 |
|
are no inconvenience. An encoding with a significant CEF doesn't |
| 646 |
|
have a significant CES (in <em>"the Report"</em> meaning), and |
| 647 |
|
vice versa. Then why should we have to distinguish these two? |
| 648 |
|
The only exception is UTF-16 series. In UTF-16 series, |
| 649 |
|
UTF-16 is a CEF and UTF-16BE is a CES. This is the only case where |
| 650 |
|
both of these two leves are needed. |
| 651 |
|
</P> |
| 652 |
|
|
| 653 |
|
<P> |
| 654 |
|
Now, <strong>CES</strong> is a concrete concept with concrete |
| 655 |
|
examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP, |
| 656 |
|
ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT, |
| 657 |
|
ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE, |
| 658 |
|
and so on. Now they are encodings themselves. |
| 659 |
|
</P> |
| 660 |
|
|
| 661 |
|
<P> |
| 662 |
|
The most important concept in this section is distinction between |
| 663 |
|
<em>coded character set</em> and <em>encoding</em>. <em>coded |
| 664 |
|
character set</em> is a component of <em>encoding</em>. Text data |
| 665 |
|
are described in <em>encoding</em>, not <em>coded character set</em>. |
| 666 |
|
</P> |
| 667 |
|
|
| 668 |
|
|
| 669 |
<sect1 id="stateful"><heading>Stateless and Stateful</heading> |
<sect1 id="stateful"><heading>Stateless and Stateful</heading> |
| 670 |
|
|
| 671 |
<P> |
<P> |
| 672 |
To construct a character code with two or more CCS, |
To construct an encoding with two or more CCS, CES has to supply |
| 673 |
CES has to supply a method to avoid collision between these CCS. |
a method to avoid collision between these CCS. |
| 674 |
There are two ways to do that. One is to make all characters |
There are two ways to do that. One is to make all characters |
| 675 |
in the all CCS have unique code points. The other is to |
in the all CCS have unique code points. The other is to |
| 676 |
allow characters from different CCS to have the same |
allow characters from different CCS to have the same |
| 679 |
</P> |
</P> |
| 680 |
|
|
| 681 |
<P> |
<P> |
| 682 |
A character code with shift state is called <strong>STATEFUL</strong> and |
An encoding with shift state is called <strong>STATEFUL</strong> and |
| 683 |
one without shift state is called <strong>STATELESS</strong>. |
one without shift state is called <strong>STATELESS</strong>. |
| 684 |
</P> |
</P> |
| 685 |
|
|
| 686 |
<P> |
<P> |
| 687 |
Examples of stateful character codes are: ISO 2022-*, |
Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR, |
| 688 |
|
ISO 2022-INT-1, ISO 2022-INT-2, and so on. |
| 689 |
|
</P> |
| 690 |
|
|
| 691 |
<P> |
<P> |
| 692 |
For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean |
For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean |
| 694 |
'$' and ',' according to the shift state. |
'$' and ',' according to the shift state. |
| 695 |
</P> |
</P> |
| 696 |
|
|
| 697 |
<sect1 id="multibyte"><heading>Multibyte character code</heading> |
<sect1 id="multibyte"><heading>Multibyte encodings</heading> |
| 698 |
|
|
| 699 |
<P> |
<P> |
| 700 |
Character codes are classified into multibyte ones and the others, |
Encodings are classified into multibyte ones and the others, |
| 701 |
according to the relationship between number of characters and number of |
according to the relationship between number of characters and number of |
| 702 |
bytes in the character code. |
bytes in the encoding. |
| 703 |
</P> |
</P> |
| 704 |
|
|
| 705 |
<P> |
<P> |
| 706 |
In non-multibyte character code, one character is always expressed |
In non-multibyte encoding, one character is always expressed |
| 707 |
by one byte. On the other hand, one character may expressed in |
by one byte. On the other hand, one character may expressed in |
| 708 |
one or more bytes in multibyte character code. Note that the number |
one or more bytes in multibyte encoding. Note that the number |
| 709 |
is not fixed even in a single character code. |
is not fixed even in a single encoding. |
| 710 |
</P> |
</P> |
| 711 |
|
|
| 712 |
<P> |
<P> |
| 713 |
Examples of multibyte character codes are: EUC-*, ISO 2022-*, |
Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP, |
| 714 |
Shift-JIS, Big5, UTF-*, and so on. Note that all of UTF-* are |
Shift-JIS, Big5, UHC, UTF-8, and so on. Note that all of UTF-* are |
| 715 |
multibyte. |
multibyte. |
| 716 |
</P> |
</P> |
| 717 |
|
|
| 718 |
<P> |
<P> |
| 719 |
Examples of non-multibyte character codes are: ISO 8859-*, |
Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2, |
| 720 |
TIS 620, VISCII, and so on. |
TIS 620, VISCII, and so on. |
| 721 |
</P> |
</P> |
| 722 |
|
|
| 723 |
<P> |
<P> |
| 724 |
Note that even in non-multibyte character code, number of characters |
Note that even in non-multibyte encoding, number of characters |
| 725 |
and number of bytes may differ if the character code is stateful. |
and number of bytes may differ if the encoding is stateful. |
| 726 |
|
</P> |
| 727 |
|
|
| 728 |
|
<P> |
| 729 |
|
Ken Lunde's "CJKV Information Processing" |
| 730 |
|
<footnote> |
| 731 |
|
ISBN 1-56592-224-7, O'Reilly, 1999 |
| 732 |
|
</footnote> |
| 733 |
|
classifies encoding methods |
| 734 |
|
into the following three categories: |
| 735 |
|
<list> |
| 736 |
|
<item>modal |
| 737 |
|
<item>non-modal |
| 738 |
|
<item>fixed-length |
| 739 |
|
</list> |
| 740 |
|
<em>Modal</em> corresponds to <em>stateful</em> in this document. |
| 741 |
|
Other two are <em>stateless</em>, where <em>non-modal</em> is |
| 742 |
|
<em>multibyte</em> and <em>fixed-length</em> is |
| 743 |
|
<em>non-multibyte</em>. However, I think <em>stateful</em> - |
| 744 |
|
<em>stateless</em> and <em>multibyte</em> - <em>non-multibyte</em> |
| 745 |
|
are independent concept. |
| 746 |
|
<footnote> |
| 747 |
|
though there are no existing encodings which is stateful and |
| 748 |
|
non-multibyte. |
| 749 |
|
</footnote> |
| 750 |
</P> |
</P> |
| 751 |
|
|
| 752 |
<sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading> |
<sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading> |
| 762 |
|
|
| 763 |
<P> |
<P> |
| 764 |
Speaking of relationship between characters and bytes, |
Speaking of relationship between characters and bytes, |
| 765 |
in multibyte character codes, two or more bytes may be needed |
in multibyte encodings, two or more bytes may be needed |
| 766 |
to express one character. In stateful character codes, escape |
to express one character. In stateful encodings, escape |
| 767 |
sequences are not related to any characters. |
sequences are not related to any characters. |
| 768 |
</P> |
</P> |
| 769 |
|
|
| 774 |
Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set |
Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set |
| 775 |
will occupy two columns and 'Half-width forms' will occupy one column. |
will occupy two columns and 'Half-width forms' will occupy one column. |
| 776 |
Combining characters used for Thai and so on can be regarded as |
Combining characters used for Thai and so on can be regarded as |
| 777 |
zero-column characters. |
zero-column characters. Though there are no standards, you can |
| 778 |
|
use <tt>wcwidth()</tt> and <tt>wcswidth()</tt> for this purpose. |
| 779 |
|
See <ref id="output-console-column"> for detail. |
| 780 |
</P> |
</P> |
| 781 |
|
|
| 782 |
<sect id="standards"><heading>Standards for Character Codes</heading> |
<sect id="standards"><heading>Standards for Character Sets and Encodings</heading> |
| 783 |
|
|
| 784 |
<sect1 id="ascii"><heading>ASCII and ISO 646</heading> |
<sect1 id="ascii"><heading>ASCII and ISO 646</heading> |
| 785 |
|
|
| 786 |
<P> |
<P> |
| 787 |
<strong>ASCII</strong> is a CCS and also a character code at the same time. |
<strong>ASCII</strong> is a CCS and also an encoding at the same time. |
| 788 |
ASCII is 7bit and contains 94 printable characters which are |
ASCII is 7bit and contains 94 printable characters which are |
| 789 |
encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>. |
encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>. |
| 790 |
</P> |
</P> |
| 824 |
</P> |
</P> |
| 825 |
|
|
| 826 |
<P> |
<P> |
| 827 |
As far as I know, all character codes (besides EBCDIC) in the world |
As far as I know, all encodings (besides EBCDIC) in the world |
| 828 |
are compatible with ISO 646. |
are compatible with ISO 646. |
| 829 |
</P> |
</P> |
| 830 |
|
|
| 833 |
</P> |
</P> |
| 834 |
|
|
| 835 |
<P> |
<P> |
| 836 |
Nowadays usage of character codes incompatible with ASCII is not |
Nowadays usage of encodings incompatible with ASCII is not |
| 837 |
encouraged and thus ISO 646-* (other than US version) should not |
encouraged and thus ISO 646-* (other than US version) should not |
| 838 |
be used. One of the reason is that when a string is converted into |
be used. One of the reason is that when a string is converted into |
| 839 |
Unicode, the converter doesn't know whether IRVs are converted into |
Unicode, the converter doesn't know whether IRVs are converted into |
| 847 |
|
|
| 848 |
<P> |
<P> |
| 849 |
<strong>ISO 8859</strong> is both a series of CCS and a series of |
<strong>ISO 8859</strong> is both a series of CCS and a series of |
| 850 |
character codes. It is an expansion of ASCII using all 8 bits. |
encodings. It is an expansion of ASCII using all 8 bits. |
| 851 |
Additional 96 printable characters encoded in 0xa0 - 0xff are |
Additional 96 printable characters encoded in 0xa0 - 0xff are |
| 852 |
available besides 94 ASCII printable characters. |
available besides 94 ASCII printable characters. |
| 853 |
</P> |
</P> |
| 884 |
<P> |
<P> |
| 885 |
<strong>ISO 2022</strong> is an international standard of CES. |
<strong>ISO 2022</strong> is an international standard of CES. |
| 886 |
ISO 2022 determines a few requirement for CCS to be a member |
ISO 2022 determines a few requirement for CCS to be a member |
| 887 |
of ISO 2022-based character codes. It also defines a very |
of ISO 2022-based encodings. It also defines a very |
| 888 |
extensive (and complex) rules to combine these CCS into one |
extensive (and complex) rules to combine these CCS into one |
| 889 |
character code. Many character codes such as EUC-*, ISO 2022-*, |
encoding. Many encodings such as EUC-*, ISO 2022-*, |
| 890 |
compound text, |
compound text, |
| 891 |
<footnote> |
<footnote> |
| 892 |
Compound text is a standard for text exchange between X clients. |
Compound text is a standard for text exchange between X clients. |
| 938 |
</P> |
</P> |
| 939 |
|
|
| 940 |
<P> |
<P> |
| 941 |
For example, ASCII, ISO 646-UK, and JIS X 0201 Katakana |
For example, ASCII, ISO 646-UK, and JISX 0201 Katakana |
| 942 |
are classified into (1), JIS X 0208 Japanese Kanji, |
are classified into (1), JISX 0208 Japanese Kanji, |
| 943 |
KS C 5601 Korean, GB 2312-80 Chinese are classified into (3), |
KSX 1001 Korean, GB 2312-80 Chinese are classified into (3), |
| 944 |
and ISO 8859-* are classified to (2). |
and ISO 8859-* are classified to (2). |
| 945 |
</P> |
</P> |
| 946 |
|
|
| 1012 |
</list> |
</list> |
| 1013 |
<item>character set with multibyte 94-character |
<item>character set with multibyte 94-character |
| 1014 |
<list> |
<list> |
| 1015 |
<item>F=0x40 for JIS X 0208-1978 Japanese |
<item>F=0x40 for JISX 0208-1978 Japanese |
| 1016 |
<item>F=0x41 for GB 2312-80 Chinese |
<item>F=0x41 for GB 2312-80 Chinese |
| 1017 |
<item>F=0x42 for JIS X 0208-1983 Japanese |
<item>F=0x42 for JISX 0208-1983 Japanese |
| 1018 |
<item>F=0x43 for KS C 5601 Korean |
<item>F=0x43 for KSC 5601 Korean |
| 1019 |
<item>F=0x44 for JIS X 0212-1990 Japanese |
<item>F=0x44 for JISX 0212-1990 Japanese |
| 1020 |
<item>F=0x45 for CCITT Extended GB (ISO-IR-165) |
<item>F=0x45 for CCITT Extended GB (ISO-IR-165) |
| 1021 |
<item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan) |
<item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan) |
| 1022 |
<item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan) |
<item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan) |
| 1057 |
</P> |
</P> |
| 1058 |
|
|
| 1059 |
<P> |
<P> |
| 1060 |
Note that a character code in a character set invoked into GR is |
Note that a code in a character set invoked into GR is |
| 1061 |
or-ed with 0x80. |
or-ed with 0x80. |
| 1062 |
</P> |
</P> |
| 1063 |
|
|
| 1103 |
codes are used to invoke G2 and G3 into GL in ISO 2022, they are |
codes are used to invoke G2 and G3 into GL in ISO 2022, they are |
| 1104 |
invoked into GR in EUC. |
invoked into GR in EUC. |
| 1105 |
<strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>, |
<strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>, |
| 1106 |
and <strong>EUC-TW</strong> are widely used character codes |
and <strong>EUC-TW</strong> are widely used encodings |
| 1107 |
which use EUC as CES. |
which use EUC as CES. |
| 1108 |
</P> |
</P> |
| 1109 |
|
|
| 1208 |
<sect2 id="unicode-ces"><heading>UTF as CES</heading> |
<sect2 id="unicode-ces"><heading>UTF as CES</heading> |
| 1209 |
|
|
| 1210 |
<P> |
<P> |
| 1211 |
A few CES are used to construct character codes which use UCS as |
A few CES are used to construct encodings which use UCS as |
| 1212 |
a CCS. They are <strong>UTF-7</strong>, <strong>UTF-8</strong>, |
a CCS. They are <strong>UTF-7</strong>, <strong>UTF-8</strong>, |
| 1213 |
<strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and |
<strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and |
| 1214 |
<strong>UTF-16BE</strong>. UTF means Unicode (or UCS) |
<strong>UTF-16BE</strong>. UTF means Unicode (or UCS) |
| 1215 |
Transformation Format. |
Transformation Format. |
| 1216 |
Since these CES always take UCS as the only CCS, they are also |
Since these CES always take UCS as the only CCS, they are also |
| 1217 |
names for character codes. |
names for encodings. |
| 1218 |
<footnote> |
<footnote> |
| 1219 |
Compare UTF and EUC. There are a few variants of EUC whose CCS |
Compare UTF and EUC. There are a few variants of EUC whose CCS |
| 1220 |
are different (EUC-JP, EUC-KR, and so on). This is why we cannot |
are different (EUC-JP, EUC-KR, and so on). This is why we cannot |
| 1221 |
call EUC as a character code. In other words, calling of 'EUC' |
call EUC as an encoding. In other words, calling of 'EUC' |
| 1222 |
cannot specify a character code. On the other hands, 'UTF-8' |
cannot specify an encoding. On the other hands, 'UTF-8' |
| 1223 |
is the name for a specific concrete character code. |
is the name for a specific concrete encoding. |
| 1224 |
</footnote> |
</footnote> |
| 1225 |
</P> |
</P> |
| 1226 |
|
|
| 1227 |
<sect3 id="unicode-utf8"><heading>UTF-8</heading> |
<sect3 id="unicode-utf8"><heading>UTF-8</heading> |
| 1228 |
|
|
| 1229 |
<P> |
<P> |
| 1230 |
UTF-8 is a character code whose CCS is UCS-4. UTF-8 |
UTF-8 is an encoding whose CCS is UCS-4. UTF-8 |
| 1231 |
is designed to be upward-compatible to ASCII. |
is designed to be upward-compatible to ASCII. |
| 1232 |
UTF-8 is multibyte and number of bytes needed to express |
UTF-8 is multibyte and number of bytes needed to express |
| 1233 |
one character is from 1 to 6. |
one character is from 1 to 6. |
| 1258 |
<sect3 id="unicode-utf16"><heading>UTF-16</heading> |
<sect3 id="unicode-utf16"><heading>UTF-16</heading> |
| 1259 |
|
|
| 1260 |
<P> |
<P> |
| 1261 |
UTF-16 is a character code whose CCS is 20bit Unicode. |
UTF-16 is an encoding whose CCS is 20bit Unicode. |
| 1262 |
</P> |
</P> |
| 1263 |
|
|
| 1264 |
<P> |
<P> |
| 1437 |
<sect3 id="646problem"><heading>ISO 646-* Problem</heading> |
<sect3 id="646problem"><heading>ISO 646-* Problem</heading> |
| 1438 |
|
|
| 1439 |
<P> |
<P> |
| 1440 |
You will need a codeset converter between your local character codes |
You will need a codeset converter between your local encodings |
| 1441 |
(for example, ISO 8859-* or ISO 2022-*) and Unicode. |
(for example, ISO 8859-* or ISO 2022-*) and Unicode. |
| 1442 |
For example, Shift-JIS character code |
For example, Shift-JIS encoding |
| 1443 |
<footnote> |
<footnote> |
| 1444 |
The standard character code for Macintosh and MS Windows. |
The standard encoding for Macintosh and MS Windows. |
| 1445 |
</footnote> |
</footnote> |
| 1446 |
consists from |
consists from |
| 1447 |
JISX 0201 Roman (Japanese version of ISO 646), not ASCII, |
JISX 0201 Roman (Japanese version of ISO 646), not ASCII, |
| 1460 |
'yen currency mark - <tt>n</tt>'. You may say that program sources |
'yen currency mark - <tt>n</tt>'. You may say that program sources |
| 1461 |
must written in ASCII and the wrong point is that you |
must written in ASCII and the wrong point is that you |
| 1462 |
tried to convert program source. However, there are many |
tried to convert program source. However, there are many |
| 1463 |
source codes and so on written in Shift-JIS character code. |
source codes and so on written in Shift-JIS encoding. |
| 1464 |
</P> |
</P> |
| 1465 |
|
|
| 1466 |
<P> |
<P> |
| 1479 |
</P> |
</P> |
| 1480 |
|
|
| 1481 |
|
|
| 1482 |
<sect1 id="othercodes"><heading>Other Character Codes</heading> |
<sect1 id="othercodes"><heading>Other Character Sets and Encodings</heading> |
| 1483 |
|
|
| 1484 |
<P> |
<P> |
| 1485 |
There are a few popular character codes which cannot be classified |
There are a few popular encodings which cannot be classified |
| 1486 |
into an international standard. Internationalized softwares should |
into an international standard. Internationalized softwares should |
| 1487 |
support these character codes (again, you don't need to be aware of |
support these encodings (again, you don't need to be aware of |
| 1488 |
character codes if you use LOCALE and <tt>wchar_t</tt> technology). |
encodings if you use LOCALE and <tt>wchar_t</tt> technology). |
| 1489 |
Some organizations are developing systems which go father than |
Some organizations are developing systems which go father than |
| 1490 |
limitations of the current international standards, though these |
limitations of the current international standards, though these |
| 1491 |
systems may be not diffused very much so far. |
systems may be not diffused very much so far. |
| 1494 |
<sect2 id="othercodes-big5"><heading>Big5</heading> |
<sect2 id="othercodes-big5"><heading>Big5</heading> |
| 1495 |
|
|
| 1496 |
<P> |
<P> |
| 1497 |
<strong>Big5</strong> is a de-facto standard character code for |
<strong>Big5</strong> is a de-facto standard encoding for |
| 1498 |
Taiwan (1984). It is also a CCS which is upper-compatible with ASCII. |
Taiwan (1984) and is upper-compatible with ASCII. |
| 1499 |
|
It is also a CCS. |
| 1500 |
</P> |
</P> |
| 1501 |
|
|
| 1502 |
<P> |
<P> |
| 1510 |
Though Taiwan has ISO 2022-compliant new standard CNS 11643, |
Though Taiwan has ISO 2022-compliant new standard CNS 11643, |
| 1511 |
Big5 seems to be more popular than CNS 11643. |
Big5 seems to be more popular than CNS 11643. |
| 1512 |
(CNS 11643 is a CCS and there are a few ISO 2022-derived |
(CNS 11643 is a CCS and there are a few ISO 2022-derived |
| 1513 |
character codes which include CNS 11643.) |
encodings which include CNS 11643.) |
| 1514 |
</P> |
</P> |
| 1515 |
|
|
| 1516 |
<sect2 id="othercodes-viscii"><heading>VISCII</heading> |
<sect2 id="othercodes-viscii"><heading>VISCII</heading> |
| 1517 |
|
|
| 1518 |
<P> |
<P> |
| 1519 |
Vietnamese language uses 186 characters (Latin alphabets with accents). |
Vietnamese language uses 186 characters (Latin alphabets with accents) |
| 1520 |
It is a bit more than the limit of ISO 8859-like character code. |
and other symbols. |
| 1521 |
|
It is a bit more than the limit of ISO 8859-like encoding. |
| 1522 |
</P> |
</P> |
| 1523 |
|
|
| 1524 |
<P> |
<P> |
| 1532 |
</P> |
</P> |
| 1533 |
|
|
| 1534 |
<P> |
<P> |
| 1535 |
Vietnam has a new, ISO 2022-compliant character code |
Vietnam has a new, ISO 2022-compliant character set |
| 1536 |
<strong>TCVN 5712</strong> (aka <strong>VSCII</strong>). |
<strong>TCVN 5712</strong> (aka <strong>VSCII</strong>). |
| 1537 |
In TCVN 5712, accented characters are expressed as a |
In TCVN 5712, accented characters are expressed as a |
| 1538 |
combined character. Note that a part of accented characters |
combined character. Note that some of accented characters |
| 1539 |
have their own code points. |
have their own code points. |
| 1540 |
</P> |
</P> |
| 1541 |
|
|
| 1542 |
<sect2 id="othercodes-tron"><heading>TRON</heading> |
<sect2 id="othercodes-tron"><heading>TRON</heading> |
| 1543 |
|
|
| 1544 |
<P> |
<P> |
| 1545 |
url id="http://www.tron.org/index-e.html" name="TRON project"> |
<url id="http://www.tron.org/index-e.html" name="TRON"> |
| 1546 |
is a project to develop a new operating system, |
is a project to develop a new operating system, |
| 1547 |
founded as a collaboration of industries and academics |
founded as a collaboration of industries and academics |
| 1548 |
in Japan since 1984. |
in Japan since 1984. |
| 1551 |
<P> |
<P> |
| 1552 |
The most diffused version of TRON operating system families |
The most diffused version of TRON operating system families |
| 1553 |
is ITRON, a real-time OS for embedded systems. |
is ITRON, a real-time OS for embedded systems. |
| 1554 |
However, our interest is not on the ITRON now. |
However, our interest is not on ITRON now. |
| 1555 |
TRON determines a TRON character code. |
TRON determines a TRON encoding. |
| 1556 |
</P> |
</P> |
| 1557 |
|
|
| 1558 |
<P> |
<P> |
| 1559 |
TRON's character code is stateful. Each state are assigned |
TRON's encoding is stateful. Each state are assigned |
| 1560 |
to each language. It has already defined about 130000 characters |
to each language. It has already defined about 130000 characters |
| 1561 |
(January 2000). |
(January 2000). |
| 1562 |
</P> |
</P> |
| 1587 |
<enumlist> |
<enumlist> |
| 1588 |
<item>kinds and number of characters used in the language, |
<item>kinds and number of characters used in the language, |
| 1589 |
<item>explanation on coded character set(s) which is (are) standardized, |
<item>explanation on coded character set(s) which is (are) standardized, |
| 1590 |
<item>explanation on character code(s) which is (are) standardized, |
<item>explanation on encoding(s) which is (are) standardized, |
| 1591 |
<item>usage and popularity for each character code, |
<item>usage and popularity for each encoding, |
| 1592 |
<item>de-facto standard, if any, on how many columns characters occupy, |
<item>de-facto standard, if any, on how many columns characters occupy, |
| 1593 |
<item>writing direction and combined characters, |
<item>writing direction and combined characters, |
| 1594 |
<item>how to layout characters (word wrapping and so on), |
<item>how to layout characters (word wrapping and so on), |
| 1625 |
<P> |
<P> |
| 1626 |
<strong>LOCALE</strong> is a basic concept introduced |
<strong>LOCALE</strong> is a basic concept introduced |
| 1627 |
into <strong>ISO C</strong> (ISO/IEC 9899:1990). The |
into <strong>ISO C</strong> (ISO/IEC 9899:1990). The |
| 1628 |
standard is expanded in 1995 (ISO 9899:1990 Ammendment 1:1995). |
standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995). |
| 1629 |
In LOCALE model, the behaviors of some C functions are dependent |
In LOCALE model, the behaviors of some C functions are dependent |
| 1630 |
on LOCALE environment. LOCALE environment is divided |
on LOCALE environment. LOCALE environment is divided |
| 1631 |
into a few categories and each of these categories can |
into a few categories and each of these categories can |
| 1642 |
all versions of Unix operating systems support XPG5. |
all versions of Unix operating systems support XPG5. |
| 1643 |
</P> |
</P> |
| 1644 |
|
|
| 1645 |
<sect id="localecategory">Locale Categories and Locale Names</heading> |
<sect id="localecategory">Locale Categories and <tt>setlocale()</tt></heading> |
| 1646 |
|
|
| 1647 |
<P> |
<P> |
| 1648 |
In LOCALE model, the behaviors of some C functions are dependent |
In LOCALE model, the behaviors of some C functions are dependent |
| 1657 |
<tag><strong>LC_CTYPE</strong> |
<tag><strong>LC_CTYPE</strong> |
| 1658 |
<item> |
<item> |
| 1659 |
<p> |
<p> |
| 1660 |
Category related to character code. |
Category related to encodings. |
| 1661 |
Characters which are encoded by LC_CTYPE-depndent character |
Characters which are encoded by LC_CTYPE-dependent encoding |
| 1662 |
code is called <strong>multibyte characters</strong>. |
is called <strong>multibyte characters</strong>. |
| 1663 |
Note that multibyte character doesn't need to be multibyte. |
Note that multibyte character doesn't need to be multibyte. |
| 1664 |
</p> |
</p> |
| 1665 |
<p> |
<p> |
| 1747 |
will determine the locale name in the following manner: |
will determine the locale name in the following manner: |
| 1748 |
<list> |
<list> |
| 1749 |
<item>At first, consult <tt>LC_ALL</tt> environmental variable. |
<item>At first, consult <tt>LC_ALL</tt> environmental variable. |
| 1750 |
<item>Then, consult environmental variable same as the |
<item>If <tt>LC_ALL</tt> is not available, consult environmental |
| 1751 |
name of the locale category. For example, <tt>LC_COLLATE</tt>. |
variable same as the name of the locale category. |
| 1752 |
<item>At last, consult <tt>LANG</tt> environmental variable. |
For example, <tt>LC_COLLATE</tt>. |
| 1753 |
|
<item>If none of them are available, consult <tt>LANG</tt> |
| 1754 |
|
environmental variable. |
| 1755 |
</list> |
</list> |
| 1756 |
This is why a user is expected to set <tt>LANG</tt> variable. |
This is why a user is expected to set <tt>LANG</tt> variable. |
| 1757 |
In other words, all what a user has to do is to set <tt>LANG</tt> |
In other words, all what a user has to do is to set <tt>LANG</tt> |
| 1765 |
international. |
international. |
| 1766 |
</p> |
</p> |
| 1767 |
|
|
| 1768 |
|
<sect id="localename">Locale Names</heading> |
| 1769 |
|
|
| 1770 |
|
<P> |
| 1771 |
|
We can specify locale names for these six locale categories. |
| 1772 |
|
Then, which name should we specify? |
| 1773 |
|
</P> |
| 1774 |
|
|
| 1775 |
|
<P> |
| 1776 |
|
The syntax to build a locale name is determined as follows: |
| 1777 |
|
<example> |
| 1778 |
|
language[_territory][.codeset][@modifier] |
| 1779 |
|
</example> |
| 1780 |
|
where <em>language</em> is two lowercase alphabets described |
| 1781 |
|
in ISO639, such as <tt>en</tt> for English, <tt>eo</tt> for |
| 1782 |
|
Esperanto, and <tt>zh</tt> for Chinese, <em>territory</em> |
| 1783 |
|
is two uppercase alphabets described in ISO3166, such as |
| 1784 |
|
<tt>GB</tt> for United Kingdom, <tt>KR</tt> for Republic of |
| 1785 |
|
Korea (South Korea), <tt>CN</tt> for China. There are no standard |
| 1786 |
|
for <em>codeset</em> and <em>modifier</em>. GNU libc uses |
| 1787 |
|
<tt>ISO-8859-1</tt>, <tt>ISO-8859-13</tt>, <tt>eucJP</tt>, |
| 1788 |
|
<tt>SJIS</tt>, <tt>UTF8</tt>, and so on for <em>codeset</em>, |
| 1789 |
|
and <tt>euro</tt> for <em>modifier</em>. |
| 1790 |
|
</P> |
| 1791 |
|
|
| 1792 |
|
<P> |
| 1793 |
|
However, it is depend on the system which locale names are valid. |
| 1794 |
|
In other words, you have to install <em>locale database</em> for |
| 1795 |
|
locale you want to use. Type <tt>locale -a</tt> to display all |
| 1796 |
|
supported locale names on the system. |
| 1797 |
|
</P> |
| 1798 |
|
|
| 1799 |
<p> |
<p> |
| 1800 |
Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are |
Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are |
| 1801 |
determined for the names for default behavior. For example, |
determined for the names for default behavior. For example, |
| 1807 |
<sect id="wchar">Multibyte Characters and Wide Characters</heading> |
<sect id="wchar">Multibyte Characters and Wide Characters</heading> |
| 1808 |
|
|
| 1809 |
<p> |
<p> |
| 1810 |
Now we will concentrate on LC_CTYPE category. |
Now we will concentrate on LC_CTYPE, which is the most important |
| 1811 |
|
category in six locale categories. |
| 1812 |
</p> |
</p> |
| 1813 |
|
|
| 1814 |
<p> |
<p> |
| 1815 |
Many character codes such as ASCII, ISO 8859-*, KOI8-R, EUC-*, |
Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*, |
| 1816 |
ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world. |
ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world. |
| 1817 |
It is inefficient and a cause of bugs, even not impossible, for |
It is inefficient and a cause of bugs, even not impossible, for |
| 1818 |
every softwares to implement all these character codes. |
every softwares to implement all these encodings. |
| 1819 |
Fortunetely, we can use LOCALE technology to solve this problem. |
Fortunately, we can use LOCALE technology to solve this problem. |
| 1820 |
<footnote> |
<footnote> |
| 1821 |
Usage of UCS-4 is the second best solution fot this problem. |
Usage of UCS-4 is the second best solution for this problem. |
| 1822 |
Sometimes LOCALE technology cannot be used and UCS-4 is the |
Sometimes LOCALE technology cannot be used and UCS-4 is the |
| 1823 |
best. I will discuss this solution later. |
best. I will discuss this solution later. |
| 1824 |
</footnote> |
</footnote> |
| 1826 |
|
|
| 1827 |
<p> |
<p> |
| 1828 |
<strong>Multibyte characters</strong> is a term to call characters |
<strong>Multibyte characters</strong> is a term to call characters |
| 1829 |
encoded in locale-specific character code. Thus, the behaviors of |
encoded in locale-specific encoding. In ISO 8859-1 locale, |
| 1830 |
C functions which handle multibyte characters depend on |
ISO 8859-1 is multibyte character. In EUC-JP locale, EUC-JP |
| 1831 |
<tt>LC_CTYPE</tt> locale category. |
is multibyte character. In UTF-8 locale, UTF-8 is multibyte character. |
| 1832 |
|
In short, multibyte character is defined by <tt>LC_CTYPE</tt> locale |
| 1833 |
|
category. |
| 1834 |
Multibyte characters should be used when your software inputs |
Multibyte characters should be used when your software inputs |
| 1835 |
or outputs text data from/to everywhere out of your software, |
or outputs text data from/to everywhere out of your software, |
| 1836 |
for example, standard input/output, display, keyboard, file, |
for example, standard input/output, display, keyboard, file, |
| 1843 |
</p> |
</p> |
| 1844 |
|
|
| 1845 |
<p> |
<p> |
| 1846 |
|
You can handle multibyte characters using ordinal <tt>char</tt> |
| 1847 |
|
or <tt>unsigned char</tt> types and ordinal character- and |
| 1848 |
|
string-oriented functions, just like you used to do for |
| 1849 |
|
ASCII and 8bit encodings. |
| 1850 |
|
And more, ISO C standard determines C functions which should be sensible |
| 1851 |
|
to <tt>LC_CTYPE</tt> locale category and thus these functions can |
| 1852 |
|
handle multibyte characters. |
| 1853 |
|
</p> |
| 1854 |
|
|
| 1855 |
|
<p> |
| 1856 |
Multibyte character may be stateful or stateless and multibyte or |
Multibyte character may be stateful or stateless and multibyte or |
| 1857 |
non-multibyte. Thus it is not convenient for internal processing. |
non-multibyte. Thus it is not convenient for internal processing. |
| 1858 |
It needs complex algorithm even for, for example, character |
It needs complex algorithm even for, for example, character |
| 1871 |
</p> |
</p> |
| 1872 |
|
|
| 1873 |
<p> |
<p> |
| 1874 |
A string of wide characters is achived by an array of <tt>wchar_t</tt>, |
A string of wide characters is achieved by an array of <tt>wchar_t</tt>, |
| 1875 |
just like a string of characters is achieved by an array |
just like a string of characters is achieved by an array |
| 1876 |
of <tt>char</tt>. |
of <tt>char</tt>. |
| 1877 |
</p> |
</p> |
| 1928 |
<p> |
<p> |
| 1929 |
You cannot assume anything on the concrete value of <tt>wchar_t</tt>, |
You cannot assume anything on the concrete value of <tt>wchar_t</tt>, |
| 1930 |
besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII. |
besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII. |
| 1931 |
|
<footnote> |
| 1932 |
|
Some of you may know GNU libc uses UCS-4 for the internal expression |
| 1933 |
|
of <tt>wchar_t</tt>. However, you should not use the knowledge. |
| 1934 |
|
It may differ in other systems. |
| 1935 |
|
</footnote> |
| 1936 |
You may feel this limitation is too strong. If you cannot do |
You may feel this limitation is too strong. If you cannot do |
| 1937 |
under this limitation, you can use UCS-4 as the internal character |
under this limitation, you can use UCS-4 as the internal encoding. |
| 1938 |
code. In such a case, you can write your software emulating |
In such a case, you can write your software emulating |
| 1939 |
the locale-sensible behavior using <tt>setlocale()</tt>, |
the locale-sensible behavior using <tt>setlocale()</tt>, |
| 1940 |
<tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>. Consult |
<tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>. Consult |
| 1941 |
the section of <ref id="iconv">. |
the section of <ref id="iconv">. |
| 1943 |
|
|
| 1944 |
<p> |
<p> |
| 1945 |
You can write wide character in the source code as <tt>L'a'</tt> |
You can write wide character in the source code as <tt>L'a'</tt> |
| 1946 |
and wide string as <tt>L"string"</tt>. Since the character |
and wide string as <tt>L"string"</tt>. Since the encoding |
| 1947 |
code for the source code is ASCII, you can only write ASCII |
for the source code is ASCII, you can only write ASCII |
| 1948 |
characters. If you'd like to use other characters, you should |
characters. If you'd like to use other characters, you should |
| 1949 |
use <prgn>gettext</prgn>. |
use <prgn>gettext</prgn>. |
| 1950 |
</p> |
</p> |
| 1991 |
<sect id="locale_unicode">Unicode and LOCALE technology</heading> |
<sect id="locale_unicode">Unicode and LOCALE technology</heading> |
| 1992 |
|
|
| 1993 |
<p> |
<p> |
| 1994 |
UTF-8 is considered as the future character code and |
UTF-8 is considered as the future encoding and |
| 1995 |
many softwares are coming to support UTF-8. Though some |
many softwares are coming to support UTF-8. Though some |
| 1996 |
of these softwares implement UTF-8 directly, I recommend |
of these softwares implement UTF-8 directly, I recommend |
| 1997 |
you to use LOCALE technology to support UTF-8. |
you to use LOCALE technology to support UTF-8. |
| 2033 |
for I18N. |
for I18N. |
| 2034 |
<footnote> |
<footnote> |
| 2035 |
In such a case, do they think of abolishing support of 7bit or |
In such a case, do they think of abolishing support of 7bit or |
| 2036 |
8bit non-multibyte character codes? If no, it may be unfair that |
8bit non-multibyte encodings? If no, it may be unfair that |
| 2037 |
8bit language speakers can use both UTF-8 and conventional (local) |
8bit language speakers can use both UTF-8 and conventional (local) |
| 2038 |
character codes while speakers of multibyte languages, combining |
encodings while speakers of multibyte languages, combining |
| 2039 |
characters, and so on cannot use their popular locale character |
characters, and so on cannot use their popular locale encodings. |
| 2040 |
codes. I think such a software cannot be called "internationalized". |
I think such a software cannot be called "internationalized". |
| 2041 |
</footnote> |
</footnote> |
| 2042 |
Even in such cases, you can rewrite such a software so that it |
Even in such cases, you can rewrite such a software so that it |
| 2043 |
checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables |
checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables |
| 2044 |
to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>. |
to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>. |
| 2045 |
You can also rewrite the software to call <tt>setlocale()</tt>, |
You can also rewrite the software to call <tt>setlocale()</tt>, |
| 2046 |
<tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software |
<tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software |
| 2047 |
supports all character codes which the OS supports, as discussed later. |
supports all encodings which the OS supports, as discussed later. |
| 2048 |
Consult |
Consult |
| 2049 |
<url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html" |
<url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html" |
| 2050 |
name="the discussion in the Groff mailing list on the support of |
name="the discussion in the Groff mailing list on the support of |
| 2051 |
UTF-8 and locale-specific character codes">, mainly held by Werner |
UTF-8 and locale-specific encodings">, mainly held by Werner |
| 2052 |
LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA, |
LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA, |
| 2053 |
the author of this document. |
the author of this document. |
| 2054 |
</p> |
</p> |
| 2055 |
|
|
| 2056 |
|
|
| 2057 |
|
|
| 2058 |
<sect id="iconv">nl_langinfo() and iconv()</heading> |
<sect id="iconv"><heading><tt>nl_langinfo()</tt> and <tt>iconv()</tt></heading> |
| 2059 |
|
|
| 2060 |
<p> |
<p> |
| 2061 |
Though ISO C defines extensive LOCALE-related functions, |
Though ISO C defines extensive LOCALE-related functions, |
| 2062 |
you may want more extensive support. You may also want |
you may want more extensive support. You may also want |
| 2063 |
conversion between different character codes. |
conversion between different encodings. |
| 2064 |
There are C functions which can be used for such purposes. |
There are C functions which can be used for such purposes. |
| 2065 |
</p> |
</p> |
| 2066 |
|
|
| 2098 |
<item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>) |
<item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>) |
| 2099 |
<item>format of time (era-based) (<tt>ERA_T_FMT</tt>) |
<item>format of time (era-based) (<tt>ERA_T_FMT</tt>) |
| 2100 |
<item>radix (<tt>RADIXCHAR</tt>) |
<item>radix (<tt>RADIXCHAR</tt>) |
| 2101 |
<item>thousands separater (<tt>THOUSEP</tt>) |
<item>thousands separator (<tt>THOUSEP</tt>) |
| 2102 |
<item>alternative characters for numerics (<tt>ALT_DIGITS</tt>) |
<item>alternative characters for numerics (<tt>ALT_DIGITS</tt>) |
| 2103 |
<item>affirmative word (<tt>YESSTR</tt>) |
<item>affirmative word (<tt>YESSTR</tt>) |
| 2104 |
<item>affirmative response (<tt>YESEXPR</tt>) |
<item>affirmative response (<tt>YESEXPR</tt>) |
| 2105 |
<item>negative word (<tt>NOSTR</tt>) |
<item>negative word (<tt>NOSTR</tt>) |
| 2106 |
<item>negative response (<tt>NOEXPR</tt>) |
<item>negative response (<tt>NOEXPR</tt>) |
| 2107 |
<item>character code (<tt>CODESET</tt>) |
<item>encoding (<tt>CODESET</tt>) |
| 2108 |
</list> |
</list> |
| 2109 |
For example, you can get names for months and use them for |
For example, you can get names for months and use them for |
| 2110 |
your original output algorithm. <tt>YESEXPR</tt> and |
your original output algorithm. <tt>YESEXPR</tt> and |
| 2114 |
|
|
| 2115 |
<p> |
<p> |
| 2116 |
<tt>iconv_open()</tt>, <tt>iconv</tt>, and <tt>iconv_close()</tt> |
<tt>iconv_open()</tt>, <tt>iconv</tt>, and <tt>iconv_close()</tt> |
| 2117 |
are functions to perform conversion between character codes. |
are functions to perform conversion between encodings. |
| 2118 |
Please consult manpages for them. |
Please consult manpages for them. |
| 2119 |
</p> |
</p> |
| 2120 |
|
|
| 2121 |
<p> |
<p> |
| 2122 |
Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>, |
Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>, |
| 2123 |
you can easily modify Unicode-enabled software into locale-sensible |
you can easily modify Unicode-enabled software into locale-sensible |
| 2124 |
truely internationalized software. |
truly internationalized software. |
| 2125 |
</p> |
</p> |
| 2126 |
|
|
| 2127 |
<p> |
<p> |
| 2171 |
</p> |
</p> |
| 2172 |
|
|
| 2173 |
|
|
| 2174 |
|
<sect id="locale-limit"><heading>Limit of Locale technology</heading> |
| 2175 |
|
|
| 2176 |
|
<P> |
| 2177 |
|
Locale model has a limit. That is, it cannot handle two locales at |
| 2178 |
|
the same time. Especially, it cannot handle relationship between two |
| 2179 |
|
locales at all. |
| 2180 |
|
</P> |
| 2181 |
|
|
| 2182 |
|
<P> |
| 2183 |
|
For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings |
| 2184 |
|
in Japan. EUC-JP is the de-facto standard for UNIX systems, |
| 2185 |
|
ISO 2022-JP is the standard for Internet, and Shift-JIS is the |
| 2186 |
|
encoding for Windows and Macintosh. Thus, Japanese people have to |
| 2187 |
|
handle texts with these encodings. Text viewers such as <tt>jless</tt> |
| 2188 |
|
and <tt>lv</tt> and editors such as <tt>emacs</tt> can automatically |
| 2189 |
|
understand the encoding to be read. You cannot write such a software |
| 2190 |
|
using Locale technology. |
| 2191 |
|
</P> |
| 2192 |
|
|
| 2193 |
|
|
| 2194 |
|
|
| 2195 |
<chapt id="output"><heading>Output to Display</heading> |
<chapt id="output"><heading>Output to Display</heading> |
| 2248 |
<P> |
<P> |
| 2249 |
The feature of console is that: |
The feature of console is that: |
| 2250 |
<list> |
<list> |
| 2251 |
<item>All what a software has to do is to send a correct character |
<item>All what a software has to do is to send a correct encoding |
| 2252 |
code to standard output. Softwares on console don't need to |
to standard output. Softwares on console don't need to |
| 2253 |
care about fonts and so on. |
care about fonts and so on. |
| 2254 |
<item>Fonts with fixed sizes are used. The unit of the width |
<item>Fonts with fixed sizes are used. The unit of the width |
| 2255 |
of the font is called 'column'. 'Doublewidth' fonts, i.e., |
of the font is called 'column'. 'Doublewidth' fonts, i.e., |
| 2260 |
</list> |
</list> |
| 2261 |
</P> |
</P> |
| 2262 |
|
|
| 2263 |
<sect1 id="output-console-code"><heading>Character Code</heading> |
<sect1 id="output-console-code"><heading>Encoding</heading> |
| 2264 |
|
|
| 2265 |
<P> |
<P> |
| 2266 |
Softwares running on the console are not responsible for displaying. |
Softwares running on the console are not responsible for displaying. |
| 2267 |
The console itself is responsible. There are consoles |
The console itself is responsible. There are consoles |
| 2268 |
which can display character codes other than ASCII such as |
which can display encodings other than ASCII such as |
| 2269 |
<taglist> |
<taglist> |
| 2270 |
<tag>kon2 |
<tag>kon2 |
| 2271 |
<item>EUC-JP, Shift-JIS, and ISO-2022-JP |
<item>EUC-JP, Shift-JIS, and ISO-2022-JP |
| 2272 |
<tag>jfbterm |
<tag>jfbterm |
| 2273 |
<item>EUC-JP, ISO-2022-jp, and ISO-2022 (including any 94, 96, |
<item>EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96, |
| 2274 |
and 94x94 character sets whose fonts are available) |
and 94x94 coded character sets whose fonts are available) |
| 2275 |
<tag>kterm |
<tag>kterm |
| 2276 |
<item>EUC-JP, Shift-JIS, ISO-2022-JP, and ISO-2022 (including |
<item>EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including |
| 2277 |
ISO8859-{1,2,3,4,5,6,7,8,9}, JISX0201, JISX0208, JISX0212, |
ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212, |
| 2278 |
GB2312, and KSC5601) |
GB 2312, and KSC 5601) |
| 2279 |
<tag>krxvt |
<tag>krxvt |
| 2280 |
<item>EUC-JP |
<item>EUC-JP |
| 2281 |
<tag>crxvt-gb |
<tag>crxvt-gb |
| 2283 |
<tag>crxvt-big5 |
<tag>crxvt-big5 |
| 2284 |
<item>Big5 |
<item>Big5 |
| 2285 |
<tag>hanterm |
<tag>hanterm |
| 2286 |
<item>EUC-KR, Johab, and ISO-2022-KR |
<item>EUC-KR, Johab, and ISO 2022-KR |
| 2287 |
<tag>xiterm+thai |
<tag>xiterm+thai |
| 2288 |
<item>TIS620 |
<item>TIS 620 |
| 2289 |
<tag>xterm |
<tag>xterm |
| 2290 |
<item>UTF-8 |
<item>UTF-8 |
| 2291 |
</taglist> |
</taglist> |
| 2292 |
However, there are no way for a software on console to know which |
However, there are no way for a software on console to know which |
| 2293 |
character code is available. I think it is a responsibility for |
encoding is available. I think it is a responsibility for |
| 2294 |
a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG |
a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG |
| 2295 |
environmental variable). Provided LC_CTYPE locale is set properly, |
environmental variable). Provided LC_CTYPE locale is set properly, |
| 2296 |
a software can use it to know which character code to be supported |
a software can use it to know which encoding to be supported |
| 2297 |
by the console. |
by the console. |
| 2298 |
</P> |
</P> |
| 2299 |
|
|
| 2310 |
using ASCII. |
using ASCII. |
| 2311 |
<list> |
<list> |
| 2312 |
<item>8-bit cleanness. I think everyone understand this. |
<item>8-bit cleanness. I think everyone understand this. |
| 2313 |
<item>Continuity of multibyte characters. In multibyte character |
<item>Continuity of multibyte characters. In multibyte encodings |
| 2314 |
codes such as EUC-JP and UTF-8, one character may consist |
such as EUC-JP and UTF-8, one character may consist |
| 2315 |
from more than two bytes. These bytes should be outputed |
from more than two bytes. These bytes should be outputed |
| 2316 |
continued. Insertion of additional codes between the |
continued. Insertion of additional codes between the |
| 2317 |
continuing bytes can break the character. I have seen a |
continuing bytes can break the character. I have seen a |
| 2345 |
<P> |
<P> |
| 2346 |
If your software inputs a string from keyboard, you will have to |
If your software inputs a string from keyboard, you will have to |
| 2347 |
take more cares. All of numbers of characters, bytes, and columns |
take more cares. All of numbers of characters, bytes, and columns |
| 2348 |
differ. For example, in UTF-8 character code, one character of |
differ. For example, in UTF-8 encoding, one character of |
| 2349 |
'a' with acute accent occupies two bytes and one column. One |
'a' with acute accent occupies two bytes and one column. One |
| 2350 |
character of CJK-ideograph occupies three bytes and two columns. |
character of CJK-ideograph occupies three bytes and two columns. |
| 2351 |
For example, if the user types 'Backspace', how many backspace |
For example, if the user types 'Backspace', how many backspace |
| 2382 |
at the same time. This is related to the distinction between |
at the same time. This is related to the distinction between |
| 2383 |
coded character set (CCS) and character encoding scheme (CES) |
coded character set (CCS) and character encoding scheme (CES) |
| 2384 |
which I wrote at the section of <ref id="coding-general-term">. |
which I wrote at the section of <ref id="coding-general-term">. |
| 2385 |
Some character codes in the world use multiple coded character |
Some encodings in the world use multiple coded character |
| 2386 |
sets at the same time. This is the reason we have to handle |
sets at the same time. This is the reason we have to handle |
| 2387 |
multiple X fonts at the same time. |
multiple X fonts at the same time. |
| 2388 |
<footnote> |
<footnote> |
| 2389 |
Though UTF-8 is a character code with single CCS, the current |
Though UTF-8 is an encoding with single CCS, the current |
| 2390 |
version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8. |
version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8. |
| 2391 |
</footnote> |
</footnote> |
| 2392 |
</P> |
</P> |
| 2396 |
locale (LC_CTYPE)-sensible. This means that you have to |
locale (LC_CTYPE)-sensible. This means that you have to |
| 2397 |
call <tt>setlocale()</tt> before you use XFontSet-related |
call <tt>setlocale()</tt> before you use XFontSet-related |
| 2398 |
functions. And more, you have to specify the string you want |
functions. And more, you have to specify the string you want |
| 2399 |
to draw as a mulbibyte character or a wide character. |
to draw as a multibyte character or a wide character. |
| 2400 |
</P> |
</P> |
| 2401 |
|
|
| 2402 |
<P> |
<P> |
| 2438 |
The upstream developers of X clients sometimes hate to enforce |
The upstream developers of X clients sometimes hate to enforce |
| 2439 |
users to set such environmental variables. |
users to set such environmental variables. |
| 2440 |
<footnote> |
<footnote> |
| 2441 |
IMO, all users will have to set LANG properly when UTF-8 will |
IMHO, all users will have to set LANG properly when UTF-8 will |
| 2442 |
become popular. |
become popular. |
| 2443 |
</footnote> |
</footnote> |
| 2444 |
In such a case, |
In such a case, |
| 2448 |
If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>, |
If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>, |
| 2449 |
or <tt>"POSIX"</tt>, use |
or <tt>"POSIX"</tt>, use |
| 2450 |
<tt>XFontStruct</tt> way. Otherwise use <tt>XFontSet</tt> way. |
<tt>XFontStruct</tt> way. Otherwise use <tt>XFontSet</tt> way. |
| 2451 |
The author implemented this algoritym to a few window managers |
The author implemented this algorithm to a few window managers |
| 2452 |
such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0), |
such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0), |
| 2453 |
sawmill (0.28), and so on. |
sawmill (0.28), and so on. |
| 2454 |
</P> |
</P> |
| 2480 |
|
|
| 2481 |
|
|
| 2482 |
|
|
| 2483 |
|
|
| 2484 |
|
|
| 2485 |
|
|
| 2486 |
|
|
| 2487 |
|
|
| 2488 |
|
|
| 2489 |
|
|
| 2490 |
|
|
| 2491 |
|
|
| 2492 |
|
|
| 2493 |
|
|
| 2494 |
|
|
| 2495 |
<chapt id="input"><heading>Input from Keyboard</heading> |
<chapt id="input"><heading>Input from Keyboard</heading> |
| 2496 |
|
|
| 2497 |
<P> |
<P> |
| 2609 |
<chapt id="internal"><heading>Internal Processing and File I/O</heading> |
<chapt id="internal"><heading>Internal Processing and File I/O</heading> |
| 2610 |
|
|
| 2611 |
<P> |
<P> |
| 2612 |
From a user's point of view, a software can use any internal character |
From a user's point of view, a software can use any internal encodings |
| 2613 |
codes if I/O is done correctly. It is because a user cannot be aware of |
if I/O is done correctly. It is because a user cannot be aware of |
| 2614 |
which kind of internal code is used in the software. |
which kind of internal code is used in the software. |
| 2615 |
</P> |
</P> |
| 2616 |
|
|
| 2624 |
approach are: |
approach are: |
| 2625 |
<list> |
<list> |
| 2626 |
<item>The programmer don't need to know the detail of |
<item>The programmer don't need to know the detail of |
| 2627 |
international and local character codes. |
international and local encodings. |
| 2628 |
</list> |
</list> |
| 2629 |
However, there are a few disadvantages: |
However, there are a few disadvantages: |
| 2630 |
<list> |
<list> |
| 2825 |
|
|
| 2826 |
<chapt id="other"><heading>Other Special Topics</heading> |
<chapt id="other"><heading>Other Special Topics</heading> |
| 2827 |
|
|
|
<sect id="locale-"><heading>Locale in C</heading> |
|
|
|
|
|
<P> |
|
|
Locale is the main faculty for I18N of C language. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Locale model is that a software changes its behavior |
|
|
according to its language environment. The environment can be |
|
|
set independently for six categories of |
|
|
LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, |
|
|
and LC_TIME. |
|
|
C library supplies a set of functions which changes their |
|
|
behaviors according to one of the six locale categories. |
|
|
To internationalize a software, use these functions. |
|
|
Don't forget to call <tt>setlocale</tt> function at first |
|
|
or these functions would not change their behavior. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
If <tt>setlocale(LC_ALL, "")</tt> is described at the start of the |
|
|
software, the choice of the environment is done by environmental variables |
|
|
whose names are same to the names of categories. |
|
|
If LC_ALL variable is defined, LC_ALL takes precedence over these |
|
|
variables. If neither of them are defined, LANG variable is adopted. |
|
|
If LANG is also not defined, 'C' locale, which means default behavior, |
|
|
is used. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Though valid values for these environmental variables (locale names) |
|
|
depend on the kind and set-up of the OS, the format of locale names |
|
|
is usually like <tt>ja_JP.eucJP</tt>, where two lowercase characters |
|
|
mean language (<tt>ja</tt> = Japanese), two capital characters |
|
|
mean country (<tt>JP</tt> = Japan), and characters after dot mean |
|
|
character code (<tt>eucJP</tt> = EUC-JP). Type <tt>locale -a</tt> to |
|
|
display all valid locale names. However, it is users' responsibility to |
|
|
set proper value to LANG variable and the developers don't need to |
|
|
be aware of the value of the LANG variable. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Many people tend to think that I18N means <tt>gettext</tt>ization |
|
|
and translation of messages. However, it is mere one category |
|
|
(LC_MESSAGES) out of six categories. The most important category |
|
|
is LC_CTYPE. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
However, note that M17N is not achieved by locale mechanism, |
|
|
especially LC_CTYPE. You have to use international |
|
|
character codes such as ISO 2022 and Unicode instead of LC_CTYPE mechanism |
|
|
to write M17N-ed software. Moreover, LC_CTYPE mechanism is |
|
|
sometimes insufficient even for I18N. For example, |
|
|
<package>jless</package> is a text file viewer which can |
|
|
automatically distinguishes three Japanese character codes and converts |
|
|
into desirable character codes. You cannot write such a software using |
|
|
LC_CTYPE mechanism. |
|
|
</P> |
|
|
|
|
|
<sect id="wchar-"><heading>Multibyte and Wide characters in C</heading> |
|
|
|
|
|
<P> |
|
|
Standard C library supplies functions to handle multibyte and |
|
|
wide characters. These functions are sensible to LC_CTYPE |
|
|
locale category. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Multibyte character is a character code which is used for <em>real</em> |
|
|
input/output. In other words, <em>the character code you usually use</em> |
|
|
is the multibyte character whatever language you speak. |
|
|
If you use ISO-8859-1, it is your multibyte character. |
|
|
If you use EUC-KR, it is your multibyte character. |
|
|
Despite the name, multibyte character may or may not be |
|
|
expressed in multiple bytes. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Since multibyte character can be stateful (that is, can have |
|
|
shift status) and the number of bytes a character does not |
|
|
have to be a constant, implementation using multibyte character |
|
|
can be difficult. For example, it may be difficult even to count |
|
|
the number of characters. Thus wide character can be used. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Wide character is a character code supplied by the standard C library |
|
|
for easy handling of international strings. |
|
|
Wide character is stateless and the size of every wide characters |
|
|
are same. Functions for conversion between multibyte character |
|
|
and wide character (and string of multibyte characters and |
|
|
string of wide characters) are supplied by library. |
|
|
Wide character is expressed using <tt>wchar_t</tt> type. |
|
|
String of wide characters is expressed |
|
|
as a array of <tt>wchar_t</tt>, like string of ASCII characters is expressed |
|
|
as a array of <tt>char</tt>. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Thus it is convenient to input multibyte characters from a stream, |
|
|
convert them into wide characters, process, convert back into |
|
|
multibyte characters, and output them to a stream. <tt>wchar_t</tt> is |
|
|
used as an internal code. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Functions for conversion between multibyte and wide characters/strings |
|
|
are shown below: |
|
|
<list> |
|
|
<item><tt>mbtowc()</tt> and <tt>mbrtowc()</tt> to convert |
|
|
from multibyte to wide character. |
|
|
<item><tt>mblen()</tt>, <tt>mbrlen()</tt> to obtain the number |
|
|
of characters of multibyte character string. |
|
|
<item><tt>mbstowcs()</tt>, <tt>mbsrtowcs()</tt> to convert from |
|
|
multibyte to wide character string. |
|
|
<item><tt>wctomb()</tt>, <tt>wcrtomb()</tt> to convert from wide |
|
|
to multibyte character. |
|
|
<item><tt>wcstombs()</tt>, <tt>wcsrtombs()</tt> to convert from |
|
|
wide to multibyte character string. |
|
|
<item><tt>mbsinit()</tt> to check shift status. |
|
|
<item><tt>btowc()</tt> and <tt>wctob()</tt> to convert 1byte and |
|
|
wide characters. |
|
|
</list> |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
'<tt>r</tt>' version of these functions (for example, <tt>mbrtowc</tt>) |
|
|
have an additional parameter to a pointer to a <tt>mbstate_t</tt> |
|
|
variable which contains the shift status. Since non-'<tt>r</tt>' |
|
|
version of these functions have shift status in their internal |
|
|
(static) variable, these can treat only one succession of string at a time. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
See manpages of these functions for further information. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
The implementation of wchar_t is not determined by any |
|
|
standards, though UCS-4 is used for glibc. You must not |
|
|
assume the implementation of <tt>wchar_t</tt>. |
|
|
</P> |
|
|
|
|
|
<P> |
|
|
Though usual functions such as <tt>printf()</tt> can be used for multibyte |
|
|
characters for input/output, one have to take care of escape |
|
|
character '<tt>%</tt>' used in formatted input/output functions, because |
|
|
a part of a multibyte character can have same value as ASCII |
|
|
code of '<tt>%</tt>'. |
|
|
</P> |
|
| 2828 |
|
|
| 2829 |
|
|
| 2830 |
<sect id="gettext"><heading>Gettext</heading> |
<sect id="gettext"><heading>Gettext</heading> |
| 2849 |
read the copyright mark NOW in THIS document) is non-ASCII character |
read the copyright mark NOW in THIS document) is non-ASCII character |
| 2850 |
(0xa9 in ISO-8859-1). |
(0xa9 in ISO-8859-1). |
| 2851 |
Otherwise, translators may feel difficulty to edit catalog files |
Otherwise, translators may feel difficulty to edit catalog files |
| 2852 |
because of conflict between character codes for <tt>msgid</tt> and in |
because of conflict between encodings for <tt>msgid</tt> and in |
| 2853 |
<tt>msgstr</tt>. |
<tt>msgstr</tt>. |
| 2854 |
</P> |
</P> |
| 2855 |
|
|
| 2864 |
|
|
| 2865 |
<P> |
<P> |
| 2866 |
The 2nd (3rd, ...) byte of multibyte characters or |
The 2nd (3rd, ...) byte of multibyte characters or |
| 2867 |
all bytes of non-ASCII characters in stateful character codes |
all bytes of non-ASCII characters in stateful encodings |
| 2868 |
can be 0x5c (same to backslash in ASCII) or 0x22 |
can be 0x5c (same to backslash in ASCII) or 0x22 |
| 2869 |
(same to double quote in ASCII). |
(same to double quote in ASCII). |
| 2870 |
These characters have to properly escaped because |
These characters have to properly escaped because |
| 3057 |
because we are discussing about i18n. |
because we are discussing about i18n. |
| 3058 |
Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>, |
Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>, |
| 3059 |
<tt>html</tt>, and so on. <tt>charset</tt> parameter can also be |
<tt>html</tt>, and so on. <tt>charset</tt> parameter can also be |
| 3060 |
added to specify character codes. |
added to specify encodings. |
| 3061 |
<tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>, |
<tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>, |
| 3062 |
<tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by |
<tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by |
| 3063 |
RFC 2046 for <tt>charset</tt>. This list can be added by writing |
RFC 2046 for <tt>charset</tt>. This list can be added by writing |
| 3080 |
in the main text of mail. On the other hand, RFC 2047 describes |
in the main text of mail. On the other hand, RFC 2047 describes |
| 3081 |
'encoded words' which is the way to write non-ASCII characters in the header. |
'encoded words' which is the way to write non-ASCII characters in the header. |
| 3082 |
It is like that: |
It is like that: |
| 3083 |
<tt>=?</tt><var>character code</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>, |
<tt>=?</tt><var>encoding</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>, |
| 3084 |
where <var>character code</var> is selected from the list of <tt>charset</tt> |
where <var>encoding</var> is selected from the list of <tt>charset</tt> |
| 3085 |
of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt> |
of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt> |
| 3086 |
or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for |
or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for |
| 3087 |
base64, and <var>data</var> is encoded data whose length is less than |
base64, and <var>data</var> is encoded data whose length is less than |
| 3112 |
</P> |
</P> |
| 3113 |
|
|
| 3114 |
<P> |
<P> |
| 3115 |
RFC 1866 describes that the default character code for HTML is |
RFC 1866 describes that the default encoding for HTML is |
| 3116 |
ISO-8859-1. However, many web pages are written in, |
ISO-8859-1. However, many web pages are written in, |
| 3117 |
for example, Japanese and Korean using (of course) character codes |
for example, Japanese and Korean using (of course) encodings |
| 3118 |
different from ISO-8859-1. |
different from ISO-8859-1. |
| 3119 |
Sometimes the HTML document describes: |
Sometimes the HTML document describes: |
| 3120 |
<example> |
<example> |
| 3121 |
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"> |
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"> |
| 3122 |
</example> |
</example> |
| 3123 |
which declares that the page is written in ISO-2022-JP. |
which declares that the page is written in ISO-2022-JP. |
| 3124 |
However, there many pages without any declaration of character code. |
However, there many pages without any declaration of encoding. |
| 3125 |
</P> |
</P> |
| 3126 |
|
|
| 3127 |
<P> |
<P> |
| 3128 |
Web browsers have to deal with such a circumstance. |
Web browsers have to deal with such a circumstance. |
| 3129 |
Of course web browsers have to be able to deal with every |
Of course web browsers have to be able to deal with every |
| 3130 |
character codes in the world which is listed in MIME. |
encodings in the world which is listed in MIME. |
| 3131 |
However, many web browsers can only deal with ASCII |
However, many web browsers can only deal with ASCII |
| 3132 |
or ISO-8859-1. Such web browsers are useless at all |
or ISO-8859-1. Such web browsers are useless at all |
| 3133 |
for non-ASCII or non-ISO-8859-1 people. |
for non-ASCII or non-ISO-8859-1 people. |
| 3137 |
URL should be written in ASCII character, |
URL should be written in ASCII character, |
| 3138 |
though non-ASCII characters can be expressed |
though non-ASCII characters can be expressed |
| 3139 |
using <tt>%</tt><var>nn</var> sequence where <var>nn</var> |
using <tt>%</tt><var>nn</var> sequence where <var>nn</var> |
| 3140 |
is hexadegimal value. This is because there are |
is hexadecimal value. This is because there are |
| 3141 |
no way to specify character code. Wester-European people |
no way to specify encoding. Wester-European people |
| 3142 |
would treat it as ISO-8859-1, while Japanese people |
would treat it as ISO-8859-1, while Japanese people |
| 3143 |
would treat it as EUC-JP or SHIFT-JIS. |
would treat it as EUC-JP or SHIFT-JIS. |
| 3144 |
</P> |
</P> |