/[ddp]/manuals/trunk/intro-i18n/intro-i18n.sgml
ViewVC logotype

Diff of /manuals/trunk/intro-i18n/intro-i18n.sgml

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1040 by kubota, Wed Nov 15 12:07:38 2000 UTC revision 1041 by kubota, Thu Nov 16 15:37:41 2000 UTC
# Line 72  real state and existing problems for eac Line 72  real state and existing problems for eac
72    
73  <P>  <P>
74  Minimum requirements, for example,  Minimum requirements, for example,
75  that characters should be displayed proper font (at least users  that characters should be displayed with fonts with
76  of the software must be able to guess what is written),  proper charset (at least users of the software must be
77    able to guess what is written),
78  that characters must be inputed from keyboard, and  that characters must be inputed from keyboard, and
79  that softwares must not destroy characters,  that softwares must not destroy characters,
80  are stressed in the document and I am trying to  are stressed in the document and I am trying to
# Line 153  I18N is needed for the following places. Line 154  I18N is needed for the following places.
154  <list>  <list>
155   <item>Display characters for users' native languages.   <item>Display characters for users' native languages.
156   <item>Input characters for users' native languages.   <item>Input characters for users' native languages.
157   <item>Handle files written in popular character codes   <item>Handle files written in popular encodings
158         <footnote>         <footnote>
159          There are a few terms related to character code,          There are a few terms related to character code,
160          such as character set, character code, charset,          such as character set, character code, charset,
# Line 218  Now I will introduce a few models other Line 219  Now I will introduce a few models other
219          Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual          Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual
220          Emacs) text editor which can input/output Japanese text file,          Emacs) text editor which can input/output Japanese text file,
221          Hanterm X terminal emulator which can display and input          Hanterm X terminal emulator which can display and input
222          Korean characters via a few Korean character codes.          Korean characters via a few Korean encodings.
223          Since a programmer has his/her own mother tongue,          Since a programmer has his/her own mother tongue,
224          there are numerous L10N patches and L10N softwares          there are numerous L10N patches and L10N softwares
225          written to satisfy his/her own need.          written to satisfy his/her own need.
# Line 249  Now I will introduce a few models other Line 250  Now I will introduce a few models other
250          <footnote>          <footnote>
251            I recommend not to implement Unicode and UTF-8 directly.            I recommend not to implement Unicode and UTF-8 directly.
252            Instead, use LOCALE technology and your software will            Instead, use LOCALE technology and your software will
253            support not only UTF-8 but also many character codes            support not only UTF-8 but also many encodings
254            in the world.            in the world.
255          </footnote>          </footnote>
256     </p></item>     </p></item>
# Line 297  from an another viewpoint. Line 298  from an another viewpoint.
298          The advantage of this approach is that detailed and strict          The advantage of this approach is that detailed and strict
299          implementation is possible beyond the field where          implementation is possible beyond the field where
300          standardized methods are available, such as auto-detection          standardized methods are available, such as auto-detection
301          of character codes of text files to be read.  Language-specific          of encodings of text files to be read.  Language-specific
302          problems can be perfectly solved (of course it depends on          problems can be perfectly solved (of course it depends on
303          the skill of the programmer).  The disadvantages are          the skill of the programmer).  The disadvantages are
304          (1) that the number of supported languages is restricted          (1) that the number of supported languages is restricted
# Line 349  later even by other developers. Line 350  later even by other developers.
350  </P>  </P>
351    
352  <P>  <P>
353  M17N-model can be achieved using international character codes such  M17N-model can be achieved using international encodings such
354  as ISO-2022 and Unicode.  Though you can hard-code these character codes  as ISO 2022 and Unicode.  Though you can hard-code these encodings
355  for your software (i.e. approach B), I recommend to use standardized  for your software (i.e. approach B), I recommend to use standardized
356  LOCALE technology.  However, using international character codes  LOCALE technology.  However, using international encdoings
357  is not sufficient to achieve M17N-model.  You will have to prepare  is not sufficient to achieve M17N-model.  You will have to prepare
358  a mechanism to switch <strong>input methods</strong>.  You will also want  a mechanism to switch <strong>input methods</strong>.  You will also want
359  to prepare a character code-guessing mechanism for input files.  to prepare an encoding-guessing mechanism for input files,
360    such as <prgn>jless</prgn> and <prgn>emacs</prgn> have.
361  Mule is the only software which achieved M17N (though it does not  Mule is the only software which achieved M17N (though it does not
362  use LOCALE technology).  use LOCALE technology).
363  </P>  </P>
# Line 370  Let's preview the contents of each chapt Line 372  Let's preview the contents of each chapt
372  I have already wrote that this document will put stress on  I have already wrote that this document will put stress on
373  correct handling of characters and character codes for users' native  correct handling of characters and character codes for users' native
374  languages.  To achieve this purpose, I will discuss on popular  languages.  To achieve this purpose, I will discuss on popular
375  character codes in the world at the first chapter of  character sets and encodings in the world at the first chapter of
376  <ref id="coding">.  You will not need the detailed  <ref id="coding">.  You will not need the detailed
377  knowledges for these character codes if you will use LOCALE technology.  knowledges for these character codes if you will use LOCALE technology.
378  The aim of this chapter is only for showing the concepts used in these  The aim of this chapter is only for showing the concepts used in these
# Line 424  codes. Line 426  codes.
426  </P>  </P>
427    
428  <P>  <P>
429  Here major character codes are introduced.  Here major character sets and encodings are introduced.
430  Note that you don't have to know the detail of these  Note that you don't have to know the detail of these
431  character codes if you use LOCALE and <tt>wchar_t</tt> technology.  character codes if you use LOCALE and <tt>wchar_t</tt> technology.
432  However, these knowledge will help you to understand why number  However, these knowledge will help you to understand why number
# Line 438  processing of existing character codes, Line 440  processing of existing character codes,
440  If you are planning to develop a text-processing software  If you are planning to develop a text-processing software
441  beyond the fields which the LOCALE technology covers, you will  beyond the fields which the LOCALE technology covers, you will
442  have to understand the following descriptions very well.  have to understand the following descriptions very well.
443  These fields include automatic detection of character code  These fields include automatic detection of encodings
444  used for the input file (Most of Japanese-capable text viewers  used for the input file (Most of Japanese-capable text viewers
445  such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism)  such as <prgn>jless</prgn> and <prgn>lv</prgn> have this mechanism)
446  and so on.  and so on.
# Line 480  At first I begin this chapter by definin Line 482  At first I begin this chapter by definin
482            such as citizen registration system, serious DTP such as            such as citizen registration system, serious DTP such as
483            newspaper system, and so on.            newspaper system, and so on.
484      </p></item>      </p></item>
485    <tag><strong>Character Code</strong>    <tag><strong>Encoding</strong>
486      <item><p>      <item><p>
487            Character code is a rule where characters and texts are            Encoding is a rule where characters and texts are
488            expressed in combinations of bits or bytes in order to            expressed in combinations of bits or bytes in order to
489            treat characters in computers.  Words of <em>character            treat characters in computers.  Words of <em>character
490            coding system</em>, <em>charset</em>, and so on are used            coding system</em>, <em>character code</em>, <em>charset</em>,
491            to express the same meaning.  Basically, character code            and so on are used to express the same meaning.
492            takes care of <em>characters</em>, not <em>glyphs</em>.            Basically, <em>encoding</em> takes care of
493            There are many official and de-facto standards of character            <em>characters</em>, not <em>glyphs</em>.
494            codes such as ASCII, ISO 8859-{1,2,...,15},            There are many official and de-facto standards of encodings
495              such as ASCII, ISO 8859-{1,2,...,15},
496            ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2},            ISO 2022-{JP, JP-1, JP-2, KR, CN, CN-EXT, INT-1, INT-2},
497            EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620,            EUC-{JP, KR, CN, TW}, Johab, UHC, Shift-JIS, Big5, TIS 620,
498            VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE,            VISCII, VSCII, so-called 'CodePages', UTF-7, UTF-8, UTF-16LE,
499            UTF-16BE, KOI8-R, and so on so on.            UTF-16BE, KOI8-R, and so on so on.
500            To construct a character code, we have to consider the            To construct an encoding, we have to consider the
501            following concepts.  (Character code = one or more            following concepts.  (Encoding = one or more
502            CCS + one CES).            CCS + one CES).
503      </p></item>      </p></item>
504    <tag><strong>Character Set</strong>    <tag><strong>Character Set</strong>
505      <item><p>      <item><p>
506            Character set is a set of characters.  This determines            Character set is a set of characters.  This determines
507            a range of characters where the character code can handle.            a range of characters where the encoding can handle.
508              In contrast to <em>coded character set</em>, this is often
509              called as <em>non-coded character set</em>.
510      </p></item>      </p></item>
511    <tag><strong>Coded Character Set (CCS)</strong>    <tag><strong>Coded Character Set (CCS)</strong>
512      <item><p>      <item><p>
# Line 522  At first I begin this chapter by definin Line 527  At first I begin this chapter by definin
527    <tag><strong>Character Encoding Scheme (CES)</strong>    <tag><strong>Character Encoding Scheme (CES)</strong>
528      <item><p>      <item><p>
529            Character Encoding Scheme is also a word defined in RFC 2050            Character Encoding Scheme is also a word defined in RFC 2050
530            to call methods to construct a character code using one or            to call methods to construct an encoding using one or
531            more CCS.  This is important when two or more CCS are used            more CCS.  This is important when two or more CCS are used
532            to construct a character code.            to construct an encoding.
533            ISO 2022 is a method to construct a character code from            ISO 2022 is a method to construct an encoding from
534            one or more ISO 2022-compliant CCS.  ISO 2022 is very            one or more ISO 2022-compliant CCS.  ISO 2022 is very
535            complex system and subsets of ISO 2022 are usually used            complex system and subsets of ISO 2022 are usually used
536            such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII            such as EUC-JP (ASCII and JISX 0208), ISO-2022-KR (ASCII
537            and KSX 1001), and so on.  CES is not important for            and KSX 1001), and so on.  CES is not important for
538            character codes with only one CCS.            encodings with only one 8bit CCS.
539            UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be            UTF series (UTF-8, UTF-16LE, UTF-16BE, and so on) can be
540            regarded as CES whose CCS is Unicode or ISO 10646.            regarded as CES whose CCS is Unicode or ISO 10646.
541      </p></item>      </p></item>
542  </taglist>  </taglist>
543  </P>  </P>
544    
545    <P>
546    Some other words are usually used related to character codes.
547    </P>
548    
549    <P>
550    <strong>Character code</strong> is a widely-used word to mean
551    <em>encoding</em>.  This is an primitive and crude word to call
552    the way a computer handles characters with assigning numbers.
553    For example, <em>character code</em> can call <em>encoding</em>
554    and can call <em>coded character set</em>.  Thus this word can
555    be used only in the case when both of them can be regard in
556    the same category.  This word should be avoided in serious
557    discussions.  This document will not use this word hereafter.
558    </P>
559    
560    <P>
561    <strong>Codeset</strong> is a word to call <em>encoding</em>
562    or <em>character encoding scheme</em>.
563    <footnote>
564     This document used a word <em>codeset</em> before Novermber 2000
565     to call <em>encoding</em>.  I changed terminology since
566     <em>encoding</em> seems more popular.
567    </footnote>
568    </P>
569    
570    <P>
571    <strong>charset</strong> is also a well-used word.
572    This word is used very widely, for example, in MIME (like
573    <tt>Content-Type: text/plain, charset=iso8859-1</tt>),
574    in XLFD (X Logical Font Description) font name
575    (CharSetResigtry and CharSetEncoding fields), and so on.
576    Note that <em>charset</em> in MIME is <em>encoding</em>,
577    while <em>charset</em> in XLFD font name is <em>coded character
578    set</em>.  This is very confusing.
579    </P>
580    
581    <P>
582    Ken Lunde's "CJKV Information Processing" uses a word
583    <strong>encoding method</strong>.  He says that
584    ISO-2022, EUC, Big5, and Shift-JIS are examples of
585    <em>encoding methods</em>.  It seems that his <em>encoding
586    method</em> is <em>CES</em> in this document.  However,
587    we should notice that Big5 and Shift-JIS are encodings
588    while ISO-2022 and EUC are not.
589    <footnote>
590    During I18N programming, we will frequently meet with EUC-JP
591    or EUC-KR, while we well rarely meet with EUC.  I think it is
592    not appropriate to stress EUC, a class of encodings, over
593    EUC-JP, EUC-KR, and so on, concrete encodings.
594    </footnote>
595    </P>
596    
597    <P>
598    <url id="http://www.unicode.org/unicode/reports/tr17/"
599    name="Character Encoding Model, Unicode Technilcal Report #17">
600    (hereafter, <em>"the Report"</em>) suggests five-level model.
601    <list>
602      <item>ACR: abstract character repertoire
603      <item>CCS: Coded Character Set
604      <item>CEF: Character Encoding Form
605      <item>CES: Character Encoding Scheme
606      <item>TES: Transfer Encoding Syntax
607    </list>
608    </P>
609    
610    <P>
611    <strong>TES</strong> is also suggested in RFC 2130.  Some examples of
612    TES are: <em>base64</em>, <em>uuencode</em>, <em>BinHex</em>,
613    <em>quoted-printable</em>, <em>gzip</em>, and so on.
614    TES means a transform of encoded data which may (or may not) include
615    textual data.  Thus, TES is not a part of character encoding.
616    However, TES is important in the Internet data exchange.
617    </P>
618    
619    <P>
620    When using a computer, we rarely have a chance to face with
621    <strong>ACR</strong>.
622    Though it is true that CJK people have their national standard of
623    ACR (for example, standard for ideograms which can be used for
624    personal names) and some of us may need to handle these ACR with
625    computers (for example, citizen registration system), this is too
626    heavy theme for this document.  This is because there are no
627    standardized or encouraged methods to handle these ACR.  You may
628    have to build the whole system for such purposes.  Good lack!
629    </P>
630    
631    <P>
632    <strong>CCS</strong> in <em>"the Report"</em> is same as what I wrote
633    in this document.
634    It has concrete examples: ASCII, ISO 8859-{1,2,...,15}, JISX 0201,
635    JISX 0208, JISX 0212, KSX 1001, KSX 1002, GB 2312, Big5,
636    CNS 11643, TIS 620, VISCII, TCVN 5712, UCS2, UCS4, and so on.
637    Some of them are national standards, some are international
638    standards, and others are de-facto standards.
639    </P>
640    
641    <P>
642    <strong>CEF</strong> and <strong>CES</strong> in <em>"the Report"</em>
643    correspond to <strong>CES</strong> in this document.
644    This document will not distinguish these two, since I think there
645    are no inconvenience.  An encoding with a significant CEF doesn't
646    have a significant CES (in <em>"the Report"</em> meaning), and
647    vice versa.  Then why should we have to distinguish these two?
648    The only exception is UTF-16 series.  In UTF-16 series,
649    UTF-16 is a CEF and UTF-16BE is a CES.  This is the only case where
650    both of these two leves are needed.
651    </P>
652    
653    <P>
654    Now, <strong>CES</strong> is a concrete concept with concrete
655    examples: ASCII, ISO 8859-{1,2,...,15}, EUC-JP, EUC-KR, ISO 2022-JP,
656    ISO 2022-JP-1, ISO 2022-JP-2, ISO 2022-CN, ISO 2022-CN-EXT,
657    ISO 2022-KR, ISO 2022, VISCII, UTF-7, UTF-8, UTF-16LE, UTF-16BE,
658    and so on.  Now they are encodings themselves.
659    </P>
660    
661    <P>
662    The most important concept in this section is distinction between
663    <em>coded character set</em> and <em>encoding</em>.  <em>coded
664    character set</em> is a component of <em>encoding</em>.  Text data
665    are described in <em>encoding</em>, not <em>coded character set</em>.
666    </P>
667    
668    
669  <sect1 id="stateful"><heading>Stateless and Stateful</heading>  <sect1 id="stateful"><heading>Stateless and Stateful</heading>
670    
671  <P>  <P>
672  To construct a character code with two or more CCS,  To construct an encoding with two or more CCS, CES has to supply
673  CES has to supply a method to avoid collision between these CCS.  a method to avoid collision between these CCS.
674  There are two ways to do that.  One is to make all characters  There are two ways to do that.  One is to make all characters
675  in the all CCS have unique code points.  The other is to  in the all CCS have unique code points.  The other is to
676  allow characters from different CCS to have the same  allow characters from different CCS to have the same
# Line 551  code point and to have a code such as es Line 679  code point and to have a code such as es
679  </P>  </P>
680    
681  <P>  <P>
682  A character code with shift state is called <strong>STATEFUL</strong> and  An encoding with shift state is called <strong>STATEFUL</strong> and
683  one without shift state is called <strong>STATELESS</strong>.  one without shift state is called <strong>STATELESS</strong>.
684  </P>  </P>
685    
686  <P>  <P>
687  Examples of stateful character codes are: ISO 2022-*,  Examples of stateful encodings are: ISO 2022-JP, ISO 2022-KR,
688    ISO 2022-INT-1, ISO 2022-INT-2, and so on.
689    </P>
690    
691  <P>  <P>
692  For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean  For example, in ISO 2022-JP, two bytes of <tt>0x24 0x2c</tt> may mean
# Line 564  a Japanese Hiragana character 'GA' or tw Line 694  a Japanese Hiragana character 'GA' or tw
694  '$' and ',' according to the shift state.  '$' and ',' according to the shift state.
695  </P>  </P>
696    
697  <sect1 id="multibyte"><heading>Multibyte character code</heading>  <sect1 id="multibyte"><heading>Multibyte encodings</heading>
698    
699  <P>  <P>
700  Character codes are classified into multibyte ones and the others,  Encodings are classified into multibyte ones and the others,
701  according to the relationship between number of characters and number of  according to the relationship between number of characters and number of
702  bytes in the character code.  bytes in the encoding.
703  </P>  </P>
704    
705  <P>  <P>
706  In non-multibyte character code, one character is always expressed  In non-multibyte encoding, one character is always expressed
707  by one byte.  On the other hand, one character may expressed in  by one byte.  On the other hand, one character may expressed in
708  one or more bytes in multibyte character code.  Note that the number  one or more bytes in multibyte encoding.  Note that the number
709  is not fixed even in a single character code.  is not fixed even in a single encoding.
710  </P>  </P>
711    
712  <P>  <P>
713  Examples of multibyte character codes are: EUC-*, ISO 2022-*,  Examples of multibyte encodings are: EUC-JP, EUC-KR, ISO 2022-JP,
714  Shift-JIS, Big5, UTF-*, and so on.  Note that all of UTF-* are  Shift-JIS, Big5, UHC, UTF-8, and so on.  Note that all of UTF-* are
715  multibyte.  multibyte.
716  </P>  </P>
717    
718  <P>  <P>
719  Examples of non-multibyte character codes are: ISO 8859-*,  Examples of non-multibyte encodings are: ISO 8859-1, ISO 8859-2,
720  TIS 620, VISCII, and so on.  TIS 620, VISCII, and so on.
721  </P>  </P>
722    
723  <P>  <P>
724  Note that even in non-multibyte character code, number of characters  Note that even in non-multibyte encoding, number of characters
725  and number of bytes may differ if the character code is stateful.  and number of bytes may differ if the encoding is stateful.
726    </P>
727    
728    <P>
729    Ken Lunde's "CJKV Information Processing"
730    <footnote>
731    ISBN 1-56592-224-7, O'Reilly, 1999
732    </footnote>
733    classifies encoding methods
734    into the following three categories:
735    <list>
736      <item>modal
737      <item>non-modal
738      <item>fixed-length
739    </list>
740    <em>Modal</em> corresponds to <em>stateful</em> in this document.
741    Other two are <em>stateless</em>, where <em>non-modal</em> is
742    <em>multibyte</em> and <em>fixed-length</em> is
743    <em>non-multibyte</em>.  However, I think <em>stateful</em> -
744    <em>stateless</em> and <em>multibyte</em> - <em>non-multibyte</em>
745    are independent concept.
746    <footnote>
747    though there are no existing encodings which is stateful and
748    non-multibyte.
749    </footnote>
750  </P>  </P>
751    
752  <sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>  <sect1 id="number"><heading>Number of Bytes, Number of Characters, and Number of Columns</heading>
# Line 608  and columns. Line 762  and columns.
762    
763  <P>  <P>
764  Speaking of relationship between characters and bytes,  Speaking of relationship between characters and bytes,
765  in multibyte character codes, two or more bytes may be needed  in multibyte encodings, two or more bytes may be needed
766  to express one character.  In stateful character codes, escape  to express one character.  In stateful encodings, escape
767  sequences are not related to any characters.  sequences are not related to any characters.
768  </P>  </P>
769    
# Line 620  and Korean Hangul occupy two columns in Line 774  and Korean Hangul occupy two columns in
774  Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set  Note that 'Full-width forms' in UCS-2 and UCS-4 coded character set
775  will occupy two columns and 'Half-width forms' will occupy one column.  will occupy two columns and 'Half-width forms' will occupy one column.
776  Combining characters used for Thai and so on can be regarded as  Combining characters used for Thai and so on can be regarded as
777  zero-column characters.  zero-column characters.  Though there are no standards, you can
778    use <tt>wcwidth()</tt> and <tt>wcswidth()</tt> for this purpose.
779    See <ref id="output-console-column"> for detail.
780  </P>  </P>
781    
782  <sect id="standards"><heading>Standards for Character Codes</heading>  <sect id="standards"><heading>Standards for Character Sets and Encodings</heading>
783    
784  <sect1 id="ascii"><heading>ASCII and ISO 646</heading>  <sect1 id="ascii"><heading>ASCII and ISO 646</heading>
785    
786  <P>  <P>
787  <strong>ASCII</strong> is a CCS and also a character code at the same time.  <strong>ASCII</strong> is a CCS and also an encoding at the same time.
788  ASCII is 7bit and contains 94 printable characters which are  ASCII is 7bit and contains 94 printable characters which are
789  encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>.  encoded in the region of <tt>0x21</tt>-<tt>0x7e</tt>.
790  </P>  </P>
# Line 668  Here is a few examples of versions of IS Line 824  Here is a few examples of versions of IS
824  </P>  </P>
825    
826  <P>  <P>
827  As far as I know, all character codes (besides EBCDIC) in the world  As far as I know, all encodings (besides EBCDIC) in the world
828  are compatible with ISO 646.  are compatible with ISO 646.
829  </P>  </P>
830    
# Line 677  Characters in 0x00 - 0x1f, 0x20, and 0x7 Line 833  Characters in 0x00 - 0x1f, 0x20, and 0x7
833  </P>  </P>
834    
835  <P>  <P>
836  Nowadays usage of character codes incompatible with ASCII is not  Nowadays usage of encodings incompatible with ASCII is not
837  encouraged and thus ISO 646-* (other than US version) should not  encouraged and thus ISO 646-* (other than US version) should not
838  be used.  One of the reason is that when a string is converted into  be used.  One of the reason is that when a string is converted into
839  Unicode, the converter doesn't know whether IRVs are converted into  Unicode, the converter doesn't know whether IRVs are converted into
# Line 691  are written in ASCII.  Source code must Line 847  are written in ASCII.  Source code must
847    
848  <P>  <P>
849  <strong>ISO 8859</strong> is both a series of CCS and a series of  <strong>ISO 8859</strong> is both a series of CCS and a series of
850  character codes.  It is an expansion of ASCII using all 8 bits.  encodings.  It is an expansion of ASCII using all 8 bits.
851  Additional 96 printable characters encoded in 0xa0 - 0xff are  Additional 96 printable characters encoded in 0xa0 - 0xff are
852  available besides 94 ASCII printable characters.  available besides 94 ASCII printable characters.
853  </P>  </P>
# Line 728  A detailed explanation is found at Line 884  A detailed explanation is found at
884  <P>  <P>
885  <strong>ISO 2022</strong> is an international standard of CES.  <strong>ISO 2022</strong> is an international standard of CES.
886  ISO 2022 determines a few requirement for CCS to be a member  ISO 2022 determines a few requirement for CCS to be a member
887  of ISO 2022-based character codes.  It also defines a very  of ISO 2022-based encodings.  It also defines a very
888  extensive (and complex) rules to combine these CCS into one  extensive (and complex) rules to combine these CCS into one
889  character code.  Many character codes such as EUC-*, ISO 2022-*,  encoding.  Many encodings such as EUC-*, ISO 2022-*,
890  compound text,  compound text,
891  <footnote>  <footnote>
892   Compound text is a standard for text exchange between X clients.   Compound text is a standard for text exchange between X clients.
# Line 782  mapped into 0x20 - 0x7f. Line 938  mapped into 0x20 - 0x7f.
938  </P>  </P>
939    
940  <P>  <P>
941  For example, ASCII, ISO 646-UK, and JIS X 0201 Katakana  For example, ASCII, ISO 646-UK, and JISX 0201 Katakana
942  are classified into (1), JIS X 0208 Japanese Kanji,  are classified into (1), JISX 0208 Japanese Kanji,
943  KS C 5601 Korean, GB 2312-80 Chinese are classified into (3),  KSX 1001 Korean, GB 2312-80 Chinese are classified into (3),
944  and ISO 8859-* are classified to (2).  and ISO 8859-* are classified to (2).
945  </P>  </P>
946    
# Line 856  where 'F' is determined for each charact Line 1012  where 'F' is determined for each charact
1012      </list>      </list>
1013   <item>character set with multibyte 94-character   <item>character set with multibyte 94-character
1014      <list>      <list>
1015       <item>F=0x40 for JIS X 0208-1978 Japanese       <item>F=0x40 for JISX 0208-1978 Japanese
1016       <item>F=0x41 for GB 2312-80 Chinese       <item>F=0x41 for GB 2312-80 Chinese
1017       <item>F=0x42 for JIS X 0208-1983 Japanese       <item>F=0x42 for JISX 0208-1983 Japanese
1018       <item>F=0x43 for KS C 5601 Korean       <item>F=0x43 for KSC 5601 Korean
1019       <item>F=0x44 for JIS X 0212-1990 Japanese       <item>F=0x44 for JISX 0212-1990 Japanese
1020       <item>F=0x45 for CCITT Extended GB (ISO-IR-165)       <item>F=0x45 for CCITT Extended GB (ISO-IR-165)
1021       <item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)       <item>F=0x46 for CNS 11643-1992 Set 1 (Taiwan)
1022       <item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)       <item>F=0x48 for CNS 11643-1992 Set 2 (Taiwan)
# Line 901  WHAT IS THE VALUE OF THESE CONTROL CODES Line 1057  WHAT IS THE VALUE OF THESE CONTROL CODES
1057  </P>  </P>
1058    
1059  <P>  <P>
1060  Note that a character code in a character set invoked into GR is  Note that a code in a character set invoked into GR is
1061  or-ed with 0x80.  or-ed with 0x80.
1062  </P>  </P>
1063    
# Line 947  of ISO 2022 except for the usage of SS2 Line 1103  of ISO 2022 except for the usage of SS2
1103  codes are used to invoke G2 and G3 into GL in ISO 2022, they are  codes are used to invoke G2 and G3 into GL in ISO 2022, they are
1104  invoked into GR in EUC.  invoked into GR in EUC.
1105  <strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>,  <strong>EUC-JP</strong>, <strong>EUC-KR</strong>, <strong>EUC-CN</strong>,
1106  and <strong>EUC-TW</strong> are widely used character codes  and <strong>EUC-TW</strong> are widely used encodings
1107  which use EUC as CES.  which use EUC as CES.
1108  </P>  </P>
1109    
# Line 1052  It includes 49194 distinct coded charact Line 1208  It includes 49194 distinct coded charact
1208  <sect2 id="unicode-ces"><heading>UTF as CES</heading>  <sect2 id="unicode-ces"><heading>UTF as CES</heading>
1209    
1210  <P>  <P>
1211  A few CES are used to construct character codes which use UCS as  A few CES are used to construct encodings which use UCS as
1212  a CCS.  They are <strong>UTF-7</strong>, <strong>UTF-8</strong>,  a CCS.  They are <strong>UTF-7</strong>, <strong>UTF-8</strong>,
1213  <strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and  <strong>UTF-16</strong>, <strong>UTF-16LE</strong>, and
1214  <strong>UTF-16BE</strong>.  UTF means Unicode (or UCS)  <strong>UTF-16BE</strong>.  UTF means Unicode (or UCS)
1215  Transformation Format.  Transformation Format.
1216  Since these CES always take UCS as the only CCS, they are also  Since these CES always take UCS as the only CCS, they are also
1217  names for character codes.  names for encodings.
1218  <footnote>  <footnote>
1219   Compare UTF and EUC.  There are a few variants of EUC whose CCS   Compare UTF and EUC.  There are a few variants of EUC whose CCS
1220   are different (EUC-JP, EUC-KR, and so on).  This is why we cannot   are different (EUC-JP, EUC-KR, and so on).  This is why we cannot
1221   call EUC as a character code.  In other words, calling of 'EUC'   call EUC as an encoding.  In other words, calling of 'EUC'
1222   cannot specify a character code.  On the other hands, 'UTF-8'   cannot specify an encoding.  On the other hands, 'UTF-8'
1223   is the name for a specific concrete character code.   is the name for a specific concrete encoding.
1224  </footnote>  </footnote>
1225  </P>  </P>
1226    
1227  <sect3 id="unicode-utf8"><heading>UTF-8</heading>  <sect3 id="unicode-utf8"><heading>UTF-8</heading>
1228    
1229  <P>  <P>
1230  UTF-8 is a character code whose CCS is UCS-4.  UTF-8  UTF-8 is an encoding whose CCS is UCS-4.  UTF-8
1231  is designed to be upward-compatible to ASCII.  is designed to be upward-compatible to ASCII.
1232  UTF-8 is multibyte and number of bytes needed to express  UTF-8 is multibyte and number of bytes needed to express
1233  one character is from 1 to 6.  one character is from 1 to 6.
# Line 1102  locale will increase. Line 1258  locale will increase.
1258  <sect3 id="unicode-utf16"><heading>UTF-16</heading>  <sect3 id="unicode-utf16"><heading>UTF-16</heading>
1259    
1260  <P>  <P>
1261  UTF-16 is a character code whose CCS is 20bit Unicode.  UTF-16 is an encoding whose CCS is 20bit Unicode.
1262  </P>  </P>
1263    
1264  <P>  <P>
# Line 1281  programming more difficult. Line 1437  programming more difficult.
1437  <sect3 id="646problem"><heading>ISO 646-* Problem</heading>  <sect3 id="646problem"><heading>ISO 646-* Problem</heading>
1438    
1439  <P>  <P>
1440  You will need a codeset converter between your local character codes  You will need a codeset converter between your local encodings
1441  (for example, ISO 8859-* or ISO 2022-*) and Unicode.  (for example, ISO 8859-* or ISO 2022-*) and Unicode.
1442  For example, Shift-JIS character code  For example, Shift-JIS encoding
1443  <footnote>  <footnote>
1444    The standard character code for Macintosh and MS Windows.    The standard encoding for Macintosh and MS Windows.
1445  </footnote>  </footnote>
1446  consists from  consists from
1447  JISX 0201 Roman (Japanese version of ISO 646), not ASCII,  JISX 0201 Roman (Japanese version of ISO 646), not ASCII,
# Line 1304  escape character. For example, 'new line Line 1460  escape character. For example, 'new line
1460  'yen currency mark - <tt>n</tt>'.  You may say that program sources  'yen currency mark - <tt>n</tt>'.  You may say that program sources
1461  must written in ASCII and the wrong point is that you  must written in ASCII and the wrong point is that you
1462  tried to convert program source.  However, there are many  tried to convert program source.  However, there are many
1463  source codes and so on written in Shift-JIS character code.  source codes and so on written in Shift-JIS encoding.
1464  </P>  </P>
1465    
1466  <P>  <P>
# Line 1323  to ASCII, such as ISO 646-*. Line 1479  to ASCII, such as ISO 646-*.
1479  </P>  </P>
1480    
1481    
1482  <sect1 id="othercodes"><heading>Other Character Codes</heading>  <sect1 id="othercodes"><heading>Other Character Sets and Encodings</heading>
1483    
1484  <P>  <P>
1485  There are a few popular character codes which cannot be classified  There are a few popular encodings which cannot be classified
1486  into an international standard.  Internationalized softwares should  into an international standard.  Internationalized softwares should
1487  support these character codes (again, you don't need to be aware of  support these encodings (again, you don't need to be aware of
1488  character codes if you use LOCALE and <tt>wchar_t</tt> technology).  encodings if you use LOCALE and <tt>wchar_t</tt> technology).
1489  Some organizations are developing systems which go father than  Some organizations are developing systems which go father than
1490  limitations of the current international standards, though these  limitations of the current international standards, though these
1491  systems may be not diffused very much so far.  systems may be not diffused very much so far.
# Line 1338  systems may be not diffused very much so Line 1494  systems may be not diffused very much so
1494  <sect2 id="othercodes-big5"><heading>Big5</heading>  <sect2 id="othercodes-big5"><heading>Big5</heading>
1495    
1496  <P>  <P>
1497  <strong>Big5</strong> is a de-facto standard character code for  <strong>Big5</strong> is a de-facto standard encoding for
1498  Taiwan (1984).  It is also a CCS which is upper-compatible with ASCII.  Taiwan (1984) and is upper-compatible with ASCII.
1499    It is also a CCS.
1500  </P>  </P>
1501    
1502  <P>  <P>
# Line 1353  and means an ideogram and so on (13461 c Line 1510  and means an ideogram and so on (13461 c
1510  Though Taiwan has ISO 2022-compliant new standard CNS 11643,  Though Taiwan has ISO 2022-compliant new standard CNS 11643,
1511  Big5 seems to be more popular than CNS 11643.  Big5 seems to be more popular than CNS 11643.
1512  (CNS 11643 is a CCS and there are a few ISO 2022-derived  (CNS 11643 is a CCS and there are a few ISO 2022-derived
1513  character codes which include CNS 11643.)  encodings which include CNS 11643.)
1514  </P>  </P>
1515    
1516  <sect2 id="othercodes-viscii"><heading>VISCII</heading>  <sect2 id="othercodes-viscii"><heading>VISCII</heading>
1517    
1518  <P>  <P>
1519  Vietnamese language uses 186 characters (Latin alphabets with accents).  Vietnamese language uses 186 characters (Latin alphabets with accents)
1520  It is a bit more than the limit of ISO 8859-like character code.  and other symbols.
1521    It is a bit more than the limit of ISO 8859-like encoding.
1522  </P>  </P>
1523    
1524  <P>  <P>
# Line 1374  not only <tt>0x21</tt> - <tt>0x7e</tt> a Line 1532  not only <tt>0x21</tt> - <tt>0x7e</tt> a
1532  </P>  </P>
1533    
1534  <P>  <P>
1535  Vietnam has a new, ISO 2022-compliant character code  Vietnam has a new, ISO 2022-compliant character set
1536  <strong>TCVN 5712</strong> (aka <strong>VSCII</strong>).  <strong>TCVN 5712</strong> (aka <strong>VSCII</strong>).
1537  In TCVN 5712, accented characters are expressed as a  In TCVN 5712, accented characters are expressed as a
1538  combined character.  Note that a part of accented characters  combined character.  Note that some of accented characters
1539  have their own code points.  have their own code points.
1540  </P>  </P>
1541    
1542  <sect2 id="othercodes-tron"><heading>TRON</heading>  <sect2 id="othercodes-tron"><heading>TRON</heading>
1543    
1544  <P>  <P>
1545  url id="http://www.tron.org/index-e.html" name="TRON project">  <url id="http://www.tron.org/index-e.html" name="TRON">
1546  is a project to develop a new operating system,  is a project to develop a new operating system,
1547  founded as a collaboration of industries and academics  founded as a collaboration of industries and academics
1548  in Japan since 1984.  in Japan since 1984.
# Line 1393  in Japan since 1984. Line 1551  in Japan since 1984.
1551  <P>  <P>
1552  The most diffused version of TRON operating system families  The most diffused version of TRON operating system families
1553  is ITRON, a real-time OS for embedded systems.  is ITRON, a real-time OS for embedded systems.
1554  However, our interest is not on the ITRON now.  However, our interest is not on ITRON now.
1555  TRON determines a TRON character code.  TRON determines a TRON encoding.
1556  </P>  </P>
1557    
1558  <P>  <P>
1559  TRON's character code is stateful.  Each state are assigned  TRON's encoding is stateful.  Each state are assigned
1560  to each language.  It has already defined about 130000 characters  to each language.  It has already defined about 130000 characters
1561  (January 2000).  (January 2000).
1562  </P>  </P>
# Line 1429  these points: Line 1587  these points:
1587  <enumlist>  <enumlist>
1588    <item>kinds and number of characters used in the language,    <item>kinds and number of characters used in the language,
1589    <item>explanation on coded character set(s) which is (are) standardized,    <item>explanation on coded character set(s) which is (are) standardized,
1590    <item>explanation on character code(s) which is (are) standardized,    <item>explanation on encoding(s) which is (are) standardized,
1591    <item>usage and popularity for each character code,    <item>usage and popularity for each encoding,
1592    <item>de-facto standard, if any, on how many columns characters occupy,    <item>de-facto standard, if any, on how many columns characters occupy,
1593    <item>writing direction and combined characters,    <item>writing direction and combined characters,
1594    <item>how to layout characters (word wrapping and so on),    <item>how to layout characters (word wrapping and so on),
# Line 1467  how to treat such languages. Line 1625  how to treat such languages.
1625  <P>  <P>
1626  <strong>LOCALE</strong> is a basic concept introduced  <strong>LOCALE</strong> is a basic concept introduced
1627  into <strong>ISO C</strong> (ISO/IEC 9899:1990).  The  into <strong>ISO C</strong> (ISO/IEC 9899:1990).  The
1628  standard is expanded in 1995 (ISO 9899:1990 Ammendment 1:1995).  standard is expanded in 1995 (ISO 9899:1990 Amendment 1:1995).
1629  In LOCALE model, the behaviors of some C functions are dependent  In LOCALE model, the behaviors of some C functions are dependent
1630  on LOCALE environment.  LOCALE environment is divided  on LOCALE environment.  LOCALE environment is divided
1631  into a few categories and each of these categories can  into a few categories and each of these categories can
# Line 1484  XPG5 is mandatory to obtain Unix brand. Line 1642  XPG5 is mandatory to obtain Unix brand.
1642  all versions of Unix operating systems support XPG5.  all versions of Unix operating systems support XPG5.
1643  </P>  </P>
1644    
1645  <sect id="localecategory">Locale Categories and Locale Names</heading>  <sect id="localecategory">Locale Categories and <tt>setlocale()</tt></heading>
1646    
1647  <P>  <P>
1648  In LOCALE model, the behaviors of some C functions are dependent  In LOCALE model, the behaviors of some C functions are dependent
# Line 1499  The followings are the six categories: Line 1657  The followings are the six categories:
1657    <tag><strong>LC_CTYPE</strong>    <tag><strong>LC_CTYPE</strong>
1658         <item>         <item>
1659         <p>         <p>
1660         Category related to character code.         Category related to encodings.
1661         Characters which are encoded by LC_CTYPE-depndent character         Characters which are encoded by LC_CTYPE-dependent encoding
1662         code is called <strong>multibyte characters</strong>.         is called <strong>multibyte characters</strong>.
1663         Note that multibyte character doesn't need to be multibyte.         Note that multibyte character doesn't need to be multibyte.
1664         </p>         </p>
1665         <p>         <p>
# Line 1589  Given <tt>""</tt> for <em>locale</em>, < Line 1747  Given <tt>""</tt> for <em>locale</em>, <
1747  will determine the locale name in the following manner:  will determine the locale name in the following manner:
1748  <list>  <list>
1749    <item>At first, consult <tt>LC_ALL</tt> environmental variable.    <item>At first, consult <tt>LC_ALL</tt> environmental variable.
1750    <item>Then, consult environmental variable same as the    <item>If <tt>LC_ALL</tt> is not available, consult environmental
1751          name of the locale category.  For example, <tt>LC_COLLATE</tt>.          variable same as the name of the locale category.
1752    <item>At last, consult <tt>LANG</tt> environmental variable.          For example, <tt>LC_COLLATE</tt>.
1753      <item>If none of them are available, consult <tt>LANG</tt>
1754            environmental variable.
1755  </list>  </list>
1756  This is why a user is expected to set <tt>LANG</tt> variable.  This is why a user is expected to set <tt>LANG</tt> variable.
1757  In other words, all what a user has to do is to set <tt>LANG</tt>  In other words, all what a user has to do is to set <tt>LANG</tt>
# Line 1605  at the first of your softwares, if the s Line 1765  at the first of your softwares, if the s
1765  international.  international.
1766  </p>  </p>
1767    
1768    <sect id="localename">Locale Names</heading>
1769    
1770    <P>
1771    We can specify locale names for these six locale categories.
1772    Then, which name should we specify?
1773    </P>
1774    
1775    <P>
1776    The syntax to build a locale name is determined as follows:
1777    <example>
1778      language[_territory][.codeset][@modifier]
1779    </example>
1780    where <em>language</em> is two lowercase alphabets described
1781    in ISO639, such as <tt>en</tt> for English, <tt>eo</tt> for
1782    Esperanto, and <tt>zh</tt> for Chinese, <em>territory</em>
1783    is two uppercase alphabets described in ISO3166, such as
1784    <tt>GB</tt> for United Kingdom, <tt>KR</tt> for Republic of
1785    Korea (South Korea), <tt>CN</tt> for China.  There are no standard
1786    for <em>codeset</em> and <em>modifier</em>.  GNU libc uses
1787    <tt>ISO-8859-1</tt>, <tt>ISO-8859-13</tt>, <tt>eucJP</tt>,
1788    <tt>SJIS</tt>, <tt>UTF8</tt>, and so on for <em>codeset</em>,
1789    and <tt>euro</tt> for <em>modifier</em>.
1790    </P>
1791    
1792    <P>
1793    However, it is depend on the system which locale names are valid.
1794    In other words, you have to install <em>locale database</em> for
1795    locale you want to use.  Type <tt>locale -a</tt> to display all
1796    supported locale names on the system.
1797    </P>
1798    
1799  <p>  <p>
1800  Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are  Note that locale names of <tt>"C"</tt> and <tt>"POSIX"</tt> are
1801  determined for the names for default behavior.  For example,  determined for the names for default behavior.  For example,
# Line 1616  invocation of <tt>date(1)</tt>. Line 1807  invocation of <tt>date(1)</tt>.
1807  <sect id="wchar">Multibyte Characters and Wide Characters</heading>  <sect id="wchar">Multibyte Characters and Wide Characters</heading>
1808    
1809  <p>  <p>
1810  Now we will concentrate on LC_CTYPE category.  Now we will concentrate on LC_CTYPE, which is the most important
1811    category in six locale categories.
1812  </p>  </p>
1813    
1814  <p>  <p>
1815  Many character codes such as ASCII, ISO 8859-*, KOI8-R, EUC-*,  Many encodings such as ASCII, ISO 8859-*, KOI8-R, EUC-*,
1816  ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world.  ISO 2022-*, TIS 620, UTF-8, and so on are used widely in the world.
1817  It is inefficient and a cause of bugs, even not impossible, for  It is inefficient and a cause of bugs, even not impossible, for
1818  every softwares to implement all these character codes.  every softwares to implement all these encodings.
1819  Fortunetely, we can use LOCALE technology to solve this problem.  Fortunately, we can use LOCALE technology to solve this problem.
1820  <footnote>  <footnote>
1821    Usage of UCS-4 is the second best solution fot this problem.    Usage of UCS-4 is the second best solution for this problem.
1822    Sometimes LOCALE technology cannot be used and UCS-4 is the    Sometimes LOCALE technology cannot be used and UCS-4 is the
1823    best.  I will discuss this solution later.    best.  I will discuss this solution later.
1824  </footnote>  </footnote>
# Line 1634  Fortunetely, we can use LOCALE technolog Line 1826  Fortunetely, we can use LOCALE technolog
1826    
1827  <p>  <p>
1828  <strong>Multibyte characters</strong> is a term to call characters  <strong>Multibyte characters</strong> is a term to call characters
1829  encoded in locale-specific character code.  Thus, the behaviors of  encoded in locale-specific encoding.  In ISO 8859-1 locale,
1830  C functions which handle multibyte characters depend on  ISO 8859-1 is multibyte character.  In EUC-JP locale, EUC-JP
1831  <tt>LC_CTYPE</tt> locale category.  is multibyte character.  In UTF-8 locale, UTF-8 is multibyte character.
1832    In short, multibyte character is defined by <tt>LC_CTYPE</tt> locale
1833    category.
1834  Multibyte characters should be used when your software inputs  Multibyte characters should be used when your software inputs
1835  or outputs text data from/to everywhere out of your software,  or outputs text data from/to everywhere out of your software,
1836  for example, standard input/output, display, keyboard, file,  for example, standard input/output, display, keyboard, file,
# Line 1649  and so on. Line 1843  and so on.
1843  </p>  </p>
1844    
1845  <p>  <p>
1846    You can handle multibyte characters using ordinal <tt>char</tt>
1847    or <tt>unsigned char</tt> types and ordinal character- and
1848    string-oriented functions, just like you used to do for
1849    ASCII and 8bit encodings.
1850    And more, ISO C standard determines C functions which should be sensible
1851    to <tt>LC_CTYPE</tt> locale category and thus these functions can
1852    handle multibyte characters.
1853    </p>
1854    
1855    <p>
1856  Multibyte character may be stateful or stateless and multibyte or  Multibyte character may be stateful or stateless and multibyte or
1857  non-multibyte.  Thus it is not convenient for internal processing.  non-multibyte.  Thus it is not convenient for internal processing.
1858  It needs complex algorithm even for, for example, character  It needs complex algorithm even for, for example, character
# Line 1667  character and <tt>WEOF</tt>, an substitu Line 1871  character and <tt>WEOF</tt>, an substitu
1871  </p>  </p>
1872    
1873  <p>  <p>
1874  A string of wide characters is achived by an array of <tt>wchar_t</tt>,  A string of wide characters is achieved by an array of <tt>wchar_t</tt>,
1875  just like a string of characters is achieved by an array  just like a string of characters is achieved by an array
1876  of <tt>char</tt>.  of <tt>char</tt>.
1877  </p>  </p>
# Line 1724  There are additional functions for <tt>w Line 1928  There are additional functions for <tt>w
1928  <p>  <p>
1929  You cannot assume anything on the concrete value of <tt>wchar_t</tt>,  You cannot assume anything on the concrete value of <tt>wchar_t</tt>,
1930  besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII.  besides <tt>0x21</tt> - <tt>0x7e</tt> are identical to ASCII.
1931    <footnote>
1932     Some of you may know GNU libc uses UCS-4 for the internal expression
1933     of <tt>wchar_t</tt>.  However, you should not use the knowledge.
1934     It may differ in other systems.
1935    </footnote>
1936  You may feel this limitation is too strong.  If you cannot do  You may feel this limitation is too strong.  If you cannot do
1937  under this limitation, you can use UCS-4 as the internal character  under this limitation, you can use UCS-4 as the internal encoding.
1938  code.  In such a case, you can write your software emulating  In such a case, you can write your software emulating
1939  the locale-sensible behavior using <tt>setlocale()</tt>,  the locale-sensible behavior using <tt>setlocale()</tt>,
1940  <tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>.  Consult  <tt>nl_langinfo(CODESET)</tt>, and <tt>iconv()</tt>.  Consult
1941  the section of <ref id="iconv">.  the section of <ref id="iconv">.
# Line 1734  the section of <ref id="iconv">. Line 1943  the section of <ref id="iconv">.
1943    
1944  <p>  <p>
1945  You can write wide character in the source code as <tt>L'a'</tt>  You can write wide character in the source code as <tt>L'a'</tt>
1946  and wide string as <tt>L"string"</tt>.  Since the character  and wide string as <tt>L"string"</tt>.  Since the encoding
1947  code for the source code is ASCII, you can only write ASCII  for the source code is ASCII, you can only write ASCII
1948  characters.  If you'd like to use other characters, you should  characters.  If you'd like to use other characters, you should
1949  use <prgn>gettext</prgn>.  use <prgn>gettext</prgn>.
1950  </p>  </p>
# Line 1782  Such software can input and output multi Line 1991  Such software can input and output multi
1991  <sect id="locale_unicode">Unicode and LOCALE technology</heading>  <sect id="locale_unicode">Unicode and LOCALE technology</heading>
1992    
1993  <p>  <p>
1994  UTF-8 is considered as the future character code and  UTF-8 is considered as the future encoding and
1995  many softwares are coming to support UTF-8.  Though some  many softwares are coming to support UTF-8.  Though some
1996  of these softwares implement UTF-8 directly, I recommend  of these softwares implement UTF-8 directly, I recommend
1997  you to use LOCALE technology to support UTF-8.  you to use LOCALE technology to support UTF-8.
# Line 1824  Some developers may think that support o Line 2033  Some developers may think that support o
2033  for I18N.  for I18N.
2034  <footnote>  <footnote>
2035   In such a case, do they think of abolishing support of 7bit or   In such a case, do they think of abolishing support of 7bit or
2036   8bit non-multibyte character codes?  If no, it may be unfair that   8bit non-multibyte encodings?  If no, it may be unfair that
2037   8bit language speakers can use both UTF-8 and conventional (local)   8bit language speakers can use both UTF-8 and conventional (local)
2038   character codes while speakers of multibyte languages, combining   encodings while speakers of multibyte languages, combining
2039   characters, and so on cannot use their popular locale character   characters, and so on cannot use their popular locale encodings.
2040   codes.  I think such a software cannot be called "internationalized".   I think such a software cannot be called "internationalized".
2041  </footnote>  </footnote>
2042  Even in such cases, you can rewrite such a software so that it  Even in such cases, you can rewrite such a software so that it
2043  checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables  checks <tt>LC_*</tt> and <tt>LANG</tt> environmental variables
2044  to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>.  to emulate the behavior of <tt>setlocale(LC_ALL, "");</tt>.
2045  You can also rewrite the software to call <tt>setlocale()</tt>,  You can also rewrite the software to call <tt>setlocale()</tt>,
2046  <tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software  <tt>nl_langinfo()</tt>, and <tt>iconv()</tt> so that the software
2047  supports all character codes which the OS supports, as discussed later.  supports all encodings which the OS supports, as discussed later.
2048  Consult  Consult
2049  <url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html"  <url id="http://ffii.org/archive/mails/groff/2000/Oct/0056.html"
2050  name="the discussion in the Groff mailing list on the support of  name="the discussion in the Groff mailing list on the support of
2051  UTF-8 and locale-specific character codes">, mainly held by Werner  UTF-8 and locale-specific encodings">, mainly held by Werner
2052  LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA,  LEMBERG, an experienced developer of GNU roff, and Tomohiro KUBOTA,
2053  the author of this document.  the author of this document.
2054  </p>  </p>
2055    
2056    
2057    
2058  <sect id="iconv">nl_langinfo() and iconv()</heading>  <sect id="iconv"><heading><tt>nl_langinfo()</tt> and <tt>iconv()</tt></heading>
2059    
2060  <p>  <p>
2061  Though ISO C defines extensive LOCALE-related functions,  Though ISO C defines extensive LOCALE-related functions,
2062  you may want more extensive support.  You may also want  you may want more extensive support.  You may also want
2063  conversion between different character codes.  conversion between different encodings.
2064  There are C functions which can be used for such purposes.  There are C functions which can be used for such purposes.
2065  </p>  </p>
2066    
# Line 1889  for <em>item</em> defined in <tt>langinf Line 2098  for <em>item</em> defined in <tt>langinf
2098    <item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>)    <item>format of time (am/pm format) (<tt>T_FMT_AMPM</tt>)
2099    <item>format of time (era-based) (<tt>ERA_T_FMT</tt>)    <item>format of time (era-based) (<tt>ERA_T_FMT</tt>)
2100    <item>radix (<tt>RADIXCHAR</tt>)    <item>radix (<tt>RADIXCHAR</tt>)
2101    <item>thousands separater (<tt>THOUSEP</tt>)    <item>thousands separator (<tt>THOUSEP</tt>)
2102    <item>alternative characters for numerics (<tt>ALT_DIGITS</tt>)    <item>alternative characters for numerics (<tt>ALT_DIGITS</tt>)
2103    <item>affirmative word (<tt>YESSTR</tt>)    <item>affirmative word (<tt>YESSTR</tt>)
2104    <item>affirmative response (<tt>YESEXPR</tt>)    <item>affirmative response (<tt>YESEXPR</tt>)
2105    <item>negative word (<tt>NOSTR</tt>)    <item>negative word (<tt>NOSTR</tt>)
2106    <item>negative response (<tt>NOEXPR</tt>)    <item>negative response (<tt>NOEXPR</tt>)
2107    <item>character code (<tt>CODESET</tt>)    <item>encoding (<tt>CODESET</tt>)
2108  </list>  </list>
2109  For example, you can get names for months and use them for  For example, you can get names for months and use them for
2110  your original output algorithm.  <tt>YESEXPR</tt> and  your original output algorithm.  <tt>YESEXPR</tt> and
# Line 1905  answer from users. Line 2114  answer from users.
2114    
2115  <p>  <p>
2116  <tt>iconv_open()</tt>, <tt>iconv</tt>, and <tt>iconv_close()</tt>  <tt>iconv_open()</tt>, <tt>iconv</tt>, and <tt>iconv_close()</tt>
2117  are functions to perform conversion between character codes.  are functions to perform conversion between encodings.
2118  Please consult manpages for them.  Please consult manpages for them.
2119  </p>  </p>
2120    
2121  <p>  <p>
2122  Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>,  Combining <tt>nl_langinfo()</tt> and <tt>iconv()</tt>,
2123  you can easily modify Unicode-enabled software into locale-sensible  you can easily modify Unicode-enabled software into locale-sensible
2124  truely internationalized software.  truly internationalized software.
2125  </p>  </p>
2126    
2127  <p>  <p>
# Line 1962  in the distribution of GNU libc for usag Line 2171  in the distribution of GNU libc for usag
2171  </p>  </p>
2172    
2173    
2174    <sect id="locale-limit"><heading>Limit of Locale technology</heading>
2175    
2176    <P>
2177    Locale model has a limit.  That is, it cannot handle two locales at
2178    the same time.  Especially, it cannot handle relationship between two
2179    locales at all.
2180    </P>
2181    
2182    <P>
2183    For example, EUC-JP, ISO 2022-JP, and Shift-JIS are popular encodings
2184    in Japan.  EUC-JP is the de-facto standard for UNIX systems,
2185    ISO 2022-JP is the standard for Internet, and Shift-JIS is the
2186    encoding for Windows and Macintosh.  Thus, Japanese people have to
2187    handle texts with these encodings.  Text viewers such as <tt>jless</tt>
2188    and <tt>lv</tt> and editors such as <tt>emacs</tt> can automatically
2189    understand the encoding to be read.  You cannot write such a software
2190    using Locale technology.
2191    </P>
2192    
2193    
2194    
2195  <chapt id="output"><heading>Output to Display</heading>  <chapt id="output"><heading>Output to Display</heading>
# Line 2020  for Windows are also regarded as console Line 2248  for Windows are also regarded as console
2248  <P>  <P>
2249  The feature of console is that:  The feature of console is that:
2250  <list>  <list>
2251    <item>All what a software has to do is to send a correct character    <item>All what a software has to do is to send a correct encoding
2252          code to standard output.  Softwares on console don't need to          to standard output.  Softwares on console don't need to
2253          care about fonts and so on.          care about fonts and so on.
2254    <item>Fonts with fixed sizes are used.  The unit of the width    <item>Fonts with fixed sizes are used.  The unit of the width
2255          of the font is called 'column'.  'Doublewidth' fonts, i.e.,          of the font is called 'column'.  'Doublewidth' fonts, i.e.,
# Line 2032  The feature of console is that: Line 2260  The feature of console is that:
2260  </list>  </list>
2261  </P>  </P>
2262    
2263  <sect1 id="output-console-code"><heading>Character Code</heading>  <sect1 id="output-console-code"><heading>Encoding</heading>
2264    
2265  <P>  <P>
2266  Softwares running on the console are not responsible for displaying.  Softwares running on the console are not responsible for displaying.
2267  The console itself is responsible.  There are consoles  The console itself is responsible.  There are consoles
2268  which can display character codes other than ASCII such as  which can display encodings other than ASCII such as
2269  <taglist>  <taglist>
2270   <tag>kon2   <tag>kon2
2271        <item>EUC-JP, Shift-JIS, and ISO-2022-JP        <item>EUC-JP, Shift-JIS, and ISO-2022-JP
2272   <tag>jfbterm   <tag>jfbterm
2273        <item>EUC-JP, ISO-2022-jp, and ISO-2022 (including any 94, 96,        <item>EUC-JP, ISO 2022-JP, and ISO 2022 (including any 94, 96,
2274              and 94x94 character sets whose fonts are available)              and 94x94 coded character sets whose fonts are available)
2275   <tag>kterm   <tag>kterm
2276        <item>EUC-JP, Shift-JIS, ISO-2022-JP, and ISO-2022 (including        <item>EUC-JP, Shift-JIS, ISO 2022-JP, and ISO 2022 (including
2277              ISO8859-{1,2,3,4,5,6,7,8,9}, JISX0201, JISX0208, JISX0212,              ISO8859-{1,2,3,4,5,6,7,8,9}, JISX 0201, JISX 0208, JISX 0212,
2278              GB2312, and KSC5601)              GB 2312, and KSC 5601)
2279   <tag>krxvt   <tag>krxvt
2280        <item>EUC-JP        <item>EUC-JP
2281   <tag>crxvt-gb   <tag>crxvt-gb
# Line 2055  which can display character codes other Line 2283  which can display character codes other
2283   <tag>crxvt-big5   <tag>crxvt-big5
2284        <item>Big5        <item>Big5
2285   <tag>hanterm   <tag>hanterm
2286        <item>EUC-KR, Johab, and ISO-2022-KR        <item>EUC-KR, Johab, and ISO 2022-KR
2287   <tag>xiterm+thai   <tag>xiterm+thai
2288        <item>TIS620        <item>TIS 620
2289   <tag>xterm   <tag>xterm
2290        <item>UTF-8        <item>UTF-8
2291  </taglist>  </taglist>
2292  However, there are no way for a software on console to know which  However, there are no way for a software on console to know which
2293  character code is available.  I think it is a responsibility for  encoding is available.  I think it is a responsibility for
2294  a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG  a user to properly set LC_CTYPE locale (i.e. LC_ALL, LC_CTYPE, or LANG
2295  environmental variable).  Provided LC_CTYPE locale is set properly,  environmental variable).  Provided LC_CTYPE locale is set properly,
2296  a software can use it to know which character code to be supported  a software can use it to know which encoding to be supported
2297  by the console.  by the console.
2298  </P>  </P>
2299    
# Line 2082  to care about points which you don't hav Line 2310  to care about points which you don't hav
2310  using ASCII.  using ASCII.
2311  <list>  <list>
2312    <item>8-bit cleanness.  I think everyone understand this.    <item>8-bit cleanness.  I think everyone understand this.
2313    <item>Continuity of multibyte characters.  In multibyte character    <item>Continuity of multibyte characters.  In multibyte encodings
2314          codes such as EUC-JP and UTF-8, one character may consist          such as EUC-JP and UTF-8, one character may consist
2315          from more than two bytes.  These bytes should be outputed          from more than two bytes.  These bytes should be outputed
2316          continued.  Insertion of additional codes between the          continued.  Insertion of additional codes between the
2317          continuing bytes can break the character.  I have seen a          continuing bytes can break the character.  I have seen a
# Line 2117  tell is that your software should avoid Line 2345  tell is that your software should avoid
2345  <P>  <P>
2346  If your software inputs a string from keyboard,  you will have to  If your software inputs a string from keyboard,  you will have to
2347  take more cares.  All of numbers of characters, bytes, and columns  take more cares.  All of numbers of characters, bytes, and columns
2348  differ.  For example, in UTF-8 character code, one character of  differ.  For example, in UTF-8 encoding, one character of
2349  'a' with acute accent occupies two bytes and one column.  One  'a' with acute accent occupies two bytes and one column.  One
2350  character of CJK-ideograph occupies three bytes and two columns.  character of CJK-ideograph occupies three bytes and two columns.
2351  For example, if the user types 'Backspace', how many backspace  For example, if the user types 'Backspace', how many backspace
# Line 2154  The main feature of XFontSet is that it Line 2382  The main feature of XFontSet is that it
2382  at the same time.  This is related to the distinction between  at the same time.  This is related to the distinction between
2383  coded character set (CCS) and character encoding scheme (CES)  coded character set (CCS) and character encoding scheme (CES)
2384  which I wrote at the section of <ref id="coding-general-term">.  which I wrote at the section of <ref id="coding-general-term">.
2385  Some character codes in the world use multiple coded character  Some encodings in the world use multiple coded character
2386  sets at the same time.  This is the reason we have to handle  sets at the same time.  This is the reason we have to handle
2387  multiple X fonts at the same time.  multiple X fonts at the same time.
2388  <footnote>  <footnote>
2389  Though UTF-8 is a character code with single CCS, the current  Though UTF-8 is an encoding with single CCS, the current
2390  version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.  version of XFree86 (4.0.1) needs multiple fonts to handle UTF-8.
2391  </footnote>  </footnote>
2392  </P>  </P>
# Line 2168  Another significant feature of XFontSet Line 2396  Another significant feature of XFontSet
2396  locale (LC_CTYPE)-sensible.  This means that you have to  locale (LC_CTYPE)-sensible.  This means that you have to
2397  call <tt>setlocale()</tt> before you use XFontSet-related  call <tt>setlocale()</tt> before you use XFontSet-related
2398  functions.  And more, you have to specify the string you want  functions.  And more, you have to specify the string you want
2399  to draw as a mulbibyte character or a wide character.  to draw as a multibyte character or a wide character.
2400  </P>  </P>
2401    
2402  <P>  <P>
# Line 2210  clean.  The user has to set <tt>LANG</tt Line 2438  clean.  The user has to set <tt>LANG</tt
2438  The upstream developers of X clients sometimes hate to enforce  The upstream developers of X clients sometimes hate to enforce
2439  users to set such environmental variables.  users to set such environmental variables.
2440  <footnote>  <footnote>
2441   IMO, all users will have to set LANG properly when UTF-8 will   IMHO, all users will have to set LANG properly when UTF-8 will
2442   become popular.   become popular.
2443  </footnote>  </footnote>
2444  In such a case,  In such a case,
# Line 2220  The X clients should have two ways to ou Line 2448  The X clients should have two ways to ou
2448  If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>,  If <tt>setlocale()</tt> returns <tt>NULL</tt>, <tt>"C"</tt>,
2449  or <tt>"POSIX"</tt>, use  or <tt>"POSIX"</tt>, use
2450  <tt>XFontStruct</tt> way.  Otherwise use <tt>XFontSet</tt> way.  <tt>XFontStruct</tt> way.  Otherwise use <tt>XFontSet</tt> way.
2451  The author implemented this algoritym to a few window managers  The author implemented this algorithm to a few window managers
2452  such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),  such as TWM (version 4.0.1d), Blackbox (0.60.1), IceWM (1.0.0),
2453  sawmill (0.28), and so on.  sawmill (0.28), and so on.
2454  </P>  </P>
# Line 2252  communication.  This topic will be descr Line 2480  communication.  This topic will be descr
2480    
2481    
2482    
2483    
2484    
2485    
2486    
2487    
2488    
2489    
2490    
2491    
2492    
2493    
2494    
2495  <chapt id="input"><heading>Input from Keyboard</heading>  <chapt id="input"><heading>Input from Keyboard</heading>
2496    
2497  <P>  <P>
# Line 2369  All you need is that: Line 2609  All you need is that:
2609  <chapt id="internal"><heading>Internal Processing and File I/O</heading>  <chapt id="internal"><heading>Internal Processing and File I/O</heading>
2610    
2611  <P>  <P>
2612  From a user's point of view, a software can use any internal character  From a user's point of view, a software can use any internal encodings
2613  codes if I/O is done correctly.  It is because a user cannot be aware of  if I/O is done correctly.  It is because a user cannot be aware of
2614  which kind of internal code is used in the software.  which kind of internal code is used in the software.
2615  </P>  </P>
2616    
# Line 2384  the string is to be converted into multi Line 2624  the string is to be converted into multi
2624  approach are:  approach are:
2625  <list>  <list>
2626    <item>The programmer don't need to know the detail of    <item>The programmer don't need to know the detail of
2627          international and local character codes.          international and local encodings.
2628  </list>  </list>
2629  However, there are a few disadvantages:  However, there are a few disadvantages:
2630  <list>  <list>
# Line 2585  is stateful. Line 2825  is stateful.
2825    
2826  <chapt id="other"><heading>Other Special Topics</heading>  <chapt id="other"><heading>Other Special Topics</heading>
2827    
 <sect id="locale-"><heading>Locale in C</heading>  
   
 <P>  
 Locale is the main faculty for I18N of C language.  
 </P>  
   
 <P>  
 Locale model is that a software changes its behavior  
 according to its language environment.  The environment can be  
 set independently for six categories of  
 LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC,  
 and LC_TIME.  
 C library supplies a set of functions which changes their  
 behaviors according to one of the six locale categories.  
 To internationalize a software, use these functions.  
 Don't forget to call <tt>setlocale</tt> function at first  
 or these functions would not change their behavior.  
 </P>  
   
 <P>  
 If <tt>setlocale(LC_ALL, "")</tt> is described at the start of the  
 software, the choice of the environment is done by environmental variables  
 whose names are same to the names of categories.  
 If LC_ALL variable is defined, LC_ALL takes precedence over these  
 variables.  If neither of them are defined, LANG variable is adopted.  
 If LANG is also not defined, 'C' locale, which means default behavior,  
 is used.  
 </P>  
   
 <P>  
 Though valid values for these environmental variables (locale names)  
 depend on the kind and set-up of the OS, the format of locale names  
 is usually like <tt>ja_JP.eucJP</tt>, where two lowercase characters  
 mean language (<tt>ja</tt> = Japanese), two capital characters  
 mean country (<tt>JP</tt> = Japan), and characters after dot mean  
 character code (<tt>eucJP</tt> = EUC-JP).  Type <tt>locale -a</tt> to  
 display all valid locale names.  However, it is users' responsibility to  
 set proper value to LANG variable and the developers don't need to  
 be aware of the value of the LANG variable.  
 </P>  
   
 <P>  
 Many people tend to think that I18N means <tt>gettext</tt>ization  
 and translation of messages.  However, it is mere one category  
 (LC_MESSAGES) out of six categories.  The most important category  
 is LC_CTYPE.  
 </P>  
   
 <P>  
 However, note that M17N is not achieved by locale mechanism,  
 especially LC_CTYPE.  You have to use international  
 character codes such as ISO 2022 and Unicode instead of LC_CTYPE mechanism  
 to write M17N-ed software.  Moreover, LC_CTYPE mechanism is  
 sometimes insufficient even for I18N.  For example,  
 <package>jless</package> is a text file viewer which can  
 automatically distinguishes three Japanese character codes and converts  
 into desirable character codes.  You cannot write such a software using  
 LC_CTYPE mechanism.  
 </P>  
   
 <sect id="wchar-"><heading>Multibyte and Wide characters in C</heading>  
   
 <P>  
 Standard C library supplies functions to handle multibyte and  
 wide characters.  These functions are sensible to LC_CTYPE  
 locale category.  
 </P>  
   
 <P>  
 Multibyte character is a character code which is used for <em>real</em>  
 input/output.  In other words, <em>the character code you usually use</em>  
 is the multibyte character whatever language you speak.  
 If you use ISO-8859-1, it is your multibyte character.  
 If you use EUC-KR, it is your multibyte character.  
 Despite the name, multibyte character may or may not be  
 expressed in multiple bytes.  
 </P>  
   
 <P>  
 Since multibyte character can be stateful (that is, can have  
 shift status) and the number of bytes a character does not  
 have to be a constant, implementation using multibyte character  
 can be difficult.  For example, it may be difficult even to count  
 the number of characters.  Thus wide character can be used.  
 </P>  
   
 <P>  
 Wide character is a character code supplied by the standard C library  
 for easy handling of international strings.  
 Wide character is stateless and the size of every wide characters  
 are same.  Functions for conversion between multibyte character  
 and wide character (and string of multibyte characters and  
 string of wide characters) are supplied by library.  
 Wide character is expressed using <tt>wchar_t</tt> type.  
 String of wide characters is expressed  
 as a array of <tt>wchar_t</tt>, like string of ASCII characters is expressed  
 as a array of <tt>char</tt>.  
 </P>  
   
 <P>  
 Thus it is convenient to input multibyte characters from a stream,  
 convert them into wide characters, process, convert back into  
 multibyte characters, and output them to a stream.  <tt>wchar_t</tt> is  
 used as an internal code.  
 </P>  
   
 <P>  
 Functions for conversion between multibyte and wide characters/strings  
 are shown below:  
 <list>  
  <item><tt>mbtowc()</tt> and <tt>mbrtowc()</tt> to convert  
        from multibyte to wide character.  
  <item><tt>mblen()</tt>, <tt>mbrlen()</tt> to obtain the number  
        of characters of multibyte character string.  
  <item><tt>mbstowcs()</tt>, <tt>mbsrtowcs()</tt> to convert from  
        multibyte to wide character string.  
  <item><tt>wctomb()</tt>, <tt>wcrtomb()</tt> to convert from wide  
        to multibyte character.  
  <item><tt>wcstombs()</tt>, <tt>wcsrtombs()</tt> to convert from  
        wide to multibyte character string.  
  <item><tt>mbsinit()</tt> to check shift status.  
  <item><tt>btowc()</tt> and <tt>wctob()</tt> to convert 1byte and  
        wide characters.  
 </list>  
 </P>  
   
 <P>  
 '<tt>r</tt>' version of these functions (for example, <tt>mbrtowc</tt>)  
 have an additional parameter to a pointer to a <tt>mbstate_t</tt>  
 variable which contains the shift status.  Since non-'<tt>r</tt>'  
 version of these functions have shift status in their internal  
 (static) variable, these can treat only one succession of string at a time.  
 </P>  
   
 <P>  
 See manpages of these functions for further information.  
 </P>  
   
 <P>  
 The implementation of wchar_t is not determined by any  
 standards, though UCS-4 is used for glibc.  You must not  
 assume the implementation of <tt>wchar_t</tt>.  
 </P>  
   
 <P>  
 Though usual functions such as <tt>printf()</tt> can be used for multibyte  
 characters for input/output, one have to take care of escape  
 character '<tt>%</tt>' used in formatted input/output functions, because  
 a part of a multibyte character can have same value as ASCII  
 code of '<tt>%</tt>'.  
 </P>  
2828    
2829    
2830  <sect id="gettext"><heading>Gettext</heading>  <sect id="gettext"><heading>Gettext</heading>
# Line 2760  For example, '&copy;' (copyright mark; y Line 2849  For example, '&copy;' (copyright mark; y
2849  read the copyright mark NOW in THIS document) is non-ASCII character  read the copyright mark NOW in THIS document) is non-ASCII character
2850  (0xa9 in ISO-8859-1).  (0xa9 in ISO-8859-1).
2851  Otherwise, translators may feel difficulty to edit catalog files  Otherwise, translators may feel difficulty to edit catalog files
2852  because of conflict between character codes for <tt>msgid</tt> and in  because of conflict between encodings for <tt>msgid</tt> and in
2853  <tt>msgstr</tt>.  <tt>msgstr</tt>.
2854  </P>  </P>
2855    
# Line 2775  THAN MEANINGLESS BROKEN MESSAGES.</em> Line 2864  THAN MEANINGLESS BROKEN MESSAGES.</em>
2864    
2865  <P>  <P>
2866  The 2nd (3rd, ...) byte of multibyte characters or  The 2nd (3rd, ...) byte of multibyte characters or
2867  all bytes of non-ASCII characters in stateful character codes  all bytes of non-ASCII characters in stateful encodings
2868  can be 0x5c (same to backslash in ASCII) or 0x22  can be 0x5c (same to backslash in ASCII) or 0x22
2869  (same to double quote in ASCII).  (same to double quote in ASCII).
2870  These characters have to properly escaped because  These characters have to properly escaped because
# Line 2968  and <tt>application</tt>.  Now we are in Line 3057  and <tt>application</tt>.  Now we are in
3057  because we are discussing about i18n.  because we are discussing about i18n.
3058  Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>,  Sub types for <tt>text</tt> are <tt>plain</tt>, <tt>enriched</tt>,
3059  <tt>html</tt>, and so on.  <tt>charset</tt> parameter can also be  <tt>html</tt>, and so on.  <tt>charset</tt> parameter can also be
3060  added to specify character codes.  added to specify encodings.
3061  <tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>,  <tt>US-ASCII</tt>, <tt>ISO-8859-1</tt>,
3062  <tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by  <tt>ISO-8859-2</tt>, ..., <tt>ISO-8859-10</tt> are defined by
3063  RFC 2046 for <tt>charset</tt>.  This list can be added by writing  RFC 2046 for <tt>charset</tt>.  This list can be added by writing
# Line 2991  RFC 2045 and 2046 determine the way to w Line 3080  RFC 2045 and 2046 determine the way to w
3080  in the main text of mail.  On the other hand, RFC 2047 describes  in the main text of mail.  On the other hand, RFC 2047 describes
3081  'encoded words' which is the way to write non-ASCII characters in the header.  'encoded words' which is the way to write non-ASCII characters in the header.
3082  It is like that:  It is like that:
3083  <tt>=?</tt><var>character code</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>,  <tt>=?</tt><var>encoding</var><tt>?</tt><var>conversion algorithm</var><tt>?</tt><var>data</var><tt>?=</tt>,
3084  where <var>character code</var> is selected from the list of <tt>charset</tt>  where <var>encoding</var> is selected from the list of <tt>charset</tt>
3085  of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt>  of <tt>Content-Type</tt> header, <var>algorithm</var> is <tt>Q</tt>
3086  or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for  or <tt>q</tt> for quoted-printable or <tt>B</tt> or <tt>b</tt> for
3087  base64, and <var>data</var> is encoded data whose length is less than  base64, and <var>data</var> is encoded data whose length is less than
# Line 3023  header, it is rarely used. Line 3112  header, it is rarely used.
3112  </P>  </P>
3113    
3114  <P>  <P>
3115  RFC 1866 describes that the default character code for HTML is  RFC 1866 describes that the default encoding for HTML is
3116  ISO-8859-1.  However, many web pages are written in,  ISO-8859-1.  However, many web pages are written in,
3117  for example, Japanese and Korean using (of course) character codes  for example, Japanese and Korean using (of course) encodings
3118  different from ISO-8859-1.  different from ISO-8859-1.
3119  Sometimes the HTML document describes:  Sometimes the HTML document describes:
3120  <example>  <example>
3121  &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"&gt;  &lt;META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-2022=jp"&gt;
3122  </example>  </example>
3123  which declares that the page is written in ISO-2022-JP.  which declares that the page is written in ISO-2022-JP.
3124  However, there many pages without any declaration of character code.  However, there many pages without any declaration of encoding.
3125  </P>  </P>
3126    
3127  <P>  <P>
3128  Web browsers have to deal with such a circumstance.  Web browsers have to deal with such a circumstance.
3129  Of course web browsers have to be able to deal with every  Of course web browsers have to be able to deal with every
3130  character codes in the world which is listed in MIME.  encodings in the world which is listed in MIME.
3131  However, many web browsers can only deal with ASCII  However, many web browsers can only deal with ASCII
3132  or ISO-8859-1.  Such web browsers are useless at all  or ISO-8859-1.  Such web browsers are useless at all
3133  for non-ASCII or non-ISO-8859-1 people.  for non-ASCII or non-ISO-8859-1 people.
# Line 3048  for non-ASCII or non-ISO-8859-1 people. Line 3137  for non-ASCII or non-ISO-8859-1 people.
3137  URL should be written in ASCII character,  URL should be written in ASCII character,
3138  though non-ASCII characters can be expressed  though non-ASCII characters can be expressed
3139  using <tt>%</tt><var>nn</var> sequence where <var>nn</var>  using <tt>%</tt><var>nn</var> sequence where <var>nn</var>
3140  is hexadegimal value.  This is because there are  is hexadecimal value.  This is because there are
3141  no way to specify character code. Wester-European people  no way to specify encoding. Wester-European people
3142  would treat it as ISO-8859-1, while Japanese people  would treat it as ISO-8859-1, while Japanese people
3143  would treat it as EUC-JP or SHIFT-JIS.  would treat it as EUC-JP or SHIFT-JIS.
3144  </P>  </P>

Legend:
Removed from v.1040  
changed lines
  Added in v.1041

  ViewVC Help
Powered by ViewVC 1.1.5