| 1 |
kubota |
845 |
|
| 2 |
kubota |
857 |
<sect id="japanese"><heading>Japanese language / used in Japan</heading> |
| 3 |
kubota |
845 |
|
| 4 |
|
|
|
| 5 |
|
|
<P> |
| 6 |
|
|
This section is the text written by |
| 7 |
osamu |
6992 |
Tomohiro KUBOTA <email>kubota@debian.org</email> (no more reachable). |
| 8 |
kubota |
845 |
</P> |
| 9 |
|
|
|
| 10 |
|
|
<P> |
| 11 |
|
|
Japanese is the only official language used in Japan. |
| 12 |
|
|
People in Okinawa islands and Ainu ethnic group in Hokkaido region |
| 13 |
|
|
have each language, though they are used among few number |
| 14 |
|
|
of people and they don't have own letters. |
| 15 |
|
|
</P> |
| 16 |
|
|
|
| 17 |
|
|
<P> |
| 18 |
|
|
Japan is the only region where Japanese language is widely used. |
| 19 |
|
|
</P> |
| 20 |
|
|
|
| 21 |
|
|
|
| 22 |
kubota |
857 |
<sect1 id="japanese-character"><heading>Characters used in Japanese</heading> |
| 23 |
kubota |
845 |
|
| 24 |
|
|
<P> |
| 25 |
|
|
There are three kinds of characters used in Japan, |
| 26 |
|
|
Hiragana, Katakana, and Kanji. |
| 27 |
|
|
Arabic numerical characters (same as European languages) are |
| 28 |
|
|
widely used in Japanese, though we have Kanji numerical characters. |
| 29 |
|
|
Though Latin alphabets are not a part of Japanese characters, |
| 30 |
|
|
they are widely used for proper nouns for companies and so on. |
| 31 |
|
|
</P> |
| 32 |
|
|
|
| 33 |
|
|
<P> |
| 34 |
|
|
Hiragana and Katakana are phonogram derived from Kanji. |
| 35 |
|
|
Hiragana and Katakana characters have one-to-one correspondence |
| 36 |
|
|
each other like upper and lower case of Latin alphabets. |
| 37 |
|
|
However, <tt>toupper()</tt> and <tt>tolower()</tt> should not |
| 38 |
|
|
convert Hiragana and Katakana each other. |
| 39 |
|
|
Hiragana contains about 100 characters and of course Katakana does. |
| 40 |
kubota |
1053 |
(FYI: about 50 regular characters, 20 characters with voiced |
| 41 |
kubota |
845 |
consonant symbol, 5 characters with semi-voiced consonant symbol, |
| 42 |
|
|
and 9 small characters.) |
| 43 |
|
|
</P> |
| 44 |
|
|
|
| 45 |
|
|
<P> |
| 46 |
|
|
Kanji is ideogram imported from China roughly about 1 - 2 thousands |
| 47 |
|
|
years ago. |
| 48 |
|
|
Nobody knows the whole number of Kanji and almost all of adult Japanese |
| 49 |
|
|
people know several thousands of Kanji characters. |
| 50 |
|
|
Though the origin of Kanji is Chinese character, shapes are |
| 51 |
|
|
changed from original ancient Chinese Kanji. |
| 52 |
|
|
Almost all Kanji have several ways to read, according to the |
| 53 |
|
|
word the Kanji is contained. |
| 54 |
|
|
</P> |
| 55 |
|
|
|
| 56 |
|
|
|
| 57 |
kubota |
857 |
<sect1 id="japanese-sets"><heading>Character Sets</heading> |
| 58 |
kubota |
845 |
|
| 59 |
|
|
<P> |
| 60 |
|
|
JIS (Japan Industrial Standards) is an organization responsible |
| 61 |
kubota |
1054 |
for coded character sets (CCS) and encodings used in Japan. |
| 62 |
|
|
The major coded character sets in Japan are: |
| 63 |
kubota |
845 |
<list> |
| 64 |
kubota |
857 |
<item>JIS X 0201-1976 Roman characters (Almost same to ASCII but 0x5c |
| 65 |
kubota |
845 |
is Yen mark instead of backslash and 0x7e is upper bar instead of tilde) |
| 66 |
kubota |
1054 |
<item>JIS X 0201-1976 Kana (about 60 KATAKANA characters), |
| 67 |
kubota |
857 |
<item>JIS X 0208-1997 1st and 2nd levels (about 7000 characters |
| 68 |
|
|
including symbols, numeric characters, Latin, Cyrillic and |
| 69 |
|
|
Greek alphabets, Japanese HIRAGANA, KATAKANA, and KANJI), |
| 70 |
kubota |
845 |
<item>JIS X 0212 (about 6000 characters including KANJI, which are not |
| 71 |
|
|
included in JIS X 0208), and |
| 72 |
kubota |
1054 |
<item>JIS X 0213:2000 (aka JIS 3rd and 4th levels). |
| 73 |
kubota |
845 |
</list> |
| 74 |
|
|
</P> |
| 75 |
|
|
|
| 76 |
|
|
<P> |
| 77 |
kubota |
1054 |
<strong>JIS X 0201 Roman</strong> is the Japanese version of ISO 646. |
| 78 |
|
|
Though JIS X 0201 is included in SHIFT-JIS encoding (explained later) and |
| 79 |
|
|
widely used for Windows/Macintosh, usage of this is not encouraged in UNIX. |
| 80 |
kubota |
845 |
</P> |
| 81 |
|
|
|
| 82 |
|
|
<P> |
| 83 |
kubota |
1054 |
<strong>JIS X 0201 Kana</strong> defines about 60 KATAKANA characters. |
| 84 |
|
|
This is widely used by old 8bit computers. |
| 85 |
|
|
In deed, SHIFT-JIS encoding was designed to be upward-compatible |
| 86 |
|
|
with 8-bit encoding of JISX 0201 Roman and JISX 0201 Kana. |
| 87 |
|
|
Note this CCS is not included in ISO 2022-JP encoding which is |
| 88 |
|
|
used for e-mail and so on. |
| 89 |
kubota |
845 |
</P> |
| 90 |
|
|
|
| 91 |
|
|
<P> |
| 92 |
kubota |
1054 |
<strong>JIS X 0212</strong> is not widely used, probably because it cannot be |
| 93 |
|
|
included in SHIFT-JIS, the standard encoding for Japanese version |
| 94 |
|
|
of Windows and Macintosh. And more, this CCS may be obsolete |
| 95 |
|
|
when JIS X 0213 will be popular, since JIS X 0213 has many |
| 96 |
|
|
characters which are included in JIS X 0212. |
| 97 |
|
|
However, the advantage of JIS X 0212 over JIS X 0213 is that |
| 98 |
|
|
all characters in JIS X 0212 are included in the current |
| 99 |
|
|
Unicode (version 3.0.1) while not all characters in JIS X 0213 |
| 100 |
|
|
are. |
| 101 |
|
|
</P> |
| 102 |
|
|
|
| 103 |
|
|
<P> |
| 104 |
|
|
<strong>JIS X 0208</strong> (aka JIS C 6226) is the main standard |
| 105 |
|
|
for Japanese characters. |
| 106 |
kubota |
857 |
Strictly speaking, it was originally defined in 1978 and |
| 107 |
|
|
revised on 1983, 1990, and 1997. |
| 108 |
|
|
Though 1997 version has 77 more characters than original 1976 version |
| 109 |
|
|
and shape of more than 200 characters are changed, |
| 110 |
|
|
almost softwares don't have to care about the difference between them. |
| 111 |
kubota |
1054 |
However, be careful of that ISO-2022-JP encoding (explained below) |
| 112 |
kubota |
857 |
contains both JIS X 0208-1978 and JIS X 0208-1983. |
| 113 |
kubota |
1054 |
1978 version is called 'old JIS' and later is called 'new JIS'. |
| 114 |
|
|
Characters in JIS X 0208 are divided into two levels, 1st and 2nd. |
| 115 |
|
|
Old 8bit computers rarely implemented the 2nd level. |
| 116 |
kubota |
845 |
</P> |
| 117 |
|
|
|
| 118 |
|
|
<P> |
| 119 |
|
|
Usage of numeric characters and Latin alphabets in JIS X 0208 is |
| 120 |
kubota |
1054 |
not encouraged because these characters are also included in ASCII |
| 121 |
|
|
and JIS X 0201 Roman, either of which is included in all encodings. |
| 122 |
|
|
When converting into Unicode, these characters are mapped into |
| 123 |
|
|
'fullwidth forms'. |
| 124 |
kubota |
845 |
</P> |
| 125 |
|
|
|
| 126 |
|
|
<P> |
| 127 |
kubota |
1054 |
All of these coded character sets (except for JIS X 0213) are |
| 128 |
|
|
included in Unicode 3.0.1. A part of JIS X 0213 characters are not |
| 129 |
|
|
included in Unicode 3.0.1. |
| 130 |
kubota |
845 |
</P> |
| 131 |
|
|
|
| 132 |
|
|
<P> |
| 133 |
kubota |
1054 |
There are a few different tables for conversion between non-letter |
| 134 |
|
|
characters in JIS X 0208 and Unicode. This is a problem because |
| 135 |
|
|
this may deny 'round-trip compatiblilty'. |
| 136 |
|
|
<url id="http://www.opengroup.or.jp/jvc/cde/ucs-conv-e.html" |
| 137 |
|
|
name="Problems and Solutions for Unicode and User/Vendor Defined Characters"> |
| 138 |
|
|
discusses this problem in detail. |
| 139 |
kubota |
845 |
</P> |
| 140 |
|
|
|
| 141 |
|
|
|
| 142 |
kubota |
1054 |
<sect1 id="japanese-encodings"><heading>Encodings</heading> |
| 143 |
kubota |
845 |
|
| 144 |
|
|
<P> |
| 145 |
kubota |
1054 |
There are three popular encodings widely used in Japan. |
| 146 |
kubota |
845 |
<list> |
| 147 |
|
|
<item>ISO-2022-JP (aka JIS code or JUNET code) |
| 148 |
|
|
<list> |
| 149 |
|
|
<item>stateful |
| 150 |
|
|
<item>subset of 7bit version of ISO-2022, where ASCII, |
| 151 |
kubota |
1054 |
JIS X 0201-1976 Roman, JIS X 0208-1978, |
| 152 |
|
|
and JIS X 0208-1983 are supported. |
| 153 |
kubota |
845 |
<item>7bit, which means the most significant bit (MSB) of each |
| 154 |
kubota |
1054 |
byte is always zero. |
| 155 |
kubota |
845 |
<item>used for e-mail and net-news and preferred for HTML. |
| 156 |
kubota |
1054 |
<item>Determined in RFC 1468. |
| 157 |
kubota |
845 |
</list> |
| 158 |
|
|
<item>EUC-JP (Japanese version of Extended UNIX Code) |
| 159 |
|
|
<list> |
| 160 |
|
|
<item>stateless |
| 161 |
kubota |
1054 |
<item>an implementation of EUC where G0, G1, G2, and G3 are |
| 162 |
|
|
ASCII, JIS X 0208, JIS X 0201 Kana, and JIS X 0212 |
| 163 |
|
|
respectively. There are many implementation which cannot |
| 164 |
|
|
use JIS X 0201 Kana and JIS X 0212. |
| 165 |
kubota |
845 |
<item>8bit |
| 166 |
kubota |
1054 |
<item>preferred encoding for UNIX. For example, almost all Japanese |
| 167 |
kubota |
845 |
message catalogs for gettext is written in EUC-JP. |
| 168 |
kubota |
1054 |
<item>Japanese code is mapped in <tt>0xa0</tt> - <tt>0xff</tt>. |
| 169 |
|
|
This is important |
| 170 |
kubota |
845 |
for programmer because one doesn't need to care there are |
| 171 |
|
|
fake '\' or '/' (which can be treated in a special way in |
| 172 |
|
|
various context) in the Japanese code. |
| 173 |
|
|
</list> |
| 174 |
|
|
<item>SHIFT-JIS (aka Microsoft Kanji Code) |
| 175 |
|
|
<list> |
| 176 |
|
|
<item>stateless |
| 177 |
|
|
<item>NOT subset of ISO-2022 |
| 178 |
|
|
<item>8bit |
| 179 |
kubota |
1054 |
<item>JIS X 0201 Roman, JIS X 0201 Kana, and JIS X 0208 |
| 180 |
|
|
can be expressed, but JIS X 0212 cannot. |
| 181 |
|
|
<item>The standard encoding for Windows/Macintosh. This makes |
| 182 |
|
|
SHIFT-JIS the most popular encoding in Japan. Though MS |
| 183 |
kubota |
845 |
is thinking about transition to UNICODE, it is suspicious |
| 184 |
|
|
that it can be done successfully. |
| 185 |
|
|
</list> |
| 186 |
|
|
</list> |
| 187 |
|
|
</P> |
| 188 |
|
|
|
| 189 |
|
|
<P> |
| 190 |
kubota |
1054 |
<strong>ISO-2022-JP</strong> is a subset of 7bit version of ISO 2022, |
| 191 |
|
|
where only G0 is used and G0 is assumed to be invoked into GL. |
| 192 |
|
|
Character sets included in ISO-2022-JP are: |
| 193 |
kubota |
857 |
<list> |
| 194 |
|
|
<item>ASCII (ESC 0x28 0x42), |
| 195 |
|
|
<item>JIS X 0201-1976 Roman (ESC 0x28 0x4a), |
| 196 |
|
|
<item>JIS X 0208-1978 (old JIS) (ESC 0x24 0x40), and |
| 197 |
|
|
<item>JIS X 0208-1983 (new JIS) (ESC 0x24 0x42). |
| 198 |
|
|
</list> |
| 199 |
|
|
Note that JIS X 0208-1978 and JIS X 0208-1983 are almost identical |
| 200 |
|
|
and ASCII and JIS X 0201-1976 Roman are also almost identical. |
| 201 |
kubota |
1054 |
A line (stream of bytes between 'newline' control code) must |
| 202 |
|
|
start by ASCII status and to end by ASCII status. |
| 203 |
kubota |
857 |
See <ref id="iso-2022"> for detail. |
| 204 |
kubota |
845 |
</P> |
| 205 |
|
|
|
| 206 |
|
|
<P> |
| 207 |
kubota |
1054 |
<strong>ISO-2022-JP-2</strong> (RFC 1554) is a subset of 7bit version |
| 208 |
|
|
of ISO 2022 and superset of ISO-2022-JP. Difference between ISO-2022-JP |
| 209 |
|
|
and ISO-2022-JP-2 is that ISO-2022-JP-2 has more coded character sets |
| 210 |
|
|
than ISO-2022-JP. Character sets included in ISO-2022-JP-2 are: |
| 211 |
kubota |
857 |
<list> |
| 212 |
|
|
<item>ASCII (ESC 0x28 0x42) |
| 213 |
|
|
<item>JIS X 0201-1976 Roman (ESC 0x28 0x4a), |
| 214 |
|
|
<item>JIS X 0208-1978 (old JIS) (ESC 0x24 0x40), |
| 215 |
|
|
<item>JIS X 0208-1983 (new JIS) (ESC 0x24 0x42), |
| 216 |
|
|
<item>GB2312-80 (simplified Chinese) (ESC 0x24 0x41), |
| 217 |
|
|
<item>KS C 5601 (Korean) (ESC 0x24 0x28 0x43), |
| 218 |
|
|
<item>JIS X 0212-1990 (ESC 0x24 0x28 0x44), |
| 219 |
|
|
<item>ISO 8859-1 (Latin-1) (ESC 0x2e 0x41), and |
| 220 |
|
|
<item>ISO 8859-7 (Greek) (ESC 0x2e 0x46). |
| 221 |
|
|
</list> |
| 222 |
kubota |
1054 |
Though JIS X 0212-1990 may sometimes be used, ISO-2022-JP-2 |
| 223 |
kubota |
857 |
is rarely used. |
| 224 |
kubota |
845 |
</P> |
| 225 |
|
|
|
| 226 |
|
|
<P> |
| 227 |
kubota |
1054 |
<strong>ISO-2022-INT-1</strong> is a superset of ISO-2022-JP-2 which has |
| 228 |
kubota |
857 |
CNS 11643-1986-1 and CNS 11643-1986-2 (traditional Chinese). |
| 229 |
|
|
</P> |
| 230 |
|
|
|
| 231 |
|
|
<P> |
| 232 |
kubota |
1054 |
<strong>EUC-JP</strong> is a version of EUC, where |
| 233 |
|
|
G0 is ASCII, G1 is JIS X 0208, G2 is JIS X 0201 Kana, and |
| 234 |
|
|
G3 is JIS X 0212. G2 and G3 are sometimes not implemented. |
| 235 |
|
|
This is the most popular encoding for Linux/Unix. |
| 236 |
kubota |
857 |
See <ref id="euc"> for detail. |
| 237 |
kubota |
845 |
</P> |
| 238 |
|
|
|
| 239 |
|
|
<P> |
| 240 |
kubota |
1054 |
<strong>SHIFT-JIS</strong> is designed to be a superset of |
| 241 |
|
|
encodings for old 8bit computers which includes JIS X 0201 |
| 242 |
|
|
Roman and JIS X 0201 Kana. <tt>0x20</tt> - <tt>0x7f</tt> |
| 243 |
|
|
is JIS X 0201 Roman and <tt>0xa0</tt> - <tt>0xdf</tt> is |
| 244 |
|
|
JIS X 0201 Kana. <tt>0x80</tt> - <tt>0x9f</tt> and <tt>0xe0</tt> |
| 245 |
|
|
- <tt>0xff</tt> is the first byte of doublebyte characters. |
| 246 |
|
|
The second byte is <tt>0x40</tt> - <tt>0x7e</tt> and |
| 247 |
|
|
<tt>0x80</tt> - <tt>0xfc</tt>. This code space is used for JIS X 0208. |
| 248 |
kubota |
857 |
</P> |
| 249 |
|
|
|
| 250 |
|
|
<P> |
| 251 |
kubota |
845 |
UNICODE is not popular in Japan at all, probably because |
| 252 |
kubota |
857 |
conversion from these codes into Unicode is a bit difficult. |
| 253 |
kubota |
845 |
However MS Windows uses Unicode in a limited field, for example, |
| 254 |
|
|
internal code for file names. |
| 255 |
kubota |
857 |
I guess more and more softwares will come to support |
| 256 |
|
|
Unicode in the future. |
| 257 |
kubota |
845 |
</P> |
| 258 |
|
|
|
| 259 |
kubota |
857 |
<P> |
| 260 |
kubota |
1054 |
You can convert files written in these encodings one another using |
| 261 |
|
|
<package>nkf</package> or <package>kcc</package> package. |
| 262 |
|
|
Using options <tt>-j</tt>, <tt>-s</tt>, and <tt>-e</tt>, |
| 263 |
|
|
<prgn>nkf</prgn> convert a file into ISO-2022-JP (aka JIS), |
| 264 |
|
|
SHIFT-JIS (aka MS-KANJI), and EUC-JP, respectively. Note that |
| 265 |
|
|
difference between JIS X 0201 Roman and ASCII is ignored. |
| 266 |
|
|
Though <prgn>nkf</prgn> can guess the encoding of |
| 267 |
|
|
the input file, you can specify the encoding by command option. |
| 268 |
kubota |
857 |
This is because there are no algorithm to completely distinguish |
| 269 |
|
|
EUC-JP and SHIFT-JIS, though <prgn>nkf</prgn> usually guesses |
| 270 |
kubota |
1054 |
correctly. <prgn>tcs</prgn> can also convert these encodings, |
| 271 |
|
|
though without guessing input encoding. |
| 272 |
|
|
Conversion between these encodings can be done with a simple |
| 273 |
kubota |
857 |
algorithm since all of them are based on the same character sets. |
| 274 |
kubota |
1054 |
You need a table for code conversion between these encodings and Unicode. |
| 275 |
kubota |
857 |
</P> |
| 276 |
kubota |
845 |
|
| 277 |
|
|
|
| 278 |
kubota |
1054 |
<sect1 id="japanese-how"><heading>How These Encodings Are Used --- Information for Programmers</heading> |
| 279 |
kubota |
857 |
|
| 280 |
kubota |
845 |
<P> |
| 281 |
|
|
Since EUC-JP is widely used for UNIX, |
| 282 |
|
|
EUC-JP should be supported. Exceptions are shown below. |
| 283 |
|
|
Of course direct implementation of knowledge on EUC-JP is not |
| 284 |
|
|
encouraged. If you can implement without the knowledge by use |
| 285 |
|
|
of <tt>wchar_t</tt> and so on, you should do so. |
| 286 |
|
|
<list> |
| 287 |
|
|
<item>the body of mail and news messages must be written in ISO-2022-JP. |
| 288 |
|
|
<item>De-facto standard of ICQ is SHIFT-JIS. |
| 289 |
kubota |
1054 |
<item>WWW browser must recognize all encodings. |
| 290 |
kubota |
845 |
<item>Softwares which communicate with Windows/Macintosh should use |
| 291 |
|
|
SHIFT-JIS. |
| 292 |
|
|
<item>SHIFT-JIS is widely used for BBS. (BBS is a service like Compuserve). |
| 293 |
|
|
<item>File names for Joliet-format CD-ROM used for Windows is written |
| 294 |
|
|
in Unicode. |
| 295 |
|
|
</list> |
| 296 |
|
|
</P> |
| 297 |
|
|
|
| 298 |
|
|
|
| 299 |
|
|
|
| 300 |
kubota |
857 |
<sect1 id="japanese-columns"><heading>Columns</heading> |
| 301 |
kubota |
845 |
|
| 302 |
|
|
<P> |
| 303 |
|
|
In consoles which are able to display Japanese characters |
| 304 |
kubota |
1054 |
(kon, jfbterm, kterm, krxvt, and so on), characters in JIS X 0201 |
| 305 |
|
|
(Roman and Kana) occupy 1 column |
| 306 |
|
|
and characters in JIS X 0208, JIS X 0212, and JIS X 0213 occupy 2 columns. |
| 307 |
kubota |
845 |
</P> |
| 308 |
|
|
|
| 309 |
|
|
|
| 310 |
|
|
|
| 311 |
kubota |
857 |
<sect1 id="japanese-direction"><heading>Writing Direction and Combined Characters</heading> |
| 312 |
kubota |
845 |
|
| 313 |
|
|
<P> |
| 314 |
|
|
Japanese language can be written in vertical direction. A line goes |
| 315 |
|
|
downward and the row of lies goes from right to left. This direction |
| 316 |
|
|
is the traditional style. For example, most Japanese books, magazines |
| 317 |
kubota |
857 |
and newspapers except for in the field of natural science |
| 318 |
|
|
(or ones containing many Latin words or equations) are written |
| 319 |
kubota |
845 |
in vertical direction. Thus a word processor is strongly recommended |
| 320 |
kubota |
1054 |
to support this direction. DTP systems which don't support this |
| 321 |
|
|
direction are almost useless. |
| 322 |
kubota |
845 |
</P> |
| 323 |
|
|
|
| 324 |
|
|
<P> |
| 325 |
|
|
Japanese language can also written in the same direction to Latin |
| 326 |
|
|
languages. Japanese books and magazines on science and technology |
| 327 |
|
|
are written in this direction. It is enough for almost usual softwares |
| 328 |
|
|
to support this direction only. |
| 329 |
|
|
</P> |
| 330 |
|
|
|
| 331 |
|
|
<P> |
| 332 |
|
|
A few Japanese characters have to have different fonts for vertical |
| 333 |
|
|
direction. They are reasonable characters --- parentheses and |
| 334 |
|
|
'long syllable' symbol whose shape is like dash in English or |
| 335 |
|
|
mathematical 'minus' sign. Symbols equivalent to |
| 336 |
|
|
period and comma also have different style for horizontal and vertical |
| 337 |
|
|
direction. |
| 338 |
|
|
</P> |
| 339 |
|
|
|
| 340 |
|
|
<P> |
| 341 |
|
|
In Japan, Arabic numerical characters are widely used, like European |
| 342 |
kubota |
1054 |
languages, though we have Kanji (ideogram) numerical characters. |
| 343 |
|
|
Latin characters |
| 344 |
kubota |
845 |
can also appear in Japanese texts. If a row of 1 - 3 (or 4) characters of |
| 345 |
|
|
Arabic and Latin appear in Japanese vertical text, these characters |
| 346 |
|
|
can be crowded into one column. If more characters appear (large numbers |
| 347 |
|
|
or long words), the paper is rotated 90 degree in anticlockwise and |
| 348 |
kubota |
1054 |
the characters are written in European way. Sometimes Latin characters |
| 349 |
|
|
which appears in vertical text are written in the same way as Japanese |
| 350 |
|
|
character, i.e., vertical direction. This is not so strong |
| 351 |
kubota |
845 |
custom. Arabic and Latin characters can always be written in both |
| 352 |
|
|
normal and rotated way in vertical text. |
| 353 |
|
|
<footnote> |
| 354 |
kubota |
1054 |
I HAVE TO SHOW EXAMPLE USING GRAPHICS. |
| 355 |
kubota |
845 |
</footnote> |
| 356 |
kubota |
1054 |
DTP system should support all of them. |
| 357 |
kubota |
845 |
</P> |
| 358 |
|
|
|
| 359 |
|
|
<P> |
| 360 |
kubota |
1054 |
A version of Japanized TeX (developed by ASCII, a publishing company |
| 361 |
|
|
in Japan) can use vertical direction. This can also |
| 362 |
kubota |
845 |
treat a page containing both vertical and horizontal texts. |
| 363 |
|
|
</P> |
| 364 |
|
|
|
| 365 |
|
|
|
| 366 |
kubota |
857 |
<sect1 id="japanese-layout"><heading>Layout of Characters</heading> |
| 367 |
kubota |
845 |
|
| 368 |
|
|
<P> |
| 369 |
|
|
In Japanese language, words are not separated by space and |
| 370 |
|
|
a line can be broken anywhere, with a few exceptions, unlike |
| 371 |
|
|
European languages. Thus hyphenation is not needed for Japanese. |
| 372 |
|
|
</P> |
| 373 |
|
|
|
| 374 |
|
|
<P> |
| 375 |
|
|
Characters like open parentheses cannot come to the end |
| 376 |
|
|
of a line. Characters like close parentheses and sorts of |
| 377 |
|
|
sentence-separating marks such as period and comma cannot |
| 378 |
kubota |
1054 |
come to the top of a line. This rule and processing is |
| 379 |
|
|
called 'kinsoku' in Japanese. |
| 380 |
kubota |
845 |
</P> |
| 381 |
|
|
|
| 382 |
|
|
<P> |
| 383 |
|
|
In European languages, a break of line is equivalent to a space. |
| 384 |
|
|
In Japanese language, a break of line should be neglected. |
| 385 |
kubota |
1054 |
For example, when rendering an HTML file, line-breaking character |
| 386 |
|
|
in the HTML source should not be converted into whitespace. |
| 387 |
kubota |
845 |
</P> |
| 388 |
|
|
|
| 389 |
|
|
|
| 390 |
|
|
|
| 391 |
|
|
|
| 392 |
kubota |
857 |
<sect1 id="japanese-lang"><heading>LANG variable</heading> |
| 393 |
kubota |
845 |
|
| 394 |
|
|
<P> |
| 395 |
kubota |
1054 |
Different value of <tt>LANG</tt> used for different encodings. |
| 396 |
kubota |
845 |
</P> |
| 397 |
|
|
|
| 398 |
|
|
<P> |
| 399 |
|
|
Following values are used for EUC-JP. |
| 400 |
|
|
<list> |
| 401 |
kubota |
1054 |
<item><tt>LANG=ja_JP.eucJP</tt> (major for Linux and *BSD) |
| 402 |
|
|
<item><tt>LANG=ja_JP.ujis</tt> (used to be major for Linux) |
| 403 |
|
|
<item><tt>LANG=ja_JP</tt> (because EUC-JP is the de-facto standard for UNIX; |
| 404 |
|
|
not recommended) |
| 405 |
|
|
<item><tt>LANG=ja</tt> (because EUC-JP is the de-facto standard for UNIX; |
| 406 |
|
|
not recommended) |
| 407 |
kubota |
845 |
</list> |
| 408 |
|
|
</P> |
| 409 |
|
|
|
| 410 |
|
|
<P> |
| 411 |
|
|
<tt>LANG=ja_JP.jis</tt> is used for ISO-2022-JP (aka JIS code or JUNET code). |
| 412 |
|
|
</P> |
| 413 |
|
|
|
| 414 |
|
|
<P> |
| 415 |
|
|
<tt>LANG=ja_JP.sjis</tt> is used for SHIFT-JIS (aka Microsoft Kanji Code). |
| 416 |
|
|
</P> |
| 417 |
|
|
|
| 418 |
kubota |
857 |
<P> |
| 419 |
kubota |
1054 |
Setting LANG is not sufficient for a Japanese user who has just installed |
| 420 |
kubota |
857 |
Linux to get a minimal Japanese environment. There are several |
| 421 |
|
|
books on establishing Japanese environment on Linux/BSD and |
| 422 |
|
|
magazines on Linux often have feature articles on how to establish |
| 423 |
|
|
Japanese environment. Nowadays many Japanized Linux distributions |
| 424 |
kubota |
1054 |
which are optimized so that many basic software can display and |
| 425 |
kubota |
857 |
input Japanese are popular. |
| 426 |
kubota |
1054 |
Debian GNU/Linux has <package>user-ja</package> (for potato) and |
| 427 |
|
|
<package>language-env</package> (for woody and following versions) |
| 428 |
|
|
packages to establish basic Japanese environment. |
| 429 |
kubota |
857 |
</P> |
| 430 |
kubota |
845 |
|
| 431 |
|
|
|
| 432 |
kubota |
857 |
<sect1 id="japanese-input"><heading>Input from Keyboard</heading> |
| 433 |
kubota |
845 |
|
| 434 |
|
|
<P> |
| 435 |
|
|
Since Japanese characters cannot be inputed directly from a keyboard, |
| 436 |
|
|
a software is needed to convert ASCII characters into Japanese. |
| 437 |
kubota |
1054 |
<prgn>WNN</prgn>, <prgn>Canna</prgn>, and <prgn>SKK</prgn> are popular |
| 438 |
|
|
free softwares to input Japanese language. Though |
| 439 |
|
|
<prgn>T-Code</prgn> is also available, it is difficult to use. |
| 440 |
kubota |
857 |
Since these adopt server/client model and implement their own protocols, |
| 441 |
kubota |
1054 |
we cannot input Japanese only with <package>wnn</package>, |
| 442 |
|
|
<package>canna</package>, or <package>skk</package> |
| 443 |
|
|
(and their depending packages). |
| 444 |
kubota |
845 |
</P> |
| 445 |
|
|
|
| 446 |
|
|
<P> |
| 447 |
kubota |
1054 |
In X Window System environment, <package>kinput2-*</package> and |
| 448 |
|
|
<package>skkinput</package> packages |
| 449 |
kubota |
857 |
connects these protocols and XIM, which is the standard input |
| 450 |
|
|
protocol for X. Kinput2 also has an original protocol and |
| 451 |
|
|
<package>kterm</package> and so on can be a client of kinput2 protocol. |
| 452 |
kubota |
1054 |
Kinput2 protocol was developed before international standards such |
| 453 |
|
|
as XIM (or Ximp or Xsi) became available. |
| 454 |
kubota |
857 |
</P> |
| 455 |
|
|
|
| 456 |
|
|
<P> |
| 457 |
kubota |
1054 |
On console, there are no standard and each software has to support |
| 458 |
kubota |
857 |
wnn and/or canna protocol. For example, <package>jvim-canna</package>, |
| 459 |
kubota |
1054 |
<package>xemacs21-mule-canna</package>, and emacs20 with |
| 460 |
kubota |
857 |
<package>emacs-dl-canna</package> or <package>emacs-dl-wnn</package>. |
| 461 |
|
|
Thus the ways to operate are different between softwares. |
| 462 |
kubota |
1054 |
<package>skkfep</package> provides a general way to input Japanese |
| 463 |
|
|
on console. |
| 464 |
kubota |
857 |
</P> |
| 465 |
|
|
|
| 466 |
|
|
<P> |
| 467 |
|
|
Then the way to input Japanese is explained. |
| 468 |
|
|
</P> |
| 469 |
|
|
|
| 470 |
|
|
<P> |
| 471 |
kubota |
845 |
Since almost Hiraganas and Katakanas represents a pair of a vowel |
| 472 |
|
|
and a consonant with one character, we can input one Hiragana or |
| 473 |
|
|
one Katakana with two Latin alphabets. A few Hiraganas and Katakanas |
| 474 |
|
|
need one or three alphabets. |
| 475 |
|
|
</P> |
| 476 |
|
|
|
| 477 |
|
|
<P> |
| 478 |
|
|
Kanji is obtained by converting from Hiragana. |
| 479 |
|
|
There are many Japanese words which are expressed by two or more Kanjis |
| 480 |
|
|
and almost recent converting softwares can convert such words at a time. |
| 481 |
kubota |
1054 |
(Old softwares can convert one Kanji at a time. You must be patient |
| 482 |
|
|
to use this way.) |
| 483 |
|
|
Softwares with good grammar/context analyzer and large dictionary |
| 484 |
kubota |
845 |
can convert longer phrases or even a whole sentence at a time. |
| 485 |
|
|
However, we usually have to select one Kanji or word from |
| 486 |
|
|
candidates the software shows, because Japanese language has |
| 487 |
|
|
many homophones. For example, 61 Kanjis whose readings are 'KAN' |
| 488 |
|
|
and 6 words whose readings are 'KOUKOU' are registered in |
| 489 |
|
|
dictionary of <tt>canna</tt>. |
| 490 |
|
|
(Today, 2 Oct 1999, I saw a TV advertisement film of Japanese word processor |
| 491 |
kubota |
1054 |
which insists the software can correctly convert an input into |
| 492 |
kubota |
845 |
'a cafe which opened today', not 'a cafe which rotated today'. |
| 493 |
|
|
Though Japanese word 'KAITEN' means both 'open (a shop)' and 'rotate', |
| 494 |
|
|
the software knows it is more usual for a cafe to open than to rotate.) |
| 495 |
|
|
</P> |
| 496 |
|
|
|
| 497 |
|
|
<P> |
| 498 |
|
|
The conversion from Hiragana to Kanji needs a large dictionary which |
| 499 |
|
|
contains the Kanji spelling and readings of Japanese major words and |
| 500 |
|
|
conjugation or inflection. Thus proprietary softwares tend to |
| 501 |
|
|
efficiently convert. They usually have dictionaries larger than |
| 502 |
|
|
few megabytes. Some of these recent proprietary softwares |
| 503 |
|
|
even analyze the topic or meaning of the inputed Hiragana sentence |
| 504 |
|
|
and choose the most appropriate homophone, though they often choose |
| 505 |
|
|
wrong ones. |
| 506 |
|
|
</P> |
| 507 |
|
|
|
| 508 |
|
|
<P> |
| 509 |
kubota |
1054 |
Nowadays several proprietary conversion softwares such as ATOK, WNN6, |
| 510 |
kubota |
845 |
and VJE for Linux are sold in Japan. |
| 511 |
|
|
</P> |
| 512 |
|
|
|
| 513 |
|
|
<P> |
| 514 |
kubota |
1054 |
Since it is complex and hard work for users to input Japanese characters, |
| 515 |
kubota |
845 |
we don't want to input Y (for YES) or N (for NO) in Japanese. |
| 516 |
|
|
We prefer learning such basic English words to inputing Japanese |
| 517 |
|
|
words by invoking conversion software, inputing Latin alphabetic |
| 518 |
|
|
expression of Japanese, converting it into Hiragana, converting |
| 519 |
|
|
it into Kanji, choosing the correct Kanji, determining the correct |
| 520 |
|
|
Kanji, and ending the conversion software each time we need to |
| 521 |
|
|
input yes or no or similar words. |
| 522 |
|
|
</P> |
| 523 |
|
|
|
| 524 |
|
|
|
| 525 |
kubota |
857 |
<sect1 id="japanese-more"><heading>More Detailed Discussions</heading> |
| 526 |
kubota |
845 |
|
| 527 |
kubota |
857 |
<sect2 id="japanese-width"><heading>Width of Characters</heading> |
| 528 |
kubota |
845 |
|
| 529 |
|
|
<P> |
| 530 |
|
|
Different from European languages, Japanese characters should |
| 531 |
|
|
written in a fixed width. Exceptions arises when two symbols |
| 532 |
|
|
such as parentheses, periods and commas continue. Kerning |
| 533 |
|
|
should be done for such cases if the software is a word processor. |
| 534 |
|
|
A text editor need not. |
| 535 |
|
|
</P> |
| 536 |
|
|
|
| 537 |
kubota |
864 |
<sect2 id="japanese-ruby"><heading>Ruby</heading> |
| 538 |
kubota |
845 |
|
| 539 |
|
|
<P> |
| 540 |
|
|
<strong>Ruby</strong> is a small (usually 1/2 in length and 1/4 in |
| 541 |
|
|
area or a bit smaller) |
| 542 |
|
|
characters written above (in horizontal direction) or at right side |
| 543 |
|
|
(in vertical direction) of the main text. This is usually used to show |
| 544 |
|
|
a reading of difficult Kanji. |
| 545 |
|
|
</P> |
| 546 |
|
|
|
| 547 |
|
|
<P> |
| 548 |
|
|
Japanized TeX can use ruby by using an extra macro. Word processors should |
| 549 |
|
|
have Ruby faculty. |
| 550 |
|
|
</P> |
| 551 |
|
|
|
| 552 |
|
|
|
| 553 |
kubota |
857 |
<sect2 id="japanese-case"><heading>Upper And Lower Cases</heading> |
| 554 |
kubota |
845 |
|
| 555 |
|
|
<P> |
| 556 |
|
|
Japanese character does not have upper and lower case although |
| 557 |
|
|
there two sets of phonograms, Hiragana and Katakana. |
| 558 |
|
|
</P> |
| 559 |
|
|
|
| 560 |
|
|
<P> |
| 561 |
kubota |
1054 |
Thus <tt>tolower()</tt> and <tt>toupper()</tt> should not convert |
| 562 |
|
|
between Hiragana and Katakana. |
| 563 |
kubota |
845 |
</P> |
| 564 |
|
|
|
| 565 |
|
|
<P> |
| 566 |
|
|
Hiragana is used for usual text. Katakana is used mainly for |
| 567 |
|
|
express foreign or imported words, for example, KONPYU-TA |
| 568 |
|
|
for computer, MAIKUROSOFUTO for Microsoft, and AINSYUTAIN for Einstein. |
| 569 |
|
|
</P> |
| 570 |
|
|
|
| 571 |
|
|
|
| 572 |
kubota |
857 |
<sect2 id="japanese-sort"><heading>Sorting</heading> |
| 573 |
kubota |
845 |
|
| 574 |
|
|
<P> |
| 575 |
|
|
Phonograms (Hiragana and Katakana) have sorting order. |
| 576 |
|
|
The order is same to defined in JIS X 0208, with a few exceptions. |
| 577 |
|
|
</P> |
| 578 |
|
|
|
| 579 |
|
|
<P> |
| 580 |
|
|
Ideograms (Kanji) sorting is difficult. They should be sorted |
| 581 |
kubota |
1054 |
by their reading but almost all kanji have a few readings according |
| 582 |
kubota |
845 |
to the context. So if you want to sort Japanese text, you will need |
| 583 |
|
|
a dictionary of whole Japanese Kanji words. And more, a few |
| 584 |
|
|
Japanese words written in Kanji have different readings with |
| 585 |
kubota |
1054 |
exactly same series of Kanjis, this can occur especially for names of |
| 586 |
|
|
person. |
| 587 |
kubota |
845 |
So it is usual that addressbook databases have two 'name' columns, |
| 588 |
|
|
one for Kanji expression and the other for Hiragana. |
| 589 |
|
|
</P> |
| 590 |
|
|
|
| 591 |
|
|
<P> |
| 592 |
|
|
I know no softwares which can sort Japanese words in perfect way, |
| 593 |
|
|
including free and proprietary softwares. |
| 594 |
|
|
</P> |
| 595 |
|
|
|
| 596 |
|
|
|
| 597 |
kubota |
864 |
<sect2 id="japanese-romaji"><heading> Ro-ma ji (Alphabetic expression of Japanese)</heading> |
| 598 |
kubota |
845 |
|
| 599 |
|
|
<P> |
| 600 |
|
|
We have a phonetic alphabetic expression of Japanese, Ro-ma ji. |
| 601 |
|
|
It has almost one-to-one correspondence to Japanese phonogram. |
| 602 |
|
|
It can be used to display Japanese text on Linux console and |
| 603 |
|
|
so on. Since Japanese have many homophones this expression |
| 604 |
|
|
can be crabbed. |
| 605 |
|
|
</P> |
| 606 |
|
|
|
| 607 |
|
|
<P> |
| 608 |
|
|
There are several variants of Ro-ma ji. |
| 609 |
|
|
</P> |
| 610 |
|
|
|
| 611 |
|
|
<P> |
| 612 |
|
|
The first distinguishing point is on handling of long syllable. |
| 613 |
|
|
For example, long syllable of 'E' is expressed in: |
| 614 |
|
|
<list> |
| 615 |
|
|
<item>'E' with caret, |
| 616 |
|
|
<item>'E' with upper bar, |
| 617 |
|
|
<item>only 'E' in which long syllable mark is ignored, |
| 618 |
|
|
<item>'EE', |
| 619 |
|
|
<item>and so on. |
| 620 |
|
|
</list> |
| 621 |
|
|
</P> |
| 622 |
|
|
|
| 623 |
|
|
<P> |
| 624 |
|
|
The second distinguishing point is some special pairs |
| 625 |
|
|
of vowel and consonant. |
| 626 |
|
|
For example, Hiragana character for combination of 'T' and 'I' is |
| 627 |
|
|
pronounced like 'CHI'. |
| 628 |
|
|
<list> |
| 629 |
|
|
<item>TI or CHI, as described above, |
| 630 |
|
|
<item>TU or TSU, |
| 631 |
|
|
<item>SI or SHI, |
| 632 |
|
|
<item>HU or FU, |
| 633 |
|
|
<item>WO or O, |
| 634 |
|
|
<item>TYA or CHA, and |
| 635 |
|
|
<item>N or M. |
| 636 |
|
|
</list> |
| 637 |
|
|
</P> |
| 638 |
|
|
|