utf.7 (2301B)
1 .TH UTF 7 2 .SH NAME 3 UTF, Unicode, ASCII, rune \- character set and format 4 .SH DESCRIPTION 5 The Plan 9 character set and representation are 6 based on the Unicode Standard and on the ISO multibyte 7 .SM UTF-8 8 encoding (Universal Character 9 Set Transformation Format, 8 bits wide). 10 The Unicode Standard represents its characters in 16 11 bits; 12 .SM UTF-8 13 represents such 14 values in an 8-bit byte stream. 15 Throughout this manual, 16 .SM UTF-8 17 is shortened to 18 .SM UTF. 19 .PP 20 In Plan 9, a 21 .I rune 22 is a 16-bit quantity representing a Unicode character. 23 Internally, programs may store characters as runes. 24 However, any external manifestation of textual information, 25 in files or at the interface between programs, uses a 26 machine-independent, byte-stream encoding called 27 .SM UTF. 28 .PP 29 .SM UTF 30 is designed so the 7-bit 31 .SM ASCII 32 set (values hexadecimal 00 to 7F), 33 appear only as themselves 34 in the encoding. 35 Runes with values above 7F appear as sequences of two or more 36 bytes with values only from 80 to FF. 37 .PP 38 The 39 .SM UTF 40 encoding of the Unicode Standard is backward compatible with 41 .SM ASCII\c 42 : 43 programs presented only with 44 .SM ASCII 45 work on Plan 9 46 even if not written to deal with 47 .SM UTF, 48 as do 49 programs that deal with uninterpreted byte streams. 50 However, programs that perform semantic processing on 51 .SM ASCII 52 graphic 53 characters must convert from 54 .SM UTF 55 to runes 56 in order to work properly with non-\c 57 .SM ASCII 58 input. 59 See 60 .MR rune (3) . 61 .PP 62 Letting numbers be binary, 63 a rune x is converted to a multibyte 64 .SM UTF 65 sequence 66 as follows: 67 .PP 68 01. x in [00000000.0bbbbbbb] → 0bbbbbbb 69 .br 70 10. x in [00000bbb.bbbbbbbb] → 110bbbbb, 10bbbbbb 71 .br 72 11. x in [bbbbbbbb.bbbbbbbb] → 1110bbbb, 10bbbbbb, 10bbbbbb 73 .br 74 .PP 75 Conversion 01 provides a one-byte sequence that spans the 76 .SM ASCII 77 character set in a compatible way. 78 Conversions 10 and 11 represent higher-valued characters 79 as sequences of two or three bytes with the high bit set. 80 Plan 9 does not support the 4, 5, and 6 byte sequences proposed by X-Open. 81 When there are multiple ways to encode a value, for example rune 0, 82 the shortest encoding is used. 83 .PP 84 In the inverse mapping, 85 any sequence except those described above 86 is incorrect and is converted to rune hexadecimal 0080. 87 .SH "SEE ALSO" 88 .MR ascii (1) , 89 .MR tcs (1) , 90 .MR rune (3) , 91 .IR "The Unicode Standard" .