Talk:UTF-8

From Wikipedia, the free encyclopedia

1 slightly ambiuous text
2 Reason for RFC3629 restricting sequence length
3 invalid UTF-8 binary streams
4 decimal VS hex
5 Appropriate image?
6 Comment about UTF-8
7 The comparison table
8 bom
9 Double UTF-8
10 xyz
11 "This is an example of the NFD variant of text normalization"
12 Ideograms vs. logograms
13 External links
14 conversion
15 Clarify intro?
16 Need reference for email client support
17 stroke vs caps
18 Recent change in Advantages and Disadvantages Section
19 Asian characters in UTF-8 versus other encodings
20 Java modified UTF-8
21 "Legacy" vs "Other" encodings
22 Better rationale for the introductory comment on higher plane characters being unusual
23 Code range values in table within Description section
24 Consolidated and extended the input range for three-byte codes
25 Why length matters
26 Stop spamming
27 Maybe the paragraph about how APIs work is unnecessary

[edit] slightly ambiuous text

UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 does this mean that the offset is subtracted in utf-8 or does it mean that it is subtracted in utf-16 and not utf-8? —The preceding unsigned comment was added by Plugwash (talk • contribs) 2004-12-04 22:38:47 (UTC).

[edit] Reason for RFC3629 restricting sequence length

I've tried to find a good reason why RFC3629 restricts the allowed encoding area from 0x0-0x7FFFFFFF down to 0x0-0x10FFFF, but haven't found anything. It would be nice to have an explanation about why this was done. Was it only to be UTF-16 compatible? -- sunny256 2004-12-06 04:07Z

i haven't seen any specific reference to why in specifications but i think what you mention is the most likely reason. I do wonder if this restriction will last though or if they will have to find a way to add more code points at a later date (they originally thought that 65535 code points would be plenty ;) ). Plugwash 10:31, 6 Dec 2004 (UTC)

The reason is indeed UTF-16 compatibility. Originally Unicode was meant to be fixed width. After the Unicode/ISO10646 harmonization, Unicode was meant to encode the basic multilingual plane only, and was equivalent to ISO's UCS-2. The ISO10646 repertoire was larger from the beginning, but no way of encoding it in 16 bit units was provided. After UTF-16 was defined and standardized, future character assignments were restricted to the first 16+ planes of the 31-bit ISO repertoire, which are reachable with UTF-16. Consequently security reasons prompted the abandonment of non-minimum length UTF-8 encodings, and around the same time valid codes were restricted to at most four bytes, instead of the original six capable of encoding the whole theoretical ISO repertoire. Decoy 14:05, 11 Apr 2005 (UTC)

The precise references for the above are [1] and [2]. The latter indicates that WG2 on the ISO side has formally recognized the restriction. Furthermore, [3] and [4] indicate that the US has formally requested WG2 to commit to never allocating anything beyond the UTF-16 barrier in ISO10646, and that the restriction will indeed be included in a future addendum to the ISO standard. Decoy 16:51, 19 Apr 2005 (UTC)

[edit] invalid UTF-8 binary streams

In the description, the table indicates that 1100000x (C0, C1) are invalid binary strings because the value (xyz) is less than 127. Is this really correct? Don't the two byte character points start at 0x80? Also, while I think it's useful to point out the pre-RFC-3629 validation, combining them all in the same table may be confusing. 1111110x was valid in 2002 but invalid in 2004. If my understanding is correct (quite possibly not), the table might be more easily understood as:

Codes (binary)	Notes
0xxxxxxx	valid
10xxxxxx	valid
110xxxxx	valid
1110xxxx	valid
1111010x	valid
1111011x	invalid
11111xxx	invalid

Perhaps the previous method considered hex values more clearly (personally, I find the binary imperative to understanding UTF-8). Alexgenaud 2007-03-06

Yes C0 and C1 are invalid because they would be the lead bytes of an overlong form of a character from 0 to 63 or 64 to 127 respectively. As for the other codes that are now forbidden it is important to remember this is essentially a change from being reserved for future expansion to being forbidden, that is theese values should not have been in use anyway. Plugwash 16:40, 6 March 2007 (UTC)

Thanks. rfc3629 doesn't define 'overlong'. The single byte contributes 7 bits to the character point while two bytes contribute 11. If the leading 4 bits of the 11 are zero then we are left with 7 bits. The stream octets are still unique, but the character point would be redundant (though the RFC implies this will naively or magically become +0000 for reasons beyond me). I suppose there must be other similar issues: 11100000 1000000x 10xxxxxx , 11100000 100xxxxx 10xxxxxx and perhaps three others in the four octet range. Alexgenaud 2007-03-06 (octets seperated by plugwash)

Hope you don't mind me spacing out the octets in your last post There are invalid combinations in the higher ranges but no more byte values that are always invalid. RFC 3629 doesn't define overlong but it does use the term in one place and talks about decoding of overlong forms by nieve decoders. It does state clearly that they are invalid though it does not use the term when doing so ("It is important to note that the rows of the table are mutually exclusive, i.e., there is only one valid way to encode a given character."). Plugwash 13:24, 8 March 2007 (UTC)

[edit] decimal VS hex

the article currently uses a mixture of decimal and hex with no clear labeling as to which is which. Unicode specs seem to prefer hex. HTML entities and the windows entry method default to decimal. ideas? Plugwash 02:19, 26 Jan 2005 (UTC)

Unicode characters are usually given in hexadecimal (e.g., U+03BC). But numeric character references in HTML are better in decimal (e.g., μ rather than μ), because this works on a wider range of browsers. So I think we should stick to hexadecimal except for HTML numeric character references. --Zundark 09:15, 26 Jan 2005 (UTC)

I agree, and would add that when referring to UTF-8 sequence values / bit patterns / fixed-bit-width numeric values, it is more common to use hexadecimal values, either preceded by "0x" (C notation) or followed by " hex" or " (hex)". I prefer the C notation. I doubt there are very many people in the audience for this article who are unfamiliar with it. For example, bit pattern 11111111 is written as 0xFF. -- mjb 18:25, 26 Jan 2005 (UTC)

Could someone add a paragraph to the main text about decimal or hex numeric character references to explain as why they are necessary?? —The preceding unsigned comment was added by 203.218.79.170 (talk • contribs) 2005-03-25 01:19:52 (UTC).

[edit] Appropriate image?

Is Image:Lack of support for Korean script.png an appropriate image for the "invalid input" section? I am wondering if those squares with numbers are specifically UTF-8? --Oldak Quill 06:26, 23 July 2005 (UTC)

No, the hexadecimal numbers in the boxes are Unicode code points for which the computer has no glyph to display. They don't indicate the encoding of the page. The encoding could be UTF-16, UTF-8, or a legacy encoding. Indefatigable 14:47, 23 July 2005 (UTC)

[edit] Comment about UTF-8

The article reads: "It uses groups of bytes to represent the Unicode standard for the alphabets of many of the world's languages. UTF-8 is especially useful for transmission over 8-bit Electronic Mail systems."

"UTF-8 was invented by Ken Thompson on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The day after, Pike and Thompson implemented it and updated their Plan 9 operating system to use it throughout."

RANTMODE: ON

I would like to add that UTF-8 is causing havoc and confusion to computer users that are used to the international characters in at least the ISO-8859-1 charset. Getting national chars through is a hit or miss thing these days. Some people can read UTF-8, but not write it. Some people write it but can't read it, some people can do neither, and some people do both.

UTF-8, in my opinion, sucks. I've used months to finally figure out how to get rid of the abomination. I'm finally back to ISO-8859-1 on all my own systems. I *hate* modern OS-distributions defaulting to UTF-8, and I've got a dream about meeting the bastard that invented the abomination to tell him what I really think of it.

In my opinion UTF-8 is wonderful encoding and I hope that it will be widely accepted as soon as possible. For example, the Czech language may use the CP852 encoding, the ISO8859-2 encoding, the Kamenicky encoding, Windows-1250 encoding, theoretically even KOI-8CS. Every stupid producer of a character LCD display creates his own "encoding". It doesn't work in DVD players, it creates problems on web pages, in computers, you cannot even imagine how it sucks. UTF-8 could become THE STANDARD we are looking for, because it is definitely the best standardized Unicode encoding and it is widely accepted (even Notepad in WinXP recognizes it, so we can say that it has become a de-facto standard). What really makes one angry is the "Java UTF-?", "MAC UTF-?", byte-order-dependent UTF-16 "encoding" and so on. Everybody tries to make his "proprietary extensions" and the main reason is to keep incompatibility and to kill the interoperability of software - because some people think it to be "good for bussiness". So PLEASE, ACCEPT THE UTF-8, it really is the best character encoding nowadays. (2006-08-17 20:38)

RANTMODE: OFF (unsigned comment by 80.111.231.66)

working with systems that use different encodings and provide no good way to specify which one is in use is going to be painfull PERIOD. The old system of different charsets for different languages can be just as horrible. Plugwash 16:37, 13 August 2005 (UTC)

The point is that most people don't need to work with systems with different language encodings. While almost everyone is affected by the conflicts between UTF-8 and their local charset. When you're not used to that kind of bullshit, it gets hugely annoying. Personally I've just spent the day de-utf8ying my systems. (80.111.231.66)

Point taken, but you should know that we feel the same way when dealing with your "code pages" and other limited charsets. Some time in the future, we might say the same thing about UTF-8, though at least they've left us two extra bytes to add characters to. Rōnin 22:48, 15 October 2005 (UTC)

Wikipedia is utf-8. Anyone who thinks that Wikipedia would have been as easy to manage without one unified character set is smoking something. 'nuff said. --Alvestrand 00:15, 18 August 2006 (UTC)

[edit] The comparison table

Is the table wrong for the row showing "000800–00FFFF"? I don't know much about UTF-8 but to me it should read "1110xxxx 10xxxxxx 10xxxxxx" not "1110xxxx xxxxxx xxxxxx" --220.240.5.113 04:30, 14 September 2005 (UTC)

Fixed. It got lost in some recent table reformatting. Bovlb 05:44, 14 September 2005 (UTC)

[edit] bom

the following was recently added to the article i reverted and was re-reverted ;).

UTF-16 needs a Byte Order Mark (U+FEFF) in the beginning of the stream to identify the byte order. This is not necessary in UTF-8, as the sequences always start with the most significant byte on every platform.

UTF-16 only needs a bom if you don't know the endianess ahead of time because the protocol was badly designed not to specify it, but such apps that don't have an out of band way of specifying charset already probbablly aren't going to be very suitable for use with utf-16 anyway. Apart from plain text files i don't see many situations where a bom would be needed and it should be noted that with plain utf-8 text at least ms uses a bom anyway. In sumarry while i can see where you are coming from i'm not sure how relavent this is given whats already been said. Plugwash 22:39, 6 October 2005 (UTC)

It was recently pointed out to me that using a UTF-8 BOM is the only way to force web browsers to interpret a document as UTF-8 when they've been set to ignore declared encodings. I haven't confirmed this, myself, but it sounds plausible. — mjb 00:15, 19 October 2005 (UTC)

I know the basic cycle of when web browsers recognized the utf-8 BOM but have not heard of the force outside of windows. True that all major web browsers now accept utf-8 with or without a BOM. There were times when sending utf-8 with a BOM would stop html web browsers. Now Web3D X3D browsers are arriving doing that same failure. All VRML and X3D Classic VRML encodings have always specified utf-8 with # as the first character for delivery and storage. However, the appearance of the apparently 'optional' BOM has caught a few of the old-school utf-8 out and make the utf-8 BOM illegal by requiring the first character to be # Are there other examples where the apparently 'optional' utf-8 BOM has generated problems and solutions that can be documented here? Thanks and Best Regards, JoeJoedwill 08:04, 12 November 2007 (UTC) .

Well it also plays havoc with Unix expecting shell scripts to start with #!. —Preceding unsigned comment added by Spitzak (talk • contribs) 20:36, 12 March 2008 (UTC)

[edit] Double UTF-8

I recently ran into a problem with a double-UTF8 encoded document. Doing a search on the internet for double UTF-8 I see this isn't an uncommon problem. If a BOM exists in double-UTF8 encoded text, it appears as "C3 AF C2 BB C2 BF". If there were some mention of this problem and how to cope with it here, I could have saved some headaches trying to figure out what was going on. I think this problem should be addressed somehow in the article. Boardhead 16:46, 11 April 2007 (UTC)

[edit] xyz

I see that there are now not only x'es, but also y's and z's in the table. While it was difficult to understand before, it's now starting to be more confusing than informative. Could we possibly revert to only using x? Rōnin 19:38, 28 November 2005 (UTC)

I can see xyz in the table, but description in the table looks like "three x, eight x" for "00000zzz zxxxxxxx" UTF-16 entry, and in other places of the table - just x, no y nor z. —The preceding unsigned comment was added by 81.210.107.159 (talk • contribs) 2006-04-18 21:46:20 (UTC).

I for one am more confused after reading the table than before reading it. I used to think I had at least a vague idea of how it worked, but I'm back to square zero now. If things could be rephrased it might be easier to understand. -- magetoo 19:40, 21 May 2006 (UTC)

I replaced the z's in the first three rows of the table with x's, since the "z>0000" note was redundant with the hexadecimal ranges column. I couldn't decipher the fourth row of the table, so I left it alone. IMHO the table would make a lot more sense if the UTF-16 column was removed. -- surturz 06:15, 14 June 2006 (UTC)

I've removed the UTF-16 column and y's and z's. It now uses only x (and 0) for a bit marker. I think it is much clearer now, I hope everyone agrees. Surturz 06:43, 14 June 2006 (UTC)

Someone has put x's y's and z's back in the table, but in a much better manner. It looks good now. --Surturz 21:35, 7 October 2006 (UTC)

[edit] "This is an example of the NFD variant of text normalization"

did you actually compare the tables to check this was the case before adding the comment? —Preceding unsigned comment added by Plugwash (talk • contribs)

no - the week before, I talked to the president of the Unicode Consortium, Mark Davis, and asked him whether there were any products that used NFD in practice. He said that MacOS X was the chief example of use of NFD in real products. --Alvestrand 21:08, 19 April 2006 (UTC)

[edit] Ideograms vs. logograms

I replaced the three occurences of 'ideogram' to their correct name, 'logogram', since clearly (and in one case definitely) CJK characters were ment, which are logographic, not ideographic. JAL - 13:48, 2006-06-06 (CET)

[edit] External links

The second UTF-8 test page is more comprehensive than the first, which is now unmaintained. Should we just remove the first?

[edit] conversion

is there any conversion tool from and to UTF-8 (e.g. U+91D1 <-> UTF-8 E98791)? Doing the binary acrobatics in my head takes too long. dab (ᛏ) 12:36, 24 July 2006 (UTC)

i find the bin to hex and hex to bin much easier, but that might be because i worked with chips. matter of practice. may be a small paper showing at least first 16 hex and their bin equiv, or, a txt file with link in desktop or quicklink, is another alternative. but a tool for that should perform better. ~Tarikash 05:41, 25 July 2006 (UTC).

um, I am familiar with hex<->bin, but honestly, how long does it take you to figure out 91D1->E98791? I suppose with some practice, it can be done in 20 seconds? 10 seconds if you're good? Now imagine you need to do dozens of these conversions? Will you do them all in your head, with a multiple-GHz machine sitting right in front of you? I'm all for mental agility, but that sounds silly. dab (ᛏ) 18:09, 29 July 2006 (UTC)

considering that utf-8 was designed to be trivial for computers to encode and decode it really shouldn't take any programmer that understands bit-ops more than a few minuites to code something like this from scratch. Even less to code it around an existing conversion library they were familiar with.

why exactly are you doing dozens of theese conversions in the first place? at that kind of numbers i'd be thinking of automating the whole task not using an app i had to manually paste numbers into. Plugwash 22:44, 29 July 2006 (UTC)

[edit] Clarify intro?

I had to read all the way down into the table in order to be sure we were talking about a binary format rather than a text format here. Is there a way of making the first two sentences clearer? By "binary format" I mean that characters apparently use the full range of ASCII characters, rather than just the 36 alphanumeric ones. Stevage 06:34, 16 August 2006 (UTC)

odd definition of text format you have there (btw ascii has 96 printable characters). Yes it does use byte values that aren't valid ascii to represent non ascii characters just like virtually every text encoding used today (cp850,windows-1252,ISO-8859-1 etc) does. Also note that UTF-8 is not a file format in its own right, you can have a dos style UTF-8 text file or a unix style UTF-8 text file or a file that uses UTF-8 for strings but a non text based structure. Plugwash 17:02, 16 August 2006 (UTC)

Ok, having read character encoding I understand better. Not sure how the intro to this article could be improved. Stevage 19:59, 16 August 2006 (UTC)

Yeah, the bit "the initial encoding of byte codes and character assignments for UTF-8 is coincident with ASCII (requiring little or no change for software that handles ASCII but preserves other values)." is either confusing, not English, or not correct, I'm not sure which. Is this better? : "a byte in the ASCII range (i.e. 0-127, AKA 0x00 to 0x7F) represents itself in UTF-8. (This partial backwards compatibility means that little or no change is required for software that handles ASCII but preserves other values to also support UTF-8). UTF-8 strings." It needs to be explained how it differs from punycode. --Elvey 16:36, 29 November 2006 (UTC)

[edit] Need reference for email client support

Of all the major email clients, only Eudora does not support UTF-8.

This needs a reference. Which are the major email clients?! 199.172.169.7 18:33, 17 January 2007 (UTC)

I've removed it, since it's gone unchecked for about three months. -- Mikeblas 14:30, 22 April 2007 (UTC)

[edit] stroke vs caps

Hi, I am trying to translate this page in Italian for it.wikipedia, and was wondering if someone could elaborate a bit about the part on stroke versus capital letters. --Agnul 21:42, 2 Dec 2004 (UTC)

(moved new subject to the bottom, per convention)

Since the word "stroke" doesn't appear in the article, I'm at a loss to identify what you mean. Could you clarify? --Alvestrand 22:54, 21 January 2007 (UTC)

[edit] Recent change in Advantages and Disadvantages Section

I'm pretty new to editing in Wikipedia, so I wanted to consult with some more experienced Wikipedians before messing with someone else's recent change.

An anonymous user just added this text under the Disadvantages compared to UTF-16 section:

For most Wester European languages, the strictly 8 bit character codes from the ISO_8859 standard work fine with current applications, which can be broken when using multibyte codes.

I see at least three problems with this change. First, there is at least one typo. Second, this is miscategorized under the UTF-16 comparison section, since it is comparing UTF-8 to legacy encodings. Thirdly, this sounds like an unsupported assertion coming from someone who doesn't like UTF-8. I didn't want to fix the first two problems without addressing the third.

Is there any way this addition can be salvaged by moving it under legacy encodings and editing to be more appropriate? Perhaps to talk about support for legacy encodings vs. UTF-8? I'm not even sure that the assertion in this item is true. In my personal experience, UTF-8 support in current operating systems, applications, and network protocols seems to be very good, and applications that don't support UTF-8 seem to be more of the exception than the common case. —The preceding unsigned comment was added by Scott Roy Atwood (talk • contribs) 21:21, 6 February 2007 (UTC).

[edit] Asian characters in UTF-8 versus other encodings

The article states that "As a result, text in Chinese, Japanese or Hindi takes up more space when represented in UTF-8." This is not strictly true of Japanese text-the Japanese often use the character encoding ISO-2022-JP to communicate on the Internet, which is a 7-bit encoding in which characters take up more space than in UTF-8. For the Shift-JIS encoding used in Windows, however, it does hold true. Rōnin 01:44, 9 February 2007 (UTC)

The relationship between the size of UTF-8 encoded text and ISO-2022-JP encoded text is somewhat more complicated than for some of the other Asian encodings, but it is still generally true that UTF-8 encoding is larger. It is true that it is limited to 7-bit ASCII characters, but it uses a three byte escape sequence to begin and end a run of Japanese characters, and within the escaped text, each Japanese character takes two bytes. So that means ISO-2022-JP text will take fewer bytes than UTF-8 as long as the run of text is longer than six Japanese characters. Scott Roy Atwood 19:08, 9 February 2007 (UTC)

Ah... Thanks so much for the explanation! Rōnin 22:05, 14 February 2007 (UTC)

[edit] Java modified UTF-8

"Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet."

It's usually safe to treat modified UTF-8 simply as UTF-8, so this "very careful to avoid mislabelling" is exaggerating.

Not really. The treatment of code point 0, and code points beyond 65535 are not compatible. MegaHasher 04:23, 9 July 2007 (UTC)

[edit] "Legacy" vs "Other" encodings

De Boyne and Imroy - can you please give your arguments here rather than flipping the article back and forth between "legacy" and "other" encodings? I have an opinion, but want to hear both your arguments for why you think it's obvious enough to make these changes without discussion. --Alvestrand 17:33, 16 October 2007 (UTC)

At the risk of sounding juvenille - de Boyne started it. It doesn't appear to be a technical matter at all, but a grammatical one. He seems to have a pet peeve about the (mis)use of the word 'legacy' and has even written a page about it[5] on his ISP's web host. I've pointed out to him that 'legacy' is being used as an adjective in "legacy encoding", even showing that OED says it's ok. I thought my arguments had settled the matter last month but then he came back yesterday and restarted his efforts to 'fix' this article. He's now claiming original research, saying we made up the term or something...

I've been trying to argue the matter with him on his talk page and am currently waiting on a reply. --Imroy 19:08, 16 October 2007 (UTC)

I agree with the above. What I don't understand is where JdBP got the idea that "legacy" is pejorative in the first place. "Legacy" in computer science has always been a neutral descriptive term like "naïve" in mathematics and philosophy (naïve set theory, etc.). It's true that many non-mathematicians misinterpret this use of "naïve" as an insult, but the solution is to educate those people, not to switch to a less precise term.

The other problem is JdBP's antisocial behavior. He obviously should have started a discussion here after the first revert; instead he just kept making the edit over and over. If I'd been in Imroy's shoes I would have reverted those edits on general principle regardless of the content. -- BenRG 02:03, 17 October 2007 (UTC)

This other user is beside the point. The question here is whether the article should be using the term "legacy encoding" to mean "non-Unicode encoding". It's not that the term "legacy" is pejorative, it's just that it expresses a point of view. As such, it's inappropriate for a supposedly neutral encyclopedia article. --Zundark 10:15, 17 October 2007 (UTC)

The common non unicode encodings were introduced by IBM, MS and ISO. Afaict all of those have moved on to unicode now and are keeping thier old encodings arround only for historical/compatibility reasons. Do you have any evidence to the contary? Plugwash 11:09, 19 October 2007 (UTC)

I don't have any meaningful evidence either for or against your hypothesis. But I don't think your question is very relevant anyway, as there are significant character encodings that are not the product of any of the organizations you mention. Do you have any evidence that a character encoding as recent as KZ-1048 is regarded by its creators as a legacy encoding? Would it be NPOV for us to call it such? --Zundark 13:34, 19 October 2007 (UTC)

[edit] Better rationale for the introductory comment on higher plane characters being unusual

In the lede the encoded length of higher plane characters is excused by noting that they are rare. This is an excellent rationale, but the proper context of those sorts of reasons is not spelled out: it is a result in text compression that coding rare symbols with lengthier encodings and the more common ones with tighter ones, in the average causes the length of the entire text to sink. Because of that I think a link to at least one relevant article on statistical data compression should be included in the introduction, presuming the abovementioned apology is.

That might also help us tighten the (already quite bloated) introduction a bit, and make it more encyclopedic. It might also be that such reasons should be factored out from here and into the main article on Unicode/ISO 10646, because they tend to impact all encodings more or less equally. After all, they are always applied in the context of the design of the parent coded character set, and not of the encodings.

Decoy (talk) 22:39, 20 December 2007 (UTC)

[edit] Code range values in table within Description section

Why list two ranges "000800–00D7FF" and "00E000–00FFFD" for the 3-byte code?

Why not "000800–00FFFF"?

The range "00E000–00FFFD" followed by the range "010000–10FFFF" leaves a gap -- it doesn't account for 00FFFE and 00FFFF. --Ac44ck (talk) 21:41, 2 January 2008 (UTC)

The note in the middle of the table for the two-byte code says: "byte values C2–DF and 80–BF". Wouldn't the first byte be in the range "C0–DF"? --Ac44ck (talk) 23:36, 2 January 2008 (UTC)

C0 and C1 are invalid in UTF-8 because they would always code for overlong sequences. Plugwash (talk) 00:54, 3 January 2008 (UTC)

Aha. The UTF-8 code splits to occupy two bytes as the number of the character being encoded increments from 0x7F to 0x80. Bit 6 of the one-byte form moves to become bit 0 in the most-significant byte of the two-byte form — so the MSB can't be C0. And then incrementing the value represented by the distributed bits (which are all set) forces a "carry" to set bit 1 and clear bit 0 of the MSB — and the MSB can't be C1. Thus, "c2 80" is the first valid two-byte form:

http://www.utf8-chartable.de/

Unicode → UTF-8

U+007F → 7f

U+0080 → c2 80

Thanks. —Ac44ck (talk) 02:43, 3 January 2008 (UTC)

[edit] Consolidated and extended the input range for three-byte codes

The top end of the "code range" for three-byte UTF-8 was listed as FFFD – omitting FFFE and FFFF.

The character code FFFF has a three-byte UTF-8 code: EF BF BF.

Perhaps FFFE was omitted because it can be a Byte Order Mark (BOM). Some web pages note that the BOM is not a valid UTF-8 code (no value encodes to "FF FE"). Nor is FFFF a valid UTF-8 code. But these facts are unrelated to which values may be encoded.

I see no reason to exclude FFFE or FFFF from the code range, nor to break the range 0000-FFFF into two sub-ranges. -Ac44ck (talk) 18:44, 4 January 2008 (UTC)

The BOM is U+FEFF, not U+FFFE. Noncharacters like U+FFFE do have UTF-8 encodings, according to Table 3-7 of Unicode 5.0, and they should be included in the range. Surrogates do not, and they should be excluded from the range. ED A0 80 is not a valid UTF-8 byte sequence; in no sense whatsoever does it decode to U+D800. -- BenRG (talk) 22:10, 4 January 2008 (UTC)

This seems like overreaching to me:

http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf

Page 41 of 64 in pdf file; Labeled as page 103 in the visible text:

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points D800..DFFF is ill-formed.

Unlike a leading byte of C0 for a two-byte code, there is nothing inherent in the UTF-8 structure which prevents (or creates any ambiguity when) encoding "D800" as "ED A0 80".

A screw driver isn't designed to be used as chisel. That doesn't stop me from chiseling with a screw driver sometimes. Likewise, data-encoding mechanisms may be used for other than their intended purposes.

http://en.wikipedia.org/wiki/UTF-8

Java ... modified UTF-8 for ... embedding constants in class files.

Might one of those constants be in the range D800..DFFF?

The article also notes:

data compression is not one of Unicode's aims

But one might want to use UTF-8 for that purpose. A value in the disallowed range might be part of the data being compressed.

The meaning of data in the range D800..DFFF is restricted only by a Unicode convention, but there seems to be no functional problem with encoding any value in that range as UTF-8.

What is allowed by Unicode and what is doable in UTF-8 seem like separate issues to me.

I added a note to the table to assist others who may be as surprised by the exclusion from the range and unfamiliar with this restriction by Unicode as I was.

http://en.wikipedia.org/wiki/Byte_Order_Mark

Representations of byte order marks by encoding

UTF-16 Little Endian FF FE

-Ac44ck (talk) 05:13, 5 January 2008 (UTC)

UTF-8 (as defined in Unicode 5.0 and RFC 2279) has no encodings for surrogates. It's impossible by definition to encode a surrogate in UTF-8. I'm not trying to be dogmatic, I just want the information in the article to be correct. What you're talking about is a variable length integer coding scheme that's used by CESU-8, UTF-8, and some other serialization formats, some of which are not even character encodings. That scheme is not called UTF-8 as far as any standards body is concerned, but some people do call it UTF-8, and maybe the article should mention that.

The byte sequence FF FE and the code point U+FFFE are different things. The BOM is U+FEFF, and it has many encodings, one of which is the byte sequence FF FE. -- BenRG (talk) 01:37, 6 January 2008 (UTC)

I hear you. It is important to make it clear that the range D800..DFFF is disallowed by Unicode. A parser which is built to decode a UTF-8 stream to Unicode code points is broken if it doesn't complain about receiving data which resolves to a surrogate.

But the Unicode standard isn't a natural law. It is not "impossible" for values in the D800..DFFF to be transformed by the UTF-8 algorithm. The table at http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt (linked in the article) suggests that the range D800..DFFF wasn't excluded from the original implementations of UTF-8. But the range D800..DFFF is prohibited by Unicode these days.

This was also helpful: "The byte sequence FF FE and the code point U+FFFE are different things." I wasn't making the distinction. Whether a value is transmitted as little- or big-endian doesn't affect the magnitude of that value; and a "code point" is a value, not merely byte sequence.

Thanks.-Ac44ck (talk) 19:45, 6 January 2008 (UTC)

[edit] Why length matters

I reverted a change that seemed to argue that the only place size really matters is in database design. I don't agree - I think text storage and forms handling are cases where size (and in particular knowing size ahead of time) also matter. I kind of agree that size doesn't matter that *much* any more, but didn't see a reason to agree with the database focus. Feel free to argue with me. --Alvestrand (talk) 10:08, 21 January 2008 (UTC)

[edit] Stop spamming

I just removed a comment made by certain Software vendor regarding certaing file format that hapens to have (broken) suport UTF-8. It is thus completely irrelevant. 90.163.53.5 (talk) 19:32, 19 February 2008 (UTC)

Ok, that paragraph in the 'history' section was misplaced and perhaps not really relevant to UTF-8. But how was it "spamming", and why do you say it was a "comment made by [a] certain Software vendor"? I don't follow. --Imroy (talk) 09:12, 20 February 2008 (UTC)

[edit] Maybe the paragraph about how APIs work is unnecessary

There is a flurry of editing about how Win32/Windows/WinNT/etc. work.

I am not clear about the distinctions between the various Windows-sounding names. The extent to which I need to be concerned about the distinctions between them may suggest a flaw (and/or a lack of _user_-focus) in the development philosophy being inflicted upon the industry.

How a vendor persists in creating a moving target is secondary to the issue _that_ they persist in creating a moving target.

The article is about UTF-8.

A vendor whose operating system:

1. Requires gigabytes of memory as overhead, but
2. Is not designed to run 16-bit applications

may have a questionable commitment to "maintaining compatibility with existing programs" anyway. If a change in how they implement an operating system creates duplicate work for them, maybe it isn't all bad for them to feel a little of the pain that they inflict upon others.

It seems to me that expending a great deal of effort on how various products-with-Windows-in-the-name do or don't handle UTF-8 would be more appropriate in an article about Windows than in this article about UTF-8. -Ac44ck (talk) 03:30, 1 March 2008 (UTC)

I was just trying to make the example more accurate. Win32 (which I probably should have linked) is the name of the Windows API that has duplicated narrow-character and wide-character versions of everything. -- BenRG (talk) 20:51, 1 March 2008 (UTC)

I disagree with Ac44ck's assertion. As far as I can tell, the discussion about the Windows API occupies a small amount of the article's content, and implementation has been an important factor in the development and deployment of UTF-8.—Kbolino (talk) 04:26, 2 March 2008 (UTC)

My contention is that an article about UTF-8 is not the place to have a pissing contest about how one interprets "Windows", "Win32", "WinNT", ad nauseam.

It is one thing to say "such and such was the case in version 4 of UTF-8, but it was changed to behave thus and so in version 5 because ..."

But going into detail about "how Windows works today" is the stuff of a magazine article — not an encyclopedia. It's a good bet that the next "new and improved" version of "Windows" (which despite the common name doesn't mean _one_ thing — it is a moving target) will create incompatibilities with what "Windows" means to all users (of any version) anyway. Going into detail about "how Windows worked yesterday" is the stuff of a magazine back-issue.

The whole notion of keeping "current" with "change for the sake of change" (or perhaps someone else's profit as opposed to the user's benefit) has become tiresome.

An article about UTF-8 is on my watch list. I don't have an article about Windows on my watch list. Having my attention drawn to a pissing contest about what label applies to a moving target became irritating to me.

I don't know whether I can avoid the "let's do it this way today" nonsense by moving to Linux, but I hope to. Then all the controversy about what a particular version of somehow-called "Windows" does with various flavors of UTF would be of even less interest to me in an article on UTF-8. —Ac44ck (talk) 07:34, 2 March 2008 (UTC)