Talk:UTF-16/UCS-2
From Wikipedia, the free encyclopedia
Contents |
[edit] UTF-16 and UCS-2 as one topic
The history of this page makes it look like there was never anything but a redirect to UCS here (at UTF-16), but I am fairly certain there was a separate entry for UTF-16 in the recent past.
I do not like the idea of redirecting to UCS. While UCS should mention and perhaps summarize what encodings it defines, I strongly feel that the widely-used UTF-8, UTF-16, and UTF-32 encodings should have their own entries, since they are not exclusively tied to the UCS (as UCS-2 and UCS-4 are) and since they require much prose to accurately explain. Therefore, I have replaced the UCS redirect with a full entry for UTF-16. --mjb 16:53, 13 October 2002
- I no longer feel so strongly about having both encoding forms discussed in the same article. That's fine. However, I do have a problem with saying that they are just alternative names for the same encoding form. You can't change the definition of UCS-2 and UTF-16 in this way just because the names are often conflated; the formats are defined in standards, and there is a notable difference between them. Both points (that they're slightly different, and that UCS-2 is often mislabeled UTF-16 and vice-versa) should be mentioned. I've edited the article accordingly today. — mjb 23:37, 13 October 2005 (UTC)
- I would also like to see UCS-2 more clearly separated from UTF-16 - they are quite different, and it's important to make it clear that UCS-2 is limited to just the 16-bit codepoint space defined in Unicode 1.x. This will become increasingly important with the adoption of GB18030 for use in mainland China, which requires characters defined in Unicode 3.0 that are beyond the 16-bit BMP space. — Richard Donkin 09:07, 18 November 2005 (UTC)
[edit] UTF-16LE BOMs Away!
Concerning the text explaining UTF-16LE and UTF-16BE, would it not be better that instead of saying,
- A BOM at the beginning of UTF-16LE or UTF-16BE encoded data is not considered to be a BOM; it is part of the text itself.
we say something like,
- No BOM is required at the beginning of UTF-16LE or UTF-16BE encoded data and, if present, it would not be understood as such, but instead be mistaken as part of the text itself.
--Chris 17:27, 12 January 2006 (UTC)
[edit] Clean-up?
Compare the clean, snappy, introductory paragraph of the UTF-8 article to the confusing ramble that starts this one. I want to know the defining characteristics of UTF-16 and I don't want to know (at this stage) what other specifications might or might not be confused with it. Could someone who understands this topic consider doing a major re-write.
A good start would be to have one article for UTF-16 and another article for UCS-2. The UTF-16 article could mention UCS-2 as an obsolete first attempt and the UCS-2 article could say that it is obsolete and is replaced by UTF-16. --137.108.145.11 17:02, 19 June 2006 (UTC)
- I rewrote the introduction to hopefully make things clearer; please correct if you find technical errors. The rest of the article also needs some cleanup, which I may attempt. I disagree that UTF-16 and UCS-2 should be separate articles, as they are technically so similar. Dmeranda 15:12, 18 October 2006 (UTC)
-
- Agreed, despite the different names they are essentially different versions of the same thing
- UCS-2 --> 16 bit unicode format for unicode versions <= 3.0
- UTF-16 --> 16 bit unicode format for unicode versions >= 3.1
- Plugwash 20:13, 18 October 2006 (UTC)
[edit] Surrogate Pair Example Wrong?
The example: 119070 (hex 1D11E) / musical G clef / D834 DD1E: the surrogate pair should be D874 DD1E for 1D11E. Can somebody verify that and change the example? —Preceding unsigned comment added by 85.216.46.173 (talk)
- NO the surrogate pair in the article is correct (and btw this incorrect correction has come up many times in the articles history before)
- 0x1D11E-0x10000=0x0D11E
- 0x0D11E=0b00001101000100011110
- split the 20 bit number in half
- 0b0000110100 =0x0034
- 0b0100011110 =0x011E
- add the surrogate bases
- 0x0034+0xD800=0xD834
- 0x011E+0xDC00=0xDD1E
- -- Plugwash 18:39, 8 November 2006 (UTC)
[edit] Decoding example
Could there be an example for decoding the surrogate pairs, similar in format to the encoding example procedure? Neilmsheldon 15:29, 27 December 2006 (UTC)
[edit] Java not UTF-16?
After reading through the documentation for the Java Virtual Machine (JVM) (See Java 5 JVM[1] section 4.5.7), it seems to me that Java does not use UTF-16 as claimed. Instead it uses a modified form of UTF-8, but where it still uses surrogate pairs for supplemental codepoints (each surrogate being UTF-8 encoded though); so it's a weird nonstandard UTF-8/UTF-16 mishmash. This is for the JVM, which is the "byte code". I don't know if Java (the language) exposes something more UTF-16 like than the underlying bytecode, but it seems clear that the bytecode does not use UTF-16. Can somebody more Java experienced than me please verify this and correct this article if necessary. - Dmeranda 06:00, 23 October 2007 (UTC)
- The serialisation format and the bytecode format do indeed store strings in "modified UTF-8" but the strings stored in memory and manipulated by the application are UTF-16. Plugwash 09:25, 23 October 2007 (UTC)
[edit] Windows: UCS-2 vs UTF-16
UTF-16 is the native internal representation of text in the Microsoft Windows NT/2000/XP/CE
Older Windows NT systems (prior to Windows 2000) only support UCS-2
That sounds like a contradiction. Besides this blog indicates UTF-16 wasn't really supported by Windows until XP: [2]
--Kokoro9 (talk) 12:44, 30 January 2008 (UTC)
- I think surrogate support could be enabled in 2K but i'm not positive on that. Also iirc even XP doesn't have surrogate support enabled by default. As with java windows uses 16 bit unicode quantities but whether surrogates are supported depends on the version and the settings. The question is how best to express that succiently in the introduction. Plugwash (talk) 13:05, 30 January 2008 (UTC)
-
- I've found this:
Note: Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.
SourceIf you are developing a font or IME provider, note that pre-Windows XP operating systems disable supplementary character support by default. Windows XP and later systems enable supplementary characters by default.
- That seems to indicate Windows 2000 supports UTF-16 at some level. On the other hand, I think NT should be removed from the list of UTF-16 supporting OSs.--Kokoro9 (talk) 17:38, 30 January 2008 (UTC)
- I've found this:

