Talk:Percent-encoding

From Wikipedia, the free encyclopedia

[edit] Interpretation of percent-encoded octets

if i have a encoded sequence %7e%7e, how does it know it's ~~ or a unicode char with hex 7e7e? Xah Lee 03:42, September 10, 2005 (UTC)

That's like asking what the letter "I" represents. It could represent the letter "I", the byte 0x49, the personal pronoun denoting individuality, the Roman numeral one…

At the lexical level, if it is in a URI, then %7E%7E and ~~ mean the same thing: two instances of the tilde character (U+007E) and, simultaneously, two instances of the byte 0x7E. These characters and values, however, may be representing almost anything, depending on where in the URI they appear, and why. Maybe each tilde represents byte value 7E, and maybe the pair together is significant. Maybe not. So, you don't know, really. In order to figure it out, you need to know more about the context. How and why the sequence was produced? Is it HTML form data? Was the producer of this sequence following the guidelines of some URI scheme? — mjb 04:56, 10 September 2005 (UTC)

In URLs UTF-8 is used to encode non-ASCII characters. The binary representation of %7e%7e would be: 01111110 01111110. Both bytes start with a zero and this indicates, that a normal ASCII-char is encode here. If it should represent an Unicode-char in UTF-8 then the byte should start with a 1. Since that is not the case it is interpreted as two single-byte chars.

Or short: %7e%7e ist not a valid 2-Byte UTF-encoding. -- JonnyJD 23:45, 24 May 2007 (UTC)

I thought the answer is a simple "because there is a % in between the two 7e's" because that is the difference between %7e%7e and %7e7e. Daveoh 11:08, 29 July 2007 (UTC)

%7e7e is no complete encoding. Only %7e would be encoded the second 7e would be normal text. So %7e7e == ~7e

Even in UTF the bytes get encoded one by one. -- JonnyJD 11:54, 30 July 2007 (UTC)

URIs are not generally UTF-8 encoded. Cite from text: The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. IMHO this is wrong. RFC 3986 does norm this only for the host part. The encoding of the URI is generally transparent: Each application that generates an URI can interpret its own URIs correctly (apply encoding and decoding correctly). The only interesting point is the behaviour of externally generated URIs e.g. browser forms and external applications (e.g. goolge parameter ie=UTF-8). The browser behaviour may be different among the browser types. --Jenswilke (talk) 09:31, 28 February 2008 (UTC)

From RFC 3986 sec. 2: When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. For example, the character A would be represented as "A", the character LATIN CAPITAL LETTER A WITH GRAVE would be represented as "%C3%80", and the character KATAKANA LETTER A would be represented as "%E3%82%A2".

The paragraph you're taking issue with is paraphrasing that in almost exactly the same terms, so I don't see what the problem is.

The only thing that might need emphasis is that it's new URI schemes that need to do this; the ones that are most ubiquitous (http, mailto, and the application/x-www-url-encoded MIME type as defined by XHTML1/HTML4 and lower) aren't affected and thus use arbitrary encodings that vary from app to app. —mjb (talk) 19:27, 28 February 2008 (UTC)

[edit] Added a reference table

Hey all, I just came to this article looking how to encode a % sign in a url string... noticed that the article couldn't tell me, just how to find out (which is arguably more encyclopedic) so I went along and added a table. Possibly needs a bit of rewording if it's decided to keep the table, else if you don't like it feel free to remove it =) Themania 15:12, 1 March 2007 (UTC)

[edit] Whitespace

Shouldn't the whitespace character %20 be encoded also? It already is encoded and is mentioned in examples several times in the RFC3986, just read all the paragraphs containing "%20".

I also think the line "No other characters are allowed in an URI." deserves a citation. Daveoh 12:05, 29 July 2007 (UTC)

I agree, I just came to this page to check the value and was surprised it wasn't here. I've added it to the tables. Not sure if there is a better way of representing a blank character?! --Bleveret (talk) 13:17, 22 January 2008 (UTC)

Reverted. Space is not a "reserved" character. As the article and specs explain, a reserved character has a special purpose (usually as some kind of component delimiter) in a URI, so when it appears literally in a URI, it must be used for that special purpose. If it's not used for that purpose, then it has to be represented by a percent-encoded octet (8-bit byte value).

Space is also not an "unreserved" character. That is, it's not one of the very small number of characters that can simply appear in a URI literally. Unreserved characters can optionally be represented by percent-encoded octets.

Not being in either of those special sets, space is one of the very large range of characters that must always be percent-encoded. There's no need to single it out; there are over 100,000 other characters that must also be percent-encoded for the same reason.

As for the "no other characters" ... that's covered in the spec, in several places where it's said that a URI must match the syntax rules. I don't mind adding specific citations, but it's not really up for debate, is it? —mjb (talk) 21:23, 22 January 2008 (UTC)

OK fair enough, however it still seems odd to me that information regarding %20 (which is probably the most commonly percent-encoded character) barely gets a mention. Would be it OK if I added a section with a helpful reference table of commonly used characters? --Bleveret (talk) 11:12, 8 February 2008 (UTC)

Maybe. It's original research, maybe best sought elsewhere and linked to, but as long as it's correct I'm not going to argue about it. My fear is that people will keep adding to it, or they won't understand that just because the reserved character "@", for example, is listed in the table, that it always has to be percent-encoded as "%40". So you have to make it clear that that's not the case. And it has to be clear that your table assumes an ASCII-based encoding for the non-'reserved', non-'unreserved' characters.

See, in theory, characters that aren't in the reserved or unreserved sets get percent-encoded after being converted to bytes according to any encoding (i.e., whichever one you need to use; the spec doesn't mandate one, although, going forward from Jan 2005, new specs are supposed to stick to UTF-8). In practice, the encoding is almost always a superset of ASCII, like UTF-8 or ISO-8859-1, but not UTF-16LE or UTF-16BE or UTF-32. So, space almost always becomes %20, and the other ASCII-range characters (U+0000 to U+00FF) likewise become one percent-encoded byte per character. Meanwhile, the U+0100 to U+10FFFD range (the upper limit of Unicode) gets percent-encoded with more than one per character, depending on what encoding was used as the basis.

Also, since percent-encoding is used in different contexts by different applications, there's differences of opinion over whether and how certain characters are percent-encoded. The table would need to make it clear that it's normal for "+" to be used instead of "%20" in application/x-www-form-urlencoded data, for example, although you may see different behaviors for this in different browsers. mjb (talk) 23:17, 8 February 2008 (UTC)

Talk:Percent-encoding

From Wikipedia, the free encyclopedia

[edit] Interpretation of percent-encoded octets

[edit] Added a reference table

[edit] Whitespace

Views

Navigation

Interaction

Search