Tech-Tip: Simple URL Encoder

Before a client communicates with a web server it must first encode URLs to ensure URL correctness. This encoding is necessary because some characters within the URL may be considered reserved.

This technical tip describes a simple URL encoding helper method that is useful for MIDlets communicating with web servers or using web services. Because this tech-tip is targeted at Java ME clients, it only includes how to encode a URL (decoding a URL is not presented). The tech-tip first presents some related definitions taken from RFC 2396 – Uniform Resource Identifiers (URI): Generic Syntax, section 2, followed by class UrlUtils.

For a helpful URL character encoding chart see i-Technica’s URLEncode Code Chart.

Encoding a URL

From RFC 2396 – Uniform Resource Identifiers (URI): Generic Syntax, Section 2

2.2. Reserved Characters

Many URI include components consisting of or delimited by, certain special characters. These characters are called “reserved”, since their usage within the URI component is limited to their reserved purpose. If the data for a URI component would conflict with the reserved purpose, then the conflicting data must be escaped before forming the URI.

reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","

The “reserved” syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax; they are used as delimiters of the components described in Section 3.

Characters in the “reserved” set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding (note that US-ASCII is same as UTF-8).

2.3. Unreserved Characters

Data characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include upper and lower case letters, decimal digits, and a limited set of punctuation marks and symbols.

 
unreserved  = alphanum | mark

where mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

Unreserved characters can be escaped without changing the semantics of the URI, but this should not be done unless the URI is being used in a context that does not allow the unescaped character to appear.

2.4. Escape Sequences

Data must be escaped if it does not have a representation using an unreserved character; this includes data that does not correspond to a printable character of the US-ASCII coded character set, or that corresponds to any US-ASCII character that is disallowed, as explained below.

2.4.1. Escaped Encoding

An escaped octet is encoded as a character triplet, consisting of the percent character “%” followed by the two hexadecimal digits representing the octet code. For example, “%20” is the escaped encoding for the US-ASCII space character.

escaped = "%" hex hex

where hex = digit | "A" | "B" | "C" | "D" | "E" | "F" | 
                          "a" | "b" | "c" | "d" | "e" | "f"

2.4.2. When to Escape and Unescape

A URI is always in an “escaped” form, since escaping or unescaping a completed URI might change its semantics. Normally, the only time escape encodings can safely be made is when the URI is being created from its component parts; each component may have its own set of characters that are reserved, so only the mechanism responsible for generating or interpreting that component can determine whether or not escaping a character will change its semantics. Likewise, a URI must be separated into its components before the escaped characters within those components can be safely decoded.

In some cases, data that could be represented by an unreserved character may appear escaped; for example, some of the unreserved “mark” characters are automatically escaped by some systems. If the given URI scheme defines a canonicalization algorithm, then unreserved characters may be unescaped according to that algorithm. For example, “%7e” is sometimes used instead of “~” in an http URL path, but the two are equivalent for an HTTP URL.

Because the percent “%” character always has the reserved purpose of being the escape indicator, it must be escaped as “%25” in order to be used as data within a URI. Implementers should be careful not to escape or unescape the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.

2.4.3. Excluded US-ASCII Characters

Although they are disallowed within the URI syntax, we include here a description of those US-ASCII characters that have been excluded and the reasons for their exclusion.

The control characters in the US-ASCII coded character set are not used within a URI, both because they are non-printable and because they are likely to be misinterpreted by some control mechanisms.

control = <US-ASCII coded characters 00-1F and 7F hexadecimal>

The space character is excluded because significant spaces may disappear and insignificant spaces may be introduced when URI are transcribed or typeset or subjected to the treatment of word- processing programs. Whitespace is also used to delimit URI in many contexts.

space = <US-ASCII coded character 20 hexadecimal>

The angle-bracket “<" and ">” and double-quote (“) characters are excluded because they are often used as the delimiters around URI in text documents and protocol fields. The character “#” is excluded because it is used to delimit a URI from a fragment identifier in URI references (Section 4). The percent character “%” is excluded because it is used for the encoding of escaped characters.

delims = "<" | ">" | "#" | "%" | <">

Other characters are excluded because gateways and other transport agents are known to sometimes modify such characters, or they are used as delimiters.

unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

Data corresponding to excluded characters must be escaped in order to be properly represented within a URI.

The Code – UrlUtils.java

ceo

Tech-Tip: Simple URL Encoder

Encoding a URL

The Code – UrlUtils.java

Leave a Reply