Monday, 11 January 2016

If you enjoy reading RFCs, or you've been overly strict validating URLs in the past, you might know that domain names support unicode. Except they don't. Not really. In fine internet tradition, it's all a bit of a hack.

The Domain Name System (DNS) is restricted in practice to a subset of ASCII. Any internationalized domain names (IDNs) must be transcribed to ASCII before being stored in DNS. This method of representing unicode in plain ASCII is called Punycode. Your browser will use this encoding to convert "bücher.ch" to it's ASCII representation ("xn–bcher-kva.ch") when you make a request.

Punycode tries to be human-readable and easy to implement, but these goals must yield to the reality of space constraints. Domain labels are limited to only 63 characters (RFC 1034), so the ratio of original string length to encoded ASCII length must be small. The algorithm takes an interesting approach to this problem, and while I wouldn't call it elegant, it makes a useful case study in space efficiency.

The encoding takes advantage of the fact that unicode characters are grouped into related blocks. Since an IDN will often include characters from only one other script system, the extended code points are usually close together.

To encode an extended string, the algorithm first copies all the basic ASCII characters from the IDN (eg, "bücher" => "bcher"). It then iterates through the extended characters, checking each position to see if it should be inserted. Each time, a counter is incremented. When an insertion is made, the value of the counter is appended to the output (eg, "bcher-kva", where "kva" is the counter value). Then, the counter is reset to zero and the iteration continues until all the extended characters have been inserted.

The resulting counter values define the number of iterations the decoder will have to go through to insert the next extended character. These deltas will usually start fairly large (to reach the appropriate code block) then be relatively small for subsequent characters. By using a variable-width integer encoding, Punycode requires fewer characters for small deltas than it does for large ones. For example, to encode "büücher" (with two "üü") we'd only require one more iteration of the decoder, which translates to just one extra character in the output: "bcher-kvaa" (where the deltas are now "kva" and "a", or 745 and 0 respectively).

The code is a little fiddly in practice, so if you'd like a clear description of the algorithm ignore my ramblings above and take a look at RFC 3492.

Why am I writing about Punycode? Because I've been working on a scheme implementation of course! The API looks something like this…

(use punycode)

(punycode-encode "Bücher")
;; => "Bcher-kva"

(punycode-decode "Bcher-kva")
;; => "Bücher"

(domain->ascii "www.bücher.de")
;; => "www.xn--bcher-kva.de"

(domain->unicode "www.xn--bcher-kva.de")
;; => "www.bücher.de"

Hopefully this is useful to people who need to display internationalized domains, or accept them as input. For myself, part of the appeal with CHICKEN has been exploring things I take for granted in more popular languages. I wouldn't normally find an excuse to read RFC 3492.