Two Amazing Things: Thing #1
June 30th, 2009 by RadarToday two amazing things happened and I would like to share them with you in a two-part series (woah, look at me going all high-tech on my reader). Here’s the first:
Hampton Catlin
Today Hampton Catlin was talking with Dr Nic about Ruby 1.9 issues he was having with his wikimedia-mobile project, specifically he was getting incompatible character encodings: ASCII-8BIT and UTF-8. This is a guy I admire and look up to and think he’s “the shit”. He came to my company looking for help and it was my (and Bo’s) task to help him figure out what’s going on. Honoured.
The search box on his site was a bit wrong for fanciful languages like that German:

and some pages threw some more interesting errors:

I’d seen this error before in my Ruby 1.9 testing, but that was so long ago that I had forgotten what context or even if I fixed it. Probably not.
I remembered someone linking to this post by Dave Thomas a while ago but forgot the link, but thanks to Google I was able to enter “Ruby 1.9 encodings” and it knew exactly what I was after. I followed the “instructions” and put # encoding: utf-8 at the top of the merb executable and the buffer.rb file in HAML (which, it turns out had no bearing on the final result). No luck. Then Hampton mentioned he put -KU on the end of the ruby interpreter which randomly fixed/broke random things. So I tried that, and got a couple of degrees of success.
I opened up irb1.9 -KU (yeah, I’m so cool I have two versions of Ruby installed, at the same time) and I knew of the encoding method you could call on a string in order to get the encoding of that string. So I tried something simple: “Ryan”.encoding which gave UTF-8 so I tried the German text and I wasn’t surprised when that also returned UTF-8. So what’s going on?
Well, turns out that even though we specified # encoding: utf-8 in the merb executable and even in a meta tag in the HTML, the HAML that was getting sent to the parser was being sent in ASCII-8BIT! Around this point Bo came in and we discovered the lovely force_encoding method for strings in order to… well, I’m sure you can figure it out.
This is the misbehaving line in haml 2.0.9 and to fix it we just do result.force_encoding(”UTF-8″) and that forces whatever’s being appended to the buffer to always be UTF-8!
Hampton was happy, we were happy, and karma rewarded me with a delicious steak sandwich + icecream with banana slices with maple syrup on top.

June 30th, 2009 at 10:42 pm
Except that forcing a string into UTF-8 isn’t necessarily the correct thing to do.
What if the user’s browser has submitted iso-8859-1, Shift-JIS or Big-5 data?
July 1st, 2009 at 7:39 am
James,
This is true. How would you best handle it? HAML is appending two strings together, one is UTF-8 and one is not. If they differ, an exception gets raised…
Also, in this context, the string is coming from the HAML views. It was outputting the user’s search query onto the page, sure, but whatever passed it to us (in this case it was Merb’s params, I suppose) was obviosly butchering the encoding to make the string ASCII-8BIT in the first place…
July 1st, 2009 at 10:31 am
You’re right – if it’s your own view you can be sure the content is UTF-8 and can safely force the encoding yourself.
I read the post as proposing a patch for HAML that does the force_encoding() call. That would be dangerous since it’s possible developers will want to generate non UTF-8 HAML files.
July 1st, 2009 at 12:19 pm
I can see how it was interpreted that way, my bad.
July 1st, 2009 at 5:52 pm
I’ve just pushed (what should be) a fix for this issue that will get released soon in Haml 2.2. Haml now has an :encoding option that specifies which encoding to use for the template; this defaults to UTF-8. It won’t automatically re-encode input – that’s Merb’s job – but as long as the input matches the template, these problems shouldn’t arise.
July 3rd, 2009 at 12:19 pm
Fantastic thanks Nathan!