Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

Standup 01/16/2009: onReady() for AJAX, Web Sprites & Detecting UTF-8

Interesting Things

  • Web based sprite generator - here

This also makes the generated sprite really small which is great if you care about page load times. A Ruby+ImageMagick sprite generator might also be a good thing to build.

  • Cool way of detecting if a file is UTF-8 enconded using Ruby+IConv - here

Ask for Help

"Is there an onReady() for AJAX events?"

Detecting invalid encoding in CSV uploads

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren't valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn't want to ask the user to select an encoding, because they'd probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with ç international characters"

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?( be_true

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?( be_false

Here's the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can't be represented in UTF-8. If the output and input aren't identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren't included here.

Then in our controller, we ensure the CSV doesn't have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren't browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren't gigundous. Anyway, since we can input CSV as either a file or a textarea string, that's why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

Standup 01/13/2009: daemons, encoding

Ask for Help

"Daemon best practices in Ruby?"

  • We haven't tried DaemonKit
  • SimpleDaemon is what we currently use, which we suspect of interfering with monit (had problems with multiple instances starting, and process not starting upon reboot).
  • A couple of people suggested looking at Daemonize
  • Always monitor daemons with sanity checks (e.g. memory usage); use Monit or God
  • Roll your own?

"cut doesn't handle strange characters in large (5GB) text file, are there other unix commands for text file manipulation that are utf-8 compliant?"

  • Try awk/sed maybe
  • Try using od/hexdump to figure out what the weird characters are

UPDATE 01/14/2009: Chad's corrections

Unicode Transliteration to Ascii

Matthew O'Connor and I recently worked on a project that sent SMS messages to mobile customers. Unfortunately the SMS aggregator we used on the project rejected messages with non-ascii characters.

One approach we considered was to strip our messages of any characters that were not ascii and send them as is. After looking through some of the rejected messages we realized most of the problems occurred with unicode punctuation. Instead of simple deleting the characters we tried transliterating them to their ascii equivalent.

Our first approach used IConv:

require 'iconv'

module SmsEncoder
  def self.convert(utf8_text)
    text = Iconv.iconv("US-ASCII//TRANSLIT", "UTF-8", utf8_text).first
    text.gsub(/`/, "'")
  rescue Iconv::Failure

For some reason the backtick ` also caused problems so we converted that after using Iconv.

This approach worked perfectly on OS X but as soon as we moved to the Linux servers the libiconv characteristics changed and most untranslatable characters became question marks instead of empty strings.

Instead of wrestling with libiconv we looked for a solution entirely in ruby. We found unidecode which got us most of the way there. Unidecode did a little more than we wanted though and translated Chinese and Japanese characters to their approximate sounds. e.g. 今年1月 gets transliterated to Jin Nian 1Yue

We decided to only transliterate extended latin charaters, punctuation and money symbols.

Here is the final code with the unidecode monkey patch:

require 'set'
require 'unidecode'

module SmsEncoder
  def self.convert(utf8_text)
    Unidecoder.decode(utf8_text.to_s).gsub("[?]", "").gsub(/`/, "'").strip

module Unidecoder
  class << self
    def decode(string)
      string.gsub(/[^x20-x7e]/u) do |character|
        codepoint = character.unpack("U").first
        if should_transliterate?(codepoint)
          CODEPOINTS[code_group(character)][grouped_point(character)] rescue ""


    # c.f.
      :basic_latin => .. 126),
      :latin1_supplement => .. 255),
      :latin1_extended_a => .. 383),
      :latin1_extended_b => .. 591),
      :general_punctuation => .. 8303),
      :currency_symbols => .. 8399),

    def should_transliterate?(codepoint)
      @all_ranges ||= CODE_POINT_RANGES.values.sum
      @all_ranges.include? codepoint

and tests:

class SmsEncoderTest < Test::Unit::TestCase
  def test_transliteration_of_blank
    assert_equal "", SmsEncoder.convert(nil)
    assert_equal "", SmsEncoder.convert("")

  def test_transliteration_of_whitespace
    assert_equal "", SmsEncoder.convert(" tn")

  def test_transliteration_of_text_surrounded_by_space
    assert_equal "abc", SmsEncoder.convert("  abc  ")

  def test_transliteration_of_ascii
    orig_text = "!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~"
    conv_text = SmsEncoder.convert(orig_text)
    assert_equal orig_text.gsub(/`/, "'"), conv_text

  def test_transliteration_of_unicode_punctuation
    utf8_text = "“foo” ‹foo› ‘foo’ ,foo, –foo— {foo} (foo) `foo`"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal ""foo" <foo> 'foo' ,foo, foo-- {foo} (foo) 'foo'", ascii_text

  def test_transliteration_of_common_latin1_characters
    utf8_text = "ñ ò ^ ¡ ¿ Æ æ ß Ç §"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "n o ^ ! ? AE ae ss C SS", ascii_text

  def test_transliteration_of_money_characters
    utf8_text = "€ £ $ ¥"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "EU PS $ Y=", ascii_text

  def test_untransliterable_characters
    utf8_text = "ɏ x1f x01 x00 Ʌ x7f"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "", ascii_text


  def test_transliteration_of_chinese_characters
    utf8_text = "ウェブ全体から検索"
    ascii_text = SmsEncoder.convert(utf8_text)
    assert_equal "", ascii_text