Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

Detecting invalid encoding in CSV uploads

We ran into an odd bug using FasterCSV to import some data. We were requiring the CSV files to be UTF-8 encoded, but some users tried to upload files in other encodings. FasterCSV ended up choking on characters that weren’t valid UTF-8 and truncating the data to the end of the line and leaving fields blank. We didn’t want to ask the user to select an encoding, because they’d probably get it wrong anyway, so we decided to reject any files with characters that would cause problems. The trick then, is how to detect that.

First, the tests. We want to detect if an input string contains valid characters in UTF-8 encoding. And we need to deal with both strings and IO (File or StringIO) objects (more on that in a bit).

describe "::encoding_is_utf8? checks strings and IOs" do
  before do
    @utf8 = "This is a test with ç international characters"

  it "returns true when all characters are valid" do
    Importer.encoding_is_utf8?(@utf8).should be_true
    Importer.encoding_is_utf8?( be_true

  it "returns false when any characters are invalid" do
    bogus = Iconv.conv('ISO-8859-1', 'UTF-8', @utf8)
    Importer.encoding_is_utf8?(bogus).should be_false
    Importer.encoding_is_utf8?( be_false

Here’s the implementation:

class Importer
  def self.encoding_is_utf8?(file_or_string)
    file_or_string = [file_or_string] if file_or_string.is_a?(String)
    is_utf8 = file_or_string.all? { |line| Iconv.conv('UTF-8//IGNORE', 'UTF-8', line) == line }
    file_or_string.rewind if file_or_string.respond_to?(:rewind)

So the meat of the check is that we are using the Iconv library to detect bad characters. We convert from an assumed UTF-8 to UTF-8, ignoring any characters that can’t be represented in UTF-8. If the output and input aren’t identical, that means there were bogus characters and the uploaded file should be rejected.

The #rewind is needed to reset the read position in the file so FasterCSV can start over from the beginning. Specs for that aren’t included here.

Then in our controller, we ensure the CSV doesn’t have any bad characters before we give it to FasterCSV. We extracted that check into its own method, shown here:

def require_utf8!(csv_content)
  unless Importer.encoding_is_utf8?(csv_content)
    raise "Import file must be UTF-8 only. You can paste non-UTF-8 CSV directly into the CSV Text field for automatic conversion."

As you can read in the exception message (which ends up in the flash), the user can work around the encoding issue by pasting the CSV into a textarea input in the browswer, which automatically transcodes the data into UTF-8. Aren’t browsers awesome? The other option would be to transcode the CSV file, but the textarea is easier if the files aren’t gigundous. Anyway, since we can input CSV as either a file or a textarea string, that’s why #encoding_is_utf8? needs to check both files and strings.

This approach and implementation seem fine to me. I get the feeling there might be a much simpler way, though. Anyone got a better idea?

  1. This seems like something that should be handled entirely in the Importer, and just raise the error from there. I would probably do something like:

    class Importer
    class NotUtf8 < StandardError; end
    def import(file_or_string)
    raise NotUtf8 unless encoding_is_utf8?(file_or_string)
    # the rest of the method

    Then in the controller you can rescue the NotUtf8 exception and display whichever message you like. This seems cleaner than the controller explicitly having to check for utf8-ness in the controller.

  2. Josh Susser says:

    @Andy: I like that, but we had to fit the code into an existing error reporting setup and refactoring things around that much would have been more work than we wanted to do for that story.

  3. Bruno Lopes says:

    Heh, I think I’m the one who reported that particular bug on getsatisfaction. I’m glad to hear you’re fixing it.

    On the solution, is there any way the user could just paste the data directly from Excel, instead of having to save it as a CSV, opening the file and copy/pasting it to the browser?
    It may require another importer, as it wouldn’t exactly be a CSV, but it may end up saving a lot of time for the user (in this case, me ;) )

  4. Matthew says:

    In the past I’ve used something similar to the following for detecting if a string is valid UTF-8:

    def is_valid_utf8?(text)
    rescue nil

    In an ideal world you wouldn’t have to read the entire file twice on success. If you’re opening the file you could use a validating proxy object that dies if it reads a non-utf8 line.

  5. Matthew says:

    oops, that should be ArgumentError, not nil

  6. Julik says:

    Much simpler (in Rails context anyway):

    “337”.is_utf8? #=> false

  7. Moses Hohman says:

    At CDD we also needed to detect the charset of CSVs and other file types. We wrapped the libmagic library (available on OS X via MacPorts, and on Linux), see (That software goes by two equally confusing names, the libmagic library, which people tend to think is related to ImageMagick, or the “file” utility, which people tend not to realize refers to a utility whose name actually is “file”.) Libmagic is pretty good at detecting UTF-8 vs. ASCII vs. “unknown”, which we always call windows-1252 because it always is, vs. ISO-8859-1, and can handle large files quickly. Occasionally we find pathological cases, leading to false detection of US-ASCII, and we have a workaround for that in the Ruby wrapper.

    This is something we’d definitely open source if people thought it would be useful.

    Nice idea to use the browser to convert the character set. Our CSVs are often too large for that, but it makes me realize that’s perhaps why DabbleDB does things that way.

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *