Close
Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

LABS
Marshal.dump vs YAML::dump

We find ourselves with a project with a very large dataset, more than 2 million items. This dataset changes frequently. The changes need to be transported to their respective servers ready to be served out to clients.
We decided to use a queuing architecture to distribute data. Objects are serialized and pushed to a queue. The large size of the dataset requires us to optimize as much as possible. There are only so many hours in a day and there is a lot of data to transport.
A question was raised in standup as to what was the fastest serialization method: YAML::dump or Marshal.dump. It seemed appropriate to write a quick script and work out which would be appropriate for our particular situation.
The objects we are serializing are simple hashes. I thought I’d write something that was representative of our situation in order to present a nice clear decision.
Here’s some code:

require 'yaml'
obj = {:a => "hello", :b => "goodbye", :c => "new string", :d => {:da => 1, :db => 2}, :e => 1}
start = Time.now
(0..10000).each do
  ser_obj = YAML::dump(obj)
  new_obj = YAML::load(ser_obj)
end
puts "YAML::dump time"
puts Time.now - start
start = Time.now
(0..10000).each do
  ser_obj = Marshal.dump(obj)
  new_obj = Marshal.load(ser_obj)
end
puts "Marshal.dump time"
p Time.now - start

I think we all knew how the results would look. It was nice to see that for our particular case there was a clear winner.

YAML::dump time
5.397909
Marshal.dump time
0.280292

Seems fairly cut and dried to me.
I personally prefer YAML for test result comparison. Maybe we’ll put something in our spec_helper to use YAML for testing and Marshal for production.

Comments
  1. Dan DeLeo says:

    I posed this question on the AMQP list, and ezmobius wrote that Marshal is the fastest, JSON is a close second, and YAML is way behind. I decided to just trust it ;) Anyway, is there some reason you can’t use JSON? If it’s fast enough to get the job done, seems like it would provide good readability and speed.

    Also, I personally haven’t seen a case where it didn’t work, but the Rdoc for Marshal or the Pickaxe (don’t remember which) warns that Marshal may change formats between VMs, i.e. Ruby 2.0 could potentially be unable to load Marshaled Ruby 1.9 or 1.8 objects. Seems to work fine with 1.8 and 1.9 though. Flag days are no fun, so it’s worth considering.

  2. Matthew O'Connor says:

    FWIW, if you do any weird eigenclass stuff to your hashes then Marshal won’t work:

    >> hash = {:foo => 1, :bar => 2}
    => {:bar=>2, :foo=>1}
    >> class < < hash >> def has_foo_key?
    >> has_key?(:foo)
    >> end
    >> end
    => nil
    >> hash.has_foo_key?
    => true
    >> Marshal.dump(hash)
    TypeError: singleton can’t be dumped
    from (irb):8:in `dump’
    from (irb):8

  3. Steve Conover says:

    Stick that in a module and include it instead…

  4. For data portability, you might consider checking out the alternate YAML implementation ZAML. A naive benchmark pegs it at 1600% faster than YAML, but feel free to check it out for yourself.

    http://gnomecoder.wordpress.com/2008/09/27/yaml-dump-1600-percent-faster/

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *