Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

Standup 05/20/2009: "Merging Large Data Sets – after_commit"

Ask for Help

“What are some good ways to merge large business location data sets?”

There was a bunch of input including the following:

  • You should create a scoring of how close the matches are.
  • Good admin merge tools are worth the effort to create.
  • Normalizing the data prior to the merge (i.e. pass the addresses through the USPS API to turn [Av Ave Avenue] => Ave)
  • Humans do this best, outsource or Mechanical Turk it.

Interesting Things

  • The after_commit plugin allows you to hook events to after the transaction commits. This is really useful when kicking off threads that expect to have access to the data in the database. Note: using after_save can cause you to have a race condition if the other thread attempts access to the data before the original thread has a chance to commit the transaction.

  • If you are storing a marshaled object in the database, you should make that field a blob type, it is smaller to store and if you leave it as a text or varchar you can corrupt the binary data you are storing in there. If you don’t have a choice about field types you should at least base64 encode the marshaled data before storing it.


RE: “NewRelic Side Effects?” from 05/19/2009 Standup

  • It seems that NewRelic was not the cause of the problem but helped in exacerbating the problem by holding the transaction open long enough to create a race condition that still shows up when the system is put under enough load. To fix our problem we moved the trigger that launches the background process from and after_save to an after_commit see plugin. We also re-added NewRelic.

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *