Close
Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

LABS
Screen Scrape No More…Seriously!

This week I had the pleasure of attending Dapper Camp put on by the folks at Dapper for their user community. Mitch Kapor kicked it off with a talk on disruptive technology, openness, and innovation. We then got to hear both from the Dapper team about new and upcoming features, and folks like Aaron Fulkerson of MindTouch about using Dapper to repurpose data. All around it was pretty interesting.

What I love about Dapper is that it helps solve one of the big issues I see our clients have: data. We can build just about anything, but if an application needs some specific data (and many do), products must be launched with sub-par (but available) data, or worse launches can end up being delayed. In many cases, we can end up spending a large amount of time (aka money) getting/munging data rather than developing features. Note: I also think the ease if pushing data out of apps via instant gadgets makes Dapper very interesting but that’s a whole separate post.

The “Dapp Factory,” a Rhino-based server application and a web front-end that deals with just about any site by proxying your requests and modeling the DOM on the proxy, then recording your actions for later replay. But, their secret sauce is a super-cool algorithm that figures out the structure of pages in such a way that your API can withstand changes to the target site, making your feed resilient to all but massive site overhauls. You then simply consume an XML or JSON feed, or use a simple API to dynamically construct paramaterized feeds.

There are other companies trying to make data less painful. Metaweb, for example, provides an incredibly fast graph engine and relational schemas (think RDF) that makes real-time use interesting. But, if the data you need isn’t in Freebase (a likelihood until they get larger), or your data is continually being updated, you will still be stuck scraping and relating the data, and that’s generally where most of the work is.

Take as a small example Dav’s awesome Vacation Planner. The concept is simple, the feature set is small, but getting the data is a pain (see article). Some sites don’t have APIs, and those that do provide unstansardized, sometimes buggy, and are often often missing the data you need.

I could imagine writing a dapp parser akin to ActiveResource pretty easily (I hear a ruby SDK came out of DapperCamp, but I can’t find it). With a little more work, it would probably be easy to add cache_fu support, and ruby modules that could be mixed into models (for asynchronous data gathering) and controllers (for serve now vs. polling) to easily support Dav’s polling mechanism.

This would leave Dav with pretty simple model (data) code, and the luxury of focussing on whether to add wikipedia integration for population figures or the Big Mac Index, rather than tweaking his Mechanize xpaths all weekend. I vote for the Big Mac Index.

So, the next time someone suggests you screen scrape to get that data you need, tell them to give Dapper a shot. And if anyone wants to write ‘ActiveDapp’ let me know. It could be really fun.

Note: In the spirit of full disclosure, Jon Aizen (Dapper CTO) is a friend of mine and they gave me a free t-shirt…and a sandwich (thanks).

Comments
  1. Parker Thompson says:

    Here’s what I imagine an implementation of Dav’s vacation finder using the (not-yet-existant) “ActiveDapp” library might look like:

    # base class for results from
    # various of travel sites
    class Fare < ActiveDapp::Model # don't eveen search without these parameter :airport, :required => true
    parameter :city, :required => true
    end

    class ExciteFare < Fare dapp "url/to/excite/dapp' end class FareCompareFare < Fare dapp "url/to/carecompare/dapp' end # model representing dapp results of VRBO class Rental < ActiveDapp::Model dapp "url/to/VRBO/dapp' # map fields from one dapp to another's search parameters has_many :fares, :join => [: airport => :renter_location]
    end

    # model that has knowledge of dapps, but isn’t one
    class Vacation < ActiveDapp::Model property :airport, :required => true

    has_many :rentals, :params [ :renter_location => :airport]

    has_many :fares, :through => :rentals do
    def lowest
    proxy.sort {|a, b| a.price < => b.price }.first
    end
    end

    def price
    self.rental.price + self.rental.fares.lowest.price
    end
    end

    @vacations = Vacation.find(:airport => ‘SFO’, :sort_by => :price)
    @vacations.each do |v|
    puts v.city
    puts v.price
    puts v.rental.start_date
    puts v.rental.end_date
    puts v.fare.airline
    puts v.fare.price
    end

  2. Viktor says:

    It is a better way of handling data. Especially for me, i deal with several layers for clients. good job.

  3. Peyton says:

    This is good system on handling data. So users specifically the company mostly handles volume of data will not complicated to operate them.

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *