Close
Glad You're Ready. Let's Get Started!

Let us know how we can contact you.

Thank you!

We'll respond shortly.

PIVOTAL LABS
Recent Pivotal Tracker outages

We’ve had a number of brief outages and/or periods of degraded performance in the last few weeks. I’d like to shed some light on what caused these incidents and what we’re doing to prevent them in the future.

As you may know, one of Pivotal Tracker’s core features is that your view of your project is always up to date, there’s no need to refresh your browser page. If one member of the project pushes a start button on a story, for example, everyone else sees the change immediately. This is an important aspect of keeping the entire team focused and on the same page.

Under the covers, the way this is accomplished is via polling – the browser sends a request every few seconds, basically asking if there is something new. Given the large number of users out there, this translates to approximately 1000 requests per second.

Most of these requests don’t end up hitting any of our application servers, they go straight to a very fast in-memory cache (in the form of multiple memcached processes). Only requests that involve a “stale” response (meaning, there are some changes to return to the client) make their way to an application server. These represent a very small fraction of all requests.

This architecture works well, but the in-memory cache is a critical component, and if it goes down or has any problems, the 1000 requests per second end up hitting the app servers, which are not designed to handle that kind of load. The requests end up backing up, and it takes a few minutes for the system to recover even if the caches are brought back up quickly.

Some of the recent brief outages in the last few weeks involved the cache processes hitting a few different configuration-specified limits (related to connections and the virtualization layer). We also saw a similar issue with our load balancers, which route all of the traffic to the right places in the cluster.

In all cases, the problem was identified and resolved quickly, and Tracker was brought back to normal.

To reduce the likelihood of similar issues in the future, we’ve added more monitoring, and we’re making some changes to the environment, including additional layers of redundancy for the cache, and moving the cache processes from virtual hosts to dedicated bare metal machines. We’re also considering similar changes to other parts of the cluster, but taking it one step at a time to avoid introducing too many changes all at once.

We’re also considering moving away from the polling architecture, which requires a continuous high traffic rate, to a push approach, via HTML5 WebSockets. This would reduce the number of requests dramatically, but the HTML5 WebSockets protocol is still being finalized, and only some browsers support it natively (Chrome 4 and Safari 5 currently). One option that we’re thinking about is a hybrid approach – WebSockets push for browsers that support it, falling back to polling when push is not supported.

We apologize if you were inconvenienced by any of these brief outages – we certainly understand what it means to lose access to Tracker, even momentarily.

Comments
  1. James Fairbairn says:

    Another compatibilty option (as used by http://pusherapp.com/), is falling back to a Flash sockets implementation of the WebSockets interface on the client side. See https://github.com/gimite/web-socket-js if you haven’t already. :-)

    Hope this helps….

  2. Dan Podsedly says:

    Thanks James….we’ve looked at a few Flash based fall-back options, including web-socket-js. We’ll probably want to avoid a Flash dependency, though, so the hybrid push/poll model might make more sense for us.

  3. Another option, if you haven’t noticed it, is Socket.IO (http://socket.io/). It will do the magic for you, use WebSockets if available, then use flash if needed (which gives you real sockets btw), and finally go to long-polling and other wonky techniques for the last few percent of users that don’t have proper sockets.

    It means you’ll need a node.js backend though, but that would just replace whatever code now polls memcached for cached data, and I’d guess it would perform equally or better.

  4. Otto Hilska says:

    At Flowdock we’ve implemented all this fall-back logic by ourselves, because we started a couple of years ago.

    As Jordi suggested, today it would probably make sense to use Socket.IO. It has server-side bindings for other platforms as well (like an experimental Rack/Ruby version), but node.js is great for building a simple PubSub server.

  5. Given that a Flash fall-back can work transparently, I’d say that’s the best option. Web Sockets and sockets via Flash really do provide better performance. If users aren’t using Flash, other XHR or polling solutions can simply be tried.

    Socket.IO is definitely a good option as Otto says, but will require a new setup/maintenance if you’re not already running node.js.

    At Beaconpush, we offer push messaging via Web Sockets like many others. But we also provide fallbacks not only to Flash but to normal XHR long-poll (even cross-domain, given that you are using fairly modern browser).

    Even if you are looking at rolling your own solution, have a look at Socket.IO and our service. If nothing else, it might provide some inspiration. Also, if you’re interested or just want to talk push solutions, don’t hesitate, drop me an e-mail. Love to talk about this stuff and maybe , help improve Pivotal Tracker (we’re avid users at ESN.me)


    Carl

  6. Dan Podsedly says:

    Thanks, everyone. We’re starting to look more closely at Socket.IO, and might ping some of you for specific feedback/suggestions. The big question is how well it scales to say 10s of thousands of connections, general stability, and potential issues with proxy servers and firewalls.

Post a Comment

Your Information (Name required. Email address will not be displayed with comment.)

* Copy This Password *

* Type Or Paste Password Here *