From the editor: At the beginning of 2014, Travis CI , a hosted continuous integration service, began delivering over 74,000 builds per day for customers—quite a milestone to be congratulated on! In this guest post, one of Travis CI’s core team engineers, Mathias Meyer, recalls how they have evolved from a few hundred builds per day. He gives insight on their system’s architecture, key scale challenges along the way, use of RabbitMQ, and what they have learned about scaling distributed systems:
The idea for Travis CI was simple. We wanted to offer a free continuous integration platform for open source projects, where could run their tests on to make sure any new patches are integrated successfully.
While the architecture was very simple at first, we wanted to offer one key feature in particular. When a build is running, we wanted you to be able to see the build log tailing live. Of course, this is simple when tests are run on simple continuous integration systems on a single box. As soon as you run multiple tests concurrently on different machines, you’re faced with a challenge.
How can we reliably stream these messages, first to the systems processing them and then directly to the user’s browser?
Very early on, we set up RabbitMQ to be our mediator.
Beyond RabbitMQ, Travis CI was built and is still running on a Ruby stack. This includes a mix of both JRuby and C-based Ruby implementations running the site. Travis CI is very easy to integrate because of the variety of client libraries supporting the different Ruby platforms
Travis CI’s Architecture
Travis CI started out with a few simple components.
One component was responsible for serving our web interface and the API, while also accepting commit notifications from GitHub to turn them into executable builds.
Another component was responsible for scheduling these builds to run, collecting their build results, and streaming the build logs to the browser.
While very simple at first, this setup turned into a scaling challenge for us.
Challenge and Solution #1: Scaling Build Logs
Initially, we only ran a few hundred builds per day. In 2012, this grew to several thousand builds per day. At the beginning of 2014, we ran 74,000 builds per day. This volume has posed some challenges for us, and the most interesting challenge is around build logs.
Build logs on Travis CI are split into small chunks, up to 10 kB each in size. These chunks are tailed from the live running builds and forwarded to RabbitMQ. When you only run a few hundred builds, you have only a few dozen messages per minute. When you run tens of thousands of builds, you have hundreds of messages per second. Inarguably, this isn’t a scale of real time trading systems, where tens of thousands of messages are pushed per second. However, it is a challenge for a system that started off as a hobby project.
Initially, pushing these messages relied heavily on the ordering that RabbitMQ (by way of AMQP) preserves in the stream of messages. We used to have only one consumer pulling messages of build log chunks off the queue, updating the log in our database, and forwarding each chunk to the browser for soft real-time build tailing. While this was quite handy, a single consumer was barely able to reliably keep up with dozens of messages per second, let alone a hundred.
At first, we were able to utilize a simple means of partitioning to scale up the processing. However, this was only a slight improvement. Even with that in place, the processing was inherently unscalable, and, when the process died, messages piled up quickly. When you push 100 messages per second, and the processor stops for 10 minutes, the system piles up 60,000 messages. In addition to scaling out the live processing, we also had to improve reliability and improve our recovery from failures.
To reliably scale up the process, we had to stop relying on the ordering imposed by AMQP. So, we turned to ideas that are now more than 35 years old, and Leslie Lamport’s paper, titled Time, Clocks, and the Ordering of Events in Distributed Systems, published in 1978, served as an inspiration for the changes we made. We stopped relying on AMQP for the ordering of events, and instead, we made the order a property of each event.
We used to push just the log chunks to RabbitMQ with the combination of a build job identifier and the message body. Now, we also include a clock. We could’ve used a timestamp, but we decided to simply use an incrementing counter to identify the ordering of the log chunks.
With this change, the consumer on our single queue didn’t have to care about the ordering imposed by AMQP. Instead of updating a single log, it stored every single chunk in the database, together with the clock. Every process accessing chunks was able to restore the ordering based on the clock—just sorting them by the incrementing numbers. This allowed us not only to scale out log processing, it allowed us to make them more reliable. When you only have one consumer, and it dies, there’s nothing else processing the logs, and queues back up quickly.
Now, we have two processes consuming the log chunks. Together, they are able to process hundreds of messages per second, with just ten threads each. There is still room to tune them for higher throughput, but, even when we had logs piling up, just the two of them can plow through 500 messages per second. We can also add more in the future if we need. For a small platform like ours, that’s a significant improvement and leaves us with means to scale out for the foreseeable future.
Challenge and Solution #2: RabbitMQ as a Multi-Tenant System
Our scaling woes didn’t just affect our own processes.
Travis CI relies heavily on third-party infrastructure. This allows us to focus on shipping new features and platform improvements that make our users happy. Working with 3rd party infrastructure also has challenges. For example, we’ve been using a hosted RabbitMQ setup for more than two years now.
RabbitMQ has some unique properties for handling overly ambitious message producers in the system. When one or more processes on one virtual host produce more messages than the system can handle, RabbitMQ can block or limit other producers and consumers. Much to our frustration, this affected us a few times. For RabbitMQ, the function is a natural protection. A system like this must handle backpressure and make sure that the system isn’t flooded or potentially killed by too many messages passing through. However, this created a problem for our user experience. For the messages we produce and consume, we rely on processing messages almost immediately. If our processes are blocked, our customers aren’t able to see their build logs, and their builds aren’t scheduled.
We eventually realized that this was our fault. We didn’t dive into these details sooner to figure out ways of handling them. In addition, we also had to make sure we’re not affected by other malicious tenants in the system. We had to make sure we always have the capacity we need available without worrying about others. So, we had to make sure we’re the only tenant in the system.
To solve the issue, we started using a dedicated RabbitMQ cluster, allocated just for us. We haven’t had any problems since. With the help of our friends at CloudAMQP, we deployed a dedicated cluster late in 2012. The cluster was up for almost 300 days before it had to be restarted to update the SSL certificates.
Lessons Learned and Results
Our take-aways? Don’t blame others for your own planning mistakes, and, when there’s infrastructure that’s in your application’s critical path, use a dedicated setup.
RabbitMQ has been a core part of our infrastructure almost since the very beginning of Travis CI. We’ve had our challenges with scale, but simplifying and reducing the scope for some parts of our infrastructure has helped. For example, reducing the implicit complexity of message ordering alone has helped us scale out our processing of messages considerably. With these architectural improvements in place, RabbitMQ helps us scale really well. Today, both Travis CI platforms—open source or private repositories—each run on dedicated RabbitMQ clusters with high availability by default, courtesy of CloudAMQP. We also have custom monitoring in place to pull metrics out of RabbitMQ via its API. It’s been humming along nicely for more than a year!
|About the Author: Mathias Meyer is purveyor of fine bacon, coffee and infrastructure at Travis CI.|