I just had a discussion with a co-Pivot about the resentment that many teams develop about Continuous Integration – especially when the release process requires a green tag from CI, and a broken build is standing in the way.
As anyone who has worked with me will attest, I’m hardcore on CI and consider any team which leaves a build red for longer than a workday to be sorely lacking in discipline.
OK, OK, there are always extenuating circumstances, but I still believe that most resentment of CI stems from underlying antipatterns and smells, rather than problems with CI itself. For example:
- “The Customer Has To See This Feature RIGHT NOW”: Frequent releases are a great thing, but if you cannot wait for a green build to deploy, you have some deeper problems. Often, this is because a team doesn’t manage customer expectations well. The customer should understand that CI is a critical part of the Agile process which ensures that only reliable, quality releases get pushed to staging or production. Any problem which is preventing a green build should be fixed before the release is deployed. If the customer is not willing to allow you that time and flexibility, perhaps they are too addicted to new features, and the entire team needs to have a heart-to-heart about Code Debt in the next retrospective.
- “It Works For Me, But Fails On CI: The important question is which environment is more like production – your development environment or CI? If you are developing on Windows or Mac, and your production box is some other flavor of Xnix, then your CI box should be as close a possible to production. Ideally, you should be able to log on to the CI instance and debug the failing test there. Usually, your CI box is not configured correctly. If it is hard to keep your CI environment in sync with production, then perhaps you should look into automation (because you KNOW you or your sysadmin will probably forget to do the same thing when you push to production, right?). If the problem is that your development environment is not the same as production, and it is a legitimate problem, then CI just saved you some stress on the next deploy.
- “Intermittent” Failures: Same deal as the prior point. CI runs your tests much more than you do. For web apps, it hopefully runs them in more browsers than you do. In my experience, many “intermittent” bugs are real bugs which are just very hard to isolate. It could be an AJAX bug that only happens when the site is run remotely, not via localhost. It could be a performance problem which only shows up on a slower system, not your fastest-on-the-market dev box. It could be a dependency on an external resource that happens to be unavailable sometimes, such as a web service, remote storage, etc. Again, just being aware of these issues puts you ahead of the game. For browser bugs, dig in and find out WHY it is failing intermittently. It may be a real bug. For intermittent outages of external resources, you may just have to live with it, but you don’t have to live with the intermittent failures in CI. Mock out the resource or disable the tests in the CI environment. Yes, this is OK, especially if you leave them enabled in your development environment. Another option is to automatically repeat these tests a few times with a delay, and only fail the entire build if they fail repeatedly. Big services like Amazon or Google might drop a request occasionally, but still respond to a subsequent request.
- Slow Test Suites: This is an insidious problem, because once your suite is slow, it is often a monumental effort to make it fast again. It is much better to be proactive, and monitor any slow-running tests like a hawk, relentlessly mocking out slow resources or replacing broad functional tests with faster, more targeted unit tests. You can also always split your tests into different suites, running your fastest tests continuously, and the entire slow deploy-test suite only nightly or periodically. As long as your customer isn’t addicted to immediate features, it should be fine to only deploy from nightly builds.
- The Failing Test That “Doesn’t Matter”: This is my pet peeve. Whenever I break CI, I fix it ASAP. If I ignore a “minor” broken test, the next thing I check in may be a major FUBAR which gets past my local tests for some reason (see prior points). Some who know me might even say it is LIKELY to be a major FUBAR. The point is, I don’t trust myself or my local box, I trust CI. Now, if ANOTHER developer breaks the build, and tries to tells me they are not going to fix it because it is a “minor” problem, that really chaps my hide. They are ripping huge holes in my nice safety net, forcing me to expend much more time and attention on the tests that I run on my local environment, and causing me more stress and work in general. Stop making excuses, and fix the damn build NOW, or comment out the failing test.
Now, I’m sure that all of the above points can be debated or shown to be inapplicable in a specific situation. Plus, if you are dealing with imperfect CI and development tools (which is always the case), you will have some degree of pain which is directly attributable to CI. It would be great to hear about some of these situations in the comments.
Bottom Line: Integration is always one of the most painful parts of software development. Doing integration with high quality and low risk is even harder. Most developers who have been on a non-Agile project of any significant size have experienced days-long integration hell and ulcer-inducing all-night production deployments. Continuous Integration doesn’t make that pain and stress go away, but it does break it down into small, bite-sized pieces that can be easily handled on a daily basis. All for the low, low cost of being proactive and disciplined, which makes you a better developer anyway.