+ Introducing Zuul for improved CI/CD +
++ A quick history of how and why Zuul is replacing Jenkins in CI + testing in the OpenStack community. +
+Authored by Jeremy Stanley, February 7, 2020
++ (This article originally ran on opensource.com + and is reprinted here with permission of the author under the Creative + Commons Attribution-Share Alike 4.0 International License.) +
++ Jenkins is a marvelous piece of + software. As an execution and automation engine, it's one of the best + you're going to find. Jenkins serves as a key component in countless + continuous integration (CI) systems, and this is a testament to the + value of what its community has built over the years. But that's what + it is—a component. Jenkins is not a CI system itself; it just + runs things for you. It does that really well and has a variety of + built-ins and a vibrant ecosystem of plugins to help you tell it what + to run, when, and where. +
++ CI is, at the most fundamental level, about integrating the work of + multiple software development streams into a coherent whole with as + much frequency and as little friction as possible. Jenkins, on its + own, doesn't know about your source code or how to merge it together, + nor does it know how to give constructive feedback to you and your + colleagues. You can, of course, glue it together with other software + that can perform these activities, and this is how many CI systems + incorporate Jenkins. +
++ It's what we did for OpenStack, too, at least at first. +
+If it's not tested, it's broken
++ In 2010, an open source community of projects called OpenStack was forming. Some of + the developers brought in to assist with the collaboration + infrastructure also worked on a free database project called Drizzle, and a key philosophy within that community was the idea + "if it's not tested, it's broken." So OpenStack, on day one, required + all proposed changes of its software to be reviewed and tested for + regressions before they could be approved to merge into the trunk of + any source code repositories. To do this, Hudson (which later forked + to form the Jenkins project) was configured to run tests exercising + every change. +
++ A plugin was installed to interface with the Gerrit code review + system, automatically triggering jobs when new changes were proposed + and reporting back with review comments indicating whether they + succeeded or failed. This may sound rudimentary by today's standards, + but at the time, it was a revolutionary advancement for an open + source collaboration. No developer on OpenStack was special in the + eyes of CI, and everyone's changes had to pass this growing battery + of tests before they could merge—a concept the project called + "project gating." +
++ There was, however, an emerging flaw with this gating idea: To + guarantee two unrelated changes didn't alter a piece of software in + functionally incompatible ways, they had to be tested one at a time + in sequence before they could merge. OpenStack was complicated to + install and test, even back then, and quickly grew in popularity. The + rising volume of developer contributions coupled with increasing test + coverage meant that, during busy periods, there was simply not enough + time to test every change that passed review. Some longer-running + jobs took nearly an hour to complete, so the upper bound for what + could get through the gate was roughly two dozen changes in a day. + The resulting merge backlog showed a new solution was required. +
+Enter Zuul
++ During an OpenStack CI meeting in May 2012, one of the CI team + members, James Blair, announced that he'd "been working on speculative execution of + Jenkins jobs." Speculative execution is an + optimization most commonly found in the pipelines of modern + microprocessors. Much like the analogy with processor hardware, the + theory was that by optimistically predicting positive gating results + for changes recently approved but that had not yet completed their + tests, subsequently approved changes could be tested concurrently and + then conditionally merged as long as their predecessors also passed + tests and merged. James said he had a name for this intelligent + scheduler: Zuul. +
++ Within this time frame, challenges from trying to perform better + revision control for Jenkins' XML job configuration led to the + creation of the human-readable YAML-based Jenkins Job + Builder templating engine. Limited success with the JClouds + plugin for Jenkins and cumbersome attempts to use jobs for refreshing + cloud images of single-use Jenkins slaves ended with the creation of + the Nodepool + service. Limited log-storage capabilities resulted in the team adding + separate external solutions for organizing, serving, and indexing job + logs and assuming maintainership of an abandoned secure copy protocol + (SCP) plugin replacing the less-secure FTP option that Jenkins + provided out of the box. The OpenStack infrastructure team was slowly + building a fleet of services and utilities around Jenkins but began + to bump up against a performance limitation. +
+Multiplying Jenkins
++ By mid-2013, Nodepool was constantly recycling as many as 100 virtual + machines registered with Jenkins as slaves, but this was no longer + enough to keep up with the growing workload. Thread contention for + global locks in Jenkins thwarted all attempts to push past this + threshold, no matter how much processor power and memory was thrown + at the master server. The project had offers to donate additional + capacity for Jenkins slaves to help relieve the frequent job backlog, + but this would require an additional Jenkins master. The efficient + division of work between multiple masters needed a new channel of + communication for dispatch and coordination of jobs. Zuul's + maintainers identified the Gearman + job server protocol as an ideal fit, so they outfitted Zuul with a + new geard service and extended Jenkins with a custom Gearman client + plugin. +
++ Now that jobs were spread across a growing assembly of Jenkins + masters, there was no longer any single dashboard with a complete + view of job activity and results. In order to facilitate this new + multi-master world, Zuul grew its own status API and WebUI, as well + as a feature to emit metrics through the StatsD protocol. Over the + next few years, Zuul steadily subsumed more of the CI features its + users relied on, while Jenkins' place in the system waned + accordingly, and it was becoming a liability. OpenStack made an early + choice to standardize on the Python programming language; this was + reflected in Zuul's development, yet Jenkins and its plugins were + implemented in Java. Zuul's configuration was maintained in the same + YAML serialization format that OpenStack used to template its own + Jenkins jobs, while Jenkins kept everything in baroque XML. These + differences complicated ongoing maintenance and led to an + unnecessarily steep learning curve for new administrators from + related communities that had started trying to run Zuuls. +
++ The time was right for another revolution. +
+The rise of Ansible
++ In early 2016, Zuul's maintainers embarked on an ambitious year-long + overhaul of their growing fleet of services with the goal of + eliminating Jenkins from the overall system design. By this time, + Jenkins was serving only as a conduit for running jobs consisting + mostly of shell scripts on slave nodes over SSH, providing real-time + streaming of job output and copying resulting artifacts to + longer-term storage. Ansible + was found to be a great fit for that first need; purpose-built to run + commands remotely over SSH, it was written in Python, just like Zuul, + and also used YAML to define its tasks. It even had built-in modules + for features the team had previously implemented as bespoke Jenkins + plugins. Ansible provided true multi-node support right out of the + box, so the same playbooks could be used for both simulating and + performing complex production deployments. An ever-expanding + ecosystem of third-party modules filled in any gaps, in much the same + way as the Jenkins community's plugins had before. +
++ A new Zuul executor service filled the prior role of the Jenkins + master: it acted on pending requests in the scheduler's geard, + dispatched them via Ansible to ephemeral servers managed by Nodepool, + then collected results and artifacts for publication. It also exposed + in-progress build output over the classic RFC 742 Name/Finger + protocol, streamed in real time from an extension of Ansible's + command output module. Once it was no longer necessary to limit jobs + to what Jenkins' parser could comprehend, Zuul was free to grow new + features like distributed in-repository job definitions, shareable + between projects with inheritance and secure handling of secrets, as + well as the ability to test-drive proposed changes for the jobs + themselves. Jenkins served its purpose admirably, but at least for + Zuul, its usefulness was finally at an end. +
+Testing the future
++ Zuul's community likes to say that it "tests the future" through its + novel application of speculative execution. Gone are the harrowing + days of wondering whether the improvement you want to make to an + existing job will render it non-functional once it's applied in + production. Overloaded review teams for a massive central job + repository are a thing of the past. Jobs are treated as a part of the + software and shipped right alongside the rest of the source code, + taking advantage of Zuul's other features like cross-repository + dependencies so that your change to part of a job in one project can + be exercised with a proposed job change in another project. It will + even comment on your job changes, highlighting specific lines with + syntax problems as if it were another code reviewer giving you + advice. +
++ These were features Zuul only dreamed of before, but which required + freedom from Jenkins so that it could take job parsing into its own + hands. This is the future of CI, and Zuul's users are living it. +
++ As of early 2019, the OpenStack Foundation recognized Zuul as an + independent, openly governed project with its own identity and + flourishing community. If you're into open source CI, consider taking + a look. Development on the next evolution of Zuul is always underway, + and you're welcome to help. Find out more on Zuul's website. +
+ +