Into The Evergreen

Into The Evergreen

Or: Early Advances In Rebuilding From Scratch

I started The Greenfield Guild to bring cloud native and open source best practices to a wider audience. One of the first cloud native practices that I’m focusing on is one I call evergreen: completely rebuilding and updating all of your architecture on a regular basis. It’s one I learned nearly a decade ago, at a startup I got involved with through the local Austin tech scene.

They had built an open source toolkit which automated big data deployments in the cloud back when most people had never heard of big data or the cloud. With this we powered internet-scale data APIs, and then direct provisioning of cloud-based big data architecture for enterprise clients as the market and tools matured.

They did this because their founders had seen a gap in data tool adoption. Hadoop was well known in academic physics circles, but not much beyond that. Social analytics was a hot new space for a marketing industry hungry for user feedback, but everyone was developing their own solutions. Large enterprises weren’t well prepared for the complexity of managing data solutions in the cloud, especially as the cutting edge evolved. Also, at the time there weren’t good toolkits for deploying all of these tools into the cloud in a standard way.

The solution they built was a cluster orchestration toolkit, written in Ruby, and integrated tightly with Chef and AWS. This was driven by a domain specific language which could express higher level goals like “deploy Elasticsearch to five m3.large nodes”. It sat in a similar space to Cloudformation, but its deeper integration with Chef meant it provided a full-stack cloud deployment in a single toolkit. (This also vaguely parallels the declarative architectural approach that Kubernetes uses, although it was far less refined and used different abstractions.)

When I joined the project, the core of this was already in place, written by physics majors who’d started this to capitalize on their Hadoop experience. One of my first contributions was open sourcing the tool, which provided some good value to the young Chef community as well as some broader adoption and bugfixes.

This allowed us to accomplish things that were cutting edge for 2011. Most organizations go their whole lifetime without proving they can rebuild critical architecture if it fails. (Sometimes that lifetime ends in a spectacular demonstration that they cannot.) With AWS new instances can be created quickly, which made building replacements for existing architecture similarly easy. Clustered data solutions replicated their data between the instances, keeping everything correct there through the piecemeal replacement of parts. Layers above that were even simpler, following patterns that resembled what would later be called microservices.

Within months of joining the organization, I stabilized and standardized their deployments, completely rebuilding every piece of their big data architecture without any major service disruptions. We continued this practice of regular maintenance rebuilds throughout my tenure. This not only proved the stability and robustness of the tooling, but produced a knock-on effect: the mean time to repair (MTTR) of almost any node problem now had an upper bound of the replacement time for that node. In the months that followed, I lead the development of the tool through two major revisions, each further expanding the flexibility of our deployments.

In addition to our own big data infrastructure, we were able to reliably deploy clusters with a variety of tools — including data stores like CouchDB, Elasticsearch, Hadoop, Kafka, MongoDB, Redis, RabbitMQ, MySQL, and PostgreSQL — for ourselves and customers in hours, not days, weeks or months. This gave early big data adopters advantages in development velocity and ease of experimentation that were not easily available elsewhere, nor available for so many tools in a single platform. We also had a robust framework for “amenities”, allowing us to broadly automate integration with secondary tools like metrics and logging.

It wasn’t all roses, to be sure. The DSL depended upon a library which did some things Ruby allows, but which violated encapsulation and confused code paths, and I would dread when debugging lead me into that part of the codebase. Our tight integration with AWS, Chef, and Ubuntu presented another series of hurdles: attempts to migrate to CentOS and OpenStack were of only limited commercial success, and our eventual acquisition by a Puppet shop spelled the end of the tool. Also, there was no automated gating for data store rebuilds: the operator had to personally watch for the replication to report back all clear before moving on to replacing the next node.

Despite those drawbacks, I built valuable experience that has informed my career and work ever since. After working with such highly automated deployments, the toil of manual installation and upgrades is no longer a thing to endure, but an enemy to subdue and eliminate. The library and tight-integration problems taught me to be picky about what dependencies my future projects accept. The eventual support for a tool I’d invested so much in was understandable but disappointing, and taught me to be discerning about the longer-term viability of open source projects. The regular rebuilds of the entire infrastructure kept our architecture clean and up to date easily, a practice I would later start calling evergreen development.

Stay tuned for more on evergreen development.