Tl;dr: Our current analysis of Kubernetes underscored its suitability for scaling Coinbase into the long run. Previously, a migration to Kubernetes raised issues because of the operational burden of operating and securing the management airplane in-house. We’ve now concluded that managed Kubernetes choices scale back this operational burden with out compromising our stack safety.
By Clare Curtis, Coinbase Employees Software program Engineer
Nearly two years in the past we launched a weblog publish detailing why Kubernetes is not part of our technical stack. On the time, migrating to Kubernetes would have created a complete new set of issues that outweighed any near-term advantages. Nevertheless, as these applied sciences have matured, our newly-formed Compute Crew devised a method for leveraging Kubernetes in a approach that may ship a extra versatile and scalable model of our present system.
Coinbase has grown considerably since we first thought-about migrating to Kubernetes. With any development of this sort, it is very important prioritize scalability issues. As we proceed to scale, one of many principal areas in want of future-proofing is Coinbase’s compute platform. In mid-2020, our largest service was configured to run a comparatively small variety of hosts, whereas right now it’s operating 10x that quantity.
On this similar interval, we quadrupled the dimensions of our engineering group inflicting a considerable enhance within the variety of deployments — every needing utterly new hosts. The rise within the variety of deployments have raised issues over future scalability as we’re already operating into technical limitations of present APIs and assets. Recurring points with getting sufficient capability and having it delivered in an affordable timeframe, brought on a rise in failed deployments and required our largest companies to dramatically decelerate their launch course of.
Whereas these points are solvable, we determined to take this chance to guage whether or not it made sense to proceed investing in a homegrown system or contemplate an open supply various that will be rather more scalable in the long run.
In our analysis of Kubernetes, we discovered that one of many largest benefits of a migration is that it decouples host provisioning from service deployment, transferring the burden of managing host acquisition from particular person groups to the broader Infrastructure workforce. This empowers the Infrastructure workforce to take a holistic method to host administration. Additionally, capability constraints are much less prone to have an effect on deployments, and we scale back the quantity of cloud supplier particular data that particular person engineers want to keep up.
The Kubernetes group has created a wealth of knowledge and tooling that we are able to make the most of to offer higher assist to groups and shortly allow new options. Moreover, as Kubernetes is extensible, there’s nonetheless the choice to construct tooling internally and open supply it to be used throughout the wider group.
Safety is extremely essential at Coinbase and securing Kubernetes clusters is a non-trivial endeavor. Transitioning from highly-isolated and single-tenant compute to a system which promotes multi-tenancy requires deliberate safety design and consideration. As a result of we have now high-security workloads the place we have now to ensure isolation, we should run separate clusters and construct automated tooling that handles all cluster operations. Giving people entry to function high-security infrastructure isn’t allowed.
Managed Kubernetes choices, reminiscent of AWS EKS, tackle the accountability of working, sustaining, and securing the management airplane, decreasing the operational burden of operating many clusters. Decreasing our operational burden and safety accountability permits us to concentrate on constructing the orchestration and automation that’s required to assist many clusters throughout a big engineering group. EKS has considerably matured over the previous few years and proven that it gives steady, operational Kubernetes whereas additionally integrating with options which can be generally utilized in EC2 reminiscent of having the ability to connect safety teams to pods and IAM Roles to service accounts. Having these integrations reduces the chance and value related to migration, as they permit for migration with out having to alter the id or entry patterns of our present platform.
Whereas the migration to Kubernetes spurred issues up to now, we’ve now concluded that managed Kubernetes choices, reminiscent of AWS EKS, can scale back the operational burden with out compromising safety. In the end, we realized there’s a clear ceiling to the power of our homegrown system to scale, and whereas there’s a giant arrange and migration value related to a transfer to Kubernetes, we’re assured that will probably be extra versatile and scalable than our present system.