GoDaddy and Amazon EKS

Imagine nearly 200 engineering teams, many of whom are looking for a solution to running container workloads in order to reduce operational complexity, manage orchestration and scale horizontally on the fly. What happens when they don’t have a common solution at hand? By nature, engineers will seek out a solution, evaluate, and then begin solving problems. There are a number of viable, useful and technically sound container runtime solutions out there. If these teams are operating independently, not every team will choose the same solution. The downside to this is that each of these solutions can be complex to operate and each has its own best practices. As teams grow, shrink, and shift, it becomes incrementally more difficult and expensive to keep operating lots of different solutions over time, making it harder to combine efforts from various teams on particular projects, or to fold projects together under a common team. The risks of siloing knowledge of best practices and accumulating technical debt surrounding bespoke, single-case solutions are high.

A predictable and potentially viable way to solve this would be to attempt to break out our operation of Kubernetes, the engine we chose, into a platform team and offer it to other teams “as a service.” This solves the problem of the diversity of solutions in play and focuses the technical expertise of operating a complex environment on one team. However, many problems remain.

Running an “as a service” container solution requires a solid definition of the platform’s responsibility versus the product teams’ responsibility. In an environment such as Kubernetes, it can be challenging to debug and optimize an application. This is because the problem space is split between the Kubernetes control plane (master nodes with their core context services, such as kube-controller-manager) and the worker nodes where the application runs. So, for example, if a particular high volume service is experiencing slow load times or intermittent interruptions, the solution might be to tune kube-controller-manager’s qps and burst settings or kube-apiserver’s max-requests-inflight. But the problem could also be in the application’s configuration or its pod runtime settings. It can be very difficult to divide up the responsibilities for the cluster between a platform team and the various application teams who are using it. It can be even harder to do so in a way that can scale across an organization or across the company.

Operating such a service also requires an integrated authentication and authorization mechanism so that access to resources is appropriately gated. It requires an implementation of some kind of charge-back model so that the cost of the resources is shared among the teams using the solution. It requires the members of that platform team to be able to handle the issues generated by potentially hundreds of teams of engineers, to keep on top of the needs of those teams, to keep the compute infrastructure supporting those teams appropriately scaled, to make capital expenditures when the resource pool is deemed too thin, and more. A fully funded platform effort could address all of these issues. But what if operating “as a service” infrastructures that span from the resource and virtualization layer through the container runtime layer and all the way up to end-user applications and services isn’t the main distinguishing feature of your business? What if you need your hundreds of teams of engineers to be focused on the tooling and customer-facing products that differentiate your products from your competitors? In short, if your company’s primary mission isn’t running these on-premises “as a service” solutions, it’s difficult to justify the significant person time and investment that is needed to keep it running successfully for a large engineering organization.

Enter Amazon Web Services (AWS) and Amazon Elastic Container Service for Kubernetes (EKS). As a foundation, AWS offers us the tools needed to track our expenditures, team by team, organization by organization, without having to completely implement our own model for managing expense. They provide us a clearly articulated Shared Responsibility Model that offloads many layers of operational responsibility to their own scaled-out support teams across all regions and services. In addition, they let us scale out our resources as-needed and with an operational expense model rather than a capital expenditure model which simplifies our process and lessens our risk of spending money where we don’t need to.

Kubernetes is a flexible, sophisticated tool for running workloads in a common way regardless of whether they are running in our on-premise infrastructure or on AWS. This gives us a simplified operational model for our software deployment and runtime management and monitoring and at the same time simplifies migration to AWS from our existing infrastructure. EKS in particular offers an enormous benefit to GoDaddy. Our engineers use Kubernetes in incredibly diverse ways. In particular, our usage divides up into four primary use cases:

  1. Batch: event- or schedule-driven, finite duration jobs. An example is a security scan, which usually contains standard container model definitions and minimal operational complexity
  2. Small services: one-to-ten pod deployments with one container each, normally a traditional LAMP-style web service
  3. Big services: hundreds of pods with multiple containers, usually representing an entire microservices architecture tied back to a GoDaddy product line, such as a commerce system
  4. Massive, end-user containerized services: thousands-to-hundreds of thousands of pods per cluster, each with thousands of nodes and multiple containers per pod in a sophisticated end-user architecture

As an example of the last, Managed WordPress 2.0 (MWP2) was developed on Kubernetes to produce managed, highly available WordPress sites that offer our users fantastic performance, scalability and flexibility. We solved problems with traditional shared hosting implementations of WordPress by using containers and taking advantage of the overlay filesystem, giving flexibility to site owners to be on the versions of PHP and WordPress they want to be on. We used Kubernetes’ horizontal scaling capabilities and made good use of cachine to keep the performance of the WordPress system as high as possible.

But this is only one of many ways we are using this powerful container runtime and orchestration platform. We are also running sophisticated CICD pipelines, automating security scans, deploying and managing both internally and externally facing proxies and operating and scaling core customer services this way. Our Presence and Commerce systems, aftermarket DNS sales and core data application infrastructure are tested, built, deployed and scaled using Kubernetes.

Our four primary use cases for container workloads demand many different configurations, management, and scaling requirements on the Kubernetes clusters. EKS takes the operational complexity out of managing these clusters. First, it manages the scaling and coordination of the control plane’s core infrastructure, eliminating the need for GoDaddy to administer Kubernetes master nodes. Second, because of the deep integration of EKS with other AWS services, GoDaddy can leverage massive benefits from Elastic Load Balancers, auto-scaling, AWS CloudTrail logging, AWS Cloudwatch monitoring, and event-driven programming using AWS Lambda. Third, we can manage the Kubernetes worker infrastructure using GoDaddy’s best practices through AWS CloudFormation. AWS CloudFormation enables GoDaddy to define and deploy infrastructure as code, which is then used in conjunction with AWS Service Catalog to provide governance and best practices.

Getting started with EKS is simple. Before you begin, you should have kubectl and the Heptio Authenticator to allow IAM authentication for your Kubernetes cluster installed. In addition, you should have the AWS CLI installed and set up with credentials to access your account. Full instructions on setting up these prerequisites can be found in the Getting Started guide.

There are two basic tasks involved in getting your AWS Kubernetes cluster up and running:

  1. Get your cluster up: this is the control plane, including the Kubernetes masters and services, and
  2. Get your worker nodes up: this includes the EC2 instances you’ll be running your pods on

To get your cluster going, you need nothing more than a VPC, an EKS role to associate with the cluster, subnets selected from the VPC for your workers, and security groups for them to run under. Using the AWS CLI:

aws eks create-cluster \
  --name <name of your cluster> \
  --role-arn <arn of the EKS role you created> \
  --resources-vpc-config subnetIds=<subnet id 1>,<subnet id 2>,securityGroupIds=<security group id 1>,<security group id 2>

Another way to do this is to use the AWS canonical CloudFormation template and create a stack with it:

aws cloudformation create-stack \
  --stack-name <stack name> \
  --template-url https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-06-05/amazon-eks-vpc-sample.yaml

This option will create a new VPC with new subnets and security groups and launch the cluster there. Note that the pricing of the EKS control plane is $0.20/hour, so if you’re experimenting on a budget, be sure to keep track of how long your cluster is running.

The second set of things you’ll need for your cluster are the worker nodes. Here, the current recommendation is to use another canonical AWS-developed CloudFormation stack definition to get going quickly.

aws cloudformation create-stack \
  --stack-name <stack name> \
  --template-url https://amazon-eks.s3-us-west-2.amazonaws.com/1.10.3/2018-06-05/amazon-eks-nodegroup.yaml
  --parameters \
    ParameterKey=ClusterName,ParameterValue=<the name you gave your cluster> \
    ParameterKey=ClusterControlPlaneSecurityGroup,ParameterValue=<security group id 1>,<security group id 2> \
    ParameterKey=NodeGroupName,ParameterValue=<a name for your node group and the autoscaling group it is a part of> \
    ParameterKey=NodeAutoScalingGroupMinSize,ParameterValue=<autoscaling group minimum number of nodes> \
    ParameterKey=NodeAutoScalingGroupMaxSize,ParameterValue=<autoscaling group maximum number of nodes> \
    ParameterKey=NodeInstanceType,ParameterValue=<the instance type for your EC2 instances in the group> \
    ParameterKey=VpcId,ParameterValue=<the VPC id of your EKS cluster> \
    ParameterKey=Subnets,ParameterValue=<subnet id 1>,<subnet id 2> \
    ParameterKey=KeyName,ParameterValue=<the EC2 keypair for access via SSH to the EC2 nodes in your group> \
    ParameterKey=NodeImageId,ParameterValue=<the AMI ID for your EKS-optimized worker nodes>

The AMIs, which are based on Amazon Linux 2, are listed by region and can be found in the documentation for launching workers. After your cluster and workers are created and launched, you must ensure your workers can join the cluster by applying a ConfigMap that will configure the cluster appropriately. You can find complete instructions in the Getting Started guide.

That’s all there is to it. Within a few minutes, you can have a Kubernetes cluster integrated into your AWS account and ready for your workloads.

Our engineering teams embrace the DevOps model, where they own the process of developing, operating and monitoring the infrastructure for their products in the same way that they develop their customer-facing applications. Using AWS Service Catalog as a product portfolio manager, GoDaddy can define standard settings for products with AWS CloudFormation definitions, enabling teams to iterate according to their own Service definitions within the boundaries that we determine to be both secure and performant. This approach lets GoDaddy apply centralized governance and help teams across the company, while still maintaining a path for radiating best practices and new knowledge out to the company.

Because GoDaddy hosts millions of domain names, websites and web services, we need an environment that scales to our needs while maintaining operational efficiency and minimizing complexity. EKS offers us an industry standard container runtime and orchestration engine that enables a clear path for migrating workloads from our on-premises infrastructure to AWS. It helps us simplify how we engineer and lets us focus on our ability to offer differentiated and delightful experiences for our customers, who look to GoDaddy to provide them the platform for creating and managing their independent ventures.


Author