In early 2018, GoDaddy began a cloud migration journey to move the company from on-prem data centers to the cloud, using AWS as a service provider. The transition to AWS was an important one for the company. It enabled GoDaddy teams to:
The journey started from scratch, beginning with an entire migration and enablement strategy, and ending with executing on its delivery. Today, the Cloud Platform organization at GoDaddy manages over 2000 AWS accounts across multiple AWS Organizations for more than 300 projects across the company.
GoDaddy uses AWS Organizations to manage its multi-account setup, providing an array of benefits like streamlined access controls and aggregated billing, simplified access to cross-account resources, and the benefit of isolated workload environments to limit blast radius due to any service events.
While the number of teams and products migrated into AWS grew steadily, there was an equally increasing demand for improved observability into the footprint of GoDaddy’s cloud deployments. In the event of an AWS service event or outage, how would GoDaddy products be affected? How could GoDaddy create an observability platform that enables multiple stakeholders in the company to quickly identify a point of contact within impacted teams? How could the operations center quickly scope potential impact to its customers and services by filtering criteria like region and AWS services in use?
Over the past year, GoDaddy built and released the Global Tech Registry (GTR), an internal observability platform to provide insights into their AWS Cloud deployments. The objectives of the GTR are to:
The GTR is a metadata service that receives events about AWS accounts, infrastructure resources, and service endpoint information from different sources. The metadata is processed and made accessible through an API and dashboards in Kibana. Stitching metadata together enables faster response times and enhanced insights into service events as they take place, and the potential impact to its products. How does it achieve this? Let’s take a closer look at the different pieces of metadata collected by the GTR platform.
The Public Cloud Portal, a developer gateway to a number of engineering services at capabilities at GoDaddy, was created out of the need to facilitate a common process to onboard and receive AWS accounts. It enables an automated process that vends a standardized GoDaddy Trusted Landing Zone (TLZ) AWS account for a team with various security and architecture best practices built in. As part of that automated account generation flow, GTR integrates with underlying APIs by subscribing to Amazon Simple Notification Service (SNS) topics and receiving an SNS message with account metadata whenever an AWS account is created or updated. An AWS Lambda function then processes the SNS messages to collect core GoDaddy team and AWS account metadata. The account metadata collected in this phase are AWS account IDs, account environment (e.g., Dev, Prod), account regions, VPCs, and network configurations. The team metadata collected in this phase are team name, budget id, budget owner, and team contact information like on-call group and email distribution list.
AWS Config collects snapshots of all AWS resources deployed in GoDaddy’s TLZs across all AWS organizations. Each time any AWS resource is created, updated, or deleted inside the GoDaddy TLZ account, AWS Config delivers the updated configuration states to an SNS topic inside the account that GTR subscribes to through a Config Delivery Channel. GTR then processes these config event notifications using an event-based architecture in real-time. In addition, AWS Config creates a snapshot on a regular cadence of all resources in the account for each region, which is replicated to an S3 bucket in a central logging account for ingestion.
The AWS Config events and snapshots are sent to an Amazon Simple Queue Service queue. From there, AWS Lambda functions pick up each event, process the individual resources, and store them in a DynamoDB database that can be queried and used for reporting dashboards and API access.
Some AWS resources need to be processed outside of the AWS Config pipeline because they aren’t supported by AWS Config yet. For example, some Amazon Route53 resource types are discovered from the AWS accounts using Lambdas with CloudWatch Event Bridge and Python boto3 libraries.
By invoking the AWS Health API, AWS customers can retrieve relevant Service Health Events that impact AWS Services in specific regions, like the AWS global service event in December that impacted multiple services in the us-east-1 region.
GTR regularly polls the AWS Health API across all GoDaddy’s accounts using AWS Organizations. This allows GTR to receive and process all Health Events relevant to GoDaddy. When GTR receives a Health Event for an issue with an AWS Service (e.g., EC2) in a specific region (e.g., us-east-1), it quickly identifies the GoDaddy services, resources, accounts, and teams impacted by the Health Event. It summarizes the Health Event for the Global Operations Center (GOC) and can send an automated message to the relevant contact points. This creates an improved interface for the GOC in the event of an AWS incident because it provides a full picture of the affected services, accounts, and team contact points.
The diagram below shows a high-level overview of the GTR architecture.
Internal GoDaddy teams can access the registry’s data through the GTR APIs or Kibana dashboards that help quickly visualize various data points. Access to the API is managed per endpoint, and Kibana Spaces is used to provide granular access to the relevant Kibana dashboards, visualizations, and indices from Elastic to further distill the user experience to relevant datasets based on each persona. All access is programmatically handled as part of a workflow in GitHub and deployed through standard CI/CD pipelines.
So far, GoDaddy has achieved several benefits to multiple different personas using the GTR:
The GoDaddy GTR team is working on the next phase of GTR, which will include service dependency mapping in addition to accounts, resources, and health metadata. GoDaddy uses Elastic APM as part of all applications for telemetry and tracing. By combining trace data from Elastic APM and AWS resource metadata from all the AWS Accounts, GTR will be able to automatically discover the services and their dependencies running across AWS accounts. This will give an even more detailed picture of the impact GoDaddy is seeing in the event of outages. Stay tuned for a second installment to this blog post about this feature in the near future!
Cover Photo Attribution: Photo by unsplash: https://unsplash.com/photos/Q1p7bh3SHj8