Four tips for developing sound experimentation practices

Introduction

Experimentation is a critical way for product teams to rapidly gather implicit feedback from users on new features and product changes. Product teams at GoDaddy are increasingly using scientific methods and tools to inform the decisions we make about our products, and the goal of the GoDaddy experimentation team is to help them get the best ROI from these tools and methods. There are many interesting challenges posed by online experimentation and previous posts have described some of the technical solutions our team is developing. But experimentation tools can only be as effective as the way product teams choose to wield them. In this post we’ll address how to supplement tooling improvements by adopting and promoting healthy experimental practices, using lessons we’ve learned while scaling up experimentation at GoDaddy.

Tip #1: avoid underpants-gnome thinking

Teams that are new to experimentation may occasionally fall prey to what I like to call underpants gnome thinking. For those unfamiliar, in this episode of South Park, a group of gnomes concocts the following business plan that hinges critically and questionably on stealing underpants:

This type of thinking surfaces in hypotheses that skip directly from a proposed feature change to a postulated business outcome without connective tissue reasoning clearly and explicitly about why the feature is expected to impact the users and user behaviors that drive the business outcome. It is this connective tissue that defines what exactly can and can’t be learned from the experiment, and an absence of careful thinking in this space can lead to experiments with less valuable learnings and lower success rates.

At its extreme end, this may include “throwing spaghetti against the wall” by trying to push metrics with changes that aren’t clearly connected to a user problem or insight (changing CTA text from “OK” to “Got it”). Far more common, though, is the case where the experimenter has an implicit “Phase 2” in mind that seems obvious enough to all involved not to be stated or explored in depth. We’ve found that even – and especially! – in these cases, refocusing attention from business outcomes to user impact facilitates healthy and productive discussions during experiment design, review, and analysis.

For example, a recent experiment moved an element on the page to a more prominent position above the fold. Implicit in this decision was an assumption that the element would be visually more salient to users in the new position, whereas presumably in its previous position some people had not noticed it. However, the new design also gave it a dark color that caused it to recede in the page relative to other elements. Would this work against its visual salience and thus its success? Of course we can’t know before testing, but asking the question can prompt teams to refine their treatments or better anticipate likely alternative outcomes or follow-up tests that they hadn’t considered before. It is much better to initiate this discussion proactively before the experiment launches than try to back into it retrospectively and realize that you didn’t get the information you wanted to get out of the experiment.

This type of discussion is especially likely to be skipped when an experiment follows a common, recognizable pattern. For example, growth experiments sometimes attempt to reduce friction by removing some number of steps to reach a goal. This strategy is so common that its rationale is rarely questioned, but it’s never a bad idea to drill deeper into the source of the user friction. Is it because the user is on a mobile device and it’s inconvenient to fill out a field? Is it because the number of choices is overwhelming, or the choices themselves are described confusingly? Depending on what the source of the friction is, skipping the step altogether may or may not be the first solution you choose to test. It may even backfire in the case that the skipped step involved critical information needed to provide a seamless experience after goal completion. At minimum, the experiment should include metrics to evaluate this possible outcome.

The humans participating in an experiment can and should be the focus for many of the decisions made about its design. To encourage this type of thinking, we use a hypothesis template that requires experimenters to specify not only what variable is hypothesized to impact a metric, but to state all the reasoning in between:

By [making some change] we expect [some outcome] because [reasons why it impacts users].

We also focus experiment metrics heavily around users, rather than products, sessions, or orders. This is not only a statistical best practice for the methods we use, but we find it also helps to draw more attention to the question mark in Phase 2. By directly measuring the user behavior the experiment is intending to change, we can begin to understand how downstream business metrics are tied to user behaviors, and how feature changes impact those behaviors. Of course, there may be situations that warrant other types of metrics, but as a general practice, framing success around users can make the logic of the experiment and its results easier to reason about and learn from.

Tip #2: be diligent about controls

Before we talk about how controls can go wrong, let’s first cover the basics on why and how we use controls to establish causality. Take a simple, hypothetical statement that seems intuitively plausible: taking vitamins reduces blood pressure. Assuming you had a bunch of blood pressure measurements for vitamin-takers and non-vitamin-takers, could you accept this statement if you found lower average blood pressure among vitamin takers? Unless you sampled the data in an experiment using a proper control, the answer is no. What we’ve described above is a relationship between vitamin consumption and health, but not necessarily a causal one. The intuition is simple: chances are high that people who buy vitamins are already different from people who do not. Among many other things, they may differ in their diet, age, or income, and each of those things seems likely to impact blood pressure in some way. Given that, how can you ever prove the health outcomes are caused by vitamins, as opposed to the many other ways the two groups differ? In this scenario, you can’t: vitamin consumption is “confounded” with many other things, resulting in an apples-to-oranges comparison. Having a proper control group – where the types of people in the two groups are as similar as possible in all of these ways – would allow you to isolate the causal impact of vitamin consumption.

In some cases, the word “control” sometimes morphs to take on a new, subtly different meaning: the feature(s) that users currently see. It’s important to understand that the word “control” describes a group of people, not the state of an application. More specifically, it describes the subset of people who see the current feature, but are otherwise identical to the treatment group. This may seem like a matter of semantics, but the latter critically excludes all of the people who visited before the experiment was turned on, or after it was turned off, or who didn’t meet eligibility criteria, regardless of what they saw in the product. All of those types of visitors differ systematically from the treatment group. Pre-experiment visitors, for example, will have had more time to visit, or make purchases, or engage with the product, so including them in the control is going to give you an apples-to-oranges comparison.

As development teams ramp up on experimentation, this concept is critical to internalize. Consider that some percentage of the time, your experimentation platform may experience downtime; how should you handle those errors? Perhaps you want to default to showing these users the current version of the product. Hopefully by now you know not to log those sessions into the control though, otherwise, you’ll systematically allocate buggy experiences into one side of the test.

Moral of the story: don’t be fooled into thinking that experiment bucketing is going to align neatly with your logic for who sees what. These things not only can but will diverge. Conflating this critical distinction when you configure or implement an experiment with a single “control” label can sometimes be tempting, as it makes code seem cleaner – but it can make your experimental data dirty, and possibly uninterpretable.

Tip #3: empower engineers to participate in experiment design

Most engineers who have worked with a designer to implement a change to some UI are familiar with the feedback loop that can develop when there is a productive partnership between UX/UI design and engineering. Mocks may be ambiguous in some way and require follow-up discussion or clarification, or technical considerations may render certain design choices suboptimal. In such cases you need to establish an open line of communication, or even the ability to fill in gaps using your own knowledge about the designer’s intentions and the context for the task.

The same basic principle applies in the context of experimentation, where instead of (or in addition to) visual mocks you may be dealing with the experimental design, a plan that describes how and why data will be collected for some experiment. An experiment design typically includes the metrics that will be measured, the way the traffic will be allocated across conditions, and a description of what changes will be tested and why. While engineers may be tempted to shy away from the planning of a new experiment, experiments are consistently better when they lean into the process instead. When armed with the “why” behind the experiment, engineers can provide valuable feedback about how various implementation choices could impact the experiment design, or point out corner cases that the proposed design doesn’t account for.

Importantly, engineers are the best, and sometimes the only, people in a position to identify cases where implementation choices asymmetrically impact one condition, causing the apples-to-oranges comparison discussed above. As an example, we once ran an experiment that compared different ordering algorithms. The design dictated that in one condition we would order the elements on a page using business logic, and in another condition we would order them randomly. However, we implemented the randomization in such a way that on each request a different value might be returned. Users noticed the page elements were changing on refresh and behaved in atypical ways, such as reloading the page over and over to see what new elements would appear. This in turn caused metrics to look strange: people were being exposed to more choices across the board by virtue of the page refresh, and thus were more likely to find at least one they wanted to click on, making both the total number of impressions and unique click-through look higher in the random condition. Our implementation choice had confounded the experiment: we changed not only the order of the elements but also the rate at which they updated, and the latter turned out to have a robust impact on user behavior. This rendered our results uninterpretable, and we had to start over.

When moving to experiment-driven development, code changes need to not only support the feature specifications, but also the experiment design, to ensure that the hypothesis about the new feature can be measured effectively. But for that to happen, engineers need to feel some measure of ownership over the experiment design. In service of that, our team has adopted a process that exposes new designs to cross-functional peer review, so that important questions about the telemetry, metrics, and implementation can be surfaced and settled before experiments move into development. Because experimental design is not a linear process – often involving multiple iterations of refinement – it can be helpful to define an end state for peer review; we’ve found that sign-off using a readiness checklist such as this one can be very effective for closing on when a design is “ready”.

Tip #4: mind your instrumentation

Okay, it’s true: the way you log data usually doesn’t immediately impact customer experience as obviously as say, page load times do. While development teams generally build robust monitoring, automated testing and QA processes to find and fix problems that are visible to the customer, we’ve found it to be less common for teams to dedicate the same level of effort to making sure the data logged about what users are seeing or doing is squeaky clean.

Data bugs may include duplicate impressions fired when a component updates, or leaky click data due to a page redirecting before an event can be sent, logging the right event with the wrong metadata (e.g., user or session identifiers, or information about user state), or other issues that make the data a distorted or incomplete representation of what actually happened in the application. Even data bugs that cause critical data to be missing or wrong in some tiny percentage of cases can be harmful to an experiment, and devastating when they asymmetrically impact one side of the experiment – as can easily happen since experimental code is usually newer and thus more vulnerable to bugs of any sort. And unfortunately, most of these types of bugs cannot be fixed in analysis. In the best case, your instrumentation will have added so much noise to the results that it drowns out any actual effect. In the worst case, it can distort the results and lead to the wrong conclusion.

Once we started being more proactive about looking at and continually questioning the quality of our user data, not only were our experiments cleaner and easier to interpret, but we also had a new mechanism available for finding genuine product bugs that eluded existing testing processes. On one occasion, funny-looking data for a particular metric caused us to launch an investigation into our logging. But instead of finding buggy logs, we found a race condition that was causing a dramatically worse user experience at a critical point in the purchase flow, and which surfaced only for users with slower internet connections (and was thus not visible in our internal network). Thus, while diligence about data quality is not only a requirement for experimentation, it also has the added benefit of helping identify blind spots in your testing. To reap these types of benefits, teams need to be inquisitive and skeptical, with a finger constantly on the pulse of the data, looking for possible anomalies.

Wrap-up

In order to scale experimentation, product teams need powerful tools to help them experiment rapidly; but equally important, they also need to be armed with processes and knowledge to help them experiment effectively. In this post, we’ve discussed how and why to integrate thoughtful experiment design and data collection practices into the development cycle, all the way from experiment ideation to logging and implementation. While no amount of experiment design can guarantee a winning treatment, small changes to existing processes can greatly increase the chance that teams can successfully detect a winning treatment, and increase the amount of insights they can extract about a losing one.


Author