Book Summary: Trustworthy Online Controlled Experiments by Kohavi, Tang, Xu

Rating: 8.0/10.

Book about AB testing best practices in a practical product setting, with many specific recommendations and examples from the authors’ experience in the industry. The image at the front is a HiPPO (highest paid person’s opinion), which is a common although ineffective way of making decisions in organizations that don’t do AB testing. One revelation is that once an organization starts to adopt AB testing, they quickly realize that most senior people’s opinions are often wrong.

Some barriers to running online experiments: you need enough users to test, the infrastructure to run experiments easily, and overall evaluation criterion (OEC) – a good proxy for meaningful business metrics while being measurable quickly. Many improvements are accidental: eg changing the colors of links in the Bing search page that was discovered by accident, or many core product features that were developed but after testing found to be not as impactful as expected. Generally, it is difficult to predict experiment results.

Steps to run an experiment is assigning users randomly to groups, defining the test metric, and then doing hypothesis testing, including p-value and statistical power (to know what sample size is needed). When interpreting results, consider both statistical and practical significance, implementation complexity, and sanity metrics to ensure that the experiment ran properly.

Some common mistakes when doing A/B tests include running multiple experiments and looking for the lowest p-value, or checking results multiple times while the test is running, which are forms of p-hacking. Selection bias occurs in how users are assigned to groups, including survivorship bias that tends to be non-random before data collection, or differences between control and treatment groups due to effects other than what’s intended, such as novelty effects from showing a page that’s different from usual, implementation differences that favor either the control or treatment group, etc. It’s important to be skeptical when results look too good to be true.

Once an organization has committed to doing A/B tests at all, the next step is setting up an experiment infrastructure. You can build a custom framework or integrate an existing one, and there are some subtleties in assigning variants to users: need to decide when in the lifecycle to do it and propagate it across service backend; then you need the logic to branch based on which variant is assigned to a user ID. Within this, multiple simultaneous experiments sometimes have combinations of variants that are invalid, which means a nested design is necessary. Finally, you need a UI to manage experiments, review the data, start and stop them, etc.

One case study is on the effects of website latency and how much latency hurts your user metrics, eg: Bing ran a test to simulate latency of 100 and 250ms to measure the effects on users. Some details of measurement of latency, such as on the web page you should measure when the page perceptually looks like it’s finished loading which is not the same as any server or client side metric.

Next, when designing a metric, you must be careful to be aligned with the goals and strategy of the organization and not to be gameable in unintended ways, need to have guardrail metrics to prevent optimization at the expense of something else, like page size or latency. When defining metrics, it’s impossible to capture the entire thing in one metric, and also having too many metrics will be overwhelming, so it’s best to define a small number of OECs (overall evaluation criteria), that combine several related metrics, eg, for an email campaign, the expected revenue that is balanced with the loss from users unsubscribing from the email. However, Goodhart’s Law states that any key metric that becomes a target tends to lose its meaning over time.

Institutional memory is useful for learning how to run better experiments, what are good metrics, etc. Scientific ethics can be applied to experiment design, evaluating the risk and potential harm to users in experiments, possible deceptive practices, and privacy concerns when data is expected to be private but can be de-anonymized.

Other sources of user data include mining logs for analysis, surveys, and running focus groups and human data annotation if A/B testing experiments are not possible. When randomized experiments are not possible due to ethical or practical reasons), often need to use observational data to infer correlation by using interrupted time series where a condition was changed for everyone at some point, control for confounds and match by propensity score, or find “natural experiments” where the assignment to a group is close to random. Still, there are some pitfalls like the effect of outliers or confounds that you didn’t consider, and many observational studies have failed to replicate in a controlled and randomized setting.

The next part is on systems engineering for running experiments. When the experiment is client-side (ie, an app store) where users may post the latest version at different times and delay analytics reporting. Thus, each variant of the experiment needs to be shipped to the client side and triggered with a flag, and analytics of which variant was run needs to be sent back to the server. Also, various guardrail metrics need to be monitored, like battery and network usage, users blocking notifications, and uninstalling.

Randomization should be at a fine-grained level like a page or session, but you need to take care to avoid inconsistent experiences. So it is common to randomize by user level instead and then limit the effect of heavy users on the total metric. Never randomize by IP address, which is often shared or unstable. Ramping an experiment needs to start with internal or beta users since they’re more receptive to bugs, and then gradually ramp up; sometimes leave a holdout set of users on the status quo for some time to observe long-term metrics, and then switch to 100% and remove the previous code path. When scaling up experimentation to an organization, you need to agree on common definitions and metrics and have an experimentation framework that can track metrics across experiments while being accessible.

The last section is on various more advanced topics, starting with basic statistics relevant to experiments, like the t-test and p-value which quantify how likely the observation would be if the null hypothesis were true, and power analysis calculates the likelihood of type 1 and type 2 errors. For multiple hypothesis testing, corrections tends to either overcorrect or be too complex, so the authors recommend a few tiers of p-values based on how expected the effect is when defining the experiment.

Variance estimation is an important step in analysis since it’s the basis for p-values and confidence intervals, and it can often be done incorrectly, eg in the cases of ratio quantities or non-normal data with outliers, and there are various fixes for each case. A general although expensive method is the bootstrap method, which can estimate the variance of any setup through simulation.

The A/A test technique is, instead of an A/B test, to control against itself to verify the validity of the experimentation framework, and across many such tests, the distribution of p-values should be uniform; otherwise, there is a problem with the variance estimation, eg some independence assumption is violated somewhere. Triggering means increasing the statistical power by doing analysis on the users that are actually impacted by the experiment, like users who reach the page being tested, or if deploying two models, only include the users whose predictions would have been different on the other model. Another useful guardrail check is sample ratio mismatch (where the sizes of treatment and control groups don’t match the expected), which points to an error in the experimentation framework or randomization happening downstream of the treatment logic, etc.

Leakage between variants is possible when testing features involving social networks, and changes to the experiment may change how people interact with each other, thus affecting the control, or they might be competing for a shared resource pool. This can be mitigated by randomizing on geography, time, or network clusters, but each has different trade-offs. Finally, many experiments may have long-term effects that aren’t apparent when looking at short-term metrics, and you will need to keep the experiment running continuously or track user metrics after the experiment has concluded.

Share this:

Most similar books:

Leave a Reply Cancel reply