How Data Scientists Broke A/B Testing (And How We Can Fix It) - posit::conf(2023)

Posit PBC

15 Dec 202318:18

Summary

TLDRCarl Vogel, a principal data scientist at Baby List, discusses challenges with AB testing in organizations. He highlights the common misalignment between statistical methods and business decision-making. Vogel shares two alternative approaches: non-inferiority testing, which focuses on ensuring new features are not worse by a certain margin, and value of information experiments, which weigh the cost and value of running longer tests. These methods help decision-makers make more informed choices about launching features. Vogel emphasizes the need for data scientists to rethink tools to align better with real-world decisions, focusing on risk, cost, and time.

Takeaways

🧑‍💻 **Launch on Neutral Concept**: Stakeholders often decide to launch a feature despite it showing no statistically significant results, relying on a 'launch on neutral' approach if the trend is positive.
🔢 **Non-Inferiority Test Design**: Instead of testing if a new version is better, non-inferiority tests check if it's not worse by a predefined margin, aligning better with stakeholders' risk tolerance.
📉 **Risk of Small Losses**: When repeatedly using non-inferiority testing, small losses from each test can add up, requiring organizations to have an aggregate risk budget to manage potential losses over time.
📅 **Time vs Data Trade-off**: One core problem in AB testing is the impatience to run fully powered tests due to opportunity costs and roadmap delays, leading to the misuse of statistical tools.
💸 **Value of Information**: The value of additional data diminishes over time, but understanding the cost-benefit trade-off between test duration and data value helps optimize test length.
📊 **Sample Size Struggles**: Decision-makers often struggle to provide accurate estimates for the effect size of a feature, leading to underpowered tests and incorrect use of AB testing tools.
💡 **Cost of Delay**: Long tests delay product roadmaps and hold up the deployment of dependent features, making it crucial to balance between waiting for data and acting quickly.
💬 **Rethinking Tools**: Data scientists should provide tools that align with how decision-makers think about risk, cost, and time, rather than just focusing on statistical significance.
💻 **Practical Solutions for AB Testing**: Introducing concepts like non-inferiority testing and value of information can lead to more meaningful conversations between data scientists and stakeholders about risk and decision-making.
🚀 **Evolving the Role of Data Science**: As data science tools become standardized and automated, the real value lies in addressing decision-making misalignments and creating tailored, quantitative methods for organizations.

Q & A

What is the main topic Carl Vogel discusses in this presentation?
-Carl Vogel discusses the challenges and nuances of AB testing, particularly focusing on improving decision-making processes in organizations when interpreting AB test results.
What does 'launch on neutral' or 'launch on flat' mean?
-'Launch on neutral' or 'launch on flat' refers to a situation where a product manager decides to launch a feature despite the AB test results being statistically insignificant, but with a positive (though non-significant) conversion lift.
Why do product managers often want to shorten the duration of an AB test?
-Product managers often want to shorten the duration of AB tests because of cost, time constraints, and opportunity costs associated with delaying feature rollouts. They may prioritize moving quickly over statistically significant results.
What is non-inferiority testing and how is it useful in AB tests?
-Non-inferiority testing is an approach where instead of testing if a new version of a feature is better, the goal is to test if it’s not worse by a certain margin. This allows for more meaningful conversations about acceptable risks and timelines in AB testing.
How does non-inferiority testing differ from traditional AB testing?
-In traditional AB testing, the goal is often to see if a new feature performs better than the current version. Non-inferiority testing shifts this to check if the new feature is 'not worse' by a specific, acceptable margin, making it easier to reason about risk and decision-making.
Why is sample size important in AB testing, and why is it hard to define?
-Sample size is important because it determines the test’s power to detect meaningful effects. However, it's often hard to define because product managers struggle to specify an expected effect size, especially when any positive lift is considered valuable.
What is 'value of information' in the context of AB testing?
-The 'value of information' refers to the concept that additional data gathered during an AB test has value, as it reduces uncertainty about a feature’s impact. The goal is to balance this value against the cost of continuing to collect data.
How does the value of information help in deciding when to stop an AB test?
-The value of information helps decision-makers know when to stop an AB test by comparing the cost of gathering more data with the expected value of that data. If the cost exceeds the value, the test should stop.
What are some key reasons stakeholders misuse AB testing tools?
-Stakeholders often misuse AB testing tools because they prioritize quick decisions, opportunity costs, and product roadmaps over statistical significance. Their decision-making frameworks often don't align with traditional statistical measures.
How can data scientists improve conversations with decision-makers about AB testing?
-Data scientists can improve conversations with decision-makers by shifting the focus from abstract statistical concepts like error rates to more practical business metrics, such as cost, benefit, risk, and time, which resonate better with decision-makers.
What should be considered when budgeting for inferiority margin risk across multiple tests?
-When using non-inferiority testing, it's important to establish an overall risk budget for the organization. This involves determining how much loss (inferiority margin) is acceptable across a sequence of tests to avoid a cumulative negative impact.