How Data Scientists Broke A/B Testing (And How We Can Fix It) - posit::conf(2023)
Summary
TLDRCarl Vogel, a principal data scientist at Baby List, discusses challenges with AB testing in organizations. He highlights the common misalignment between statistical methods and business decision-making. Vogel shares two alternative approaches: non-inferiority testing, which focuses on ensuring new features are not worse by a certain margin, and value of information experiments, which weigh the cost and value of running longer tests. These methods help decision-makers make more informed choices about launching features. Vogel emphasizes the need for data scientists to rethink tools to align better with real-world decisions, focusing on risk, cost, and time.
Takeaways
- 🧑💻 **Launch on Neutral Concept**: Stakeholders often decide to launch a feature despite it showing no statistically significant results, relying on a 'launch on neutral' approach if the trend is positive.
- 🔢 **Non-Inferiority Test Design**: Instead of testing if a new version is better, non-inferiority tests check if it's not worse by a predefined margin, aligning better with stakeholders' risk tolerance.
- 📉 **Risk of Small Losses**: When repeatedly using non-inferiority testing, small losses from each test can add up, requiring organizations to have an aggregate risk budget to manage potential losses over time.
- 📅 **Time vs Data Trade-off**: One core problem in AB testing is the impatience to run fully powered tests due to opportunity costs and roadmap delays, leading to the misuse of statistical tools.
- 💸 **Value of Information**: The value of additional data diminishes over time, but understanding the cost-benefit trade-off between test duration and data value helps optimize test length.
- 📊 **Sample Size Struggles**: Decision-makers often struggle to provide accurate estimates for the effect size of a feature, leading to underpowered tests and incorrect use of AB testing tools.
- 💡 **Cost of Delay**: Long tests delay product roadmaps and hold up the deployment of dependent features, making it crucial to balance between waiting for data and acting quickly.
- 💬 **Rethinking Tools**: Data scientists should provide tools that align with how decision-makers think about risk, cost, and time, rather than just focusing on statistical significance.
- 💻 **Practical Solutions for AB Testing**: Introducing concepts like non-inferiority testing and value of information can lead to more meaningful conversations between data scientists and stakeholders about risk and decision-making.
- 🚀 **Evolving the Role of Data Science**: As data science tools become standardized and automated, the real value lies in addressing decision-making misalignments and creating tailored, quantitative methods for organizations.
Q & A
What is the main topic Carl Vogel discusses in this presentation?
-Carl Vogel discusses the challenges and nuances of AB testing, particularly focusing on improving decision-making processes in organizations when interpreting AB test results.
What does 'launch on neutral' or 'launch on flat' mean?
-'Launch on neutral' or 'launch on flat' refers to a situation where a product manager decides to launch a feature despite the AB test results being statistically insignificant, but with a positive (though non-significant) conversion lift.
Why do product managers often want to shorten the duration of an AB test?
-Product managers often want to shorten the duration of AB tests because of cost, time constraints, and opportunity costs associated with delaying feature rollouts. They may prioritize moving quickly over statistically significant results.
What is non-inferiority testing and how is it useful in AB tests?
-Non-inferiority testing is an approach where instead of testing if a new version of a feature is better, the goal is to test if it’s not worse by a certain margin. This allows for more meaningful conversations about acceptable risks and timelines in AB testing.
How does non-inferiority testing differ from traditional AB testing?
-In traditional AB testing, the goal is often to see if a new feature performs better than the current version. Non-inferiority testing shifts this to check if the new feature is 'not worse' by a specific, acceptable margin, making it easier to reason about risk and decision-making.
Why is sample size important in AB testing, and why is it hard to define?
-Sample size is important because it determines the test’s power to detect meaningful effects. However, it's often hard to define because product managers struggle to specify an expected effect size, especially when any positive lift is considered valuable.
What is 'value of information' in the context of AB testing?
-The 'value of information' refers to the concept that additional data gathered during an AB test has value, as it reduces uncertainty about a feature’s impact. The goal is to balance this value against the cost of continuing to collect data.
How does the value of information help in deciding when to stop an AB test?
-The value of information helps decision-makers know when to stop an AB test by comparing the cost of gathering more data with the expected value of that data. If the cost exceeds the value, the test should stop.
What are some key reasons stakeholders misuse AB testing tools?
-Stakeholders often misuse AB testing tools because they prioritize quick decisions, opportunity costs, and product roadmaps over statistical significance. Their decision-making frameworks often don't align with traditional statistical measures.
How can data scientists improve conversations with decision-makers about AB testing?
-Data scientists can improve conversations with decision-makers by shifting the focus from abstract statistical concepts like error rates to more practical business metrics, such as cost, benefit, risk, and time, which resonate better with decision-makers.
What should be considered when budgeting for inferiority margin risk across multiple tests?
-When using non-inferiority testing, it's important to establish an overall risk budget for the organization. This involves determining how much loss (inferiority margin) is acceptable across a sequence of tests to avoid a cumulative negative impact.
Outlines
🎤 Introduction and Setting the Scene
The speaker, Carl Vogel, introduces himself as a principal data scientist at Baby List and sets the context for his talk on AB testing. He begins by telling a story where a product manager asks him about running an AB test for a significant feature. The product manager only has a budget and timeline for a two-week test, despite Carl’s recommendation of six weeks. Despite the insufficient time frame, a small conversion lift is observed, but it’s not statistically significant. The product manager decides to launch the feature anyway based on a 'launch on neutral' mindset, prompting Carl to reflect on how stakeholders make decisions in the context of AB testing and the challenges data scientists face in aligning with those decisions.
🛠 Reframing AB Testing with New Tools
Carl argues that when stakeholders misuse AB testing tools, instead of simply correcting their mistakes, data scientists should consider offering them different tools. He introduces two approaches that help improve communication with decision-makers: non-inferiority test designs and value of information experimental designs. Non-inferiority tests focus on whether the new version of a feature is not worse by a specific margin, encouraging a more meaningful conversation about acceptable risks and possible outcomes. This shifts the focus from detecting a significant effect to assessing manageable risks.
🔄 Understanding the Value of Information in AB Testing
Carl discusses another approach: value of information (VoI) experimental design. He explains that decision-makers often rush AB tests due to the costs of running tests and opportunity costs of delayed feature releases. VoI quantifies the trade-off between the cost of running a longer test and the value of the additional data it provides. This approach allows decision-makers to frame the problem in terms of cost-benefit analysis, making it easier for them to understand when to stop or continue collecting data based on potential gains in knowledge and reduced decision-making risk.
💡 Computing the Value of Additional Data
Carl delves into how to estimate the value of additional data using the concept of Expected Value of Sample Information (EVSI). He explains that by simulating different conversion lift scenarios and running potential experiments, one can estimate how much new data will change the decision. If new data significantly improves decision accuracy, it's worth continuing the test. This approach reframes AB testing from simply determining whether variant B is better than A to asking if more data will improve the decision. Once the additional data’s value no longer outweighs its cost, the test can be stopped, allowing the team to launch the best variant observed.
🎯 Rethinking AB Testing Tools and Methods
In this concluding section, Carl emphasizes that there is no one-size-fits-all solution to AB testing. He explains that the real challenge for data scientists is to align their tools with how decision-makers think about risk, cost, and value. Instead of correcting symptoms of tool misuse, such as 'launch on neutral' or running tests until significance is reached, data scientists should focus on solving the underlying problem: the need for speed and cost efficiency. He encourages the audience to rethink the tools they provide, aligning them with the core concerns of decision-makers to support better, more informed decisions.
Mindmap
Keywords
💡AB Testing
💡Statistical Significance
💡Non-Inferiority Test
💡Conversion Lift
💡Sample Size
💡Power Calculations
💡Type I Error
💡Risk Budget
💡Opportunity Cost
💡Value of Information
Highlights
Introduction to speaker Carl Vogel, Principal Data Scientist at Baby List, discussing challenges and methods for AB testing.
Story about a product manager requesting a test duration that contradicts data scientist's recommendation, illustrating real-world challenges in experimental design.
Explanation of common issues with test duration and sample size, highlighting the discrepancy between data-driven recommendations and stakeholder constraints.
Discussion on the concept of 'launch on neutral,' where stakeholders proceed with feature deployment despite lack of statistical significance, prioritizing business strategy over strict data rules.
Introduction of non-inferiority test designs, which focus on ensuring a new feature is not worse than the current one by a specified margin, as a more practical approach in AB testing.
Highlight of how non-inferiority testing can lead to more productive conversations by aligning tests with stakeholders' risk tolerance and strategic goals.
Presentation of the concept of 'Value of Information' (VOI) experimental designs, which help determine optimal test duration by balancing data value against the cost of time.
Explanation of how VOI designs can quantify the value of additional data and inform decisions on whether to continue or stop an experiment based on cost-benefit analysis.
Discussion on the challenges of applying traditional statistical significance testing in AB testing scenarios where business decisions must consider costs and time constraints.
Proposal that data scientists should focus on providing tools that better align with business decision-making, such as translating statistical outputs into financial terms.
Emphasis on the importance of adapting analytical tools to fit the decision-making frameworks of stakeholders, rather than strictly enforcing traditional statistical methodologies.
Argument that understanding and addressing the root causes of 'launch on neutral' behaviors can lead to more effective test designs and stakeholder engagement.
Recommendation to implement risk budgeting in non-inferiority testing to manage cumulative risks across multiple experiments over time.
Highlight of how sequential testing methods and risk budgets can provide a more nuanced approach to managing long-term experimental risks.
Conclusion emphasizing the role of data scientists in bridging the gap between technical rigor and practical business needs, and the continued relevance of their work in evolving business contexts.
Transcripts
Hi.
Sounds like my mic is working. Okay. Great.
So yeah. My name's Carl Vogel.
I'm principal data scientist company called Baby List.
I am here to nominally talk about AB testing.
This is the, if you you probably know by now,
but this is how you ask questions.
I'll try and leave a minute at the end. Otherwise,
feel free to accost me in the hallway.
But I'd like to start with a story.
And in this story, you are a data scientist, which hopefully
not super hard to imagine.
And you work at a company that like sells goods or services online,
and one day a product manager comes up to you asking about a
test they want to run for some feature they want to launch.
And this is not this is like a substantial feature.
It's not we're not like moving a button on the page somewhere.
We're not like changing some copy.
Designers were involved. Engineers were involved. You know,
this is part of like a broader user experience strategy we're
trying to do. And, you get like that perennial question,
that perennial data question of like how much data do I need for this?
Or, in this case, how long should I be running this experiment?
And now you are a competent and a diligent data scientist.
Hopefully, it's also not hard to imagine.
And you ask some thoughtful questions about, well,
what are you trying to measure and what does success and
failure look like for this feature? And, you know,
importantly, like,
how big of a conversion lift are we looking to get here to
make this worthwhile?
And the product manager sort of struggles with these questions a little bit,
but you get answers and you run off and do the thing you were
trained to do and you do like some sample size and some power
calculations and you figure out like an expected test length
and you get back to them and you great. Well,
you're gonna need six weeks for this to test this feature.
And they go, that's great. Thank you for this.
We have budget and appetite for two weeks.
And so you like take a deep breath and, you wish them well,
and you warn them that they may not be able to detect the kind
of effects interested in in two weeks.
And they seem oddly okay with that warning. They're like, okay, cool.
And two weeks pass and you go take a look at the data and
like lo and behold.
There's a conversion lift in this new feature.
And if it were real,
it would mean meaningful amount money for the for the company.
But given the test length and sample size,
it's not statistically significant,
and you go back and you report this to them,
and they respond to you with three words.
And now you have maybe studied and applied experimental
methods for a long time, for many years,
and you have never heard these three words before.
But it they're called launch on neutral,
or maybe they said launch on flat. But either way,
it basically means, Hey, I heard you.
It's not statistically significant, but it's positive.
And I'm gonna launch feature anyway.
And you respond, you know, you have a very, like,
logical response to this. Right?
But you are introspective and you are curious and when you are done, like,
raging about the lack of respect for type one error
rates or whatever. Right?
You start asking stakeholders like, hey, what's the deal?
Like, why do we do this?
Why do we like run an underpowered test?
And then like launch on an insignificant result?
If you're like me, you might get answers kinda like this. Right?
You're gonna hear about like some like faith in a broader
strategy or like wanting to launch something and learn and
iterate on it and all this stuff.
And this will starts to make you think about like AB testing
in your organization like a little bit differently is
something that has like attributes that make distinct
from just like the raw application of, like,
null hypothesis to significance testing. Right?
And some of these attributes are that, you know,
the features we test are not just like coming off of a conveyor belt,
like randomly drawn out of a population of ideas that, like,
might make us might lose us money, like who knows. Right?
They are carefully planned. They are road map.
They have a lot of path dependencies between each other.
You know,
what we do in the next feature depends on like what we see in
this one and all this sort of stuff. Right? Secondly,
like we always struggle to have conversations about, like,
sample size and test design with folks because you need
some effect size input into it,
and they always struggle to tell you that.
And that is because by time we have gotten to the test,
we have sunk all the cost of deploying the feature.
It is basically pushing a button at that point.
So any lift is good. Any lift is cool, let's do that. Right?
And we're like talking to them in these terms of like type one
and type two error rates.
And is just like doesn't correspond to how like these
decision makers are thinking about the risk in this decision.
They just kinda like wanna make more money than they lose on
average and like never lose too much money at a time.
And it's really hard to, like, map this to, like,
the false positive and false negative rates conditioned on,
like, a null hypothesis. Right?
They are asking us this question, essentially. Right?
How do I make a good decision about the effect size I see?
And we are handing them some tools that go well here are
some statistical guarantees on an inference you might wanna
make. And so there's a little bit of a mismatch,
and they end up misusing the tools or ignoring the tools.
And this is kind of what this talk is about is when we see this happening,
the instinct is kind of like correct their use of the tool.
I want to argue in the AB testing context,
we maybe want to think about handing them slightly different tools.
And so the for the rest of my time, I am gonna talk about,
two approach is to to to thinking about AB tests that
have helped me have, like,
more productive conversations with decision makers.
So the first approach is, non inferiority test designs,
which are not new and not esoteric.
But I think they're slept on in the AB testing context,
a little more than they ought to be.
You'll notice the picture here is a guardrail, and that is that
the metaphor to keep in mind.
The main idea of this approach is that instead of testing
whether the new version of the site is better than the current one,
we're just gonna test whether it's not worse by some margin.
And that margin is the delta and the red box there,
and we call that the inferiority margin.
What is this bias?
Well,
when you have a conversation about what these margins on it
be, you are forcing conversations
about the sort of things that motivate these, like,
launch on neutral type positions about. You know,
well, how much do you want to risk to launch this thing?
How much do you believe in it?
How quickly are you gonna iterate on it and on it after
it's launched. Right?
And stakeholders can kind of start to give you like
meaningful answers to these questions instead of like
coming up with like a fake effect size that they wanna like find.
And you can start to power against this like, well,
any positive effect is good type of scenario.
And you see that. That's kind of what this graph does here,
and you can you can start to have a conversation about, well, look,
if you run a test for three weeks and you want good power
against any feature that like isn't losing us money.
Then you may have to accept some small risk that it
actually will lose us, like, say, like, you know,
one and a half percent conversion drop, right?
And maybe that's an acceptable risk,
and maybe it's not an acceptable risk,
but it's an assessment of risk that usually they can reason
about a little bit better.
And you end up with a more productive conversation.
So that's non inferiority testing.
But there's another method that I like that I wanna talk to you
about. And this one,
directly sort of attacks the core problem,
the core question that we have in these test design conversations.
And that is,
what's the hurry? Right? Like,
why do we not have the patience to run an adequately powered AB
test in this organization. Right? What's going on?
Why are we in such a rush?
And we sort of know the answer, right?
Like running a test a long time is costly.
There's an opportunity cost of time. This gets back to that,
like, the nature of, like,
the road mapping and the path dependent amongst features.
If we're waiting a long time for a test of what feature one,
that is gonna hold up feature two,
which depends on the launch or the non launch of feature one
and what happens. Right? So hold roadmaps get held up.
There's the opportunity cost of like sampling and randomization
and tests, right? By construction of an AB test,
a bunch of users are not getting the best version of
your site, right?
If you've ever worked with bandits or seen bandits,
you've seen them trying to approach this problem.
And then lastly,
there's just like the day to day maintenance cost of tests,
right, having a bunch of tests running on the site it wants is, you know,
engineering effort and like code complexity and data
storage and whatever. Right? Usually, that's all,
but it's it's there.
So the question is, right?
If we know about all these costs and we know they affect
how decision makers run tests.
Why aren't we incorporating them into test designs? Right?
And this is where value of information experimental
designs can help.
So we know the time we spend running a test longer is costly.
We know the extra data we get from running a test longer valuable.
If we can quantify the cost and we can quantify the value,
that should be telling us how long we should be running a
test. If the value of more data exceeds the cost of more data,
you should keep getting data.
And if the cost of more data exceeds the value of more data,
then you should stop getting data.
Right.
This picture here. Right? The longer our test run,
the more data we get, the more valuable that data is,
additional data is less valuable when you have a lot
than when you have a little, right,
costs increase as you wait to get that data.
If it's more valuable to get the data than is costly,
you should get the data. Right?
How do we think about the value of data though, right?
Like what what is that? Well, before we run an experiment,
before we have any data, right?
We know very little about what the conversion lift of a new
feature might be. It could be very negative.
It could be very positive.
If we make a decision based on our best information now, right?
We could end up launching an awful feature or failing to
launch a really, really good one. Right?
And then it was a collect data, right?
We have a better idea of what that conversion lift might be,
the range of values it might take narrows.
We may make it incorrect guess now,
but our guess is likely to be wrong by less.
And it turns out that you can put a value on being probably
less wrong. And if that and again,
if that value exceeds the cost of the time it takes to get
that data, you should be getting it.
So how do we actually say like compute that value,
that value of being potentially less wrong.
It turns out it has a name.
It's called the expected value of sample information,
and I'm gonna show you a really simplified way of how you might estimate it.
So we start with a prior over what we think the conversion
lift might be. Relatively wide range. We don't know very much.
It could be pretty negative. It could be pretty positive.
We are going to draw a bunch of values out of that prior a
bunch of potential lifts out of that prior.
For each of those lifts that we draw,
gonna simulate an experiment, right?
And let's say we're interested in like, hey,
what if I want to get two more weeks of data? Right?
So I can simulate a two week experiment,
control and treatment with that lift that I drew out of the prior.
That data and that prior right, generates a posterior
with a new opinion about what those lifts might be.
Each of those posterior may change my mind by may not
change my mind at all based on what I was gonna do under the prior.
They may change my mind a lot based on what I was gonna do under the prior.
If an experiment generates data is likely to generate data that
changes my mind by a lot,
that was a valuable experiment to run. Right?
If it was never had never had any hope of changing my mind.
There was no point in doing it. Right?
And so we run all these simulations. We get all these
priors. We value the, we get all the posteriors.
We get the we average those out.
That's an estimate of this expected value of getting this extra data.
Even better,
this is like an inherently kind of sequential process. Right?
It's just posterior updating so you can do it over and over
again. After you get some data right?
You're really just asking what's the value of some more
data. And this changes the core decision in an AB test from
is B better than A as though that's like a really hard
problem to figure out, right,
to should I stop getting data or should I keep getting data? Right?
There's a good fit for AB test because we don't have to like
recruit subjects for an AB test.
We just have to wait.
You know, and then once more data isn't worth it,
you just launch the best observe variant.
The inference problem the statistical significance
problem is irrelevant at that point.
This is the best information we have,
and it's not worth getting more. So there you are.
And it turns out, like,
I find this a really compelling way to think about AB tests
with decision makers.
It directly gets at the core concepts that they think about
when they wanna make a decision, right?
Cost, benefit, time, risk. Everything's in dollars.
The outputs are in dollars. Right? They're not like, you
know, error rates. Right?
And it's
more complicated than traditional testing,
but it's tractable for like a pretty broad range of the kinds of AB tests
I've run in my experience.
I've built, you know,
there are like open research questions on it.
It's like an active area of research still, but it's,
I've built whole analytics engines on it with R and Shiny
and worked with product managers on it who have found
it, you know,
it gels really well with how make decisions and kind of like
liberate them into being able to like, oh,
I can figure out how a test should work with like dollar outputs.
Right?
So those are, those are the two methods.
And this is,
this is the part of the talk where I'll reveal I've, like,
failed to pay off on the clickbait title per se.
But hopefully there are, like, some useful lessons.
So the first one is I'm not trying to sell you on these
specific two methods.
I don't think there's like a one size fits all approach to
AB testing, in your organization.
You're going to make decisions differently.
You're gonna need to figure out what kind of measurements you
need to make to like support those decisions.
These have pros, these have cons. There's no silver bullet.
But when you observe stakeholders misusing the tools
that you have provided them to do analysis, right?
It should really cause you to rethink, oh,
what is this tool I've handed it to them?
And does it align with how they make decisions? Right?
Does it align with their concerns about risk and cost
and time and value and all that important stuff.
Am I giving them outputs that map to how they think about the problem?
And when I do that,
when I go back and I just try and rethink this, right,
about the tools that I'm handing
you really want to get at,
am I solving the core problem or am I just solving the
symptoms of the misuse that I'm observing, right?
Launching on neutral, running a test until significant,
like all this stuff are kind of a symptom of the problem that
the AB test frameworks we often work with.
Don't deal with the cost of time. Right?
And there are like lots of advanced techniques out there,
is like co variant adjustments and sequential p values and all
this stuff that'll run out that like will help a test go
faster. Right? And they're great and you use them when you can.
But they don't answer the question of like,
why does this test need to go so fast?
And so they're really just kind of treating the symptom of
impatience, right.
And this isn't just about AB testing, right?
Data scientists sometimes like love a tool and like apply
it not super discriminately to problems.
And so we end up with like lots of places where like the tools
that stakeholders aren't exactly like the perfect fit
for how they think about it.
AB testing is a really interesting case because it's
like a domain where you know,
it feels like this is a solved statistical problem.
This should be really straightforward and then you go
try and use it in practice and it's like,
gets messy really fast.
But this is, I think,
the cool stuff that we get to do.
This is like a vaguely weird time for data scientists.
It feels like a lot of the problems we used to work on are
getting like automated or like outsourced or standardized
whatever. Right?
But these kinds of misalignments between decision
making in an organization and the data science tools used to
support those decisions happen like all over the place and all
the time in organizations and,
identifying those and addressing them by, you know,
going back to the first principles problem and really
translating that decision making problem into
quantitative methods and quantifying the core concepts
in that decision making problem is where we can, like,
add value, right?
And I don't want you to let, like, SaaS vendors and, like,
ChatGPT like convince you that these are all solved problems
and there's nothing left to do.
I think there's a lot of things like this to do out there
still. And I think that's what we're here for. Right?
And that's all I have. I hope thanks for coming, everybody.
I hope you enjoy the rest of the conference.
Alright. Fantastic.
So
let's see if I can say this correctly.
Is there a risk of compounding poorly tested changes into real
deterioration of the project. So I've got a I've got a yes,
but can you talk about that a little bit?
Yeah. This is asking about the non inferiority stuff. Right?
If you're willing to, like, accept a tiny loss on each test, right?
Those
starts to add up. Yes, that's at that absolutely can happen.
I would think the way I think about this is having kind of an aggregate
inferiority margin budget over a bunch of tests and going like, well,
you can put a margin on this one and this one and this one,
but can't like, you know,
this is like the total loss that we can kind of accept over
like a long sequence of tests or for a year over some unit of
time. And so you have to, like,
sort of budget out that risk that you're you should you
should have a risk budget for all these decisions, right?
You don't want to think about them in isolation, right?
Okay. Fantastic. Let's thank Carl again.
Ver Más Videos Relacionados
5.0 / 5 (0 votes)