How Data Scientists Broke A/B Testing (And How We Can Fix It) - posit::conf(2023)

Posit PBC
15 Dec 202318:18

Summary

TLDRCarl Vogel, a principal data scientist at Baby List, discusses challenges with AB testing in organizations. He highlights the common misalignment between statistical methods and business decision-making. Vogel shares two alternative approaches: non-inferiority testing, which focuses on ensuring new features are not worse by a certain margin, and value of information experiments, which weigh the cost and value of running longer tests. These methods help decision-makers make more informed choices about launching features. Vogel emphasizes the need for data scientists to rethink tools to align better with real-world decisions, focusing on risk, cost, and time.

Takeaways

  • 🧑‍💻 **Launch on Neutral Concept**: Stakeholders often decide to launch a feature despite it showing no statistically significant results, relying on a 'launch on neutral' approach if the trend is positive.
  • 🔢 **Non-Inferiority Test Design**: Instead of testing if a new version is better, non-inferiority tests check if it's not worse by a predefined margin, aligning better with stakeholders' risk tolerance.
  • 📉 **Risk of Small Losses**: When repeatedly using non-inferiority testing, small losses from each test can add up, requiring organizations to have an aggregate risk budget to manage potential losses over time.
  • 📅 **Time vs Data Trade-off**: One core problem in AB testing is the impatience to run fully powered tests due to opportunity costs and roadmap delays, leading to the misuse of statistical tools.
  • 💸 **Value of Information**: The value of additional data diminishes over time, but understanding the cost-benefit trade-off between test duration and data value helps optimize test length.
  • 📊 **Sample Size Struggles**: Decision-makers often struggle to provide accurate estimates for the effect size of a feature, leading to underpowered tests and incorrect use of AB testing tools.
  • 💡 **Cost of Delay**: Long tests delay product roadmaps and hold up the deployment of dependent features, making it crucial to balance between waiting for data and acting quickly.
  • 💬 **Rethinking Tools**: Data scientists should provide tools that align with how decision-makers think about risk, cost, and time, rather than just focusing on statistical significance.
  • 💻 **Practical Solutions for AB Testing**: Introducing concepts like non-inferiority testing and value of information can lead to more meaningful conversations between data scientists and stakeholders about risk and decision-making.
  • 🚀 **Evolving the Role of Data Science**: As data science tools become standardized and automated, the real value lies in addressing decision-making misalignments and creating tailored, quantitative methods for organizations.

Q & A

  • What is the main topic Carl Vogel discusses in this presentation?

    -Carl Vogel discusses the challenges and nuances of AB testing, particularly focusing on improving decision-making processes in organizations when interpreting AB test results.

  • What does 'launch on neutral' or 'launch on flat' mean?

    -'Launch on neutral' or 'launch on flat' refers to a situation where a product manager decides to launch a feature despite the AB test results being statistically insignificant, but with a positive (though non-significant) conversion lift.

  • Why do product managers often want to shorten the duration of an AB test?

    -Product managers often want to shorten the duration of AB tests because of cost, time constraints, and opportunity costs associated with delaying feature rollouts. They may prioritize moving quickly over statistically significant results.

  • What is non-inferiority testing and how is it useful in AB tests?

    -Non-inferiority testing is an approach where instead of testing if a new version of a feature is better, the goal is to test if it’s not worse by a certain margin. This allows for more meaningful conversations about acceptable risks and timelines in AB testing.

  • How does non-inferiority testing differ from traditional AB testing?

    -In traditional AB testing, the goal is often to see if a new feature performs better than the current version. Non-inferiority testing shifts this to check if the new feature is 'not worse' by a specific, acceptable margin, making it easier to reason about risk and decision-making.

  • Why is sample size important in AB testing, and why is it hard to define?

    -Sample size is important because it determines the test’s power to detect meaningful effects. However, it's often hard to define because product managers struggle to specify an expected effect size, especially when any positive lift is considered valuable.

  • What is 'value of information' in the context of AB testing?

    -The 'value of information' refers to the concept that additional data gathered during an AB test has value, as it reduces uncertainty about a feature’s impact. The goal is to balance this value against the cost of continuing to collect data.

  • How does the value of information help in deciding when to stop an AB test?

    -The value of information helps decision-makers know when to stop an AB test by comparing the cost of gathering more data with the expected value of that data. If the cost exceeds the value, the test should stop.

  • What are some key reasons stakeholders misuse AB testing tools?

    -Stakeholders often misuse AB testing tools because they prioritize quick decisions, opportunity costs, and product roadmaps over statistical significance. Their decision-making frameworks often don't align with traditional statistical measures.

  • How can data scientists improve conversations with decision-makers about AB testing?

    -Data scientists can improve conversations with decision-makers by shifting the focus from abstract statistical concepts like error rates to more practical business metrics, such as cost, benefit, risk, and time, which resonate better with decision-makers.

  • What should be considered when budgeting for inferiority margin risk across multiple tests?

    -When using non-inferiority testing, it's important to establish an overall risk budget for the organization. This involves determining how much loss (inferiority margin) is acceptable across a sequence of tests to avoid a cumulative negative impact.

Outlines

00:00

🎤 Introduction and Setting the Scene

The speaker, Carl Vogel, introduces himself as a principal data scientist at Baby List and sets the context for his talk on AB testing. He begins by telling a story where a product manager asks him about running an AB test for a significant feature. The product manager only has a budget and timeline for a two-week test, despite Carl’s recommendation of six weeks. Despite the insufficient time frame, a small conversion lift is observed, but it’s not statistically significant. The product manager decides to launch the feature anyway based on a 'launch on neutral' mindset, prompting Carl to reflect on how stakeholders make decisions in the context of AB testing and the challenges data scientists face in aligning with those decisions.

05:00

🛠 Reframing AB Testing with New Tools

Carl argues that when stakeholders misuse AB testing tools, instead of simply correcting their mistakes, data scientists should consider offering them different tools. He introduces two approaches that help improve communication with decision-makers: non-inferiority test designs and value of information experimental designs. Non-inferiority tests focus on whether the new version of a feature is not worse by a specific margin, encouraging a more meaningful conversation about acceptable risks and possible outcomes. This shifts the focus from detecting a significant effect to assessing manageable risks.

10:02

🔄 Understanding the Value of Information in AB Testing

Carl discusses another approach: value of information (VoI) experimental design. He explains that decision-makers often rush AB tests due to the costs of running tests and opportunity costs of delayed feature releases. VoI quantifies the trade-off between the cost of running a longer test and the value of the additional data it provides. This approach allows decision-makers to frame the problem in terms of cost-benefit analysis, making it easier for them to understand when to stop or continue collecting data based on potential gains in knowledge and reduced decision-making risk.

15:07

💡 Computing the Value of Additional Data

Carl delves into how to estimate the value of additional data using the concept of Expected Value of Sample Information (EVSI). He explains that by simulating different conversion lift scenarios and running potential experiments, one can estimate how much new data will change the decision. If new data significantly improves decision accuracy, it's worth continuing the test. This approach reframes AB testing from simply determining whether variant B is better than A to asking if more data will improve the decision. Once the additional data’s value no longer outweighs its cost, the test can be stopped, allowing the team to launch the best variant observed.

🎯 Rethinking AB Testing Tools and Methods

In this concluding section, Carl emphasizes that there is no one-size-fits-all solution to AB testing. He explains that the real challenge for data scientists is to align their tools with how decision-makers think about risk, cost, and value. Instead of correcting symptoms of tool misuse, such as 'launch on neutral' or running tests until significance is reached, data scientists should focus on solving the underlying problem: the need for speed and cost efficiency. He encourages the audience to rethink the tools they provide, aligning them with the core concerns of decision-makers to support better, more informed decisions.

Mindmap

Keywords

💡AB Testing

AB testing is a method of comparing two versions of a webpage or feature to see which one performs better in terms of specific metrics like conversion rates. In the video, Carl Vogel discusses how product teams often implement this method to test new features but sometimes misuse it by making decisions on underpowered or inconclusive data.

💡Statistical Significance

Statistical significance refers to the likelihood that a result from data analysis is not due to chance. In the video, Carl talks about how product managers often launch features even when results are not statistically significant, which he describes as 'launching on neutral.' This practice can lead to poor decision-making and misinterpretation of test results.

💡Non-Inferiority Test

Non-inferiority tests are used to determine if a new feature or version is not worse than the current one by a certain margin. Carl argues that this method is underused in AB testing but can be effective in forcing conversations about acceptable risks and decision-making, particularly when a small decline in performance may be tolerable for potential long-term gains.

💡Conversion Lift

Conversion lift refers to the increase in the rate of a desired action, like sales or sign-ups, after implementing a change. Carl mentions conversion lift when describing how product managers seek any positive effect, even if not statistically significant, and sometimes use this metric to justify launching features prematurely.

💡Sample Size

Sample size is the number of observations or users involved in an experiment. A key point in Carl’s talk is that determining the right sample size is critical for AB tests to produce reliable results. Often, teams don't allocate enough time or users to gather meaningful data, leading to unreliable conclusions.

💡Power Calculations

Power calculations are used to determine the likelihood that a test will detect a true effect if one exists. Carl mentions these calculations in the context of figuring out how long an AB test should run based on the expected conversion lift. Failing to perform proper power calculations can result in underpowered tests, which are less likely to detect meaningful effects.

💡Type I Error

A Type I error occurs when a test falsely identifies a positive effect, meaning the feature is launched based on inaccurate data. Carl discusses how ignoring statistical significance can increase the risk of Type I errors, leading to the premature launch of ineffective or harmful features.

💡Risk Budget

Risk budget refers to the acceptable level of risk a company is willing to take on over time or across multiple tests. Carl suggests that organizations should set a ‘risk budget’ when using non-inferiority tests to ensure they don’t accept too many small losses that could compound into larger problems.

💡Opportunity Cost

Opportunity cost is the potential benefit that is lost when choosing one alternative over another. In the context of AB testing, Carl explains that there are opportunity costs to running long tests, such as delaying other features that depend on the test's outcome, which is why decision-makers sometimes rush tests.

💡Value of Information

The value of information quantifies the benefit of collecting more data to improve decision-making. Carl introduces this concept as a method for determining whether the cost of running a test longer is worth the potential insights. It shifts the focus from 'Is B better than A?' to 'Is more data worth the cost?'

Highlights

Introduction to speaker Carl Vogel, Principal Data Scientist at Baby List, discussing challenges and methods for AB testing.

Story about a product manager requesting a test duration that contradicts data scientist's recommendation, illustrating real-world challenges in experimental design.

Explanation of common issues with test duration and sample size, highlighting the discrepancy between data-driven recommendations and stakeholder constraints.

Discussion on the concept of 'launch on neutral,' where stakeholders proceed with feature deployment despite lack of statistical significance, prioritizing business strategy over strict data rules.

Introduction of non-inferiority test designs, which focus on ensuring a new feature is not worse than the current one by a specified margin, as a more practical approach in AB testing.

Highlight of how non-inferiority testing can lead to more productive conversations by aligning tests with stakeholders' risk tolerance and strategic goals.

Presentation of the concept of 'Value of Information' (VOI) experimental designs, which help determine optimal test duration by balancing data value against the cost of time.

Explanation of how VOI designs can quantify the value of additional data and inform decisions on whether to continue or stop an experiment based on cost-benefit analysis.

Discussion on the challenges of applying traditional statistical significance testing in AB testing scenarios where business decisions must consider costs and time constraints.

Proposal that data scientists should focus on providing tools that better align with business decision-making, such as translating statistical outputs into financial terms.

Emphasis on the importance of adapting analytical tools to fit the decision-making frameworks of stakeholders, rather than strictly enforcing traditional statistical methodologies.

Argument that understanding and addressing the root causes of 'launch on neutral' behaviors can lead to more effective test designs and stakeholder engagement.

Recommendation to implement risk budgeting in non-inferiority testing to manage cumulative risks across multiple experiments over time.

Highlight of how sequential testing methods and risk budgets can provide a more nuanced approach to managing long-term experimental risks.

Conclusion emphasizing the role of data scientists in bridging the gap between technical rigor and practical business needs, and the continued relevance of their work in evolving business contexts.

Transcripts

play00:07

Hi.

play00:09

Sounds like my mic is working. Okay. Great.

play00:11

So yeah. My name's Carl Vogel.

play00:13

I'm principal data scientist company called Baby List.

play00:17

I am here to nominally talk about AB testing.

play00:22

This is the, if you you probably know by now,

play00:25

but this is how you ask questions.

play00:26

I'll try and leave a minute at the end. Otherwise,

play00:28

feel free to accost me in the hallway.

play00:32

But I'd like to start with a story.

play00:37

And in this story, you are a data scientist, which hopefully

play00:41

not super hard to imagine.

play00:43

And you work at a company that like sells goods or services online,

play00:48

and one day a product manager comes up to you asking about a

play00:51

test they want to run for some feature they want to launch.

play00:54

And this is not this is like a substantial feature.

play00:56

It's not we're not like moving a button on the page somewhere.

play00:59

We're not like changing some copy.

play01:00

Designers were involved. Engineers were involved. You know,

play01:03

this is part of like a broader user experience strategy we're

play01:06

trying to do. And, you get like that perennial question,

play01:09

that perennial data question of like how much data do I need for this?

play01:12

Or, in this case, how long should I be running this experiment?

play01:17

And now you are a competent and a diligent data scientist.

play01:21

Hopefully, it's also not hard to imagine.

play01:24

And you ask some thoughtful questions about, well,

play01:26

what are you trying to measure and what does success and

play01:28

failure look like for this feature? And, you know,

play01:30

importantly, like,

play01:31

how big of a conversion lift are we looking to get here to

play01:34

make this worthwhile?

play01:35

And the product manager sort of struggles with these questions a little bit,

play01:38

but you get answers and you run off and do the thing you were

play01:41

trained to do and you do like some sample size and some power

play01:43

calculations and you figure out like an expected test length

play01:46

and you get back to them and you great. Well,

play01:48

you're gonna need six weeks for this to test this feature.

play01:51

And they go, that's great. Thank you for this.

play01:54

We have budget and appetite for two weeks.

play01:58

And so you like take a deep breath and, you wish them well,

play02:01

and you warn them that they may not be able to detect the kind

play02:04

of effects interested in in two weeks.

play02:06

And they seem oddly okay with that warning. They're like, okay, cool.

play02:10

And two weeks pass and you go take a look at the data and

play02:13

like lo and behold.

play02:14

There's a conversion lift in this new feature.

play02:16

And if it were real,

play02:17

it would mean meaningful amount money for the for the company.

play02:21

But given the test length and sample size,

play02:23

it's not statistically significant,

play02:24

and you go back and you report this to them,

play02:26

and they respond to you with three words.

play02:29

And now you have maybe studied and applied experimental

play02:32

methods for a long time, for many years,

play02:35

and you have never heard these three words before.

play02:38

But it they're called launch on neutral,

play02:40

or maybe they said launch on flat. But either way,

play02:42

it basically means, Hey, I heard you.

play02:43

It's not statistically significant, but it's positive.

play02:46

And I'm gonna launch feature anyway.

play02:49

And you respond, you know, you have a very, like,

play02:51

logical response to this. Right?

play02:56

But you are introspective and you are curious and when you are done, like,

play03:00

raging about the lack of respect for type one error

play03:02

rates or whatever. Right?

play03:04

You start asking stakeholders like, hey, what's the deal?

play03:06

Like, why do we do this?

play03:07

Why do we like run an underpowered test?

play03:09

And then like launch on an insignificant result?

play03:13

If you're like me, you might get answers kinda like this. Right?

play03:16

You're gonna hear about like some like faith in a broader

play03:18

strategy or like wanting to launch something and learn and

play03:20

iterate on it and all this stuff.

play03:22

And this will starts to make you think about like AB testing

play03:26

in your organization like a little bit differently is

play03:28

something that has like attributes that make distinct

play03:31

from just like the raw application of, like,

play03:32

null hypothesis to significance testing. Right?

play03:36

And some of these attributes are that, you know,

play03:38

the features we test are not just like coming off of a conveyor belt,

play03:42

like randomly drawn out of a population of ideas that, like,

play03:44

might make us might lose us money, like who knows. Right?

play03:47

They are carefully planned. They are road map.

play03:49

They have a lot of path dependencies between each other.

play03:53

You know,

play03:55

what we do in the next feature depends on like what we see in

play03:57

this one and all this sort of stuff. Right? Secondly,

play04:00

like we always struggle to have conversations about, like,

play04:03

sample size and test design with folks because you need

play04:05

some effect size input into it,

play04:07

and they always struggle to tell you that.

play04:08

And that is because by time we have gotten to the test,

play04:11

we have sunk all the cost of deploying the feature.

play04:13

It is basically pushing a button at that point.

play04:15

So any lift is good. Any lift is cool, let's do that. Right?

play04:21

And we're like talking to them in these terms of like type one

play04:24

and type two error rates.

play04:25

And is just like doesn't correspond to how like these

play04:28

decision makers are thinking about the risk in this decision.

play04:31

They just kinda like wanna make more money than they lose on

play04:33

average and like never lose too much money at a time.

play04:36

And it's really hard to, like, map this to, like,

play04:39

the false positive and false negative rates conditioned on,

play04:41

like, a null hypothesis. Right?

play04:45

They are asking us this question, essentially. Right?

play04:49

How do I make a good decision about the effect size I see?

play04:52

And we are handing them some tools that go well here are

play04:54

some statistical guarantees on an inference you might wanna

play04:56

make. And so there's a little bit of a mismatch,

play05:00

and they end up misusing the tools or ignoring the tools.

play05:04

And this is kind of what this talk is about is when we see this happening,

play05:08

the instinct is kind of like correct their use of the tool.

play05:12

I want to argue in the AB testing context,

play05:13

we maybe want to think about handing them slightly different tools.

play05:17

And so the for the rest of my time, I am gonna talk about,

play05:21

two approach is to to to thinking about AB tests that

play05:24

have helped me have, like,

play05:25

more productive conversations with decision makers.

play05:31

So the first approach is, non inferiority test designs,

play05:34

which are not new and not esoteric.

play05:37

But I think they're slept on in the AB testing context,

play05:41

a little more than they ought to be.

play05:43

You'll notice the picture here is a guardrail, and that is that

play05:46

the metaphor to keep in mind.

play05:49

The main idea of this approach is that instead of testing

play05:52

whether the new version of the site is better than the current one,

play05:55

we're just gonna test whether it's not worse by some margin.

play05:59

And that margin is the delta and the red box there,

play06:01

and we call that the inferiority margin.

play06:05

What is this bias?

play06:07

Well,

play06:07

when you have a conversation about what these margins on it

play06:10

be, you are forcing conversations

play06:12

about the sort of things that motivate these, like,

play06:14

launch on neutral type positions about. You know,

play06:17

well, how much do you want to risk to launch this thing?

play06:19

How much do you believe in it?

play06:21

How quickly are you gonna iterate on it and on it after

play06:23

it's launched. Right?

play06:24

And stakeholders can kind of start to give you like

play06:26

meaningful answers to these questions instead of like

play06:28

coming up with like a fake effect size that they wanna like find.

play06:32

And you can start to power against this like, well,

play06:35

any positive effect is good type of scenario.

play06:39

And you see that. That's kind of what this graph does here,

play06:41

and you can you can start to have a conversation about, well, look,

play06:44

if you run a test for three weeks and you want good power

play06:47

against any feature that like isn't losing us money.

play06:51

Then you may have to accept some small risk that it

play06:53

actually will lose us, like, say, like, you know,

play06:56

one and a half percent conversion drop, right?

play06:58

And maybe that's an acceptable risk,

play07:00

and maybe it's not an acceptable risk,

play07:01

but it's an assessment of risk that usually they can reason

play07:04

about a little bit better.

play07:05

And you end up with a more productive conversation.

play07:08

So that's non inferiority testing.

play07:12

But there's another method that I like that I wanna talk to you

play07:15

about. And this one,

play07:17

directly sort of attacks the core problem,

play07:20

the core question that we have in these test design conversations.

play07:24

And that is,

play07:26

what's the hurry? Right? Like,

play07:27

why do we not have the patience to run an adequately powered AB

play07:31

test in this organization. Right? What's going on?

play07:33

Why are we in such a rush?

play07:35

And we sort of know the answer, right?

play07:37

Like running a test a long time is costly.

play07:40

There's an opportunity cost of time. This gets back to that,

play07:43

like, the nature of, like,

play07:44

the road mapping and the path dependent amongst features.

play07:47

If we're waiting a long time for a test of what feature one,

play07:50

that is gonna hold up feature two,

play07:52

which depends on the launch or the non launch of feature one

play07:54

and what happens. Right? So hold roadmaps get held up.

play07:58

There's the opportunity cost of like sampling and randomization

play08:01

and tests, right? By construction of an AB test,

play08:03

a bunch of users are not getting the best version of

play08:05

your site, right?

play08:07

If you've ever worked with bandits or seen bandits,

play08:09

you've seen them trying to approach this problem.

play08:12

And then lastly,

play08:13

there's just like the day to day maintenance cost of tests,

play08:15

right, having a bunch of tests running on the site it wants is, you know,

play08:18

engineering effort and like code complexity and data

play08:20

storage and whatever. Right? Usually, that's all,

play08:22

but it's it's there.

play08:26

So the question is, right?

play08:27

If we know about all these costs and we know they affect

play08:29

how decision makers run tests.

play08:31

Why aren't we incorporating them into test designs? Right?

play08:36

And this is where value of information experimental

play08:39

designs can help.

play08:42

So we know the time we spend running a test longer is costly.

play08:46

We know the extra data we get from running a test longer valuable.

play08:49

If we can quantify the cost and we can quantify the value,

play08:52

that should be telling us how long we should be running a

play08:54

test. If the value of more data exceeds the cost of more data,

play08:58

you should keep getting data.

play08:59

And if the cost of more data exceeds the value of more data,

play09:02

then you should stop getting data.

play09:04

Right.

play09:08

This picture here. Right? The longer our test run,

play09:11

the more data we get, the more valuable that data is,

play09:14

additional data is less valuable when you have a lot

play09:17

than when you have a little, right,

play09:19

costs increase as you wait to get that data.

play09:22

If it's more valuable to get the data than is costly,

play09:24

you should get the data. Right?

play09:29

How do we think about the value of data though, right?

play09:31

Like what what is that? Well, before we run an experiment,

play09:35

before we have any data, right?

play09:37

We know very little about what the conversion lift of a new

play09:39

feature might be. It could be very negative.

play09:41

It could be very positive.

play09:43

If we make a decision based on our best information now, right?

play09:46

We could end up launching an awful feature or failing to

play09:50

launch a really, really good one. Right?

play09:52

And then it was a collect data, right?

play09:54

We have a better idea of what that conversion lift might be,

play09:57

the range of values it might take narrows.

play10:00

We may make it incorrect guess now,

play10:02

but our guess is likely to be wrong by less.

play10:06

And it turns out that you can put a value on being probably

play10:10

less wrong. And if that and again,

play10:13

if that value exceeds the cost of the time it takes to get

play10:16

that data, you should be getting it.

play10:19

So how do we actually say like compute that value,

play10:23

that value of being potentially less wrong.

play10:26

It turns out it has a name.

play10:27

It's called the expected value of sample information,

play10:29

and I'm gonna show you a really simplified way of how you might estimate it.

play10:35

So we start with a prior over what we think the conversion

play10:37

lift might be. Relatively wide range. We don't know very much.

play10:40

It could be pretty negative. It could be pretty positive.

play10:44

We are going to draw a bunch of values out of that prior a

play10:47

bunch of potential lifts out of that prior.

play10:51

For each of those lifts that we draw,

play10:55

gonna simulate an experiment, right?

play10:56

And let's say we're interested in like, hey,

play10:58

what if I want to get two more weeks of data? Right?

play11:00

So I can simulate a two week experiment,

play11:04

control and treatment with that lift that I drew out of the prior.

play11:07

That data and that prior right, generates a posterior

play11:11

with a new opinion about what those lifts might be.

play11:16

Each of those posterior may change my mind by may not

play11:18

change my mind at all based on what I was gonna do under the prior.

play11:22

They may change my mind a lot based on what I was gonna do under the prior.

play11:26

If an experiment generates data is likely to generate data that

play11:30

changes my mind by a lot,

play11:32

that was a valuable experiment to run. Right?

play11:34

If it was never had never had any hope of changing my mind.

play11:37

There was no point in doing it. Right?

play11:39

And so we run all these simulations. We get all these

play11:41

priors. We value the, we get all the posteriors.

play11:43

We get the we average those out.

play11:45

That's an estimate of this expected value of getting this extra data.

play11:51

Even better,

play11:52

this is like an inherently kind of sequential process. Right?

play11:56

It's just posterior updating so you can do it over and over

play11:58

again. After you get some data right?

play11:59

You're really just asking what's the value of some more

play12:01

data. And this changes the core decision in an AB test from

play12:05

is B better than A as though that's like a really hard

play12:08

problem to figure out, right,

play12:09

to should I stop getting data or should I keep getting data? Right?

play12:13

There's a good fit for AB test because we don't have to like

play12:16

recruit subjects for an AB test.

play12:17

We just have to wait.

play12:20

You know, and then once more data isn't worth it,

play12:23

you just launch the best observe variant.

play12:25

The inference problem the statistical significance

play12:27

problem is irrelevant at that point.

play12:30

This is the best information we have,

play12:32

and it's not worth getting more. So there you are.

play12:37

And it turns out, like,

play12:38

I find this a really compelling way to think about AB tests

play12:40

with decision makers.

play12:43

It directly gets at the core concepts that they think about

play12:46

when they wanna make a decision, right?

play12:49

Cost, benefit, time, risk. Everything's in dollars.

play12:53

The outputs are in dollars. Right? They're not like, you

play12:56

know, error rates. Right?

play12:59

And it's

play13:01

more complicated than traditional testing,

play13:03

but it's tractable for like a pretty broad range of the kinds of AB tests

play13:08

I've run in my experience.

play13:11

I've built, you know,

play13:14

there are like open research questions on it.

play13:16

It's like an active area of research still, but it's,

play13:18

I've built whole analytics engines on it with R and Shiny

play13:20

and worked with product managers on it who have found

play13:23

it, you know,

play13:25

it gels really well with how make decisions and kind of like

play13:27

liberate them into being able to like, oh,

play13:29

I can figure out how a test should work with like dollar outputs.

play13:33

Right?

play13:37

So those are, those are the two methods.

play13:40

And this is,

play13:41

this is the part of the talk where I'll reveal I've, like,

play13:43

failed to pay off on the clickbait title per se.

play13:46

But hopefully there are, like, some useful lessons.

play13:50

So the first one is I'm not trying to sell you on these

play13:53

specific two methods.

play13:55

I don't think there's like a one size fits all approach to

play13:57

AB testing, in your organization.

play14:00

You're going to make decisions differently.

play14:01

You're gonna need to figure out what kind of measurements you

play14:03

need to make to like support those decisions.

play14:06

These have pros, these have cons. There's no silver bullet.

play14:11

But when you observe stakeholders misusing the tools

play14:14

that you have provided them to do analysis, right?

play14:18

It should really cause you to rethink, oh,

play14:20

what is this tool I've handed it to them?

play14:22

And does it align with how they make decisions? Right?

play14:25

Does it align with their concerns about risk and cost

play14:28

and time and value and all that important stuff.

play14:31

Am I giving them outputs that map to how they think about the problem?

play14:38

And when I do that,

play14:39

when I go back and I just try and rethink this, right,

play14:41

about the tools that I'm handing

play14:44

you really want to get at,

play14:45

am I solving the core problem or am I just solving the

play14:47

symptoms of the misuse that I'm observing, right?

play14:50

Launching on neutral, running a test until significant,

play14:54

like all this stuff are kind of a symptom of the problem that

play14:57

the AB test frameworks we often work with.

play14:59

Don't deal with the cost of time. Right?

play15:07

And there are like lots of advanced techniques out there,

play15:09

is like co variant adjustments and sequential p values and all

play15:11

this stuff that'll run out that like will help a test go

play15:13

faster. Right? And they're great and you use them when you can.

play15:18

But they don't answer the question of like,

play15:20

why does this test need to go so fast?

play15:22

And so they're really just kind of treating the symptom of

play15:25

impatience, right.

play15:30

And this isn't just about AB testing, right?

play15:32

Data scientists sometimes like love a tool and like apply

play15:37

it not super discriminately to problems.

play15:39

And so we end up with like lots of places where like the tools

play15:41

that stakeholders aren't exactly like the perfect fit

play15:44

for how they think about it.

play15:46

AB testing is a really interesting case because it's

play15:48

like a domain where you know,

play15:50

it feels like this is a solved statistical problem.

play15:52

This should be really straightforward and then you go

play15:54

try and use it in practice and it's like,

play15:56

gets messy really fast.

play16:02

But this is, I think,

play16:04

the cool stuff that we get to do.

play16:06

This is like a vaguely weird time for data scientists.

play16:09

It feels like a lot of the problems we used to work on are

play16:11

getting like automated or like outsourced or standardized

play16:16

whatever. Right?

play16:18

But these kinds of misalignments between decision

play16:20

making in an organization and the data science tools used to

play16:23

support those decisions happen like all over the place and all

play16:26

the time in organizations and,

play16:29

identifying those and addressing them by, you know,

play16:32

going back to the first principles problem and really

play16:35

translating that decision making problem into

play16:37

quantitative methods and quantifying the core concepts

play16:40

in that decision making problem is where we can, like,

play16:43

add value, right?

play16:46

And I don't want you to let, like, SaaS vendors and, like,

play16:48

ChatGPT like convince you that these are all solved problems

play16:51

and there's nothing left to do.

play16:52

I think there's a lot of things like this to do out there

play16:54

still. And I think that's what we're here for. Right?

play17:01

And that's all I have. I hope thanks for coming, everybody.

play17:04

I hope you enjoy the rest of the conference.

play17:13

Alright. Fantastic.

play17:15

So

play17:15

let's see if I can say this correctly.

play17:18

Is there a risk of compounding poorly tested changes into real

play17:23

deterioration of the project. So I've got a I've got a yes,

play17:26

but can you talk about that a little bit?

play17:28

Yeah. This is asking about the non inferiority stuff. Right?

play17:30

If you're willing to, like, accept a tiny loss on each test, right?

play17:33

Those

play17:34

starts to add up. Yes, that's at that absolutely can happen.

play17:39

I would think the way I think about this is having kind of an aggregate

play17:43

inferiority margin budget over a bunch of tests and going like, well,

play17:48

you can put a margin on this one and this one and this one,

play17:50

but can't like, you know,

play17:51

this is like the total loss that we can kind of accept over

play17:54

like a long sequence of tests or for a year over some unit of

play17:57

time. And so you have to, like,

play17:58

sort of budget out that risk that you're you should you

play18:01

should have a risk budget for all these decisions, right?

play18:03

You don't want to think about them in isolation, right?

play18:07

Okay. Fantastic. Let's thank Carl again.

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
AB TestingData ScienceDecision MakingExperimentationStatisticsRisk AnalysisConversion RatesNon-inferiorityTesting StrategiesValue Assessment
Benötigen Sie eine Zusammenfassung auf Englisch?