Regression and Matching | Causal Inference in Data Science Part 1

Emma Ding
5 Jan 202223:31

Summary

TLDRIn this video, the host interviews Yuan, a cognitive scientist, about causal inference in data science. They discuss why data scientists should learn causal inference, its importance beyond A/B testing, and how it can be applied to solve real-world problems. The conversation focuses on two key methods: regression and matching. They explore the challenges of using observational data, the concept of confounders, and the pitfalls of regression. The video also introduces propensity score matching as a solution to the limitations of traditional matching methods, with practical examples to illustrate its application.

Takeaways

  • πŸ” Causal inference is crucial in data science for understanding 'why' behind observed effects, not just 'what'.
  • πŸ“ˆ Traditional A/B testing has limitations, such as in social media and marketplaces where interference between treatment and control groups can skew results.
  • 🧐 Observational data can be used for causal inference but is prone to selection bias if not properly accounted for.
  • πŸ€” Understanding confounding variables is key; these are factors that affect both the treatment and the outcome, potentially leading to incorrect conclusions if not controlled.
  • βš–οΈ Regression analysis can be used to control for confounders by including them in a model to isolate the effect of the treatment on the outcome.
  • ❗ Pitfalls of regression include overlooking important variables or incorrectly controlling for mediators and colliders, which can lead to spurious correlations.
  • πŸ”„ Matching methods, such as propensity score matching, can handle various functional forms of confounder influences and are an alternative to regression.
  • πŸ“Š Propensity score matching involves predicting the probability of treatment and matching users based on this score to compare outcomes like engagement or conversion rates.
  • πŸ›  Careful model selection and algorithm design are essential for effective propensity score matching to ensure valid causal inferences.
  • πŸ“š Further methods like difference in difference and synthetic control will be discussed in upcoming videos, promising stronger causal claims through different approaches.

Q & A

  • What is causal inference and why is it important in data science?

    -Causal inference is a method used to determine the cause-and-effect relationship between variables. It's important in data science because it allows us to make predictions and decisions that are based on understanding why something happens, rather than just observing that it does.

  • Why might A/B testing not be effective in certain scenarios?

    -A/B testing might not be effective when there is interference between the treatment and control groups, such as in social media platforms or marketplaces where users share a common pool of resources, leading to changes in supply and demand that can affect the outcome.

  • What is the role of confounders in causal inference?

    -Confounders are variables that affect both the treatment and the outcome, potentially leading to biased conclusions if not accounted for. They need to be controlled for to ensure that any observed effects can be attributed to the treatment rather than other factors.

  • How does selection bias impact causal inference from observational data?

    -Selection bias occurs when the groups being compared (e.g., exposed vs. unexposed to a treatment) differ systematically in ways other than the treatment itself. This can lead to incorrect conclusions about the effect of the treatment, as the differences observed might be due to these other factors rather than the treatment.

  • What is the regression method in causal inference and how does it work?

    -The regression method involves fitting a statistical model that includes the treatment variable and the confounders. It aims to estimate the effect of the treatment on the outcome while holding the confounders constant, thus isolating the treatment's impact.

  • What are the potential pitfalls when using regression for causal inference?

    -Pitfalls include not controlling for all relevant confounders or controlling for the wrong variables. Additionally, assuming the wrong form of relationships (e.g., linear when they are not) or controlling for mediators or colliders can lead to incorrect conclusions.

  • What is matching in causal inference, and how does it differ from regression?

    -Matching involves finding pairs of treated and untreated units that are similar on key characteristics, to compare their outcomes and attribute differences to the treatment. Unlike regression, which statistically controls for confounders, matching creates comparable groups based on observed characteristics.

  • How does propensity score matching work and why is it useful?

    -Propensity score matching involves predicting the probability of receiving the treatment based on observed characteristics and then matching individuals based on these scores. It's useful because it can control for a multitude of confounding factors by reducing them to a single score, simplifying the matching process.

  • What are the challenges associated with propensity score matching?

    -Challenges include the need for an accurate model to predict propensity scores and the requirement for an efficient algorithm to find good matches. If either of these is not done well, the resulting matched groups may not be comparable, leading to invalid conclusions.

  • Can you provide an example use case for propensity score matching?

    -A use case could be a data scientist at HelloFresh wanting to determine if users who click on ads are more likely to purchase due to the ad. Propensity score matching could be used to control for users' interests in cooking and compare the conversion rates of clickers and non-clickers.

  • What other causal inference methods will be discussed in future videos?

    -Future videos will cover methods such as difference in differences and synthetic control, which are additional techniques for making causal inferences from observational data.

Outlines

00:00

🧐 Introduction to Causal Inference in Data Science

The video begins with a warm welcome and an introduction to the topic of causal inference in data science. The host explains that causal inference is crucial when A/B testing is not feasible and invites a cognitive scientist, Yuan, to share insights. Yuan discusses the importance of causal inference, emphasizing the need to understand 'why' behind product failures to prevent recurrence. The conversation highlights the limitations of A/B testing in certain scenarios, such as social media platforms and marketplaces, where interference between treatment and control groups can skew results. The host and Yuan agree on the necessity of alternative techniques for causal inference when traditional methods fall short.

05:02

πŸ” Understanding Selection Bias and Confounders

Yuan explains the concept of selection bias and confounders, using the example of Facebook's new AI system for detecting harmful content. The discussion revolves around the challenge of measuring the impact of harmful content on user engagement due to personalized news feeds. Yuan clarifies that confounders are variables that affect both the treatment and the outcome, and their influence must be controlled to make accurate causal inferences. The conversation also touches on the difference between causal effects and mere observations, highlighting the importance of controlling for confounders to draw valid conclusions from observational data.

10:02

πŸ“Š Regression Method for Causal Inference

The conversation shifts to the regression method, a statistical approach to control for confounders. Yuan elaborates on how regression works by including treatment variables and confounders in a model to extract the partial slope of the treatment, which indicates the effect of the treatment on the outcome while holding other variables constant. The discussion points out potential pitfalls in using regression, such as not controlling for all relevant variables or controlling for the wrong ones. The importance of understanding the functional form of connections and the causal structure of variables is emphasized, with directed acyclic graphs (DAGs) introduced as a tool to represent these relationships.

15:03

🀝 Matching Method for Causal Inference

Yuan introduces the matching method as an alternative to regression, particularly useful for dealing with non-linear influences from confounders. The method involves finding untreated units that closely match treated units based on certain characteristics. The discussion addresses the challenges of matching, such as the curse of dimensionality and the difficulty of finding exact matches, especially with smaller datasets. To overcome these challenges, propensity score matching is introduced, which uses a model to predict the probability of treatment and then matches users based on this propensity score, simplifying the matching process and allowing for more efficient comparison of outcomes.

20:04

πŸ› οΈ Practical Use Cases and Summary of Causal Inference Methods

The final part of the conversation provides a practical use case for propensity score matching, illustrating how it can be used to evaluate the effectiveness of an ad campaign by HelloFresh. The example demonstrates how matching can help control for selection bias and provide a clearer picture of the ad's impact on user behavior. The host and Yuan summarize the key points of the video, reiterating the importance of controlling for confounders and the limitations of statistical control. They also hint at future discussions on other causal inference methods like difference in difference and synthetic control, promising more insights in upcoming videos.

Mindmap

Keywords

πŸ’‘Causal Inference

Causal inference is a method used in data science to determine cause-and-effect relationships between variables. It's essential for understanding why certain outcomes occur, which can inform decision-making and strategy. In the video, causal inference is discussed as a go-to method when A/B testing is not feasible, helping to uncover reasons behind phenomena like user churn or the impact of harmful content on engagement.

πŸ’‘A/B Testing

A/B testing is a statistical experiment that compares two versions of a variable to determine which one performs better. It's a common method for finding causal relationships but has limitations, as highlighted in the video. For instance, it may not work well in social media platforms or marketplaces where the treatment group can influence the control group, thus interfering with the ability to draw causal conclusions.

πŸ’‘Confounders

Confounders are variables that can affect both the treatment and the outcome, potentially distorting the relationship between them. In the context of the video, the example of personalized news feeds is used to illustrate how confounders like age, location, and user interaction history can affect both the exposure to harmful content and user engagement, complicating causal inferences.

πŸ’‘Selection Bias

Selection bias refers to a bias that results from the method of collecting a sample that is not representative of the population. In the video, it's discussed as a significant issue in observational data, where the difference between exposed and unexposed groups can lead to misleading conclusions about causality. The video emphasizes the importance of controlling for confounders to mitigate selection bias.

πŸ’‘Regression

Regression is a statistical method used to determine the relationship between a dependent variable and one or more independent variables. In the video, regression is presented as a method for causal inference where confounders are statistically controlled. By including treatment and confounder variables in a regression model, one can estimate the effect of the treatment on the outcome while holding confounders constant.

πŸ’‘Matching

Matching is a technique used in causal inference to compare similar units that have and have not been exposed to a treatment. The video explains matching as a way to find untreated units that closely resemble treated units based on certain characteristics, allowing for a more accurate assessment of the treatment's effect by controlling for confounders.

πŸ’‘Propensity Score Matching

Propensity score matching is a method that uses a model to predict the probability of receiving a treatment and then matches individuals based on this score. The video discusses this method as a solution to the challenges of traditional matching, particularly in handling multiple confounders and large datasets. It allows for more efficient matching and comparison of treatment effects.

πŸ’‘Counterfactuals

Counterfactuals are hypothetical scenarios that explore what would have happened if a different action had been taken. In the video, counterfactuals are mentioned as a basis for stronger causal claims in causal inference. They help in understanding the effect of interventions by comparing actual outcomes with what would have happened in their absence.

πŸ’‘Natural Experiments

Natural experiments are situations where a change in a policy or condition creates a quasi-experimental setup that can be used for causal inference. The video suggests that natural experiments can provide stronger causal evidence compared to purely observational data, as they often involve random or quasi-random assignment of treatments.

πŸ’‘Harmful Content

Harmful content refers to any material that can negatively impact users, such as hate speech, harassment, or misinformation. In the video, the impact of harmful content on user engagement is used as a case study to demonstrate how causal inference methods like regression and matching can be applied to real-world problems in data science.

Highlights

Causal inference is crucial in data science for understanding 'why' behind observed effects, not just 'what'.

Causal inference becomes essential when A/B testing is not feasible due to interference between treatment and control groups.

Yuan, a cognitive scientist, shares insights from data science interviews at DoorDash, Quora, and Meta.

Two primary causal inference methods discussed are regression and matching.

Regression is used to control for confounding variables and isolate the effect of a treatment.

Matching methods are an alternative when dealing with non-linear relationships or complex data sets.

Selection bias can lead to invalid conclusions if not properly accounted for in observational data.

Confounders are variables that affect both the treatment and outcome, requiring careful control.

Propensity score matching is introduced as a technique to deal with high-dimensional data and find comparable groups.

The importance of understanding the causal structure of variables is emphasized to avoid common pitfalls in regression analysis.

Mistakes in regression analysis, such as controlling for mediators or colliders, can lead to spurious correlations.

The limitations of regression in handling non-linear relationships and the benefits of matching methods are discussed.

Practical examples, such as measuring the impact of harmful content on user engagement, are used to illustrate causal inference methods.

The video concludes with a teaser for upcoming videos on other causal inference methods like difference in difference and synthetic control.

Yuan's blog post on the applications of causal inference in data science is recommended for further reading.

The video emphasizes the importance of causal inference in making data-driven decisions in various industries.

Transcripts

play00:00

hey guys welcome back to the daily

play00:02

interview pro channel in test video and

play00:04

the next few videos we'll be talking

play00:06

about causal inference in data science

play00:09

causal inference has many applications

play00:11

in data science and it becomes the go-to

play00:13

method when ev testing is not working

play00:16

under certain conditions

play00:17

to give you guys an intuitive way to

play00:19

understand the topic i have invited my

play00:22

good friend yuan to share their

play00:24

knowledge in this field

play00:26

yuen is a cognitive scientist who

play00:28

recently went through a few data science

play00:30

interviews and has successfully landed

play00:32

offers from door dash quora and meta

play00:36

in this video we'll focus on two

play00:38

commonly used causal inference methods

play00:40

regression and matching we will go over

play00:43

how to apply each method to solve actual

play00:46

problems faced by tech comics be sure to

play00:48

watch the entire video so you get not

play00:50

only the fundamental knowledge and also

play00:52

the applications let's get started

play00:55

hey yun thank you for joining me today

play00:58

i'm so excited about today's topic

play00:59

causal inference and before we dive into

play01:02

specific methods um i think it's helpful

play01:04

that we start with the motivation of

play01:07

learning causal inference

play01:08

so

play01:09

why should data scientists learn about

play01:11

causal inference

play01:13

yeah great question i can think of two

play01:15

good reasons

play01:16

um the first is that

play01:19

which i heard from

play01:20

a data scientist that lived sean taylor

play01:22

that data scientists should do more than

play01:25

just say

play01:26

what killed a product we need to be able

play01:29

to know why in order to prevent it from

play01:31

happening again so if you're working for

play01:34

netflix and notice that five percent of

play01:36

the users churned uh last month

play01:39

you need to use some causal inference to

play01:41

know the reason so that

play01:43

you won't have this

play01:45

high level of churn rate in the future

play01:47

and another is that um

play01:50

so a b testing is usually the goat

play01:52

standard for finding out causal

play01:54

relationships but it does not always

play01:56

work though

play01:59

so can you give us some examples that

play02:01

advertising is not working

play02:03

yeah i actually learned a lot of

play02:05

examples from your past videos so i

play02:07

think people should really check those

play02:08

out

play02:10

so one typical example is social media

play02:12

or communication software

play02:15

so it's really weird that

play02:18

if treatment

play02:20

users have a new feature like reactions

play02:22

and then they interact with control

play02:24

users

play02:25

um who don't have that feature

play02:27

and another typical example is in

play02:30

marketplaces like uber dash or airbnb so

play02:34

users in each given market share the

play02:36

same pool of

play02:38

drivers dashers and the listings

play02:40

so

play02:42

the treatment can affect the control

play02:44

users by changing the supply and demand

play02:46

in the market

play02:48

because of uh such interference between

play02:52

the treatment and the control groups

play02:53

it's hard to make causal claims uh about

play02:56

the treatment effect

play02:58

um for those reasons we need alternative

play03:00

techniques to make causal inference

play03:04

yeah makes sense so if advertising can

play03:07

be tricky under certain conditions right

play03:09

or for those business models can we use

play03:12

uh observational data to draw

play03:14

conclusions you know data that are not

play03:16

generated from a b tests

play03:20

yeah

play03:21

you can do that but then you may suffer

play03:23

from selection bias that may lead to

play03:25

invalid conclusions let me tell you a

play03:27

recent example that i saw in the news

play03:31

just um

play03:33

this month facebook launched a new ai

play03:35

system that can quickly learn to detect

play03:38

harmful content

play03:39

including

play03:40

hate speech harassment misinformation

play03:44

you name it

play03:45

so um

play03:48

so quickly detecting hardware content is

play03:51

important for um

play03:53

facebook or meta users

play03:55

because

play03:56

this kind of content may negatively

play03:59

impact user engagement

play04:01

so

play04:02

if you're a data scientist working for

play04:04

mana how would you go about

play04:07

measuring the impact of harmful content

play04:10

on users

play04:12

yeah very good question i guess uh a

play04:14

simple way could be we you know identify

play04:16

those users who were exposed to negative

play04:19

content right and also those who were

play04:22

not exposed to negative content and

play04:26

maybe we could compare them in terms of

play04:28

the

play04:29

engagement metrics we care about right

play04:31

such as time spent uh content

play04:34

consumption or creation

play04:36

or

play04:37

user retention etc

play04:40

okay that is a good start but here is a

play04:43

problem though um as we all know

play04:46

news feed is personalized for each user

play04:49

based on

play04:50

so-called ranking signals um like age

play04:54

the country you live in the education

play04:55

you receive

play04:57

and how you interact with past content

play05:00

recommended to you

play05:01

so

play05:02

for

play05:04

users who are exposed to more harmful

play05:07

content may differ from users who are

play05:10

exposed to less

play05:12

harmful content in terms of those

play05:14

aspects

play05:16

and also because of differences in those

play05:18

aspects

play05:20

the more exposed users may have

play05:22

different english engagement levels

play05:25

compared to the the less exposed users

play05:28

anyways so it would be wrong for us to

play05:31

conclude that it is exposure to harmful

play05:34

content that affected

play05:37

engagement

play05:39

to summarize um so variables like um

play05:42

like these ones that i just mentioned

play05:44

can affect both the treatment and the

play05:46

outcome so those variables are called

play05:49

confounders and they are

play05:50

the ones that we should watch out for um

play05:54

so does that make sense

play05:56

yeah that makes sense so

play05:57

what is selection bias you have

play05:59

mentioned earlier

play06:01

um yeah let me explain this

play06:04

so um

play06:06

as mentioned what we can directly

play06:08

observe from the data is the difference

play06:10

between exposed users and unexposed

play06:13

users in terms of their um

play06:15

engagement metrics um that you talk

play06:18

about right

play06:19

but um as we talk about these are not

play06:22

the same people so they may differ in a

play06:25

lot more ways than

play06:27

just whether or not they're exposed

play06:29

harmful content so ideally

play06:32

we need to um look at the same users and

play06:35

imagine like in a parallel universe um

play06:39

had they not been exposed to harmful

play06:42

content how their engagement metrics

play06:44

would have been

play06:46

so um

play06:47

what makes the

play06:49

causal effect different from the

play06:51

observation is the selection bias so

play06:55

in this case

play06:56

users who are exposed to more harmful

play06:59

content may be more engaged anyways in

play07:02

that case we have a positive selection

play07:05

bias

play07:07

and we may

play07:08

falsely conclude that

play07:10

harmful content

play07:12

increases engagement

play07:15

but um

play07:16

it may be the other way around um so um

play07:20

people exposed to more harmful content

play07:23

may be less engaged anyways so then in

play07:26

that case we may reach another false

play07:28

conclusion that

play07:30

harmful content hurts engagement

play07:33

so

play07:35

because of

play07:36

the existence of selection bias

play07:39

it is hard to make valid causal

play07:41

inference from purely observational data

play07:45

so both conclusion will be misleading

play07:48

right uh so given that selection bias

play07:50

exists how do we make causal inference

play07:53

then

play07:54

um so we can statistically control the

play07:57

confounders that

play07:59

i just described

play08:00

by holding them at constant values and

play08:03

see if the treatment still

play08:06

affects the outcome or not

play08:09

and

play08:11

this can be done in two ways regression

play08:13

and matching

play08:15

all right so can you elaborate the

play08:17

regression method

play08:19

yeah for sure

play08:21

so to use regression then we put

play08:25

treatment variable out that we care

play08:26

about in this case exposure to harmful

play08:28

content together with

play08:31

the confounders that we just mentioned

play08:34

and then

play08:35

we fit this model to data

play08:38

and

play08:38

they extract the partial slope of the

play08:41

treatment so basically the slope tells

play08:44

us

play08:45

um when all the other variables are held

play08:47

at constant values

play08:49

how much um being exposed to harmful

play08:52

content

play08:53

affects engagement which is the outcome

play08:56

that we care about and we can visualize

play08:59

this using a type of plot

play09:01

called a partial regression plot

play09:05

so imagine that the confounders will be

play09:09

held at constant values right so how do

play09:11

we find those constant values

play09:15

the short answer is that it doesn't

play09:17

matter too much in practice so

play09:20

with the models we don't

play09:22

manually

play09:24

assign constant values to co-founders

play09:27

your estimator should be able to

play09:30

take account of their influences

play09:32

and when we interpret the models it also

play09:35

doesn't matter

play09:37

so we can plug in any ver any values for

play09:40

those

play09:41

variables that we control for

play09:44

because

play09:45

anything cancels out when we subtract

play09:48

the

play09:49

equation for the

play09:51

untreated users from the equation

play09:54

for the treated users

play09:57

so

play09:58

any other

play10:00

things we need to pay attention to or

play10:02

any pitfalls when we use regression

play10:05

yeah there are loads

play10:06

of those so um which i think boils down

play10:09

to two things um so one mistake is to

play10:12

con is not to control for all the things

play10:14

that we should control for and the other

play10:16

is controlling for like uh the wrong

play10:19

variables

play10:20

so um

play10:21

as

play10:22

to avoid the first mistake you need to

play10:25

know at the main really well to

play10:27

understand um like what factors are at

play10:30

the play um in that domain

play10:33

but then

play10:34

as for the second one we should pay

play10:36

attention to

play10:38

how variables are connected and that has

play10:41

two aspects

play10:42

the first aspect is the so-called

play10:45

functional form of the connections

play10:48

so a quick example is that if you use

play10:51

linear regression um to control for your

play10:54

confounders

play10:55

then

play10:57

it can only control for confiners that

play10:58

have linear relationships with the

play11:01

outcome if your

play11:04

co-founders have other types of

play11:06

relationships

play11:07

or even unknown relationships with the

play11:10

outcome then linear regression does not

play11:12

do the job but then the other um

play11:16

aspect has to do with the causal

play11:18

structure of how variables are connected

play11:21

um and we can use directed acyclic

play11:24

graphs or dags to represent causal

play11:27

relationships

play11:28

um so

play11:30

it's a

play11:32

complicated topic but today all you need

play11:34

to know is that um in a dag

play11:37

um nodes represent variables and edges

play11:40

show their connections

play11:42

so

play11:44

we can

play11:45

here is the doc that drew for the

play11:47

harmful content case

play11:49

so

play11:50

here is one type of mistake

play11:53

so

play11:54

usually it's usually the case that the

play11:56

more friends you have

play11:58

the more um posts that you uh get shared

play12:01

um

play12:02

from your friends and then the more

play12:04

likely it is that you may get exposed to

play12:07

some kind of harmful content

play12:11

we can so in this case um the number of

play12:14

posts shared with you is a mediator

play12:17

through which the size of your social

play12:19

network

play12:21

impacts your exposure to harmful content

play12:24

so what happens when we control for

play12:26

mediators for example we can look at

play12:28

users who

play12:30

get a ton of shared posts

play12:35

then

play12:36

we may like conclude that oh um users

play12:39

with many friends uh have just um

play12:42

um like as high a risk of being exposed

play12:46

to um however content as users with few

play12:49

friends

play12:50

so um if we control for mediators then

play12:53

we lose generating relationships in the

play12:56

data

play12:57

and here is another type of mistake that

play13:00

you can make which is controlling for a

play13:02

type of variables called colliders

play13:06

so colliders are basically

play13:08

common effects of other variables

play13:12

so let me give you a concrete example

play13:15

say

play13:16

you look at if you

play13:19

look at users who are super engaged late

play13:22

at night right so

play13:24

chances are

play13:25

many of those users have a lot of

play13:27

friends so they have a lot of people to

play13:29

talk to like into the night and then

play13:32

those users may be um may have insomnia

play13:36

so they cannot fall asleep

play13:38

so

play13:40

so in this case uh like the time spent

play13:43

uh late at night is a collider of the

play13:45

number of friends you have and whether

play13:47

or not you have insomnia

play13:49

if you control if we control for this

play13:52

collider by only looking at um

play13:55

users who spend a lot of time on

play13:58

facebook late at night then we may

play14:00

conclude that oh popularity have has

play14:03

something to do with insomnia

play14:06

but that would be um a spurious

play14:08

correlation that does not actually exist

play14:12

so

play14:13

in reality

play14:15

it is hard to know what variables are

play14:18

there that we should control for let

play14:20

along how they are connected so using

play14:22

regression blindly can be really

play14:24

dangerous

play14:27

yeah that makes sense

play14:29

um yeah thank you so much for the uh

play14:31

explanation of the pitfalls uh and you

play14:34

mentioned another method matching right

play14:36

so can we talk about um you know given

play14:39

that we have regression why do we need

play14:40

to know matching and how do we do it

play14:44

yeah

play14:45

so um i kind of answered the why

play14:48

question like

play14:49

when i just talked about regression so

play14:51

um

play14:52

we'll talk about like how linear

play14:54

regression can only control for linear

play14:57

influences from the co-founders but

play15:00

matching can deal with any types of

play15:03

functional forms

play15:05

and

play15:07

but the whole question is really tricky

play15:10

so a quick answer is that um when using

play15:14

matching or basically uh um for each uh

play15:17

treated unit uh like an exposed user

play15:20

we find one or several untreated units

play15:24

in this case like unexposed users

play15:27

and a match um then based on some

play15:29

characteristics that we care about like

play15:31

the ones we mentioned before

play15:34

and then

play15:35

um if we um

play15:37

dare to make this assumption that those

play15:40

users

play15:41

are only different in terms of um

play15:45

whether they get the treatment or not

play15:46

but they are not different otherwise

play15:48

then we can um

play15:50

attribute the difference in their

play15:52

outcomes in this case engagement

play15:54

just to

play15:56

the treatment

play15:57

not to other things so that is the main

play16:00

idea

play16:01

of matching

play16:04

yeah so the idea sounds reasonable to me

play16:07

but i guess the question is how do we

play16:09

actually match um like you mentioned

play16:11

untreated units and treated units

play16:15

yeah so that is the hard question so um

play16:19

the most direct way to do this is that

play16:22

we can

play16:23

this is a toy dataset that i created for

play16:26

the meta example

play16:28

so we have a

play16:30

group of exposed users and another group

play16:33

of unexposed users

play16:34

and we have like certain characteristics

play16:36

that we um want to match them um

play16:39

that we want to match them on right so

play16:41

then

play16:43

we can

play16:44

find um exposed to users who have

play16:47

exactly the same age

play16:49

live in the same country

play16:51

receive the same highest degree

play16:53

as those unexposed users

play16:56

so in this case

play16:58

like user 3 can be

play17:01

matched to user 9 because they have the

play17:04

same values

play17:06

for those variables that we want to

play17:08

control for and so is

play17:11

user 6 um

play17:14

that is matched to user 15. so um this

play17:18

method can be quite painful first of all

play17:21

it suffers from the so-called cursive

play17:23

dimensionality um because we need to go

play17:26

through the matching process as many

play17:28

times as there are variables that we

play17:31

want to control for

play17:32

and then

play17:33

unless we have a really big data set um

play17:37

which i think is no problem for meta but

play17:39

maybe

play17:40

but it may be a problem for smaller

play17:42

companies with smaller data sets it can

play17:44

be hard to find um

play17:46

enough exact matches that we can use for

play17:49

comparison later

play17:51

so sounds like for each data point we

play17:53

have we need to find its pair right and

play17:56

that's why

play17:58

we will have less data that could be

play18:00

used to draw conclusions

play18:02

so is there any method we could use to

play18:05

address this kind of problem

play18:08

yeah we can use

play18:10

propensity score matching so basically

play18:13

the idea is that instead of matching

play18:15

users directly based on those

play18:17

characteristics

play18:19

we

play18:20

build a model

play18:21

that takes those characteristics about

play18:24

the users to predict

play18:27

the probability that they might be

play18:29

exposed to harmful content so this

play18:32

predicted probability is called a

play18:34

propensity score

play18:36

for each user so

play18:38

instead of going through the matching

play18:40

process many many times

play18:42

we just match unexposed users

play18:45

to exposed users based on a single

play18:48

propensity score and then

play18:52

we can

play18:53

and then after users are matched then we

play18:55

can compare their engagement levels to

play18:58

see

play19:00

if

play19:01

there's any difference that we can

play19:02

attribute to the treatment

play19:04

so

play19:06

this method

play19:07

solves the two problems that i just

play19:10

mentioned but

play19:12

it it's also very challenging so first

play19:15

of all we need a good model that can

play19:17

accurately predict

play19:19

the propensity to receive the treatment

play19:21

in this

play19:22

case the probability of being exposed

play19:25

harmful content and the second of all we

play19:27

need a good algorithm that um

play19:31

that can quickly find um

play19:34

similar users um in terms of their

play19:37

propensity scores

play19:39

so um

play19:40

um

play19:41

both are really demanding and if either

play19:44

goes wrong then um

play19:46

we cannot find comparable groups that we

play19:49

can draw valid conclusions based on

play19:53

yeah thank you for explaining that so

play19:54

given that both the prediction and the

play19:57

matching part are pretty challenging um

play20:00

so what are the use cases of this

play20:02

particular method can you give us some

play20:04

examples

play20:05

so let's say like you're a data

play20:07

scientist working at like hellofresh and

play20:09

then you want to see you put out a

play20:11

um ad campaign like

play20:14

for the food that you're selling and

play20:15

then you want to know if

play20:17

um

play20:18

people who click on the ads are more

play20:20

likely to buy food

play20:22

from you because of this ad or not all

play20:24

right so um because of the selection

play20:27

bias that we mentioned it can be um

play20:30

misleading just to compare uh people who

play20:32

click on the ad and um who don't and

play20:35

compare their conversion rates because

play20:37

people who

play20:38

um

play20:40

because just like um

play20:42

news feed ads are also personalized so

play20:45

um

play20:46

users who are shown this ad may have

play20:48

higher interests in cooking anyways so

play20:51

they um may be more likely to buy foot

play20:54

from you

play20:55

without a width or without clicking on

play20:58

this ad

play20:59

so a better way to answer that question

play21:02

would be using propensity score matching

play21:04

um so first um we use uh characteristics

play21:08

of the users to predict

play21:10

in this case it's uh the prediction is

play21:12

more complex because we need to predict

play21:16

how likely it is that a user is shown

play21:18

this ad and then after being shown this

play21:21

ad

play21:22

how likely it is that they're going to

play21:23

click on it um so then

play21:25

um after

play21:27

generating this propensity score for

play21:30

your users then you can match

play21:33

the clickers and non-clickers on this

play21:35

propensity score and compare their

play21:38

conversion rates

play21:39

so that's just another example but there

play21:42

are many many user use cases that you

play21:44

can imagine so

play21:46

despite the difficulty it is a it's

play21:49

still a

play21:50

very

play21:51

popular method for causal inference

play21:53

yeah thank you for the example so

play21:56

could you summarize what we have learned

play21:58

the two methods we have learned in this

play22:00

video and you know to help our audience

play22:03

to learn them better

play22:06

yeah sure um so basically we learned

play22:09

that when the data is purely

play22:11

observational

play22:12

it is dangerous to

play22:14

draw causal conclusions because of the

play22:16

selection bias that we mentioned so to

play22:18

solve that problem we can use

play22:20

regression or matching to control for a

play22:23

type of variables called co-founders

play22:26

that can affect both the treatment and

play22:28

the outcome

play22:30

so after um confounders are controlled

play22:33

for then we can be more confident

play22:36

whether or not the treatment has

play22:39

a true effect

play22:40

on the outcome

play22:42

um

play22:43

but um

play22:43

[Music]

play22:45

but

play22:46

overall um because of

play22:48

various problems that we just mentioned

play22:51

statistical control only allows us to

play22:54

make a

play22:55

pretty weak causal influence so um

play22:59

methods based on counterfactuals and

play23:01

natural experiments can help us

play23:04

make stronger causal claims and we will

play23:06

talk about these methods later

play23:09

sounds good thank you so much

play23:11

alright guys thanks for watching if you

play23:13

want to learn more about applications of

play23:15

causal influence in data science yuen

play23:17

has a great blog post about it feel free

play23:19

to check it out

play23:20

in the next few videos you and i will

play23:22

talk about other common use causal

play23:24

inference methods in data science such

play23:26

as difference in difference and

play23:27

synthetic control stay tuned we will see

play23:30

you soon

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Causal InferenceData ScienceRegression AnalysisMatching MethodAB TestingSelection BiasConfoundersPropensity ScoreHarmful ContentEngagement Metrics