Truth in Data Science | Jaya Tripathi | TEDxYouth@BHS

TEDx Talks
15 Jun 201815:31

Summary

TLDRJaya Tripathi, a principal scientist at Mitel, discusses her role as a data scientist, emphasizing the interdisciplinary nature of the field. She highlights the importance of domain expertise, statistical methods, and machine learning in extracting knowledge from data. Tripathi shares her experience with hypothesis testing in medical research, where she predicts prescription fraud by analyzing data from the Prescription Drug Monitoring Program. She also explores data through visualization techniques like histograms, clustering, and geospatial analysis, leading to significant findings such as the identification of potential fraud and the prediction of an HIV epidemic in Indiana. She concludes by advocating for passion in research, the importance of asking questions, and the scientific method in data analysis.

Takeaways

  • πŸ”¬ Jaya Tripathi is a principal scientist at Mitel, focusing on data science and innovation programs.
  • πŸ“Š Data science involves interdisciplinary approaches, including domain expertise, math, statistics, visualization, and graph analytics.
  • πŸ” The concept of absolute truth does not exist in data science, especially in medical research, due to the variability of observable properties.
  • βš–οΈ Hypothesis testing is a key method in scientific research, where one begins with a hypothesis, makes assumptions, and tests them through models and data analysis.
  • πŸ’Š Tripathi's research involved predicting prescription fraud by analyzing data elements like travel distance and drug combinations.
  • πŸ“ˆ The CRISP-DM process was used for data mining, emphasizing the importance of understanding and preparing data before modeling.
  • πŸ“Š Visualization techniques, such as histograms and clustering, were used to explore data and identify patterns that could inform machine learning models.
  • πŸ—Ί Geospatial analysis, including heat maps, was used to identify hotspots for drug use, which led to significant findings and interventions.
  • πŸ”Ž The importance of not just accepting the obvious explanation in data analysis was highlighted, encouraging deeper exploration to uncover underlying issues.
  • πŸ“š Tripathi suggests further reading on the Simpsons paradox, UC-Berkeley gender bias study, and the man in the cave allegory to understand data science concepts better.

Q & A

  • What is Jaya Tripathi's profession and where does she work?

    -Jaya Tripathi is a Principal Scientist at Mitel Corporation in Bedford, Massachusetts.

  • What is the purpose of the innovation program at Mitel Corporation?

    -The innovation program at Mitel Corporation is an internal research program that fosters creativity and is funded to support research initiatives.

  • What does Jaya Tripathi consider as the 'sexiest job' of the 21st century?

    -Jaya Tripathi refers to a quote from the Harvard Business Review, which states that a data scientist has the 'sexiest job' of the 21st century.

  • What interdisciplinary techniques does a data scientist like Jaya Tripathi use?

    -Data scientists use a combination of domain expertise, math, statistics, visualization techniques, and graph analytics, among other scientific methods and algorithms, to extract knowledge from data.

  • Why does Jaya Tripathi believe the concept of absolute truth does not exist in data science, particularly in medical research?

    -Jaya Tripathi explains that in medical research, data is based on observable properties that can vary over time or across different data sets, and the concept of absolute truth can introduce biases, which contradicts the scientific method.

  • What is the significance of hypothesis testing in Jaya Tripathi's research?

    -Hypothesis testing is a method of scientific research where Jaya Tripathi starts with a question, makes assumptions, and creates tests to validate those assumptions. It helps in building models and determining the confidence of results, leading to the acceptance or rejection of initial assumptions.

  • Can you explain the CRISP-DM process Jaya Tripathi mentioned for data mining?

    -The CRISP-DM process is a cross-industry standard for data mining that Jaya Tripathi follows. It involves understanding the data, preparing it by removing duplicates and correcting errors, modeling, and finally evaluating and validating the results.

  • What was Jaya Tripathi's hypothesis regarding prescription fraud by physicians?

    -Jaya Tripathi hypothesized that she could predict prescription fraud by physicians based on data elements such as the distance traveled by the person, certain combinations of drugs prescribed, and other related factors.

  • How did Jaya Tripathi use geospatial analysis in her research?

    -Jaya Tripathi used geospatial analysis, specifically heat maps, to visualize data on drug distribution and usage across different regions. This helped in identifying hotspots for further investigation, such as the HIV epidemic in Scott County, Indiana.

  • What is Jaya Tripathi's advice for conducting good research?

    -Jaya Tripathi suggests that one must be passionate about the research topic, make observations, ask interesting questions, formulate hypotheses, gather the correct data, test predictions, and generalize findings with other datasets.

Outlines

00:00

πŸ”¬ Introduction to Data Science and Research Methodology

Jaya Tripathi, a principal scientist at Mitel, introduces herself and her work in the field of data science. She discusses her role in federally funded research and development centers, focusing on an internal innovation program. Tripathi emphasizes the importance of data in forming opinions and decisions, quoting Sherlock Holmes and highlighting the 'sexiest job' of the 21st century as described by the Harvard Business Review. She outlines the interdisciplinary nature of data science, which includes domain expertise, mathematics, statistics, and various analytical techniques. Tripathi also touches on the concept of truth in data science, contrasting it with absolute truths found in mathematics and religious texts, and explains how evidence-based research forms the basis of medical research.

05:02

πŸ“Š Data Mining and Hypothesis Testing in Medical Research

The speaker delves into the process of data mining using the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology. She discusses the importance of understanding and preparing data, including data quality assessment and handling duplicates or missing values. Tripathi then focuses on the modeling phase, where she employs predictive machine learning methods such as support vector machines, neural networks, and random forests. She emphasizes the iterative nature of this process, adjusting features and models based on outcomes. The speaker shares her experience with hypothesis testing, using her own research on prescription fraud as an example. She explains how she formulated a hypothesis, gathered data from sources like the Prescription Drug Monitoring Program (PDMP), and tested her assumptions to predict fraudulent prescriptions by physicians.

10:03

πŸ”Ž Exploratory Data Analysis and Visualization Techniques

Tripathi presents her approach to exploratory data analysis, beginning with basic visualizations like histograms to understand the distribution of data. She discusses gender and age differences in ADHD drug prescriptions, noting the higher prevalence in male children and the potential for deeper investigation beyond surface-level explanations. The speaker also explores clustering as an unsupervised learning technique to identify patterns in data without prior assumptions. She uses this method to predict 'bad doctors' based on overlapping prescriptions and highlights the importance of not stopping at the first plausible explanation. Tripathi further discusses the use of geospatial analysis and heat maps to identify hotspots of drug activity, which led to the discovery of an HIV epidemic in Scott County, Indiana. She concludes by emphasizing the value of these techniques in guiding resource allocation and educational outreach.

15:04

πŸŽ“ Conclusion and Recommendations for Research

In her concluding remarks, Tripathi stresses the importance of passion in research and the process of making observations, asking questions, and formulating hypotheses. She advises on the necessity of gathering appropriate data to test predictions and the iterative process of accepting or rejecting hypotheses. The speaker also suggests generalizing findings across different datasets and offers further reading materials, including the Simpson's paradox and the man in the cave allegory, to deepen understanding of data science concepts and their implications.

Mindmap

Keywords

πŸ’‘Principal Scientist

A Principal Scientist is a senior research professional who leads scientific projects and teams, often with a significant degree of autonomy and responsibility for the direction and success of research initiatives. In the context of the video, Jaya Tripathi, the speaker, holds this title at Mitel Corporation, indicating her significant role in driving scientific research and innovation within the company.

πŸ’‘Federally Funded Research and Development Centers (FFRDCs)

FFRDCs are unique independent organizations that assist the U.S. government with scientific research and analysis, technical expertise, and scientific knowledge. They are funded by the government but operate autonomously, allowing for objective research. In the video, the speaker's organization operates several FFRDCs, emphasizing its commitment to advancing research in the public interest.

πŸ’‘Innovation Program

An Innovation Program typically refers to an initiative designed to foster creativity and new ideas within an organization. It often provides resources and support for research and development. In the video, the speaker's work has been funded by such a program, highlighting the importance of internal support for driving innovation and research in data science.

πŸ’‘Data Scientist

A Data Scientist is a professional who analyzes and interprets complex digital data, such as those used in scientific research, to help businesses make decisions. They are skilled in statistics, data analysis, and often programming. The speaker identifies as a Data Scientist, emphasizing the video's focus on the role of data in scientific research and its potential for uncovering insights.

πŸ’‘Hypothesis Testing

Hypothesis testing is a scientific method used to make decisions about claims based on data. It involves formulating a hypothesis, gathering data, and then testing the hypothesis to determine its validity. The speaker discusses hypothesis testing as a key method in her research, showing how it's used to explore and validate assumptions about data.

πŸ’‘Evidence-Based

Evidence-based practices refer to actions or decisions made based on the best available evidence, often from data or research. In the context of the video, the speaker mentions that medical research is evidence-based, emphasizing the importance of data in understanding and addressing medical phenomena.

πŸ’‘Machine Learning

Machine Learning is a subset of artificial intelligence that provides systems the ability to learn and improve from experience without being explicitly programmed. The speaker uses machine learning techniques, such as support vector machines and random forests, to predict and analyze data, showcasing the application of advanced analytics in data science.

πŸ’‘Predictive Analytics

Predictive analytics is the use of data, statistical algorithms, and machine learning techniques to identify the likelihood of future outcomes based on historical data. The speaker discusses using predictive analytics to forecast prescription fraud by physicians, demonstrating how predictive models can be used to analyze and prevent fraudulent activities.

πŸ’‘Geospatial Analysis

Geospatial analysis involves the examination of data in relation to its geographical context. In the video, the speaker uses geospatial analysis to create heat maps that visualize data distribution across geographic areas, such as the prevalence of certain drugs. This technique helps identify patterns and anomalies that could inform policy decisions or resource allocation.

πŸ’‘Data Visualization

Data visualization is the graphical representation of information and data. It helps in understanding data by providing a visual context, making it easier to interpret. The speaker uses various types of visualizations, such as histograms and heat maps, to explore and communicate findings from her data analysis, highlighting the importance of visual tools in data science.

Highlights

Jaya Tripathi, a principal scientist at Mitel, discusses her work funded by the innovation program.

Emphasizes the importance of data in decision-making, quoting 'without data, you're just another person with an opinion'.

Cites Sherlock Holmes on the mistake of theorizing before having data.

Data scientists are described as having the 'sexiest job of the 21st century' according to Harvard Business Review.

Interdisciplinary approach of a data scientist includes domain expertise, math, statistics, and visualization techniques.

Discusses the concept of truth in data science, contrasting it with absolute truth in mathematics.

Explains the scientific method's role in data science, particularly in hypothesis testing.

Shares a personal project predicting prescription fraud by analyzing data elements like travel distance and drug combinations.

Describes the CRISP-DM process for data mining, emphasizing the importance of understanding and preparing data.

Details the use of predictive machine learning methods like support vector machines and random forests in modeling.

Highlights the iterative process of model building, feature selection, and validation in data science.

Exploration of data through visualization techniques like histograms reveals insights into prescription patterns.

Uncovers a potential case of prescription fraud by further investigating data that initially seemed explained.

Discusses the significance of not just looking for the obvious explanation and delving deeper into data.

Uses clustering to identify patterns in data, suggesting potential inputs for machine learning models.

Presents a hypothesis on the correlation between certain drug combinations and cash payments, validated through data visualization.

Demonstrates the use of geospatial heat maps to identify potential hotspots for further investigation, leading to significant findings.

Concludes with advice on the importance of passion in research and the process of making observations, formulating hypotheses, and generalizing findings.

Transcripts

play00:02

[Music]

play00:09

okay good so I'm Jaya Tripathi and I am

play00:14

a principal scientist at the Mitel

play00:17

corporation here in Bedford

play00:18

Massachusetts next please

play00:21

just a line bought minor so minor works

play00:25

in the public interest and we operate

play00:28

several FFRDCs federally funded research

play00:32

and development centers so one of the

play00:35

programs we have is an internal research

play00:38

program that fosters creativity it's

play00:40

called the innovation program and the

play00:44

work that I've been doing for the last

play00:45

seven or eight years was funded by this

play00:47

research program the innovation program

play00:50

next please just a couple of quotes on

play00:53

data because that's what I do I'm a data

play00:55

scientist then I really like one without

play00:58

data you're just another person with an

play01:01

opinion I don't see the slides here

play01:04

that's okay you're just another person

play01:09

with an opinion there's another one that

play01:10

I like next please from so this is from

play01:17

Sherlock Holmes a study in scarlet from

play01:19

Sir Arthur Conan Doyle's it is a mistake

play01:22

to theorize before one has data next

play01:27

okay so Howard says I have the sexiest

play01:30

job this is a quote from the Harvard

play01:32

Business Review a data scientist the

play01:34

sexiest job of the 20th century so

play01:36

that's me

play01:38

next so what does a data scientist do

play01:43

what is someone like me do first it's an

play01:45

interdisciplinary approach so we use a

play01:48

lot of the techniques some of which are

play01:50

you must know your domain you must have

play01:53

domain expertise you use math statistics

play01:57

visualization techniques graph analytics

play02:00

and so on so essentially we use all

play02:03

these scientific methods algorithms

play02:06

subject matter expertise to try and

play02:09

extract knowledge from data and explain

play02:12

the phenomenon that you're trying to

play02:15

address next ways

play02:18

so you are familiar with truth in an

play02:21

informal way in in terms of ethics

play02:24

philosophy religion you've read about it

play02:26

in the Bible and in math we there is a

play02:29

concept of absolute truth we use proofs

play02:32

like reductio ad absurdum or a an

play02:35

example which is proof by contradiction

play02:37

to establish the absolute truth however

play02:40

this notion of absolute truth does not

play02:42

exist in data science particularly when

play02:45

applied to medical research why is that

play02:47

why because in medical research it's

play02:51

based on observable properties of the

play02:53

phenomenon something you measure you

play02:55

must have heard the term evidence-based

play02:56

and what you measure and what you

play02:58

observe today may be different in a

play03:01

different data set or at a different

play03:02

point in time also if you subscribe to

play03:07

the notion of absolute truth right it

play03:09

violates the scientific method because

play03:12

you're bringing in preconceived notions

play03:13

and biases lastly the truth in

play03:17

scientific method is relatively the

play03:19

context or the model and that gave me

play03:21

different pathways or different models

play03:23

in which to approach your problem next

play03:27

please

play03:27

so one method of scientific research

play03:31

that I want to talk about is hypothesis

play03:33

testing and there's a quote there from

play03:35

Edward Teller that I like so fact is a

play03:37

simple statement that everyone believes

play03:39

right if innocent it is innocent until

play03:42

found guilty with the hypotheses it's

play03:45

guilty until you found effective so you

play03:47

begin with the hypothesis right you ask

play03:51

yourself an interesting question you

play03:52

begin with the hypothesis you make some

play03:54

assumptions and then you create tests to

play03:57

test those assumptions and you build

play03:59

models and then how confident are you

play04:01

that the result is not due to chance and

play04:03

once you're confident you've sticked

play04:05

your hypotheses otherwise you revert to

play04:08

the alternate hypothesis and you accept

play04:11

or reject the initial assumption in

play04:12

favor of the augment position and then

play04:14

you go on and try and expand your

play04:16

hypotheses to other data sets to

play04:18

generalize so here was i have done

play04:22

several projects one of which was my

play04:25

hypothesis was that if i I can predict a

play04:28

particular kind of

play04:29

committed by prescribers of physicians a

play04:32

particular kind of prescription fraud

play04:34

solely and looking at data sets that

play04:37

have certain data elements data elements

play04:39

have to do with the distance the person

play04:40

traveled certain combinations of drugs

play04:44

and so on so that was my policies and

play04:45

the first thing I did was to go and see

play04:48

data that would be suitable to test my

play04:50

father sees one of which was the PDMP

play04:52

prescription drug monitoring programs

play04:54

all the states have them they are

play04:57

prescriptions reported by pharmacies

play05:01

once a controlled substance is filled so

play05:04

an oxycodone or hydrocodone a benzo and

play05:06

benzodiazepine and so on next please so

play05:11

here is once I have my policies I seek

play05:14

the appropriate data and this is an

play05:16

industry standard for data mining it's

play05:18

called the crisp cross industry standard

play05:21

process for data mining and the first

play05:23

step is understanding your data well you

play05:25

really need to spend a lot of time

play05:27

understanding the data well do the data

play05:30

quality assist assessment and then the

play05:34

next stage its preparation removing

play05:36

duplicates and spurious data are missing

play05:40

values and so on part that I would say I

play05:43

spent the most time on is on modeling so

play05:46

there's several different techniques

play05:47

there but I used something predictive

play05:51

machine learning methods examples of

play05:54

which are support vector machines you

play05:56

all must have heard of artificial neural

play05:58

networks random forests and so on and

play06:01

that's a cyclical process so you try

play06:04

certain essentially you're trying to

play06:06

predict some things in this case I'm

play06:08

trying to predict who the bad doctors

play06:10

are and so you you tries different

play06:13

features to feed into your model and

play06:16

then if that doesn't work out you come

play06:17

back and change your features and so on

play06:19

so it's the cyclical profit process and

play06:21

in the end is evaluation and validation

play06:24

next slide please

play06:26

okay so I'm just going to show the next

play06:28

three slides and then I'll end after

play06:30

that I'm going to show you some examples

play06:31

of exploration so before I get into

play06:36

the modeling and the machine learning I

play06:38

did some exploration on the data and

play06:40

visualization histogram you all are

play06:42

familiar with histograms two basic

play06:44

concepts on this chat here I have male

play06:46

children on the upper graph and females

play06:48

on the lower one and I'm just doing an

play06:50

age gender histogram on the

play06:52

prescriptions that were fed remember

play06:53

these are all controlled substance

play06:55

prescriptions what do you see what it

play06:58

since the audience is small enough it's

play06:59

okay for you to shout out what do you

play07:01

see that's different in the two graphs

play07:05

correct so on the on the right side

play07:09

sorry your name is yeah and she pointed

play07:13

out the difference mainly is in children

play07:17

under the age of 10 years right and so

play07:20

it's worth exploring it when a histogram

play07:22

is something that's very simple to do

play07:23

and should always do your basic Union

play07:25

bivariate statistics in the beginning so

play07:28

I look at it and then I see that most of

play07:30

the drugs almost all the drugs were ADHD

play07:32

drugs ritalin and adderal and you look

play07:35

up literature and you research it and

play07:37

you find that there is a two point

play07:41

something times greater prevalence of

play07:44

that diagnosis in male children than

play07:46

female children so at this point a lot

play07:49

of data scientists would have said well

play07:50

that explains it you know there's a

play07:52

that's a roughly two point five times

play07:54

over there's a 2.5 times greater

play07:56

prevalence of ADHD and that explains it

play07:59

but something in me wanted to delve

play08:00

further so I look further and and I

play08:03

think this this is the point that I'm

play08:04

trying to make is don't just look for

play08:09

the obvious explanation it's worth it to

play08:12

spend a little bit more time and delve

play08:13

further so look further and they found

play08:15

actually most of these prescriptions

play08:18

were filled by a pharmacy in a different

play08:21

state more than 100 miles away and there

play08:23

were several pharmacies much closer and

play08:27

without revealing much more let's just

play08:32

say that this led to a case next slide

play08:37

please

play08:38

so here I am before I'm getting into

play08:41

complex machine learning models or doing

play08:43

any

play08:43

so intelligence applications this is me

play08:45

just trying to get a feel for the data

play08:47

so essentially after I did the data

play08:49

quality is you look for distributions

play08:51

you look to see if your data is Gaussian

play08:52

because certain models require certain

play08:55

distributions and in this case I'm just

play08:58

hypothesizing and and doing certain

play09:01

visualizations so the the upper-right

play09:04

one I am doing clustering

play09:06

so what clustering is it's a method of

play09:08

unsupervised machine learning where what

play09:10

is the data saying about itself you have

play09:12

no bias no prior knowledge of the data

play09:14

can if I'm trying to make two classes

play09:18

for example it's called a binary

play09:19

classifier to classes bad guys good guys

play09:21

for example that's as simple as

play09:23

classifier can the data distribute

play09:26

itself into two groups and I'm not

play09:28

saying which one's bad which one's good

play09:29

but it's just the data so I'm Here I am

play09:31

trying to classify the people based on

play09:36

who did multiple drugs at the same time

play09:39

so not one drug in January and for 30

play09:42

days then another one in August but

play09:44

overlaps and overlaps across different

play09:47

pharmacological groups and do you guys

play09:49

see that needs clustering into two

play09:52

groups that's promising

play09:54

and so that tells me this is an input

play09:59

that I should use in my input vector

play10:01

when I do machine learning so this is

play10:03

Annie tree unique uniquely separable

play10:05

data here this is another hypothesis I

play10:08

have so in this case I'm looking for

play10:11

these dots are all prescribers or

play10:13

doctors red and blue the red ones are

play10:16

the ones that prescribe the certain

play10:18

combination of drugs for which there is

play10:20

no medical legitimacy by that I mean

play10:23

there are certain combinations of drugs

play10:24

which no doctor would ever give why

play10:26

because it leads to 12 or more times

play10:29

greater chance of respiratory depression

play10:31

coma or even death and so I have

play10:34

prophesized that most of those guys who

play10:37

did this particular combination of drugs

play10:40

there was cash payments involvement

play10:42

that's a hypothesis right there just

play10:44

guess sometimes the Prophecy's based on

play10:46

some some sort of expert knowledge and

play10:49

sometimes it is

play10:50

we're trying up so I wanted to show you

play10:53

here so the blue dots and and the red do

play10:57

you see so what's my hypothesis correct

play11:01

or wrong what do you see what do you see

play11:03

from that crap okay

play11:07

so I'll give you the answer it's you

play11:09

could more or less say that it's kind of

play11:12

a linear along the 45-degree line right

play11:14

and so here I have Medicare payments and

play11:16

you have cash so my policies was the

play11:18

people who patients who went to doctors

play11:22

who gave the sort of combination that's

play11:23

a no-no had some sort of cash payments

play11:27

mixed in to go under the radar and I

play11:30

would say that's not entirely true

play11:33

hypotheses right because it's it's along

play11:36

the 45-degree line however there is

play11:39

something happening because it's below

play11:42

this this vertical line here most of the

play11:46

people who did that also had a lot of

play11:49

prescriptions there's a logarithmic

play11:50

scale so had a lot of prescriptions

play11:52

these examples of visualizing the data

play11:54

next please

play11:56

so here's a technique called geospatial

play12:00

or heat maps and hotspots you must have

play12:01

seen this all the time this is the map

play12:03

of Indiana because that's where I got

play12:04

the data and this is normalized by

play12:07

livable land area and by population

play12:10

because if you don't normalize the

play12:11

population Indianapolis in the center

play12:14

will always turn out bright red because

play12:15

there's lots of people there so you have

play12:17

to normalize that population and zcta

play12:19

for the most part is the same as the US

play12:21

postal zip code postal service construct

play12:25

is more or less the same thing zip code

play12:27

tabulation area zcta but I take zcta

play12:30

because it gives me access to other

play12:31

demographic information so I can tie in

play12:33

the zip codes or the Z CTAs with income

play12:38

over those deaths education and so on so

play12:41

I can do that and so here this is a heat

play12:44

map on a particular drug called opana so

play12:46

I did heat maps on many different drugs

play12:47

the top 10 drugs

play12:49

them on polypharmacy combinations of

play12:51

drugs and also overlaid with other

play12:54

datasets from CDC and other places so

play12:57

what's that sticks out here the the Reds

play13:01

the red areas Hartford City and Scott

play13:03

Scott Scott's bar is no I I don't really

play13:07

live in Indiana's I gave it to the

play13:09

Health and Human Services and the health

play13:10

commissioner from Indiana and asked him

play13:13

to look into it and we don't have any

play13:14

plausible explanation for why why this

play13:16

is read these you know largely rural

play13:18

areas and no one knows what going on

play13:21

so guess what two months after I gave

play13:23

the state of Indiana this chart there

play13:26

was a huge HIV epidemic that broke out

play13:28

in Scotts County

play13:30

this is relatively large percent of

play13:32

population at HIV was in 2014 and the

play13:36

CDC and the federal government had to

play13:38

intervene and give aid and so on and at

play13:42

the conclusion of the investigation it

play13:44

turned out that the HIV epidemic could

play13:46

be attributed to people sharing needles

play13:50

from a panel so here's a technique

play13:53

that's validated kind of this is my kind

play13:57

of truth it's validated because there's

play14:00

a hot map heat map that said you know

play14:04

explore this region further and States

play14:08

typically have limited resources so you

play14:10

if you not in this particular example

play14:12

but other examples for example if you

play14:14

want to see where law enforcement query

play14:16

should be police could decide where to

play14:19

place the drug diversion officers and

play14:21

along the heat maps you could focus your

play14:23

education outreach programs and then in

play14:26

the red spot areas and so on next please

play14:31

so in conclusion first we you can't do

play14:37

really good research unless you really

play14:38

passionate about the topic so pick a

play14:40

topic that that the dreamy speaks to you

play14:43

that you're passionate about make

play14:44

observations ask interesting questions

play14:47

formulating hypotheses and develop the

play14:50

predictions that you can test gather the

play14:52

correct data test your predictions

play14:54

accept or reject your hypotheses and if

play14:57

excepted and see if you can generalize

play14:59

it invalidate with other datasets for

play15:02

further reading

play15:03

may I suggest reading about the Simpsons

play15:05

paradox uc-berkeley gender bias study or

play15:09

how many of you are familiar with it

play15:12

good and then the man in the cave

play15:15

allegory from Plato's Republic thank you

play15:18

for having me

play15:24

you

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data ScienceMedical FraudResearch MethodsAnalyticsMachine LearningHealthcare DataPredictive ModelingEthics in DataData VisualizationScientific Method