Truth in Data Science | Jaya Tripathi | TEDxYouth@BHS
Summary
TLDRJaya Tripathi, a principal scientist at Mitel, discusses her role as a data scientist, emphasizing the interdisciplinary nature of the field. She highlights the importance of domain expertise, statistical methods, and machine learning in extracting knowledge from data. Tripathi shares her experience with hypothesis testing in medical research, where she predicts prescription fraud by analyzing data from the Prescription Drug Monitoring Program. She also explores data through visualization techniques like histograms, clustering, and geospatial analysis, leading to significant findings such as the identification of potential fraud and the prediction of an HIV epidemic in Indiana. She concludes by advocating for passion in research, the importance of asking questions, and the scientific method in data analysis.
Takeaways
- 🔬 Jaya Tripathi is a principal scientist at Mitel, focusing on data science and innovation programs.
- 📊 Data science involves interdisciplinary approaches, including domain expertise, math, statistics, visualization, and graph analytics.
- 🔍 The concept of absolute truth does not exist in data science, especially in medical research, due to the variability of observable properties.
- ⚖️ Hypothesis testing is a key method in scientific research, where one begins with a hypothesis, makes assumptions, and tests them through models and data analysis.
- 💊 Tripathi's research involved predicting prescription fraud by analyzing data elements like travel distance and drug combinations.
- 📈 The CRISP-DM process was used for data mining, emphasizing the importance of understanding and preparing data before modeling.
- 📊 Visualization techniques, such as histograms and clustering, were used to explore data and identify patterns that could inform machine learning models.
- 🗺 Geospatial analysis, including heat maps, was used to identify hotspots for drug use, which led to significant findings and interventions.
- 🔎 The importance of not just accepting the obvious explanation in data analysis was highlighted, encouraging deeper exploration to uncover underlying issues.
- 📚 Tripathi suggests further reading on the Simpsons paradox, UC-Berkeley gender bias study, and the man in the cave allegory to understand data science concepts better.
Q & A
What is Jaya Tripathi's profession and where does she work?
-Jaya Tripathi is a Principal Scientist at Mitel Corporation in Bedford, Massachusetts.
What is the purpose of the innovation program at Mitel Corporation?
-The innovation program at Mitel Corporation is an internal research program that fosters creativity and is funded to support research initiatives.
What does Jaya Tripathi consider as the 'sexiest job' of the 21st century?
-Jaya Tripathi refers to a quote from the Harvard Business Review, which states that a data scientist has the 'sexiest job' of the 21st century.
What interdisciplinary techniques does a data scientist like Jaya Tripathi use?
-Data scientists use a combination of domain expertise, math, statistics, visualization techniques, and graph analytics, among other scientific methods and algorithms, to extract knowledge from data.
Why does Jaya Tripathi believe the concept of absolute truth does not exist in data science, particularly in medical research?
-Jaya Tripathi explains that in medical research, data is based on observable properties that can vary over time or across different data sets, and the concept of absolute truth can introduce biases, which contradicts the scientific method.
What is the significance of hypothesis testing in Jaya Tripathi's research?
-Hypothesis testing is a method of scientific research where Jaya Tripathi starts with a question, makes assumptions, and creates tests to validate those assumptions. It helps in building models and determining the confidence of results, leading to the acceptance or rejection of initial assumptions.
Can you explain the CRISP-DM process Jaya Tripathi mentioned for data mining?
-The CRISP-DM process is a cross-industry standard for data mining that Jaya Tripathi follows. It involves understanding the data, preparing it by removing duplicates and correcting errors, modeling, and finally evaluating and validating the results.
What was Jaya Tripathi's hypothesis regarding prescription fraud by physicians?
-Jaya Tripathi hypothesized that she could predict prescription fraud by physicians based on data elements such as the distance traveled by the person, certain combinations of drugs prescribed, and other related factors.
How did Jaya Tripathi use geospatial analysis in her research?
-Jaya Tripathi used geospatial analysis, specifically heat maps, to visualize data on drug distribution and usage across different regions. This helped in identifying hotspots for further investigation, such as the HIV epidemic in Scott County, Indiana.
What is Jaya Tripathi's advice for conducting good research?
-Jaya Tripathi suggests that one must be passionate about the research topic, make observations, ask interesting questions, formulate hypotheses, gather the correct data, test predictions, and generalize findings with other datasets.
Outlines
🔬 Introduction to Data Science and Research Methodology
Jaya Tripathi, a principal scientist at Mitel, introduces herself and her work in the field of data science. She discusses her role in federally funded research and development centers, focusing on an internal innovation program. Tripathi emphasizes the importance of data in forming opinions and decisions, quoting Sherlock Holmes and highlighting the 'sexiest job' of the 21st century as described by the Harvard Business Review. She outlines the interdisciplinary nature of data science, which includes domain expertise, mathematics, statistics, and various analytical techniques. Tripathi also touches on the concept of truth in data science, contrasting it with absolute truths found in mathematics and religious texts, and explains how evidence-based research forms the basis of medical research.
📊 Data Mining and Hypothesis Testing in Medical Research
The speaker delves into the process of data mining using the CRISP-DM (Cross Industry Standard Process for Data Mining) methodology. She discusses the importance of understanding and preparing data, including data quality assessment and handling duplicates or missing values. Tripathi then focuses on the modeling phase, where she employs predictive machine learning methods such as support vector machines, neural networks, and random forests. She emphasizes the iterative nature of this process, adjusting features and models based on outcomes. The speaker shares her experience with hypothesis testing, using her own research on prescription fraud as an example. She explains how she formulated a hypothesis, gathered data from sources like the Prescription Drug Monitoring Program (PDMP), and tested her assumptions to predict fraudulent prescriptions by physicians.
🔎 Exploratory Data Analysis and Visualization Techniques
Tripathi presents her approach to exploratory data analysis, beginning with basic visualizations like histograms to understand the distribution of data. She discusses gender and age differences in ADHD drug prescriptions, noting the higher prevalence in male children and the potential for deeper investigation beyond surface-level explanations. The speaker also explores clustering as an unsupervised learning technique to identify patterns in data without prior assumptions. She uses this method to predict 'bad doctors' based on overlapping prescriptions and highlights the importance of not stopping at the first plausible explanation. Tripathi further discusses the use of geospatial analysis and heat maps to identify hotspots of drug activity, which led to the discovery of an HIV epidemic in Scott County, Indiana. She concludes by emphasizing the value of these techniques in guiding resource allocation and educational outreach.
🎓 Conclusion and Recommendations for Research
In her concluding remarks, Tripathi stresses the importance of passion in research and the process of making observations, asking questions, and formulating hypotheses. She advises on the necessity of gathering appropriate data to test predictions and the iterative process of accepting or rejecting hypotheses. The speaker also suggests generalizing findings across different datasets and offers further reading materials, including the Simpson's paradox and the man in the cave allegory, to deepen understanding of data science concepts and their implications.
Mindmap
Keywords
💡Principal Scientist
💡Federally Funded Research and Development Centers (FFRDCs)
💡Innovation Program
💡Data Scientist
💡Hypothesis Testing
💡Evidence-Based
💡Machine Learning
💡Predictive Analytics
💡Geospatial Analysis
💡Data Visualization
Highlights
Jaya Tripathi, a principal scientist at Mitel, discusses her work funded by the innovation program.
Emphasizes the importance of data in decision-making, quoting 'without data, you're just another person with an opinion'.
Cites Sherlock Holmes on the mistake of theorizing before having data.
Data scientists are described as having the 'sexiest job of the 21st century' according to Harvard Business Review.
Interdisciplinary approach of a data scientist includes domain expertise, math, statistics, and visualization techniques.
Discusses the concept of truth in data science, contrasting it with absolute truth in mathematics.
Explains the scientific method's role in data science, particularly in hypothesis testing.
Shares a personal project predicting prescription fraud by analyzing data elements like travel distance and drug combinations.
Describes the CRISP-DM process for data mining, emphasizing the importance of understanding and preparing data.
Details the use of predictive machine learning methods like support vector machines and random forests in modeling.
Highlights the iterative process of model building, feature selection, and validation in data science.
Exploration of data through visualization techniques like histograms reveals insights into prescription patterns.
Uncovers a potential case of prescription fraud by further investigating data that initially seemed explained.
Discusses the significance of not just looking for the obvious explanation and delving deeper into data.
Uses clustering to identify patterns in data, suggesting potential inputs for machine learning models.
Presents a hypothesis on the correlation between certain drug combinations and cash payments, validated through data visualization.
Demonstrates the use of geospatial heat maps to identify potential hotspots for further investigation, leading to significant findings.
Concludes with advice on the importance of passion in research and the process of making observations, formulating hypotheses, and generalizing findings.
Transcripts
[Music]
okay good so I'm Jaya Tripathi and I am
a principal scientist at the Mitel
corporation here in Bedford
Massachusetts next please
just a line bought minor so minor works
in the public interest and we operate
several FFRDCs federally funded research
and development centers so one of the
programs we have is an internal research
program that fosters creativity it's
called the innovation program and the
work that I've been doing for the last
seven or eight years was funded by this
research program the innovation program
next please just a couple of quotes on
data because that's what I do I'm a data
scientist then I really like one without
data you're just another person with an
opinion I don't see the slides here
that's okay you're just another person
with an opinion there's another one that
I like next please from so this is from
Sherlock Holmes a study in scarlet from
Sir Arthur Conan Doyle's it is a mistake
to theorize before one has data next
okay so Howard says I have the sexiest
job this is a quote from the Harvard
Business Review a data scientist the
sexiest job of the 20th century so
that's me
next so what does a data scientist do
what is someone like me do first it's an
interdisciplinary approach so we use a
lot of the techniques some of which are
you must know your domain you must have
domain expertise you use math statistics
visualization techniques graph analytics
and so on so essentially we use all
these scientific methods algorithms
subject matter expertise to try and
extract knowledge from data and explain
the phenomenon that you're trying to
address next ways
so you are familiar with truth in an
informal way in in terms of ethics
philosophy religion you've read about it
in the Bible and in math we there is a
concept of absolute truth we use proofs
like reductio ad absurdum or a an
example which is proof by contradiction
to establish the absolute truth however
this notion of absolute truth does not
exist in data science particularly when
applied to medical research why is that
why because in medical research it's
based on observable properties of the
phenomenon something you measure you
must have heard the term evidence-based
and what you measure and what you
observe today may be different in a
different data set or at a different
point in time also if you subscribe to
the notion of absolute truth right it
violates the scientific method because
you're bringing in preconceived notions
and biases lastly the truth in
scientific method is relatively the
context or the model and that gave me
different pathways or different models
in which to approach your problem next
please
so one method of scientific research
that I want to talk about is hypothesis
testing and there's a quote there from
Edward Teller that I like so fact is a
simple statement that everyone believes
right if innocent it is innocent until
found guilty with the hypotheses it's
guilty until you found effective so you
begin with the hypothesis right you ask
yourself an interesting question you
begin with the hypothesis you make some
assumptions and then you create tests to
test those assumptions and you build
models and then how confident are you
that the result is not due to chance and
once you're confident you've sticked
your hypotheses otherwise you revert to
the alternate hypothesis and you accept
or reject the initial assumption in
favor of the augment position and then
you go on and try and expand your
hypotheses to other data sets to
generalize so here was i have done
several projects one of which was my
hypothesis was that if i I can predict a
particular kind of
committed by prescribers of physicians a
particular kind of prescription fraud
solely and looking at data sets that
have certain data elements data elements
have to do with the distance the person
traveled certain combinations of drugs
and so on so that was my policies and
the first thing I did was to go and see
data that would be suitable to test my
father sees one of which was the PDMP
prescription drug monitoring programs
all the states have them they are
prescriptions reported by pharmacies
once a controlled substance is filled so
an oxycodone or hydrocodone a benzo and
benzodiazepine and so on next please so
here is once I have my policies I seek
the appropriate data and this is an
industry standard for data mining it's
called the crisp cross industry standard
process for data mining and the first
step is understanding your data well you
really need to spend a lot of time
understanding the data well do the data
quality assist assessment and then the
next stage its preparation removing
duplicates and spurious data are missing
values and so on part that I would say I
spent the most time on is on modeling so
there's several different techniques
there but I used something predictive
machine learning methods examples of
which are support vector machines you
all must have heard of artificial neural
networks random forests and so on and
that's a cyclical process so you try
certain essentially you're trying to
predict some things in this case I'm
trying to predict who the bad doctors
are and so you you tries different
features to feed into your model and
then if that doesn't work out you come
back and change your features and so on
so it's the cyclical profit process and
in the end is evaluation and validation
next slide please
okay so I'm just going to show the next
three slides and then I'll end after
that I'm going to show you some examples
of exploration so before I get into
the modeling and the machine learning I
did some exploration on the data and
visualization histogram you all are
familiar with histograms two basic
concepts on this chat here I have male
children on the upper graph and females
on the lower one and I'm just doing an
age gender histogram on the
prescriptions that were fed remember
these are all controlled substance
prescriptions what do you see what it
since the audience is small enough it's
okay for you to shout out what do you
see that's different in the two graphs
correct so on the on the right side
sorry your name is yeah and she pointed
out the difference mainly is in children
under the age of 10 years right and so
it's worth exploring it when a histogram
is something that's very simple to do
and should always do your basic Union
bivariate statistics in the beginning so
I look at it and then I see that most of
the drugs almost all the drugs were ADHD
drugs ritalin and adderal and you look
up literature and you research it and
you find that there is a two point
something times greater prevalence of
that diagnosis in male children than
female children so at this point a lot
of data scientists would have said well
that explains it you know there's a
that's a roughly two point five times
over there's a 2.5 times greater
prevalence of ADHD and that explains it
but something in me wanted to delve
further so I look further and and I
think this this is the point that I'm
trying to make is don't just look for
the obvious explanation it's worth it to
spend a little bit more time and delve
further so look further and they found
actually most of these prescriptions
were filled by a pharmacy in a different
state more than 100 miles away and there
were several pharmacies much closer and
without revealing much more let's just
say that this led to a case next slide
please
so here I am before I'm getting into
complex machine learning models or doing
any
so intelligence applications this is me
just trying to get a feel for the data
so essentially after I did the data
quality is you look for distributions
you look to see if your data is Gaussian
because certain models require certain
distributions and in this case I'm just
hypothesizing and and doing certain
visualizations so the the upper-right
one I am doing clustering
so what clustering is it's a method of
unsupervised machine learning where what
is the data saying about itself you have
no bias no prior knowledge of the data
can if I'm trying to make two classes
for example it's called a binary
classifier to classes bad guys good guys
for example that's as simple as
classifier can the data distribute
itself into two groups and I'm not
saying which one's bad which one's good
but it's just the data so I'm Here I am
trying to classify the people based on
who did multiple drugs at the same time
so not one drug in January and for 30
days then another one in August but
overlaps and overlaps across different
pharmacological groups and do you guys
see that needs clustering into two
groups that's promising
and so that tells me this is an input
that I should use in my input vector
when I do machine learning so this is
Annie tree unique uniquely separable
data here this is another hypothesis I
have so in this case I'm looking for
these dots are all prescribers or
doctors red and blue the red ones are
the ones that prescribe the certain
combination of drugs for which there is
no medical legitimacy by that I mean
there are certain combinations of drugs
which no doctor would ever give why
because it leads to 12 or more times
greater chance of respiratory depression
coma or even death and so I have
prophesized that most of those guys who
did this particular combination of drugs
there was cash payments involvement
that's a hypothesis right there just
guess sometimes the Prophecy's based on
some some sort of expert knowledge and
sometimes it is
we're trying up so I wanted to show you
here so the blue dots and and the red do
you see so what's my hypothesis correct
or wrong what do you see what do you see
from that crap okay
so I'll give you the answer it's you
could more or less say that it's kind of
a linear along the 45-degree line right
and so here I have Medicare payments and
you have cash so my policies was the
people who patients who went to doctors
who gave the sort of combination that's
a no-no had some sort of cash payments
mixed in to go under the radar and I
would say that's not entirely true
hypotheses right because it's it's along
the 45-degree line however there is
something happening because it's below
this this vertical line here most of the
people who did that also had a lot of
prescriptions there's a logarithmic
scale so had a lot of prescriptions
these examples of visualizing the data
next please
so here's a technique called geospatial
or heat maps and hotspots you must have
seen this all the time this is the map
of Indiana because that's where I got
the data and this is normalized by
livable land area and by population
because if you don't normalize the
population Indianapolis in the center
will always turn out bright red because
there's lots of people there so you have
to normalize that population and zcta
for the most part is the same as the US
postal zip code postal service construct
is more or less the same thing zip code
tabulation area zcta but I take zcta
because it gives me access to other
demographic information so I can tie in
the zip codes or the Z CTAs with income
over those deaths education and so on so
I can do that and so here this is a heat
map on a particular drug called opana so
I did heat maps on many different drugs
the top 10 drugs
them on polypharmacy combinations of
drugs and also overlaid with other
datasets from CDC and other places so
what's that sticks out here the the Reds
the red areas Hartford City and Scott
Scott Scott's bar is no I I don't really
live in Indiana's I gave it to the
Health and Human Services and the health
commissioner from Indiana and asked him
to look into it and we don't have any
plausible explanation for why why this
is read these you know largely rural
areas and no one knows what going on
so guess what two months after I gave
the state of Indiana this chart there
was a huge HIV epidemic that broke out
in Scotts County
this is relatively large percent of
population at HIV was in 2014 and the
CDC and the federal government had to
intervene and give aid and so on and at
the conclusion of the investigation it
turned out that the HIV epidemic could
be attributed to people sharing needles
from a panel so here's a technique
that's validated kind of this is my kind
of truth it's validated because there's
a hot map heat map that said you know
explore this region further and States
typically have limited resources so you
if you not in this particular example
but other examples for example if you
want to see where law enforcement query
should be police could decide where to
place the drug diversion officers and
along the heat maps you could focus your
education outreach programs and then in
the red spot areas and so on next please
so in conclusion first we you can't do
really good research unless you really
passionate about the topic so pick a
topic that that the dreamy speaks to you
that you're passionate about make
observations ask interesting questions
formulating hypotheses and develop the
predictions that you can test gather the
correct data test your predictions
accept or reject your hypotheses and if
excepted and see if you can generalize
it invalidate with other datasets for
further reading
may I suggest reading about the Simpsons
paradox uc-berkeley gender bias study or
how many of you are familiar with it
good and then the man in the cave
allegory from Plato's Republic thank you
for having me
you
Weitere ähnliche Videos ansehen
Introduction to Data Science - Fundamental Concepts
Intro to Data Science: What is Data Science?
S3E10 | DPDPA Compliance for MNC Offices in India | #DPDPA #privacycast #mnc
What I *actually* do as a Data Scientist in 2024 (everything you need to know)
Practical Research 2 Lesson 1: Introduction to Quantitative Research
Advice From a Top 1% Machine Learning Engineer
5.0 / 5 (0 votes)