How I Would Learn Data Science in 2022
Summary
TLDRThe video script provides a practical guide on learning data science in 2022, emphasizing a breadth-first approach centered around project-based learning. It outlines essential topics including coding, statistics, data visualization, exploratory data analysis (EDA), machine learning, data scraping, APIs, databases, and deployment. The speaker recommends starting with Python due to its simplicity and rich data science libraries. The script also advises on learning SQL for database management and highlights the importance of domain knowledge and communication skills for a data scientist. It suggests using interactive platforms like Free Code Camp and resources like Kaggle for practical learning and emphasizes the evolving landscape of data science with automation of repetitive tasks, stressing the need for understanding algorithms and their application in specific contexts.
Takeaways
- ๐ **Practical Guide Focus**: The video emphasizes a practical approach to learning data science, focusing on effective learning methods and persistence.
- ๐ **Interdisciplinary Nature**: Data science involves coding, math, statistics, and business acumen, necessitating a breadth-first learning approach.
- ๐ ๏ธ **Project-Based Learning**: A project-based learning approach is recommended for its effectiveness in encoding information deeply and retaining knowledge.
- ๐ **Python for Coding**: Python is suggested as the starting language for coding due to its simplicity, great documentation, and data science libraries.
- ๐ **Statistics Fundamentals**: Basic statistical knowledge is crucial, including mean, median, mode, standard deviation, and distributions.
- ๐ **Data Visualization**: Learning a visualization library like seaborn is important for graphically representing data insights.
- ๐ฌ **Exploratory Data Analysis (EDA)**: EDA is introduced as a method to explore and familiarize oneself with data sets, looking for trends and patterns.
- ๐ **Learning Timeline**: The video provides a suggested timeline for learning each topic, emphasizing the importance of starting with the basics and progressing to projects.
- ๐ค **Machine Learning Algorithms**: Understanding common machine learning algorithms is key, with an intuitive grasp being more important initially than deep mathematical understanding.
- ๐ **Data Scraping and APIs**: As one progresses, learning to scrape data and work with APIs becomes essential for obtaining and manipulating data sets.
- ๐ก **Domain Knowledge**: With automation on the rise, domain knowledge and the ability to communicate the impact of data science work becomes increasingly important.
Q & A
What is the main focus of the video regarding learning data science?
-The main focus of the video is to provide a practical guide on how to effectively learn data science, emphasizing a breadth-first approach centered around project-based learning.
Why is project-based learning recommended for learning data science?
-Project-based learning is recommended because it allows learners to apply theoretical knowledge in practice, which helps in deeper encoding of information into the brain and better retention of knowledge.
What is the recommended first step in learning data science according to the video?
-The recommended first step is learning coding, specifically starting with Python, as it is a general-purpose language with great libraries for data science.
Why is a breadth-first approach preferred over a depth-first approach when learning data science?
-A breadth-first approach is preferred because it helps learners avoid getting overwhelmed by the depth of each subject, allows them to start implementing what they learn sooner, and keeps the learning process engaging.
What are some of the key topics to cover when learning data science?
-Key topics include programming, statistics, data visualization, exploratory data analysis (EDA), machine learning, data scripting, APIs, databases, and deployment, as well as specific niches like NLP and computer vision.
What is the significance of understanding the theory behind machine learning algorithms?
-Understanding the theory behind machine learning algorithms is important for applying them effectively to specific use cases and ensuring they function properly in a given context.
Why is domain knowledge considered crucial for a data scientist?
-Domain knowledge is crucial because it helps a data scientist understand the business context, communicate the value of their work, and ensure that their analyses and models provide real impact and are used by the organization.
What is the recommended timeline for learning the basics of coding in the context of data science?
-The recommended timeline for learning the basics of coding is one to two weeks at four hours per day.
How does the video suggest approaching the learning of statistics for data science?
-The video suggests brushing up on statistics with a focus on high school to first-year university stats, such as mean, median, mode, standard deviation, distributions, central limit theorem, and confidence intervals.
What is the role of accountability in the learning process as discussed in the video?
-Accountability is built into the learning process to maximize the chances of not giving up, especially for those who may not have the strongest willpower and tend to give up easily.
How does the video suggest one should engage with existing projects to enhance their learning?
-The video suggests taking someone else's project and working through it, understanding each line of code and the rationale behind it, rather than just copying code, to gain a practical understanding of how to approach a project.
Outlines
๐ Mastering Data Science in 2022: A Practical Guide
The video provides a practical guide to learning data science in 2022, emphasizing effective learning strategies over the sheer volume of information. It covers the interdisciplinary nature of data science, involving coding, math, statistics, and business acumen. The presenter outlines a step-by-step approach, starting with a breadth-first, project-based learning method. This approach encourages learners to understand the minimum required theory before diving into practical projects. The video also discusses the importance of not giving up and the role of accountability in learning.
๐ฉโ๐ป Coding and Project-Based Learning: Starting Strong
The presenter suggests starting with coding, finding it more motivating to see immediate results. Python is recommended as the language of choice due to its simplicity and rich data science libraries. The basics of coding, including variables, functions, loops, and conditionals, are covered, along with the importance of learning data science modules like pandas and numpy. The video also touches on statistics, visualization, and the first project milestone, which involves exploratory data analysis (EDA). Timelines for learning these topics are provided, with an emphasis on the breadth-first approach and the integration of theory and practice.
๐ Diving Deeper: Statistics, Visualization, and Machine Learning
The video moves on to more advanced topics, beginning with statistics, where a foundational understanding is crucial. It then transitions into data visualization, recommending the seaborne library for its intuitive interface and aesthetic appeal. The presenter highlights the importance of EDA for understanding data trends and patterns. Following this, the focus shifts to machine learning, introducing common algorithms and the importance of understanding their workings intuitively. The video also discusses the process of learning through other people's projects on platforms like Kaggle and emphasizes the practical aspects of machine learning, such as data preprocessing and model optimization.
๐ Scraping, APIs, and Databases: Expanding Data Science Skills
The presenter covers data scraping and APIs, which are essential for sourcing data when pre-built datasets are unavailable. Beautiful Soup is recommended for web scraping, and the importance of learning SQL for database manipulation is emphasized. The video outlines the types of databases, including relational, NoSQL, and cloud databases. It also provides a timeline for learning these skills and suggests practical exercises like importing datasets into a personal database. The presenter encourages the use of interactive learning platforms and emphasizes the importance of project-based learning.
๐ Deployment, Niche Topics, and the Future of Data Science
The video discusses deployment, which involves putting machine learning models into a live environment, and explores niche areas such as natural language processing (NLP) and computer vision. The presenter provides resources for learning SQL, machine learning algorithms, and databases. They also stress the importance of understanding the automated aspects of data science and the growing significance of domain knowledge. The video concludes with the importance of communication and presenting findings in a business context, as well as the role of data scientists in ensuring their work provides value to the company.
Mindmap
Keywords
๐กData Science
๐กProject-Based Learning
๐กPython
๐กPandas and Numpy
๐กStatistics
๐กData Visualization
๐กExploratory Data Analysis (EDA)
๐กMachine Learning
๐กData Scraping and APIs
๐กDatabases
๐กDeployment
๐กDomain Knowledge
Highlights
The video provides a practical guide on learning data science effectively, emphasizing not just what to learn but how to learn it.
Data science is described as an interdisciplinary field involving coding, math, statistics, and business acumen.
A breadth-first approach centered around project-based learning is recommended for learning data science.
The importance of understanding the difference between theory and practice in technical subjects is highlighted.
Python is suggested as the starting language for coding due to its simplicity and rich data science libraries.
Pandas and NumPy are identified as key modules for data manipulation in data science.
Statistics knowledge is crucial for understanding data sets, with a focus on concepts from high school to first-year university stats.
Visualization is emphasized as an important aspect of data science, with Seaborn recommended as an intuitive module.
Exploratory Data Analysis (EDA) is introduced as a foundational project type for beginners to familiarize with data sets.
A timeline is provided for learning the basics of coding, statistics, and visualization, suggesting 1-2 weeks for each.
The video encourages learning from other data scientists and building upon existing projects to enhance understanding.
Machine learning is discussed, with an emphasis on understanding common algorithms and their applications.
Data scraping and APIs are highlighted as essential skills when moving beyond pre-built data sets.
SQL is identified as a vital language to learn for database manipulation and is often a job requirement for data roles.
Deployment and niche areas like NLP and computer vision are considered advanced topics in data science.
Recommended resources for learning include interactive platforms, online courses, and practical project-based learning.
As data science tasks become automated, the importance of domain knowledge and effective communication of findings increases.
The presenter shares personal strategies for staying accountable and motivated during the learning process.
The landscape of data science is evolving, with a growing emphasis on domain expertise and the ability to apply AI tools effectively.
Transcripts
welcome back to another recall by data
iq video
in this video i'm going to walk you
through how i would learn data science
in 2022. you've probably already seen a
couple other videos on this topic before
but what i'm going to be focusing on
here is a very practical guide because
from my experience the hardest part
about learning data science is that you
can't figure out what to learn but
rather how to learn effectively and kind
of like how to not give up essentially
because data science is hard it's an
interdisciplinary field that involves
coding math and stats and business staff
product sense first i'm going to outline
the topics to cover and my step-by-step
approach
kind of like framework for how to learn
then i'll go through the approximate
timeline for each topic some recommended
resources and finally ending with where
i see data science heading and how you
should adjust your learning plan to suit
it so do stick to the end of the video
because data science is a rapidly
changing field and i think it's
important to understand the landscape if
you're genuinely interested in getting
into the field throughout the video i'll
also be pointing out where i very
intentionally built in accountability
which is basically how to maximize your
chances to not give up because at least
for me i don't have the strongest
willpower and i tend to give up easily
so if you can kind of relate to this
maybe these tips and checkpoints will
also help you as well okay topics to
cover there's programming stats data
visualization exploratory data analysis
or eda machine learning data scripting
apis databases deployment and specific
niches like nlp and computer vision
don't worry i'm just listing these here
now but we're going to go through each
of these later in the video and talk
about why each topic is relevant and why
i recommend learning them in this
specific order but first i want to share
with you what is called meta learning
where how to learn the general approach
i recommend is what is called a
breadth-first approach centered around
project-based learning basically what i
mean by this is say we take the topics i
listed before right breadth-first
approach means that you should cover
just enough for the minimum amount of
theory for each topic before doing a
project surrounding it then you can
learn more about the topics and do a
more complex project and you do this
over and over still being learning more
about each topic and expanding your
skills this is called a breath first
approach to learning and as opposed to a
depth first approach where you would
attempt to learn every single thing
about a topic and then move on to the
next topic and then again try to learn
every single thing about that and after
you learn each topic thoroughly then you
would try to do the project the reason
why i recommend this breadth-first
approach centered around project based
learning is for three major reasons the
first reason is that technical subjects
like coding math and stats etc are
really different in theory and in
practice if you try to learn how to code
before you may have experienced
something like you do a course on coding
right and you're like okay makes sense i
know how to code now and then when you
actually sit down to code something
yourself you're kind of like uh where do
i even start the reason for this is
because implementation is really a
separate beast and the whole point of
learning to code and data science is
that you can actually implement and do
cool projects right so you do want to
know how to implement the second reason
is that if you try to deeply learn each
subject in turn you will be there
learning until the end of days each
subject of coding stats and machine
learning is huge and you can really go
down the rabbit hole and find yourself
super overwhelmed not knowing what is
actually relevant and important and at
some point you're probably going to give
up before you've been starting to use
these things that you learned trust me i
know this from experience and finally
another plug for project-based learning
studies have actually shown that project
based learning is the best form of
learning because by doing things and
figuring things out yourself you're
actually more deeply encoding that
information into your brain and more
likely to retain the information as
opposed to just like kind of passively
absorbing information if you're just
watching someone else code for example
so yes breadth first approach centered
around project based learning hopefully
i have convinced you alright let's now
go through the topics in my opinion you
should start with coding first and the
reason why i recommend coding first is
because it's a lot more motivating at
least for me to be able to see the
results of things that i do as opposed
to starting with more theoretical topics
like math and stats which of course is
extremely important and you'll certainly
get to them later but i find these
topics more abstract and less engaging
aka easier to get bored and give up for
choice of language i would recommend
starting with python the reason why i
recommend starting with python is
because it's a general purpose language
that is super simple to understand has
great documentation and also has great
libraries for data science including
machine learning so what to learn for
coding you should know the basics
including how to declare variable
functions loops and if statements then
you should get familiar with two
specific data science modules pandas and
numpy pandas is built on top of numpy
and is like the data science module
where you can manipulate your data sets
and feed them into other more
specialized libraries for data
visualization and machine learning for
example after you learn the basis of
coding next i recommend learning we're
brushing up on your stats and i'm not
talking about like crazy stuff here like
we're talking about high school to first
year university stats mean median mode
standard deviation distributions central
limit theorem confidence intervals
things like that this comes in really
handy when you're understanding the
nature of your data set now what's
really cool is that because you know how
to code now you can actually implement
the stats on your data sets which again
i think is a lot more fun because you
can see the things that you do next up
is visualizations there's a lot of
different visualization modules out
there but honestly if you learn one of
them the rest are kind of just
variations with different
functionalities i personally like
seaborne because it's really intuitive
to use and the graphs are automatically
really pretty as well at this point you
should know the basics of coding stats
and visualizations and you're ready now
for your first project which is some
exploratory data analysis or eda eda is
just a fancy way of saying exploring
your data set and familiarizing yourself
with it by seeing if there's any trends
patterns correlations between variables
etc with the basis in coding stats and
visualizations you're now well equipped
to do eda by taking a data set playing
around with it a bit and doing some
stats like finding the mean distribution
of variables and making some
visualizations okay let's talk about
timelines to get to this point in terms
of timeline i would say coding should
take you about one or two weeks at four
hours per day so that should take you
again one to two weeks maybe a little
more maybe a little longer depending on
how much stats that you remember and
visualization should take you only about
one to two hours to a day to get a hang
of now you might be thinking this is
probably a lot longer than i thought and
that's okay because remember breadth
first approach centered around
project-based learning you don't have to
know everything just enough the basics
that you can start doing a project which
will help you learn even faster so what
exactly should you do for your first
project well let me let you in on a
secret so much of data science is
learning from other data scientists and
working on top of what others have built
i find that the best projects to start
with when you're new in the field is to
take someone else's project and work
through it for example you can start
with the famous titanic data set on
cable and pick one of the highly rated
notebooks then if you're feeling daring
you can add something onto it and take
it a step further word of warning here
is of course don't just go and copy code
right like that clearly will not help
you learn but if you understand what
each line of code is doing and the
rationale behind it you'll gain an
understanding on how to approach a
project then next time when you're doing
another project you will know how to
approach it honestly even now when i
want to learn something that i'm not
super familiar with i find the fastest
way to learn is to start by doing a
project that someone else has done and
then applying it to my own project later
so by now after working through a kaggle
notebook or two you'll probably notice
that for many kaggle notebooks after
some initial exploration of the data
they start jumping into machine learning
for example some exploratory data
analysis may show that the likelihood of
survival when you're male is far lower
than if you're female and also your
class has to do with survival then the
question becomes can you predict
survival and the answer is yes with
machine learning so now it's time to
learn about machine learning there's
around 10 to 15 common machine learning
algorithms and there's a lot of ways of
classifying them one example is dividing
them into supervised learning
unsupervised learning and reinforcement
learning i recommend intuitively
understanding how the algorithms work
without worrying too much about the
exact math behind it for example linear
regression is the simplest machine
learning model and intuitively how it
works is that it tries to draw a
straight line that minimizes the
distance between each data point and
that line and the model is the line you
drew that can predict for example the
probability of survival on the titanic
given an age the good news is that most
machine learning algorithms are actually
quite intuitive and not super difficult
to understand to learn the basics of the
common machine learning algorithms i
would say it should take you about like
three to four weeks
again assuming four hours per day
definitely feel free to go deeper into
the math if you are interested however
depending on your math proficiency you
may need to refresh your calculus and go
deeper into statistics okay cool now you
can continue working through the
notebook of someone else's project and
trying out the different machine
learning algorithms it's also super
useful here to understand the notebook
author's reason for the data
pre-processing that's being done the
reason why certain machine learning
algorithms are chosen and their pros and
cons as well as how to optimize the
models these are super practical things
that are extremely important to doing
machine learning so be sure to really
understand the reasoning behind choices
that are being made now we'll cover
things up to machine learning and next
up is data scraping slash apis this
comes into play when you graduate out of
using pre-built data sets especially if
you want to do your own project it's
actually really rare that you'll find
kind of like just a nice data set laid
out for you already the more likely
situation you find yourself in is having
to scrape the data yourself from
websites or using apis which stand for
application programming interface for
scraping data a module i would recommend
checking out is beautiful soup very
useful and quite cute and whimsical too
if i do say so myself it shouldn't take
you more than a couple days to a week to
have a good grasp for apis we're
application programming interfaces
they are software built by other people
that you can use to get access to data
amongst other functions but what is
relevant here is that you can get data
using apis to learn how to use an api it
may take you some time to understand how
to use it because it involves
understanding how to use other people's
software and this really has to do with
how well documented the api is reading
documentation is in itself a skill in
both understanding how to read
documentation as well as like developing
the patients to read documentation again
remember the approach that i guess i've
already beaten into you at this point
brad first approach project-based
learning learn the minimum and do the
project next up databases for databases
what to learn here is understanding the
different types of databases like
relational databases nosql databases
cloud databases etc a language that you
may especially want to pick up here is
sql it's a much easier language to learn
compared to python and shouldn't take
you more than a week or two to learn it
well pro tip here is if you're
interested in getting a job as a data
scientist data analyst or data engineer
almost all companies will ask you sql
questions as part of the interview
process in my opinion the minimum here
to learn is relational databases and the
language behind them which is sql
especially if you're primarily learning
data science to get a job timeline here
is two weeks for the basics for database
projects i recommend downloading some
data sets like from kaggle for example
and then importing that data into your
own database this teaches you how to
create a database create tables inside
the database and manipulate the data
okay we're almost done so for the next
two topics deployment and specific
niches i consider these more advanced
topics deployment comes into play when
you want to take the machine learning
model you develop and put into a live
environment instead of just having it in
a notebook that you have you can deploy
the model across different code
environments and also integrate them
into other software then if you're
interested in a specific field of data
science you can also explore niches like
natural language processing or nlp which
has to do with developing algorithms
that understand human languages known as
natural languages it's really a very
cool interdisciplinary field there's
also niches like computer vision that
has applications in self-driving cars
for example it's kind of hard for me to
give you a timeline on these niches
because theoretically you can easily do
a project in nlp for example with the
skills you learned so far by employing
modules that other people have developed
which abstract away a lot of the
underlying concepts and this will take
you like a few hours to a few days to
learn but if you're interested in these
niche topics i would also assume you
would want to understand more of the
theory behind it and i mean there are
people who have phds in the field so in
terms of timeline really depends on how
far you want to go now let's talk about
some recommended resources i personally
prefer interactive interfaces to learn
coding like free code camp for example
because you can see what it is that you
were coding for basic statistics and
theory and math behind machine learning
algorithms the top resources i would
recommend are stat quests by josh summer
and data aiku's own guides both of which
are free for projects to follow i
already mentioned it before but kaggle
is great because there's notebooks where
you can see how people approach projects
from different perspectives a great free
resource to learn sql is moat which is
what i personally use to learn sql from
scratch and pass my own data science
interview to learn more about databases
in general there's also great moocs
available finally for deployment and
more niche topics i would personally go
with highly rated courses from moocs and
again rely heavily on working on my own
projects because at this point you
should already be quite proficient in
the basics so it's more about building
on top of them and doing specific
projects that interest you honestly
there are so many amazing free and low
cost options for learning data science
out there and i just listed a food that
i personally used and liked my
preference is to choose resources that
are interactive and already have
projects built into them i get it though
if you prefer learning from online video
courses or books for example and that's
totally fine my only recommendation is
that you should also intentionally work
through project so you can learn to
implement and in summary if you want
like the most simplistic guide possible
for how to choose a good resource and if
you're willing to spend a little money
you just absolutely cannot go wrong with
choosing a highly rated course on the
topic on a mooc platform there are many
many courses to cover each of these
topics that we discussed now finally
let's talk a little bit about how the
landscape of data science is progressing
as the data science field develops and
becomes more mature a lot of repetitive
tasks in data science like data cleaning
pre-processing exploratory data analysis
machine learning and even deployment are
becoming automated in fact data iq does
just this data iq is a platform for
everyday ai that systemizes the use of
data for business results by using
dataiku you're able to create share and
reuse applications that leverage data
and machine learning to extend and
automate decision making data iq also
allows you to scale ai safely and
effectively and deliver advanced
analytics using the latest techniques at
big data skills data iq is really
powerful and you should check out more
about the platform if you're interested
link in the descriptions below but wait
a second
if you've been paying attention
you are probably thinking right now why
should i learn all the things we just
talked about earlier if it's becoming
automated there's actually still very
good reason to do so first it's still
important for you to understand how
things work so you can understand how to
apply analyses and algorithms to
specific use cases and learn how to best
leverage these tools available because
after all they still are tools even if
they're automating things and we need to
make sure that they're doing what
they're supposed to be doing another
implication of so much of the data
science and machine learning pipelines
being automated is that it's become more
and more important for a data scientist
to have domain knowledge which by the
way is the third pillar of data science
that we haven't really discussed until
now domain knowledge or business product
sets this is just as important as the
coding and the stats coding and stats
and the machine learning algorithms and
all the other technical stuff is only as
valuable as how much value it can
provide to the company so even if you
make the fanciest and best algorithm
ever honestly nobody actually would care
if it doesn't provide value to the
company so it's very much the data
scientists job to understand the
business reason for doing an analysis
where building a model to make sure that
what they're doing also has real impact
in the organization it is also a data
scientist job to communicate the value
of what they're doing and make sure what
they do is actually going to be used for
those of you who are not in industry you
might think that this is kind of weird
right it's like of course it has so much
value and impact but believe me in
practice it's actually really crucial
and it's not just a given because if
decision makers don't understand why
your analysis or your model is useful
then they don't want to use it right and
even if you make the best model ever and
it has a lot of impact your effort would
be for nothing if it's not being used so
to summarize this section since many
repetitive data science tasks are being
automated it's important to one
understand how the algorithms work and
make sure that they're functioning
properly in your given context and two
focus on gaining domain knowledge and
learn how to communicate and present
your findings and the impact of your
work in the business context alright
that's all i have for you today i've
linked all the resources i've talked
about in the description below
do also share your thoughts on this
guide on how to learn data science i
will see you guys in the next video
Browse More Related Video
How to ACTUALLY Learn the Math for Data Science
Is Python the Coding Language of the Future? A Brief Analysis
Complete Roadmap To Become Data Analyst In 2024 With Videos And Materials
The Exact Skills and Certifications for an Entry Level Machine Learning Engineer
What is a Machine Learning Engineer
How to ACTUALLY become a data analyst? | Data Analyst Roadmap 2024
5.0 / 5 (0 votes)