Introduction to Data Science - Fundamental Concepts

myAcademic-Scholartica
26 Apr 201811:22

Summary

TLDRSiobhan Kadar introduces the 2017 Data Science Bootcamp, outlining the interdisciplinary nature of data science and its importance in extracting insights from structured and unstructured data. She discusses the skills required to be a data scientist, including computer science, statistics, and domain expertise, and emphasizes the growing demand for such professionals. The presentation covers the role of data scientists in making sense of data, the significance of data analysis in various industries, and concludes with recommended books for further learning.

Takeaways

  • 📊 Data Science is an interdisciplinary field focused on using scientific methods, processes, and systems to extract knowledge or insights from structured or unstructured data.
  • 🌟 The demand for data scientists is high across industries due to the vast amounts of data available and the need to extract value from it.
  • 👨‍💻 Hal Varian, Chief Economist at Google, emphasizes the importance of the ability to understand, process, extract value from, visualize, and communicate data.
  • 🏆 Data science jobs have been recognized for their work-life balance and are considered one of the 'sexiest jobs of the 21st century' by the Harvard Business Review.
  • 🧠 A data scientist must have a unique blend of skills, including more computer science knowledge than a statistician and more statistics than a computer scientist.
  • 📈 The role of a data scientist involves cleaning, processing, analyzing data, and drawing inferences to make sense of complex datasets.
  • 💡 Data scientists are equipped with knowledge in statistics, machine learning, linear algebra, programming, mathematics, data visualization, and domain expertise.
  • 🌐 The availability of inexpensive computing power and cloud services like AWS, Google Cloud Platform, and Microsoft Azure has facilitated data analysis on a large scale.
  • 🔍 Data scientists use advanced machine learning and programming to uncover deeper insights and make future predictions, going beyond the capabilities of traditional data analysts.
  • 📚 The speaker recommends three books for further reading: 'Data Science for Business', 'The Art of Data Science', and 'The Elements of Statistical Learning'.

Q & A

  • What is the definition of data science according to the speaker?

    -Data science is an interdisciplinary field about scientific methods, processes, and systems to extract knowledge or insights from data in various forms, either structured or unstructured.

  • Why is data science considered a hot field and important for the future?

    -Data science is considered a hot field due to the availability of huge amounts of data and the need for professionals who can understand, process, extract value from, visualize, and communicate data across industries. Hal Varian, Chief Economist of Google, emphasized the importance of these skills for the next decades.

  • What does the speaker suggest as the key skills for a data scientist?

    -A data scientist should know more computer science than a statistician and more statistics than a computer scientist. They must be proficient in statistics, machine learning, linear algebra, programming, mathematics, data visualization, and possess domain expertise.

  • Why is the demand for data scientists increasing?

    -The demand for data scientists is increasing because of the massive amounts of data being collected across industries, the decreasing cost of computing power, and the need for advanced analytics to make future predictions and informed business decisions.

  • How does the speaker describe the role of a data scientist compared to a traditional data analyst?

    -A data scientist uses advanced knowledge of machine learning, programming, and engineering to manipulate data and uncover deeper insights, making future predictions, while a traditional data analyst is bound by SQL queries and analytic packages to extract information from historical data.

  • What are the steps in data analysis according to the speaker?

    -The steps in data analysis are stating the right question, exploratory data analysis, building a model, interpreting the results, and communicating the findings.

  • What is the significance of Google Trends mentioned in the script?

    -Google Trends is significant as it provides an indication of the growing interest in data science over time by showing search trends and the popularity of data science-related queries.

  • What does the speaker mean by 'data-fication'?

    -Data-fication refers to the process of turning previously invisible processes into data, such as quantifying preferences based on likes on Facebook or evaluating the significance of web pages using Google's PageRank algorithm.

  • What are the types of questions that can be asked during data analysis according to the script?

    -The types of questions that can be asked during data analysis include descriptive, exploratory, predictive, causal, and mechanistic.

  • What are the three books recommended by the speaker for learning data science?

    -The three books recommended are 'Data Science for Business' for a general audience, 'The Art of Data Science' for a more technical understanding, and 'The Elements of Statistical Learning' which is a technical book on statistical machine learning.

Outlines

00:00

📊 Introduction to Data Science

Siobhan Kedar introduces the 2017 Data Science Bootcamp, outlining the agenda which includes an introduction to data science, the role of a data scientist, and steps in data analysis. She defines data science as an interdisciplinary field focused on extracting knowledge from structured or unstructured data using scientific methods. Highlighting the importance of data science, she quotes Hal Varian, Google's Chief Economist, emphasizing the value of data processing and visualization. The demand for data scientists is underscored by industry and government needs, with data science jobs being highly sought after for their work-life balance and being dubbed the 'sexiest job of the 21st century' by the Harvard Business Review. The talk also touches on the practical aspects of data science, such as cleaning and processing data to draw meaningful inferences.

05:02

🧠 The Skills and Necessity of Data Scientists

This section delves into the skills required to be a data scientist, which include a blend of computer science, statistics, and domain expertise. It emphasizes the importance of understanding and manipulating data to uncover insights and make predictions. The speaker discusses the exponential growth of data collection across various industries and the affordability of computing power, facilitated by cloud services. The role of a data scientist is contrasted with that of a traditional data analyst, highlighting the data scientist's ability to use advanced techniques like machine learning for future predictions. The concept of datafication, turning non-quantitative information into data, is introduced with examples like Facebook Likes and Google's PageRank algorithm. The iterative process of data analysis is outlined, from stating the right question to communicating results, with an emphasis on the importance of asking the right questions and the role of hypothesis testing in data analysis.

10:03

📚 Conclusion and Recommended Readings

In the concluding part of the presentation, Siobhan Kedar summarizes the significance of data science as an interdisciplinary field with broad applications in various industries. She stresses the role of data scientists in making business decisions through pattern discovery and future predictions. The presentation ends with a recommendation of three books for further reading: 'Data Science for Business' for a general audience, 'The Art of Data Science' for a more technical perspective, and 'The Elements of Statistical Learning' for in-depth statistical machine learning knowledge. The speaker also acknowledges the contributions of Professors from Israel in preparing the presentation.

Mindmap

Keywords

💡Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, and systems to extract knowledge or insights from data in various forms, including structured and unstructured data. In the video, Siobhan Kader introduces data science as a field that is crucial for extracting value from the massive amounts of data available today. It is highlighted as a 'hot field' with significant job opportunities and is described as the 'sexiest job of the 21st century' by the Harvard Business Review.

💡Data Scientist

A Data Scientist is a professional who applies their knowledge of statistics, machine learning, and programming to analyze and interpret complex data sets. In the script, Hal Varian's quote emphasizes the importance of a data scientist's ability to understand, process, extract value from, visualize, and communicate data. The role is depicted as essential for making sense of data and for predicting future trends, which is critical in various industries.

💡Structured and Unstructured Data

Structured data refers to information that is organized in a specific format, making it easily searchable and analyzable. Unstructured data, on the other hand, is data that does not follow a specific format and can include text, images, and videos. The script mentions that data scientists work with both types of data to extract insights, indicating the broad scope of data they handle.

💡Machine Learning

Machine Learning is a subset of artificial intelligence that enables systems to learn from data, identify patterns, and make decisions with minimal human intervention. In the video, machine learning is mentioned as a key component of a data scientist's skill set, which they use to uncover deeper insights and make predictions from data.

💡Data Cleaning

Data Cleaning is the process of identifying and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. In the script, it is mentioned as a crucial step in the data scientist's workflow, where raw data is cleaned and processed before analysis to ensure the accuracy of the insights drawn.

💡Data Visualization

Data Visualization involves the use of graphical representations to display data in a way that is easy to understand and interpret. The script emphasizes the importance of data visualization in communicating the insights and findings from data analysis to stakeholders in a clear and compelling manner.

💡Domain Expertise

Domain Expertise refers to the specialized knowledge and skills in a particular area or industry. In the context of the video, it is highlighted that a data scientist must have domain expertise in addition to technical skills, as this enables them to understand the context of the data and apply their findings effectively within specific industries.

💡Cloud Services

Cloud Services refer to the provision of computing services, including servers, storage, databases, networking, software, analytics, and intelligence, over the Internet (the cloud). The script mentions cloud services like AWS, Google Cloud Platform, and Microsoft Azure as examples of how data scientists can access powerful computing resources to analyze large data sets without the need for significant upfront investment in infrastructure.

💡Data Analysis

Data Analysis is the process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. The script outlines the steps in data analysis, emphasizing the iterative nature of the process and the importance of asking the right questions, exploring data, building models, and communicating results.

💡Hypothesis Testing

Hypothesis Testing is a statistical method used to make decisions about a population parameter using data from a sample. In the video, hypothesis testing is mentioned as a part of the data analysis process where data scientists generate and test hypotheses to understand relationships and make predictions.

Highlights

Introduction to data science as an interdisciplinary field focused on extracting knowledge from data.

Definition of data science as using scientific methods to process data and gain insights.

The importance of data science in various industries due to the availability of large data sets.

Quote from Hal Varian emphasizing the value of data understanding and processing in the future.

Data science being recognized as a field with excellent work-life balance and a 'sexy job' of the 21st century.

The demand for data scientists in new and emerging jobs across both government and industry sectors.

The process of making sense of data through cleaning, processing, analyzing, and drawing inferences.

Skills required to be a data scientist, including computer science, statistics, and domain expertise.

The necessity of data scientists knowing more computer science than statisticians and more statistics than computer scientists.

The role of data scientists in making future predictions using advanced statistics and complex data modeling.

The decrease in cost and increase in computing power, making data analysis more accessible.

The significance of Google Trends in indicating the growing interest in data science over the past 5 years.

The role of a data scientist versus a data analyst, with a focus on future predictions and deeper insights.

Datafication as the process of turning non-quantitative information into data for analysis.

Steps in data analysis, including stating the question, exploratory data analysis, building a model, and communicating results.

The iterative nature of data analysis and the importance of setting the right questions.

Recommendation of three books for further understanding of data science: 'Data Science for Business', 'The Art of Data Science', and 'The Elements of Statistical Learning'.

Conclusion on the interdisciplinary nature of data science and its importance in various industries and business decisions.

Transcripts

play00:00

good morning everybody I'm Siobhan kadar

play00:02

I welcome you for the 2017 data science

play00:08

bootcamp here is the agenda first I will

play00:12

give you an introduction to data science

play00:13

and then I will talk about how to be a

play00:16

data scientist and then why now and then

play00:22

role of a data scientist and then steps

play00:25

in data analysis and finally I will

play00:28

conclude the presentation so in a simple

play00:31

sentence a data science is an

play00:33

interdisciplinary field about scientific

play00:36

methods processes and systems to extract

play00:40

knowledge or insights from data in

play00:42

various forms either structured or

play00:44

unstructured this is a very simple

play00:46

definition you can start with this and

play00:48

with the availability of huge amounts of

play00:51

data and software and technologies

play00:54

almost every industry today needs a data

play00:58

scientist as there are lots of

play00:59

interesting use cases so this is the

play01:04

picture of Hal Varian he is the chief

play01:06

economist of Google and what he said is

play01:09

this the ability to take data to able to

play01:12

understand it to process it to extract

play01:17

value from it to visualize it to

play01:21

communicate it that's going to be a

play01:23

hugely important skill in the next

play01:25

decades and as I mentioned data science

play01:30

is a very hot field and this is has been

play01:32

noted at this article

play01:34

data science job stop glass door survey

play01:36

for best work-life balance and the

play01:39

Harvard Business Review calls it the

play01:41

sexiest job of 21st century also both

play01:45

government and industry have indicated

play01:47

that there is a dire need for data

play01:49

scientist for new and emerging jobs now

play01:53

let's look at this picture you know if

play01:56

you look at this picture you won't make

play01:58

any sense right I mean you just some

play02:00

numbers or whatever it doesn't make any

play02:02

sense so what a data scientist will do

play02:05

is the data scientist will take this

play02:08

data it will be cleaned it will be

play02:12

processed

play02:13

we'll be analyzed and then it will draw

play02:16

inference that's the whole idea so the

play02:18

art of making sense of data that's the

play02:20

whole idea of a role of a data scientist

play02:22

so what it takes to be a data scientist

play02:25

so in a nutshell this picture gives you

play02:27

a very good idea about what it takes to

play02:29

be a data scientist collect and clean

play02:31

data explore and find trains build

play02:35

models and algorithms design experiments

play02:40

communicate results and design data

play02:42

products this is typically in a nutshell

play02:45

what a data scientist will know now so

play02:48

what do you need to know now before I go

play02:52

to this slide there is a very

play02:53

interesting comment about data scientist

play02:56

I don't remember who said that but a

play02:58

data scientist must know more computer

play03:00

science than a statistician and more

play03:03

statistics than a computer scientist so

play03:06

is it a very challenging yes definitely

play03:09

it's really hard core data scientists

play03:11

must know all those things so that means

play03:13

statistics machine learning linear

play03:16

algebra programming mathematics

play03:19

including discrete mathematics data

play03:21

visualization and also a domain

play03:24

expertise so these are all the things

play03:25

you know necessary to become a very good

play03:27

data scientist and so why now we collect

play03:32

more data than ever I mean all of you

play03:34

agree with me that if you look at any

play03:36

industry they're collecting huge amounts

play03:39

of data whether it is any business a

play03:42

large scale computer networks

play03:44

pharmaceutical industry gene and

play03:46

genomics life sciences social media

play03:49

semiconductors you know you name it you

play03:52

know sensor network smart cities

play03:55

everyone every industry is collecting

play03:57

massive amounts of data and the other

play04:00

thing is the good news is it's

play04:02

inexpensive and available computing

play04:04

power so we have lots of computing power

play04:06

if you look at Google or Amazon or

play04:09

Microsoft they offer cloud services

play04:12

right AWS and Google cloud platform

play04:15

Microsoft Azure these are the typical

play04:18

cloud service vendors who provide all

play04:20

these services and it is also if you

play04:24

look at Google Trends how many of you

play04:25

are

play04:26

with Google Trends it's pretty pretty

play04:29

useful because Google trains actually

play04:31

let me see if I I think I can show you

play04:34

this you know you can search on Google

play04:37

Trends and you can see if you look at

play04:39

this slide you will see that a the the

play04:43

horizontal axis actually shows the time

play04:47

line and the vertical axis gives you a

play04:49

you know people what are they are

play04:51

searching for so in a scale of 0 to 100

play04:53

in 100 being the maximum you can see

play04:57

like if you look at data science for

play04:58

example and if you can see past 12

play05:01

months in we can also look at past 5

play05:04

years

play05:05

so this gives you an indication of what

play05:07

people are looking for in the areas of

play05:09

data science and how much interest you

play05:11

know over the period of time last 5

play05:13

years the interest is growing right so

play05:16

this is also a very good indicator of

play05:20

interest in this area and then if you

play05:23

look at the sequencing of the human

play05:25

genome you know if you look at this this

play05:29

data is from National Human Genome

play05:31

Research Institute you will notice that

play05:33

the cost is also decreasing and most of

play05:37

you are familiar with Moore's law which

play05:40

says that every two years the computing

play05:42

power the number of transistors in a

play05:45

dense integrated circuit right is

play05:47

doubling every two years so that's why

play05:49

the computing is becoming cheaper and

play05:53

cheaper over the years and you don't

play05:58

need to invest a huge amount in IT

play06:00

infrastructure today you can rent any of

play06:04

the googles or Amazon's you know data

play06:07

centers the servers and now I'm going to

play06:09

talk about the role of a data scientist

play06:12

you know typically a data analyst or an

play06:16

architect can extract information from

play06:19

large sets of data yet they are bound by

play06:22

the SQL queries and analytic packages

play06:25

used to slice these data sets so

play06:26

typically if you look at historical data

play06:29

they will be stored in a data warehouse

play06:30

and then you run some SQL queries you

play06:35

extract the information you generate

play06:37

reports and that's how a data and

play06:39

just worked you know but a data

play06:41

scientist on the other hand used

play06:44

advanced knowledge of machine learning

play06:45

and programming engineering and data

play06:48

scientists can manipulate data at their

play06:51

own will uncovering deeper insight so a

play06:53

data scientist will actually can make

play06:56

some predictions based on the data while

play07:00

your typical data analyst look to the

play07:02

past and what's happened a data

play07:05

scientist must go beyond this and look

play07:07

to the future okay so through

play07:10

application of advanced statistics and

play07:12

complex data modeling

play07:14

they must uncomfort at insan' make

play07:17

future predictions so that's the role of

play07:19

a data scientist and it's becoming an

play07:21

increasingly important for every

play07:23

organization to make some you know

play07:26

future predictions and even later on you

play07:28

will see how they can be used to make

play07:30

decisions now let's talk about data

play07:33

fication taking a process that was

play07:36

previous previously invisible into data

play07:38

for example if you look at Facebook

play07:40

Likes you know so we want to quantify

play07:42

that and how do i quantify the links you

play07:45

know so you know measure preference

play07:47

preferences based on likes and if you

play07:50

look at Google you know every page you

play07:53

associate and weight with that page like

play07:55

Google's PageRank algorithm when you

play07:57

evaluate the significance of webpages

play07:59

based on links so now I will talk about

play08:04

steps in data analysis the first thing

play08:08

is stating there are question the second

play08:10

one is exploratory data analysis the

play08:13

third one is building a model then the

play08:16

fourth step is interpret and then

play08:18

communicate the results so setting that

play08:21

so for each of these steps we will go

play08:23

through the phases like you know setting

play08:26

expectations then then we will do we

play08:30

will collect the data and finally match

play08:32

expectations with data and this is an

play08:34

iterative process you know when you do

play08:37

data analysis you do go through all

play08:40

these steps and in the first iteration

play08:42

you may not wait get good results so you

play08:44

keep on iterating and unless you get do

play08:48

several times you won't be able to get

play08:49

good results in data analysis so that's

play08:53

why it is an

play08:53

iterative process so first step is

play08:57

stating the right question like what is

play08:59

the population of California that's the

play09:02

descriptive type exploratory generate

play09:04

hypothesis from data for example you can

play09:06

make an hypothesis that the height of a

play09:09

player basketball player is related to

play09:11

the success of you know you can make an

play09:13

hypothesis if you six feet and about

play09:15

then he would be most likely very

play09:16

successful like this this kind of

play09:17

hypothesis you can make and then you can

play09:21

make some prove a hypothesis based on

play09:23

the data so we will discuss more about

play09:25

hypothesis testing late later today and

play09:27

then predictive what data a predicts be

play09:30

like if you have high levels of co2 in a

play09:34

particular region what is the effect of

play09:36

related to the temporizing temperature

play09:39

or global warming this kind of things

play09:41

questions you can ask and will changing

play09:43

a also change B that is like causal and

play09:46

then how does a a fix big that's

play09:49

mechanistic now we will go through all

play09:51

these examples later today when Suman

play09:53

will be talking in details about how to

play09:55

run experiments with this kind of

play09:57

hypothesis and finally I'm going to

play10:03

conclude this presentation by saying

play10:05

that data science is an

play10:06

interdisciplinary subject that has great

play10:09

applications in various industry and

play10:11

businesses through application of

play10:14

advanced statistics and computer data

play10:16

modelling data scientists discovered

play10:19

patterns and make future predictions

play10:23

data scientists are becoming

play10:25

increasingly important in making

play10:26

business decisions and finally data

play10:29

science is an important field with lots

play10:31

of career opportunities okay and finally

play10:37

these are the three books that I think

play10:41

will be very useful particularly the

play10:43

first one data science for business I

play10:45

find this very useful it's written for a

play10:48

general audience so anybody can read

play10:50

this book the second one is a little

play10:53

more technical the art of data science

play10:56

right being and maths we and the third

play10:59

one is the elements of statistical

play11:01

learning this is a very technical book

play11:02

and it's hard to read but it is current

play11:05

it is a basically statistical machine

play11:07

learning

play11:07

things like that so these are the three

play11:09

books that are very useful in this

play11:12

context and and professor it is Sauron

play11:16

from Israel and they were also helped me

play11:19

in preparing this life so I would like

play11:21

to thank them

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Data ScienceBootcamp 2017Career OpportunitiesData AnalysisMachine LearningStatisticsBig DataPredictive ModelingIndustry TrendsTech Innovation
هل تحتاج إلى تلخيص باللغة الإنجليزية؟