Intro to Data Science: Historical Context

Steve Brunton
5 Jun 201908:06

Summary

TLDRThis lecture explores the concept of data science, emphasizing its long-standing roots in human history. It distinguishes between data-driven science and the emerging field of data science, which involves handling, cleaning, storing, visualizing, and modeling data. The talk uses the historical example of Tycho Brahe's meticulous planetary observations, crucial for Kepler's laws and Newton's theories, to illustrate data science's impact. It also highlights the importance of moving from descriptive models like Kepler's to generalizable theories like Newton's, a goal for modern data scientists and machine learning practitioners.

Takeaways

  • 🔬 Data science is not a new concept; humans have been collecting and modeling data for centuries.
  • 📚 The term 'data science' can mean different things to different people, often referring to data-intensive science, engineering, or data-driven inquiry.
  • 🌌 Astronomy is highlighted as an example of a data-intensive science, where data collection and analysis have been pivotal in understanding planetary motion.
  • 📈 Tycho Brahe's meticulous data collection on planetary and star movements was instrumental in Kepler's discovery of elliptical orbits.
  • 🔍 Brahe's dedication to rigorous data collection and storage laid the groundwork for future scientific advancements.
  • 🐃 Fun fact: Tycho Brahe was an intriguing character with a pet moose that enjoyed beer, reflecting his unique personality.
  • 📚 Kepler's laws describe the elliptical motion of planets, while Newton's laws explain why planets move in these orbits, demonstrating the progression from description to cause.
  • 🚀 Newton's generalized laws allowed for practical applications like the Apollo program, showing the importance of generalization in scientific theories.
  • 🤖 Modern machine learning algorithms often describe the world as observed (like Kepler), but the goal should be to create models that generalize like Newton's did.
  • 📈 The 'fourth paradigm' of data-intensive scientific discovery complements traditional methods like theory, experiments, and simulations, rather than replacing them.
  • 📘 For those interested in the technical aspects of data science, the book 'Data-Driven Science and Engineering' and the associated website offer in-depth lectures and resources.

Q & A

  • What is the main focus of the lecture series on data science?

    -The lecture series focuses on providing an introductory overview of data science, explaining what it is, how it can be used, and its various aspects.

  • Why is it emphasized that data science is not a new concept?

    -It is emphasized because humans have been collecting and modeling data for centuries, and the concept of data science has evolved over time rather than being a completely new invention.

  • What are the different interpretations of the term 'data science' mentioned in the script?

    -The different interpretations include data-intensive science, data-intensive engineering, and data-driven inquiry, all of which involve using data to drive scientific investigation and discovery.

  • What is an example of a data-intensive science field mentioned in the script?

    -Astronomy is given as an example of a data-intensive science field, where the collection and analysis of data about celestial bodies have been crucial for scientific advancements.

  • Who is Tycho Brahe and why was he significant in the history of data science?

    -Tycho Brahe was a Danish astronomer known for his meticulous data collection on the motion of planets and stars, which was instrumental in Kepler's discovery of planetary motion laws.

  • What inconsistency did Tycho Brahe notice between the models of his time and his observations?

    -Tycho Brahe noticed inconsistencies between the predicted planetary conjunctions and the models of planetary motion of his time, leading him to collect rigorous and systematic data.

  • What is the significance of Kepler's laws of planetary motion in the context of data science?

    -Kepler's laws describe the elliptical orbits of planets, which were derived from the data collected by Tycho Brahe. This demonstrates the power of data in shaping scientific understanding and theories.

  • How did Isaac Newton's work build upon the foundation laid by Tycho Brahe and Kepler?

    -Newton explained why planets move in elliptical orbits by formulating the universal law of gravitation, which generalized the principles behind planetary motion and enabled further scientific and technological advancements.

  • What is the difference between Kepler's and Newton's approaches to modeling the world, as discussed in the script?

    -Kepler built a model based on observed data describing how the solar system works, while Newton generalized these observations into a physical principle that could predict and explain a wider range of phenomena.

  • What is the 'fourth paradigm' referred to in the script, and how does it relate to data science?

    -The 'fourth paradigm' refers to data-intensive scientific discovery, which complements traditional methods like theory, experimentation, and computation by leveraging massive amounts of data for scientific insights.

  • What resource is recommended for those interested in the mathematical aspects of data science?

    -The book 'Data-Driven Science and Engineering' co-authored by the speaker and Nathan Cutts is recommended, along with their website databook.udub.com, which contains lectures and videos on various topics.

Outlines

00:00

📚 Introduction to Data Science and Its Historical Roots

This paragraph introduces the concept of data science, emphasizing that it is not a new field but rather an evolution of human practices that date back centuries. The speaker clarifies that data science can mean different things to different people, such as data-intensive science, data-intensive engineering, or data-driven inquiry. The paragraph uses the example of astronomy to illustrate data science in action, highlighting Tycho Brahe's meticulous data collection on planetary motion, which was instrumental in Kepler's discovery of elliptical orbits. The historical narrative also touches on the significance of data in scientific discovery and the transition from observational data to the formulation of universal laws, as demonstrated by Newton's work on gravitational forces.

05:01

🚀 From Observational Models to Generalized Theories in Data Science

The second paragraph delves into the distinction between descriptive models like Kepler's elliptical orbits and the generalized theories that enable practical applications, such as Newton's laws of motion. It discusses the importance of moving from data description to data generalization, which is essential for advancements like the Apollo moon landings. The speaker also references 'The Fourth Paradigm: Data-Intensive Scientific Discovery,' a book that outlines the progression of scientific methods, from theoretical analysis to data-driven inquiry. The paragraph concludes with a recommendation for those interested in the technical aspects of data science to explore a book co-authored by the speaker and Nathan Cutts, which covers the mathematical foundations of data science algorithms, and mentions a website where lectures on various topics are available.

Mindmap

Keywords

💡Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, and algorithms to extract knowledge and insights from data. In the video, it is emphasized that data science is not a new concept but has been practiced for centuries in various forms. The script discusses how data science is about handling data through collection, cleaning, storage, visualization, and modeling, which is crucial for understanding the world and making informed decisions.

💡Data-Intensive Science

Data-Intensive Science refers to scientific disciplines that rely heavily on large volumes of data for analysis and discovery. The video uses astronomy as an example, where the collection of planetary motion data by Tycho Brahe was critical for Kepler's discovery of elliptical orbits. This concept is integral to the theme of the video, illustrating how data-driven approaches can lead to significant scientific breakthroughs.

💡Data-Driven Inquiry

Data-Driven Inquiry is the process of asking and answering questions based on data analysis. The video script mentions this concept as a motivation for the field of data science, emphasizing the importance of using data to inform scientific hypotheses and theories. It is a key aspect of modern scientific research, where conclusions are drawn from the analysis of collected data.

💡Astronomy

Astronomy is the scientific study of celestial objects, space, and the physical universe as a whole. In the context of the video, astronomy serves as an example of a data-intensive science, where the meticulous data collection by Tycho Brahe on planetary motion was instrumental in advancing the field through Kepler's laws and Newton's theories.

💡Tycho Brahe

Tycho Brahe was a Danish astronomer known for his accurate and comprehensive astronomical and planetary observations. The script highlights his dedication to collecting rigorous and systematic data, which was crucial for Kepler's laws of planetary motion. Brahe's work exemplifies the importance of data collection in the scientific process.

💡Johannes Kepler

Johannes Kepler was a German astronomer and mathematician whose laws of planetary motion described the orbits of planets as ellipses. The video script uses Kepler's work to illustrate how data analysis can lead to fundamental scientific laws, which in turn can be built upon by subsequent researchers, such as Newton, to develop more generalized theories.

💡Isaac Newton

Isaac Newton was an English mathematician, physicist, and astronomer known for his laws of motion and universal gravitation. The video emphasizes Newton's quote about the 'preponderance of the evidence' supporting his theories, highlighting the role of data in validating scientific hypotheses. Newton's work represents a shift from describing the world to generalizing principles that can predict and explain phenomena.

💡Data Collection

Data Collection is the process of gathering and measuring data from various sources. In the video, it is highlighted as a foundational aspect of data science, with Tycho Brahe's systematic data collection on planetary motion being a key example. Effective data collection is essential for accurate analysis and informed decision-making in any scientific discipline.

💡Modeling

Modeling in the context of data science refers to the creation of mathematical or computational representations to understand and make predictions about data. The video discusses how data science involves not only collecting and analyzing data but also building models to represent and predict phenomena, as seen in Kepler's elliptical orbits and Newton's laws of motion.

💡Generalization

Generalization in data science is the ability of a model to make predictions or explain phenomena beyond the specific data it was trained on. The video contrasts Kepler's descriptive model of planetary motion with Newton's generalizable laws of motion, emphasizing the importance of creating models that can be applied broadly and predict new outcomes.

💡The Fourth Paradigm

The Fourth Paradigm refers to the era of data-intensive scientific discovery, as described in the book 'The Fourth Paradigm: Data-Intensive Scientific Discovery' mentioned in the video. It represents a shift towards using large datasets to complement traditional scientific methods like theory, experimentation, and computation. The video script positions data science as an integral part of this paradigm, highlighting its role in advancing scientific knowledge.

Highlights

Data science is not a new concept; humans have been practicing it for centuries.

Data science can mean different things to different people, such as data-intensive science, data-intensive engineering, or data-driven inquiry.

Astronomy is an excellent example of a data-intensive science, with historical roots in the work of Tycho Brahe, Johannes Kepler, and Isaac Newton.

Tycho Brahe's meticulous data collection on planetary motion was critical for Kepler's discovery of elliptical orbits.

Brahe's dedication to rigorous data collection led to a systematic format that was crucial for scientific advancement.

Kepler's laws of planetary motion were a result of analyzing Brahe's data, illustrating the importance of data in scientific discovery.

Isaac Newton's work built upon Kepler's model, providing a generalized physical principle that explained why planets move in ellipses.

Newton's famous quote emphasizes the significance of data in supporting scientific hypotheses and theories.

The difference between Kepler's descriptive model and Newton's generalized theory is a key concept for modern data scientists and machine learning practitioners.

Data science as a field is about handling data through collection, cleaning, storage, visualization, and modeling.

The fourth paradigm of data-intensive scientific discovery complements traditional methods like theory, experimentation, and computation.

Data science does not replace existing scientific methods but rather integrates and enhances them with data-driven insights.

The book 'The Fourth Paradigm: Data-Intensive Scientific Discovery' discusses the progression and impact of data-driven science.

Astronomy's historical context provides a clear example of how data science has been integral to scientific progress.

The story of Tycho Brahe's dedication to data collection and its impact on Kepler and Newton's work highlights the value of rigorous data in science.

The transition from Kepler's descriptive model to Newton's generalized theory represents a goal for modern machine learning algorithms to achieve broader applicability.

The book 'Data-Driven Science and Engineering' by the speaker and Nathan Cutts provides a deeper dive into the mathematical foundations of data science.

The speaker's website offers lectures and resources for those interested in the technical aspects of data science, machine learning, and their mathematical underpinnings.

Transcripts

play00:01

welcome back so we're talking about data

play00:05

science this is an intro overview

play00:07

lecture series on kind of what is data

play00:09

science how can you use it what are the

play00:11

aspects and one thing I think is just

play00:15

really important to emphasize is the

play00:17

data science is not new we've been doing

play00:21

data science as humans for hundreds

play00:23

thousands of years collecting data

play00:25

modeling the world through that data and

play00:28

I think data science as a terminology

play00:31

means different things to different

play00:33

people so there's what I like to think

play00:36

of as data intensive science data

play00:39

intensive engineering or data-driven

play00:42

inquiry and that's science that you do

play00:46

based on data ok like if I want to solve

play00:50

so astronomy is a great example of

play00:52

something that is data intensive science

play00:53

I think of the phrase data science this

play00:58

is an emerging scientific discipline

play01:00

which is motivated by data intensive

play01:02

science but it's really the science of

play01:05

how do you handle data collect clean

play01:07

store visualize and model with data so

play01:10

it's a little confusing you have data

play01:12

driven science and that motivates this

play01:15

whole new field of science and

play01:17

engineering called data science and I'm

play01:19

going to use them interchangeably but

play01:21

that I just want to kind of deconflict

play01:22

those two terms early on and astronomy

play01:26

is a great example I want to walk you

play01:28

through just this very interesting

play01:30

history example that I loved about kind

play01:33

of Tycho Brahe and Kepler and Newton to

play01:36

give some idea of what data science

play01:38

looks like in a historical context so

play01:42

this is Tycho Brahe great Danish

play01:45

astronomer who collected the rich data

play01:50

set of the motion of planets and stars

play01:52

that was critical in Kepler's discovery

play01:58

of his his ellipses and planetary motion

play02:01

so to some extent Tycho Brahe was

play02:06

noticed inconsistencies between the

play02:09

models of the time kind of the the old

play02:12

law

play02:13

of how the planets would move and he

play02:16

noticed inconsistencies with what he

play02:19

observed so you know he there was this

play02:21

predicted conjunction of planets and it

play02:23

didn't agree with the models to his

play02:25

satisfaction and so he realized this I

play02:27

think was as a teenager that he needed

play02:30

to collect rigorous clean data to store

play02:34

it in a systematic format and to to make

play02:37

a science out of the data collection of

play02:40

planets and stars and he dedicated his

play02:43

life to this he had an island between

play02:46

Copenhagen and Sweden I don't know if

play02:49

you can see it here but this is his

play02:51

science island of hven where he

play02:54

collected all of this rich data and he

play02:57

guarded this data so this was his life's

play02:59

work and he knew how much value and it

play03:02

turns out Kepler didn't even really have

play03:04

full access to the data until Tico Bray

play03:07

passed away and so so both of the knew

play03:10

the value of the data and kind of moving

play03:13

moving the theory of planetary motion

play03:16

forward and this was a critical piece in

play03:19

Kepler's famous law of the elliptic

play03:23

planets elliptic motion of planets fun

play03:27

fact about Tico very interesting

play03:28

character I encourage you to read more

play03:29

about him he lost the tip of his nose in

play03:33

a duel when he was a young man arguing

play03:35

about who was a better mathematician on

play03:37

his Science Island he had a pet moose

play03:40

which was apparently very fond of beer

play03:43

and would entertain his guests by

play03:46

drinking a tremendous amount of beer so

play03:48

Chico Bray is a really interesting guy

play03:50

you can only imagine what his

play03:52

personality would be like he had to you

play03:56

know he made his life's work of very

play03:58

very very careful observations which

play04:00

changed the world forever through

play04:04

through those who came after and I think

play04:07

this also laid the foundation so this

play04:09

this data intensive inquiry laid the

play04:12

foundation for what Newton would go on

play04:14

to do so

play04:15

Kepler described these elliptic motion

play04:18

of the planets and Newton explained why

play04:22

the plants move in these ellipses and

play04:24

actually I think a great

play04:27

quote by Isaac Newton's Newton when he

play04:29

was explaining one of his theories he

play04:31

said that it was because of a

play04:33

preponderance of the evidence and that's

play04:35

another way of saying the data supported

play04:37

his hypothesis or his theory and

play04:41

something else I think is really

play04:43

fascinating that we should think about

play04:44

as data scientists and modelers and

play04:46

machine learning people today and this

play04:50

is something I talk a lot about with my

play04:51

colleague Nathan Cutts is this idea of

play04:55

the difference between Kepler and Newton

play04:57

so Kepler built a model of how things

play05:00

work the way they work on these

play05:02

elliptical planets this is kind of I

play05:04

think of an attractor of how how the

play05:08

world and how the the solar system works

play05:10

in these elliptical orbits that theory

play05:14

was useful but it wouldn't have allowed

play05:17

us to to develop the Apollo program and

play05:21

and put people on the moon okay and so

play05:25

what Newton did was somehow a

play05:27

generalization he distilled the abstract

play05:30

physical principle that gave rise to

play05:32

elliptic orbits but in a way that you

play05:35

could tell you what would happen if you

play05:36

left your elliptical orbit so what would

play05:38

happen if you left or pushed on the

play05:40

system out of the way that it always

play05:42

behaves and we've always observed it and

play05:44

his theory truly generalized F equals MA

play05:47

generalized in a way that allowed us to

play05:50

land people on the moon which is which

play05:52

is really a huge achievement and so we

play05:55

talk about this a lot a lot of machine

play05:58

learning algorithms today most of them I

play06:00

would say do what Kepler did they

play06:03

describe the world as we observe it as

play06:06

the data describes it and it takes this

play06:09

epiphany this great leap to get a model

play06:12

that truly generalizes like what Newton

play06:15

did and so we should be aspiring to make

play06:17

our algorithms go

play06:18

you know from Kepler to Newton and that

play06:20

that's a worthwhile goal it's also very

play06:22

very challenging okay so data science

play06:25

has been around for a long time there's

play06:27

a really interesting modern book called

play06:29

the fourth paradigm data-intensive

play06:31

scientific discovery which basically

play06:34

shows or describes this progression from

play06:37

kind of theory and analytics

play06:38

Mattox to experiments collecting data

play06:42

from you know running experiments to

play06:45

test hypotheses to simulations and

play06:48

numerix and computations kind of the the

play06:50

digital you know silicon age and now

play06:53

this fourth paradigm of data-driven

play06:56

inquiry and scientific discovery really

play06:58

interesting and you know how this

play07:01

complements this doesn't this doesn't

play07:02

displace theory or numerix or

play07:05

experiments it complements these

play07:07

generate massive amounts of data and we

play07:09

need a science that ties these together

play07:12

okay just like simulations didn't

play07:14

displace experiments they complement

play07:17

each other okay so that's just a very

play07:20

high-level overview I will point out for

play07:22

those of you who are kind of more

play07:24

interested in the nuts and bolts of

play07:26

machine learning and modeling and kind

play07:29

of the linear algebra and optimization

play07:31

underlying these data science algorithms

play07:33

I'll recommend a book that my colleague

play07:36

Nathan cuts and I just wrote data-driven

play07:38

science and engineering in Cambridge and

play07:41

we have a website data book u-dub com

play07:44

where we filmed up all of our lectures

play07:45

for all of the chapters and sections so

play07:49

for example you can go to our website

play07:50

and find you know different topics

play07:52

you're interested in and see our YouTube

play07:54

videos so if you're interested hopefully

play07:57

that's a resource to kind of get into

play07:59

the more nitty gritty mathematical

play08:02

aspects okay thank you

Rate This

5.0 / 5 (0 votes)

Related Tags
Data ScienceHistorical ContextTycho BraheKeplerNewtonPlanetary MotionData CollectionModelingScientific InquiryData-DrivenMachine Learning