Mitä data-analytiikka on?

HAMK Mustiala
17 Mar 202212:59

Summary

TLDRIn this video, Olli Koskela from HAMK University of Applied Sciences explores the concept of data analytics, emphasizing its importance in understanding complex phenomena through detailed analysis. He illustrates the process with historical and contemporary examples, such as World War II aircraft armor optimization and sales statistics analysis. Koskela also discusses the role of data analytics in various fields like bio- and circular economy, welfare, education, industry, and transport, highlighting the significance of effective data communication for decision-making.

Takeaways

  • 📊 Data analytics involves examining the components and structures of data in detail to build understanding.
  • 🔍 The term 'data' can be thought of as a 'collection of measuring points', which forms a dataset.
  • 🛡 Analyzing data often involves combining statistical methods and machine learning to uncover insights.
  • 🔧 Data analysis can be performed with simple tools like crayons and paper, not just digital methods.
  • ✈️ A historical example from World War II demonstrates how data analytics can optimize outcomes, like aircraft armor placement.
  • 📉 Visualizations such as scatter plots and histograms are crucial for understanding data distributions and patterns.
  • 📈 The median and average are basic statistical measures that can provide insights into data, but may not tell the whole story.
  • 📊 Histograms group data into ranges to show the frequency of data points within those ranges, which can reveal trends.
  • 🌡️ The Bioeconomy 4.0 project uses IoT sensors and data analytics to correlate environmental conditions with milk yield in cows.
  • 🐄 Machine vision and image classification can analyze animal behavior, like cow queuing for milking, to improve farm operations.
  • 🌡️ Monitoring environmental conditions like temperature and humidity can help prevent spoilage in silage, ensuring feed quality.
  • 🌐 The HAMK Smart Research Unit uses data analytics to develop solutions across various sectors including bio- and circular economy.

Q & A

  • What does Olli Koskela believe data analytics to be?

    -Olli Koskela believes data analytics to be the elucidation of the internal structure and properties of a set of measurement points, using appropriate tools such as statistical methods and machine learning.

  • What is the significance of the term 'data' in the context of data analytics?

    -In the context of data analytics, 'data' refers to a collection of measuring points, often derived from physical experiments, and is used to analyze and understand patterns or trends.

  • How does Olli Koskela describe his background in relation to data analytics?

    -Olli Koskela has a master’s degree in applied mathematics with a major in analysis, which involves using analytical tools and methods to understand problems and review them according to agreed and proven rules.

  • What is the Abraham Wald’s The Statistical Research Group data set mentioned in the script?

    -The Abraham Wald’s The Statistical Research Group data set from World War II contains information about bullet holes in aircraft and was used to determine where to place armor for optimal protection.

  • Why is it important to consider the data from planes that did not return in the aircraft armor optimization problem?

    -It is important to consider the data from planes that did not return because they likely had more bullet holes in areas that were critical for survival, thus providing crucial information for where to place armor.

  • How did Olli Koskela become a data analyst?

    -Olli Koskela became a data analyst by volunteering to develop fundraising and analyze sales statistics for products sold by children and young people during a campaign.

  • What was the initial assumption in the sales statistics analysis that Olli Koskela worked on?

    -The initial assumption was that Group A sold better than Group B, and the task was to find out why.

  • What did Olli Koskela discover about the sales data when he looked beyond the averages?

    -When looking beyond the averages, Olli Koskela discovered that Group A sellers sold an average of 13.3 products and Group B sellers sold an average of 12.6 products, indicating a small but significant difference.

  • What is the significance of the histogram in the sales data analysis?

    -The histogram in the sales data analysis is significant because it groups measurements into ranges, showing the number of sellers who sold the same number of products, which helps to identify patterns and trends in sales volumes.

  • What insights did Olli Koskela gain from analyzing the sales data more deeply?

    -Olli Koskela gained insights such as the potential of sellers who had not sold any products, the fact that sellers rarely returned products, and the possibility of increasing sales by raising prize limits.

  • How does the HAMK Smart Research Unit utilize data analytics?

    -The HAMK Smart Research Unit utilizes data analytics to develop solutions in various fields such as bio- and circular economy, welfare, education, industry, and transport.

  • What is the purpose of the Bioeconomy 4.0 project mentioned in the script?

    -The purpose of the Bioeconomy 4.0 project is to measure environmental conditions in a cowhouse and correlate them with the milk yield of cows using temperature and humidity data.

Outlines

00:00

📊 Introduction to Data Analytics

Olli Koskela from HAMK University introduces data analytics, drawing from his background in applied mathematics and analysis. He explains the term 'data analytics' as the detailed examination of the components and structures of a phenomenon. Data is defined as a collection of measurement points, and data analysis involves using tools to understand the internal structure and properties of these points. Modern tools include statistical methods and machine learning, but the core analytical thinking applies even to manual data processing. An example from Abraham Wald's WWII data set illustrates optimizing aircraft armor placement based on bullet hole distribution, highlighting the importance of considering missing data (planes that didn't return).

05:03

📈 Sales Data Analysis and Insights

The script discusses Olli's experience as a data analyst, starting with a fundraising campaign where he analyzed sales statistics to understand why Group A outperformed Group B in sales. Initial findings showed Group A sold an average of 13.3 products compared to Group B's 12.6. Further analysis using histograms and medians revealed little difference between the groups. However, deeper insights were gained by considering the number of non-selling participants and the low product return rate, suggesting potential for increased sales with slightly higher prize limits. This part also touches on the importance of understanding the measurement techniques and the data collected, such as using radar or pressure sensors for traffic monitoring, and electrical conductivity for milk quality assessment.

10:03

🐄 Applications of Data Analytics in Bio- and Circular Economy

Olli presents various projects from the HAMK Smart Research Unit that apply data analytics to bio- and circular economy, welfare, education, industry, and transport. One project involves measuring environmental conditions in a cowhouse to correlate them with milk yield using the Temperature-Humidity Index (THI). Another uses machine vision to analyze cow behavior and cricket rearing, while a third monitors silage temperature to prevent spoilage. The Biocycle and JÄRKI projects use simulations for waste management decision-making. The narrative emphasizes the importance of communicating scientific findings effectively to stakeholders, ensuring that data analysis supports informed decision-making.

Mindmap

Keywords

💡Data Analytics

Data analytics is the process of examining, cleaning, transforming, and modeling data to extract useful information, draw conclusions, and support decision-making. In the video, it is described as looking at the components and structures of data in detail, using tools like statistical methods and machine learning. The video uses the example of analyzing bullet hole data from WWII aircraft to optimize armor placement, illustrating how data analytics can be applied to solve real-world problems.

💡Applied Mathematics

Applied mathematics involves using mathematical techniques to solve practical problems in various fields. The speaker, Olli Koskela, has a master's degree in applied mathematics, which he uses as a foundation for his work in data analytics. The video suggests that the analytical tools and methods from applied mathematics are crucial for understanding and interpreting data.

💡Data

Data refers to a collection of facts, typically in a form that can be analyzed or used to represent information. In the video, data is likened to a 'collection of measuring points,' especially those gathered from physical experiments. The script uses the WWII aircraft bullet hole data set as an example of a data set used for analysis.

💡Statistical Methods

Statistical methods are mathematical techniques used to analyze and interpret data. The video mentions that statistical methods are combined with machine learning in the modern toolkit of data analytics. An example from the script is the use of statistical analysis to determine where to place armor on aircraft based on bullet hole data.

💡Machine Learning

Machine learning is a subset of artificial intelligence that provides systems the ability to learn from data, identify patterns, and make decisions with minimal human intervention. The video suggests that machine learning is now commonly combined with statistical methods as part of the data analytics process.

💡Visualization

Visualization in data analytics refers to the graphical representation of information and data. The video discusses the importance of visualizing data to build understanding and make the data analysis process more comprehensible. Examples include scatter plots and histograms used to represent sales data and understand sales patterns.

💡Normalization

Normalization is the process of adjusting values measured on different scales to a notionally equal scale. In the video, the WWII aircraft data is normalized by showing the number of bullet holes per area, allowing for comparison across machines of different sizes.

💡Correlation

Correlation is a statistical term used to describe a relationship between two variables. The video discusses how certain measurements, like electrical conductivity, may correlate with other variables of interest, even if they are not directly measurable. An example is the correlation between the temperature-humidity-index (THI) and milk yield in cows.

💡IoT (Internet of Things)

IoT refers to the network of physical devices, vehicles, home appliances, and other items embedded with electronics, software, sensors, and connectivity that enables these objects to connect, collect, and exchange data. The video mentions IoT in the context of the Bioeconomy 4.0 project, where sensors connected to LoRaWAN are used to measure environmental conditions in a cowhouse.

💡Machine Vision

Machine vision is a technology that enables machines to see and process visual information. In the video, machine vision is used to analyze video footage of cows' behavior around a milking robot and to monitor cricket rearing, providing insights into animal behavior and welfare.

💡Data Communication

Data communication is the exchange of information between different parties in a way that is meaningful and actionable. The video emphasizes the importance of a data analyst's role in communicating findings to stakeholders. Effective communication ensures that the insights derived from data analytics are understood and can be used to inform decision-making.

Highlights

Data analytics involves analyzing the components and structures of data in detail.

Data analytics combines statistical methods and machine learning to understand data sets.

Data, derived from the Latin word 'datum', refers to a collection of measurement points.

Abraham Wald’s data set from WWII is used to optimize aircraft armor placement.

Data normalization allows for comparison across different machine sizes.

The importance of understanding data collection methods and the potential biases they introduce.

The story of how Olli Koskela became a data analyst through volunteering and analyzing sales statistics.

The challenge of interpreting data when comparing groups of different sizes.

The use of percentages to express differences between groups, but the need for deeper analysis.

The insight that the number of sellers who did not sell any products could be significant.

The recommendation to increase prize limits to激励 sales, based on data analysis.

The discovery that sellers rarely returned products, suggesting high sales efficiency.

The realization that the system only allowed products to be registered once, affecting sales data interpretation.

The use of indirect measurements, such as radar signal reflection, to understand traffic volumes.

The application of data analytics in assessing milk quality through electrical conductivity.

The Bioeconomy 4.0 project's use of IoT and LoRa technology to measure environmental conditions affecting milk yield.

The preprocessing and standardization of data to facilitate meaningful analysis.

The correlation between the temperature-humidity-index (THI) and milk yield in cows.

The use of machine vision to analyze cow behavior and milking queue dynamics.

The application of video-based machine vision in cricket rearing to estimate size and number.

The Good for Livestock project's focus on monitoring silage temperature to prevent spoilage.

The Biocycle and JÄRKI projects' use of computer-assisted simulations for waste stream management.

The importance of communicating scientific findings in a way that supports decision-making.

The Field Observatory's role in measuring and communicating the effects of conservation agriculture.

Transcripts

play00:00

Hi! I am Olli Koskela from HAMK University of Applied Sciences and from HAMK Smart Research Unit.

play00:04

In this video, I’ll explain what I think data analytics is,

play00:08

and with a few examples, I’ll look at visualizing data and building understanding.

play00:13

Let’s first look at what the word pair “data analytics” actually means.

play00:18

I myself have a master’s degree in applied mathematics and my major was analysis.

play00:23

The basic computational methods familiar from school are

play00:25

in fact, based on analytical tools such as thresholds and continuity,

play00:30

but more generally analytics means

play00:32

looking at the components and structures of a thing or phenomenon in detail.

play00:37

The mathematical analysis focuses in particular on the fact

play00:40

that the problem is first precisely defined and the review is carried out only according to agreed and proven rules.

play00:47

Data is a plural form of the Latin word datum, meaning something given.

play00:52

Personally, I like to translate the data word into a “collection of measuring points”,

play00:56

which describes the so-called data set formed especially by physical experiments.

play01:01

Thus, data analysis is the elucidation of the internal structure and properties of a set of measurement points

play01:07

using appropriate tools.

play01:09

In particular, different statistical methods and machine learning are combined with this set of tools nowadays,

play01:15

but the same analytical thinking models also apply

play01:17

when the data set can be processed with, for example, crayons and a piece of paper.

play01:21

Let's take such an example from Jordan Ellenberg's excellent book

play01:26

"How not to be wrong - The power of Mathematical thinking".

play01:29

Here is Abraham Wald’s The Statistical Research Group data set from World War II.

play01:35

In this collection of measuring points, bullet holes of the aircraft have been calculated and the task is to optimize the installation of aircraft armor.

play01:41

Because armor makes machines heavier, you should only install just the needed amount of them.

play01:49

In addition, it is noted that here the data is so-called normalized, i.e. the hole per area is shown.

play01:56

This makes the measurements of machines of different sizes comparable.

play02:00

Let's open Wikipedia first and look at the content of the problem.

play02:05

Pictured is a P-51D propeller named Miss Helen.

play02:08

The parts in the measuring set are thus the engine, the body, the fuel system and all the other parts categorized into one.

play02:15

The conundrum here is to understand the measurement setup: the data has been calculated back from the planes that flew to the base.

play02:23

It therefore lacks data on planes that did not fly back.

play02:27

Therefore, the group suggested that the armor should be installed where there were the fewest bullet holes.

play02:33

How did I become a data analyst?

play02:35

The story is an excellent example of how important

play02:39

one’s hobby and passion for finding things out today is and

play02:42

how they may be useful in the future.

play02:46

I ended up volunteering to develop fundraising

play02:49

and analyze sales statistics for products sold by children and young people during a campaign.

play02:54

The starting assumption here was that Group A sells better than Group B and we wanted to find out why?

play03:00

We will not editorialise here now on the demographic meaning of the groups,

play03:07

but let's look at what I got out of the data by thinking about it and how it can be communicated in different ways.

play03:14

A straightforward way to answer the question would be to say that “Group A sold on average 5% better than Group B.”

play03:22

Percentages are a good way to express a wide range of things,

play03:25

as it is also standardized information, ie it allows for the simultaneous assessment of different types of things.

play03:32

However, don’t pay an analyst who doesn’t offer this any more results!

play03:36

Looking at the percentages does not really provide any information about the phenomenon or its magnitude.

play03:42

In this particular case, Group A sellers sold an average of 13,3 products and Group B sellers an average of 12,6 products.

play03:51

Since these are products measured in quantities,

play03:54

is there in fact any difference between these averages?

play03:57

Next, let’s look at the raw data in groups.

play04:02

Here, all the data is shown as a scatter pattern, i.e. the horizontal axis shows the number of different sellers and the vertical axis the quantities of the products they sell.

play04:12

The first observation from this is that, first, the groups are very different in size.

play04:15

There are much more sellers in Group B. It is very difficult to deduce just about anything else, ie some key figures are needed.

play04:22

The average has already been presented, the second equivalent is the median.

play04:26

There is not much difference between them either.

play04:29

The averages were very close to each other and there is little difference in the medians.

play04:34

Let’s go deeper into the data structures and break down sales volumes.

play04:39

A histogram is a way of grouping measurements so that the numbers of measurement readings in a given range are calculated.

play04:46

In other words, the number of sellers selling the same number of products is shown here, according to the intervals shown on the horizontal axis.

play04:53

The boundaries on the horizontal axis are not entirely random, but I have brought non-data data into the analysis.

play05:03

The limits have been chosen on the basis of the sales volumes eligible for the prize.

play05:07

Each seller received a small reward for their activity after 10, 18, 30, 50 and 100 products sold.

play05:15

In fact, this image is a scam: the scaling of parallel graphs is completely different.

play05:21

Let's fix it so that both have the same scale.

play05:25

The review groups are still remarkably similar in profile, with only more sellers in Group B.

play05:27

From the point of view of sales promotion, I came to the conclusion that this division between the groups is irrelevant.

play05:39

However, my real insight was to find more data for this data set. To think about the phenomenon a little more broadly.

play05:46

Elsewhere, the number of sellers who had not sold any products was found.

play05:53

I think these sellers have significant potential in this case,

play05:57

because a significant number of children and young people in this campaign had not sold any products.

play06:03

Looking at the numbers, I also made another insight.

play06:05

Based on the data, it appeared that the sellers returned little of the products, i.e. they sold all the products that were for sale when they were picked up.

play06:14

Based on this, I would recommend a small increase in

play06:17

prize limits, which would have increased sales as sellers would have taken 1-2 more products.

play06:24

However, it later became clear to me that the system only allowed products to be registered once.

play06:29

At that time, it appeared that the number sold and admitted to sale was exactly the same.

play06:34

In other words, many sellers directors kept their own records and eventually recorded the quantities sold.

play06:41

This finding undermined my recommendation.

play06:44

It also often happens that what you want to develop is not directly measurable.

play06:50

The picture shows two screens following the traffic volumes of the pedestrian and cycle route in the city of Hämeenlinna.

play06:55

Both show the number of pedestrians and cyclists,

play06:57

but in reality the measurements are the flight times of the audible signal radar or the pressure variation of the road surface,

play07:02

from which the quality of the rider is identified by different methods. These are therefore physical measurements of the

play07:07

pressure or the reflection of the radar signal, from which it is only later determined whether there has been a cyclist or a pedestrian.

play07:19

Similarly, the quality of milk is assessed, inter alia, on the basis of electrical conductivity when physical measurements are made,

play07:25

although hardly any of us need electrical conductivity in our morning coffee.

play07:31

However, it has been found that electrical conductivity correlates with other useful variables and is easily measurable.

play07:38

The HAMK Smart Research Unit develops solutions utilizing data analytics and digitalisation in the fields of bio- and circular economy, welfare, education, industry and transport.

play07:49

Next, I will present our bio- and circular economy projects.

play07:52

Based on previous research, it is known that an index can be derived from temperature and humidity

play07:58

that correlates with the milk yield of cows.

play08:00

The index is called the temperature-humidity-index (THI).

play08:06

In the Bioeconomy 4.0 project, we measured environmental conditions at various points in HAMK's cowhouse in Mustiala with such sensors connected to LoRaWAN.

play08:15

The LoRa technology has been developed for long-term measurement and allows for relatively free placement of the sensors as well as a long battery life.

play08:22

Digita produces the data collection for this IoT platform, and I have a collection of measurements downloaded from there.

play08:28

This is compared to the average daily milk yield of the herd measured by the milking robot.

play08:33

For comparison, the measurement data must be preprocessed and standardized to daily data.

play08:38

We have filtered temperatures and relative humidity for this use.

play08:43

We want to compare the conditions with the average daily milk yield of cows.

play08:48

Conditions are measured every 15 minutes and must first be preprocessed, viewed for erroneous measurements, and unify to daily data.

play08:59

In this case, the two temperature sensors have broken and sent -1000°C readings for a few measurement periods.

play09:06

Let's remove these points from our dataset.

play09:08

Here I now present the temperatures and relative humidity as graphs.

play09:12

An ordinary time series graph is best suited to represent such a collection of more than 11,000 measurement points.

play09:23

For programmatic processing, text timestamps are converted to timestamp objects.

play09:28

Utilizing these, the daily average THI can be calculated.

play09:31

These daily averages are used in the THI calculation.

play09:35

By plotting the THIs calculated on the basis of the measurements of different sensors and the outdoor weather conditions

play09:42

with the milk yield,

play09:44

it is noticed that the THI and the average milk yield tend to be very simultaneous phenomena.

play09:51

However, a comparison of the THI does not yet tell us anything about the cause-and-effect relationships of the phenomenon.

play09:56

To understand the barn life, we conducted a three-month machine vision test on queuing behavior for a milking robot.

play10:03

Of the video footage, 1.7 million images have been classified so that each image in the video was tagged in one significant category.

play10:12

These categories included interactions from different directions, queuing congestion and normalcy, drinking from the drinking pool, and curiosity.

play10:23

In addition, disturbed images and those showing people were excluded from the classification.

play10:36

A deeper examination of the material is still in progress at the moment,

play10:39

but even here it can already be seen that mainly the cows are queuing up for milking very calmly.

play10:45

In addition, this data could already be used, for example, to measure the duration of drinking moments,

play10:48

and it was found that mainly cows drink less than one minute at a time.

play10:54

Video-based machine vision can also be used to analyze cricket rearing.

play10:57

In the HämInCent project, the feeding area of ​​crickets in the open area of ​​the breeding box was photographed with two infrared cameras.

play11:05

Moving crickets are identified from images taken every second, and the size and number of crickets can be estimated from the detections.

play11:12

The Good for Livestock project investigated the monitoring of silage temperature in the silage to prevent feed spoilage.

play11:20

By simulating the spoilage process, the effect of the placement of the measuring sticks on the measurement accuracy could be analyzed.

play11:26

The biological spoilage process causes a local heat rise at the point

play11:31

where the spoilage is taking place and monitoring the temperatures provides certainty about the quality of the feed.

play11:36

The Biocycle and JÄRKI projects implement computer-assisted simulations to support decision-making

play11:41

and improve the efficiency of waste stream management by examining the locations and other characteristics of the facilities.

play11:47

"Science is not finished until it's communicated."

play11:50

Science does not end with the result being on the researcher’s table but must be communicated in a meaningful way to the world.

play11:58

The Field Observatory is the result of the joint work of the Finnish Meteorological Institute, the Finnish Environment Institute, the Baltic Sea Action Group and HAMK,

play12:05

in which the effects of conservation agriculture and other measures to promote carbon sequestration on farms are measured and brought to an easy understanding for various stakeholders.

play12:12

A big part of a data analyst's job is communication.

play12:16

It is a discussion with various stakeholders.

play12:18

The analyst should understand the problem or phenomenon for which information is needed.

play12:22

The measurement technique used to monitor this phenomenon must be understood.

play12:26

One has to understand the connection between these. And when an internal structure

play12:33

that is relevant to the future is found in a measured collection of data points, it must be able to communicate in a meaningful way that supports decision-making.

play12:40

Thank you for joining us on this journey to data analytics.

Rate This

5.0 / 5 (0 votes)

Related Tags
Data AnalyticsApplied MathVisualizationMachine LearningStatistical MethodsData InterpretationSales AnalysisIoT SolutionsBioeconomyResearch Insights