Mitä data-analytiikka on?
Summary
TLDRIn this video, Olli Koskela from HAMK University of Applied Sciences explores the concept of data analytics, emphasizing its importance in understanding complex phenomena through detailed analysis. He illustrates the process with historical and contemporary examples, such as World War II aircraft armor optimization and sales statistics analysis. Koskela also discusses the role of data analytics in various fields like bio- and circular economy, welfare, education, industry, and transport, highlighting the significance of effective data communication for decision-making.
Takeaways
- 📊 Data analytics involves examining the components and structures of data in detail to build understanding.
- 🔍 The term 'data' can be thought of as a 'collection of measuring points', which forms a dataset.
- 🛡 Analyzing data often involves combining statistical methods and machine learning to uncover insights.
- 🔧 Data analysis can be performed with simple tools like crayons and paper, not just digital methods.
- ✈️ A historical example from World War II demonstrates how data analytics can optimize outcomes, like aircraft armor placement.
- 📉 Visualizations such as scatter plots and histograms are crucial for understanding data distributions and patterns.
- 📈 The median and average are basic statistical measures that can provide insights into data, but may not tell the whole story.
- 📊 Histograms group data into ranges to show the frequency of data points within those ranges, which can reveal trends.
- 🌡️ The Bioeconomy 4.0 project uses IoT sensors and data analytics to correlate environmental conditions with milk yield in cows.
- 🐄 Machine vision and image classification can analyze animal behavior, like cow queuing for milking, to improve farm operations.
- 🌡️ Monitoring environmental conditions like temperature and humidity can help prevent spoilage in silage, ensuring feed quality.
- 🌐 The HAMK Smart Research Unit uses data analytics to develop solutions across various sectors including bio- and circular economy.
Q & A
What does Olli Koskela believe data analytics to be?
-Olli Koskela believes data analytics to be the elucidation of the internal structure and properties of a set of measurement points, using appropriate tools such as statistical methods and machine learning.
What is the significance of the term 'data' in the context of data analytics?
-In the context of data analytics, 'data' refers to a collection of measuring points, often derived from physical experiments, and is used to analyze and understand patterns or trends.
How does Olli Koskela describe his background in relation to data analytics?
-Olli Koskela has a master’s degree in applied mathematics with a major in analysis, which involves using analytical tools and methods to understand problems and review them according to agreed and proven rules.
What is the Abraham Wald’s The Statistical Research Group data set mentioned in the script?
-The Abraham Wald’s The Statistical Research Group data set from World War II contains information about bullet holes in aircraft and was used to determine where to place armor for optimal protection.
Why is it important to consider the data from planes that did not return in the aircraft armor optimization problem?
-It is important to consider the data from planes that did not return because they likely had more bullet holes in areas that were critical for survival, thus providing crucial information for where to place armor.
How did Olli Koskela become a data analyst?
-Olli Koskela became a data analyst by volunteering to develop fundraising and analyze sales statistics for products sold by children and young people during a campaign.
What was the initial assumption in the sales statistics analysis that Olli Koskela worked on?
-The initial assumption was that Group A sold better than Group B, and the task was to find out why.
What did Olli Koskela discover about the sales data when he looked beyond the averages?
-When looking beyond the averages, Olli Koskela discovered that Group A sellers sold an average of 13.3 products and Group B sellers sold an average of 12.6 products, indicating a small but significant difference.
What is the significance of the histogram in the sales data analysis?
-The histogram in the sales data analysis is significant because it groups measurements into ranges, showing the number of sellers who sold the same number of products, which helps to identify patterns and trends in sales volumes.
What insights did Olli Koskela gain from analyzing the sales data more deeply?
-Olli Koskela gained insights such as the potential of sellers who had not sold any products, the fact that sellers rarely returned products, and the possibility of increasing sales by raising prize limits.
How does the HAMK Smart Research Unit utilize data analytics?
-The HAMK Smart Research Unit utilizes data analytics to develop solutions in various fields such as bio- and circular economy, welfare, education, industry, and transport.
What is the purpose of the Bioeconomy 4.0 project mentioned in the script?
-The purpose of the Bioeconomy 4.0 project is to measure environmental conditions in a cowhouse and correlate them with the milk yield of cows using temperature and humidity data.
Outlines
📊 Introduction to Data Analytics
Olli Koskela from HAMK University introduces data analytics, drawing from his background in applied mathematics and analysis. He explains the term 'data analytics' as the detailed examination of the components and structures of a phenomenon. Data is defined as a collection of measurement points, and data analysis involves using tools to understand the internal structure and properties of these points. Modern tools include statistical methods and machine learning, but the core analytical thinking applies even to manual data processing. An example from Abraham Wald's WWII data set illustrates optimizing aircraft armor placement based on bullet hole distribution, highlighting the importance of considering missing data (planes that didn't return).
📈 Sales Data Analysis and Insights
The script discusses Olli's experience as a data analyst, starting with a fundraising campaign where he analyzed sales statistics to understand why Group A outperformed Group B in sales. Initial findings showed Group A sold an average of 13.3 products compared to Group B's 12.6. Further analysis using histograms and medians revealed little difference between the groups. However, deeper insights were gained by considering the number of non-selling participants and the low product return rate, suggesting potential for increased sales with slightly higher prize limits. This part also touches on the importance of understanding the measurement techniques and the data collected, such as using radar or pressure sensors for traffic monitoring, and electrical conductivity for milk quality assessment.
🐄 Applications of Data Analytics in Bio- and Circular Economy
Olli presents various projects from the HAMK Smart Research Unit that apply data analytics to bio- and circular economy, welfare, education, industry, and transport. One project involves measuring environmental conditions in a cowhouse to correlate them with milk yield using the Temperature-Humidity Index (THI). Another uses machine vision to analyze cow behavior and cricket rearing, while a third monitors silage temperature to prevent spoilage. The Biocycle and JÄRKI projects use simulations for waste management decision-making. The narrative emphasizes the importance of communicating scientific findings effectively to stakeholders, ensuring that data analysis supports informed decision-making.
Mindmap
Keywords
💡Data Analytics
💡Applied Mathematics
💡Data
💡Statistical Methods
💡Machine Learning
💡Visualization
💡Normalization
💡Correlation
💡IoT (Internet of Things)
💡Machine Vision
💡Data Communication
Highlights
Data analytics involves analyzing the components and structures of data in detail.
Data analytics combines statistical methods and machine learning to understand data sets.
Data, derived from the Latin word 'datum', refers to a collection of measurement points.
Abraham Wald’s data set from WWII is used to optimize aircraft armor placement.
Data normalization allows for comparison across different machine sizes.
The importance of understanding data collection methods and the potential biases they introduce.
The story of how Olli Koskela became a data analyst through volunteering and analyzing sales statistics.
The challenge of interpreting data when comparing groups of different sizes.
The use of percentages to express differences between groups, but the need for deeper analysis.
The insight that the number of sellers who did not sell any products could be significant.
The recommendation to increase prize limits to激励 sales, based on data analysis.
The discovery that sellers rarely returned products, suggesting high sales efficiency.
The realization that the system only allowed products to be registered once, affecting sales data interpretation.
The use of indirect measurements, such as radar signal reflection, to understand traffic volumes.
The application of data analytics in assessing milk quality through electrical conductivity.
The Bioeconomy 4.0 project's use of IoT and LoRa technology to measure environmental conditions affecting milk yield.
The preprocessing and standardization of data to facilitate meaningful analysis.
The correlation between the temperature-humidity-index (THI) and milk yield in cows.
The use of machine vision to analyze cow behavior and milking queue dynamics.
The application of video-based machine vision in cricket rearing to estimate size and number.
The Good for Livestock project's focus on monitoring silage temperature to prevent spoilage.
The Biocycle and JÄRKI projects' use of computer-assisted simulations for waste stream management.
The importance of communicating scientific findings in a way that supports decision-making.
The Field Observatory's role in measuring and communicating the effects of conservation agriculture.
Transcripts
Hi! I am Olli Koskela from HAMK University of Applied Sciences and from HAMK Smart Research Unit.
In this video, I’ll explain what I think data analytics is,
and with a few examples, I’ll look at visualizing data and building understanding.
Let’s first look at what the word pair “data analytics” actually means.
I myself have a master’s degree in applied mathematics and my major was analysis.
The basic computational methods familiar from school are
in fact, based on analytical tools such as thresholds and continuity,
but more generally analytics means
looking at the components and structures of a thing or phenomenon in detail.
The mathematical analysis focuses in particular on the fact
that the problem is first precisely defined and the review is carried out only according to agreed and proven rules.
Data is a plural form of the Latin word datum, meaning something given.
Personally, I like to translate the data word into a “collection of measuring points”,
which describes the so-called data set formed especially by physical experiments.
Thus, data analysis is the elucidation of the internal structure and properties of a set of measurement points
using appropriate tools.
In particular, different statistical methods and machine learning are combined with this set of tools nowadays,
but the same analytical thinking models also apply
when the data set can be processed with, for example, crayons and a piece of paper.
Let's take such an example from Jordan Ellenberg's excellent book
"How not to be wrong - The power of Mathematical thinking".
Here is Abraham Wald’s The Statistical Research Group data set from World War II.
In this collection of measuring points, bullet holes of the aircraft have been calculated and the task is to optimize the installation of aircraft armor.
Because armor makes machines heavier, you should only install just the needed amount of them.
In addition, it is noted that here the data is so-called normalized, i.e. the hole per area is shown.
This makes the measurements of machines of different sizes comparable.
Let's open Wikipedia first and look at the content of the problem.
Pictured is a P-51D propeller named Miss Helen.
The parts in the measuring set are thus the engine, the body, the fuel system and all the other parts categorized into one.
The conundrum here is to understand the measurement setup: the data has been calculated back from the planes that flew to the base.
It therefore lacks data on planes that did not fly back.
Therefore, the group suggested that the armor should be installed where there were the fewest bullet holes.
How did I become a data analyst?
The story is an excellent example of how important
one’s hobby and passion for finding things out today is and
how they may be useful in the future.
I ended up volunteering to develop fundraising
and analyze sales statistics for products sold by children and young people during a campaign.
The starting assumption here was that Group A sells better than Group B and we wanted to find out why?
We will not editorialise here now on the demographic meaning of the groups,
but let's look at what I got out of the data by thinking about it and how it can be communicated in different ways.
A straightforward way to answer the question would be to say that “Group A sold on average 5% better than Group B.”
Percentages are a good way to express a wide range of things,
as it is also standardized information, ie it allows for the simultaneous assessment of different types of things.
However, don’t pay an analyst who doesn’t offer this any more results!
Looking at the percentages does not really provide any information about the phenomenon or its magnitude.
In this particular case, Group A sellers sold an average of 13,3 products and Group B sellers an average of 12,6 products.
Since these are products measured in quantities,
is there in fact any difference between these averages?
Next, let’s look at the raw data in groups.
Here, all the data is shown as a scatter pattern, i.e. the horizontal axis shows the number of different sellers and the vertical axis the quantities of the products they sell.
The first observation from this is that, first, the groups are very different in size.
There are much more sellers in Group B. It is very difficult to deduce just about anything else, ie some key figures are needed.
The average has already been presented, the second equivalent is the median.
There is not much difference between them either.
The averages were very close to each other and there is little difference in the medians.
Let’s go deeper into the data structures and break down sales volumes.
A histogram is a way of grouping measurements so that the numbers of measurement readings in a given range are calculated.
In other words, the number of sellers selling the same number of products is shown here, according to the intervals shown on the horizontal axis.
The boundaries on the horizontal axis are not entirely random, but I have brought non-data data into the analysis.
The limits have been chosen on the basis of the sales volumes eligible for the prize.
Each seller received a small reward for their activity after 10, 18, 30, 50 and 100 products sold.
In fact, this image is a scam: the scaling of parallel graphs is completely different.
Let's fix it so that both have the same scale.
The review groups are still remarkably similar in profile, with only more sellers in Group B.
From the point of view of sales promotion, I came to the conclusion that this division between the groups is irrelevant.
However, my real insight was to find more data for this data set. To think about the phenomenon a little more broadly.
Elsewhere, the number of sellers who had not sold any products was found.
I think these sellers have significant potential in this case,
because a significant number of children and young people in this campaign had not sold any products.
Looking at the numbers, I also made another insight.
Based on the data, it appeared that the sellers returned little of the products, i.e. they sold all the products that were for sale when they were picked up.
Based on this, I would recommend a small increase in
prize limits, which would have increased sales as sellers would have taken 1-2 more products.
However, it later became clear to me that the system only allowed products to be registered once.
At that time, it appeared that the number sold and admitted to sale was exactly the same.
In other words, many sellers directors kept their own records and eventually recorded the quantities sold.
This finding undermined my recommendation.
It also often happens that what you want to develop is not directly measurable.
The picture shows two screens following the traffic volumes of the pedestrian and cycle route in the city of Hämeenlinna.
Both show the number of pedestrians and cyclists,
but in reality the measurements are the flight times of the audible signal radar or the pressure variation of the road surface,
from which the quality of the rider is identified by different methods. These are therefore physical measurements of the
pressure or the reflection of the radar signal, from which it is only later determined whether there has been a cyclist or a pedestrian.
Similarly, the quality of milk is assessed, inter alia, on the basis of electrical conductivity when physical measurements are made,
although hardly any of us need electrical conductivity in our morning coffee.
However, it has been found that electrical conductivity correlates with other useful variables and is easily measurable.
The HAMK Smart Research Unit develops solutions utilizing data analytics and digitalisation in the fields of bio- and circular economy, welfare, education, industry and transport.
Next, I will present our bio- and circular economy projects.
Based on previous research, it is known that an index can be derived from temperature and humidity
that correlates with the milk yield of cows.
The index is called the temperature-humidity-index (THI).
In the Bioeconomy 4.0 project, we measured environmental conditions at various points in HAMK's cowhouse in Mustiala with such sensors connected to LoRaWAN.
The LoRa technology has been developed for long-term measurement and allows for relatively free placement of the sensors as well as a long battery life.
Digita produces the data collection for this IoT platform, and I have a collection of measurements downloaded from there.
This is compared to the average daily milk yield of the herd measured by the milking robot.
For comparison, the measurement data must be preprocessed and standardized to daily data.
We have filtered temperatures and relative humidity for this use.
We want to compare the conditions with the average daily milk yield of cows.
Conditions are measured every 15 minutes and must first be preprocessed, viewed for erroneous measurements, and unify to daily data.
In this case, the two temperature sensors have broken and sent -1000°C readings for a few measurement periods.
Let's remove these points from our dataset.
Here I now present the temperatures and relative humidity as graphs.
An ordinary time series graph is best suited to represent such a collection of more than 11,000 measurement points.
For programmatic processing, text timestamps are converted to timestamp objects.
Utilizing these, the daily average THI can be calculated.
These daily averages are used in the THI calculation.
By plotting the THIs calculated on the basis of the measurements of different sensors and the outdoor weather conditions
with the milk yield,
it is noticed that the THI and the average milk yield tend to be very simultaneous phenomena.
However, a comparison of the THI does not yet tell us anything about the cause-and-effect relationships of the phenomenon.
To understand the barn life, we conducted a three-month machine vision test on queuing behavior for a milking robot.
Of the video footage, 1.7 million images have been classified so that each image in the video was tagged in one significant category.
These categories included interactions from different directions, queuing congestion and normalcy, drinking from the drinking pool, and curiosity.
In addition, disturbed images and those showing people were excluded from the classification.
A deeper examination of the material is still in progress at the moment,
but even here it can already be seen that mainly the cows are queuing up for milking very calmly.
In addition, this data could already be used, for example, to measure the duration of drinking moments,
and it was found that mainly cows drink less than one minute at a time.
Video-based machine vision can also be used to analyze cricket rearing.
In the HämInCent project, the feeding area of crickets in the open area of the breeding box was photographed with two infrared cameras.
Moving crickets are identified from images taken every second, and the size and number of crickets can be estimated from the detections.
The Good for Livestock project investigated the monitoring of silage temperature in the silage to prevent feed spoilage.
By simulating the spoilage process, the effect of the placement of the measuring sticks on the measurement accuracy could be analyzed.
The biological spoilage process causes a local heat rise at the point
where the spoilage is taking place and monitoring the temperatures provides certainty about the quality of the feed.
The Biocycle and JÄRKI projects implement computer-assisted simulations to support decision-making
and improve the efficiency of waste stream management by examining the locations and other characteristics of the facilities.
"Science is not finished until it's communicated."
Science does not end with the result being on the researcher’s table but must be communicated in a meaningful way to the world.
The Field Observatory is the result of the joint work of the Finnish Meteorological Institute, the Finnish Environment Institute, the Baltic Sea Action Group and HAMK,
in which the effects of conservation agriculture and other measures to promote carbon sequestration on farms are measured and brought to an easy understanding for various stakeholders.
A big part of a data analyst's job is communication.
It is a discussion with various stakeholders.
The analyst should understand the problem or phenomenon for which information is needed.
The measurement technique used to monitor this phenomenon must be understood.
One has to understand the connection between these. And when an internal structure
that is relevant to the future is found in a measured collection of data points, it must be able to communicate in a meaningful way that supports decision-making.
Thank you for joining us on this journey to data analytics.
تصفح المزيد من مقاطع الفيديو ذات الصلة
The Future of Data | Tiago Santos | TEDxEUBusinessSchoolBarcelona
Prescriptive Analytics Overview
Understanding The Data Life Cycle with DataBrew
What is marketing analytics?! | Unlock growth by understanding data and analytics
University of San Diego professor on applied mathematics and faith | Satyan Devadoss
Problem Solving with Data Analytics | Google Data Analytics Certificate
5.0 / 5 (0 votes)