Forecasting and big data: Interview with Prof. Rob Hyndman
Summary
TLDRProfessor Rob Hyndman from Monash University discusses big data in time series forecasting, emphasizing the importance of analyzing multiple time series rather than very long individual ones. He highlights the shift from manual to automated forecasting as data volume increases and recommends software like R's 'forecast' package and Tableau for automation. Hyndman warns of the risks of overfitting with complex models and advocates for simplicity and testing methods on holdout sets to ensure accuracy.
Takeaways
- π Big data in time series forecasting refers to a large collection of time series, each of which may not be particularly long but collectively form a vast dataset.
- π¬ Examples of big data in time series include daily sales data for multiple products across various stores and countries, or security streaming data from hundreds of sensors.
- π When forecasting a few time series, it's feasible to manually tweak models for individual series, but automation becomes crucial when dealing with many series.
- π οΈ Automated forecasting algorithms are essential for handling large numbers of time series, as manual analysis is impractical.
- π» Software like R's 'forecast' package, Forecast Pro, and Tableau offer automated forecasting solutions, some with algorithms written by the interviewee.
- π Thrive Technologies has an exceptionally fast automatic forecasting algorithm, noted for its speed.
- β° The benefits of automation in forecasting include time and cost savings, while the danger lies in the potential for poor performance on certain series due to the limitations of automated algorithms.
- π§ A recommended strategy is to let the automatic algorithm handle the bulk of the forecasting while analysts focus on series that are not forecast well.
- π In forecasting competitions, simple methods often outperform complex ones, as large, complicated models can overfit the data, especially when time series are not very long.
- π It's important to test different forecasting methods on holdout sets to determine what works best for the specific type of data being analyzed.
Q & A
What is the definition of big data in the context of time series forecasting according to Rob Hyndman?
-Big data in time series forecasting refers to a large collection of time series, where each individual series may not be particularly long but the volume of series is substantial. Examples include daily sales data for multiple products in various stores and countries or security streaming data from hundreds of sensors.
How does Rob Hyndman describe the difference between handling a few time series versus many?
-With a few time series, one can manually analyze and tweak forecasting methods for each series to account for peculiarities. However, with many series, manual analysis becomes impractical, necessitating automated algorithms to generate forecasts efficiently.
What software does Rob Hyndman mention for automatic forecasting of many time series?
-Rob Hyndman mentions several software options for automatic forecasting, including the R package 'forecast', Forecast Pro for Windows, Tableau, and an algorithm by Thrive Technologies known for its speed.
What are the benefits of using automated forecasting algorithms according to Rob Hyndman?
-Automated forecasting algorithms save time and money by quickly generating forecasts for large numbers of time series without the need for manual intervention.
What are the potential dangers of relying on automated forecasting algorithms as highlighted by Rob Hyndman?
-The danger lies in the fact that no automatic algorithm will work well for every time series. There may be edge cases where the algorithm performs poorly, and improvements in one area might inadvertently degrade performance in others.
What strategy does Rob Hyndman suggest for dealing with time series that are not forecasted well by automated algorithms?
-Rob Hyndman suggests identifying the poorly forecasted series and focusing analyst time on these cases, while allowing the automatic algorithm to handle the majority of the series where it performs adequately.
What insights did Rob Hyndman gain from forecasting competitions involving hundreds or thousands of time series?
-From forecasting competitions, Rob Hyndman learned that simple models often outperform complex ones due to the limited length of individual time series, and that methods like exponential smoothing tend to do well, especially with data showing trends and seasonality.
Why do large, complicated models not always perform well in time series forecasting competitions?
-Large, complicated models are prone to overfitting when applied to individual time series that are not long enough to support the complexity of such models, which is often the case in forecasting competitions.
What is the importance of testing forecasting methods on holdout sets according to Rob Hyndman?
-Testing forecasting methods on holdout sets is crucial for evaluating their effectiveness and identifying which methods work well with the specific data at hand. This practice helps in selecting the most appropriate forecasting approach.
What is the significance of the ATS algorithm mentioned by Rob Hyndman?
-The ATS (Automatic Exponential Smoothing) algorithm is significant because it automates the process of fitting exponential smoothing models to time series data, making it easier to forecast without manual intervention.
How does Rob Hyndman's involvement in developing the algorithm for Tableau reflect his expertise in forecasting?
-Rob Hyndman's involvement in developing Tableau's forecasting algorithm demonstrates his expertise in creating efficient and effective automated forecasting solutions, contributing to the accessibility of advanced forecasting methods in widely used software.
Outlines
π Big Data in Time Series Forecasting
Professor Rob Hyndman from Monash University discusses the concept of big data in time series forecasting. He explains that big data in this context does not refer to the length of individual time series, but rather to the volume of multiple time series. Hyndman provides examples such as analyzing daily sales data for various products across different stores and countries, and security streaming data from hundreds of sensors. He emphasizes that big data in time series usually means dealing with a large number of series, each potentially short in duration. Hyndman also touches on the challenges of forecasting with few versus many series, highlighting the need for automation when dealing with a large number of time series.
π οΈ Automated Forecasting Software and Strategies
In the second paragraph, Hyndman addresses the question of software for forecasting many time series and the importance of automation. He mentions R's 'forecast' package, which includes algorithms like ATS for automatic exponential smoothing and auto.arima for automatic ARIMA modeling. Other software like Forecast Pro and Tableau are also highlighted for their automatic forecasting capabilities. Hyndman discusses the benefits of automation, such as time and cost savings, and the dangers, including the risk of overfitting and the challenge of finding an algorithm that works well for all series. He suggests a strategy of allowing automation to handle the majority of series while analysts focus on those that are not forecasted well. Hyndman also shares insights from forecasting competitions, advocating for simple yet sophisticated methods and the importance of testing different approaches on holdout sets.
Mindmap
Keywords
π‘Time Series Forecasting
π‘Big Data
π‘Exponential Smoothing
π‘ARIMA Modeling
π‘Automation
π‘R (Statistical Software)
π‘Forecast Pro
π‘Tableau
π‘Thrive Technologies
π‘Overfitting
π‘Forecasting Competitions
Highlights
Introduction to Rob Hyndman, a professor of statistics at Monash University, specializing in time series forecasting research.
Definition of big data in time series forecasting: Big data involves multiple time series rather than one very long time series.
Example of big data in time series forecasting: Sales data for multiple products across various locations and time frames.
Example of big data in time series: Security streaming data with sensors sending multiple signals per second, leading to large data sets over time.
Difference in forecasting few versus many time series: Manual tuning is feasible for a few series, while automation is essential for many.
The importance of automated algorithms for forecasting when dealing with large numbers of time series.
Available software for automated forecasting: R (with the 'forecast' package), Forecast Pro, Tableau, and Thrive Technologies.
Discussion of specific algorithms for automated forecasting, such as automatic exponential smoothing (ETS) and auto.ARIMA in R.
Benefits of automation: Saves time and money by generating forecasts quickly and efficiently.
Challenges of automation: No algorithm works well for every time series, and adjustments can sometimes worsen performance for certain series.
Recommended strategy: Use automated algorithms for the majority of time series and manually check series where automation fails.
Insights from forecasting competitions: Simple methods, like exponential smoothing, often perform better than complex models like neural networks.
Explanation for why complex models, like neural networks, may not perform well in time series competitions due to overfitting.
The importance of using well-tested methods on similar types of data and validating approaches through training and test sets.
Conclusion: Emphasizes the value of simple, well-tested models and the importance of continuous testing and validation in forecasting.
Transcripts
Hi, my name is Rob Hyndman.
I'm a professor of statistics at Monash University
and I do work in time series forecasting research.
I'm here to answer a few questions from Galit Shmueli.
So the first one is, what is the meaning of big data in time
series forecasting and can you give a few examples
from projects you've worked on?
So time series, as you probably know,
are data that are collected sequentially over time.
So it's very unlikely that one single time series
will be particularly long.
If you think big data means data that
can't fit onto a single ordinary sized computer,
then no time series is that long.
So if you want to think about big data in the context
of forecasting and time series you
need to think about lots of time series.
So, for example, you might be looking at sales of a company.
And you have daily sales, data going back a few years.
And you might be having lots of different products
for that company.
And you're looking at sales for each product, in each store,
in every country of the world.
And so that way you end up with a large collection of data
because there's lots and lots of time series.
Or another situation I've dealt with big data in a time series
context was when we were looking at a security streaming data.
So that's a company that was monitoring
the security around a building.
And they had a fence on which was mounted
several hundred sensors that were detecting movement
in the vicinity of the building.
Each of those sensors was sending
a signal, or several signals every second.
And the disk was continually streaming all the time.
So in only an hour or so, you end up with gigabytes of data.
And over a long period of time, you have a seriously large data
set.
So big data in time series generally means
lots and lots of time series, each one of which
may not be particularly long.
The second question was, how does the forecasting process
differ for few vs. many series?
When you have a few series, you can
afford to spend some time looking at the results
and maybe tweaking the method that you are using
for each individual series.
You might account for some peculiarities and features
of each series and end up with a forecasting
model that's tuned to the individual series
that you've got.
Once you get above a handful of time series,
it's just no longer possible to spend
the time looking at each individual time series
separately.
So you need some kind of automatic algorithm
which will generate forecasts for you.
So with lots of series when you do forecasting,
automation is crucial because you just cannot do it in any
manual sense.
Third question is, what software can
be used for forecasting many series
and can they be used as part of an automated solution?
So yes, there's quite a lot of software
out there now that does automatic forecasting.
R, the statistical software platform,
is one where there's a package called forecast,
which is my own package.
And I've written several algorithms
in the package that will do automatic forecasting.
So the best known of those algorithms
are the ATS algorithm, which does
automatic exponential smoothing.
And then there's an auto.arima algorithm, which
does automatic ARIMA modeling.
But the forecast page for R is not
the only software around that can do automatic forecasting.
One that's been around for several decades, which
is excellent, is Forecast Pro, which is available for Windows.
And that's used by lots of companies
for doing their automatic forecasting
and it has quite good integration
with other software systems.
More recently, Tableau has produced
some automatic forecasting within their software.
I actually wrote the algorithm for Tableau.
There's a company called Thrive Technologies
that I've worked with, which has a very good forecasting
algorithm to automatically forecast very, very quickly.
It's the fastest automatic forecasting algorithm
I've ever seen.
What are the benefits and dangers of automation?
That was question number four from Galit.
Well, the obvious benefits are that it saves a lot of time
and it saves a lot of money.
You can just give your time series to a computer
and it will give you back some forecasts.
The danger, of course, is that no automatic algorithm will
work well for every series.
I spent a lot of time looking for edge cases
where my algorithms don't work and trying
to find ways to improve the algorithms to cope
with more types of data.
But it's a never ending quest.
And sometimes I modify an algorithm
so that it does better for some time series
only to find that it's actually made
things worse on other series.
So there's always going to be particular time series where
your automatic algorithm, whichever one you use,
where the automatic algorithm does not do so well.
A good strategy that I encourage my clients to do
is to try to identify the series that
are not being forecast well and just look at those ones.
And let the automatic algorithm do the bulk of the series.
And then you can concentrate on spending
your analyst time, which is expensive,
spending that time on the cases where the automation is not
working so well.
And Galit's last question concerned
forecasting competitions.
She says, in several forecasting contests in which you
were involved, participants were tasked
with forecasting hundreds or thousands of series.
What effective approaches and conclusions
emerged from these contests?
Well, there's a few things that have come out
of those sort of competitions.
The first is, keep it simple, or at least
keep it sophisticatedly simple.
Large, complicated models do not always
work well on time series data because the individual time
series are not necessarily very long.
And if you have a large complicated model,
you tend to overfit.
So, for example, time series competitions
you generally find that neural networks don't do very well.
Because they're designed for very large collections of data
and the individual time series are often not long enough
to fit a good neural net.
On the other hand, some quite simple methods,
like exponential smoothing, tends
to do pretty well in forecasting competitions, especially when
those competitions involve data that
show trend and seasonality.
I guess the other thing I would say
is that, in competitions and elsewhere
is to use methods that have been well tested
on similar types of data.
For example, you can split each of the time series
into a training set and the test set.
Apply your methods to forecast the test
set and see what works well.
And then you know what you could use going forward.
It's amazing how many people don't do that.
Just testing a range of different approaches
out on test sets, on holdout sets, to check what works well
and what doesn't work well.
OK, so there's the five questions from Galit.
It was fun to be able to answer them.
And I hope that's been helpful.
Thank you.
Browse More Related Video
Time Series Talk : Autoregressive Model
Transformes for Time Series: Is the New State of the Art (SOA) Approaching? - Ezequiel Lanza, Intel
Quantitative Forecasting Methods in Business Operations
Analisis Deret Berkala - Pengantar Statistika Ekonomi dan Bisnis (Statistik 1) | E-Learning STA
Automation Testing Tutorial for Beginners | Software Testing Certification Training | Edureka
LSTM Time Series Forecasting Tutorial in Python
5.0 / 5 (0 votes)