Forecasting and big data: Interview with Prof. Rob Hyndman

Galit Shmueli
30 Nov 201607:26

Summary

TLDRProfessor Rob Hyndman from Monash University discusses big data in time series forecasting, emphasizing the importance of analyzing multiple time series rather than very long individual ones. He highlights the shift from manual to automated forecasting as data volume increases and recommends software like R's 'forecast' package and Tableau for automation. Hyndman warns of the risks of overfitting with complex models and advocates for simplicity and testing methods on holdout sets to ensure accuracy.

Takeaways

  • πŸ“ˆ Big data in time series forecasting refers to a large collection of time series, each of which may not be particularly long but collectively form a vast dataset.
  • 🏬 Examples of big data in time series include daily sales data for multiple products across various stores and countries, or security streaming data from hundreds of sensors.
  • πŸ” When forecasting a few time series, it's feasible to manually tweak models for individual series, but automation becomes crucial when dealing with many series.
  • πŸ› οΈ Automated forecasting algorithms are essential for handling large numbers of time series, as manual analysis is impractical.
  • πŸ’» Software like R's 'forecast' package, Forecast Pro, and Tableau offer automated forecasting solutions, some with algorithms written by the interviewee.
  • πŸš€ Thrive Technologies has an exceptionally fast automatic forecasting algorithm, noted for its speed.
  • ⏰ The benefits of automation in forecasting include time and cost savings, while the danger lies in the potential for poor performance on certain series due to the limitations of automated algorithms.
  • πŸ”§ A recommended strategy is to let the automatic algorithm handle the bulk of the forecasting while analysts focus on series that are not forecast well.
  • 🏁 In forecasting competitions, simple methods often outperform complex ones, as large, complicated models can overfit the data, especially when time series are not very long.
  • πŸ“š It's important to test different forecasting methods on holdout sets to determine what works best for the specific type of data being analyzed.

Q & A

  • What is the definition of big data in the context of time series forecasting according to Rob Hyndman?

    -Big data in time series forecasting refers to a large collection of time series, where each individual series may not be particularly long but the volume of series is substantial. Examples include daily sales data for multiple products in various stores and countries or security streaming data from hundreds of sensors.

  • How does Rob Hyndman describe the difference between handling a few time series versus many?

    -With a few time series, one can manually analyze and tweak forecasting methods for each series to account for peculiarities. However, with many series, manual analysis becomes impractical, necessitating automated algorithms to generate forecasts efficiently.

  • What software does Rob Hyndman mention for automatic forecasting of many time series?

    -Rob Hyndman mentions several software options for automatic forecasting, including the R package 'forecast', Forecast Pro for Windows, Tableau, and an algorithm by Thrive Technologies known for its speed.

  • What are the benefits of using automated forecasting algorithms according to Rob Hyndman?

    -Automated forecasting algorithms save time and money by quickly generating forecasts for large numbers of time series without the need for manual intervention.

  • What are the potential dangers of relying on automated forecasting algorithms as highlighted by Rob Hyndman?

    -The danger lies in the fact that no automatic algorithm will work well for every time series. There may be edge cases where the algorithm performs poorly, and improvements in one area might inadvertently degrade performance in others.

  • What strategy does Rob Hyndman suggest for dealing with time series that are not forecasted well by automated algorithms?

    -Rob Hyndman suggests identifying the poorly forecasted series and focusing analyst time on these cases, while allowing the automatic algorithm to handle the majority of the series where it performs adequately.

  • What insights did Rob Hyndman gain from forecasting competitions involving hundreds or thousands of time series?

    -From forecasting competitions, Rob Hyndman learned that simple models often outperform complex ones due to the limited length of individual time series, and that methods like exponential smoothing tend to do well, especially with data showing trends and seasonality.

  • Why do large, complicated models not always perform well in time series forecasting competitions?

    -Large, complicated models are prone to overfitting when applied to individual time series that are not long enough to support the complexity of such models, which is often the case in forecasting competitions.

  • What is the importance of testing forecasting methods on holdout sets according to Rob Hyndman?

    -Testing forecasting methods on holdout sets is crucial for evaluating their effectiveness and identifying which methods work well with the specific data at hand. This practice helps in selecting the most appropriate forecasting approach.

  • What is the significance of the ATS algorithm mentioned by Rob Hyndman?

    -The ATS (Automatic Exponential Smoothing) algorithm is significant because it automates the process of fitting exponential smoothing models to time series data, making it easier to forecast without manual intervention.

  • How does Rob Hyndman's involvement in developing the algorithm for Tableau reflect his expertise in forecasting?

    -Rob Hyndman's involvement in developing Tableau's forecasting algorithm demonstrates his expertise in creating efficient and effective automated forecasting solutions, contributing to the accessibility of advanced forecasting methods in widely used software.

Outlines

00:00

πŸ“Š Big Data in Time Series Forecasting

Professor Rob Hyndman from Monash University discusses the concept of big data in time series forecasting. He explains that big data in this context does not refer to the length of individual time series, but rather to the volume of multiple time series. Hyndman provides examples such as analyzing daily sales data for various products across different stores and countries, and security streaming data from hundreds of sensors. He emphasizes that big data in time series usually means dealing with a large number of series, each potentially short in duration. Hyndman also touches on the challenges of forecasting with few versus many series, highlighting the need for automation when dealing with a large number of time series.

05:01

πŸ› οΈ Automated Forecasting Software and Strategies

In the second paragraph, Hyndman addresses the question of software for forecasting many time series and the importance of automation. He mentions R's 'forecast' package, which includes algorithms like ATS for automatic exponential smoothing and auto.arima for automatic ARIMA modeling. Other software like Forecast Pro and Tableau are also highlighted for their automatic forecasting capabilities. Hyndman discusses the benefits of automation, such as time and cost savings, and the dangers, including the risk of overfitting and the challenge of finding an algorithm that works well for all series. He suggests a strategy of allowing automation to handle the majority of series while analysts focus on those that are not forecasted well. Hyndman also shares insights from forecasting competitions, advocating for simple yet sophisticated methods and the importance of testing different approaches on holdout sets.

Mindmap

Keywords

πŸ’‘Time Series Forecasting

Time series forecasting is a statistical technique that deals with the prediction of future data points based on previously observed data points. In the context of the video, Rob Hyndman discusses how big data influences time series forecasting, emphasizing the importance of handling multiple time series data sets, such as daily sales data across various products and locations. The video highlights the challenges and methods associated with forecasting in such scenarios.

πŸ’‘Big Data

Big data refers to extremely large and complex data sets that traditional data processing applications are inadequate to handle. In the video, Hyndman clarifies that in the context of time series forecasting, 'big data' typically means having a large number of time series rather than extremely long individual time series. Examples given include daily sales data for multiple products in various stores worldwide and security streaming data from hundreds of sensors.

πŸ’‘Exponential Smoothing

Exponential smoothing is a time series forecasting method for univariate data that can be used to produce forecasts for time series data. It is a type of weighted moving average that assigns exponentially decreasing weights over time. In the video, Hyndman mentions that simple methods like exponential smoothing often perform well in forecasting competitions, especially for data that exhibit trends and seasonality.

πŸ’‘ARIMA Modeling

ARIMA, which stands for AutoRegressive Integrated Moving Average, is a statistical model used for time series forecasting. It combines autoregression (AR) with moving average (MA) and often includes differencing (I) to make the series stationary. Hyndman discusses the 'auto.arima' algorithm in the R package 'forecast', which automates the selection of ARIMA model parameters for time series data.

πŸ’‘Automation

Automation in the context of the video refers to the use of algorithms and software to automatically generate forecasts without manual intervention. Hyndman stresses the importance of automation when dealing with many time series, as it is not feasible to manually adjust forecasts for each series. He also cautions about the potential dangers of automation, such as the risk of overfitting or poor performance on certain types of time series.

πŸ’‘R (Statistical Software)

R is a programming language and software environment commonly used for statistical computing, data analysis, and graphical representation. In the video, Hyndman mentions R as a platform that hosts the 'forecast' package, which contains algorithms for automatic forecasting, including the ATS and auto.arima algorithms.

πŸ’‘Forecast Pro

Forecast Pro is a commercial software package used for forecasting and is mentioned by Hyndman as a tool that has been around for several decades and is used by many companies for automatic forecasting. It is noted for its good integration with other software systems.

πŸ’‘Tableau

Tableau is a software used for data visualization and business intelligence. Hyndman mentions that Tableau has recently incorporated automatic forecasting capabilities into its software, with the algorithms being written by him. This highlights the growing integration of forecasting tools into data visualization platforms.

πŸ’‘Thrive Technologies

Thrive Technologies is a company that Hyndman has worked with, known for its fast automatic forecasting algorithm. The video mentions this company as an example of how forecasting algorithms are being developed and optimized for speed and efficiency.

πŸ’‘Overfitting

Overfitting occurs when a model learns the detail and noise in the training data to an extent that it negatively impacts the model's performance on new data. Hyndman warns against using overly complex models for time series forecasting due to the risk of overfitting, especially when the individual time series are not very long.

πŸ’‘Forecasting Competitions

Forecasting competitions are events where participants are tasked with forecasting time series data, often with the goal of comparing the accuracy and effectiveness of different forecasting methods. Hyndman discusses insights gained from such competitions, such as the value of simplicity in forecasting models and the importance of testing methods on holdout sets.

Highlights

Introduction to Rob Hyndman, a professor of statistics at Monash University, specializing in time series forecasting research.

Definition of big data in time series forecasting: Big data involves multiple time series rather than one very long time series.

Example of big data in time series forecasting: Sales data for multiple products across various locations and time frames.

Example of big data in time series: Security streaming data with sensors sending multiple signals per second, leading to large data sets over time.

Difference in forecasting few versus many time series: Manual tuning is feasible for a few series, while automation is essential for many.

The importance of automated algorithms for forecasting when dealing with large numbers of time series.

Available software for automated forecasting: R (with the 'forecast' package), Forecast Pro, Tableau, and Thrive Technologies.

Discussion of specific algorithms for automated forecasting, such as automatic exponential smoothing (ETS) and auto.ARIMA in R.

Benefits of automation: Saves time and money by generating forecasts quickly and efficiently.

Challenges of automation: No algorithm works well for every time series, and adjustments can sometimes worsen performance for certain series.

Recommended strategy: Use automated algorithms for the majority of time series and manually check series where automation fails.

Insights from forecasting competitions: Simple methods, like exponential smoothing, often perform better than complex models like neural networks.

Explanation for why complex models, like neural networks, may not perform well in time series competitions due to overfitting.

The importance of using well-tested methods on similar types of data and validating approaches through training and test sets.

Conclusion: Emphasizes the value of simple, well-tested models and the importance of continuous testing and validation in forecasting.

Transcripts

play00:12

Hi, my name is Rob Hyndman.

play00:13

I'm a professor of statistics at Monash University

play00:16

and I do work in time series forecasting research.

play00:21

I'm here to answer a few questions from Galit Shmueli.

play00:25

So the first one is, what is the meaning of big data in time

play00:28

series forecasting and can you give a few examples

play00:30

from projects you've worked on?

play00:33

So time series, as you probably know,

play00:35

are data that are collected sequentially over time.

play00:38

So it's very unlikely that one single time series

play00:41

will be particularly long.

play00:44

If you think big data means data that

play00:46

can't fit onto a single ordinary sized computer,

play00:50

then no time series is that long.

play00:53

So if you want to think about big data in the context

play00:56

of forecasting and time series you

play00:58

need to think about lots of time series.

play01:01

So, for example, you might be looking at sales of a company.

play01:05

And you have daily sales, data going back a few years.

play01:09

And you might be having lots of different products

play01:11

for that company.

play01:12

And you're looking at sales for each product, in each store,

play01:16

in every country of the world.

play01:18

And so that way you end up with a large collection of data

play01:22

because there's lots and lots of time series.

play01:24

Or another situation I've dealt with big data in a time series

play01:29

context was when we were looking at a security streaming data.

play01:34

So that's a company that was monitoring

play01:37

the security around a building.

play01:39

And they had a fence on which was mounted

play01:43

several hundred sensors that were detecting movement

play01:47

in the vicinity of the building.

play01:48

Each of those sensors was sending

play01:50

a signal, or several signals every second.

play01:54

And the disk was continually streaming all the time.

play01:58

So in only an hour or so, you end up with gigabytes of data.

play02:02

And over a long period of time, you have a seriously large data

play02:06

set.

play02:07

So big data in time series generally means

play02:10

lots and lots of time series, each one of which

play02:13

may not be particularly long.

play02:15

The second question was, how does the forecasting process

play02:17

differ for few vs. many series?

play02:20

When you have a few series, you can

play02:22

afford to spend some time looking at the results

play02:24

and maybe tweaking the method that you are using

play02:27

for each individual series.

play02:28

You might account for some peculiarities and features

play02:31

of each series and end up with a forecasting

play02:34

model that's tuned to the individual series

play02:37

that you've got.

play02:38

Once you get above a handful of time series,

play02:40

it's just no longer possible to spend

play02:42

the time looking at each individual time series

play02:45

separately.

play02:46

So you need some kind of automatic algorithm

play02:48

which will generate forecasts for you.

play02:50

So with lots of series when you do forecasting,

play02:53

automation is crucial because you just cannot do it in any

play02:57

manual sense.

play02:59

Third question is, what software can

play03:01

be used for forecasting many series

play03:02

and can they be used as part of an automated solution?

play03:06

So yes, there's quite a lot of software

play03:08

out there now that does automatic forecasting.

play03:12

R, the statistical software platform,

play03:15

is one where there's a package called forecast,

play03:18

which is my own package.

play03:20

And I've written several algorithms

play03:23

in the package that will do automatic forecasting.

play03:26

So the best known of those algorithms

play03:28

are the ATS algorithm, which does

play03:30

automatic exponential smoothing.

play03:32

And then there's an auto.arima algorithm, which

play03:34

does automatic ARIMA modeling.

play03:37

But the forecast page for R is not

play03:39

the only software around that can do automatic forecasting.

play03:44

One that's been around for several decades, which

play03:46

is excellent, is Forecast Pro, which is available for Windows.

play03:51

And that's used by lots of companies

play03:53

for doing their automatic forecasting

play03:55

and it has quite good integration

play03:56

with other software systems.

play03:59

More recently, Tableau has produced

play04:02

some automatic forecasting within their software.

play04:05

I actually wrote the algorithm for Tableau.

play04:08

There's a company called Thrive Technologies

play04:11

that I've worked with, which has a very good forecasting

play04:13

algorithm to automatically forecast very, very quickly.

play04:17

It's the fastest automatic forecasting algorithm

play04:19

I've ever seen.

play04:23

What are the benefits and dangers of automation?

play04:25

That was question number four from Galit.

play04:28

Well, the obvious benefits are that it saves a lot of time

play04:30

and it saves a lot of money.

play04:32

You can just give your time series to a computer

play04:35

and it will give you back some forecasts.

play04:37

The danger, of course, is that no automatic algorithm will

play04:40

work well for every series.

play04:42

I spent a lot of time looking for edge cases

play04:44

where my algorithms don't work and trying

play04:47

to find ways to improve the algorithms to cope

play04:49

with more types of data.

play04:51

But it's a never ending quest.

play04:52

And sometimes I modify an algorithm

play04:54

so that it does better for some time series

play04:56

only to find that it's actually made

play04:57

things worse on other series.

play05:01

So there's always going to be particular time series where

play05:03

your automatic algorithm, whichever one you use,

play05:06

where the automatic algorithm does not do so well.

play05:09

A good strategy that I encourage my clients to do

play05:12

is to try to identify the series that

play05:13

are not being forecast well and just look at those ones.

play05:16

And let the automatic algorithm do the bulk of the series.

play05:20

And then you can concentrate on spending

play05:22

your analyst time, which is expensive,

play05:24

spending that time on the cases where the automation is not

play05:28

working so well.

play05:30

And Galit's last question concerned

play05:32

forecasting competitions.

play05:33

She says, in several forecasting contests in which you

play05:36

were involved, participants were tasked

play05:38

with forecasting hundreds or thousands of series.

play05:41

What effective approaches and conclusions

play05:43

emerged from these contests?

play05:46

Well, there's a few things that have come out

play05:48

of those sort of competitions.

play05:49

The first is, keep it simple, or at least

play05:52

keep it sophisticatedly simple.

play05:55

Large, complicated models do not always

play05:56

work well on time series data because the individual time

play05:59

series are not necessarily very long.

play06:01

And if you have a large complicated model,

play06:03

you tend to overfit.

play06:05

So, for example, time series competitions

play06:12

you generally find that neural networks don't do very well.

play06:15

Because they're designed for very large collections of data

play06:18

and the individual time series are often not long enough

play06:21

to fit a good neural net.

play06:24

On the other hand, some quite simple methods,

play06:26

like exponential smoothing, tends

play06:28

to do pretty well in forecasting competitions, especially when

play06:31

those competitions involve data that

play06:33

show trend and seasonality.

play06:36

I guess the other thing I would say

play06:38

is that, in competitions and elsewhere

play06:40

is to use methods that have been well tested

play06:42

on similar types of data.

play06:46

For example, you can split each of the time series

play06:48

into a training set and the test set.

play06:50

Apply your methods to forecast the test

play06:52

set and see what works well.

play06:54

And then you know what you could use going forward.

play06:58

It's amazing how many people don't do that.

play07:01

Just testing a range of different approaches

play07:03

out on test sets, on holdout sets, to check what works well

play07:09

and what doesn't work well.

play07:12

OK, so there's the five questions from Galit.

play07:16

It was fun to be able to answer them.

play07:18

And I hope that's been helpful.

play07:20

Thank you.

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Time SeriesForecastingBig DataAutomationCompetitionsR SoftwareARIMAExponential SmoothingData AnalysisPredictive Modeling