Transformes for Time Series: Is the New State of the Art (SOA) Approaching? - Ezequiel Lanza, Intel

The Linux Foundation
25 May 202354:22

Summary

TLDREzekiel Lanza, an AI open source evangelist at Intel, presents his research on applying Transformers to time series analysis. He discusses the architecture of Transformers, their adaptation from language translation to various applications, and their potential in time series forecasting. Lanza compares the performance of Transformers with traditional models like LSTMs, highlighting the benefits and challenges of using Transformers for long-term predictions. He also emphasizes the importance of community involvement in advancing the state of the art for time series analysis with Transformers.

Takeaways

  • ๐Ÿ“ˆ The presentation discusses the application of Transformers in time series analysis, a concept initially developed for natural language processing (NLP).
  • ๐Ÿ” The speaker, Ezekiel Lanza, shares his thesis work and personal experience with using Transformers for time series data.
  • ๐Ÿ“‹ The agenda includes a brief explanation of Transformers, their architecture, and how they can be adapted for time series analysis.
  • ๐Ÿค– Two main architectures for time series are highlighted: Informer and Space-Time Former, chosen for their open-source availability and practical use cases.
  • ๐Ÿ”ง The speaker emphasizes the importance of understanding the input representation, embeddings, and the adaptation of the encoder and decoder in Transformers for time series.
  • ๐Ÿ“Š The presentation compares the performance of Transformers with traditional time series models like ARIMA, Auto-Regressive models, and LSTMs.
  • ๐Ÿš€ The potential of Transformers to capture both short-term and long-term dependencies in time series data is discussed, with a focus on their efficiency and accuracy.
  • โฑ๏ธ The computational complexity of the attention mechanism in Transformers is addressed, along with modifications to improve efficiency for time series analysis.
  • ๐Ÿ” The speaker's use case involves predicting latency in a microservices architecture, demonstrating the practical application of Transformers in a real-world scenario.
  • ๐Ÿ“ The presentation concludes with a call for community involvement in advancing the state of the art for Transformers in time series analysis and the importance of testing and optimizing models for specific use cases.
  • ๐Ÿ”— The speaker recommends the use of frameworks like TSI for time series analysis, which can simplify the process of implementing and testing Transformer models.

Q & A

  • What is the main focus of Ezekiel Lanza's presentation?

    -The main focus of Ezekiel Lanza's presentation is to share his research on using Transformers for time series analysis, discussing his experiences, challenges, and the potential usefulness of this approach.

  • What are the two main architectures for Transformers in time series that Lanza discusses?

    -The two main architectures for Transformers in time series that Lanza discusses are Informer and Space-Time Former.

  • How does the Transformer architecture adapt to different tasks like translation and image generation?

    -The Transformer architecture adapts to different tasks by modifying the original model and combining it with other neural networks or CNNs, as seen in GPT for language translation and Stable Diffusion for image generation.

  • What is the significance of the self-attention mechanism in Transformers?

    -The self-attention mechanism in Transformers allows the model to focus on the most relevant parts of the input data, which is crucial for understanding the relationships between different elements in the data, such as words in a sentence or data points in a time series.

  • The challenges of applying Transformers to time series data include the need for careful input representation, the computational complexity of the attention mechanism, and the necessity of capturing both short-term and long-term dependencies in the data.

    -null

  • How does the Informer architecture address the computational complexity of the attention mechanism?

    -The Informer architecture addresses the computational complexity by using a probability-based attention mechanism, which reduces the amount of calculations required, making it more efficient for handling large time series data.

  • What are the advantages of using Transformers for time series forecasting compared to traditional methods like ARIMA or LSTM?

    -Transformers for time series forecasting can capture complex, non-linear relationships and long-term dependencies more effectively than traditional methods like ARIMA or LSTM, which may struggle with non-linear data and have limitations in handling long sequences.

  • What is the role of position encoding in Transformers for time series data?

    -Position encoding in Transformers for time series data is crucial for providing the model with information about the order and position of data points, which is essential for capturing the temporal dependencies in time series.

  • How does the Space-Time Former architecture represent time series data differently from Informer?

    -The Space-Time Former architecture represents time series data by focusing on the relationships between features and timestamps, allowing the model to pay attention to both time and features simultaneously, which can be more effective for certain types of time series analysis.

  • What are the key takeaways from Lanza's experience with implementing Transformers for time series in a microservices architecture?

    -Lanza's experience highlights the potential of Transformers for time series forecasting, especially for long-term predictions, but also emphasizes the need for community involvement, continuous research, and the importance of testing and optimizing the models for specific use cases.

Outlines

00:00

๐Ÿ“ Introduction to Transformers for Time Series

Ezekiel Lanza introduces the concept of using Transformers for time series analysis, sharing his thesis work and research on adapting the architecture for time series data. He outlines the agenda, which includes a brief explanation of Transformers, their architecture, and how they can be applied to time series. Lanza also mentions the importance of understanding the limitations and potential of Transformers in this context.

05:02

๐Ÿ” Understanding Transformers and Time Series

The speaker delves into the specifics of how Transformers can be used for time series data, emphasizing the importance of input representation, embeddings, and the adaptation of the encoder and decoder with self-attention mechanisms. He explains the concept of positional encoding and how it helps the model understand the order of data points in a time series, which is crucial for accurate predictions.

10:02

๐Ÿค– Self-Attention and Multi-Head Attention

The paragraph discusses the self-attention mechanism in Transformers, which allows the model to focus on relevant parts of the input data. The speaker explains how multi-head attention enables the model to capture both short-term and long-term dependencies within the data. He also touches on the computational complexity of attention layers and how it can be optimized for time series applications.

15:03

๐Ÿ•’ Time Series and Sequence Modeling

The speaker compares time series data to language data, highlighting the differences in how they are processed by models like Transformers. He explains that while language models can handle variable word order, time series models must maintain strict order due to the sequential nature of time data. The paragraph also discusses traditional time series approaches like ARIMA and their limitations, leading to the exploration of neural networks for capturing non-linear dependencies.

20:03

๐Ÿง  RNNs, LSTMs, and Sequence-to-Sequence Models

The speaker explores the use of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks in time series analysis. He explains how RNNs can memorize the importance of previous data points but struggle with long-term dependencies due to vanishing gradients. LSTMs, with their ability to remember long sequences, are presented as a solution to this problem. The speaker also mentions sequence-to-sequence models as a way to improve LSTM performance.

25:05

๐Ÿš€ Optimizing Transformers for Time Series

The speaker discusses the challenges of using Transformers for time series, particularly the quadratic time complexity of the attention layer. He references a 2020 survey paper that suggests modifications to the Transformer architecture for time series applications, focusing on network modifications and positioning coding. The paragraph also introduces Informer and Space-Timeformer as open-source architectures designed for time series forecasting.

30:08

๐ŸŒ Positional Encoding and Attention Module

The speaker explains how positional encoding and modifications to the attention module can improve Transformer performance for time series. He mentions the use of learnable embeddings and the inclusion of timestamp information to help the model understand the order of data points. The paragraph also discusses the use of sparse attention mechanisms to reduce computational complexity and improve efficiency.

35:10

๐Ÿ“Š Benchmarks and Performance

The speaker presents benchmarks that compare the performance of Informer and Space-Timeformer with LSTMs. He highlights the advantages of Transformers in capturing long-term dependencies and their performance in forecasting future data points. The paragraph also discusses the trade-off between the accuracy of predictions and the computational time required, emphasizing the need for optimization and community involvement in advancing Transformer models for time series.

40:11

๐Ÿ”ฎ Use Case: Microservices Architecture Latency Prediction

The speaker shares a personal use case where he applied Transformer models to predict latency in a microservices architecture. He discusses the data preparation process, including the selection of relevant features and the use of past latency data points for prediction. The paragraph also touches on the importance of processing time and the impact of attention layer optimizations on model performance.

45:12

๐Ÿค” Conclusions and Future Work

In conclusion, the speaker suggests that while Transformers show promise for time series analysis, there is still much work to be done in terms of optimization and data representation. He emphasizes the importance of community involvement and the need to test and adapt Transformer models for specific use cases. The speaker also encourages the audience to explore open-source projects and frameworks for implementing Transformers in their time series analysis.

50:14

๐Ÿ“ˆ Sequence Modeling in Time Series

The speaker addresses the question of how sequence modeling applies to time series data, clarifying the difference between traditional sequence-to-sequence models like LSTMs and the approach taken by Transformers. He explains that Transformers process all data points simultaneously, allowing for a holistic understanding of the data sequence, which is different from the step-by-step processing of LSTMs.

Mindmap

Keywords

๐Ÿ’กTransformers

Transformers are a type of deep learning architecture initially designed for natural language processing tasks. In the context of this video, they are being discussed for their application in time series analysis. The architecture is known for its attention mechanism, which allows it to weigh the importance of different parts of the input data. The video explores how this can be adapted for time series data, which is a sequence of data points collected or recorded at regular time intervals.

๐Ÿ’กTime Series

A time series is a sequence of data points indexed in time order. It is used to analyze trends, seasonality, and other patterns in data over time. In the video, the speaker discusses the challenges of applying Transformer models to time series data, which requires capturing dependencies between data points in a sequence.

๐Ÿ’กAttention Mechanism

The attention mechanism is a feature of the Transformer architecture that allows the model to focus on certain parts of the input data while ignoring others. It is particularly useful for handling sequences where the relationship between elements is not uniform. In the video, the speaker explains how the attention mechanism can be utilized to detect both short-term and long-term dependencies in time series data.

๐Ÿ’กPosition Encoding

Position encoding is a technique used in Transformer models to incorporate the order of elements in a sequence. Since Transformers do not inherently consider the order of data, position encoding is added to the input to help the model understand the sequence's structure. In the context of time series, this is crucial for maintaining the temporal order of data points.

๐Ÿ’กInformer

Informer is a variant of the Transformer architecture specifically designed for time series forecasting. It includes modifications to the attention mechanism and position encoding to better handle the temporal nature of time series data. The speaker discusses Informer as one of the main architectures used in their research for time series analysis.

๐Ÿ’กSpace-Time Former

Space-Time Former is another architecture derived from Transformers, tailored for time series analysis. It focuses on capturing relationships between features and their temporal positions. The speaker compares Space-Time Former with Informer, discussing their different approaches to representing data and their performance in time series forecasting.

๐Ÿ’กSequence to Sequence

Sequence to sequence refers to a type of model that takes one sequence as input and produces another sequence as output. This is common in tasks like machine translation. In the context of time series, sequence to sequence models can be used to predict future data points based on past sequences.

๐Ÿ’กLatency

Latency in the context of the video refers to the delay or time it takes for a service or system to respond to a request. The speaker uses latency as an example of a time series data point that can be predicted using Transformer-based models.

๐Ÿ’กOptimization

Optimization in machine learning refers to the process of improving a model's performance by adjusting its parameters or architecture. The speaker discusses the need for optimization in Transformer models for time series data, particularly in the attention layer to reduce computational complexity.

๐Ÿ’กCommunity

The community in the context of the video refers to the collective of researchers, developers, and practitioners who contribute to the advancement of a particular field, such as time series analysis using Transformers. The speaker emphasizes the importance of community involvement in driving innovation and sharing knowledge.

Highlights

Ezekiel Lanza discusses his research on using Transformers for time series analysis.

The talk is based on Lanza's thesis work, sharing his experiences, frustrations, and insights into the usefulness of Transformers for time series.

Transformers were initially designed for language translation but have since been adapted for various applications, including image generation and text-to-image synthesis.

The challenge with time series is determining the usefulness of Transformers, as they have not been extensively applied in this domain.

Lanza introduces the concept of Transformers 101, explaining the architecture and its key components relevant to time series.

The importance of input representation, embeddings, and the adaptation of the encoder and decoder for time series is emphasized.

The self-attention mechanism in Transformers is explained, highlighting its role in capturing relationships between elements in a sequence.

Multi-headed attention allows the model to focus on different parts of the sequence simultaneously, capturing both short-term and long-term dependencies.

The limitations of traditional time series models like ARIMA and RNNs are discussed, including their inability to capture non-linear dependencies and long-term relationships.

Lanza presents two main architectures for time series: Informer and Space-Time Former, chosen for their open-source availability and practicality.

Informer and Space-Time Former have been optimized to reduce computational complexity, making them more suitable for time series analysis.

The use of position encoding and attention module modifications in Transformers for time series is highlighted as crucial for capturing temporal relationships.

Lanza shares his personal use case of predicting microservices latency, demonstrating the application of Transformers in a real-world scenario.

The importance of community involvement and collaboration in advancing the state of the art for Transformers in time series is emphasized.

Lanza concludes that while Transformers show promise for time series, more work is needed in optimization and data representation to make them more effective.

The presentation encourages the audience to explore and test Transformers for their specific time series problems, as the best model can vary depending on the use case.

Lanza suggests the use of frameworks and APIs for time series analysis, such as TSI, to simplify the process of implementing and testing Transformer models.

Transcripts

play00:00

well thanks for coming for today's

play00:02

Transformers for time series my name is

play00:05

Ezekiel Lanza I am AI open source

play00:07

evangelist working for for Intel and

play00:11

basically what I would like to explain

play00:12

in this I would like to share with you

play00:14

in this talk it's my thesis work I've

play00:19

been doing some research about

play00:21

Transformers and how you can use

play00:23

Transformers for Time series

play00:25

the main idea is to share my my

play00:28

experience or what I did or my

play00:30

frustrations and what I think that can

play00:33

be used or what cannot be useful

play00:35

first of all I would like to this is the

play00:38

agenda I'd like to talk about a light

play00:41

explanation about Transformers 101 in

play00:43

case you are not aware about

play00:45

Transformers

play00:46

how is the architecture and I would like

play00:48

to I will not go in the details of

play00:50

course but just I would like to explain

play00:52

which are the most important parts of

play00:54

the Transformers that can be useful for

play00:56

Time series

play00:58

after that I will explain something

play01:00

about okay we have the Transformers how

play01:02

we can use it for Time series

play01:04

to main after that I will explain two

play01:07

main architectures that is Informer and

play01:10

space-time former we have tons of

play01:13

architecture there I decided to use

play01:15

these two main architectures for a

play01:18

reason so I will I will share later

play01:22

the use case why how I tested this

play01:25

implementation in a real use case

play01:28

and some conclusions so I hope you can

play01:31

enjoy it and it will be personal so any

play01:35

questions that you can have you can ask

play01:37

for the microphone and you can ask it

play01:39

so Transformers

play01:41

[Music]

play01:42

um

play01:43

as you may probably know everything

play01:45

started in 2017 with the paper attention

play01:48

is all you need and we have the vanilla

play01:51

return former at the beginning that it's

play01:52

basically an architecture to translate

play01:55

from English to French I think so it's

play01:58

basically the idea is to have to have a

play02:01

phrase in one language and had a

play02:03

translation to other language

play02:06

but it was back in 2017 what we have

play02:09

today is that

play02:12

to adapt or to use the same architecture

play02:14

that was really useful for that

play02:16

implementation

play02:18

we we had some transformations of the

play02:21

original Transformer right so just to

play02:23

say an example we have for instance GPT

play02:26

as you may probably heard is an

play02:29

adaptation from open AI that they build

play02:31

they did some modifications to the

play02:34

Transformer they have the gpt-1 the gpt2

play02:37

tpt3 3.5 M4

play02:40

and also for Dali for instance is the

play02:44

value of stability diffusion here on the

play02:46

bottom uh it's if you like to write a

play02:49

text and you would like to get an image

play02:50

you may use a Transformer power to

play02:53

Transformer adapted with other parts

play02:55

neural networks or cnns and so on to get

play02:59

the same result so

play03:01

but what we don't have yet is we we have

play03:04

nothing related with time series yet so

play03:07

it's okay for language it's okay for

play03:10

images or images plus languages but for

play03:13

time series we are in a moment and we

play03:16

don't know if it's so useful

play03:18

it can be useful cannot be useful but

play03:20

it's not clear what is the state of the

play03:22

art we do have a lot of research that

play03:25

you can find but it's not clear today

play03:28

actually

play03:30

just to explain

play03:32

just to give an example for instance

play03:34

this is how a stable diffusion it's

play03:37

built you have

play03:39

a text that you would like to get an

play03:41

image at the end of the of the inference

play03:45

we have the Transformer which is the the

play03:47

first part is using his transfer it's

play03:50

adapting the text to a text embeddings

play03:53

to a part of our representation and this

play03:56

representation goes to another Network

play03:59

that is the unit it's basically a CNN

play04:02

architecture right but what I wanted to

play04:04

Showcase here without going in detail is

play04:06

that they are they pick the Transformers

play04:09

architecture and they created something

play04:12

with images and the same concept is the

play04:16

concept that we may find with the

play04:18

different Scopes could be time series

play04:20

could be computer vision or it could be

play04:22

in any other case

play04:24

so it was so impressive the performance

play04:27

that the Transformers give is that most

play04:30

people will like to use Transformers and

play04:32

this is and these are at least what they

play04:35

what they've seen is that

play04:36

they would like to use the Transformers

play04:38

even if it's not the best option which

play04:41

it could be the best option but you have

play04:44

a lot of constraints that you need to

play04:45

keep in mind when you are using some

play04:47

kind of of Transformers

play04:52

we

play04:54

there are multiple parts

play04:56

of course I I won't go in in all the

play04:58

details but for time series we should be

play05:01

aware of three main parts the first part

play05:04

it's how you represent your input

play05:07

how you use the embeddings or how you do

play05:09

your embeddings

play05:10

and how you use or how you adapt the

play05:13

encoder and the encoder and the decoder

play05:16

multi-health self-attention okay this is

play05:18

something that I will explain later so

play05:21

it's basically focusing these framing

play05:23

parts of course you have a lot of

play05:24

different parts of the architecture that

play05:26

is you have a fit four or layer and so

play05:28

on but it's not in the in the main

play05:31

I mean it's it's not relevant in the

play05:33

topics over time series right and just

play05:37

to explain

play05:39

I will start explaining about how the

play05:42

vanilla Transformer was created just to

play05:45

give you an idea okay if this is how it

play05:47

works for text or for language how we

play05:50

can adapt that for for time series right

play05:54

so let's suppose that we have a phrase

play05:55

that is I love dogs we have three three

play05:58

words and we need to represent that in a

play06:02

different way because if you put the

play06:03

data or if you put the ladders in the

play06:06

model the model is not able to

play06:08

understand letters right you need to

play06:10

convert

play06:12

the words in numbers or vectors or a

play06:15

representation that a machine can

play06:17

understand right

play06:19

so how can you do that for text which is

play06:22

pretty good we have another algorithm

play06:24

that is word to back that it basically

play06:27

does is

play06:28

we we can represent the data between I

play06:31

don't know one and five thousand or one

play06:34

and two thousand which could be all the

play06:36

letters all the phrases that you can

play06:37

find in English for instance but it will

play06:41

be it wouldn't be enough because we need

play06:43

to represent

play06:45

the information that we have in the text

play06:47

with some meaning we need to provide a

play06:49

meaning to the model so they can

play06:51

understand so this is why we use board

play06:53

to back what they use word to back is

play06:56

that

play06:57

it gives you a representation for

play06:59

instance if you would like to represent

play07:00

the working and the word Queen

play07:04

and you would like to measure the

play07:06

distance between keen and queen it will

play07:08

be the same distance between men and

play07:10

women because queen of course I mean

play07:13

it's obvious but it's not just a number

play07:16

between one and five thousand one and so

play07:19

twenty thousand or whatever it's

play07:21

bringing information to the model and

play07:24

this is where words of act does it's a

play07:26

model that's already trained and you can

play07:28

download and you can use it in every use

play07:30

case for instance you know in your in

play07:32

your case and what I did is they picked

play07:35

these words back and they use it to

play07:37

represent the input

play07:40

and this is what they did and they

play07:42

represented this is just random numbers

play07:45

right it's not the real numbers but they

play07:48

they decide or they picked a dimension

play07:51

let's say 512 and they have for each

play07:54

word you have a vector of 512 of the

play07:57

dimension is 512 right and they have for

play08:01

I for love for dogs or whatever the

play08:03

phrase is

play08:05

but there is another additional problem

play08:07

is that

play08:08

we can have the the order that the words

play08:11

has in a phrase It's also pretty

play08:14

important so we need to embed

play08:16

what we are doing here is we are

play08:18

creating embeddings which is our

play08:19

representation of the letters but we

play08:23

also need to provide more context more

play08:26

information now we provide with the

play08:29

words back we are providing just the

play08:32

meaning between each word

play08:34

it's King similar to Queen and so on but

play08:38

we need to be aware of the position of

play08:40

each word in the phrase so

play08:43

they decided a model they decided on a

play08:45

function it doesn't matter the details

play08:47

of the function is a mathematical

play08:48

function what it basically does is okay

play08:50

when you do that you are also adding to

play08:54

the same embedding the same

play08:55

representation

play08:57

you are adding information about the

play08:59

position so it's not the same to use the

play09:02

word dog or another word in different

play09:06

parts of the of the phrase right so

play09:09

doing that you are representing that

play09:11

this is the first part so you probably

play09:13

May imagine that for time series okay I

play09:16

should do something similar for time

play09:17

series of course it's not a word so back

play09:19

because it's numbers but it could be

play09:22

probably important in the future right

play09:24

when I will be explaining Time series

play09:28

and that's it and we have for each

play09:30

phrase for each word we have a vector

play09:34

with information

play09:38

and now we have the encoder and the

play09:39

decoder part and without going in the

play09:42

mathematical details basically what a

play09:45

self-attention does is when you're

play09:46

training a model

play09:48

you need to attract you need to detect

play09:52

the relationship between one word and

play09:55

the other words so in that case I would

play09:58

like to have this embedding with

play09:59

information for instance and I would

play10:01

like to get information to say okay how

play10:03

is children related with playing or how

play10:06

is children related with Park

play10:08

so the representation that I will get

play10:10

and this is what they call attention

play10:11

because what they want is with this

play10:14

layer is let the model just be focused

play10:18

on the parts that are relevant on the

play10:20

phrase because the phrase could be 2000

play10:23

words or 200 words

play10:25

and you need to let the model know okay

play10:28

pay attention to this particular part

play10:31

instead of paying attention to the other

play10:32

part right

play10:35

how do you do that

play10:37

and of course with a mathematical

play10:39

operation but I just wanted to explain

play10:42

this part

play10:43

which is also important for the for

play10:46

explanation is that they have free

play10:48

weights it could be

play10:50

the queries the quiz the keys and the

play10:54

values and what I do is with this

play10:56

initial embedding that we calculated

play10:57

that we have the information of the word

play10:59

and the position

play11:01

we multiply it by by the query oh sorry

play11:06

by the query by the values but the most

play11:08

important part is that you multiply the

play11:11

value by all the others volume so if you

play11:14

have a full phrase

play11:16

a forward phrase

play11:18

when you are calculated embedding for

play11:20

the the essential layer for the first

play11:23

word we are multiplying the first word

play11:25

by the second the first word by the

play11:28

third by the fourth and so on and we are

play11:30

doing the same with second word then

play11:32

multiplying the second word with the

play11:34

first one with the second one and so on

play11:36

uh you get in score this score is weight

play11:39

and it's again put it in one embedded

play11:43

representation right

play11:46

um and this is how the self-attention

play11:48

works in a very high level right but

play11:51

it's important to explain here that

play11:54

if you have a very long phrase you can

play11:57

imagine that the time the time that it

play11:59

takes to calculate the attention could

play12:02

be pretty high I mean it could take a

play12:04

lot of time to calculate this this

play12:07

Matrix is calculations

play12:11

if we do that just once what we can do

play12:13

is

play12:15

with just one head what they used to

play12:17

call the head or the multi-headed

play12:18

tension with one head will allows you

play12:21

with allow the phrase to just be

play12:24

focusing in one part of the phrase

play12:26

but probably for instance like in the

play12:28

case the word the animal didn't cross

play12:31

the street because it was too tired and

play12:35

if I want to calculate the attention for

play12:36

it probably it it's related not only

play12:40

with

play12:41

tire but it's also related with animal

play12:43

so we need to be able to capture these

play12:46

two relationships between each and

play12:48

animal and tire because the model needs

play12:51

to have more information to make a

play12:54

decision in the future right so this is

play12:56

why instead of using

play12:58

the same thing that I that I showed in

play12:59

the in the past in one head you use

play13:02

multi-hypertensions it could be seven

play13:04

six eight I mean it's a it's a parameter

play13:07

that they decided when they are building

play13:09

the the architecture but the main idea

play13:12

is I would like to to capture for

play13:14

instance short-term dependencies but I

play13:17

will also took up to long-term

play13:18

dependencies in the in the phrase

play13:22

and it's the same architecture once you

play13:23

have the the encoder the decoder works

play13:26

in the same way

play13:27

so once you have all the vector

play13:30

representations or the embeddings for

play13:31

each work

play13:32

after attention and after the embedded

play13:35

position and so on you will put all the

play13:38

words in a matrix and this will be your

play13:40

Matrix for the encoder

play13:42

once you have that information you do

play13:44

the same with the decoder with the

play13:46

difference is that the encoder was

play13:48

trained with input data which is in this

play13:51

case English language and the decoder

play13:55

was trained with the

play13:57

with French with French language so it

play14:01

basically does the same thing but

play14:02

instead of paying attention on one

play14:04

language he's paying attention on the

play14:05

other language

play14:06

but it's using the input that it's

play14:09

already converted for attention and so

play14:12

on

play14:13

as an input for the decoder this part

play14:17

could be a bit complicated tool to

play14:19

understand but this is what it it works

play14:22

you do the same process the attention

play14:24

process the multi-attention here had

play14:26

attention layers

play14:27

either for encoder for decoder so it's

play14:31

exactly the same how it works at the end

play14:33

is

play14:35

every time it makes predictions for

play14:37

instance in that case the input will be

play14:39

this phrasing in French which has I have

play14:42

no idea about French but this is the

play14:44

phrase and once it is encoded and when

play14:48

embedded and so on this input will go to

play14:51

the decoder and the decoder will try to

play14:55

predict War by word in English which is

play14:58

the targeted language in that case so we

play15:00

predict for instance I

play15:02

the second step on the second round will

play15:05

be okay the input for the decoder will

play15:06

be the word I and all the embeddings

play15:08

that I have from the encoder and he

play15:11

predicts the second one and

play15:14

he keeps doing the same until he gets

play15:16

the end of sequence word

play15:18

when it gets the end of sequence the

play15:20

simulation is done right so

play15:23

just to give you a really high level

play15:25

idea about how it works every time

play15:28

everything if we we can explain very

play15:31

detailed but it would take a lot of time

play15:32

to to explain it but just to give you an

play15:35

idea what are important parts of the

play15:36

Transformers the initial Transformer

play15:39

right

play15:40

how it can be used for Time series

play15:44

firstly we need to talk about what is a

play15:47

Time series we need to Define time

play15:48

series right just in case if you're not

play15:50

aware

play15:52

um time series is the point that I have

play15:54

in the time zero has a dependency from

play15:58

the previous points could be five points

play16:00

could be six points ten points or

play16:01

whatever but this relationship is really

play16:04

important to capture because if you if

play16:06

you if you can capture the relationship

play16:09

that you can get between the previous

play16:10

points and the point that you would like

play16:12

to predict in the future or in the

play16:15

present and

play16:17

could be pretty complicated so

play16:19

this is the main difference within

play16:21

within time series and other kind of

play16:24

problems because we can think that it's

play16:27

it's like a language so you say

play16:30

if you have a phrase you can say okay

play16:32

the phrase at the end of the phrase has

play16:34

a dependency on the on the words for

play16:36

example for example I was

play16:40

I don't know I was I don't know for

play16:42

instance the word no has a dependency

play16:44

for bi or I or even worse

play16:49

but in language and this is why it's a

play16:52

bit complicated to use it with

play16:53

Transformers

play16:55

it doesn't really matter if the order is

play16:58

extremely the same order because in text

play17:01

for instance if one word is in a

play17:03

different position the mother should be

play17:05

able to get the message or to get the

play17:07

idea of what you are writing

play17:09

so and this is what the Transformer

play17:11

doesn't it does it pretty well but with

play17:14

time series I need to keep

play17:16

I need to be strict with the order of

play17:18

the time series

play17:20

and so we can imagine so we have a model

play17:23

that is Transformer that works pretty

play17:24

well with sequences or what pretty well

play17:27

with phrase so we can imagine that it

play17:28

can be good with time series with kind

play17:32

of similar or we can think that as a

play17:34

similar

play17:36

how are we solving those problems now

play17:38

and we have analytics approach or

play17:41

classic approach that is arima or Auto

play17:45

regressive models that I'm not saying

play17:48

that they are not useful but for some

play17:50

problems that happens

play17:52

in the past or mostly

play17:55

you need to have a deep understanding of

play17:58

the time series you need to understand

play17:59

the trend you need to understand season

play18:02

mutant attendant reseal values and in

play18:06

base of that once you do this analysis

play18:08

of your series you build the arima model

play18:11

right and even if you can of course it

play18:15

would depend if your series is not

play18:17

seasonal or if you can't attack the

play18:19

season it could be pretty complicated to

play18:22

write an arima model

play18:24

and the main problem is that even if you

play18:26

can do it

play18:28

it's only able to detect linear

play18:30

dependencies some cases if you have a

play18:33

multi amount of feature problem

play18:35

multivariate problem

play18:37

the relationship between the features is

play18:39

not linear

play18:41

so if you have a problem with a linear

play18:43

dependencies between the features

play18:46

pretty good you can use it but most of

play18:49

the times the relationship between the

play18:50

features is not linear

play18:54

so 20 years ago or 15 years ago with the

play18:57

neural networks explosion they said okay

play19:00

let's try to use fit forward layers fit

play19:02

forward networks which is probably the

play19:04

same so

play19:06

you need to build a model from scratch

play19:08

and you will say okay I will this

play19:10

network for instance has a strong

play19:13

dependency from the six

play19:15

yes the six previous data points so

play19:19

every time I need to predict the next

play19:20

data point I will need just to see the

play19:24

previous six data points the six data

play19:26

points

play19:27

so I built a model in that way and this

play19:29

model is able to get non-linear

play19:32

dependencies

play19:34

but they do have the same problem but I

play19:36

do have the same problem that is just

play19:38

focus on this part so I'm hand crafting

play19:41

the solution for this particular problem

play19:44

and probably if it gets bigger it could

play19:49

get a problem to use different forward

play19:51

players because the weights and so on so

play19:54

I don't want to know in details but and

play19:56

they say okay we have the recurrent the

play19:58

RNN the recurring neural networks which

play20:01

is

play20:03

you feed for instance the time series

play20:05

with 10 points or seven points or the

play20:08

points that you think that are useful

play20:09

and the neural network the RNN it's able

play20:13

to memorize the importance of the

play20:16

previous data points so which is pretty

play20:19

different than the feed forward when we

play20:20

are just

play20:21

calculating the weight in that case we

play20:24

have cells that they they are able to

play20:27

memorize which part of the series is

play20:29

important

play20:31

but we still have a problem with that is

play20:33

that if the series is pretty long

play20:35

this weight will be going lower and

play20:38

lower and lower and lower so if we have

play20:41

a series with

play20:42

which we need to pay attention for the

play20:44

previous

play20:45

20 data points

play20:47

they will probably won't be able to

play20:48

detect very well the last 20 data points

play20:53

because there is a

play20:55

vanish vanish vanish gradient the grain

play20:59

the event the gradient is going it's

play21:02

Vanishing so mother's not able to detect

play21:06

pretty well the longer dependencies so

play21:09

it's good if your series is just it

play21:12

depends on the true

play21:14

earlier data points but if we we are

play21:18

talking about okay I would like to

play21:19

capture long dependencies this is not a

play21:22

good option

play21:23

and

play21:24

lscm that lstm is pretty used

play21:28

today even with some use cases I found

play21:30

that it's pretty useful to use lstm it's

play21:33

pretty easy to to build them it's pretty

play21:36

easy to use it

play21:37

and which is the same concept as RNN but

play21:41

the cells they have memories and you can

play21:45

cut down and you can shut down each cell

play21:48

so it allows you to remind

play21:52

long sequences because you are not using

play21:55

the same gradient every time that you

play21:57

are updating your your weights as it

play22:00

happens with the RNN so lstm for those

play22:03

cases

play22:04

could be pretty useful and

play22:06

even today if you use those cases in

play22:09

some cases lstm could be a very good

play22:12

competitor Transformers in but I don't

play22:16

want to go in at the end of the

play22:17

conversation right and this is another

play22:19

option that instead of just feeding your

play22:23

network

play22:24

you can use the same thing lstm through

play22:27

sequence to sequence which is I would

play22:29

like to represent my data

play22:31

I would like to fit my data in an

play22:34

encoder I would like to get the features

play22:36

of the input this feature will be the

play22:38

input for the decoder as the Transformer

play22:41

explanation that I did before and once

play22:43

you have the decoder

play22:45

um

play22:46

trained you will get the output so it's

play22:49

a way to try to improve the lstm

play22:52

performance and sometimes it works

play22:55

sometimes it doesn't but it depends on

play22:58

the use case

play23:01

so now we can imagine that the

play23:02

Transformers can be useful because the

play23:04

Transformers can detect short-term

play23:07

dependencies as I said with the

play23:08

multi-head attentions you can have a

play23:10

head attention that is for only focusing

play23:13

on one part of the of the text

play23:16

and the other one will be able to detect

play23:18

long-term dependencies

play23:20

and just to give you an idea and how how

play23:24

well it works with long dependencies

play23:27

you probably used chargpt of course and

play23:32

the phrases that chatipity creates it's

play23:35

using gptn under NATO of course they are

play23:38

pretty long I mean they are not just

play23:40

phrases of five words

play23:43

they have found hundreds of words and

play23:46

they have a meaning

play23:48

so this is what this is why we can

play23:50

imagine that hey if we we can probably

play23:53

predict 1000 points data points in the

play23:56

future

play23:57

this is why we think but probably yes

play23:59

probably not

play24:00

but we can imagine that can be useful

play24:04

there is a big issue that is as I said

play24:07

at the beginning is that

play24:09

the layer the attention layer it

play24:12

multiplies each timestamp or each input

play24:15

token or each word if we would like to

play24:18

go back to the NLP case and it

play24:20

multiplies against all the others

play24:23

words in the phrase

play24:25

so as long as we

play24:27

keep the same amount we can say okay the

play24:30

time is okay but if I

play24:33

keep increasing the input data size

play24:36

the quadratic error or the quadratic

play24:38

time that it would take to process this

play24:40

it's exponentially growing

play24:42

so we can have a problem here is it can

play24:46

be more a computational problem

play24:48

or a problem that can give you or can

play24:51

take a lot of time to process the the

play24:53

answer

play24:54

it's not a huge problem because it it

play24:56

works but as you may see the quadratic

play25:00

error grows exponentially right

play25:05

what is the work that have been done

play25:07

with Transformers them

play25:08

there is a survey there is a paper from

play25:10

2020 name free

play25:13

and they did a survey and they say okay

play25:15

these are the Transformers

play25:17

and if you like to use it for time

play25:19

series you need to be aware basically in

play25:22

two main things

play25:24

we need to do Network modifications in

play25:26

the architecture and we need to be

play25:28

focusing on the positioning coding it

play25:30

means how we represent the data in the

play25:34

input and the attention module which is

play25:37

basically to reduce this complexity or

play25:39

this time of complexity to to make it

play25:42

work

play25:43

and of course is the it depends on the

play25:46

applications could be for for for

play25:48

forecasting anomaly detection or

play25:50

classification but the main modification

play25:53

that you will find for for transformance

play25:55

are position encoding and attention

play25:57

module

play26:01

you can try to use I did some

play26:03

experiments but you can try to use the

play26:05

vanilla

play26:07

encoding which is the explanation that I

play26:10

said at the beginning

play26:12

you can use probably word to vac with

play26:14

phrases or time to evacuate it's

play26:16

something similar that word to back and

play26:18

you can use the same thing but the

play26:19

problem is

play26:21

that it's not able to detect

play26:24

or to fully explode the import the

play26:26

importance of the features because as I

play26:29

said it's an architecture that is

play26:31

designed to

play26:33

find relationship but not strictly

play26:35

strictly attached to the order so you

play26:38

need to find something that will help

play26:41

you on that part

play26:42

in 201 they started a lot of research

play26:45

and they showed that okay if this is

play26:48

embedding of this representation that we

play26:50

are using if we do it the learnable

play26:52

during the time

play26:54

it could help and the other part most in

play26:57

the future they say okay let's try to

play26:59

embed it let's try to to make embeddings

play27:02

with timestamp

play27:04

so we like to put the timestamp inside

play27:07

the embedding and this is how you get

play27:10

Informer out of former fat former lots

play27:14

of Transformers that they basically do

play27:16

similar things but they start playing

play27:19

with okay how we can change the

play27:22

embeddings in the input

play27:24

and it changed a lot to be honest oh

play27:27

sorry there is more um they are focusing

play27:31

on the input as I said but they are also

play27:34

doing some pruning on the attention

play27:37

layer or they are using something called

play27:38

prop sparse which is a pro

play27:42

probability of attention so it doesn't

play27:46

matter it's it's a very

play27:48

it's not Advanced but they are doing

play27:49

instead of doing other calculations they

play27:52

are doing using just the likelihood

play27:54

between the most important

play27:57

and they do that this is a very good way

play27:59

to reduce the amount of time that it

play28:01

takes to calculate the attention layers

play28:05

Informer does that so

play28:08

I I decided to use Informer and space

play28:12

informers for two main reasons and the

play28:16

first one is because is the informal

play28:19

informal and space and former they are

play28:20

the only

play28:22

architectures that are open source I

play28:24

mean you have a GitHub you can go there

play28:25

you have examples because otherwise you

play28:28

need to go in the details and you need

play28:30

to understand a lot of how the attention

play28:32

work

play28:33

I mean you need to go in the details if

play28:35

you like to use those those

play28:37

implementations informa has a pretty

play28:39

good GitHub that you have samples you

play28:42

have how you can feel with your data and

play28:45

so on it's pretty easy and spice space

play28:48

them for time format they also do the

play28:50

same space and time science former form

play28:53

stand Stanford or university so it's

play28:56

pretty

play28:57

friendly to use

play28:59

how inform it works and what I said is

play29:03

okay we like to also add information

play29:07

about the week about the month about the

play29:11

holidays for instance if you like to

play29:13

capture I suppose that we have an year

play29:15

of information

play29:16

we can feed the model with this year of

play29:18

information but in the middle we have

play29:21

holidays we have months probably if your

play29:25

series is

play29:26

have a season for instance that in the

play29:29

summer is different has a different

play29:31

Behavior compared with with winter or

play29:34

with the fall you need to also give this

play29:36

information to to the model to try to

play29:38

force the model to say okay it fits

play29:41

summer

play29:42

use the weights in a different way when

play29:45

you are representing the input data and

play29:48

into the in the concept it's pretty

play29:50

similar because what you have at the end

play29:52

is an is a projection that has all this

play29:55

information embedded the the global the

play29:58

the global what we see weeks months and

play30:01

so on and the position embeddings what

play30:04

is a way to give the model or to force

play30:07

the model to understand okay this

play30:10

timestamp is the position one this time

play30:13

comes the position two three four four

play30:16

and so on

play30:18

they did the same as I said with the

play30:21

attention module they modified the the

play30:23

prop Spurs which is instead of doing the

play30:26

calculations when one against all the

play30:29

other ones he just does does a

play30:33

minor calculations based on likelihood

play30:37

it works it has a formula that it's

play30:40

pretty nice to read it but what I do is

play30:42

that thing

play30:45

the results that they get is

play30:48

it's better of course they compare with

play30:51

lstm

play30:53

and

play30:55

what it what I see really interesting is

play30:57

that

play30:58

this is a there are some benchmarks

play31:01

series

play31:02

that when you are running someone they

play31:05

are doing something with

play31:06

with time series you have benchmarks to

play31:08

use

play31:09

and they use it to predict 24 data

play31:14

points 48 168 up to 7 712 data points in

play31:20

the future so you are seeing

play31:24

360. I didn't put that but they assume

play31:27

that they are using an ear which is 360

play31:31

data points and they would like to

play31:33

predict two years in advance

play31:37

they show that it's better but what what

play31:40

is really interesting if you compare

play31:42

with lstms for instance you can see that

play31:45

if we see the long term capture the MSC

play31:48

error is 1 to 50 and within STM is 196.

play31:54

the one 960. and this distance is even

play31:59

closer if we use just 24 data points

play32:03

so the conclusion that we can see from

play32:05

here is that

play32:07

probably better if you use it for long

play32:10

term series

play32:13

um

play32:13

it's not so crazy I mean it's not so

play32:16

excellent that you can say okay with

play32:19

here we have a an MSC of two and here we

play32:22

have an MSC with informant 0.5 0.4 I

play32:26

mean it's better it's not something

play32:28

awesome but it's it's much better

play32:30

compared with the short ready

play32:33

traditional lstms

play32:35

space time four man they try to do the

play32:39

same but the representation again is

play32:41

different

play32:42

and they say Okay instead of using this

play32:45

position encoding using wigs days months

play32:48

and so on they try to represent in a

play32:51

different way so every position every

play32:53

time stamp they say these are the

play32:56

features that are present in this

play32:58

timestamp

play32:59

same for the second timestamp the first

play33:02

timestamp so they say they represent by

play33:04

features and they put all the features

play33:06

all together and this is the main Vector

play33:09

that they put okay okay for Time Zero

play33:11

position zero for feature zero this is

play33:13

time from 0 to 10 or to whatever for

play33:16

feature two these are all the points

play33:19

that we have for Time Zero to so on

play33:21

you may imagine that it's a different

play33:23

representation

play33:25

but to explain it in a very easy way

play33:28

what they do is they representative they

play33:30

are doing is you can gather you can have

play33:33

a temporal attention if you represent

play33:35

all your inputs in one point or in one

play33:38

vector

play33:39

you are just getting the representation

play33:41

between timestamps

play33:43

but you would like to find relationship

play33:45

between timestamp but between features

play33:49

so if you embed everything in just one

play33:52

token you will lose the relationship

play33:55

between the features

play33:58

so what I say is with informa you you

play34:01

may have this temporal attention

play34:03

it could be useful could not be useful

play34:05

but they say you may not be capturing

play34:09

all the relationship between the data

play34:12

what you can do is okay we can we can

play34:14

force a graph we can write a graph if I

play34:17

know the relationship between the

play34:18

features I can manually add a graph

play34:21

between the features

play34:23

but the problem here is is the same you

play34:25

are handcrafting something that may

play34:27

change in the in during the time it

play34:31

could not be the same relationship

play34:32

between the feature so you probably need

play34:34

to change it

play34:36

what I do is okay let's try to make it

play34:39

more open and if you see for instance

play34:42

the the lines in blue are the attention

play34:46

that the model is capturing with this

play34:48

with doing this thing with the space

play34:50

space-time former you are allowing the

play34:53

model to pay attention to features and

play34:56

time not just features or just or just

play34:59

time

play35:01

and and it works pretty well to be

play35:04

honest and but I think that the

play35:07

the thing or the concept like okay let's

play35:10

try just to or let's try to

play35:13

model it or to modulate how the the

play35:16

functions or the features Works between

play35:19

them could be pretty awesome and the

play35:23

architecture is almost the same they are

play35:25

working in optimizations in the

play35:27

attention layer they are doing some

play35:29

modification they are probably adding a

play35:32

CNN in in the middle but I mean it's not

play35:35

the main point the main point is how we

play35:37

can represent the data so the model can

play35:39

understand so we need to force that

play35:41

information

play35:44

these are the benchmarks that I that I

play35:45

showed of course better benefits but

play35:48

with something similar compared with the

play35:51

with the previous one once we predict

play35:55

more time in advance

play35:57

the error is getting bigger of course

play36:00

but the the lstm errors is getting even

play36:05

bigger in some cases and not so bigger

play36:07

in other cases right so the difference

play36:09

between them it's getting a bit bigger

play36:12

right but again it's not

play36:16

or what I think

play36:17

it's not so excellent because we have 21

play36:21

35 I think that this is because it's not

play36:24

normalized the NSE

play36:26

but

play36:28

21-35 compared with 22 11. so it's not

play36:32

so so huge it's better again but it

play36:35

sounds so huge

play36:38

and this is what I did or what I tried

play36:41

to do is in use case my use case was a

play36:45

microservices architecture where I

play36:47

wanted to predict the latency between

play36:49

the front-end service and the users

play36:53

and there is the information that I have

play36:56

is all the latencies of each particular

play36:59

microservice or a particular services

play37:02

latency or what it means is that it

play37:05

times that each service takes to respond

play37:08

to a request or to rest or to process

play37:11

the request right so I put them all

play37:13

together

play37:15

and my target is to predict the

play37:18

front-end service to the user

play37:21

I could I could have selected any other

play37:23

one but I selected the front end because

play37:25

I wanted to

play37:26

to see the impact to the user at the end

play37:30

just to explain what what we are doing

play37:33

this is just one feature of course but

play37:35

we are taking

play37:37

100 data points from the past with this

play37:40

degree in line and we are predicting the

play37:43

blue line of the dotted Blue Line

play37:45

we'll do a short prediction and we also

play37:49

will do a long prediction right we I

play37:53

would like to to see how it works with

play37:55

both of them

play37:57

um

play37:57

one once I use 360 data points which is

play38:02

the second and I would like to predict

play38:04

the next 36 which is pretty short

play38:07

I get better results with Informer in

play38:11

00.6 and again it's not so bad with the

play38:14

lstms but what

play38:17

came to my mind or or my attention is

play38:20

the time that it takes to process the

play38:23

batch

play38:25

even if the Transformer is better

play38:27

probably in some use cases you don't

play38:29

have this time to wait

play38:31

again it's not a lot of time but it

play38:33

takes more time compared with the with

play38:34

the last year

play38:36

what is affecting here what is directly

play38:39

affecting here is this optimization that

play38:42

they are doing with this with the

play38:43

attention layer

play38:44

so if they don't do that this time could

play38:47

be even the double or the triple so they

play38:50

are getting there

play38:51

and we do all the research that most

play38:54

people is doing but it's even

play38:58

much higher compared with the stms lstms

play39:01

is a pretty simple model if you like to

play39:04

to create a network right for 120 it's

play39:09

pretty similar we have

play39:12

the same results much better for the

play39:14

Informer

play39:15

Wars for lstms but the time that it

play39:18

takes is getting bigger again we have it

play39:21

takes more time to produce the Informer

play39:24

and compare with elastian

play39:28

this is from the paper from the

play39:30

space-time form and paper that I wanted

play39:33

to to show because they did the same

play39:35

calculation with lstm with log trans

play39:38

which is auto Transformer with reformer

play39:41

and so on and you may say that

play39:45

once the encoder length is growing at

play39:48

the time that it takes it's getting

play39:50

higher higher higher and the last time

play39:53

the model that takes less time is the

play39:56

always the lstn because it's easier it's

play39:58

simpler and it's more easy to use

play40:01

but information is not so bad I mean it

play40:04

takes a bit more but it's not so bad

play40:09

so conclusions

play40:11

um

play40:12

Transformers seems to be a good solution

play40:15

but I think that there's a lot of work

play40:17

that should be done

play40:19

basically in optimizations and try to

play40:21

think

play40:22

different ways to represent the data

play40:25

and even if you find this model

play40:28

it will it's probably not the best one

play40:31

for you because it depends it this is

play40:34

not language that we can see for

play40:36

instance with GPT or this kind of

play40:38

languages that once you train a language

play40:40

a language model it's able to understand

play40:43

language of course you need to fine tune

play40:45

for your particular use case but it's

play40:48

able to understand language and it

play40:50

doesn't happen when I'm serious because

play40:51

all time series they are all different

play40:53

we don't have the same time series when

play40:55

you are when we are working with time

play40:57

series so using an architecture for me

play41:01

it's it's an option to use this one more

play41:04

option to see if this architecture can

play41:07

be useful for my case for my problem

play41:10

could be could work could not work

play41:13

you never you never know you need to

play41:15

test it and the third one is of course

play41:19

being involved and

play41:21

what is important we Transformers and

play41:23

time series is that communities who

play41:25

drive to state of the art

play41:27

again it's not the same as GPT once you

play41:30

have one API open AI or Google that they

play41:34

are training the huge models here it's

play41:37

the community that it's researching that

play41:41

is testing is sharing information

play41:42

because you don't have a model that you

play41:45

can download and you can use for time

play41:47

for for Time series

play41:48

you need to train the model with your

play41:51

data and you need to see if it works if

play41:53

or if it not works so it's it's probably

play41:57

up to each use case or each problem so

play42:00

the community if you are using informal

play42:03

space and former log former or any other

play42:06

Transformer that we are using try to be

play42:09

involved try to collaborate or to

play42:12

or to inform

play42:13

and we have cool projects there are cool

play42:17

projects that like Tsai

play42:20

that is if you like to use the time

play42:22

series you probably need a framework to

play42:25

test it because

play42:28

it's not easy to build your own lscn

play42:32

your own transformer your own space and

play42:34

form and so on so if you have Frameworks

play42:37

that con with apis that can be very

play42:40

useful or very easy to use

play42:42

it really helps so I recommend you if

play42:45

you are working with time series or if

play42:47

you

play42:47

yes if you are working with time series

play42:49

these are very good projects T TSI and

play42:53

this is the GitHub

play42:55

and that's it thank you so much thank

play42:58

you so much for your time

play42:59

I hope you have enjoyed and any

play43:02

questions that you have please we have

play43:04

the microphone but think thank you thank

play43:07

you for your time

play43:15

thank you presentation

play43:17

um two questions so these modern former

play43:19

depth

play43:21

let's open source as well right

play43:23

yeah

play43:25

because you mentioned the Informer is

play43:26

the one the open source but all of them

play43:28

okay and so

play43:31

um in your use case example

play43:34

um so the data parameters you basically

play43:38

took the

play43:40

um

play43:42

the lag

play43:44

latency right and timestamp that's it

play43:47

you don't like number of users the type

play43:51

of the architecture nothing just to uh

play43:53

just like this to data points yeah the

play43:56

reality is that I'm sorry that was the

play43:59

question right yeah

play44:01

and in fact I have the P95 P99 mp98

play44:06

values for each feature so so I had 60

play44:10

features and I multiplied all the b95 to

play44:15

get I concatenate P95 P99 MPA 96 I think

play44:19

tonight or 98

play44:21

I had a 2000 data set

play44:25

2000 features data set and I also have

play44:28

the throughput

play44:29

that is another that CC yeah 60 more

play44:33

data points so I had a matrix a matrix

play44:38

of

play44:39

250 if I'm not wrong with

play44:43

5 000 or 6 000 timestamps

play44:46

I mean is it possible to add more

play44:48

Dimensions to that data yes and make it

play44:51

like around not million years yes yeah

play44:54

exactly and once you added this data The

play44:58

Matrix or the Matrix calculation with

play44:59

the essential layer is getting bigger

play45:01

and bigger and bigger and bigger and

play45:02

bigger

play45:03

uh so it's

play45:07

it takes time to train those models it's

play45:10

a pretty easy use case I mean it's it's

play45:12

just latency right and uh so did you

play45:15

like you ran the benchmarks against uh

play45:18

once you predict the future then you

play45:20

took the actual results how close they

play45:23

were what was the variance between the

play45:25

actual and predicted yeah for the close

play45:28

um

play45:29

the actual versus predicted yeah okay

play45:32

cool yeah I had this I had an initial 10

play45:37

10 000 data set I separated two thousand

play45:41

and I didn't show to the model I mean I

play45:44

trained all the model with the 8000. and

play45:47

once the model is trained I test the

play45:49

performance with this 2000 I mean with

play45:51

36 and we won 120.

play45:55

nice we're talking about yesterday in

play45:57

the this monitoring session I don't know

play45:59

if you heard there and um they're

play46:01

exactly what they're talking about now

play46:03

how AI uh might help people to identify

play46:07

impact because if you can use this and

play46:10

say look

play46:11

based on Trends the impact is going to

play46:13

uh you know be higher and acceleration

play46:17

is going to be higher it's like you know

play46:18

when you drive a car with a trailer yeah

play46:21

and you start the oh yeah shaking right

play46:24

sometimes you know you're gonna go

play46:26

straight and sometimes you're gonna fly

play46:27

off the highway so uh this kind of stuff

play46:30

can predict the impact yeah yeah because

play46:34

you are modeling

play46:36

something I mean I suppose I don't know

play46:38

what are the features that you can get

play46:39

there but I can I can think on speed and

play46:44

vibrations movements and so on

play46:49

yes yeah it's pretty pretty interesting

play46:51

but again you can do the same with lstms

play47:00

thanks for your talk thank you do you

play47:03

know if anyone's doing work on

play47:06

putting out confidence intervals with

play47:09

the model predictions so you could see

play47:11

that your your uncertainty is growing

play47:14

over time as you predict into the future

play47:16

and then you could

play47:19

decide at what point your prediction is

play47:21

just not worth using yeah this is in in

play47:24

terms no I'm not aware but I think that

play47:28

you can use that when you are

play47:30

implementing those use cases

play47:31

ones who are because this solution that

play47:35

I'm that that I tried in fact is that it

play47:38

has

play47:39

um a model selector at the beginning

play47:42

so the models are selector will detect

play47:44

for instance if you don't have enough

play47:46

data to train the Transformer you will

play47:48

probably use a regression we'll probably

play47:50

use an lstm once you are getting most

play47:52

models or most timestamps you can try to

play47:57

train these other models and in the

play48:00

middle of the process when you are

play48:01

working you probably would say okay I

play48:03

have enough data to train a Transformer

play48:05

but in the middle of the process you can

play48:08

probably say okay but it's not the best

play48:10

process it's not the best

play48:13

architecture for the solution so in that

play48:16

case when you are implementing that

play48:17

you need to be sensing the performance

play48:19

between I don't know lstms Transformers

play48:23

or even a regression could be something

play48:25

pretty easy

play48:26

yes but those use cases you you may see

play48:29

when you are implementing that that

play48:31

thing or these algorithms because when

play48:35

you are other research projects or all

play48:38

the things that you can find when

play48:39

informed and so on is I have that stuff

play48:41

this is the MSC and this is how you

play48:44

measure how good it is and they're using

play48:46

the same benchmarks so it's probably the

play48:49

same thing always right but one one you

play48:52

would like to use in a real cases

play48:54

scenario Implement that you need to

play48:56

monitor the performance you need to

play48:58

monitor the admsc even even if you have

play49:01

a good MSC you cannot be able to attack

play49:04

pigs for instance because MSC is mean

play49:08

square error

play49:10

so it's a it's a mean

play49:12

if most of the time is similar but you

play49:14

have one picking one second the MSC will

play49:16

keep the same but mostly keep the same

play49:19

right so if you're just seeing MSC you

play49:22

probably would say yep you can be

play49:24

probably losing most of the time so you

play49:26

need to have mechanisms

play49:29

to try to detect those Peaks and to see

play49:31

how it's working

play49:32

yes

play49:35

thank you thank you for your question

play49:42

yes I I really enjoyed this presentation

play49:46

and I wanted to ask about how

play49:49

I guess Time series

play49:52

sequence the sequence like data is set

play49:54

up because like when I think about

play49:57

trading a Transformer like I think of it

play50:01

as like a for example like the

play50:03

translation task you thought like a data

play50:05

set of a bunch of

play50:08

not necessarily related sentences but

play50:11

have like a fixed or they'd all have you

play50:14

know

play50:15

a five letters or a five word sentence

play50:18

in one language I'm trying to like input

play50:21

language translates to like a forward

play50:24

sequence in the output language and so I

play50:29

like to know how you create

play50:32

or like how you use a sequencer sequence

play50:34

model

play50:37

in in Time series data where you have

play50:39

just like this continuous sequence of of

play50:41

data points

play50:43

good question it's

play50:45

it's a kind of times of sequence to

play50:47

sequence but the difference is that when

play50:51

you are doing a sequence to sequence

play50:52

like you can do with an STM

play50:54

the the model internal is you are giving

play50:57

them 10 data points for instance right

play51:00

but the model is processing one data

play51:02

point

play51:02

each time I mean one more time

play51:06

with Transformers We are feeding the 10

play51:08

data points at the same point at the

play51:10

same moment

play51:11

and it's again in a representation of

play51:13

these 10 data points so you're not doing

play51:16

sequence by sequence right so you are

play51:20

predicting you are putting all the

play51:22

sequences at once they are calculating

play51:24

the attention they are calculating the

play51:26

relationship between them and this

play51:28

information goes to the decoder and the

play51:31

decoder

play51:32

they have the different ways to use it

play51:34

but for instance in space and time

play51:36

former they don't process the output one

play51:39

by one they do the same process as the

play51:42

encoder and they process everything at

play51:44

once

play51:45

so you get all the series let's suppose

play51:47

that you would like to predict the next

play51:49

time 10 timestamps you will get the 10

play51:53

timestamps at once

play51:55

so it's a sequence to sequence but it's

play51:58

not the same concept as you can think

play51:59

for lstms for instance

play52:03

length like like a hyper parameter in

play52:06

that case

play52:08

it's no it's the hyper parameters are

play52:11

this free free queries that you can have

play52:13

in the Transformers the queries keys and

play52:16

values these are the hyper parameters

play52:18

for all these Transformers right

play52:22

for each head of the attention you have

play52:26

this Matrix so you have a matrix for one

play52:28

head we have Matrix three matrices for

play52:30

other head and so on and so on and so on

play52:32

and so on these are the hyper parameters

play52:35

and what you get is a representation at

play52:37

the end you have an output which

play52:39

probably if you see this output it has

play52:41

nothing to do with the input because

play52:42

it's an embedding it's like how they

play52:45

think that it's important or how to

play52:47

model things that is important the input

play52:49

related with it

play52:51

with the attention right so this

play52:53

sequence once you have that this is

play52:56

filled to as an input for the decoder

play53:00

so you can think

play53:01

as a sequence to sequence in the concept

play53:03

that you are feeding an input

play53:05

you are having a representation or an

play53:08

extraction of features and this thing is

play53:10

what we are doing to encode to decode

play53:13

your your output

play53:14

the concept is similar but how it works

play53:17

it's pretty different in that part

play53:19

you're not feeling step by step you are

play53:20

feeling all together and

play53:23

lstm for instance when you do the

play53:25

sequence to sequence if you have 10

play53:28

cells for instance it goes first to the

play53:30

first cell

play53:31

the second information goes to the

play53:33

second cell and to the third cell to the

play53:35

fourth cell and so on and the model the

play53:37

lstm is saying okay for cell it doesn't

play53:41

matter so I will shut down the cell so

play53:43

again here you are feeding all together

play53:46

and it's doing the calculations at once

play53:50

all these multi-heads they are being

play53:53

calculated at once

play53:55

so it is but not it is in a concept it's

play53:58

a sequence to sequence but it has a

play54:00

different Behavior

play54:01

it's interesting yeah it's pretty

play54:03

interesting it's pretty challenging to

play54:05

understand the details it is

play54:09

thank you so much for a question

play54:17

all right thank you for your time

play54:19

thank you appreciate it

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
TimeSeriesTransformersInformerSpace-TimeFormerAIEzekielLanzaIntelSequenceAnalysisOpenSourceTechTalk