Transformes for Time Series: Is the New State of the Art (SOA) Approaching? - Ezequiel Lanza, Intel
Summary
TLDREzekiel Lanza, an AI open source evangelist at Intel, presents his research on applying Transformers to time series analysis. He discusses the architecture of Transformers, their adaptation from language translation to various applications, and their potential in time series forecasting. Lanza compares the performance of Transformers with traditional models like LSTMs, highlighting the benefits and challenges of using Transformers for long-term predictions. He also emphasizes the importance of community involvement in advancing the state of the art for time series analysis with Transformers.
Takeaways
- 📈 The presentation discusses the application of Transformers in time series analysis, a concept initially developed for natural language processing (NLP).
- 🔍 The speaker, Ezekiel Lanza, shares his thesis work and personal experience with using Transformers for time series data.
- 📋 The agenda includes a brief explanation of Transformers, their architecture, and how they can be adapted for time series analysis.
- 🤖 Two main architectures for time series are highlighted: Informer and Space-Time Former, chosen for their open-source availability and practical use cases.
- 🔧 The speaker emphasizes the importance of understanding the input representation, embeddings, and the adaptation of the encoder and decoder in Transformers for time series.
- 📊 The presentation compares the performance of Transformers with traditional time series models like ARIMA, Auto-Regressive models, and LSTMs.
- 🚀 The potential of Transformers to capture both short-term and long-term dependencies in time series data is discussed, with a focus on their efficiency and accuracy.
- ⏱️ The computational complexity of the attention mechanism in Transformers is addressed, along with modifications to improve efficiency for time series analysis.
- 🔍 The speaker's use case involves predicting latency in a microservices architecture, demonstrating the practical application of Transformers in a real-world scenario.
- 📝 The presentation concludes with a call for community involvement in advancing the state of the art for Transformers in time series analysis and the importance of testing and optimizing models for specific use cases.
- 🔗 The speaker recommends the use of frameworks like TSI for time series analysis, which can simplify the process of implementing and testing Transformer models.
Q & A
What is the main focus of Ezekiel Lanza's presentation?
-The main focus of Ezekiel Lanza's presentation is to share his research on using Transformers for time series analysis, discussing his experiences, challenges, and the potential usefulness of this approach.
What are the two main architectures for Transformers in time series that Lanza discusses?
-The two main architectures for Transformers in time series that Lanza discusses are Informer and Space-Time Former.
How does the Transformer architecture adapt to different tasks like translation and image generation?
-The Transformer architecture adapts to different tasks by modifying the original model and combining it with other neural networks or CNNs, as seen in GPT for language translation and Stable Diffusion for image generation.
What is the significance of the self-attention mechanism in Transformers?
-The self-attention mechanism in Transformers allows the model to focus on the most relevant parts of the input data, which is crucial for understanding the relationships between different elements in the data, such as words in a sentence or data points in a time series.
The challenges of applying Transformers to time series data include the need for careful input representation, the computational complexity of the attention mechanism, and the necessity of capturing both short-term and long-term dependencies in the data.
-null
How does the Informer architecture address the computational complexity of the attention mechanism?
-The Informer architecture addresses the computational complexity by using a probability-based attention mechanism, which reduces the amount of calculations required, making it more efficient for handling large time series data.
What are the advantages of using Transformers for time series forecasting compared to traditional methods like ARIMA or LSTM?
-Transformers for time series forecasting can capture complex, non-linear relationships and long-term dependencies more effectively than traditional methods like ARIMA or LSTM, which may struggle with non-linear data and have limitations in handling long sequences.
What is the role of position encoding in Transformers for time series data?
-Position encoding in Transformers for time series data is crucial for providing the model with information about the order and position of data points, which is essential for capturing the temporal dependencies in time series.
How does the Space-Time Former architecture represent time series data differently from Informer?
-The Space-Time Former architecture represents time series data by focusing on the relationships between features and timestamps, allowing the model to pay attention to both time and features simultaneously, which can be more effective for certain types of time series analysis.
What are the key takeaways from Lanza's experience with implementing Transformers for time series in a microservices architecture?
-Lanza's experience highlights the potential of Transformers for time series forecasting, especially for long-term predictions, but also emphasizes the need for community involvement, continuous research, and the importance of testing and optimizing the models for specific use cases.
Outlines
📝 Introduction to Transformers for Time Series
Ezekiel Lanza introduces the concept of using Transformers for time series analysis, sharing his thesis work and research on adapting the architecture for time series data. He outlines the agenda, which includes a brief explanation of Transformers, their architecture, and how they can be applied to time series. Lanza also mentions the importance of understanding the limitations and potential of Transformers in this context.
🔍 Understanding Transformers and Time Series
The speaker delves into the specifics of how Transformers can be used for time series data, emphasizing the importance of input representation, embeddings, and the adaptation of the encoder and decoder with self-attention mechanisms. He explains the concept of positional encoding and how it helps the model understand the order of data points in a time series, which is crucial for accurate predictions.
🤖 Self-Attention and Multi-Head Attention
The paragraph discusses the self-attention mechanism in Transformers, which allows the model to focus on relevant parts of the input data. The speaker explains how multi-head attention enables the model to capture both short-term and long-term dependencies within the data. He also touches on the computational complexity of attention layers and how it can be optimized for time series applications.
🕒 Time Series and Sequence Modeling
The speaker compares time series data to language data, highlighting the differences in how they are processed by models like Transformers. He explains that while language models can handle variable word order, time series models must maintain strict order due to the sequential nature of time data. The paragraph also discusses traditional time series approaches like ARIMA and their limitations, leading to the exploration of neural networks for capturing non-linear dependencies.
🧠 RNNs, LSTMs, and Sequence-to-Sequence Models
The speaker explores the use of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks in time series analysis. He explains how RNNs can memorize the importance of previous data points but struggle with long-term dependencies due to vanishing gradients. LSTMs, with their ability to remember long sequences, are presented as a solution to this problem. The speaker also mentions sequence-to-sequence models as a way to improve LSTM performance.
🚀 Optimizing Transformers for Time Series
The speaker discusses the challenges of using Transformers for time series, particularly the quadratic time complexity of the attention layer. He references a 2020 survey paper that suggests modifications to the Transformer architecture for time series applications, focusing on network modifications and positioning coding. The paragraph also introduces Informer and Space-Timeformer as open-source architectures designed for time series forecasting.
🌐 Positional Encoding and Attention Module
The speaker explains how positional encoding and modifications to the attention module can improve Transformer performance for time series. He mentions the use of learnable embeddings and the inclusion of timestamp information to help the model understand the order of data points. The paragraph also discusses the use of sparse attention mechanisms to reduce computational complexity and improve efficiency.
📊 Benchmarks and Performance
The speaker presents benchmarks that compare the performance of Informer and Space-Timeformer with LSTMs. He highlights the advantages of Transformers in capturing long-term dependencies and their performance in forecasting future data points. The paragraph also discusses the trade-off between the accuracy of predictions and the computational time required, emphasizing the need for optimization and community involvement in advancing Transformer models for time series.
🔮 Use Case: Microservices Architecture Latency Prediction
The speaker shares a personal use case where he applied Transformer models to predict latency in a microservices architecture. He discusses the data preparation process, including the selection of relevant features and the use of past latency data points for prediction. The paragraph also touches on the importance of processing time and the impact of attention layer optimizations on model performance.
🤔 Conclusions and Future Work
In conclusion, the speaker suggests that while Transformers show promise for time series analysis, there is still much work to be done in terms of optimization and data representation. He emphasizes the importance of community involvement and the need to test and adapt Transformer models for specific use cases. The speaker also encourages the audience to explore open-source projects and frameworks for implementing Transformers in their time series analysis.
📈 Sequence Modeling in Time Series
The speaker addresses the question of how sequence modeling applies to time series data, clarifying the difference between traditional sequence-to-sequence models like LSTMs and the approach taken by Transformers. He explains that Transformers process all data points simultaneously, allowing for a holistic understanding of the data sequence, which is different from the step-by-step processing of LSTMs.
Mindmap
Keywords
💡Transformers
💡Time Series
💡Attention Mechanism
💡Position Encoding
💡Informer
💡Space-Time Former
💡Sequence to Sequence
💡Latency
💡Optimization
💡Community
Highlights
Ezekiel Lanza discusses his research on using Transformers for time series analysis.
The talk is based on Lanza's thesis work, sharing his experiences, frustrations, and insights into the usefulness of Transformers for time series.
Transformers were initially designed for language translation but have since been adapted for various applications, including image generation and text-to-image synthesis.
The challenge with time series is determining the usefulness of Transformers, as they have not been extensively applied in this domain.
Lanza introduces the concept of Transformers 101, explaining the architecture and its key components relevant to time series.
The importance of input representation, embeddings, and the adaptation of the encoder and decoder for time series is emphasized.
The self-attention mechanism in Transformers is explained, highlighting its role in capturing relationships between elements in a sequence.
Multi-headed attention allows the model to focus on different parts of the sequence simultaneously, capturing both short-term and long-term dependencies.
The limitations of traditional time series models like ARIMA and RNNs are discussed, including their inability to capture non-linear dependencies and long-term relationships.
Lanza presents two main architectures for time series: Informer and Space-Time Former, chosen for their open-source availability and practicality.
Informer and Space-Time Former have been optimized to reduce computational complexity, making them more suitable for time series analysis.
The use of position encoding and attention module modifications in Transformers for time series is highlighted as crucial for capturing temporal relationships.
Lanza shares his personal use case of predicting microservices latency, demonstrating the application of Transformers in a real-world scenario.
The importance of community involvement and collaboration in advancing the state of the art for Transformers in time series is emphasized.
Lanza concludes that while Transformers show promise for time series, more work is needed in optimization and data representation to make them more effective.
The presentation encourages the audience to explore and test Transformers for their specific time series problems, as the best model can vary depending on the use case.
Lanza suggests the use of frameworks and APIs for time series analysis, such as TSI, to simplify the process of implementing and testing Transformer models.
Transcripts
well thanks for coming for today's
Transformers for time series my name is
Ezekiel Lanza I am AI open source
evangelist working for for Intel and
basically what I would like to explain
in this I would like to share with you
in this talk it's my thesis work I've
been doing some research about
Transformers and how you can use
Transformers for Time series
the main idea is to share my my
experience or what I did or my
frustrations and what I think that can
be used or what cannot be useful
first of all I would like to this is the
agenda I'd like to talk about a light
explanation about Transformers 101 in
case you are not aware about
Transformers
how is the architecture and I would like
to I will not go in the details of
course but just I would like to explain
which are the most important parts of
the Transformers that can be useful for
Time series
after that I will explain something
about okay we have the Transformers how
we can use it for Time series
to main after that I will explain two
main architectures that is Informer and
space-time former we have tons of
architecture there I decided to use
these two main architectures for a
reason so I will I will share later
the use case why how I tested this
implementation in a real use case
and some conclusions so I hope you can
enjoy it and it will be personal so any
questions that you can have you can ask
for the microphone and you can ask it
so Transformers
[Music]
um
as you may probably know everything
started in 2017 with the paper attention
is all you need and we have the vanilla
return former at the beginning that it's
basically an architecture to translate
from English to French I think so it's
basically the idea is to have to have a
phrase in one language and had a
translation to other language
but it was back in 2017 what we have
today is that
to adapt or to use the same architecture
that was really useful for that
implementation
we we had some transformations of the
original Transformer right so just to
say an example we have for instance GPT
as you may probably heard is an
adaptation from open AI that they build
they did some modifications to the
Transformer they have the gpt-1 the gpt2
tpt3 3.5 M4
and also for Dali for instance is the
value of stability diffusion here on the
bottom uh it's if you like to write a
text and you would like to get an image
you may use a Transformer power to
Transformer adapted with other parts
neural networks or cnns and so on to get
the same result so
but what we don't have yet is we we have
nothing related with time series yet so
it's okay for language it's okay for
images or images plus languages but for
time series we are in a moment and we
don't know if it's so useful
it can be useful cannot be useful but
it's not clear what is the state of the
art we do have a lot of research that
you can find but it's not clear today
actually
just to explain
just to give an example for instance
this is how a stable diffusion it's
built you have
a text that you would like to get an
image at the end of the of the inference
we have the Transformer which is the the
first part is using his transfer it's
adapting the text to a text embeddings
to a part of our representation and this
representation goes to another Network
that is the unit it's basically a CNN
architecture right but what I wanted to
Showcase here without going in detail is
that they are they pick the Transformers
architecture and they created something
with images and the same concept is the
concept that we may find with the
different Scopes could be time series
could be computer vision or it could be
in any other case
so it was so impressive the performance
that the Transformers give is that most
people will like to use Transformers and
this is and these are at least what they
what they've seen is that
they would like to use the Transformers
even if it's not the best option which
it could be the best option but you have
a lot of constraints that you need to
keep in mind when you are using some
kind of of Transformers
we
there are multiple parts
of course I I won't go in in all the
details but for time series we should be
aware of three main parts the first part
it's how you represent your input
how you use the embeddings or how you do
your embeddings
and how you use or how you adapt the
encoder and the encoder and the decoder
multi-health self-attention okay this is
something that I will explain later so
it's basically focusing these framing
parts of course you have a lot of
different parts of the architecture that
is you have a fit four or layer and so
on but it's not in the in the main
I mean it's it's not relevant in the
topics over time series right and just
to explain
I will start explaining about how the
vanilla Transformer was created just to
give you an idea okay if this is how it
works for text or for language how we
can adapt that for for time series right
so let's suppose that we have a phrase
that is I love dogs we have three three
words and we need to represent that in a
different way because if you put the
data or if you put the ladders in the
model the model is not able to
understand letters right you need to
convert
the words in numbers or vectors or a
representation that a machine can
understand right
so how can you do that for text which is
pretty good we have another algorithm
that is word to back that it basically
does is
we we can represent the data between I
don't know one and five thousand or one
and two thousand which could be all the
letters all the phrases that you can
find in English for instance but it will
be it wouldn't be enough because we need
to represent
the information that we have in the text
with some meaning we need to provide a
meaning to the model so they can
understand so this is why we use board
to back what they use word to back is
that
it gives you a representation for
instance if you would like to represent
the working and the word Queen
and you would like to measure the
distance between keen and queen it will
be the same distance between men and
women because queen of course I mean
it's obvious but it's not just a number
between one and five thousand one and so
twenty thousand or whatever it's
bringing information to the model and
this is where words of act does it's a
model that's already trained and you can
download and you can use it in every use
case for instance you know in your in
your case and what I did is they picked
these words back and they use it to
represent the input
and this is what they did and they
represented this is just random numbers
right it's not the real numbers but they
they decide or they picked a dimension
let's say 512 and they have for each
word you have a vector of 512 of the
dimension is 512 right and they have for
I for love for dogs or whatever the
phrase is
but there is another additional problem
is that
we can have the the order that the words
has in a phrase It's also pretty
important so we need to embed
what we are doing here is we are
creating embeddings which is our
representation of the letters but we
also need to provide more context more
information now we provide with the
words back we are providing just the
meaning between each word
it's King similar to Queen and so on but
we need to be aware of the position of
each word in the phrase so
they decided a model they decided on a
function it doesn't matter the details
of the function is a mathematical
function what it basically does is okay
when you do that you are also adding to
the same embedding the same
representation
you are adding information about the
position so it's not the same to use the
word dog or another word in different
parts of the of the phrase right so
doing that you are representing that
this is the first part so you probably
May imagine that for time series okay I
should do something similar for time
series of course it's not a word so back
because it's numbers but it could be
probably important in the future right
when I will be explaining Time series
and that's it and we have for each
phrase for each word we have a vector
with information
and now we have the encoder and the
decoder part and without going in the
mathematical details basically what a
self-attention does is when you're
training a model
you need to attract you need to detect
the relationship between one word and
the other words so in that case I would
like to have this embedding with
information for instance and I would
like to get information to say okay how
is children related with playing or how
is children related with Park
so the representation that I will get
and this is what they call attention
because what they want is with this
layer is let the model just be focused
on the parts that are relevant on the
phrase because the phrase could be 2000
words or 200 words
and you need to let the model know okay
pay attention to this particular part
instead of paying attention to the other
part right
how do you do that
and of course with a mathematical
operation but I just wanted to explain
this part
which is also important for the for
explanation is that they have free
weights it could be
the queries the quiz the keys and the
values and what I do is with this
initial embedding that we calculated
that we have the information of the word
and the position
we multiply it by by the query oh sorry
by the query by the values but the most
important part is that you multiply the
value by all the others volume so if you
have a full phrase
a forward phrase
when you are calculated embedding for
the the essential layer for the first
word we are multiplying the first word
by the second the first word by the
third by the fourth and so on and we are
doing the same with second word then
multiplying the second word with the
first one with the second one and so on
uh you get in score this score is weight
and it's again put it in one embedded
representation right
um and this is how the self-attention
works in a very high level right but
it's important to explain here that
if you have a very long phrase you can
imagine that the time the time that it
takes to calculate the attention could
be pretty high I mean it could take a
lot of time to calculate this this
Matrix is calculations
if we do that just once what we can do
is
with just one head what they used to
call the head or the multi-headed
tension with one head will allows you
with allow the phrase to just be
focusing in one part of the phrase
but probably for instance like in the
case the word the animal didn't cross
the street because it was too tired and
if I want to calculate the attention for
it probably it it's related not only
with
tire but it's also related with animal
so we need to be able to capture these
two relationships between each and
animal and tire because the model needs
to have more information to make a
decision in the future right so this is
why instead of using
the same thing that I that I showed in
the in the past in one head you use
multi-hypertensions it could be seven
six eight I mean it's a it's a parameter
that they decided when they are building
the the architecture but the main idea
is I would like to to capture for
instance short-term dependencies but I
will also took up to long-term
dependencies in the in the phrase
and it's the same architecture once you
have the the encoder the decoder works
in the same way
so once you have all the vector
representations or the embeddings for
each work
after attention and after the embedded
position and so on you will put all the
words in a matrix and this will be your
Matrix for the encoder
once you have that information you do
the same with the decoder with the
difference is that the encoder was
trained with input data which is in this
case English language and the decoder
was trained with the
with French with French language so it
basically does the same thing but
instead of paying attention on one
language he's paying attention on the
other language
but it's using the input that it's
already converted for attention and so
on
as an input for the decoder this part
could be a bit complicated tool to
understand but this is what it it works
you do the same process the attention
process the multi-attention here had
attention layers
either for encoder for decoder so it's
exactly the same how it works at the end
is
every time it makes predictions for
instance in that case the input will be
this phrasing in French which has I have
no idea about French but this is the
phrase and once it is encoded and when
embedded and so on this input will go to
the decoder and the decoder will try to
predict War by word in English which is
the targeted language in that case so we
predict for instance I
the second step on the second round will
be okay the input for the decoder will
be the word I and all the embeddings
that I have from the encoder and he
predicts the second one and
he keeps doing the same until he gets
the end of sequence word
when it gets the end of sequence the
simulation is done right so
just to give you a really high level
idea about how it works every time
everything if we we can explain very
detailed but it would take a lot of time
to to explain it but just to give you an
idea what are important parts of the
Transformers the initial Transformer
right
how it can be used for Time series
firstly we need to talk about what is a
Time series we need to Define time
series right just in case if you're not
aware
um time series is the point that I have
in the time zero has a dependency from
the previous points could be five points
could be six points ten points or
whatever but this relationship is really
important to capture because if you if
you if you can capture the relationship
that you can get between the previous
points and the point that you would like
to predict in the future or in the
present and
could be pretty complicated so
this is the main difference within
within time series and other kind of
problems because we can think that it's
it's like a language so you say
if you have a phrase you can say okay
the phrase at the end of the phrase has
a dependency on the on the words for
example for example I was
I don't know I was I don't know for
instance the word no has a dependency
for bi or I or even worse
but in language and this is why it's a
bit complicated to use it with
Transformers
it doesn't really matter if the order is
extremely the same order because in text
for instance if one word is in a
different position the mother should be
able to get the message or to get the
idea of what you are writing
so and this is what the Transformer
doesn't it does it pretty well but with
time series I need to keep
I need to be strict with the order of
the time series
and so we can imagine so we have a model
that is Transformer that works pretty
well with sequences or what pretty well
with phrase so we can imagine that it
can be good with time series with kind
of similar or we can think that as a
similar
how are we solving those problems now
and we have analytics approach or
classic approach that is arima or Auto
regressive models that I'm not saying
that they are not useful but for some
problems that happens
in the past or mostly
you need to have a deep understanding of
the time series you need to understand
the trend you need to understand season
mutant attendant reseal values and in
base of that once you do this analysis
of your series you build the arima model
right and even if you can of course it
would depend if your series is not
seasonal or if you can't attack the
season it could be pretty complicated to
write an arima model
and the main problem is that even if you
can do it
it's only able to detect linear
dependencies some cases if you have a
multi amount of feature problem
multivariate problem
the relationship between the features is
not linear
so if you have a problem with a linear
dependencies between the features
pretty good you can use it but most of
the times the relationship between the
features is not linear
so 20 years ago or 15 years ago with the
neural networks explosion they said okay
let's try to use fit forward layers fit
forward networks which is probably the
same so
you need to build a model from scratch
and you will say okay I will this
network for instance has a strong
dependency from the six
yes the six previous data points so
every time I need to predict the next
data point I will need just to see the
previous six data points the six data
points
so I built a model in that way and this
model is able to get non-linear
dependencies
but they do have the same problem but I
do have the same problem that is just
focus on this part so I'm hand crafting
the solution for this particular problem
and probably if it gets bigger it could
get a problem to use different forward
players because the weights and so on so
I don't want to know in details but and
they say okay we have the recurrent the
RNN the recurring neural networks which
is
you feed for instance the time series
with 10 points or seven points or the
points that you think that are useful
and the neural network the RNN it's able
to memorize the importance of the
previous data points so which is pretty
different than the feed forward when we
are just
calculating the weight in that case we
have cells that they they are able to
memorize which part of the series is
important
but we still have a problem with that is
that if the series is pretty long
this weight will be going lower and
lower and lower and lower so if we have
a series with
which we need to pay attention for the
previous
20 data points
they will probably won't be able to
detect very well the last 20 data points
because there is a
vanish vanish vanish gradient the grain
the event the gradient is going it's
Vanishing so mother's not able to detect
pretty well the longer dependencies so
it's good if your series is just it
depends on the true
earlier data points but if we we are
talking about okay I would like to
capture long dependencies this is not a
good option
and
lscm that lstm is pretty used
today even with some use cases I found
that it's pretty useful to use lstm it's
pretty easy to to build them it's pretty
easy to use it
and which is the same concept as RNN but
the cells they have memories and you can
cut down and you can shut down each cell
so it allows you to remind
long sequences because you are not using
the same gradient every time that you
are updating your your weights as it
happens with the RNN so lstm for those
cases
could be pretty useful and
even today if you use those cases in
some cases lstm could be a very good
competitor Transformers in but I don't
want to go in at the end of the
conversation right and this is another
option that instead of just feeding your
network
you can use the same thing lstm through
sequence to sequence which is I would
like to represent my data
I would like to fit my data in an
encoder I would like to get the features
of the input this feature will be the
input for the decoder as the Transformer
explanation that I did before and once
you have the decoder
um
trained you will get the output so it's
a way to try to improve the lstm
performance and sometimes it works
sometimes it doesn't but it depends on
the use case
so now we can imagine that the
Transformers can be useful because the
Transformers can detect short-term
dependencies as I said with the
multi-head attentions you can have a
head attention that is for only focusing
on one part of the of the text
and the other one will be able to detect
long-term dependencies
and just to give you an idea and how how
well it works with long dependencies
you probably used chargpt of course and
the phrases that chatipity creates it's
using gptn under NATO of course they are
pretty long I mean they are not just
phrases of five words
they have found hundreds of words and
they have a meaning
so this is what this is why we can
imagine that hey if we we can probably
predict 1000 points data points in the
future
this is why we think but probably yes
probably not
but we can imagine that can be useful
there is a big issue that is as I said
at the beginning is that
the layer the attention layer it
multiplies each timestamp or each input
token or each word if we would like to
go back to the NLP case and it
multiplies against all the others
words in the phrase
so as long as we
keep the same amount we can say okay the
time is okay but if I
keep increasing the input data size
the quadratic error or the quadratic
time that it would take to process this
it's exponentially growing
so we can have a problem here is it can
be more a computational problem
or a problem that can give you or can
take a lot of time to process the the
answer
it's not a huge problem because it it
works but as you may see the quadratic
error grows exponentially right
what is the work that have been done
with Transformers them
there is a survey there is a paper from
2020 name free
and they did a survey and they say okay
these are the Transformers
and if you like to use it for time
series you need to be aware basically in
two main things
we need to do Network modifications in
the architecture and we need to be
focusing on the positioning coding it
means how we represent the data in the
input and the attention module which is
basically to reduce this complexity or
this time of complexity to to make it
work
and of course is the it depends on the
applications could be for for for
forecasting anomaly detection or
classification but the main modification
that you will find for for transformance
are position encoding and attention
module
you can try to use I did some
experiments but you can try to use the
vanilla
encoding which is the explanation that I
said at the beginning
you can use probably word to vac with
phrases or time to evacuate it's
something similar that word to back and
you can use the same thing but the
problem is
that it's not able to detect
or to fully explode the import the
importance of the features because as I
said it's an architecture that is
designed to
find relationship but not strictly
strictly attached to the order so you
need to find something that will help
you on that part
in 201 they started a lot of research
and they showed that okay if this is
embedding of this representation that we
are using if we do it the learnable
during the time
it could help and the other part most in
the future they say okay let's try to
embed it let's try to to make embeddings
with timestamp
so we like to put the timestamp inside
the embedding and this is how you get
Informer out of former fat former lots
of Transformers that they basically do
similar things but they start playing
with okay how we can change the
embeddings in the input
and it changed a lot to be honest oh
sorry there is more um they are focusing
on the input as I said but they are also
doing some pruning on the attention
layer or they are using something called
prop sparse which is a pro
probability of attention so it doesn't
matter it's it's a very
it's not Advanced but they are doing
instead of doing other calculations they
are doing using just the likelihood
between the most important
and they do that this is a very good way
to reduce the amount of time that it
takes to calculate the attention layers
Informer does that so
I I decided to use Informer and space
informers for two main reasons and the
first one is because is the informal
informal and space and former they are
the only
architectures that are open source I
mean you have a GitHub you can go there
you have examples because otherwise you
need to go in the details and you need
to understand a lot of how the attention
work
I mean you need to go in the details if
you like to use those those
implementations informa has a pretty
good GitHub that you have samples you
have how you can feel with your data and
so on it's pretty easy and spice space
them for time format they also do the
same space and time science former form
stand Stanford or university so it's
pretty
friendly to use
how inform it works and what I said is
okay we like to also add information
about the week about the month about the
holidays for instance if you like to
capture I suppose that we have an year
of information
we can feed the model with this year of
information but in the middle we have
holidays we have months probably if your
series is
have a season for instance that in the
summer is different has a different
Behavior compared with with winter or
with the fall you need to also give this
information to to the model to try to
force the model to say okay it fits
summer
use the weights in a different way when
you are representing the input data and
into the in the concept it's pretty
similar because what you have at the end
is an is a projection that has all this
information embedded the the global the
the global what we see weeks months and
so on and the position embeddings what
is a way to give the model or to force
the model to understand okay this
timestamp is the position one this time
comes the position two three four four
and so on
they did the same as I said with the
attention module they modified the the
prop Spurs which is instead of doing the
calculations when one against all the
other ones he just does does a
minor calculations based on likelihood
it works it has a formula that it's
pretty nice to read it but what I do is
that thing
the results that they get is
it's better of course they compare with
lstm
and
what it what I see really interesting is
that
this is a there are some benchmarks
series
that when you are running someone they
are doing something with
with time series you have benchmarks to
use
and they use it to predict 24 data
points 48 168 up to 7 712 data points in
the future so you are seeing
360. I didn't put that but they assume
that they are using an ear which is 360
data points and they would like to
predict two years in advance
they show that it's better but what what
is really interesting if you compare
with lstms for instance you can see that
if we see the long term capture the MSC
error is 1 to 50 and within STM is 196.
the one 960. and this distance is even
closer if we use just 24 data points
so the conclusion that we can see from
here is that
probably better if you use it for long
term series
um
it's not so crazy I mean it's not so
excellent that you can say okay with
here we have a an MSC of two and here we
have an MSC with informant 0.5 0.4 I
mean it's better it's not something
awesome but it's it's much better
compared with the short ready
traditional lstms
space time four man they try to do the
same but the representation again is
different
and they say Okay instead of using this
position encoding using wigs days months
and so on they try to represent in a
different way so every position every
time stamp they say these are the
features that are present in this
timestamp
same for the second timestamp the first
timestamp so they say they represent by
features and they put all the features
all together and this is the main Vector
that they put okay okay for Time Zero
position zero for feature zero this is
time from 0 to 10 or to whatever for
feature two these are all the points
that we have for Time Zero to so on
you may imagine that it's a different
representation
but to explain it in a very easy way
what they do is they representative they
are doing is you can gather you can have
a temporal attention if you represent
all your inputs in one point or in one
vector
you are just getting the representation
between timestamps
but you would like to find relationship
between timestamp but between features
so if you embed everything in just one
token you will lose the relationship
between the features
so what I say is with informa you you
may have this temporal attention
it could be useful could not be useful
but they say you may not be capturing
all the relationship between the data
what you can do is okay we can we can
force a graph we can write a graph if I
know the relationship between the
features I can manually add a graph
between the features
but the problem here is is the same you
are handcrafting something that may
change in the in during the time it
could not be the same relationship
between the feature so you probably need
to change it
what I do is okay let's try to make it
more open and if you see for instance
the the lines in blue are the attention
that the model is capturing with this
with doing this thing with the space
space-time former you are allowing the
model to pay attention to features and
time not just features or just or just
time
and and it works pretty well to be
honest and but I think that the
the thing or the concept like okay let's
try just to or let's try to
model it or to modulate how the the
functions or the features Works between
them could be pretty awesome and the
architecture is almost the same they are
working in optimizations in the
attention layer they are doing some
modification they are probably adding a
CNN in in the middle but I mean it's not
the main point the main point is how we
can represent the data so the model can
understand so we need to force that
information
these are the benchmarks that I that I
showed of course better benefits but
with something similar compared with the
with the previous one once we predict
more time in advance
the error is getting bigger of course
but the the lstm errors is getting even
bigger in some cases and not so bigger
in other cases right so the difference
between them it's getting a bit bigger
right but again it's not
or what I think
it's not so excellent because we have 21
35 I think that this is because it's not
normalized the NSE
but
21-35 compared with 22 11. so it's not
so so huge it's better again but it
sounds so huge
and this is what I did or what I tried
to do is in use case my use case was a
microservices architecture where I
wanted to predict the latency between
the front-end service and the users
and there is the information that I have
is all the latencies of each particular
microservice or a particular services
latency or what it means is that it
times that each service takes to respond
to a request or to rest or to process
the request right so I put them all
together
and my target is to predict the
front-end service to the user
I could I could have selected any other
one but I selected the front end because
I wanted to
to see the impact to the user at the end
just to explain what what we are doing
this is just one feature of course but
we are taking
100 data points from the past with this
degree in line and we are predicting the
blue line of the dotted Blue Line
we'll do a short prediction and we also
will do a long prediction right we I
would like to to see how it works with
both of them
um
one once I use 360 data points which is
the second and I would like to predict
the next 36 which is pretty short
I get better results with Informer in
00.6 and again it's not so bad with the
lstms but what
came to my mind or or my attention is
the time that it takes to process the
batch
even if the Transformer is better
probably in some use cases you don't
have this time to wait
again it's not a lot of time but it
takes more time compared with the with
the last year
what is affecting here what is directly
affecting here is this optimization that
they are doing with this with the
attention layer
so if they don't do that this time could
be even the double or the triple so they
are getting there
and we do all the research that most
people is doing but it's even
much higher compared with the stms lstms
is a pretty simple model if you like to
to create a network right for 120 it's
pretty similar we have
the same results much better for the
Informer
Wars for lstms but the time that it
takes is getting bigger again we have it
takes more time to produce the Informer
and compare with elastian
this is from the paper from the
space-time form and paper that I wanted
to to show because they did the same
calculation with lstm with log trans
which is auto Transformer with reformer
and so on and you may say that
once the encoder length is growing at
the time that it takes it's getting
higher higher higher and the last time
the model that takes less time is the
always the lstn because it's easier it's
simpler and it's more easy to use
but information is not so bad I mean it
takes a bit more but it's not so bad
so conclusions
um
Transformers seems to be a good solution
but I think that there's a lot of work
that should be done
basically in optimizations and try to
think
different ways to represent the data
and even if you find this model
it will it's probably not the best one
for you because it depends it this is
not language that we can see for
instance with GPT or this kind of
languages that once you train a language
a language model it's able to understand
language of course you need to fine tune
for your particular use case but it's
able to understand language and it
doesn't happen when I'm serious because
all time series they are all different
we don't have the same time series when
you are when we are working with time
series so using an architecture for me
it's it's an option to use this one more
option to see if this architecture can
be useful for my case for my problem
could be could work could not work
you never you never know you need to
test it and the third one is of course
being involved and
what is important we Transformers and
time series is that communities who
drive to state of the art
again it's not the same as GPT once you
have one API open AI or Google that they
are training the huge models here it's
the community that it's researching that
is testing is sharing information
because you don't have a model that you
can download and you can use for time
for for Time series
you need to train the model with your
data and you need to see if it works if
or if it not works so it's it's probably
up to each use case or each problem so
the community if you are using informal
space and former log former or any other
Transformer that we are using try to be
involved try to collaborate or to
or to inform
and we have cool projects there are cool
projects that like Tsai
that is if you like to use the time
series you probably need a framework to
test it because
it's not easy to build your own lscn
your own transformer your own space and
form and so on so if you have Frameworks
that con with apis that can be very
useful or very easy to use
it really helps so I recommend you if
you are working with time series or if
you
yes if you are working with time series
these are very good projects T TSI and
this is the GitHub
and that's it thank you so much thank
you so much for your time
I hope you have enjoyed and any
questions that you have please we have
the microphone but think thank you thank
you for your time
thank you presentation
um two questions so these modern former
depth
let's open source as well right
yeah
because you mentioned the Informer is
the one the open source but all of them
okay and so
um in your use case example
um so the data parameters you basically
took the
um
the lag
latency right and timestamp that's it
you don't like number of users the type
of the architecture nothing just to uh
just like this to data points yeah the
reality is that I'm sorry that was the
question right yeah
and in fact I have the P95 P99 mp98
values for each feature so so I had 60
features and I multiplied all the b95 to
get I concatenate P95 P99 MPA 96 I think
tonight or 98
I had a 2000 data set
2000 features data set and I also have
the throughput
that is another that CC yeah 60 more
data points so I had a matrix a matrix
of
250 if I'm not wrong with
5 000 or 6 000 timestamps
I mean is it possible to add more
Dimensions to that data yes and make it
like around not million years yes yeah
exactly and once you added this data The
Matrix or the Matrix calculation with
the essential layer is getting bigger
and bigger and bigger and bigger and
bigger
uh so it's
it takes time to train those models it's
a pretty easy use case I mean it's it's
just latency right and uh so did you
like you ran the benchmarks against uh
once you predict the future then you
took the actual results how close they
were what was the variance between the
actual and predicted yeah for the close
um
the actual versus predicted yeah okay
cool yeah I had this I had an initial 10
10 000 data set I separated two thousand
and I didn't show to the model I mean I
trained all the model with the 8000. and
once the model is trained I test the
performance with this 2000 I mean with
36 and we won 120.
nice we're talking about yesterday in
the this monitoring session I don't know
if you heard there and um they're
exactly what they're talking about now
how AI uh might help people to identify
impact because if you can use this and
say look
based on Trends the impact is going to
uh you know be higher and acceleration
is going to be higher it's like you know
when you drive a car with a trailer yeah
and you start the oh yeah shaking right
sometimes you know you're gonna go
straight and sometimes you're gonna fly
off the highway so uh this kind of stuff
can predict the impact yeah yeah because
you are modeling
something I mean I suppose I don't know
what are the features that you can get
there but I can I can think on speed and
vibrations movements and so on
yes yeah it's pretty pretty interesting
but again you can do the same with lstms
thanks for your talk thank you do you
know if anyone's doing work on
putting out confidence intervals with
the model predictions so you could see
that your your uncertainty is growing
over time as you predict into the future
and then you could
decide at what point your prediction is
just not worth using yeah this is in in
terms no I'm not aware but I think that
you can use that when you are
implementing those use cases
ones who are because this solution that
I'm that that I tried in fact is that it
has
um a model selector at the beginning
so the models are selector will detect
for instance if you don't have enough
data to train the Transformer you will
probably use a regression we'll probably
use an lstm once you are getting most
models or most timestamps you can try to
train these other models and in the
middle of the process when you are
working you probably would say okay I
have enough data to train a Transformer
but in the middle of the process you can
probably say okay but it's not the best
process it's not the best
architecture for the solution so in that
case when you are implementing that
you need to be sensing the performance
between I don't know lstms Transformers
or even a regression could be something
pretty easy
yes but those use cases you you may see
when you are implementing that that
thing or these algorithms because when
you are other research projects or all
the things that you can find when
informed and so on is I have that stuff
this is the MSC and this is how you
measure how good it is and they're using
the same benchmarks so it's probably the
same thing always right but one one you
would like to use in a real cases
scenario Implement that you need to
monitor the performance you need to
monitor the admsc even even if you have
a good MSC you cannot be able to attack
pigs for instance because MSC is mean
square error
so it's a it's a mean
if most of the time is similar but you
have one picking one second the MSC will
keep the same but mostly keep the same
right so if you're just seeing MSC you
probably would say yep you can be
probably losing most of the time so you
need to have mechanisms
to try to detect those Peaks and to see
how it's working
yes
thank you thank you for your question
yes I I really enjoyed this presentation
and I wanted to ask about how
I guess Time series
sequence the sequence like data is set
up because like when I think about
trading a Transformer like I think of it
as like a for example like the
translation task you thought like a data
set of a bunch of
not necessarily related sentences but
have like a fixed or they'd all have you
know
a five letters or a five word sentence
in one language I'm trying to like input
language translates to like a forward
sequence in the output language and so I
like to know how you create
or like how you use a sequencer sequence
model
in in Time series data where you have
just like this continuous sequence of of
data points
good question it's
it's a kind of times of sequence to
sequence but the difference is that when
you are doing a sequence to sequence
like you can do with an STM
the the model internal is you are giving
them 10 data points for instance right
but the model is processing one data
point
each time I mean one more time
with Transformers We are feeding the 10
data points at the same point at the
same moment
and it's again in a representation of
these 10 data points so you're not doing
sequence by sequence right so you are
predicting you are putting all the
sequences at once they are calculating
the attention they are calculating the
relationship between them and this
information goes to the decoder and the
decoder
they have the different ways to use it
but for instance in space and time
former they don't process the output one
by one they do the same process as the
encoder and they process everything at
once
so you get all the series let's suppose
that you would like to predict the next
time 10 timestamps you will get the 10
timestamps at once
so it's a sequence to sequence but it's
not the same concept as you can think
for lstms for instance
length like like a hyper parameter in
that case
it's no it's the hyper parameters are
this free free queries that you can have
in the Transformers the queries keys and
values these are the hyper parameters
for all these Transformers right
for each head of the attention you have
this Matrix so you have a matrix for one
head we have Matrix three matrices for
other head and so on and so on and so on
and so on these are the hyper parameters
and what you get is a representation at
the end you have an output which
probably if you see this output it has
nothing to do with the input because
it's an embedding it's like how they
think that it's important or how to
model things that is important the input
related with it
with the attention right so this
sequence once you have that this is
filled to as an input for the decoder
so you can think
as a sequence to sequence in the concept
that you are feeding an input
you are having a representation or an
extraction of features and this thing is
what we are doing to encode to decode
your your output
the concept is similar but how it works
it's pretty different in that part
you're not feeling step by step you are
feeling all together and
lstm for instance when you do the
sequence to sequence if you have 10
cells for instance it goes first to the
first cell
the second information goes to the
second cell and to the third cell to the
fourth cell and so on and the model the
lstm is saying okay for cell it doesn't
matter so I will shut down the cell so
again here you are feeding all together
and it's doing the calculations at once
all these multi-heads they are being
calculated at once
so it is but not it is in a concept it's
a sequence to sequence but it has a
different Behavior
it's interesting yeah it's pretty
interesting it's pretty challenging to
understand the details it is
thank you so much for a question
all right thank you for your time
thank you appreciate it
関連動画をさらに表示
Stanford CS25: V1 I Transformers in Language: The development of GPT Models, GPT3
Transformers, explained: Understand the model behind GPT, BERT, and T5
Stanford CS25: V1 I Transformers United: DL Models that have revolutionized NLP, CV, RL
Transformers Explained - How transformers work
Practical Intro to NLP 23: Evolution of word vectors Part 2 - Embeddings and Sentence Transformers
Introduction to Transformer Architecture
5.0 / 5 (0 votes)