Lecture 5.4 Advanced ERP Topics

The Qualcomm Institute
1 Aug 201309:50

Summary

TLDRThe video script delves into extensions of basic linear discriminant analysis (LDA) for machine learning, highlighting its effectiveness on linearly transformed data. It discusses the importance of applying LDA to principal components or independent components without losing information. The script also explores the use of ICA for source time courses, allowing for interpretable classifiers and the introduction of brain-informed constraints. It touches on alternative linear features like wavelet decomposition for pattern matching and the concept of sparsity in feature selection. The discussion extends to nonlinear features, emphasizing the need for proper spatial filtering before feature extraction. The script concludes with considerations for class imbalance and the use of the area under the ROC curve as a performance measure, suggesting direct optimization of classifiers under sophisticated cost functions.

Takeaways

  • 📚 The script discusses extensions and tweaks to basic machine learning techniques to enhance their performance.
  • 🔍 Linear Discriminant Analysis (LDA) provides the same result whether applied to raw data or its linear transformations, as long as no information is lost.
  • 🌐 The script highlights the importance of considering non-linear aspects, such as Independent Component Analysis (ICA), for analyzing EEG data.
  • 🔄 It emphasizes that ICA assigns weights to components rather than channels, allowing for more interpretable classifiers.
  • 🧠 The script suggests using ICA for relating components to source space in the brain, which can inform constraints on feature selection.
  • 📊 The speaker mentions the use of linear features like averages and wavelet decompositions for pattern matching in EEG data.
  • 🔑 The concept of sparsity is introduced, where only a small number of coefficients contain the relevant information, which can be leveraged for classifier design.
  • 💡 Non-linear features are also discussed, with a focus on the importance of applying non-linear transformations after linear spatial filtering.
  • 📉 The script addresses the challenges of dealing with non-linear features in EEG due to issues like volume conduction.
  • ⚖ The importance of considering class imbalance and cost-sensitive classification is highlighted, with suggestions to use measures like the area under the ROC curve.
  • đŸ› ïž The speaker recommends using classifiers that can be directly optimized for performance metrics such as the area under the curve, mentioning the support vector framework as an example.

Q & A

  • What is the basic technique discussed in the script that is easy to comprehend but can be extended with tweaks and twists?

    -The basic technique discussed is Linear Discriminant Analysis (LDA), which is a simple yet effective method for classification that can be enhanced with various modifications to improve its performance.

  • Why does applying LDA to linearly transformed data yield the same result as applying it to the original data?

    -Applying LDA to linearly transformed data gives the same result because LDA is a linear method. It would simply learn different weights to achieve the same performance, as long as no information is lost during the transformation.

  • What is the difference between applying LDA to linear and non-linear data?

    -In the case of linear data, LDA can handle transformations like scaling or channel swapping without affecting performance. However, for non-linear data, such as oscillations, applying LDA directly to the data or to transformed data can yield different results, emphasizing the need for appropriate pre-processing or feature extraction techniques.

  • How does Independent Component Analysis (ICA) differ from LDA in terms of assigning weights?

    -ICA assigns weights to every component of the signal rather than to every channel. This allows for a more interpretable classification, as each component can be associated with specific signal sources, such as blinks or muscle activity.

  • What is the advantage of using ICA in conjunction with LDA for EEG signal classification?

    -ICA can provide source time courses that relate to the brain's activity, which can be used to inform constraints for LDA, such as excluding components from non-relevant brain regions. This enhances interpretability and allows for more targeted feature selection.

  • Why might one choose to use wavelet decomposition instead of averages for feature extraction in EEG signals?

    -Wavelet decomposition can be more suitable when the underlying ERP (Event-Related Potential) has specific forms like ripples. It allows for pattern matching with the time course of the signal, which can capture more nuanced features than simple averages.

  • How does dimensionality reduction relate to the use of wavelet features in EEG signal classification?

    -Dimensionality reduction is important when using wavelet features because using all of them would essentially be the same as using the raw data. By selecting a small number of wavelet coefficients that contain the relevant information, the model can become more efficient and focused on the signal of interest.

  • What is the significance of sparsity in the context of feature selection for EEG signal classification?

    -Sparsity refers to the idea of using only a small number of non-zero coefficients in the model. This can improve the classifier's performance by focusing on the most informative features and reducing the impact of noise or irrelevant information.

  • Why is it important to consider the order of operations when extracting non-linear features from EEG data?

    -The correct order is to first apply a spatial filter to isolate the source signal, then perform non-linear feature extraction, and finally apply a classifier. This sequence ensures that the non-linear properties of the source signal are captured accurately, rather than being distorted by channel-level transformations.

  • What is the area under the curve (AUC) and how is it used to evaluate classifier performance in imbalanced datasets?

    -The AUC represents the area under the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate against the false positive rate for different threshold settings. It provides a measure of a classifier's ability to distinguish between classes, especially in situations with imbalanced datasets or when different types of errors have different costs.

  • How can classifiers be optimized for imbalanced classes or when certain errors are more costly?

    -Classifiers can be optimized by incorporating cost-sensitive learning, where the misclassification costs are taken into account during training. Additionally, using performance metrics like AUC or F-scores, which are more informative than simple misclassification rates, can help in such scenarios.

Outlines

00:00

🧠 Extensions and Considerations in Machine Learning

This paragraph delves into the nuances of Linear Discriminant Analysis (LDA) and its application to linear and non-linear data transformations. It highlights that LDA yields the same results when applied to linear filters or linearly transformed data, as it learns different weights. The discussion extends to Independent Component Analysis (ICA), which assigns weights to components rather than channels, allowing for interpretations of signal relevance and the introduction of brain-informed constraints. The paragraph also touches on alternative linear features like wavelet decomposition for pattern matching and the concept of sparsity in classifiers, emphasizing the importance of dimensionality reduction and feature selection.

05:01

📊 Addressing Non-linearity and Class Imbalance in EEG Analysis

The second paragraph addresses the complexities of feature extraction in EEG analysis, particularly the distinction between non-linear properties of source time courses versus channels. It emphasizes the correct sequence of spatial filtering, non-linear feature extraction, and classification. The paragraph also discusses the challenges of dealing with arbitrary non-linear features in EEG and mentions the use of ICA for spatial filtering before non-linear feature extraction. Furthermore, it explores the importance of optimizing classifiers under different criteria to account for class imbalance and the cost of different types of errors, introducing the concept of the area under the ROC curve as a performance metric. The paragraph concludes with a mention of classifiers that can be directly optimized for sophisticated cost functions, such as those within the support vector framework.

Mindmap

Keywords

💡LDA

Linear Discriminant Analysis (LDA) is a statistical technique used for dimensionality reduction and classification. In the context of the video, LDA is discussed as a method that can be applied to linear filters or linearly transformed data to achieve the same result, highlighting its feature of being linear and its ability to learn different weights based on the data transformation.

💡Principal Components

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a set of orthogonal (uncorrelated) variables called principal components. The script mentions PCA as an example of a linear transformation where applying LDA would yield similar results as long as information is not lost, emphasizing the linear nature of the process.

💡Independent Components

Independent Component Analysis (ICA) is a computational technique for separating a multivariate signal into independent, non-Gaussian components. The video script discusses how LDA assigns weights to every component when applied to ICA, allowing for the interpretation of the relevance of different signal components, such as distinguishing between blinks and muscle activity.

💡Dipole Fitting

Dipole fitting is a method used in electrophysiology to estimate the location and orientation of electrical sources in the brain. The script refers to dipole fitting in the context of localizing an independent component to the motor cortex, which aids in the interpretation of classifiers by associating them with specific brain regions.

💡Classifier

In machine learning, a classifier is an algorithm that separates data into different categories. The video discusses how classifiers can be interpreted more easily when using ICA components and how they can incorporate constraints informed by the components' locations in the brain, relating to the theme of enhancing classification through spatial information.

💡ERP

Event-Related Potentials (ERP) are measured brain responses that are directly related to specific sensory, cognitive, or motor events. The script mentions ERP in the context of using linear combinations or wavelet decomposition to match patterns in ERP forms, illustrating the application of different features for classification.

💡Wavelet Decomposition

Wavelet decomposition is a mathematical method that represents a time-series signal as a sum of wavelets, allowing for the analysis of frequency content at different times. The video script uses wavelet decomposition as an example of a linear transform for feature extraction in EEG data, emphasizing its role in dimensionality reduction and pattern matching.

💡Sparsity

Sparsity in machine learning refers to the property of a model where only a small number of its parameters are non-zero. The script discusses sparsity in the context of classifiers using only a small number of wavelet coefficients that contain the relevant information, indicating a method to enhance model efficiency and interpretability.

💡Non-linear Features

Non-linear features are characteristics of data that do not follow a linear relationship. The video script contrasts linear features with non-linear ones, noting the importance of applying non-linear transformations after spatial filtering to capture source time course properties, which is crucial for accurate classification in EEG data.

💡Volume Conduction

Volume conduction refers to the spread of electrical currents through the volume of a conductive medium, such as the brain in EEG studies. The script mentions that volume conduction is less of a problem in fMRI compared to EEG, highlighting the unique challenges in EEG signal processing due to the spatial spread of electrical activity.

💡Area Under the Curve (AUC)

The Area Under the Curve (AUC), specifically the Receiver Operating Characteristic (ROC) curve, is a performance measurement for classification problems at various threshold settings. The script discusses AUC as a way to deal with class imbalance and to measure the performance of classifiers, emphasizing its usefulness in situations where the cost of different types of errors varies.

Highlights

Discussion of extensions to the basic technique of Linear Discriminant Analysis (LDA) for enhancing its performance.

LDA's invariant performance when applied to linearly transformed data, such as rescaled channels.

The difference in applying LDA to non-linear features like oscillations compared to linear ones.

The assignment of weights in LDA to every component rather than every channel in Independent Component Analysis (ICA).

The interpretability of classifiers when using ICA, allowing for reasoning about the relevance of different signal components.

Introduction of extra constraints informed by the brain's source space in ICA.

The tradeoff between using simple averages and more complex linear features like wavelet decomposition for ERP forms.

The concept of using a small number of wavelet features for dimensionality reduction in EEG signal analysis.

The importance of sparsity in classifiers, where only a small number of coefficients contain relevant information.

The distinction between linear and non-linear features, with a focus on the source time course for non-linear features.

The incorrect application of non-linear transformations on channels before classifier application in EEG analysis.

The three-stage process of linear filtering, non-linear feature extraction, and then classification for optimal results.

The difficulty of dealing with arbitrary non-linear features in EEG due to volume conduction problems.

The special condition allowing non-linear feature application after learning spatial filters with ICA.

The importance of optimizing classifiers under different criteria for imbalanced classes or costly errors.

The use of the Area Under the Curve (AUC) as a measure for dealing with imbalanced classes in classification.

The utility of AUC for classifiers that produce continuous scalar outputs, allowing for tunable thresholds.

The existence of classifiers that can be directly optimized under the AUC criterion or other advanced scoring functions.

The mention of the support vector framework as an example of a method with advanced ways to learn classifiers under sophisticated cost functions.

Transcripts

play00:00

in the next part we are um

play00:02

we'll go a little bit beyond that so

play00:05

we'll discuss a few extensions of this

play00:07

basic technique which is very easy to

play00:09

comprehend but there's tweaks and twists

play00:12

that one can apply to get even more out

play00:14

of it

play00:15

and we also cover a few more interesting

play00:18

machine learning

play00:20

aspects there

play00:22

so um

play00:24

the first little consideration is this

play00:26

lda gives you basically um

play00:30

if you apply it on linear to linear

play00:32

filters

play00:33

the same

play00:34

effective result as if you had applied

play00:36

it to linearly transformed data so if

play00:38

you had switzerland to channels for

play00:39

example or rescaled them or whatever

play00:42

well you know you would have gotten the

play00:43

same performance because lda would have

play00:45

just learned different weights

play00:48

so if you apply this to principal

play00:50

components of the data or independent

play00:51

components or whatever

play00:53

you still get basically the same result

play00:56

as long as you don't throw information

play00:57

away here as long okay so

play01:00

that's just a feature of this whole

play01:01

process being linear

play01:04

if you're looking at something

play01:05

non-linear like oscillations that

play01:06

absolutely doesn't hold anymore so

play01:08

suddenly

play01:10

you truly do get a difference when you

play01:12

apply a method to say ica versus raw

play01:15

channels it's just in this case we're

play01:17

very lucky that

play01:19

that we can basically solve it all the

play01:21

way through

play01:22

uh with one method in optimal

play01:25

however if you have decompositions like

play01:28

independent component analysis which

play01:29

gives you source time courses

play01:31

then um

play01:33

essentially lda assigns you a weight not

play01:36

to every channel but to every component

play01:38

and so suddenly you can say

play01:40

this component here of the signal like

play01:42

the blinks are this relevant and that

play01:45

component like a muscle is that relevant

play01:47

and so you can reason and say things

play01:49

such as i am not using muscular

play01:51

activation so it doesn't depend on

play01:53

artifacts or i can say

play01:56

i'm primarily using an independent

play01:58

component that with dipole fitting i

play02:00

managed to localize in the motor cortex

play02:02

and so you can suddenly interpret these

play02:05

classifiers relatively easily and

play02:08

furthermore you can basically introduce

play02:11

extra constraints

play02:13

that are informed by where these

play02:14

components sit in the brain so

play02:18

again ica is a very strong method to

play02:20

give you something relates to source

play02:22

space i say a few things in the last

play02:23

lecture on that

play02:25

you can say i don't want to use

play02:27

components that don't come from

play02:29

occipital cortex because i'm using a

play02:31

visual process here i don't think my

play02:32

signal originates from here at least not

play02:35

the signal that i want to use

play02:36

so these things suddenly become possible

play02:40

and

play02:40

so that's the various tradeoffs

play02:44

the other one is there's other linear

play02:46

features that you can use instead of

play02:48

averages averages are very simple to to

play02:51

describe and calculate and think about

play02:52

but they are not necessarily

play02:54

particularly well tuned to what you

play02:56

really want

play02:57

if you know that your underlying erp

play03:00

comes in certain kinds of

play03:02

forms like ripples like these

play03:05

then you could use linear combinations

play03:07

that do basically and it an inner

play03:10

product with with this time course

play03:13

uh to get

play03:15

to basically do pattern matching with

play03:17

that

play03:18

so that is an example of a wavelet

play03:20

decomposition here so

play03:22

it's a linear transform

play03:25

which allows you to

play03:27

you know to use wavelet features and so

play03:30

you can pick a small number of wavelet

play03:33

features

play03:34

to to

play03:35

to find features of your linear features

play03:37

of your eop and classify in terms of

play03:39

those

play03:40

if you

play03:41

uh that only makes sense if you throw

play03:43

away many of them

play03:45

to reduce dimensionality if you use all

play03:47

of them again it's linear you know

play03:49

you could have as well use the raw chunk

play03:51

of data and through a classifier edit

play03:55

so it's a way of dimensionality

play03:56

reduction also what these features

play03:59

happen to do is

play04:00

usually only a small number of these

play04:02

coefficients actually contain the

play04:04

information that you're looking for

play04:06

and so you can say things such as my

play04:09

classifier should use only small number

play04:10

of nonzero coefficients

play04:12

and there's various ways to learn these

play04:15

kinds of classifiers we say a few things

play04:16

about that

play04:17

it's this sparsity notion

play04:20

and so with these features you can make

play04:22

use of these kinds of things it doesn't

play04:24

apply say to channels the signal is not

play04:26

sparse in a channel it projects

play04:28

everywhere right but with these kind of

play04:30

things you can

play04:31

start making use of these assumptions

play04:35

there is

play04:36

also of course the whole area of

play04:38

non-linear features

play04:40

so

play04:41

obviously that is that is

play04:43

the general class of features linear is

play04:46

a special case

play04:47

and

play04:49

the problem with that is that

play04:52

if you are

play04:53

what you actually want is you want

play04:54

non-linear features that are

play04:57

features of the source time course

play04:58

because you think there's a source

play05:00

processor that does something and maybe

play05:02

there's a non-linear property of that

play05:04

as opposed to nonlinear features of the

play05:06

channels

play05:08

so

play05:09

if you do

play05:10

non-linear transform somewhere in your

play05:12

feature extraction on channels and then

play05:14

apply a classifier

play05:16

you've basically done it the wrong way

play05:18

you've applied the none in your part

play05:20

sort of too early you had the proper way

play05:23

would have been to first design a

play05:25

spatial filter which gets you to the

play05:27

source which is a big linear part then

play05:29

do your non-linear feature extraction

play05:32

whatever that might be and then maybe

play05:35

apply a classifier to that but that's

play05:37

sort of a three-stage thing you know

play05:38

it's linear non-linear linear and

play05:40

there's

play05:42

uh only a small number of methods that

play05:44

properly learn all these parameters in a

play05:46

way where you can say that's going to be

play05:48

the optimal solution in many cases it's

play05:50

patchwork so

play05:54

in practice

play05:56

it's very hard to

play05:59

to deal properly with arbitrary

play06:01

non-linear features in eeg it's

play06:03

different in say fmri where you don't

play06:05

have as much of a volume conduction

play06:07

problem and so on but in eg it's sort of

play06:09

tricky

play06:11

although there's there's special classes

play06:13

where you can do it

play06:14

and

play06:15

one

play06:16

one condition for example

play06:18

that allows you to do it is to learn

play06:20

the spatial filters

play06:22

irrespective of the class levels

play06:24

um

play06:25

using things like ica

play06:27

and then do non-linear features on your

play06:29

independent components and then use a

play06:30

classifier but i'm i'm going to talk

play06:32

about these things

play06:34

later

play06:36

there's

play06:38

there's a more important aspect and that

play06:40

is

play06:41

when you

play06:43

when you think such as you want to

play06:44

detect

play06:47

a signal in noise or so such as you want

play06:50

to detect whether the person saw a

play06:52

target image or something like that as

play06:54

opposed to not having seen it

play06:56

cases where you have different um

play06:59

ratio let's say what the prior

play07:00

probability that he saw nothing is much

play07:03

higher than the probability that he

play07:05

saw something so where there's this

play07:07

imbalance or in cases where certain

play07:09

kinds of errors

play07:10

of the pci are much more costly than

play07:12

others such as false positive or the

play07:14

false negative

play07:16

then you want to use

play07:18

um maybe different criteria to optimize

play07:21

these things

play07:22

incorporate the costs for example

play07:25

and you also want to use different

play07:26

measures to estimate how well you what

play07:29

the performance of your system is

play07:30

misclassification rate

play07:32

doesn't get you very far for example if

play07:34

90 of your trials are one class and if

play07:37

you always say that class

play07:39

is a true class you are 90 correct on

play07:42

average

play07:43

right so one way to to measure one

play07:46

pretty nice general purpose way of

play07:48

dealing with imbalance classes is the

play07:50

area under curve

play07:51

or area under receiver operator

play07:54

curve and it's um

play07:57

uh

play07:58

if your

play08:00

model has a tunable threshold basically

play08:02

where it says a versus b you know class

play08:04

a versus plus b

play08:06

for different values of this threshold

play08:08

you can basically go all the way from

play08:10

zero percent false positive rate to 100

play08:13

and for any given false positive rate

play08:16

you have an associated true positive

play08:18

rate

play08:19

uh in the ideal case

play08:21

you um

play08:22

for zero percent false positives you

play08:24

have a hundred

play08:25

true positive so you're always getting

play08:27

it right you're never getting it wrong

play08:28

basically um so the curve

play08:31

the area and the curve would be one

play08:33

if you're at random chance it's

play08:34

basically 0.5 you know you gain one in

play08:38

the first positive you lose one in the

play08:39

in the two processes basically um or uh

play08:42

you know gain one there

play08:44

so

play08:45

um

play08:48

that's a way that can be sort of applied

play08:50

post talk to any classifier that happens

play08:52

to produce

play08:53

continuous

play08:55

scalar outputs like a regression

play08:57

technique or so it say predicts the

play08:59

probability that you saw

play09:01

a plane

play09:02

on a satellite image or something like

play09:04

that

play09:06

i should say one more thing and that is

play09:10

um

play09:11

let's say

play09:13

um

play09:14

actually i think i've already said this

play09:16

so uh there is classifiers which can be

play09:19

directly optimized say under the area on

play09:22

the curve criterion there's classifiers

play09:24

that can optimize various other scores

play09:26

f scores and so on and

play09:28

for example the support vector framework

play09:31

has several

play09:32

rather advanced

play09:35

ways to learn classifiers under these

play09:37

sophisticated cost functions

play09:40

so um

play09:42

if you're in such a situation that would

play09:43

be a place to look

play09:45

and and that ends um

play09:48

this module

Rate This
★
★
★
★
★

5.0 / 5 (0 votes)

Étiquettes Connexes
EEG AnalysisMachine LearningSignal ProcessingLDA ExtensionsICA TechniquesFeature ExtractionClassifier OptimizationSparsityNon-linear FeaturesImbalance ClassesROC Curve
Besoin d'un résumé en anglais ?