Introduction to Lexical Analyzer

Neso Academy
8 Apr 202214:58

Summary

TLDRThis session provides an in-depth look at the lexical analyzer's working principle, explaining how it processes high-level language code to produce tokens. It covers the types of tokens, the scanning and analyzing phases, and the use of finite state machines (FSM) for pattern recognition. The video also discusses handling comments, whitespace, and macro expansion, emphasizing the analyzer's role in creating symbol table entries and its interaction with the syntax analyzer.

Takeaways

  • 📘 The lexical analyzer is a crucial part of a compiler that transforms high-level language code into tokens.
  • 🔍 It uses regular expressions and type 3 grammar to recognize tokens, which are the basic elements of the source code.
  • 📚 Tokens can be identifiers, operators, constants, keywords, literals, punctuators, and special characters.
  • 🔧 The lexical analyzer performs two main functions: scanning to remove non-token elements and analyzing to produce tokens.
  • 🌐 It operates on a finite state machine (FSM) to recognize different types of tokens such as keywords, identifiers, and integers.
  • 🔄 The FSM can be a non-deterministic finite automata (NFA) which is then converted to a deterministic finite automata (DFA) for implementation.
  • 🔑 The DFA recognizes tokens by moving through states based on the input characters until it reaches a final state.
  • 🚮 The lexical analyzer also removes comments and white spaces which are considered non-token elements.
  • 🔄 It supports macro expansion by replacing macro identifiers with their defined values throughout the code.
  • 🔗 The lexical analyzer maintains a symbol table for identifiers and communicates with the syntax analyzer to provide tokens for parsing.

Q & A

  • What is the primary function of a lexical analyzer?

    -The primary function of a lexical analyzer is to scan the source code of a high-level programming language and convert it into a sequence of tokens.

  • What are the different phases of a compiler where lexical analyzer plays a role?

    -The lexical analyzer plays a role in the lexical analysis phase of a compiler, where it processes the arithmetic expression and produces a stream of tokens.

  • How does the lexical analyzer use regular expressions?

    -The lexical analyzer uses regular expressions to recognize tokens by employing regular grammar or type 3 grammar.

  • What are the types of tokens that a lexical analyzer can produce?

    -Tokens produced by a lexical analyzer can be identifiers, operators, constants, keywords, literals, punctuators, and special characters.

  • What is the role of a finite state machine in lexical analysis?

    -A finite state machine is used by the lexical analyzer to recognize tokens by transitioning between states based on the input characters.

  • How does the lexical analyzer handle comments in the source code?

    -The lexical analyzer detects comments and eliminates them as non-token elements, replacing them with whitespace or simply ignoring them.

  • What is the difference between scanning and analyzing phases in a lexical analyzer?

    -In the scanning phase, the lexical analyzer eliminates non-token elements like comments and whitespace. In the analyzing phase, it identifies tokens using the finite state machine.

  • How does the lexical analyzer deal with white spaces?

    -The lexical analyzer recognizes various types of white spaces such as space, horizontal tab, new line, vertical tab, form feed, and carriage return, and treats them as non-token elements.

  • What is the significance of the symbol table in the context of a lexical analyzer?

    -The lexical analyzer creates entries for identifiers in the symbol table, which is crucial for later stages of compilation like semantic analysis.

  • How does the lexical analyzer communicate with the syntax analyzer?

    -The lexical analyzer and the syntax analyzer frequently communicate, with the lexical analyzer providing tokens to the syntax analyzer upon request during parsing.

  • What additional functionality does the lexical analyzer provide besides tokenization?

    -Apart from tokenization, the lexical analyzer also performs macro expansion, replacing macro identifiers with their defined values in the source code.

Outlines

00:00

📘 Introduction to Lexical Analyzer

This paragraph introduces the lexical analyzer, explaining its role in the compiler process. It discusses how the analyzer scans high-level language code line by line, converting it into tokens. Tokens are categorized into identifiers, operators, constants, keywords, literals, punctuators, and special characters. The paragraph also explains the two primary functions of the lexical analyzer: scanning and analyzing. Scanning removes non-token elements like comments and white spaces, while analyzing uses finite state machines to recognize tokens.

05:02

🔍 Combining Finite State Machines

This section delves into the construction of a single finite state machine (FSM) that combines multiple FSMs to recognize different types of tokens in the C language. It explains how an epsilon NFA is used to create a non-deterministic finite automata that can recognize keywords, identifiers, and integers. The process of converting the NFA to a deterministic finite automata (DFA) is outlined, highlighting how DFA can recognize tokens by moving through states based on input characters. The DFA's ability to recognize identifiers and integers is demonstrated through specific transitions and final states.

10:04

🛠 DFA Construction and Lexical Analyzer Functions

The final paragraph covers the construction of DFAs, emphasizing the need for understanding non-deterministic finite automata (NFA) and their conversion to deterministic finite automata (DFA). It also mentions the minimization of DFA to MDFA. The paragraph discusses the lexical analyzer's role in removing comments and white spaces and its interaction with the syntax analyzer. It explains how the lexical analyzer helps in macro expansion and symbol table creation, highlighting the continuous communication between the lexical analyzer and the syntax analyzer.

Mindmap

Keywords

💡Lexical Analyzer

The lexical analyzer is the first phase of a compiler that processes high-level source code. It scans the input code line by line, converting it into tokens while eliminating non-token elements like comments and white spaces. The lexical analyzer is essential to the working of a compiler and is closely tied to the theme of the video, which explains its role in analyzing and tokenizing code.

💡Tokens

Tokens are the output produced by the lexical analyzer after processing the source code. These tokens represent various elements of the program, such as identifiers, operators, constants, and keywords. The video discusses how the lexical analyzer groups inputs into tokens, which are crucial for further stages of code compilation.

💡Finite State Machine (FSM)

A finite state machine is a computational model used by the lexical analyzer to recognize patterns like keywords and identifiers in the source code. The video explains how FSMs operate, with examples of recognizing tokens like 'if' and integers through transitions between states. FSMs are a key concept for understanding how lexical analyzers function.

💡Deterministic Finite Automata (DFA)

Deterministic Finite Automata is a type of FSM that can be implemented in lexical analyzers to recognize patterns like keywords, identifiers, and integers. The video shows how different DFAs are constructed for various token types and how they are combined into a single DFA to process a programming language. The DFA ensures the lexical analyzer can recognize tokens deterministically.

💡Non-deterministic Finite Automata (NFA)

Non-deterministic Finite Automata (NFA) is another type of FSM mentioned in the video, used conceptually in lexical analysis. NFAs include epsilon transitions, which allow transitions without consuming input. While NFAs cannot be implemented directly, the video explains how they are converted into DFAs for practical use in token recognition.

💡Lexeme

A lexeme is a sequence of characters in the source code that matches a pattern defined by a token. The lexical analyzer takes lexemes as input and processes them into tokens. The video highlights how lexemes like keywords, identifiers, and numbers are converted into their corresponding tokens during the scanning phase.

💡Regular Expressions

Regular expressions define the patterns that the lexical analyzer uses to recognize tokens. The video mentions that the lexical analyzer uses type 3 grammar or regular grammar to identify various tokens such as keywords and identifiers. Regular expressions form the backbone of the lexical analyzer’s pattern matching capabilities.

💡Symbol Table

A symbol table is a data structure used by the lexical analyzer to store information about identifiers found in the source code, such as variable names. The video briefly touches on the fact that as the lexical analyzer processes code, it makes entries in the symbol table, which is then used by later stages of compilation like syntax analysis.

💡Macro Expansion

Macro expansion is the process of replacing a macro in the code with its defined value. The video mentions how the lexical analyzer assists in macro expansion, which occurs when the lexical analyzer substitutes macros defined in the code with their corresponding values. This is a key part of preprocessing before formal token analysis begins.

💡Epsilon Transition

Epsilon transitions are special transitions in NFAs that allow the automaton to move between states without consuming any input symbols. In the video, epsilon transitions are used to explain how an NFA can move between states for recognizing different tokens. They play a key role in the transition from NFA to DFA.

Highlights

Introduction to the lexical analyzer and its working principle

Outcome of the session: understanding the lexical analyzer in detail

The lexical analyzer produces a stream of tokens from arithmetic expressions

Regular expressions are used for recognizing tokens

Type 3 grammar is used for recognizing tokens

Features of the lexical analyzer: scanning high-level language code and producing tokens

Types of tokens: identifiers, operators, constants, keywords, literals, punctuators, and special characters

Two functions of the lexical analyzer: scanning and analyzing

Lexical analyzer uses finite state machines (FSM) for recognizing tokens

Example of recognizing the keyword 'if' using FSM

Finite state machine for identifiers, allowing letters, underscores, and digits

Finite state machine for integers, including sign and digit recognition

Combining multiple FSMs into a single FSM for the lexical analyzer

Introduction to non-deterministic finite automata (NFA) and its role in lexical analysis

Conversion of NFA to deterministic finite automata (DFA)

Minimization of DFA to minimized DFA (MDFA)

Elimination of comments and white spaces by the lexical analyzer

Handling of different types of white spaces by the lexical analyzer

Macro expansion facilitated by the lexical analyzer

Creation of symbol table entries for identifiers by the lexical analyzer

Communication between the lexical analyzer and the syntax analyzer

Summary of the lexical analyzer's working principle

Transcripts

play00:06

hello everyone and welcome back

play00:09

in this session we are going to be

play00:11

properly introduced to the lexical

play00:13

analyzer so let's get to learning

play00:17

coming to the outcome of this session

play00:19

today we will learn the working

play00:21

principle of lexical analyzer in details

play00:26

now in the session different phases of

play00:28

compiler we observed

play00:30

when we gave this arithmetic expression

play00:32

to the lexical analysis phase the

play00:34

lexical analyzer produced a stream of

play00:37

tokens

play00:38

and we also came to know that the

play00:40

lexical analyzer uses regular

play00:42

expressions for recognizing the tokens

play00:45

basically it uses the regular grammar or

play00:48

type 3 grammar for recognizing the

play00:50

tokens

play00:52

so if we are to note down the features

play00:55

of lexical analyzer successively

play00:58

well it scans the pure high level

play01:00

language code line by line and while

play01:02

doing so it takes the leg zooms as

play01:05

inputs and as outputs it produces the

play01:08

tokens

play01:09

now coming to tokens there are several

play01:11

types of it

play01:13

it may be an identifier

play01:15

an operator

play01:16

constants

play01:18

basically these are the fixed values

play01:20

which we assign to the variables in the

play01:22

source code itself

play01:24

then there is keyword there is various

play01:26

keywords of the programming language if

play01:28

we consider c programming language if

play01:31

ent return etc are the keywords of that

play01:36

tokens can also be literals that is

play01:38

string literals

play01:40

the punctuators like comma semicolons

play01:43

parenthesis braces brackets etc are also

play01:46

tokens

play01:48

finally the spatial characters like

play01:50

ampersand underscore etc are tokens as

play01:54

well

play01:55

so tokens can be broadly classified into

play01:57

these seven categories

play02:01

now in a lexical analyzer two functions

play02:04

take place

play02:05

first the scanning and then analyzing

play02:09

lexums are given to the scanning phase

play02:12

which in turn eliminates the non-token

play02:14

elements such as comments consecutive

play02:17

white spaces etc

play02:19

thereafter in the analyzing phase the

play02:22

actual magic happens and we get the

play02:24

tokens at the end of that

play02:26

let me illustrate how this analyzing

play02:29

phase works

play02:31

let's consider a few tokens of the c

play02:33

language

play02:34

now if is a keyword of c and that's why

play02:37

it is a token

play02:39

now for recognizing if

play02:41

the lexical analyzer will require a

play02:43

finite state machine or fsm

play02:47

so from the initial state a saying i we

play02:51

will move to the next state b

play02:53

then from b

play02:54

seeing f we will end up in the final

play02:57

state c

play03:00

considering identifiers

play03:02

well we already have briefly observed

play03:04

the finite state machine for that

play03:06

previously

play03:08

so from the initial state say d

play03:11

seeing a symbol may that be any small

play03:14

letter in the range of a to z

play03:16

or any capital one in the same range or

play03:19

an underscore

play03:20

we will end up in the final state e

play03:23

because we know that single later can

play03:26

also be an identifier like for most

play03:28

programmers the variable that runs any

play03:31

loop by incrementing itself is i

play03:34

additionally an identifier can have any

play03:37

number of letters underscores or digits

play03:40

nonetheless they must follow either a

play03:44

letter or an underscore

play03:46

now if we talk about integers for an

play03:49

integer starting from the initial state

play03:52

say f

play03:53

if we see any sign be that a positive or

play03:56

negative sign

play03:57

we will move to the next state g and

play04:00

from there seeing any digit from the

play04:03

range 0 to 9 we will end up in the final

play04:06

state h

play04:08

now this much is sufficient for single

play04:10

legit integers

play04:11

however for two or more digits we will

play04:14

have an epsilon transition from the

play04:17

spinal state h to g

play04:19

that is from h without seeing anything

play04:22

we can move to g

play04:24

now using this portion

play04:27

we can have any number of digits

play04:30

by the way integers can also not have

play04:33

any sign at the beginning

play04:35

because for positive integers the

play04:38

positive sign is implied

play04:40

so we need not mention that

play04:42

therefore to facilitate that property we

play04:46

will have an epsilon transition from the

play04:48

initial state f to this state g

play04:53

now these are individual finite state

play04:56

machines with their own initial states

play04:59

in the analyzing phase of the lexical

play05:01

analyzer we cannot have separate finite

play05:04

state machines it will need a single

play05:07

finite state machine combining all the

play05:09

separate finite state machines

play05:11

assuming our c language has only these

play05:14

three finite state machines and can only

play05:17

recognize the keyword if

play05:20

identifiers and integers

play05:22

let me show you how these will be

play05:25

combined as a single finite state

play05:27

machine

play05:28

so what we will do

play05:30

we will introduce a new initial state

play05:33

say s

play05:35

and we will have epsilon transitions

play05:37

from that to all these different states

play05:40

now this one is an nfa or

play05:43

non-deterministic finite automata

play05:46

to be precise it actually is an epsilon

play05:49

effect

play05:50

epsilon nfa is the nfa which contains

play05:53

epsilon moves

play05:54

moreover nfa is conceptual that is we

play05:58

cannot implement it

play06:00

for implementation we will have to

play06:02

derive the equivalent deterministic

play06:04

finite automata or the dfa

play06:07

now this one here

play06:09

this is the equivalent dfa

play06:12

it will recognize the keyword if the

play06:15

identifiers also the integers

play06:18

let me show you how starting from the

play06:20

initial state s

play06:22

seeing the letter small i

play06:24

we will move to the state one

play06:27

and from there

play06:28

saying f we will move to the state two

play06:32

and since it is a final state

play06:34

if we stop here it will mean that this

play06:37

dfa has recognized the input lexum as

play06:40

the keyword token if

play06:43

now if the lexum has got something more

play06:45

than if suppose it is iffy

play06:49

in that case we won't stop at the state

play06:52

2.

play06:52

with the first f after if

play06:55

we will move to the state 3

play06:57

and then for the last y

play07:00

we will use this cell flow and remain in

play07:02

this state only

play07:05

now the question is why one is also a

play07:08

final state

play07:09

if you remember earlier in this session

play07:12

i told you that the identifier using

play07:15

which most programmers run loops is i

play07:18

so if we see i only which doesn't have

play07:20

any following f

play07:22

then it is not a keyword

play07:24

rather it is an identifier

play07:27

so for the dfa if it stops reaching the

play07:29

state 1 it recognizes that as an

play07:32

identifier

play07:34

now observe the transition from 1 to 3.

play07:39

here we have intentionally chosen the

play07:42

range of small letters a to e or g to z

play07:47

so after i accept small f if we have any

play07:51

other small later or any capital letter

play07:54

or an underscore or any digit

play07:56

we will move to the state 3.

play07:59

now observe this transition here

play08:02

if you observe the ranges of small

play08:04

letters a to h or j to z

play08:08

that is any small letter except i so

play08:12

from the initial state s seeing any

play08:15

small letter except i or any capital

play08:18

letter or an underscore

play08:20

we can move to this final state 3.

play08:23

so if the machine either stops in one or

play08:26

in three through any of these

play08:28

transitions

play08:29

it will recognize the lexus as

play08:31

identifier

play08:33

now coming to the lower portion of the

play08:35

dfa starting from the initial state

play08:39

seeing a digit from the range 0 to 9

play08:42

we can end up at the final state 4 which

play08:46

will recognize any single digit positive

play08:48

integer

play08:49

thereafter this self loop on this state

play08:52

4 it will help the dfa to recognize any

play08:56

other positive integer with two or more

play08:58

digits

play09:00

now if in the legazim the sign has

play09:03

explicitly been specified for those

play09:06

starting from the initial state

play09:08

saying any sign be that either positive

play09:11

or negative the dfa will move to the

play09:13

next state that is this state 5

play09:17

and from this seeing a single digit it

play09:19

will move to the final state 4. then

play09:22

again the self-loop of 4 will help the

play09:25

dfa to recognize the signed integers

play09:28

with 2 or more digits

play09:29

so basically starting from the initial

play09:32

state if the dfa ends up at the final

play09:35

state 4 it will recognize the token as

play09:38

an integer

play09:41

in a nutshell the final state 2

play09:44

recognizes the keyword if

play09:46

final states 1 and 3 recognize

play09:49

identifiers

play09:50

and 4 recognizes integers

play09:54

and dfas like this are implemented

play09:57

within the analyzing phase of the

play09:59

lexical analyzer

play10:00

so we can state the lexical analyzer

play10:03

takes lexums as input and produces

play10:06

tokens and while doing so it makes use

play10:10

of the dfa for pattern matching

play10:13

now to be honest i have illustrated the

play10:16

working principle of the dfa

play10:18

but i haven't shown how we can construct

play10:21

dfas

play10:22

so for that the learners need to have

play10:25

the concepts of nfa that is

play10:26

non-deterministic finite automata

play10:29

thereafter the procedure of conversion

play10:32

from nfa to the deterministic finite

play10:34

automata or dfa

play10:36

and finally the knowledge of

play10:38

minimization of dfa to the minimized dfa

play10:41

or mdfa is also required

play10:44

these all can be learned from our

play10:46

beautifully presented theory of

play10:48

computation course so it's my strong

play10:50

recommendation to all of you that please

play10:53

go to that playlist and kindly learn

play10:55

these all from there

play10:58

now if you remember we were noting down

play11:01

the features of the lexical analyzer

play11:03

right

play11:04

let's continue that

play11:06

now apart from these two the lexical

play11:10

analyzer also removes cummins and white

play11:13

spaces from the pure high level language

play11:15

code

play11:16

in c two types of comments are there

play11:19

single line comments and multi-line

play11:21

comments

play11:23

while scanning the lexical analyzer

play11:25

detects comments

play11:27

if it encounters double slash it ignores

play11:30

the rest until a new line and if it

play11:33

encounters slash star

play11:35

everything afterwards will be ignored

play11:38

until a star slash been scanned

play11:41

using these patterns the lexical

play11:43

analyzer classifies the comments as

play11:46

non-token elements and eliminates those

play11:51

consider the statement

play11:53

here we have the keyword int which

play11:55

specifies the data type

play11:57

then we have this comment in between

play11:59

neso

play12:01

after going through the scanning phase

play12:03

of the lexical analyzer the comment will

play12:05

be removed however a blank white space

play12:08

will be placed in place of the comment

play12:11

now during semantic analysis an error

play12:14

will be generated

play12:16

there so will be reported as an

play12:18

undeclared variable

play12:20

however for lexical analyzer ne and so

play12:25

will be two different tokens

play12:29

now coming to white spaces these are of

play12:32

several types

play12:33

the space which is inserted in the

play12:36

source code using the space bar key

play12:39

then the horizontal tab which can be

play12:41

inserted using the tab key

play12:44

then the new line

play12:46

now vertical tab is six times the new

play12:49

line

play12:50

thereafter the form feed

play12:52

which is a page breaking ascii character

play12:55

and finally the carriage return there is

play12:58

a space created by the enter key of the

play13:00

keyboard

play13:02

all these white spaces are non-token

play13:04

elements which are recognized and

play13:06

eliminated by the scanning phase of the

play13:08

lexical analyzer

play13:11

now along with these the lexical

play13:14

analyzer also helps in macro expansion

play13:17

in the pure high-level language code

play13:20

basically it allocates the value

play13:22

specified by the macro in the hash

play13:25

defined preprocessor directive to all

play13:27

the instances where the macros been used

play13:30

in the code

play13:32

now we already know

play13:34

after the pure high level language codes

play13:36

been given to the lexical analyzer

play13:39

it creates the entries for the

play13:41

identifiers in the symbol table

play13:44

along with that it produces tokens and

play13:46

hands those over to the syntax analyzer

play13:50

now this doesn't happen at once

play13:53

after one token is passed to the syntax

play13:56

analyzer the syntax analyzer performs

play13:59

parsing and during that it asks for the

play14:02

next token to the lexical analyzer and

play14:04

in response to that it provides the next

play14:06

token to the syntax analyzer so the

play14:09

lexical analyzer and the syntax analyzer

play14:12

frequently communicate with one another

play14:16

so this is how the lexical analyzer

play14:19

works

play14:21

so in today's session we learned about

play14:24

the working principle of the lexical

play14:26

analyzer in details

play14:30

all right people that will be all for

play14:32

this session i hope the working

play14:34

principle of lexical analyzer is clear

play14:37

to you now in the next session we will

play14:39

observe the tokenization of lexical

play14:41

analyzer

play14:43

so i hope to see you in the next one

play14:45

thank you all for watching

play14:47

[Music]

play14:47

[Applause]

play14:49

[Music]

Rate This

5.0 / 5 (0 votes)

Связанные теги
Lexical AnalyzerFinite AutomataCompiler PhasesTokensDFANFAPattern MatchingRegular ExpressionsSyntax AnalysisProgramming
Вам нужно краткое изложение на английском?