Introduction to Lexical Analyzer
Summary
TLDRThis session provides an in-depth look at the lexical analyzer's working principle, explaining how it processes high-level language code to produce tokens. It covers the types of tokens, the scanning and analyzing phases, and the use of finite state machines (FSM) for pattern recognition. The video also discusses handling comments, whitespace, and macro expansion, emphasizing the analyzer's role in creating symbol table entries and its interaction with the syntax analyzer.
Takeaways
- 📘 The lexical analyzer is a crucial part of a compiler that transforms high-level language code into tokens.
- 🔍 It uses regular expressions and type 3 grammar to recognize tokens, which are the basic elements of the source code.
- 📚 Tokens can be identifiers, operators, constants, keywords, literals, punctuators, and special characters.
- 🔧 The lexical analyzer performs two main functions: scanning to remove non-token elements and analyzing to produce tokens.
- 🌐 It operates on a finite state machine (FSM) to recognize different types of tokens such as keywords, identifiers, and integers.
- 🔄 The FSM can be a non-deterministic finite automata (NFA) which is then converted to a deterministic finite automata (DFA) for implementation.
- 🔑 The DFA recognizes tokens by moving through states based on the input characters until it reaches a final state.
- 🚮 The lexical analyzer also removes comments and white spaces which are considered non-token elements.
- 🔄 It supports macro expansion by replacing macro identifiers with their defined values throughout the code.
- 🔗 The lexical analyzer maintains a symbol table for identifiers and communicates with the syntax analyzer to provide tokens for parsing.
Q & A
What is the primary function of a lexical analyzer?
-The primary function of a lexical analyzer is to scan the source code of a high-level programming language and convert it into a sequence of tokens.
What are the different phases of a compiler where lexical analyzer plays a role?
-The lexical analyzer plays a role in the lexical analysis phase of a compiler, where it processes the arithmetic expression and produces a stream of tokens.
How does the lexical analyzer use regular expressions?
-The lexical analyzer uses regular expressions to recognize tokens by employing regular grammar or type 3 grammar.
What are the types of tokens that a lexical analyzer can produce?
-Tokens produced by a lexical analyzer can be identifiers, operators, constants, keywords, literals, punctuators, and special characters.
What is the role of a finite state machine in lexical analysis?
-A finite state machine is used by the lexical analyzer to recognize tokens by transitioning between states based on the input characters.
How does the lexical analyzer handle comments in the source code?
-The lexical analyzer detects comments and eliminates them as non-token elements, replacing them with whitespace or simply ignoring them.
What is the difference between scanning and analyzing phases in a lexical analyzer?
-In the scanning phase, the lexical analyzer eliminates non-token elements like comments and whitespace. In the analyzing phase, it identifies tokens using the finite state machine.
How does the lexical analyzer deal with white spaces?
-The lexical analyzer recognizes various types of white spaces such as space, horizontal tab, new line, vertical tab, form feed, and carriage return, and treats them as non-token elements.
What is the significance of the symbol table in the context of a lexical analyzer?
-The lexical analyzer creates entries for identifiers in the symbol table, which is crucial for later stages of compilation like semantic analysis.
How does the lexical analyzer communicate with the syntax analyzer?
-The lexical analyzer and the syntax analyzer frequently communicate, with the lexical analyzer providing tokens to the syntax analyzer upon request during parsing.
What additional functionality does the lexical analyzer provide besides tokenization?
-Apart from tokenization, the lexical analyzer also performs macro expansion, replacing macro identifiers with their defined values in the source code.
Outlines
📘 Introduction to Lexical Analyzer
This paragraph introduces the lexical analyzer, explaining its role in the compiler process. It discusses how the analyzer scans high-level language code line by line, converting it into tokens. Tokens are categorized into identifiers, operators, constants, keywords, literals, punctuators, and special characters. The paragraph also explains the two primary functions of the lexical analyzer: scanning and analyzing. Scanning removes non-token elements like comments and white spaces, while analyzing uses finite state machines to recognize tokens.
🔍 Combining Finite State Machines
This section delves into the construction of a single finite state machine (FSM) that combines multiple FSMs to recognize different types of tokens in the C language. It explains how an epsilon NFA is used to create a non-deterministic finite automata that can recognize keywords, identifiers, and integers. The process of converting the NFA to a deterministic finite automata (DFA) is outlined, highlighting how DFA can recognize tokens by moving through states based on input characters. The DFA's ability to recognize identifiers and integers is demonstrated through specific transitions and final states.
🛠 DFA Construction and Lexical Analyzer Functions
The final paragraph covers the construction of DFAs, emphasizing the need for understanding non-deterministic finite automata (NFA) and their conversion to deterministic finite automata (DFA). It also mentions the minimization of DFA to MDFA. The paragraph discusses the lexical analyzer's role in removing comments and white spaces and its interaction with the syntax analyzer. It explains how the lexical analyzer helps in macro expansion and symbol table creation, highlighting the continuous communication between the lexical analyzer and the syntax analyzer.
Mindmap
Keywords
💡Lexical Analyzer
💡Tokens
💡Finite State Machine (FSM)
💡Deterministic Finite Automata (DFA)
💡Non-deterministic Finite Automata (NFA)
💡Lexeme
💡Regular Expressions
💡Symbol Table
💡Macro Expansion
💡Epsilon Transition
Highlights
Introduction to the lexical analyzer and its working principle
Outcome of the session: understanding the lexical analyzer in detail
The lexical analyzer produces a stream of tokens from arithmetic expressions
Regular expressions are used for recognizing tokens
Type 3 grammar is used for recognizing tokens
Features of the lexical analyzer: scanning high-level language code and producing tokens
Types of tokens: identifiers, operators, constants, keywords, literals, punctuators, and special characters
Two functions of the lexical analyzer: scanning and analyzing
Lexical analyzer uses finite state machines (FSM) for recognizing tokens
Example of recognizing the keyword 'if' using FSM
Finite state machine for identifiers, allowing letters, underscores, and digits
Finite state machine for integers, including sign and digit recognition
Combining multiple FSMs into a single FSM for the lexical analyzer
Introduction to non-deterministic finite automata (NFA) and its role in lexical analysis
Conversion of NFA to deterministic finite automata (DFA)
Minimization of DFA to minimized DFA (MDFA)
Elimination of comments and white spaces by the lexical analyzer
Handling of different types of white spaces by the lexical analyzer
Macro expansion facilitated by the lexical analyzer
Creation of symbol table entries for identifiers by the lexical analyzer
Communication between the lexical analyzer and the syntax analyzer
Summary of the lexical analyzer's working principle
Transcripts
hello everyone and welcome back
in this session we are going to be
properly introduced to the lexical
analyzer so let's get to learning
coming to the outcome of this session
today we will learn the working
principle of lexical analyzer in details
now in the session different phases of
compiler we observed
when we gave this arithmetic expression
to the lexical analysis phase the
lexical analyzer produced a stream of
tokens
and we also came to know that the
lexical analyzer uses regular
expressions for recognizing the tokens
basically it uses the regular grammar or
type 3 grammar for recognizing the
tokens
so if we are to note down the features
of lexical analyzer successively
well it scans the pure high level
language code line by line and while
doing so it takes the leg zooms as
inputs and as outputs it produces the
tokens
now coming to tokens there are several
types of it
it may be an identifier
an operator
constants
basically these are the fixed values
which we assign to the variables in the
source code itself
then there is keyword there is various
keywords of the programming language if
we consider c programming language if
ent return etc are the keywords of that
tokens can also be literals that is
string literals
the punctuators like comma semicolons
parenthesis braces brackets etc are also
tokens
finally the spatial characters like
ampersand underscore etc are tokens as
well
so tokens can be broadly classified into
these seven categories
now in a lexical analyzer two functions
take place
first the scanning and then analyzing
lexums are given to the scanning phase
which in turn eliminates the non-token
elements such as comments consecutive
white spaces etc
thereafter in the analyzing phase the
actual magic happens and we get the
tokens at the end of that
let me illustrate how this analyzing
phase works
let's consider a few tokens of the c
language
now if is a keyword of c and that's why
it is a token
now for recognizing if
the lexical analyzer will require a
finite state machine or fsm
so from the initial state a saying i we
will move to the next state b
then from b
seeing f we will end up in the final
state c
considering identifiers
well we already have briefly observed
the finite state machine for that
previously
so from the initial state say d
seeing a symbol may that be any small
letter in the range of a to z
or any capital one in the same range or
an underscore
we will end up in the final state e
because we know that single later can
also be an identifier like for most
programmers the variable that runs any
loop by incrementing itself is i
additionally an identifier can have any
number of letters underscores or digits
nonetheless they must follow either a
letter or an underscore
now if we talk about integers for an
integer starting from the initial state
say f
if we see any sign be that a positive or
negative sign
we will move to the next state g and
from there seeing any digit from the
range 0 to 9 we will end up in the final
state h
now this much is sufficient for single
legit integers
however for two or more digits we will
have an epsilon transition from the
spinal state h to g
that is from h without seeing anything
we can move to g
now using this portion
we can have any number of digits
by the way integers can also not have
any sign at the beginning
because for positive integers the
positive sign is implied
so we need not mention that
therefore to facilitate that property we
will have an epsilon transition from the
initial state f to this state g
now these are individual finite state
machines with their own initial states
in the analyzing phase of the lexical
analyzer we cannot have separate finite
state machines it will need a single
finite state machine combining all the
separate finite state machines
assuming our c language has only these
three finite state machines and can only
recognize the keyword if
identifiers and integers
let me show you how these will be
combined as a single finite state
machine
so what we will do
we will introduce a new initial state
say s
and we will have epsilon transitions
from that to all these different states
now this one is an nfa or
non-deterministic finite automata
to be precise it actually is an epsilon
effect
epsilon nfa is the nfa which contains
epsilon moves
moreover nfa is conceptual that is we
cannot implement it
for implementation we will have to
derive the equivalent deterministic
finite automata or the dfa
now this one here
this is the equivalent dfa
it will recognize the keyword if the
identifiers also the integers
let me show you how starting from the
initial state s
seeing the letter small i
we will move to the state one
and from there
saying f we will move to the state two
and since it is a final state
if we stop here it will mean that this
dfa has recognized the input lexum as
the keyword token if
now if the lexum has got something more
than if suppose it is iffy
in that case we won't stop at the state
2.
with the first f after if
we will move to the state 3
and then for the last y
we will use this cell flow and remain in
this state only
now the question is why one is also a
final state
if you remember earlier in this session
i told you that the identifier using
which most programmers run loops is i
so if we see i only which doesn't have
any following f
then it is not a keyword
rather it is an identifier
so for the dfa if it stops reaching the
state 1 it recognizes that as an
identifier
now observe the transition from 1 to 3.
here we have intentionally chosen the
range of small letters a to e or g to z
so after i accept small f if we have any
other small later or any capital letter
or an underscore or any digit
we will move to the state 3.
now observe this transition here
if you observe the ranges of small
letters a to h or j to z
that is any small letter except i so
from the initial state s seeing any
small letter except i or any capital
letter or an underscore
we can move to this final state 3.
so if the machine either stops in one or
in three through any of these
transitions
it will recognize the lexus as
identifier
now coming to the lower portion of the
dfa starting from the initial state
seeing a digit from the range 0 to 9
we can end up at the final state 4 which
will recognize any single digit positive
integer
thereafter this self loop on this state
4 it will help the dfa to recognize any
other positive integer with two or more
digits
now if in the legazim the sign has
explicitly been specified for those
starting from the initial state
saying any sign be that either positive
or negative the dfa will move to the
next state that is this state 5
and from this seeing a single digit it
will move to the final state 4. then
again the self-loop of 4 will help the
dfa to recognize the signed integers
with 2 or more digits
so basically starting from the initial
state if the dfa ends up at the final
state 4 it will recognize the token as
an integer
in a nutshell the final state 2
recognizes the keyword if
final states 1 and 3 recognize
identifiers
and 4 recognizes integers
and dfas like this are implemented
within the analyzing phase of the
lexical analyzer
so we can state the lexical analyzer
takes lexums as input and produces
tokens and while doing so it makes use
of the dfa for pattern matching
now to be honest i have illustrated the
working principle of the dfa
but i haven't shown how we can construct
dfas
so for that the learners need to have
the concepts of nfa that is
non-deterministic finite automata
thereafter the procedure of conversion
from nfa to the deterministic finite
automata or dfa
and finally the knowledge of
minimization of dfa to the minimized dfa
or mdfa is also required
these all can be learned from our
beautifully presented theory of
computation course so it's my strong
recommendation to all of you that please
go to that playlist and kindly learn
these all from there
now if you remember we were noting down
the features of the lexical analyzer
right
let's continue that
now apart from these two the lexical
analyzer also removes cummins and white
spaces from the pure high level language
code
in c two types of comments are there
single line comments and multi-line
comments
while scanning the lexical analyzer
detects comments
if it encounters double slash it ignores
the rest until a new line and if it
encounters slash star
everything afterwards will be ignored
until a star slash been scanned
using these patterns the lexical
analyzer classifies the comments as
non-token elements and eliminates those
consider the statement
here we have the keyword int which
specifies the data type
then we have this comment in between
neso
after going through the scanning phase
of the lexical analyzer the comment will
be removed however a blank white space
will be placed in place of the comment
now during semantic analysis an error
will be generated
there so will be reported as an
undeclared variable
however for lexical analyzer ne and so
will be two different tokens
now coming to white spaces these are of
several types
the space which is inserted in the
source code using the space bar key
then the horizontal tab which can be
inserted using the tab key
then the new line
now vertical tab is six times the new
line
thereafter the form feed
which is a page breaking ascii character
and finally the carriage return there is
a space created by the enter key of the
keyboard
all these white spaces are non-token
elements which are recognized and
eliminated by the scanning phase of the
lexical analyzer
now along with these the lexical
analyzer also helps in macro expansion
in the pure high-level language code
basically it allocates the value
specified by the macro in the hash
defined preprocessor directive to all
the instances where the macros been used
in the code
now we already know
after the pure high level language codes
been given to the lexical analyzer
it creates the entries for the
identifiers in the symbol table
along with that it produces tokens and
hands those over to the syntax analyzer
now this doesn't happen at once
after one token is passed to the syntax
analyzer the syntax analyzer performs
parsing and during that it asks for the
next token to the lexical analyzer and
in response to that it provides the next
token to the syntax analyzer so the
lexical analyzer and the syntax analyzer
frequently communicate with one another
so this is how the lexical analyzer
works
so in today's session we learned about
the working principle of the lexical
analyzer in details
all right people that will be all for
this session i hope the working
principle of lexical analyzer is clear
to you now in the next session we will
observe the tokenization of lexical
analyzer
so i hope to see you in the next one
thank you all for watching
[Music]
[Applause]
[Music]
浏览更多相关视频
Phases of Compiler | Compiler Design
Count number of tokens in compiler design | Lexical Analyzer
Lexical Analyzer | Lexical Analysis | Compiler Design
Lec-3: Lexical Analysis in Compiler Design with Examples
Getein CM-400 Clinical Chemistry Analyzer Operation Guide
GitLab | SonarCloud | Code Scan | How to Set Up GitLab Code Scan with SonarCloud | SonarQube
5.0 / 5 (0 votes)