程序代做CS代考 compiler algorithm Compiler Design Week 3 – cscodehelp代写

Compiler Design Week 3

Announcement
• Part 1 of the Project has been released
• Deadline – 15-Aug-2021 23:59
Tasks:
• PartA – Writing codes in CD21 (5%)
• PartB – Scanner for CD21 (15%)
2
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au

Detailed content
Weekly program
 Week  Week
 Week
1 – Introduction to Compiler Design 2 – CD Programming Language
3 – Lexical Analysis
3
 Week
 Week
 Week
 Week
 Week
 Week
 Week
 Week 11 – Code Optimisation  Week 12 – Revision
4 – Syntax Analysis
5 – Top Down Parsing II
6 – Symbol Table and Error Recovery 7 – Bottom-Up Parsing
8 – Stack Machine (SM) Architecture 9 – Semantic Analysis
10 – Code Generation
 Week 13 – Extra revision (if needed)
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au

Week 02 Lecture Outline
Lexical Analysis
 How to recognize Tokens?
 State Transition Diagrams
 Programming STDs
 Recognizing different types of tokens  CD21 Scanner
 Token Structure
 Symbol Table structure
 RE to DFA directly
 nullable, firstpos, lastpos, followpos  RE to DFA directly – algorithm
4
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au

How do we Recognize Tokens?
• Lexical Items: can be described by a Regular Expressions (and therefore Regular Grammar) and so can be recognised with …
From COMP2270:
• We can convert RE to NFA (NDFSM) [Thompson’s construction]
• Convert NFA’s to DFA (DFSM) [subset construction]
• Minimize the DFA [state minimization]
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
5

How do we Recognize Tokens?
• State Transition Diagrams
– The actions necessary to recognise items
– Annotate the STD to indicate the actions that must be performed when an item is recognised.
• Eg. A simple language which only has the lexical items:
, comment — to end of line, the delimiters : and ; and the
operators + and :=
– What happens when we hit an “=“ character?
– Ans: It Depends. If previous char was : then we have :=, else
we have an illegal character.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
6

Drawing the State Transition Diagram
• The behaviour of the Scanner depends on what state it is in:
GC
Comment
newline
“-”
“-”
Is it a Comment?
“;”
S “:”
Assign?
03/08/2021
“=”
Scanning Indeterminant Text
Ident
letter, digit
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
7

Annotating the State Transition Diagram
• An unlabelled edge is a default transition and
– Generally does not consume the current input character.
– Can only have 1 per state.
– Whenever we cannot match the next character with any labelled edge then we take the default transition.
• If the default transition needs to consume input (eg Comment), then it must be explicitly labelled GC (get character).
• If a node has no default transition and the current character doesn’t match any edge then it is treated as illegal and consumed.
• These diagrams can easily be written in a language like Pascal or C by using lots of Go To statements.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
8

Programming a State Transition Diagram
type States = (……);
state:States
while (…) do // eg “while not eof do”
– –
03/08/2021
Note that these programs can find an item but aren’t useful until they can be made to do something with that item.
If we have an output routine to o/p found items, we can show this on the STD, i.e. the action to be performed. E.g. out(ident).
case state of
S : c = getNextChar();
comment?: comment : assign? : ident :
if (c==ws) { state = S; }
else if (c==‘:‘) state = assign;
else if (c==‘-‘) state = comment?;
……
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
9

The OUT Action

These actions can easily be incorporated into the STD:
Note that the output of a token is always associated with a transition back to the initial state: S
GC
Comment
newline
S “:”
Is it a Comment?
“-”
“;” out(semi) out(ident)
out(colon)
03/08/2021
Assign?
“=” out(asgn)
Scanning Indeterminant Text
Ident
letter, digit
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Not every transition back to S is a token, however: e.g. Comments
10

Token Numbers and Lexemes (1)
• We can now output the classes of the items found, but we need to extend this to output token numbers in a computer language.
• For most lexemes (keywords, operators, etc.) the classification of the lexemes as a particular token number is sufficient for the later phases of the compiler.
• But some lexemes such as identifiers and literals, we not only need to record the classification (the token number) but also the actual lexeme (String).
• If the scanner is to distinguish between different identifiers (or literals), then it must build up a buffer of “the string of chars so far in this item” so that this can become part of the token object produced by the scanner.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
11

CLEAR, CAT and LOOKUP
As an identifier is recognised an (initially empty) buffer of the characters in the identifier name must be built.
The token returned can even be a tuple containing the token type and a Symbol Table position for the identifier itself
S
Ident
letter, Digit CAT
CAT Buffer:
m
y
V
a
r
?
A flat buffer that simply takes the next char – cleared on starting new identifier
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
12

Token Numbers and Lexemes (2)
• So the token needs to carry the actual lexeme along with it, as well as the identifier classification, and so the token must become (at least) a tuple .
• There will probably be other token classifications for which the actual lexeme is also needed,
– e.g. integer literal values, real literal values, and string constants.
• For all other tokens the lexeme field can be NULL as the token number carries all the information necessary for later phases.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
13

Keywords (1)
• An extended simple language which only has the lexical items: , comment — to end of line, the delimiters : and ; and the
operators + and := and reserved keywords if, else, for
S
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Ident
digit
letter,
14

Keywords (2)
• We can also incorporate keywords easily by:
– using the ident state of the STD to find them.
– having the keywords already set in the S/T or a lookup table.
– having a special field in this table which gives the token no.
• Ident now becomes a procedure and the out(ident) becomes out(TOKENNUMBER(lexeme),StrVal) with StrVal either being the current string buffer representing the lexeme itself for identifiers and literals, or NULL otherwise.
• Something to do – start thinking about S/T routines. Hash table? How to lookup and insert? What information is to be stored?
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
15

Keywords (3)
• It is also possible to simply do an exhaustive list check for keywords once they have been recognized by the “identifier” section of the scanner.
• This makes it a little easier to check for the keywords being case in- sensitive.
e.g. String lower = stringvalue.toLower(); if(lower.equals(“tcd21”)) return Token.TCD21;
if(lower.equals(“if”)) return Token.TIFTH; etc …
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
16

Program Listing
• Or more particularly, output in general
• It is a good idea to think about this at the beginning of the compiler as decisions made now could have design ramifications in later phases
– The scanner is the only part of the compiler that will know about comments and so is probably the best place to produce the program listing –
– But, how about reporting errors (especially in later phases)
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
17

Program Listing (2)
• It is now rare to actually have part of the compiler produce a copy of the source input, because most software development is now done in IDEs with the source code displayed in its own window.
• An IDE will also have another window for compiler error messages which will each be linked to its pertinent section of the source code window.
• However we do not have an IDE for development of CD21 programs and so we have to produce “batch oriented” output that will serve a similar purpose.
• A program listing with line numbers and interleaved error messages – or a separate listing of errors with line numbers to cross-reference to a position in the source code is what will be needed for our project.
• Producing the final executable will be done similarly – via a special file which is then used as input into the loader of the stack machine which will execute the compiled CD21 programs.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
18

Advanced Scanning
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
19

Advanced Scanning
• It is possible for a STD to have more than 1 start state.
• This is used within a scanner to take care of special cases:
– When multiple tokens are revealed before one of the tokens can be returned
– The first token is returned and the scanner left in a state that indicates a new starting point for STD processing for the next time it is called.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
20

Advanced Scanning (cont)
• E.g. Floating point numbers in the format 123.456 which have definitions that “overlap” with other tokens.
– If . is an operator (the dot operator) and 123. is not a valid floating point number, then text of the form 123.abc will discover and consume the decimal point before it realises that 123 is an integer.
– Some might want to back up the input but this is not necessary.
• The 123 integer token can be returned and the scanner left in a state that indicates that the next token (the dot operator) has already been found, this can be returned and the scanner reverts to its normal start state whereupon it will discover that abc is an identifier.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
21

A Scanner for CD21
• The input file could be read and processed character by character, or read line by line and then processed character by character within each line.
• Identifiers can be easily found.
– Keywords can be discovered as if they are identifiers, and then tested or found in the symbol table.
• Integers are simply a sequence of digits.
• Finding a floating point literal can be achieved by treating the definition
as if it were .
– The same code that recognizes an integer literal will be used to find
the first half of a floating point literal.
• One and two character operators are easily found.
• Comments are easily ignored.
• End-Of-Input?: Only the source file recognizes end-of-file, and so the
scanner can have a special token that represents end-of-input, e.g.
tokNo = TEOF 03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
22

A Scanner for CD21 (2)
• So, what are the difficult bits?
• The only really challenging parts of the scanning process are when things go a awry with the input (lexical errors).
– What happens when you reach eof when you didn’t expect to, e.g. in the middle of a string, or in the middle of a token?
– What happens when you find a ! that is not followed by = ?
– How do you handle (process) lexical errors?
– How do you report errors?
– What sort of information needs to be held for each token?
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
23

What constitutes a Token?
• What the Parser will need
– A number that gives the token type for the parser
– A lexeme string for identifiers, numeric literals, and string constants.
– We will see later how the context of the token can be used to
update a Symbol Table according to this lexeme.
• In order to check that things are working properly, you might also
add:
– Line number where the token was found?
– Column number where token was found?
• So each token would become an object which is (at least) a 4- tuple if this information were to be included:
– (enum tokenNo, String lexeme, int lineNo, int colNo)
– this will be as much information as the scanner is able to
obtain about any identifier or literal.
• For all other tokens, the token number will be all that is needed,
so all that is actually required is (tokenNo, null, lineNo, colNo).
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
24

What else might the Token need?
• Nothing – if we were only worried about scanning, but scanning will simply supply a stream of tokens as input to the Parser.
• So we need to look ahead to make sure our design contains whatever might be needed in future phases.
• Any extra information will ONLY be required for identifiers and constant literals.
• Every time the constant literal (1 for example) is used, it will be a different token, but it will be the SAME literal value for later phases.
• Every time the identifier X is used it may refer to the same identifier, or it may refer to a different identifier because of a different context (a different scope within the program).
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
25

Getting ready for the Symbol Table
• This will be where the symbol table comes into play:
• The token is made “Symbol Table ready” by adding another attribute
(using StRec as a Symbol Table Record)
– (enum tokenNo, String lexeme, int lineNo, int colNo, StRec st)
– this final attribute is set to null initially.
• Equivalently, we can be prepared to add this attribute to our Token class as we move into later phases.
• This will allow the parser to take context into account when deciding which record in the Symbol Table, each token will refer to.
• Once the later phases have the symbol table position of an identifier, each Token object will have completed its task.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
26

Any special things about CD21 scanning?
• Not really – CD21 has a very standard lexical structure that fits well into the usual means of lexical analysis
• The task of the scanner for CD21 will be to
– return next valid token.
• So a call to the scanner,
– will process the input from the current input position using a DFA
– reporting any lexical errors that are found and
– skipping over comments, until a valid token is found – and
– then returning that token to the caller.
• The caller – will eventually be the Parser which will use the Scanner as its input routine.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
27

A General Approach to Scanner Design
• The special things to remember and a major guide to scanner design ……
 You should be able to write the scanner without reference to the syntax of the language, all you should need is the lexical definition of the language
 The semantics of the scanner are: Return Next Valid Token
 If you are referring to the syntax specifications you are probably (very possibly?) doing something that belongs in the parser.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
28

Basic Structure of a Compiler
Source Code
Tokens
Syntax Object Tree Code
Lexical Analyser
Syntax Analyser
Code Gen.
Symbol Table
This is a nice diagram, but it is a little too simple for what is now needed.
 We need to look at the overall architecture of our compiler now rather than later, because we need to avoid mistakes that require re-design.
 So whether the scanner actually provides lexemes or proper tokens is something that we need to look at closely.
 CD21 has variable scoping for procedures and so it is the parser which will know the correct context of an identifier for it to go into the Symbol Table.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
29

The Architecture of a CD21 Compiler
Source Code
Tokens
Identifier and
literal tokens become Symbol Table records
Syntax Object Tree Code
Scanner
Parser
Code Gen.
Symbol Table
 The Scanner passes tokens to the Parser, but for those tokens which have valid lexeme strings associated with them (ids and literals), the lexeme string accompanies the token.
 The Parser can then lookup and/or insert these values into the Symbol Table, so that the Symbol Table record now replaces the Token- Lexeme as the pertinent description and information holder of that identifier or literal.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
30

Symbol Table Structure (A)
• Maybe we don’t have to worry so much about multiple lookup, and should just concentrate on efficient insertion.
• What should be stored?
– A string to match the string buffer
– A token type to return
– A type (keyword, etc.)
• Eventually we will have a number
Name String
Token Type
Type
eg keyword, int, array
Attributes, eg. Length, (poss. added later)

of other attributes:
Other attributes, eg. Array length Extra things to be added later?


• The main point at this stage is to be able to insert and lookup an identifier based on its lexeme string.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
31

Symbol Table Structure (B)
• A hash table – provides constant time lookup and insert.
• Just hashing to an array of lists of records should be enough.
• Linear searching of the list won’t cost much as we are only writing small programs and so hash synonyms won’t be much of a problem.
• We don’t have to worry about having the best hash table but the hash table should work.
• What routines are needed?
– Lookup and Insert?
– Or should we just have a single LookUpAndInsert routine?
• Either way the routine(s) should return a pointer (or reference) to the S/T record.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
32

Symbol Table Structure (C)
• In order to be ready for the later phases, there are some things that we can do now that may make life easier later on:
– Identifiers are straight forward, but later we may have to deal with visibility issues, how might that impinge on what we need to do for the scanner?
– Some architectures actually set up read-only-memory for some Integer Literals and all Floating Point Literals, so if we store these into the symbol table (or a separate symbol table?), it may make them easier to deal with.
– String Literals might also need to be stored in a way that allows easy use of them later (e.g. so they can be stored as values for string variables, or passed to a print routine).
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
33

The Architecture of a CD21 Compiler
Source Code
Scanner Tokens
Syntax Object Tree Code
Symbol Table
Parser
Code Gen.
• The Scanner classifies the lexemes and provides them to the Parser
• The Parser inserts them into the Symbol Table and updating identifier
information in the Symbol Table (taking into account the language
scoping rules),
• The Code Generator can retrieve the required identifier information for
translation at a later stage.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
34

Architecture: Relationship between the Scanner, the Parser and the Symbol Table
Tokens are used in Parsing and identifier and literal information is stored into the Symbol Table.
Source Code
Tokens as Classified Lexemes
Parser
Scanner
GetToken
Parser I/P Routine
The Parser has a special input routine
 accepts the classified lexemes
 turns them into complete tokens
 i.e. symbols that are recorded in the
symbol table according to scoping rules.
Two actions are possible:
Symbol Table
• If this symbol is being referenced for the first time, then it is inserted into the S/T and the token is then updated to refer to that new symbol in the S/T.
• If the symbol represented by the token has already been entered into the symbol table then this current token can be updated to refer to the already entered symbol.
The parser can take context (scoping, etc) into account in doing this.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
35

Architecture: Relationship between the Scanner, Parser, Listing & Error Reporting
Source Code
Listing
Classified Lexemes
(Tokens)
Error Messages
Semantic Error Messages – – later
Scanner
Output Controller
Symbol Table
• Both the Scanner and the Parser (and later the Semantic Analyser) will need to produce a program listing and report errors.
• Only the Scanner can produce the listing (otherwise comments are lost)
• BUT All 3 (Scanner, Parser, and Semantic Analysis) must be able to produce error messages in a coherent and cooperative way.
• A separate Output Controller object can coordinate the output (the feedback) to
the CD21 programmer. 03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
Parser
GetToken
Parser I/P Routine
36

Project Part 1 – The Scanner
Source Code
Listing
Part 1 Results
Tokens
Error Messages
“Replace” the parser with a simple driver program that allows the Scanner to be written as
an input object for the complete
compiler.
Scanner
Parser
GetToken
Parser I/P Routine
Output Controller
Symbol Table
Get ahead by implementing
a Symbol Table for future phases.
Repeat
t = sc.getToken()
printToken (t)
Until token no == TEOF
A special output routine for the part 1 output of tokens
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
37

From Regular Expression to DFA Directly
• Need to look on the details of the
– Thompson’s construction algorithm (RE to NFA)
– Subset construction algorithm (NFA to DFA)
– What roles played by various states?
• A state of an NFA is important if it has a non- out-transition,
– for a given state s, if move({s},a)   for some a then s is an
important state
• In Thompson’s construction algorithm each important state corresponds to a particular operand in the regular expression
• The subset construction algorithm uses only the important states when it determines
-closure(move(T,a))
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
38

NFA for (a|b)*abb
• Important states are numbered and other states are represented by
letters
 1aC
start
AB E3a4b5b6 
2bD 
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
39

From RE to DFA Directly – Method
• The constructed NFA’s accepting state is not an important state.
1. Augment the regular expression r with a special end symbol # to make
accepting states important: the new expression is r#
2. Construct a syntax tree for r#
3. Traverse the tree to construct functions nullable, firstpos, lastpos, and followpos
4. Use the algorithm for constructing DFA from those functions
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
40

Syntax Tree of (a|b)*abb#
• Leaves are labelled by  or by an alphabet symbol
• To each leaf not labeled by , an integer (position number) is attached
– Position of the leaf/symbol
– A symbol can have multiple positions
closure
alternation
5
concatenation
#
6
b
*
|
ab
a
3
b
4
position
number (for leafs )
03/08/2021
12
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
41

NFA for (a|b)*abb#
• Important states are numbered and other states are represented by
letters
• Positions in the syntax tree (last slide) corresponds to important states
of the NFA
AB E3a4b5b6#F 
 1aC
start
2bD 
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
42

Annotating the Syntax Tree
• nullable(n): the subtree at node n generates languages
including the empty string
• firstpos(n): set of positions that can match the first symbol of a
string generated by the subtree at node n
• lastpos(n): the set of positions that can match the last symbol of
a string generated be the subtree at node n
• followpos(i): the set of positions that can follow position i in the
tree
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
43

Annotating the Tree: firstpos, lastpos
Node n
nullable(n)
firstpos(n)
lastpos(n)
Leaf 
true


Leaf i
false
{i}
{i}
|
/
c1 c2
nullable(c1) or nullable(c2)
firstpos(c1)  firstpos(c2)
lastpos(c1)  lastpos(c2)

/
c1 c2
nullable(c1) and nullable(c2)
if nullable(c1) then firstpos(c1)  firstpos(c2) else firstpos(c1)
if nullable(c2) then lastpos(c1)  lastpos(c2) else lastpos(c2)
*
|
c1
true
firstpos(c1)
lastpos(c1)
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
44

Annotated Syntax Tree of (a|b)*abb# {1, 2, 3} {6}
{1, 2, 3} {5}
{6} #{6} 6
nullable
{1, 2, 3} {4}
{5} b{5} 5
{1, 2, 3} {1,2} *{1,2}
{4} b{4} 4
3
{3} {3}a{3}
firstpos lastpos
{1, 2} | {1, 2}
{1} a{1} {2} b{2}
12
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
45

Calculating followpos
Only two ways that a position of a regular expression can follow another
• If n is a cat-node with left child c1 and right child c2 , then for every position i in lastpos(c1) , all positions in firstpos(c2 ) are in followpos( i).
• If n is a star-node, and i is a position in lastpos(n), then all positions in firstpos(n) are in followpos(i).
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
46

Calculating followpos
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2) end do
else if n is a star-node
for each i in lastpos(n) do
end if end do
followpos(i) := followpos(i)  firstpos(n) end do
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
47

From Regular Expression to DFA Directly:
Example
{1, 2, 3} {6}
{1, 2, 3} {1, 2, 3} {4}
{3}
{3} a{3}
3
{5}
{5} b {5}
{6} # {6} 6
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6

nullable
5 4
{1, 2, 3} {1, 2} * {1, 2}
{4} b {4}
firstpos lastpos
{1,2} | {1,2}
{1} a{1} {2} b{2}
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
48

From Regular Expression to DFA Directly:
Example
{1, 2, 3} {6}
{1, 2, 3} {1, 2, 3} {4}
{3}
{3} a{3}
3
{5}
{5} b {5}
{6} # {6} 6
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6

nullable
5 4
{1, 2, 3} {1, 2} * {1, 2}
{4} b {4}
firstpos lastpos
{1,2} | {1,2}
{1} a{1} {2} b{2}
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
49

From Regular Expression to DFA Directly: Example
Node
followpos
1
{1, 2, 3}
2
{1, 2, 3}
3
{4}
4
{5}
5
{6}
6

1
2
a
3456
1
b 3a4b5b6
1 . Make all positions in firstpos(root) be initial states,
2. Label each arc from i to j by the symbol at position i,
3. The position of endmarker # be the only accepting state.
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
a a
b b
2
50

From RE to DFA Directly: Algorithm
s0 := firstpos(root) where root is the root of the syntax tree Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a if U is not empty and not in Dstates then
add U as an unmarked state to Dstates end if
Dtran[T,a] := U end do
end do
States with the position of endmarker # become accepting states
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
51

From Regular Expression to DFA Directly: Example
Node (symbol)
followpos
1 (a)
{1, 2, 3}
2 (b)
{1, 2, 3}
3 (a)
{4}
4 (b)
{5}
5 (b)
{6}
6 (#)

1
3456
2
start
1,2,3
bb
a
a 1,2, b 1,2, b 1,2,
3,4
3,5 3,6
a a
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
52

References
• Compilers: Principles, Techniques, and Tools (2nd Edition)
• By A.V. Aho, Lam, R. Sethi, . Ullman
• Chapter 3
03/08/2021
COMP3290 – Semester 2 – 2021 | www.newcastle.edu.au
53

Leave a Reply

Your email address will not be published. Required fields are marked *