The notes in this collection are designed to suplement the material presented in lectures. They give more details on several topics covered in the course. In doing this they use illustrations drawn from the compiler writers toolbox, a library of Pascal modules for the implementation of interactive compilers under MS-DOS. The toolbox is available from Paul Cockshotts web pages.
It should be noted that whilst the source code is given in Pascal, the essential principles of the algorithms are not altered by shifting to Java, C or some other language.
Some detailed algorithms are provided with close documentation, in particular the algorithms for implementing lexical analysers. These are provided as background material only. It should not be necessary for you to implement a lexical analyser in your exercises, as a ready written lexical analyser class is provided. However, the interface between the lexical analyser described in these notes and the Basic lexical analyser that you will be using for your course are very similar.
What makes a language a language rather than an arbitrary sequence of symbols is its grammar.
A grammar specifies the order in which the symbols of a language may be combined to make up legitimate statements in the language. Human languages have rather relaxed informal grammars that we pick up as children. Computer languages are sometimes called formal languages because they obey an explicitly specified grammar.
When people came to design computer languages around the end of the 1950's they had to devise methods of formally specifying what the grammar of these new language was to be. By coincidence the linguist Chomsky had been investigating the possibility of formally specifying natural languages, and had published an influential paper in which he had classified all possible grammars into 4 classes. These classes of grammars are now refered to as Chomsky class 0, class 1, class 2 and class 3 grammars. It turns out that Chomsky class 2 and class 3 grammars are most suitable to describe programming languages. To understand what these different classes of grammars are we need to go into a little formal notation.
The syntax or grammar of a language can be thought of as being made up of a 4 tuple (T,N,S,P) where:
T stands for what are called the terminal symbols of the language. In a human language these terminal symbols are the words or lexicon of the language. In a computer language they are things like identifiers, reserved words and punctuation symbols.
N stands for what are called the non-terminal symbols of the language. In a human language a non-terminal would be grammatical constructs like a sentence, a noun clause or a phrase. A computer language is likely to have a large number of non-terminals with names like clause, statement, expression.
S is the start symbol of the grammar. It is one of the non terminals. Its meaning will become clear shortly.
P is a set of productions or rewrite rules. These tell you how to expand a non-terminal in terms of other terminals and non-terminals.
This sounds a bit dry, but it will be clearer if we give an example. Suppose we wish to define a grammar that describes the `speech' of a traffic light. A traffic light has a very limited vocabulary. It can say red or amber or green or red-and-amber. These are the terminal symbols of its language.
T = { red, green, amber, red-and-amber }
At any moment in time the traffic light is in a current state and after some interval it goes into a new state that becomes its current state. Each state is described by one of the colours of T . This can be expressed as a set of non-terminal symbols which we will call:
N = { going-red, going-green, going-amber, going-red-and-amber }
We will assume that when the power is applied for the first time the light enters state going-red. Thus
S = going-red
A traffic light has to go through a fixed sequence of colours. These are the syntax of the traffic light language. Which sequence it goes through is defined by the productions of the traffic light language. If the light is in going-red then it must output a red and go into going-red-and-amber. We can write this down as:
going-red � red going-red-and-amber
This is an individual production in the traffic light language. The whole set of productions is given by:
P = { going-red � red going-red-and-amber
going-green � green going-amber
going-red-and-amber � red-and-amber going-green
going-amber � amber going-red
}
This combination of (T,N,S,P) ensures that the only sequence of colours allowed by the traffic light are thing like :
red red-and-amber green amber red going-red-and-amber
It turns out that traffic lights speak the simplest of the Chomsky classes of language, which perversely enough is class 3.
To distinguish between these classes of grammars the following notation will be used:
bold letters : a b c ... represent non-terminals
italic letters : a b c ... represent terminals
Class 3 languages like that of the traffic light have all of their productions of the form:
a� ab
or
a� c
The traffic light obviously only has the first type of production or it would stop at some point. These simple languages occur widely in nature. Look at the patterns of leaves round the stem of a plant. They will often alternate left or right, or form a spiral that can be described by a class 3 grammar. In the example in figure we can describe the plant shape by the grammar:
N = { lstem, rstem }
S = lstem
P = { rstem � flower
lstem � left rstem
rstem � right lstem
}
Class 3 grammars are also sometimes described at regular grammars and the patterns they describe as regular expressions. It turns out that the reserved words of most computer languages can be described by class 3 grammars. We will go into this in more detail in chapters and .
Class 2 grammars, also called context free grammars have productions of the form
a� b
where a is a non-terminal symbol and b is some combination of terminals and non terminals. We could describe the `if' expression in an algol like language as:
if-expression � if expression then expression else expression
where italics are non-terminals and bold letters are terminals. Most of the syntax of algol like programming languages can be captured using class 2 grammars. How these are used makes up the main topic of chapter .
Class 1 grammars, also called context sensitive grammars have production of the form
abc� axc
where a and c are strings of terminals and non-terminals,
b
is a single non-terminal and x is a non-empty string of terminals and non-terminals.
The reason why these are called context sensitive is that the production of x from b can only occur in the context of abc. In the context free languages a non terminal can be expanded out irrespective of the context. Although the bulk of a programming language's syntax can be described in a context free fashion, some parts are context sensitive. Consider the line:
x:=9
This will only be valid if at some point previously there has been a line declaring x. The name of the variable must have been introduced earlier and it must have been specified that it was an integer or real variable. The context sensitive part of the language is dealt with by the type checking system. In untyped languages like Basic context sensitive parts are minimal. In more advanced languages they are crucial. Class 0 grammars, the most powerful class are not needed for translating programming languages and we will ignore them.
A programming language can be translated by using a hierarchy of grammars.
At the lowest level we use class 3 grammars to recognise the identifiers and reserved words of the language. Above that we use class 2 grammars to analyze the context free parts of the language. Finally we use type checkers to verify that the context sensitive rules of the language are being obeyed.
Program Module | Grammar |
type checking | class 1 |
syntax analysis | class 2 |
lexical analysis | class 3 |
The structure of the compiler reflects this structure of the language. to each of the layers of grammar there is a module of the compiler. It also turns out that in our strategy for writing the compiler we can take advantage of a relationship which exists between classes of grammars and types of computing machines.
The idea of store and stored state will be familiar to all programmers, but a stored program computer need not in principle be anything like the Von Neumann machines that we normally call computers. There are in principle much more general purpose designs. At the most general level digital computer capable of performing computation over time must contain a set of storage cells each capable of holding a bit. The computer is capable of existing in a number of states characterised by the values in its storage cells. If we consider these we can see that the number of states that the computer can occupy will be 2s where s is the number of storage cells in the machine.
Computation proceeds by the computer going from one state to the next as shown by figure .
Clearly the number of state that a computer can go through in the course of a computation will be 2s . The larger the number of storage cells in the machine the longer or more complex the sequence of state that it can go through. This relationship is familiar to us all in the way more complex programs demand more store.
To actually perform computation it is necessary to be able to modify the sequence of states that the computer goes through on the basis of input signals. To produce any useful effect the computer must generate one or more output signals, to indicate the result of the computation. Reduced to its most simple a computer must be capable of responding to a sequence of inputs and generating appropriate outputs.
Consider a machine that has to recognise a 3 digit sequence and then respond with a yes or no according to whether or not the sequence was correct. An example might be digital door lock as shown in the diagram below. This requires the sequence 469 to be keyed in to open the lock. This sequence of numbers can be described by a class 3 grammar: (T,N,S,P) where
T = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 0 }
N = { s, t, u }
S = s
P = {
s� 4t
t� 6u
u� 9
}
We define a set of numbered states corresponding to the non terminals of the grammar such that s = 1, t = 2 etc. The machine starts in state 1 and undergoes transitions to successive states on the basis of the keys that are pressed. If any incorrect key is pressed the state reverts to the start state. This is shown in figure . The collection of numbered circles with arrows connecting them is called a state transition diagram.
Given that we wish to be able to represent state, it still does not follow that we will end up with a random access store. There are other possibilities, which have been tried in the past, and which still have certain limited applications. If each bit represents a state we could easily construct the door lock by stringing 4 cells together in sequence and having them activated sequentially by the and of the signal from a button and the previous state.
This is simple to implement and some digital logic used to be built this way, but it makes poor use of the state bits as we only get s rather than 2s states from an s bit store.
An improved arrangement is to gather all of the bits in the computer into one word s bits long. This is then treated as a binary number and the computer program can be thought of as a mapping of the form:
program:(int x input) � (int x output)
Succesive application of the program function to the state word and the input generates a new state and an output. This is in theoretical terms the ideal way to construct a computer. For large s the number of states possible becomes astronomical. A computer with a 64 bit status word could have a state to represent every centimeter of the distance between here and the nearest star. This sort of computer is a generalised finite state automaton . If the computer is organised as shown in figure .
This architecture can go from any state to any other in a single step - from here to Alpha Centauri without stopping on the way. Each bit in the current state can be taken into account in determining the next state. All values of the input bits can be taken into account likewise. A class 3 grammar can be handled by a finite state machine. Finite state machines are widely used in computer hardware in the form of PLA's or programmable logic arrays. These are basic components of microprocessors that are used to decode machine instructions. The instruction decode unit of a microprocessor has to parse machine code. Machine code has a class 3 grammar so a finite state machine is enough.
Although this sort of machine is very fast, we have practical difficulties in scaling it up. The problem is the rate at which hardware complexity increases with the size of the computer. The number of interconnection wires required to allow each state bit to affect the next state of every other bit goes up as the square of the number of bits, and the number of logic cells (ANDs , ORs) to do this goes up quadratically in the number of state bits. This is illustrated in figure .
If instead of connecting all of the state cells to one another, we organise the state cells into subgroups of bits termed words and lead these into a common logic block then we can diminish the number of wires considerably
If we divide our s state bits into w = s/b words each of b bits, and wire them up in a grid, we need only (w+b) wires to join them to the common logic block. What we then have is the random access memory computer. Like a generalised finite state automaton it runs in a cycle reading the present state and modifying the state vector as a result of what it has read but it is less powerful than an FSA in that at most w bits of the state can be taken into account each cycle and at most w bits of the state altered in each cycle.
The paradox is that although the FSA is the fastest type of computing machine, used in CPU's where speed matters, it is linguistically the least competent. Suppose that I have a class 2 grammar (T,N,S,P)
where
T = { ), (, 1, 2, 3 }
N = {s, t, u }
S = s
P = {s � ( t )
t � 1u2u � t
u � s
u � 3
}
This can generate sequences like
(132) or (11322) or (1(132)2)
You will find that you can not draw a state transition diagram that is capable of handling this syntax. In fact it can not be handled by a finite state machine. The machine would have to remember how many left brackets and how many 1s it had encountered and in what order they had come so that it could match them up with right brackets and 2s. Since the sequence defined by the grammar can be of arbitrary length, no finite memory could hold the information.
To handle a class 2 grammar like this you need to have an infinite stack memory. As each left bracket or one is encountered, a token representing it is pushed onto the stack. When parsing, the computer looks at the top of the stack and at the next character to decide what state to go into.
Of course in practice any stack that we build will be of finite depth. This means that looked at another way a stack machine is still a finite state automaton.
There will be sequences of symbols that are just too long to parse. For practical purposes we are willing to accept that some programs are too big to compile. But we can write our compiler as if it was going to run on a computer with an infinite stack. This technique allows us to write a program that only needs to have a small number of rules in it. The complexity of the parser is then limited by the size of the grammar itself rather than by the size of the programs it will have to compile.
When we take into account context sensitive information, we will need the full facilities of a random access memory in which we can built up information about what identifiers and types have been declared. Broadly speaking, the lexical analysis part of compiling will be handled by algorithms that mimic a finite state machine. The syntax analysis will be handled using a stack, and the type checking will use a random access heap.
expression � ( expression )
expression � expression + expression
expression � number
number � digit number
number � digit
digit � 0
digit � 1
machine.
Lexical analysis is the process of recognizing the elementary words or lexemes of a programming language. Suppose we wish to recognise the reserved words begin or while or end.
A file which contained one or more of these words could be produced by a grammar of the form:
.*(begin | while | end).*
where
. � any character
* means an arbitrary number of repetitions
| means alternation
This type of grammar which contains no recursive definitions corresponds to the class 3 grammars described in the previous chapter, is also termed a regular grammar. It is known that to each regular grammar there corresponds a finite state machine that can act as a recognizer for that grammar. The grammar above would produce the set of strings recognised by the finite state machine shown in Figure . In this diagram each state of the machine has a number and is drawn as a circle. There are lines ( or arcs as they are called in graph theory) going between the states. The arcs are labeled with letters. If we start out in state 0 and get a b then we go to state 1.
The letters e g i will then take us through states 2 to 4. The final n puts us into the hit state indicating that we have found one of the words. If at any state we get the wrong letter the machine goes back to state 0.
This class of finite state machines can be represented in computer memory in a number of different ways. One way to do it would be with the pascal record types shown in the listing below. To use this datastructure you would need a pointer to the present state, then for each character in the file you were searching you would run down the list of transitions to see if any applied. If they did you would take the transition, otherwise you would jump back tot he start state.
type
transition = record
letter :char ;
nextstate : ^ state;
other_transitions: ^transition;
end;
state= record
hit: boolean;
transitions:^ transition;
end;
A simpler datastructure to use would be a two dimensional array indexed by the current state and the incoming character. The recognition can then be performed by a very simple fast algorithm. Given:
where
parse states
next(F)� C that returns the next character in a file
This algorithm provides the bare bones of a lexical analyzer. We can envisage coding it up in pascal and getting something like listing.
We assume that the finite state automaton algorithm will be applied to a program held in a buffer. A function scan can be applied to this buffer and will return whether it found a lexeme. In a var parameter found the compiler returns the second last state. In this simple algorithm we will assume that this state is sufficient to distinguish the lexemes that are being looked for.
var buf:buffer;
function scan(var found:state;var hitpos:integer):boolean;
label 1,2;
begin
for i:=hitpos to maxbuf do begin
S:=table[s,buffer[i]];
if S = hitstate then goto 1;
found:=S;
end;
scan:=false;
goto 2;
1:
scan:=true; hitpos:=i;
2:
end;
#1
In practice the task of a lexical analyzer is more complicated than this.
The lexemes of the language are not just made up of reserved words. There are variables and strings, comments and numbers to contend with. Although the reserved words make up a small finite set, all possible variables of a language make up a very large set indeed. For languages which allow arbitrary length variables, the set is infinite. There can be no question of setting up a transition table that would be capable of recognizing all the possible identifiers, all the possible distinct numbers and all the possible distinct strings.
We are better to handle the process in two stages. First a simple finite automaton which tells us: 'we have a number' or 'we have an identifier'. Second, come other automatons that are invoked by the first to distinguish between different numbers or different identifiers, strings etc. This is the strategy used in the Compiler Writer's Toolbox. Lexical analysis is split into two levels. First level lexical analysis splits the source program into broad categories of lexemes like identifiers, numbers and strings. Second level lexical analysis then distinguishes between different identifiers etc.
Picture Omitted
Consider the problem of recognizing Pascal identifiers and numbers. We might define them as follows:
number � digit digit*
identifier � letter alphanumeric*
alphanumeric � letter
alphanumeric � digit
digit � [0-9]
letter � [a-zA-Z]
By [0-9] we mean any one of the set of characters from 0 to 9. This is termed a character class. Using these character classes we can define state machines that will recognise either an identifier or an number.
The graph in figure 2.4 has far fewer states than the one in figure 2.1, although that one could only recognise 4 reserved words and this one will recognise all numbers or identifiers. It is an indication of the simplification that can arise from just dealing with classes of characters and classes of lexemes. Note that the reserved words recognised by the first state machine begin, while, end will all be categorized as identifiers by the state machine in figure 5.3. This does not matter. We can distinguish between reserved words and other identifiers with second level lexical analysis.
The Compiler Writer's Toolbox is supposed to be easily configurable to handle different languages. An attempt has been made to isolate language dependent parts and make them readily alterable. The first level lexical analyzer has been made very general. It splits the input into numbers, strings, comments and lexemes. Lexemes are interpreted very liberally.
Anything which is not a string, number or comment is a lexeme. This means that things like the brackets [ ] { } ( ) or the operators + - = are treated as lexemes, as well as the more obvious alphanumeric strings. Although this may seem strange, you should realise that some languages allow sequences of operator symbols to be strung together as in: < = , ++, +=, % =. The first level analyzer sees all of these as lexemes whose meaning will be sorted out later. All it is concerned with is deciding where a lexeme starts and finishes. Consider this example:
x1:=ab<=c;
Where do the individual lexemes start and finish here?
If we write it down with spaces between the lexemes it is obvious:
x1 := ab <= c ;
but the lexical analyzer can not rely upon the programmer putting it down in this way. Without knowing what all the operators in the language are it needs to be able break the string up. This is where character classes come in handy. Using a recognizer like that in figure 5.3 you can pick out the identifiers, but that still leaves a variety of partitionings that are possible.
x1 : = ab < = c ;
x1 := ab < = c ;
x1 : = ab <= c ;
x1 := ab <= c ;
We can pick out the last one as the right one, because our experience of programming languages leads us to treat := and > = as single lexemes. We know that a string of operator symbols going together should be interpreted as a single lexeme.
We can add to our grammar the rules:
operator � opsym opsym *
opsym � [+ * - % : = / ]
Exactly which symbols go to make up the class of operator symbols will vary from language to language, but most languages do have a recognizable class of characters used in operators.
There is another class of symbols which have to be treated distinctly: brackets. We want (( to be two lexemes not one composite lexeme. We have certain broad classes of characters that are likely to be used in many languages : digits, letters, operators, brackets, spaces. The exact definition of these will vary from language to language. In some languages the underbar symbol ` _ ' can be part of an identifier, in others it is an operator.
The first level lexical analysis is performed by the unit FSM.PAS. The finite state machine is accessed via the function newlexeme which returns an fsmstate . Each time this function is called the finite state machine processes a new lexeme and terminates in a state which indicates what class of lexeme has been found. The finite state machine program is encoded as a transition table. This is a two dimensional array indexed on the current state and the current character. In association with the transition table is another array: the action table. Yielding one of (last,skip,include) the action table is accessed in step with the transition table. It provides instructions as to how to build up the lexeme. The finite state machine controls the actions of two pointers into a text buffer. One points at the start of a lexeme, the other, called finish points at the current character. As each character is processed finish pointer is advanced.
#1Finite state transducer module FSM
end;
function push_buffer:boolean;
`sec `Push buffer`;
function pop_buffer:boolean;
`sec `Pop buffer`;
function newlexeme(var B:textbuffer):fsmstate;
`sec `Transition Table` ;
label 1,99;
var
S:fsmstate;
C:charclass;
A:action;
T:textbuffer absolute the_buffer;
I:integer;
begin
1:
t.start:=t.finish;
if listprog then
{ put the condition outside the loop to prevent things
being slowed down too much }
`sec`Recognise and print`
else
`sec`Recognise` ;
99:
newlexeme:=S;
end;
var i:integer;
begin
NEWSTATE:=startstate;
listprog:=false;
include_sp := 0;
end.
@ | whitespace, | A | whitespace, | B | whitespace, | C | whitespace, |
D | whitespace, | E | whitespace, | F | whitespace, | G | whitespace, |
H | whitespace, | I | whitespace, | J | whitespace, | K | whitespace, |
L | whitespace, | M | separator, | N | whitespace, | O | whitespace, |
P | whitespace, | } Q | whitespace, | R | whitespace, | S | whitespace, |
T | whitespace, | U | whitespace, | V | whitespace, | W | whitespace, |
X | whitespace, | Y | whitespace, | Z | whitespace, | [ | whitespace, |
|
whitespace, | whitespace, | whitespace, | _ | whitespace, | ||
whitespace, | ! | shriek, | " | dquote, | # | whitespace, | |
$ | operator, | % | operator, | & | operator, | ' | quote, |
( | bracket, | ) | bracket, | * | bracket, | + | operator, |
, | bracket, | - | operator, | . | digits, | / | operator, |
0 | digits, | 1 | digits, | 2 | digits, | 3 | digits, |
4 | digits, | 5 | digits, | 6 | digits, | 7 | digits, |
8 | digits, | 9 | digits, | : | operator, | ; | separator, |
< | operator, | = | operator, | > | operator, | ? | operator, |
@ | operator, | A | alpha, | B | alpha, | C | alpha, |
D | alpha, | E | exponent, | F | alpha, | G | alpha, |
H | alpha, | I | alpha, | J | alpha, | K | alpha, |
L | alpha, | M | alpha, | N | alpha, | O | alpha, |
P | alpha, | Q | alpha, | R | alpha, | S | alpha, |
T | alpha, | U | alpha, | V | alpha, | W | alpha, |
X | alpha, | Y | alpha, | Z | alpha, | [ | bracket, |
whitespace, | ] | bracket, | operator, | _ | operator, | ||
` | operator, | a | alpha, | b | alpha, | c | alpha, |
d | alpha, | e | exponent, | f | alpha, | g | alpha, |
h | alpha, | i | alpha, | j | alpha, | k | alpha, |
l | alpha, | m | alpha, | n | alpha, | o | alpha, |
p | alpha, | q | alpha, | r | alpha, | s | alpha, |
t | alpha, | u | alpha, | v | alpha, | w | alpha, |
x | alpha, | y | alpha, | z | alpha, | { | bracket, |
| | operator, | } | bracket, | operator, | del | whitespace |
The FSM uses two enumerated types for its operation. Fsmstate lists the
states that the machine can be in, whilst charclass specifies the
classes
into which the character set can be mapped. Between them, these types
will act
as indices into the state transition table for the low level Finite
State Machine.
fsmstate =(startstate,opstate,idstate,numstate,
expstate,commentstate,stringstate,escapestate,
lastquotestate,sepstate,brackstate
);
charclass=(operator,bracket,alpha,digits,exponent,dquote,
quote,shriek,separator,whitespace
)
Const classtab:array[0..127] of charclass = ( see table 2.1 );
The Compiler Writer's Toolbox encodes the classes to which individualcharacters belong in the file { CLASSTAB.CMP}. This includes a definition of the type charclass , and an array classtab which maps from characters to this type. The definitions of this type and the array are given in table 2.1.
The file CLASSTAB.CMP is produced automatically by a program called MckCtab .
This is an example of the use of a program generator program: a program which produces another program. these are very useful tools when writing a compiler. In MckCtab the classes can be specified as sets of char, the program then generates the source declaration for an appropriate character class table.
#1mckctab
{ -----------------------------------------------------------------
Program :Mkctab
Used in :CompilerWriter's Toolbox
Author :WPCockshott
Date :3Oct1986
Version :1
Function :to build a class table include file for the
lexical analyser
Copyright (C)WPCockshott & P Balch
}
`sec `Define Character Sets`
varc:char;
begin
`sec `Output Classes and Actions`
writeln('const classtab:array{[}0..127{]} of charclass=(');
forc:=chr(0) to chr(127) do begin
`sec `Assign Characters to Classes`
end;
writeln('); {endofclasstab}');
end.
The program outputs the type definitions that are to be used in the finite state machine module of the compiler. There are two types representing the state of the finite state machine and the character classes that are jointly used to index the state transition table.
An example of the output generated from listing can be seen in section .
The program writes out the contents of a constant array that maps characters to their classes. These are written 4 entries to a line separated by commas. A somewhat prettified version of the output from this section was shown in section .
Each entry is made up of a comment followed by the value of the entry.
#1assign characters to classes
if (ord(c) mod 4) = 0 then writeln;
`section `Make Comment`
`section `Output Class`
The comments for printable characters are the characters themselves. If they are non printable ( DEL or a control character ) they are output in mnemonic form.
For a control code the menmonic is of the form ^ followed by the printable character that is 64 greater than the control code.
Thus the bell character (07) will appear in the comment as ^ G .
The character '}' has to be handled as a special case so as not to prematurely end the comment.
if c < chr(32) then write('',chr(64+ord(c)),'} ')
else if c='}' then write('closing bracket}')
else if c=chr(127) then write('del}')
else write(c,'} ');
#1make comment
Certain characters have to be treated as special cases.
The letter E can occur either as part of an identifier or as the label for the exponent part of a floating point number. For this reason it is convenient to assign it to a special class EXPONENT. The newline character (ASCII 13) is assigned to the class of separators, this is done explicityly because it can not be given as a Pascal character literal in the definition of the set SEPSY.
Quotes and double quotes are single member character classes and so are more efficiently dealt with by simple equality tests here.
#1Output class
if c in ['E','e'] then write('exponent,') else
if c in alpha then write('alpha,') else
if c in digits then write('digits,') else
if c in bracket then write('bracket,') else
if c = '!' then write ('shriek,') else
if (c=chr(13)) or (c in sepsy) then write('separator,') else
if c = '''' then write('quote,') else
if c = '"' then write('dquote,') else
begin
write ('whitespace');
if ord(c)<>127 then write(',') else writeln;
end;
Exercises
1. Define the character classes that would be appropriate for first level
lexical analysis of Pascal.
2. Draw a state transition diagram based upon the listing of file fsm.pas
that describes the behavior of the first level lexical analyser.
3. Devise a regular grammar to describe Pascal floating point numbers.
4. Modify the files Mckctab.pas and fsm.pas to construct a first level
The characters are divided into several subsets depending upon where they occur in the source language. All are declared using the structured constant facility of Turbo Pascal.
#1define charactersets
sepsy:set of char = [';'];
`sec `Opsys` ;
`sec `Brackets`;
`sec`Alphanumerics`
These characters can be put together to form an operator. That is to say if we have the characters `+' and `=' occuring next to one another they are to be treated as the composite operator `+='. An operator in the source language must be made up entirely of these operator characters.
#1Opsys
These characters include the obvious bracket characters like `[' or `]' and also, less obviously `,' and `*'. These are thrown in because all of the characters in the bracket set stand as single lexemes. They do not form part of a larger lexeme.
A `*' put next to another `*' must not be read as `**' nor must a pair of `('s be read as `(('.
#1Brackets
Alpha numeric characters can occur in two contexts: in numbers, and in identifiers. They are split into subsets according to whether they are staring characters of identifiers and of numbers. Note that the character `.' can occur in either an identifier or a (real) number.
#1Alphanumerics
const
alpha:set of char = ['a'..'z','A'..'Z'];
digits:set of char = ['0'..'9','.'];
alphanum:set of char = ['a'..'z','A'..'Z','.','0'..'9','#'];
spacechars:set of char =[' ']
This table which is stored as a two dimensional array encodes the behaviour of the low level finite state machine for PS-algol lexical analysis. The behaviour of the machine is indicated in the state transition diagram shown in fig .
const transtable:array
[fsmstate,charclass] of fsmstate =
);
state
operator
bracket
alpha
digits
exponen
"
'
!
;
whitespace
startstate
opstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
opstate
opstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
idstate
opstate
brackstate
idstate
idstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
numstate
opstate
brackstate
idstate
numstate
expstate
stringstate
startstate
commentstate
sepstate
startstate,
expstate
numstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
commentstate
commentstate
commentstate
commentstate
commentstate
commentstate
commentstate
commentstate
commentstate
sepstate
commentstate,
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate
lastquotestate
escapestate
stringstate
stringstate
stringstate,
escapestate
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate
stringstate,
lastquotestate
opstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
sepstate
opstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate,
brackstate
opstate
brackstate
idstate
numstate
idstate
stringstate
startstate
commentstate
sepstate
startstate)
type action
=(last,skip,include);
const emit:array
[fsmstate,charclass] of action = (
state
operator
bracket
alpha
digits
exponen
"
'
!
;
whitespace
startstate
skip
skip
skip
skip
skip
skip
skip
skip
skip
skip,
opstate
include
last
last
last
last
last
last
last
last
last,
idstate
last
last
include
include
include
last
last
last
last
last,
numstate
last
last
last
include
include
last
last
last
last
last,
expstate
include
last
last
include
last
last
last
last
last
last,
commentstate
skip
skip
skip
skip
skip
skip
skip
skip
last
skip,
stringstate
include
include
include
include
include
include
include
include
include
include,
escapestate
include
include
include
include
include
include
include
include
include
include,
lastquotestate
last
last
last
last
last
last
last
last
last
last,
sepstate
last
last
last
last
last
last
last
last
last
last,
brackstate
last
last
last
last
last
last
last
last
last
last)
);
This is the basic loop that recognises a lexeme. Between invocations the state of the FSM is stored in NEWSTATE.
#1recognise and print
S:=NEWSTATE;
`sec `Get the next character`
write(listfile,chr(i));
if I= 10 then begin
t.linenumber:=t.linenumber+1;
end ;
`sec `Compute new state`
if A= skip then t.start:=t.finish ; mark start of lexeme
`sec `Handle end of buffer`
until( A=last)
This is the recognition loop when printing is disabled. Two variants of the loop are provided to prevent a test for printing being enabled within the inner loop.
#1recognise
S:=NEWSTATE;
`sec`Get the next character`
if I= 10 then begin
t.linenumber:=t.linenumber+1;
end ;
`sec `Compute new state`
if A= skip then t.start:=t.finish ; mark start of lexeme
`sec `Handle end of buffer`
until( A=last)
These are the functions and variables exported from the finite state machine module to the second level lexical analyser. If Turbo Pascal allowed a more modular structure one would like the declaration of the type of the Textbuffer to be hidden.
This can not be done under Turbo Pascal V4.0.
#1interface to FSM
This uses the class table to determine the class of the input character and then, using 2d array indexing determines the newstate and the action to perform given the current state.
#1compute new state
C:=classtab[I and 127];
NEWSTATE:= transtable[S,C];
A:=emit[S,C];
Check of the end of the lexeme goes pas the end of the buffer. If it does then terminate the lexeme and pop the file buffer.
#1Handle end of buffer
Increment the finish pointer for the lexeme and then fetch the character at this position in the text buffer as an integer.
#1get next character
Program text is represented to the lexical analyser by the type TEXTBUFFER. This is a record which has the form shown in figure , and is declared in listing .
textbuffer | ||
thetext | pCharSeq | Pointer to an array of char |
start | word | first character of lexeme |
finish | word | character after the lexeme |
textlen | word | number of chars in the buffer |
linenumber | integer | the line number we are at |
type textbuffer = record
thetext: pCharSeq;
start,finish,textlen: word;
linenumber :integer;
{point at the start and finish of lexeme and give length of buffer }
end;
When a program is to be recompiled for whatever reason there needs to be the facility for the compiler to move its compilation point back to the start of the buffer. This sequence reintialises the low level finite state machine. The constant textbuflo is declared in the editor package.
The variable include_sp is used to keep track of the level of nesting of `include' files in the compilation process.
#1rewind
begin
NEWSTATE:=startstate;
the_buffer.start:=textbuflo;
the_buffer.finish:=textbuflo;
include_sp:=0;
end
This operation pushes the current text buffer onto a stack and advances the include_sp . This will be done whenever an include file is encountered and our position in a previous buffer has to be saved.
#1push buffer
if include_sp<includedepth then begin
push_buffer:=true;
include_sp:=include_sp+1;
buffstack[include_sp]:=the_buffer;
end else push_buffer:=false;
end
This is the obverse operation to pushing a buffer. It is performed whenever an include file ends. The space occupied by the buffer can then be freed and the finite state machine switched back to obtaining its input from the old buffer.
#1pop buffer
begin
if include_sp>0 then begin
pop_buffer:=true;
with the_buffer do freemem(thetext,textlen);
the_buffer:=buffstack[include_sp];
include_sp:=include_sp-1;
end else pop_buffer:=false;
end
The first level of lexical analysis picks out the start and finish of lexemes. At this point the lexemes are sequences of characters. This is not a convenient form to deal with computationally. A common operation that has to be performed in the higher levels of a compiler is to compare two lexemes to see if they match. String comparison is slow. A representation more suited to fast manipulation is needed. Computers carry out atomic operations on words of binary data. The fastest type of comparison is between two binary words. This can generally be done in one or two instructions. For speed, the representation of lexemes should be changed from strings into words. Can this be done?
Can strings of perhaps a dozen characters be represented in a 32 bit or 16 bit word?
It all depends upon how many of them there are. An 8 character string contains 64 bits. It is only possible to represent this in a 16 bit word if the string actually contains less than 64 bits of information. With a 64 bit binary number 264 different values can be represented. There are 264 possible 8 character strings. If longer strings are considered the number of possible strings will rise exponentially in the length of the string. A 16 bit word can only encode 216 values. It seems impossible to cram into this all the possible strings.
In principle a program could be written that would use the billions of billions of different identifiers that are theoretically permitted in programming languages. In practice, in our finite world such a program will never occur. All we have to worry about is how many distinct identifiers will actually occur in a program. This number is unlikely to rise above a few hundreds, so a 16 bit word can easily range over them. One simply has to assign ascending integers to the lexemes in the order in which they are encountered. The second level lexical analyzer can be seen as a black box that takes in strings, numbers them and then outputs the numbers.
strings � 2nd level lexical analysis � numbers
In functional notation we can characterize it as a function
identify(string� number)
such that
identify(A) = identify(B) iff A = B
In words this means that the mapping from strings into numbers preserves string equality. If we knew beforehand what strings were to be encountered this would be trivial. We need only construct a fixed table with two columns, strings in one numbers in the other. When we needed to perform a conversion we would scan the first column for a match and output the corresponding number. For the reserved words of a language this is feasible and it is indeed the technique used by some compilers. A fixed table is set up containing the reserved words, which is searched for a match each time a lexeme is produced. If you do this however, you end up with the reserved words of the language embedded in your compiler. This may be a disadvantage when you have to convert the compiler to handle a new language. Furthermore, a compiler will always have to handle additional symbols over and above the reserved words. It has to handle the user defined names of variables. These can not be known before hand, so a compiler is forced to have an extendible data structure - often called the symbol table - to deal with these. An alternative approach is to deal with all lexemes other than literal values ( integers, reals etc) in a homogeneous way. The lexical analyzer of the Compiler Toolbox contains no built in knowledge of what identifiers are used in the language. It works entirely with dynamic data structures. The simplest way to build a mapping from strings to numbers would be something like the algorithm shown in program SIMTRANS .
This algorithm is simple and reliable but suffers from two serious disadvantages. The first is excessive space consumption. The second is slowness.
In order to be able to handle the longest identifier that you are going to encounter the constant idmaxlen is going to have to be set to something like 40 or 50. The type idrange may have to go up to 1000 to handle the number of identifiers that might occur in a big program. Already you have reserved 40 to 50 kilobytes of store. A compiler will need several other tables. Is it wise to use up so much space that you may not need? Most identifiers will be closer to 8 characters than 40 characters in length. Perhaps 80 per cent of the table will be wasted. This kind of pressure tempts compiler writers to restrict the length of identifiers that the compiler will accept. When computers had less memory, the temptation to only allow short identifiers was severe. Fortran had only 6 characters in identifiers. Early versions of C ignored all but the first 8 characters of a name. This kind of thing is no longer acceptable. Some sort of dynamic data structure should be chosen for an identifier table that does not restrict how long the identifiers can be.
The simple table used in SIMTRANS will be slow to search. On average you will have to scan through half the table to find an identifier. This can mean hundreds of string comparisons for each symbol processed. In order to prevent identifier translation becoming a bottleneck it should be made as fast as possible.
#1program SIMTRANS
There are a variety of dynamic data structures that could be used based upon trees or hashing. What the Compiler Toolbox uses is a Trie. This is a special sort of tree that makes use of the observation that many of the names used in a program will be similar to one another. Consider the user defined identifiers in what follows.
idrange
idmax
idstring
idmaxlen
idpntr
idtab
pushid
id
identify
i
found
There is a great deal of commonality here. A trie takes advantage of the fact that many identifiers share common prefixes with other identifiers. We can reorganise the identifiers to show this in figure .
founD
IDentifY
maXleN
pntR
rangE
strinG
taB
pushiD
In this arrangement the prefixes are only written down once. The last letter of an identifier is written in capitals. Although this arrangement is more compact, it contains all of the names. A trie achieves this arrangement in computer memory using a linked list.
A possible record format for a trie is shown in listing .
The type lexnode points at a delabrandis record. This has a character in it and two further lexnodes. The follower pointer corresponds to moving horizontally in figure 2.6, the alternates pointer corresponds to moving vertically. The number final will be non zero if this is the last letter in a word, corresponding to the use of bold characters in figure 2.6. The identifiers are further indexed on their first character using the type lexarray.
type
lexnode = ^delabrandis;
lexarray = array[minchar..maxchar] of lexnode;
delabrandis = record
final:integer;
pref:char;
follower,alternates:lexnode;
end;
The dynamic trie structure places no limits on the the number of characters that can be in an identifier. Despite this, no unnecessary space is allocated for short identifiers. Very short, one or two letter identifiers are likely to fall entirely within other longer ones, and effectively use no space. Search is fast. Using the index the algorithm restricts its search space to just those identifiers that start with the right letter. With each further letter matched, the search space is restricted.
The trie is manipulated by the function insert_token which updates the trie as a side effect of searching it for an identifier. Because the trie is a recursive datatype the actual insertion is done by a recursive procedure ins . The function maintains the alternates in alphabetical order. The presence of an order reduces the number of unnecessary comparisons that have to be made if you are inserting a new string.
Note how access to the trie is accelerated by using an array n indexed on characters. This means that there is essentially a sub trie for each possible first letter of the identifiers. This produces a cheap speedup in the search performance. If this were not the case the first letter would have to be matched using a linear search.
#1Insert token
function insert_token(var B:textbuffer; var n:lexarray):lextoken;`sec `Create a new node`
{ inserts the string into the tree }
{$S-}
var p : lexnode;
charno : integer;
c : char;
hit ,inserted :boolean;
procedure newnode(var next:lexnode;c:char);
procedure ins(var n:lexnode;charno: word );`sec `Recursive Insert`
var t:lexnode;
c:char;
begin
{$r-}
ins( n[B.thetext^[B.start]], B.start);
end;
A new node is created on the heap. Its character field is initialised to the current character. Its final field is set to 0 to indicate that this is not the last character on the list. It has no follower or alternates as yet.
begin
new(next);
with next^ do begin
pref:=c; final:=0;
follower:=nil;
alternates:=nil;
end;
end;
The recursive insert procedure has to deal with 3 alternatives.
begin`sec `Add another letter`
c:=b.thetext^[charno];
if charno <B.finish then with B do
if n=nil then
else with n ^ do`sec `Last letter of word`
if c = pref then begin
if charno=finish -1 then
else ins(follower,charno+1);`sec `Char less than prefix` ;
end
else if c< pref then ins(alternates,charno)
else
end;
A new node is attached to the currenly nil node pointer. The insert procedure is invoked recursively to append any further characters in the word to the trie.
begin
newnode(t,c);
n:=t;
ins(n,charno)
end
We are on the last character of the word. Either the word has been encountered before or it has not.
If it has been met previously then a token will have been stored for the word in the final field of the node. Alternatively, the word is new and the final field indicates this by holding 0. It is then replaced by the value of lasttoken , which is itself then incremented. After this the value in final must be the appropriate code for the word and can be returned from insert_token .
begin
{ a hit }
hit:=true;
if final = 0 then
{ first time we have encountered this word}
begin
final:=lasttoken;
lasttoken:=succ(lasttoken);
end;
insert_token:=final;
end
A copy of the pointer to the current node is made in t . A new node is then attached to the current node pointer. The old node is then made the first alternative for the new node. Then the insert operation is invoked recursively.
begin
t:=n;
newnode(n,c);
n^.alternates:=t;
ins(n,charno);
end
The identifier table is initially loaded with a list of all of the reserved symbols of the language. This includes not only the reserved words, but also the operators and brackets. At startup the procedure init_lexanal gets invoked to read these in from a file: lexemes.def. Because the insertion procedure assigns ascending integers to the identifiers read in, the lexemes in the file will be given internal integer representations in the order in which they occur in the file. Within this file they are organised as list of symbols one per line.
then
to
TRACEON
TRACEOFF
true
upb
vector
while
write
{
}
~
~=
\
..INT.LIT
..REAL.LIT
..STRING.LIT
..IDENTIFIER
THEN_SY,
TO_SY,
TRACEON_SY,
TRACEOFF_SY,
TRUE_SY,
UPB_SY,
VECTOR_SY,
WHILE_SY,
WRITE_SY,
CUR_SY,
LEY_SY,
TILDE_SY,
NEQ_SY,
SLASH_SY,
INT_LIT,
REAL_LIT,
STRING_LIT,
IDENTIFIER,
The entries in the lexeme definition file are put in one to one correspondence with an enumerated data type: lexeme. This type provides the interface between the lexical analyzer and the syntax analyzer. The context free grammar of the language will be defined in terms of these lexeme values.
FORDOCUMENTATION
The lexical analyzer communicates with the syntax analyzer via a small group of procedures and variables. The most important of these is the procedure next_symbol . Each time it is called it processes one symbol. The finite state machine in the level one syntax analyzer is invoked to delimit the symbol. The terminating state of the machine determines whether the symbol was a name, an operator, a number or a string.
If a number is found then a numeric conversion function is invoked to convert the decimal representation of the number into a binary format. In Turbo pascal this is easily achieved using the val function. This takes a string and returns an integer or real equivalent. If your compiler is in another dialect of pascal, it may become necessary to write your own numeric conversion functions.
Next_symbol leaves the results of lexical analysis in a set of global variables:
symbol :lextoken;
lexsymbol :lexeme;
the_string :stringv;
the_real :real;
the_integer:integer;
Lexsymbol will hold the lexeme that has been matched. If it was an integer, real or string then it will take on the values INT_LIT,REAL_LIT or STRING_LIT and the corresponding value of the literal will be stored in the_integer,the_real or the_string . If the lexeme is a reserved word then the corresponding enumerated type value is stored in lexsymbol. If the lexeme is a user defined identifier, then the value of lexsymbol will be IDENTIFIER and the number associated with that identifier will be returned in symbol as a lextoken.
The writer of a syntax analyzer can choose to call next_symbol and perform tests on these global variables. In the process of analysis certain actions have to be carried out repeatedly. A common sequence is to test to see if the current lexeme has a particular value, and, if it has to call next_symbol to see what follows. This combination of actions is bundled up in :
function have( t: lexeme) : boolean;
If the lexical analyzer has t then it says yes and grabs the next lexeme.
{ ------------------------------------------------------------------- }
{ HAVE }
{ conditionally matches a token }
{ ------------------------------------------------------------------- }
function have( t: lexeme ) : boolean;
begin
if t = lexsymbol then
begin next_symbol(the_buffer);
have:=true
end
else have:=false;
end;
The stricter version of have is mustbe. This is called when the syntax stipulates the presence of a particular symbol. If the symbol is not found then an error is reported. The error message specifies what symbol was found and what was expected. This can generate such messages as `begin found instead of then'.
Note that as presently configured, mustbe will skip over newlines until it comes to a symbol. This is appropriate for most modern languages, but would not be appropriate for parsing assembly language for instance.
{ ------------------------------------------------------------------- }
{ MUSTBE }
{ compulsorily matches a token }
{ ------------------------------------------------------------------- }
procedure mustbe(t : lexeme );
begin
if not have(t) then begin
if have(newline_sy) then mustbe(t) else syntax(t);
end;
end;
{ ------------------------------------------------------------------- }
{ SYNTAX }
{ report error and stop }
{ ------------------------------------------------------------------- }
procedure syntax( t : lexeme);
var m :stringv;
begin
m:= currentid +' found instead of '+ psym(ord(t));
ReportError(m);
end;
The code generator may need to know the current identifier's printable form in order to be able to plant information for a linker. It is thus convenient to have a function that will return this. This avoids the code generator having to know anything about the data format of the text buffer.
{ -------------------------------------------------------------- }TOPICS
{ CURRENTID }
{ returns the identifier as a string }
{ -------------------------------------------------------------- }
function currentid :stringv;
var n:stringv;
i,p:integer;
begin
with the_buffer do begin
n:='';
for i:=start+1 to finish do begin
n:=n+thetext^[i];
end;
currentid:=n;
end;
end;
For error reporting purposes it is often convenient to be able to convert from lexemes back into printable strings.
The technique that you chose for this will depend upon how frequently you want to make the conversions each way. If conversions back into strings are very rare: just when the compiler stops with an error, then it is not necessary for the conversion process to be very fast. In this case one could perform a depth first traversal of the trie looking for a node marked with the current lexeme. When one is found, one knows that this node represents the last character of the identifier. By tracing ones path back through the trie the original identifier can be extracted.
If speed is more important then it may be worth explicitly storing each identifier in ram. This of course brings with it all the problems of space overheads mentioned earlier. If you use a 2 dimensional character array, you have to agree upon the maximum length of character that you can store.
An alternative storage mechanism shown in figure uses an array starts indexed on the lexemes to find the starting positions of identifiers in a one dimensional character array: pool . The lexeme l will then occupy positions
|
The lowest level interface to a lexical analyser is provided by a procedure that gets a symbol. We call this procedure NEXTSYMBOL . It reads in a lexeme and stores the token in the variable symbol .
If the symbol turns out to have been a literal value, for instance a number or a string then the actual value of the literal must be stored for the subsequent stages of the compiler. The value can be stored in one of several global variables:
the.integer,the.real,the.string .
In order to do this some processing will be necessary to convert between source character strings and numbers or strings.
{ ------------------------------------------------------------------ }`sec`Convert string to number`
{ NEXTSYMBOL }
{ ------------------------------------------------------------------ }
procedure next_symbol;
var S:fsmstate;
function numconv:lexeme;
var n:stringv;
i,p:integer;
isint:boolean;
begin
end;`sec `Convert string to internal form`
procedure printsymbol;
var i:integer;
c: char;
begin
with the_buffer do
if lexsymbol <> newline_sy then begin
for i:=start to finish -1 do write(thetext^[i]);
write(' ');
end else writeln;
end;
function stringconv:lexeme;
var n:stringv;
i,p:integer;
escape:boolean;
c:char;
procedure append(c:char);
begin if length(n)<MAXSTRING then n:=n+c; end;
begin
end;`sec `Type coercion operation`
begin`sec `Detect run time error location`
coerce.dummy:=0;
compilercursor:=the_buffer.start;
S:=newlexeme(the_buffer);
with coerce do
if s in [opstate,brackstate,idstate] then
l1:=insert_token(the_buffer,predefined)
else if s in[numstate,expstate] then l2:=numconv
else if s in [stringstate,lastquotestate] then l2:=stringconv
else symbol:=coerce.l1;
if symbol >maxpredefined then lexsymbol:=identifier
else lexsymbol:=coerce.l2;
end;
The string is searched to see if it contains any non-digit characters.
If it does then the flag isint will be set. On the basis of this flag the turbo pascal function val is called to convert the string either into an integer or a real. Then the appropriate lexeme is returned from numconv.
isint:=true;
with the_buffer do begin
n:='';
for i:=start to finish-1 do begin
n:=n+thetext^[i];
isint:=isint and (thetext^[i] in ['0'..'9']);
end;
if isint then begin val(n,the_integer,p); numconv:=INT_LITend
else begin val(n,the_real,p); numconv:=REAL_LITend;
end;
Some characters that one may want to place in a string have no printable representation. Most of these are format characters like carriage return, tab, or backspace. Some computer languages provide special means of including these into a string by prefixing a printable character with an escape character. When so prefixed, the printable character means something else.
In C the escape prefix is \ and in S-algol it is ' .
The conversion from the printable to the internal form of a string can be performed by a simple finite state machine that has two states:
begin`sec `Convert prefixed characters`
escape:=false;
with the_buffer do begin
n:='';
for i:=start+1 to finish -2 do begin
c:=thetext^[i];
if not escape then begin
escape:=classtab[ord(c) and 127]=quote;
if not escape then append(c);
end
else begin
escape:=false;
end;
end;
the_string:=n;
stringconv:=string_lit;
end;
end;
Most languages provide a mechanism for embedding control characters in strings, typically by providing a prefix character and then some code that follows. This example obviously applies to the language S-algol, but other languages would require similar code. S-algol has the following mapping from escape sequnces to non-printable characters:
�n� LineFeed = 10
�t� Tab = 9
�o� CarriageReturn = 13
�b� BackSpace = 8
�p� VerticalTab = 11
Anything else preceded by a ' stands for itself. Hence '' stands for ' and '" stands for " .
FORCOMPILER
case c of
'n' : append(chr(NEWLINE));
't' : append(chr(TAB));
'o' : append(chr(CR));
'b' : append(chr(BS));
'p' : append(chr(VTAB));
else append(c);
end ;
We wish to have an enumerated type lexeme for the reserved words of the language. Subsequent identifiers are assigned ordinal values as their lexical tokens that start up where the predefined lexemes finish. What we obtain from our identifier conversion routine is a lextoken which is an ordinal type. We have to convert this to a lexeme. There is no built in way of converting an ordinal to an enumerated type in standard pascal, so we cheat by using a variant record. We assign the lextoken to field l1 and then read it back as a lexeme from field l2 .
var coerce:record
case boolean of
true:(l1:lextoken);
false:(l2:lexeme;dummy:byte);
end
If an error occurs during the execution of an S-algol program, the line number on which this occured is passed back to the compiler. The program is then searched using the lexical analyser to find the appropriate position in the text where the error happened. The lexical analyser knows it has been called to find an error if the variable stopline is non zero.
if stopline >0 then
if coerce.l2=newline_sy then
begin
if the_line > stopline then
begin
if errorfree then
error('Run time Error');
end
end;
1. Write numeric conversion functions to convert from the decimal representation of integers and reals to binary.
2. Modify these to handle numbers in any base from 2 to 16. Use the notation defined by the grammar:
anybase � base # number
base � decimalnumber
number � anydigit anydigit*
decimalnumber � decimaldigit
decimaldigit*
anydigit � [0-9A-F]
decimaldigit � [0-9]
An example in this notation might be
16#01A
3. Alter the string conversion procedure to handle pascal strings. These are deliminated by single quotes.
4. Modify the data type lexeme to correspond to the lexemes needed for pascal. Create a new version of the file lexemes.def called lexemes.pas that contains the pascal lexicon. Test the ability of your reconfigured lexical analyzer to accept pascal input. This may require you to write a driver program that will print the output of the lexical analyzer.
5. How would you make the lexical analyzer insensitive to the differences between upper and lower case keywords and identifiers whilst retaining the cases of letters in strings?
The syntax analysis method used in this course is called recursive descent. It is the standard method of syntax analysis used in block structured languages.
The idea of recursion or self reference is central to this compiling technique. Consider the class 2 grammar described in chapter 1.
G = T, N, S, P
where
T = ), (, 1, 2, 3
N = s, t, u
S = s
P =
s ->( t )
t ->1u 2
u ->t
u ->s
u ->3
The non-terminals of this grammar refer to one another: s refers to t which refers to u which in turn refers back to both of them. This grammar is recursive or self-referential. Recursion is a property of those grammars that allow terms to be nested within one another to an arbitrary depth. Because the conventions governing the way that mathematical formulae are written down allow the use of nested brackets, most computer languages too allow this sort of nesting: if only when handling mathematical expressions. In consequence the grammars used in computer languages tend to be recursive. Most high level languages allow recursion in several parts of their grammar. Recursion in a grammar is handled by recursion in the syntax analyzer. The analyzer is a collection of mutually recursive procedures. Associated with each non-terminal of the grammar is a procedure whose job it is to recognise that non-terminal. This is best understood by example.
To parse G we would set up 3 procedures, one to handle each of s, t, u . We know from the definition of G that an s is a t with brackets round it.
A procedure that would accept an s would be:
accept a '('
accept a t
accept a ')'
accept a '1'
accept a u
accept a '2'
If the nextsymbol = '(' then accept an s
else
if the nextsymbol = '1' then accept a t
else
accept a '3'
begin
if symbol=bra then S else
if symbol=one then T else
mustbe(three)
end;
procedure S;
begin
mustbe(bra);T;mustbe(ket)
end;
procedure T;
begin
mustbe(one);U;mustbe(two)
end;
It is clear that you will not get very far with producing your parser unless you have a clearly specified grammar for your language. There are a number of formalisms for putting a grammar down on paper. Some of these, like the one we have used up to now depend upon the use of special type faces. We have been showing non-terminals in italics and the terminals in bold. Although this reads well enough on the printed page, and can be handled by most word processors, variations in typeface are difficult to express in the standard ASCII codes used on Western computers.
It is convenient at times to allow parsers to be built automatically by other programs which have been given a specification of a grammar. The automatic construction of a compiler in this way is termed compiler compilation, and the programs which do it compiler compilers. These compiler compilers need a clear machine readable specification of the grammar, without italics or boldening. A formalism that meets these criterion is BNF or Backus Naur Format, that was developed to describe algol 60 by Backus and Naur in 1960.
In BNF non-terminals are shown between angle brackets '<' '>' thus: < expression> means the non terminal whose name is expression. Symbols written down outside the brackets stand for terminals. The way that a non terminal expands out is shown by ::= which plays the same role as -> in out earlier notation. Thus in the production
<multop>::=<star>|div|/|rem
<star>::=*
In S-algol the if clause has two variants depending upon whether else is to be used. We have examples like:
procedure if_clause ;
label 1,2,3,4;
begin
next_symbol;
1: clause;
if have(do_sy) then begin clause end else
2: begin
mustbe(then_sy);
3: clause;
mustbe(else_sy);
4: clause;
end;
end;
The syntax handled by listing 3.2 is:
if <clause> then <clause> else <clause>
if b>0 then "positive" else "negative"
^
The up arrow indicates where the current symbol is. The procedure if_clause will only be called if some other procedure has performed a look ahead and decided that the next symbol is an 'if'. Since the 'if' has already been recognised, it is discarded by if_clause when it calls next_symbol. This gets us to label 1. The state of the parse is now:
if b>0 then "positive" else "negative"The syntax stipulates that an 'if' must be followed by a clause so the procedure that parses clauses is called. Next a choice has to be made. If the program that is being read in is valid then the next symbol must either be 'do' or it must be 'then'. Which of these it is can be tested using the have call. This tests the current symbol against its parameter. If they match, then the current symbol is 'eaten' and a true result returned. Otherwise false is returned. Since we have a 'then' not a 'do' false will be returned and we go to label 2. The state of the parse is still.
^
if b>0 then "positive" else "negative"
^
The grammar now offers no alternative. We must have a 'then'. Mustbe is called to make sure that we do and eats up the 'then'. The program has now moved to label 3. The state of the parse is now:
if b>0 then "positive" else "negative"
^
There are no further alternatives. We must have a clause followed by a an 'else'. This brings us to label 4: with the parse in state:
if b>0 then "positive" else "negative"
^
One more call on clause and the parse is finished. The other form of conditional clause: the case statement, provides some new problems. Its syntax in S-algol is:
#1The recognizer for case clauses
procedure case_clause ;
begin
next_symbol;
clause(t1);
mustbe(of_sy);
while errorfree and not have(default_sy) do begin
repeat clause (t)
until not have(comma_sy);
mustbe(colon_sy);
clause(t);
end;
mustbe(colon_sy); clause(t);
end;
The procedure case_clause uses two loops. The while loop recognizes each of the actions of the case clause and can be terminated either by an error in the parsing of lower level code or by the occurrence of the symbol ' default'. Within this we have a repeat loop that recognizes the list of guards in front of each of the actions.
The other types of clauses in the language: for_clause, while_clause etc, can be parsed by applying the same principles exhibited in the case_clause and the if_clause. A more complicated problem is raised in the parsing of expressions. In most programming languages there is the notion of operator priority. If we consider the expression:
2+3*3
the answer should be 11 rather than 8 since the multiplication operator is taken to be of higher priority than addition. What 'of higher priority' means is that we can rewrite the expression as:
2+(3*3)
without changing its meaning. High priority operators implicitly cause their arguments to be bracketed. They allow the programmer to leave out the brackets.
In S-algol the priorities of the operators are shown in table 7.1. High :=
+ -
is isnt < > >= <= = =
and
or
Low
The operator priorities are encoded in the syntax for expressions. Syntactically expressions are defined to be made up of 8 classes of sub expressions generated by 8 non-terminals called: expression, exp1,exp2,..., exp7. The production rule for an expression is
(a and( b or (c and d)))
((a and (b or c)) and d)
(a and ((b or c) and d))
((a and b) or (c and d))
(<exp1> or <exp1>)
((<exp2> and <exp2>) or ( <exp2> and <exp2>))
(( a and b) or ( c and d))
<exp7> ::= <name >|
<stdproc>|
<literal>|
(<clause>)|
<cur> <sequence> <ley>|
begin <sequence> end |
@<clause> of <type1><bra><clauselist<ket>|
vector<bounds> of <clause>
<clauselist> ::=<clause>[,<clause>]*
<subscription> ::=(<clauselist>)[<subscription>]*|
(<clause><bar><clause>)
<exp6> ::=<exp7>[<subscription>]
<exp5> ::=<exp6>[:=<exp6>]
<exp4> ::=[<addop>]<exp5>[<multop><exp5>]*
<multop> ::=<star>|div|rem|/|++
<star> ::=*
<bar> ::=|
<addop> ::= +|-
<exp3> ::= <exp4>[<addop><exp4>]*
<exp2> ::= [~] <exp3> [<relop> <exp3>]
<relop> ::= is|isnt|<|>|>=|<=|~=|=
The parsing of expressions provides a particularly direct example of how to translate a BNF into a set of recursive procedures. We take as an example the procedure that recognizes an exp4.
{ -------------------------------------------------------------- }
{ EXP4
this parses the syntax rules
<exp4>::=[<addop>]<exp5>[<multop><exp5>]*
<multop>::=<star>|div|rem|/|++
<addop>::= +|-
}
{ -------------------------------------------------------------- }
procedure exp4 (var t:typerec);
var continue:boolean; sym:lexeme;
adop:(plus,minus,noadop);
begin
debug(' exp4 ');
adop:=noadop;
if have(plus_sy) then adop:=plus else
if have(minus_sy) then adop:=minus;
exp5(t);
sym:=lexsymbol; continue:=true;
repeat
sym:=lexsymbol;
case lexsymbol of
star_sy : begin
next_symbol;exp5(t1);
end ;
slash_sy : begin
next_symbol;
exp5(t1);
end ;
div_sy,rem_sy : begin
next_symbol;
exp5(t1);
end ;
dplus_sy : begin
next_symbol;
exp5(t1);
end ;
else continue:=false;
end;
until not continue ;
end;
end;
The important technique to notice in Listing 3.3 is the way a case statement is used to select between which of the operators has been found. This is used for clarity and speed. We could quite readily have used a list of if .. then... else... statements directed by calls to have but this would have been slower and a little less clear. As a general rule, if you are writing a parsing procedure for a rule that has only a couple of alternatives, then you should use calls on have along with an if .. then... else... statement. When you have many alternatives use a case statement. The techniques described in this chapter are sufficient to parse the context free portion of the language. There are some portions of the language for which they are inadequate. Consider for example:
x(1)
This can either mean subscript a vector called 'x' and extract its first element, or it could mean call a procedure called 'x' with parameter one.What it means can only be determined once we know what type of identifier 'x' is. This is context sensitive. Before we can deal with it we must look athow a compiler stores information about types.
A type is a set of values.
Reals are a set of numbers. Integers are a subset of the reals. Booleans are the set true,false. Characters are the set A, B, C,.., a, b, c ..+, -, *, ...
These sets are what are often termed base types in a computer language, they correspond to sets of values that can be represented in machine words of variaous sorts. Most modern computers provide support for floating point numbers, integer numbers, binary values and characters. For this reason, these types are given priority in computer langauges. They are taken to be predefined. For the machine supported base types there is a one to one mapping between the type and a data format
Type Format
boolean bit
character byte
integer machine word
real IEEE 64bit floating point number
Some compilers may not use this particular representation of types. It may be convenient to represent reals, booleans and integers all in 32 bits. Whatever the particular convention followed, the idea remains the same. For each type there is a determinate data format.
In addition to the base types, it is usually possible to define new types ina program. These types are constructed using type composition operations.Such operations have a variety of syntactic forms in different languages.When building a general purpose set of compilation tools it is advisable to look at type composition in a language independent fashion. If we do that wecan capture the generality of types rather than just the particular typesthat one finds in a given language.
We are all familiar with the Cartesian co-ordinate system in geometry. This uses ordered pairs of real numbers, conventionally designated [x, y], to represent points on a two dimensional surface. Triples [x, y, z] represent points in 3 space. Quadruples would represent points in 4 space etc. In computer science the notion of Cartesian composition is generalised to then notion of tuple as in: 4tuple, 5tuple. An ntuple is an ordered list of n components. The components may all have the same type, or they may be different. In the case of Cartesian co-ordinates all components are real numbers. In polar co-ordinates, one is an angle the other a real number. Examples of tuples in computer languages are records, structures and procedure parameter lists.
A union type is one that may take on several forms. Let A and B be types representing the sets of values a1, a2, a3,.... and b1, b2, b3,....
Then the type C=union(A,B) is the type formed by the union sets of values
a1, a2, a3,.... b1, b2, b3,....
Given a type T which represents the set of values t1, t2, t3,... then the type powerset of T is the set of all subsets of T. Powerset types are rather misleadingly termed set types in programming languages like Pascal. Powersets are difficult to implement on a computer, so some compilers restrict themselves to implementing powersets of the integers.
Given an ordered type then a subrange type can be defined on it. Given the integers, 1..2 is a subrange of them. Given the characters, 'A'..'C' is a subrange of them.
Given the types T and U then the type map(T->U) is the powerset of the ordered pairs [t,u] where t is an element of T and u is an element of U. An individual value of the type map(T->U) will be a set of ordered pairs [t,u] where t is an element of T and u is an element of U.
Maps in the general sense are relations. These are implemented in database programming languages.
Where the maping is such that for each value of t there is only one pair [t, u] and thus a unique value of u, we have a function from T to U. Functions are widely implemented in programming languages. The two main sorts of functions are algorithmic functions and storage functions. Algorithmic functions are implemented by passing parameters to subroutines that return a result. Storage functions are things like arrays where the result is computed by indexing an area of memory.
Some languages distinguish between the type of a value and the type of a store location holding that value. A location that can be updated is a variable. The type of a variable that can hold an integer is distinguished from the type of an integer itself.
These concepts of maps, sets, cartesian composition, constancy and subranging are sufficient to represent most of the concrete types that you will come accross in compilation. We need some way to represent them in a compiler. In the compiler toolbox they are represented using pascal records whose types are declared in idtypes.pas. A functional interface is provided to the data structure with functions implementing the type construction operations described in this chapter. Operations are provided on types to determine whether they are equal or not.
#1Basic type checking functions in a compiler
----------------------
EQ - name equality
EQ2 - representation equality
------------------------------------------------------------------
function eq(var t1:typerec; var t2:typerec):boolean;
function eq2(var t1:typerec; var t2:typerec):boolean;
------------------------------------------------------------------
MATCH
enforce name equality of types
------------------------------------------------------------------
procedure match(var t1:typerec;var t2:typerec);
var em:textline;
begin
if not eq(t1,t2) then begin
em:= ptype(t1)+' not compatible with ' +ptype(t2);
error(em );
end;
end;
------------------------------------------------------------------
COERCE - verify representaional equality
------------------------------------------------------------------
procedure coerce(var t1:typerec;var t2:typerec);
------------------------------------------------------------------
BALANCE
------------------------------------------------------------------
procedure balance(var t1:typerec; var t2:typerec);
The equivalence of types can be tested using eq and eq2 which return boolean results. In contexts where only a certain type is permited: for instance with procedure parameters, then the operations match,coerce, and balance can be used. These force type equivalence and generate error messages if the types are not equivalent. Match is used for name equivalence and coerce for representational equivalence. Balance is a special form of coerceion used with numbers. In the arithmetic expression
1 + 2.9
we have a combination of an integer and a real number. Balance is called in arithmetic expressions to ensure that the two operands are of the same type. If they are not, it tries to convert them to the same type by planting instructions to convert numbers between representations.
The type checking procedures can be called by the syntax analyser for two purposes:
a) to disambiguate a construct,
b) to validate a construct.
Diambiguation is needed if the same syntax can imply several different sequences of machine code instructions have to be generated. Consider:
The lexical analyser will have converted all identifiers into an internal representation in the compiler as numbers. The identifier management software must associate type information with these numbers. In the toolbox each identifier has an idrec associated with it. These are declared in the file idtypes.pas. The record gathers together information about the identifier as shown in table .
is global, local,
a parameter a
static or a typename
identifier lextoken This holds the
numeric form of the
identifier output
by the lexical
analyser
offset integer This specifies the
address of the variable relative
to some base
block_level byte For block structured
languages it
specifies how far in
the variable is in
terms of blocks
lex_level byte This specifies how
far in the variable
is in terms of
procedures
typeinfo typeref A reference to the
type of the identifier
zone (variable,field) Says if the identifier is a field
of a record
A number of different things can be given names in a program: types, variables, fields of records. The variables may be classified according to where in the program they are declared. Global variables are those that are declared at the outermost level of the program and are accessible to all procedures. Local variables are accessible to the procedures in which they are declared. Parameters are an intermediate category, accessible from two places. Suppose I call a procedure:
begin
write (a+c)/b
end
In a block structured language procedures and blocks may be nested within one another.
procedure p (int a,b,c)
begin
procedure m
begin
let a = b+j
write (a+c)/b ! point 1
begin
let b = j
write (a+c)/b ! point 2
end
write (a+c)/b ! point 3
end
write (a+c)/b ! point 0
m()
end
p(1,2,3)
2.0 4.0 2.66666 4.0
These differences arise from the fact that there are 2 versions of both a and b in the program. At point 0 all that can be seen are the parameters a, b, c. At point 1 the parameter a is hidden by the local variable a, and similarly with b at point 2. Finally at point 3 the local b has vanished. The identifier manager handles this behaviour by using a compile time stack of pointers to identifier records. This is shown in figure .
Whenever a new variable is declared, the function newname is called to allocate an identifier record for the new id. This record is pushed onto the identifier stack. When an identifier is encountered subsequently the function lookup is called. This takes as its parameter the lexical symbol for the identifier and scans the identifier stack from top to bottom to find the first identifier record with that symbol. The effect is to find the most recently declared identifier of that name. Whenever the syntax analyser encounters the start of a new scope, either a procedure or a block, it calls the procedure enterscope to record the value of the identifier stack pointer when the scope was entered. On exit from the procedure or block exitblock or exit_proc is called to restore the identifier stack to the state that it was in when the scope was entered.
.
Up to now we have described a parser that is capable of checking if a program is valid in terms of the syntax and type rules of the language. It does not produce any output other than error messages for incorrect programs.
We will now look at how to modify the parser to generate an equivalent machine code program. The technique used in the Toolbox is to decorate the parser procedures with calls to the code generator. This can be illustrated by looking at a couple of simple examples.
{ -------------------------------------------------------------- }Listing 9.1
{ WHILE CLAUSE }
{ recognises: while <bool> do <void> }
{ -------------------------------------------------------------- }
procedure while_clause;
var t:typerec;
begin
mustbe(while_sy); clause (t);
if have (do_sy) then
begin clause(t); match(t,VOID); end;
end;
{ -------------------------------------------------------------- }Listing 9.2
{ WHILE CLAUSE }
{ recognises: while <bool> do <void> }
{ -------------------------------------------------------------- }
procedure while_clause;
var t:typerec;
l1,l2,l3:labl;
begin
l1:=newlab; l3:=newlab;l2:=newlab;
plant(l1);
mustbe(while_sy); clause (t); condify(t); jumpt(l3);jumpop(l2);
plant(l3);
if have (do_sy) then
begin clause(t); match(t,VOID); end;
bjump(l1);plant(l2);
end;
Listing 9.1 shows the original form of the parsing procedure for while clauses. It simply checks the grammar of the clause. Listing 9.2 shows how this is modified to handle code generation. It has been augmented with calls to newlab, plant, condify, jumpt, jumpop and bjump. As can be deduced from their names these procedures are responsible for generating jump instructions and handling labels. Suppose that we have the while statement:
while C1 do C2
with the Ci standing for clauses. The effect of the decorated parsing procedure is to generate machine code that looks like listing 9.3.
l1:
C1 code
condify code
jumpt l3
jump l2
l3: C2 code
jump l1
l2:
Listing 9.3
The code shown in listing 9.3 is not for any one particular type of CPU. It is an abstract machine code. It abstracts from the details of particular machines. The syntax analyser assumes it is producing instructions for this abstract machine. The abstract machine is a general purpose computer whose instruction set includes all of the operations necessary to implement the semantics of the language that is being translated. On some computers the operations of the abstract machine can be implemented with single instructions. In others, several real machine instructions may be needed to achieve the same effect as the abstract machine instructions. What is shown in listing 9.3 is a fairly simple set of abstract machine instructions that are likely to be available to most machines. A full listing of the instructions executed by the abstract machine is given in
Appendix G, but we will give a brief outline of the machine here. The machine is assumed to have four registers:
PC Program Counter points at current instruction.
GP Globals Pointer, points at the start of the global variables
FP Frame pointer, points at the local variables of a procedure
SP Stack Pointer points at the top of the stack.
There are three areas of store:
CS The Code Store holds instructions
STACK This holds variables and temporary results
HEAP This holds objects like arrays, strings or structures.
All instructions are defined in terms of the effect that they produce on the registers and the stores.
The S abstract machine is a stack machine. That is to say arithmetic instructions operate on the top two words on stack. Consider the following expression:
2+4
This works by placing two words on the stack and then adding them. The abstract machine instructions that do this would be:
llint(2)
llint(4)
add
This form of arithmetic in which the operator comes after its operands is termed reverse polish notation. It is a particularly easy notation to compile into. The general rule for generating code for any binary expression
e1 op e2
becomes :
generate code for e1
generate code for e2
generate code for op
Reverse polish notation combined with a recursive descent compiler will automatically generate the right code for expressions with operators of mixed priorities. The expression:
4+2*3
should yield 10. Given the syntax:
<exp3>::=<exp4>[<addop><exp4>]*
<exp4>::=<exp5>[<multop><exp5>]*
<exp5>::=<int-literal>
we obain the parse
exp3 ...
exp4 addop exp4 ...
exp5 addop exp4 ...
4 addop exp4 llint(4) ... 4
4 addop exp5 multop exp5 ... 4
4 addop 2 multop exp5 llint(2) ... 4 2
4 addop 2 multop 3 llint(3) ... 4 2 3
4 addop 2 * 3 mult ... 4 6
4 + 2 * 3 add ... 10
push 2
push 3
pop ecx
pop eax
imul ecx
x:push eax
pop ecx
y:pop eax
add eax,cx
push eax
The generation of arithmetic instructions is fairly straight forward since computers always have a set of arithmetic machine codes. Handling boolean operations is more problematic. Consider the operation < which takes two numbers and returns a truth value. In a high level language like S-algol or Pascal truth values have the type boolean, and are represented in memory by a word which contains some non zero value for true and zero for false. Some modern CPUs like the AMD 29000 have opcodes that directly compute this operation, but older ones like the 80x86 series do not. Instead they have comparison instructions which compare two values and set some CPU flags on the result. In particular the sign and carry flags are set according to the result of comparison. The 80x86 series then provide jump instructions that will conditionally jump on the flags : JL for Jump Lessthan, JG for Jump Greater than etc.
Suppose we have the source code
push b
pop ecx
pop eax
cmp eax,ecx
jl label1
jump label2
label1: ... code for X
label2:
push b
pop ecx
pop eax
cmp eax,ecx
; code to generate a boolean on the stack
jl label1
push 0 ;*
jump label2
label1:push 1 ;*
; code to perform the assignment
label2:pop p
If clauses provide an illustration of how conditionals are handled. The syntax analysis procedure for an if clause is given in listing . Note how the procedure condify is called to ensure that the condition codes have been set. This is necessary to deal with examples like:
if a do write "z > y"
if x<y do write "x < y "
then the call on clause would have set the variable t to condition. Finding that the condition codes were already set, the condify procedure would do nothing.
{ -------------------------------------------------------------- }
{ IF_CLAUSE
this parses the rule
<ifclause> ::= if <clause> do <clause> |
if <clause> then <clause> else <clause>
}
{ -------------------------------------------------------------- }
procedure if_clause ;
var t1:typerec;l,l1,l3:labl;
begin
l1:=newlab; l:=newlab;l3:=newlab;
next_symbol;
clause(t); condify(t);
jumpt(l);jumpop(l3);plant(l);
if have(do_sy) then begin clause(t);plant(l3); match(t,VOID) end
else
begin
mustbe(then_sy);
clause(t1);jumpop(l1);decsp(t1);
mustbe(else_sy);
plant(l3);
clause(t);balance(t,t1); plant(l1); release_label(l1);
end;
release_label(l3);
end;
Listing 9.7
For loops in programming languages come in two main forms. The simples is the Pascal variant where you write something like for i:=x to y do ... . Within the body of the loop, i will take on all the values in the range x to y in turn. Other languages, including S-algol allow a more general variant of the for loop : for i=x to y by z do ... . In this case z provides the step by which i is to be incremented. It is necessary to evaluate the expressions for the start, finish and the step at the top of the loop. Consider the following discriminating example:
let y:=3
for i=x to y do begin
write i
y:=y+i
end
write y
--> 1 2 3 9
let y:=3
let induction:=x
let finish=y
while induction <= finish do begin
let i=induction
write i
y:=y+i
induction:=induction+1
end
write y
1) The induction variable and the end point are computed before the loop starts
2) That i is declared as a cint each time round the loop
The semantics of the generalised for loop with a variable step size are more complex.
write i
y:=y+i
end
write y
let finish=y
let step=z
while if step>0 then induction <= finish else induction>= finish do begin
let i=induction
write i
y:=y+i
induction:=induction+step
end
write y
procedure for_clause;
var t:typerec;l1,l2:labl;id:lextoken; os, n:namedesc;
complex:boolean;
begin
enterscope(os);
l1:=newlab;l2:=newlab;
next_symbol;
id:=symbol;
mustbe(identifier);
mustbe(eq_sy);
clause(t);match(t,int_type);
n:=newid(id,cint_type);
mustbe(to_sy);
clause(t);match(t,int_type);
if have(by_sy) then begin complex:=true; clause(t);match(t,int_type);end
else complex:=false;
mustbe(do_sy); if not complex then forprepop;
plant(l1);fortestop(complex,l2);
clause(t);match(t,VOID);
forstepop(complex,l1);plant(l2);
exitblock(os,VOID);
end;
This looks at the for loop to see if the loop is a simple one or a complex one. It calls the code generator to output either a forprep sequence if it is a simple loop. The code generator routines fortest and forpstep are then called with a parameter to indicate if the loop is complex or simple.We can see the code generated for the simple loop:
#1code generated for a for loop
push 10
; forprep sequence
pop ecx
pop eax
push eax
sub ecx,eax
add ecx,2 ; precompute the number of times round loop
;minfortest sequence
l1: loop m1 ; this is a machine code instruction which
; tests the CX register
; if non zero it goes to m1 and decrements CX
pop eax
jmp l2
m1: push ecx ; CX held induction variable
;--------------------- Main body of loop goes here
; minforstep sequence
pop ecx ; induction variable back in CX register
pop eax ; increment i
inc eax
push eax
jmp l1 ; go back to the top of the loop
l2:
Note that this takes advantage of special instructions included in the x86 instruction set to handle simple loops. The loop instruction, expects the counter register ECX to hold the number of times it is to go round a loop.
We precompute this and load it into CX before the loop starts. During the body of the loop, CX is pushed onto the stack to prevent it being corrupted by a nested loop.
An abstract machine specifies a set of stores and a set of operations on these stores. These stores can have a number of possible types. One class of store is predesignated variables capable of holding an individual word of data. We generally call these registers. In an actual hardware machine the registers will often be implemented by using particularly fast memory chips, or in a microprocessor, by using on chip memory cells. From the standpoint of abstract machine design this is not important, since an abstract machine is concerned only with the functional specification of a computer. The speed of access to different parts of the store is an implementation optimization.
Some abstract machines support a random access memory. The PS-algol machine does not. Instead it uses forms of structured memory: stack and heap. On a given implementation these may actually be implemented in a common random access store, but this is not necessary. Indeed it might be advantageous from a performance point of view to implement the heaps and stacks as physically distinct memories.
The areas of memory defined by the PS-algol abstract machine are the registers, the code store, the stack, and the heap.
PS-algol, like all Algols is a recursive language. It is recursive in two senses. It is defined by a recursive grammar and it allows the recursive calling of procedures. This imposes special constraints on the store of the language that are best satisfied by a stack structured memory. Consider the fragment of code in algorithm .
begin
let y:=x*readi
! position 1
begin
let a = x
x:=y; y:=a
! position 2
end
begin
let i:=9+x
if i>y do y:=x
! position 3
end
end
#1Code generated for the nested scopes in algorigthm5.5.1
2 global(0) ! x -> top of stack
3 readi ! readi -> top of stack
4 times ! let y:= x*readi
! position 1
5 global(0) ! let a=x
6 global(1) ! y -> top of stack
7 globalassign(0) ! x:=y
9 global(2) ! a-> top of stack
10 gloabalassign(1) ! y:=a
! position 2
11 retract(1) ! get rid of a
12 ll.int(9) ! 9 -> top of stack
13 global(0) ! x-> top of stack
14 plus ! let i:=9+x
15 global(2) ! i->top of stack
16 global(1) ! y->top of stack
17 le.i ! i<y -> top of stack
18 jumpf(23) ! if top of stack
! false goto 23
19 global(1) ! y -> top of stack
20 globalassign(0) ! x:=y
! position 3
21 retract(1) ! get rid of i
22 retract(2) ! get rid of x and y
Variables are accessed by specifying their address relative to the current base of the stack. The variable x is accessed using the operator global(0) since it is at the base of the stack, y is addressed as global(1) as it is at position 1 on the stack etc. It is worth noting that the combination of the PS-algol initializing assignment statement
let <variable>:= <expression>
with the stack allocation discipline means that many of the store instructions that would be required in a conventional machine architecture are dispensed with. The initial value is simply calculated and then left on the stack. The compiler then just remembers where on the stack it was left.
If the variable was declared at the outer most level the address associated with the variable is given relative to the GP or global pointer register1. If a variable is declared in a procedure, then its address is specified relative to the FP or frame pointer register2. When generating code, variables are consistently dealt with in terms of their addresses relative to some base register.
The most complicated useof the stack in an Algol-based language is the way in which it is used to implement procedure calls. We will start by looking at how to implement procedure calls in languages like C that do not allow nested procedures.
Global variables are accessed by offset from some global base register.
Local variables are accessed by an offset from the frame pointer. Suppose we have the following C procedure:
{int temp;temp= *a; *a= *b; *b = temp; ]
49 copy(fp,sp) ! fp <- sp
50 retract(-1) ! reserve space for temp
51 local.i(-3) ! push a, tos<- [FP-3]
52 deref ! change to *a tos<-[tos]
53 localass(1) ! store in temp [FP+1]<-tos
54 local.i(-3) ! a to top of stack
55 local.i(-2) ! b to top of stack
56 deref ! change to *b
57 store ! store in *a [tos]<-tos
58 locali(-2) ! push b
59 local(1) ! push temp
60 store ! *b <-temp
61 retract(1) ! get rid of temp
62 pop(fp)
63 return
| work space |
+----------------+
| temp |
+----------------+
+-| old FP | <-----------FP
| +----------------+
| | return address |
| +----------------+
| | b |
| +----------------+
| | a +
V +----------------+
The meaning of the code is made clearer by figure 5.1. The local variables are accessed by a positive offset from the FP and the parameters by a negative one. A procedure call to swap might go as follows:
101 local.addr(5) ! tos <- &y means tox<- FP+5
102 call(49) ! call swap
103 retract(2) ! get rid of parameters
In the abstract machine examples given above it is assumed that the stack grows upwards from low addresses to high addresses. This is true on some hardware but not on all. On intel machines like the 8086, the stack grows downwards from high addresses to low addresses. The actual machine code generated by the toolbox must take into account which direction the stack grows in. On a machine with a downward growing stack the addresses of parameters will be a positive offset from the FP and the addresses of local variables a negative offset from the FP.
Algol like languages allow procedures to be nested within one another. The problem is to devise a calling mechanism that will:
a) Enable procedures to have space for local variables
b) Allow these to be called recursively
c) Allow procedures to access variables that are in a surrounding scope
We will consider the example shown in listing . Here we have 3 procedures A, B and C with procedures B and C within procedure A. B must have access to the variables of A , to its own variables and to the global variables. Access to the globals is no problem, since the global pointer register can beused for this. Access to the locals can be handled as shown in the previous example. The difficulty comes with access to intermediate level variables, those of A. In our example X is a global, and M and P are intermediate level variables with respect to B. B is called by C. How is B to access M?
procedure A(int P)
begin
let M=P+X ! an intermediate variable
procedure B
begin
write M
end
procedure C
begin
B; ! call B
end
if P>0 then A(P-1) else C
end
The technique used to handle this problem is called a display. In an algol like langauge variables can be accessed using a 2 component addressing technique. Each variable is identified by the combination of what is called its lexical level with an offset. Variables at the global level are said to be at lexical level 0. Variables in global procedures like A are said to be at lexical level 1. Variables in a nested procedure like B would be at lexical level 2 etc. If we write down the addresses of the variables in the example in this way we get the following:
X 0,1
P 1,-2
M 1,3
display[ll]+offset
Whenever a procedure is entered a display is created pointing at the enclosing lexical levels. This is done using the abstract machine instruction prologop.
address | contents | comment |
135 | 132 | display 2 |
134 | 122 | display 1 |
133 | 100 | display 0 |
132 | 127 | dynamic link |
131 | return address for b | |
130 | 127 | display 2 |
129 | 122 | display 1 |
128 | 100 | display 0 |
127 | 122 | dynamic link |
126 | return address for c | |
125 | M | |
124 | 122 | display 1 |
123 | 100 | display 0 |
121 | return address for a | |
120 | P |
For the global context we have |
address | contents | comment |
101 | X | |
100 | 100 | global base |
The semantics of the prolog operation are given below.
FP->S[++SP]; SP->FP;
S[S[FP]:S[FP]+ll-1]->S[SP+1:SP+ll]; SP+ll->SP;
FP->S[++SP]
The display of A is the array [ 100, 122], where 100 is the base address for globals, and 122 is the address of the current frame.
The display of C is [100, 122, 127]. That is a copy of the display of A followed by the address of the frame for C = 127.
The display of B is [100, 122, 132]. It shares elements 0 and 1 with the display of C, since both of these procedures are nested at the same lexical level. On the other hand it has a new element 2, since the new lexical level 2 is the frame of B. This dynamically constructed display gives access to all of the variables in scope from B. Thus
X = (0,1) = display[0]+1 = 100 +1 = 101
P = (1,-2) = display[1]-2 = 122 -2 = 120
M = (1,3) = display[1]+3 = 122 +3 = 125
The abstract machine instructions that access intermediate variables take two parameters lexical level and offset. The compiler translates these into a series of instructions that access the display and find the variable. For instance on an x86 the abstract instruction to assign an integer to a variable could be translated into:
assi macro
; Assign an integer
; invoke with first parameter lexical level
; second parameter the offset of the variable within frame
mov esi,[ebp+#1] ; ebp points at the display
;esi<-display[lexlevel]
pop dword[esi+#2] ; esi now points at the desired frame
; esi[offset]<- tos
#em
FP->SP; S[SP--]->FP; S[SP--]->PC; SP-Discard ->SP
For the PS-algol compiler the code generator is a collection of procedures in the module SAGEN.PAS.
These are organised one per abstract machine instruction. Each procedure has the the name of an abstract machine instruction and when called places this abstract machine instruction into the compiled binary program.
It actually implements the abstract machine instructions by calling the macro assembler module ASSEMBLE.PAS, to output one or more 8086 instructions.
We will examine how the assembler works in the next chapter. We will now examine how the abstract machine instructions are implemented using the physical resources made available to us by the x86.
The abstract machine has a small collection of registers that have to be implemented on the physical register set of the intel x86 series machines. A description of the Intel processor architecture is not provided here. Those who are unfamiliar with it are advised to consult a reference book3.
On the x86 the following conventions are used for register allocation in the compiler toolbox. The frame pointer is implemented using the intel EBP register. The global pointer is the intel EBX register. Since the stack grows downwards variables are accessed with negative offsets from these registers.
The display mechanism is directly supported in the intel hardware for processor models iAPX 186 and upwards and on the NEC V series processors. On these machines there is a single instruction ENTER that implements prologop.
The assignment of abstract machine registers to physical registers is
sumarised below
PC PC
GP EBX
FP EBP
SP ESP
Note that this contrasts with the exercise for PLDI3 which uses the floating point stack for all arithmetic.
The code generator maintains an internal variable called stack_ptr which is used to keep track current displacement between the SP and FP registers. The procedures which output abstract machine instructions should increment or decrement stack_ptr to mimic the effects that will be produced at run time on the real stack. To help in doing this a collection of utility routines are provided to increment or decrement the stack by the space that would be taken up by a value of a given type. The procedure incsp should be called when a value is pushed onto the stack and decsp when a value is poped from the stack. These procedures use information about the sizes of types that are expressed in strides. Strides are the smallest amount by which the stack can be adjusted. On 80386 machines and above strides are 4 bytes long. It is important when making any alterations to the codegenerator to ensure that these procedures are called whenever an opcode is produced that will affect the real time stack.
The assembler phase of a compiler is responsible for generating the binary machine code that will be executed by the cpu. As such it has to know about the details of an individual machine code. A conventional assembler is a stand alone program that takes in a file of ascii text with an assembly language source program in it and outputs either a linkable object file or an executable binary file. Many compilation systems use a stand alone assembler as the last phase of the compilation process. This has several advantages:
1. The compiler writers need not bother themselves with the binary formats of the machine instructions, all they have to know about is how to generate the assembler mnemonics, which is usually much easier.
2. Several compilers can share the same assembler which encourages software reuse.
On Unix this use of a stand alone assembler for the back end of a compiler seems to be standard practice. This is partly because an assembler is provided with the C compiler on every Unix system. Since all Unix systems used for software development are bound to have a C compiler, writers of other compilers are free to use the same back end. On MSDOS, assemblers are not provided as standard features of the operating system. Since each compiler writer then has to provide their own version of the assembler, they have to consider whether to make the assembler a stand alone program or a bound in module of their compiler.
The disadvantage of using a stand alone program is obviously that it has to communicate with the compiler via intermediate text files. The output of these, followed by their input and reanalysis by the assembler will be time consuming. If you have to provide your own assembler it you might as well produce a fast one. That is the strategy adopeted in the Toolbox. The assembler in the toolbox is a pascal module assemble.pas that is bound into the main compiler program. We can see the relationship between the two in figure . In the left hand example you see the arangement used in the toolbox. The compiler is a single program containing several modules, of which only the assembler and the front end are shown. These communicate via shared buffers. In the right hand example the compiler and the assembler are stand alone programs which communicate via intermediate files.
The two main tasks that a conventional assembler has to deal with are
a) Converting mnemonics into binary code.
b) Keeping track of the addresses associated with labels.
Both of these become considerably easier when the assembler is a program module and communicates with other modules using internal buffers. It is no longer necessary for the mnemonics to be human readable. An assembly language instruction would normally look something like:
operator parm1 parm2
ENTER 2, 20
There would be an opcode, followed by one or more operands. Each of these fields would be encoded as a sequence of ascii characters. Allowing for newline characters and spaces, the whole line might take 15 to 20 bytes of file space. When we want to assemble code directly in memory, we do not want to waste this much space. Instead of holding the assembly language source as lines of text, we can hold it as records. These can be much more compact. Listing shows the format of the pseudo instruction record used in the toolbox to represent a line of assembler source. Like the textual source line it contains three fields: an operator and two parameters. The whole thing takes up only 5 bytes, 2 bytes for each integer and one for the operator.
operator:opcode;
parm2:integer;
parm1:integer;
end;
The operator belongs to the enumerated datatype opcode declared in the opcodes.pas module. A partial listing of the type opcode is shown below.
jl,
jle,
jg,
jge,
je,
jnz,
jump,
jumpt,
jumptt,
cjump,
fortest,
forstep,
outbyte,
shrink,
enterframe,
exitframe,
The assembler strategy is for the code generator phase of the compiler to deposit a sequence of these pseudo instructions into an array. At the end of the code generation phase the assembler is then called to convert these into binary code which can be output to a file.
The central task of the assembler is to expand out the opcode menmonics into the series of bytes that implement the operations on an 80x86 processor. It does this using the three arrays : codelen, codeoffset, codelib. The fist two of these are mappings from opcodes to integers. Given an opcode mnemonic, codelen will tell you how many bytes there are in the equivalent machine code. Codelib is a big array of bytes that contains all of the binary opcodes. To obtain the sequence of bytes equivalent to a particular opcode the process shown in figure is followed. The codeoffset array is indexed to find the start of the code sequence in the codelib array, and the codelen array is used to determine how many bytes to copy from the codelib to the binary file.
The other task that the assembler has to achieve is to plant the parameter fields of the instructions into the outgoing code stream. If we consider the instructionset of the 80x86 we find that some of the opcodes take 2 parameters, some take 1 and others take none. As against this the format of the pseudo instructions always contains two parameters whether they are needed or not. The assembler has to find some way of determining how many parameters an instruction will really need. this information is encoded in an array codeparams, shown in listing 10.3. This specifies the different types of parameters that instructions can have, as follows:
nonadic
The instruction has no parameters.
monadic
The instruction has a single parameter, this will be the 16 bit number held in param 1.
* dyadic
The instruction has 2 parameter field each of 16 bits, to be obtained from param1 and param 2 of the pseudo instruction.
* byteadic
The instruction has a single 8 bit parameter obtained from param1 of the pseudo instruction
relative
The instruction has a single parameter which is a 16 bit relative offset from the current value of the program counter.
* byterel
The instruction has a single parameter which is an 8 bit relative offset from the current value of the program counter.
abslabel
The instruction contains a single parameter which is a 16 bit absolute address from the start of the code segment.
nonadic,monadic,dyadic,stringadic,byteadic,relative,byterel,
abslabel);
const codeparams:array[opcode]of optype =(
{jl}byterel,
{jle}byterel,
{jg}byterel,
{jge}byterel,
{je}byterel,
{jnz}byterel,
{jump}relative,
...
{fortest}relative,
...
{outbyte}nonadic,
...
{prolog}byteadic,
Several of the addressing modes require the assembler to generate addresses to words in the code segment. These are usually the destinations of jump instructions. The address formats required in the final code will either be relative to the start of the code segment or relative to the program counter. In the 'assembly source', which in this case is a sequence of pseudo instructions, the destination address will be indicated by a label. To understand how this works it is helpful at first to think of what the assembler would have to do if it were taking its input from a conventional source file.
Consider the following example
mov ecx,10
call proc1
pop eax
dec ecx
jnz l1
...
proc1: ......
.....
label | address |
l1 | 100 |
proc1 | 124 |
labn | 230 |
procedure fixlabels;
var i:word;
begin
machinepc:=start;
for i:=0 to pc do
with pseudocode^[i] do begin
if operator=plant_label then labels[parm1]:=machinepc;
machinepc:=machinepc+codelen[operator];
end
end;
With the ram resident assembler we are using the labels are simply numbers that are used to index the label table. No lexical representation of the labels is needed. A special pseudo instruction called PLANT is used to plant a label in the code. The procedure FIXLABELS shown in listing 6.210.3 calculates what address is associated with each label.
The interface to the assembler is provided by 3 procedures :
procedure assem(op:opcode;p1:integer;p2 : integer);
procedure initassem;
1Typically the DS register on an intel machine.
2This would typically be the EBP register on an intel machine.
3A
particularly clear explanation of the original 8086 is given in chapter
5
of 'Osborne 16 bit Microprocessor Handbook' by Adam Osborne from
McGraw-Hill.
Alternatively one can consult the processor manuals published by Intel,
AMD
or NEC for their CPU chips. It should be born in mind that the register
naming
conventions used in NEC literature differ slightly from that used by
Intel and
AMD. In what follows, the Intel names will be used. For more recent
machines
the manuals can be downloaded from
http://developer.intel.com/design/pentium4/manuals/.