YACC vs. Scheme

Robert Rice

CS 441: Project 1

History of YACC

YACC (Yet Another Compiler Compiler) was developed at Bell Laboratories in the early 1970s, when universities and other research were creating many other compiler generators institutes. There is some irony in the fact that this sarcastically named compiler has stood the test of time while others from the same era are all but now nonexistent. YACC was developed by Stephen C. Johnson using a portable dialect of C. YACC has been a standard UNIX utility since the 7th edition UNIX. System V and older versions of BSD use the original AT&T versions, while the newest version of BSD uses flex and Berkeley YACC.

Overview of YACC

YACC provides a general tool for describing the input to a computer program. YACC is a language parser. It accepts word items called tokens and, given a list of rules describing how these items form larger entities, deduces which items or combinations of items satisfy a rule. In other words, it produces a parser for a given grammar. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result. Use of YACC is described below.

 

YACC requires the use of a lexical analyzer in order to work efficiently because YACC syntax analysis is only one part of compilation. The other part, lexical analysis, or scanning of the input looking for certain kinds of tokens, is usually handled by the lexical analyzer, lex. This program by Mike Lusk and E. Schmidt was also created for UNIX. Lex uses many of the same conventions as YACC.

YACC and Lex working together help the writing of programs that transform structured input. This includes an enormous range of applications--anything from a simple text search program that looks for patterns in its input file to a compiler that transforms a source program into optimized object code.

A C compiler needs to find the expressions, statements, declarations, blacks, and procedures in the program. This task is known as parsing and the list of rules that define the relationships that the program understands is a grammar. YACC takes a concise description of a grammar and produces a C routine that can parse that grammar. The YACC parser automatically detects whenever a sequence of input token matches one of the rules in the grammar and also detects a syntax error whenever its inputs does not match any of the rules.

A YACC source file is separated into three distinct sections; the definition section, the rules section, and the user subroutine section. Lines consisting of a pair of percent signs separate these sections from each other. Whereas the third section is optional, the first and second sections are required, though one of them may be empty. The definition section can contain declarations of %union, %start, %token, %type, %left, %right and %nonassoc, as well as a literal block. The rules section consists of the grammar rules for the compiler and actions for these rules containing C code. The user subroutines section generally contains routines called from the actions of the previous section. These subroutines are copied verbatim to the C source file.

Handling of Data Objects

YACC takes a grammar specified by the programmer and writes a parser that recognizes valid "sentences" in that grammar. For this reason, these grammars must be defined in a standard format. In this case, that format is Backus-Naur Form or BNF. Tokens, or terminal symbols, are data objects provided to YACC by the lexical analyzer. Non-terminal symbols are data objects that appear on the left-hand side of some rule. Both of these objects are defined together as symbols, but a symbol can also be literal quoted characters. Literal tokens are contained within single quotes. The token number of a literal token is the numeric value in the local character set usually ASCII. Internally, each symbol value is declared as a C union that includes all of the data types. Symbolic tokens are usually assigned values by YACC, giving them higher numbers than any possible characters code to avoid any conflict with literal tokens. The programmer can assign token numbers by using the following format in the definition section:

%token token_name integer

Symbols are strings of letters, digits, periods, and underscores that do not start with a digit. Apart form the symbol error being used for error recovery, YACC establishes to a priori meaning to any symbol. Every symbol in a YACC parser has a value, which gives additional information about a particular instance of the symbol. For instance, if a symbol represents a number, the value would be the particular number. It if represents a literal text string, the value would be a pointer to a copy of the string. If it represents a variable in a program, the value would be a pointer to a symbol table entry describing the variable. By default, YACC makes all values of the integer type. Non-terminal symbols can be allowed any value, as determined by code in the parser.

A YACC grammar consists of a set of rules, each of which start with a non-terminal symbol followed by a colon. These two items together are called the left-hand side. Then follows a possibly empty list of symbols, literal tokens, and actions. This grouping is called the right hand side. An optional semicolon completes each rule. If consecutive rules share the same left-hand side, each right hand side can be represented in a list delimited by a vertical bar following a single instance of the left-hand side.

The programmer in the definition section of the source code can define the data type of a token. This is done by using %union and %type declarations. %union identifies all of the possible C types that a symbol value can have. A %union declaration is written as follows:

%union {

…field declarations…

}

In the absence of this declaration, YACC defaults to using integer for all of the symbol values. The types declared in %union are associated with particular symbols using the %type declaration. This declaration has the following form:

%type <type> name, name,…

Each name is the name of a non-terminal symbol.

Handling of the Sequence Control in YACC

Precedence is one of the ways that YACC implements sequence control. Precedence controls which mathematical operators are evaluated in which order in a mathematical expression. Associativity is also significant when considering the proper evaluation of a mathematical expression. Associativity controls the grouping of operators at the same precedence level. Operators can group to either the left or the right or not at all.

Two ways exist to specify precedence and associativity in a grammar, implicitly and explicitly. In order to specify them implicitly, the grammar needs to be written using separate non-terminal symbols for each precedence level. To specify them explicitly, first lines must be added to the definition section of the YACC source file. For example, say we added the following lines to the code:

%left ‘+’ ‘-‘

%left ’*’ ‘/’

The level of precedence is set by the order of the lines, beginning with the lowest precedence first. The ‘left’ symbol tells the compiler that these operators are left associative. There are two other possible declarations used to establish precedence and associativity. %right makes an operator right associative, while %nonassoc makes the operator non-associative.

Sequence control is also implemented when ambiguity and conflicts arise. For example, there may be two possible parses for a single input string. Conversely, the grammar may be unambiguous, but the parsing technique that YACC uses is not powerful enough to parse the grammar. There are two types of conflicts that can occur when YACC tries to create a parser: "shift/reduce" and "reduce/reduce". A shift/reduce conflict occurs when there are two possible parses for an input string, and one of the parses completes a rule (the reduce option) and one doesn’t (the shift option). YACC always chooses the shift option unless the user puts in operator declarations. Whenever there is a shift/reduce conflict, YACC compares the precedence of the token that might be shifted to that of the rule that might be reduced. It shifts if the token’s precedence is higher or reduces if the rule’s precedence is higher. If both have the same precedence, YACC checks the associativity. If they are left associative it reduces, if they are right associative it shifts, and if they are non-associative an error is generated. A reduce/reduce conflict occurs when the same token could complete two different rules. More often than not, these types of conflicts represent mistakes in the grammar.

Recursive rules parse of list of items of indefinite length. These rules are defined in terms of themselves. Any recursive rule must have at least one non-recursive alternative. Otherwise there is no way to terminate the string that it matches. There is a distinction between left recursion and right recursion. Left recursion places the recursive reference at the left end of the right hand side of the rule, while right recursion places it at the right end of the right hand side of the rule. Left-hand recursion is handled more efficiently by YACC. The reason for this is discussed in the section "Handling of Subprograms and Memory".

Sequence control is also established when variant and multiple grammars are used. These are used when you want to have parsers for two partially or entirely different grammars in the same program. One way of achieving this is by combing the parsers. This is done by adding a start rule that depends on the first token read. Then, the grammar parsed is chosen by the token. Code also needs to be added to the lexer that returns the appropriate special token the first time that the parser requests a token from the lexer. The advantage to using this method is that the program is smaller than it would be if multiple parsers were used since only one copy of the parsing code is needed. Also, if the grammars being parsed are related, some parts of the grammar may be shared. However, one parser cannot usually be called while the other is active and different symbols need to be used in the two grammars except where rules are deliberately shared.

Another method of establishing variant and multiple grammars is by using multiple parsers, or including two complete parsers in a single program. However, because both parsers put parse tables and parse stacks in global variables with the same names, a long list of multiply defined symbols is generated. The trick is to change the names YACC uses for its functions and variables. This is done by using the command line switch –p, which changes the prefix used on the names in the parser generated by YACC.

Handling of the Subprograms and Storage Management in YACC

YACC storage utilizes stacks and the parser produced by YACC does the storage or memory management. The parser consists of a finite state machine, wherein each state of the machine is given a small integer label, and a stack. The parser is also responsible for reading and remembering the lookahead token. This is the next input token provided by the lexical analyzer. The top of the stack always contains the current state of the compilation.

A move of the parser is done as follows:

1. The parser decides whether it needs a lookahead token to decide what action should be done based on its current state. If a token is needed, but the parser does not have one, it calls yylex to obtain the next token.

2. The parser decides on its next action using the current state and the lookahead token if needed, and then carries it out. This may result in states being pushed or popped onto the stack and in the lookahead token either being processed or left alone.

The machine has only four actions available to it; shift, reduce, accept, and error. The parser takes the ‘shift’ action more often than any other action. There is always a lookahead token being referenced when a ‘shift’ action is taken. For example, in state 56 there may be an action:

IF shift 34

which says, while in state 56, if the lookahead token is IF, the current state is pushed down on the stack, and state 34 becomes the current state and is pushed on the top of the stack. The lookahead token is then cleared.

The ‘reduce’ action prevents the storage stack from growing unbounded. Reduce actions are appropriate when the parser has seen the right hand side of a grammar rule, and is prepared to announce that it has seen an instance of the rule, thus replacing the right hand side by the left hand side. It mayor may not be necessary to consult the lookahead token to decide whether to reduce, but usually it is not; in fact, the default action (represented by a "."') is often a reduce action. Reduce actions are associated with individual grammar rules. Suppose the rule being reduced is

A: x y z ;

The ‘reduce’ action depends on the left-hand symbol (A in this case), and the number of symbols on the right hand side (three in this case). To reduce, first pop off the top three states from the stack Generally, the number of states popped is determined by the number of symbols on the right side of the rule. This is because it is these states that were the ones put on the stack while recognizing x, y, and z, and no longer serve any useful purpose. After popping these three states, the next state on the stack is the state the parser was in before beginning to process the rule. Using this uncovered state and the symbol on the left side of the rule, a "shift"-like action is performed on A. A new state for the parser is obtained, pushed onto the stack, and parsing continues. There are, however, significant differences between the processing of the left-hand symbol and an ordinary shift of a token, so this action is actually called a "goto" action. A shift actually clears the lookahead token while a goto does not affect the token at all. Next, the uncovered state may contains an entry such as ‘ A goto 20’, causing state 20 to be pushed onto the stack. State 20 now becomes the current state. In effect, the reduce action ``turns back the clock'' in the parse, popping the states off the stack to go back to the state where the right hand side of the rule was first seen. The parser then behaves as if it had seen the left side at that time. If the right hand side of the rule is empty, no states are popped off the stack: the uncovered state is in fact the current state.

User-supplied actions and values can also be affected by the "reduce" action. Before the stack is adjusted, the code supplied with the rule is executed when a rule is reduced. There is another stack in addition to the stack holding the states that holds the values returned from the lexical analyzer and the actions. When a shift takes place, an external variable, yylval, is copied onto this value stack. It is not until after the parser returns from the user code that the reduction is carried out. When the goto action is done, an external variable, yyval, is pushed onto the value stack. The pseudo-variables $1, $2, etc., referred to in the rule, reference items on the value stack. The ‘accept’ and ‘error’ actions do not participate in storage management.

 

Handling of Abstraction and Encapsulation in YACC

YACC handles abstraction and encapsulation in only one way. As mentioned previously, the rules section of a YACC source file contains actions that define what is to occur when a legal collection of symbols is found. Each of these definitions consists of a left-hand side and a right hand side. The left-hand side is always a single non-terminal symbol, while the right hand side can contain a series of tokens. Although it is illegal to have a token on the left-hand side of a rule, it is perfectly acceptable to have a non-terminal on the right hand side. This results in a form of abstraction. Once a rule is defined, it does not need to be completely rewritten to be used on the right hand side of another rule. The programmer can simply use the non-terminal in its place in the right hand side. That rule, when referenced from the right hand side of the encapsulating rule, can then be evaluated as if the entire right hand side had been put there. This is also true of using a non-terminal in its own definition. This results is a recursive definition, which works very well for definitions incorporating lists of indefinite length.

 

 

Handling of Data Objects in Scheme

Scheme has support for several objects, including numbers, characters, strings, symbols, and lists or vectors of object. Scheme also handles a full set of numeric data types including complex, real and arbitrary-precision rational numbers. This allows the creation of many numerical applications coded in lower level languages. The following list illustrates the data objects in Scheme.

The pair or cons cell is the most fundamental of Scheme's structured object types. Pairs are commonly used to build lists, but they may also be used to construct binary trees. Each pair in the tree structure represents an internal node of the binary tree. Proper lists are visually represented as sequences of objects separated by whitespace and enclosed in parentheses.

Numbers can be represented in a wide range of types; integers, rational numbers, real numbers, or complex numbers. This classification emulates that of the real world, where all integers are rational, all rational numbers are real, and all real numbers are complex.

Characters are atomic objects representing letters, digits, special symbols, and certain nongraphic control characters such as space and newline. Characters are written with a #\ prefix. The characters newline and space may be written in this manner as well, but they can also be written as #\newline or #\space.

Strings are sequences of characters and are typically used as messages or character buffers. Scheme provides operations for creating strings, extracting characters from strings, obtaining substrings, concatenating strings, and altering the contents of strings. A string is written as a sequence of characters enclosed in double quotes. The backward slash is an escape character. It allows the output of a backward slash or a double quote when it precedes either of these characters. Each character in a string is indexed by an exact nonnegative integer, beginning with 0 to index the first character.

Vectors are more convenient and efficient than lists for some applications. Whereas accessing an arbitrary element in a list requires a linear traversal of the list up to the selected element, arbitrary vector elements are accessed in constant time. The length of a vector in Scheme is the number of elements it contains. Vectors are indexed by exact non-negative integers, and the index of the first element of any vector is 0.

A vector is output as a sequence of objects separated by whitespace, preceded by the prefix # and surrounded by parentheses.

Symbols are used as symbolic names in Scheme programs. Symbols can be quickly compared for equivalence, so their most appropriate use is as identifiers in the representation of programs. Symbols are written without double quotes or other bracketing characters. Parentheses, double quotes, spaces, and most other characters with a special meaning to the Scheme reader are not allowed within the representation of a symbol.

Handling of Sequence Control in Scheme

The most basic Scheme sequence control structure is a procedure. The syntax is as follows:

( procedure expression …)

Any structure without a syntax keyword in Scheme in the first position is a procedure application. The order of the procedure and argument expressions is evaluated unspecified. It may be left to right, right to left, or some arbitrary order ( as specified by the programmer). However, the evaluation is guaranteed to be sequential. Each expression is evaluated before the next evaluation is started. The example below is to illustrate a procedure.

( + 3 4) => 7

Sequencing

To sequence assignments and other operations, Scheme uses the begin statement.
Syntax:

( begin expression1 expression2, … )

The expressions are evaluated in sequence from left to right, and the value(s) of the last expression(s) is returned. This expression type is used to sequence side effects such as input and output.

Conditionals

Scheme has conditionals of if statements, not statements, and statements, or statements, cond statements and case statements. However, only if, cond, and case statements are used for sequence handling.

If statement

Syntax: ( if test consequent alternative)
Syntax: ( if test consequent)

Test, consequent and alternative are expressions. The return value for consequent and alternative will depend on the test value. These items are represented as you would expect. Test is a condition to evaluate, and consequent and alternative are statements to evaluate depending on the evaluation of test.

Cond statement

Syntax: ( cond expression1 expression2 … )
Syntax: ( cond text expression1 … else …. )

Each expression is either evaluate or not, dependent on the evaluation of cond. The expressions are evaluated in sequence. The value of the last expression is returned.

Case statement

Syntax: ( case expression1 statement1 statement2 )

The case statement evaluates the expression(s) given and returns the value of the corresponding statement. A series of case statements, each representing a different condition, can act like a series of if statements.

Recursion and Iteration

Syntax: ( let bindings body)

This form of let, called a named let, is a general-purpose iteration and recursion construct. Bindings should have the form

	((variable1 init1) ...),

where each init is an expression, and body should be a sequence of one or more expressions. This statement returns the value of the final expression. This statement can also be written as

Syntax: (letrec bindings body)

This statement too returns the value of the final expression. The difference between let and letrec is that letrec allows the definition of mutually recursive procedures.

Syntax: (do ((variable1 init1 step1)

 

This is an iteration construct. On each step, the test expression test is evaluated. If the value of test is true, iteration ceases, the result expressions are evaluated in sequence and the value of the last expression is returned.

If the value of test is false, the expressions within the loop are evaluated in sequence and iteration continues.

Syntax: for-each proc list1 list2 ...

The for-each statement performs procedures in sequence over the lists from left to right.

Handling of subprograms and storage management in Scheme

Scheme’s lambda syntactic form allows programmers to define subprograms. Subprograms in Scheme are block-structured and the identifiers involved in each subprogram may be bound either at the top level of the source file or locally. The block structure, along with lexical scoping helps to create programs that are modular, easy to read, easy to maintain, and reliable. In Scheme, a subprogram definition can appear within another block or subprogram, but the nested subprogram may still be called at any time after that, even if the enclosing block has completed its execution. To support lexical scoping, a subprogram carries its lexical context, or environment, along with its code. Furthermore, Scheme subprograms are not necessarily always named. Instead, subprograms are first-class data objects like strings or numbers, and variables are bound to subprograms in the same way they are bound to other objects.

Like procedures in many other languages, Scheme procedures may be recursive. Tail recursion is a special case of recursion supported by Scheme that provides iteration. A tail call occurs when one subprogram directly returns the result of calling another subprogram. Tail recursion occurs when a subprogram recursively tail calls itself. This requires that Scheme programmers only need be familiar with simple procedure calls and recursion and not worry about the usual varieties of loop structures.

Continuations support the definition of arbitrary control structures. A continuation is a procedure that represents the remainder of a program at a given point in the program. When a continuation is called, the program immediately continues from that point. A continuation may be obtained at any time during the execution of a program. As with other subprograms, a continuation is a first-class object and may be called at any time after its creation.

Scheme does not have explicit storage management. Objects are dynamically allocated in a heap where they are kept until no longer needed, then automatically deallocated. A Scheme program never explicitly deallocates scheme objects such as pairs, strings, and procedures. Instead, the storage management system uses garbage collection to automatically reclaim storage space by calling the procedure collect. Once the garbage collector can prove that an object is no longer being referenced, it deallocates the object and the storage used by the object can be reused later. The garbage collector runs periodically as program runs to prevent the system storage from being used up. Although garbage collection is performed automatically, the programmer can also instigate garbage collection at a particular time.

Weak pairs and guardians in Scheme also help with storage management. Weak pairs allow programs to maintain weak pointers to objects. Although a weak pointer remains valid while the object it references is accessible in the system, it does not prevent the object from being reclaimed by the storage management system. Guardians allow programmers to protect objects from deallocation by the garbage collector. Weak pairs and guardians allow programs to keep information about objects in separate data structures without having to concern about the possibility that maintaining this information will cause the objects to remain indefinitely in the system.

Handling of abstraction and encapsulation in Scheme

The define keyword is used so that objects can be given names. As an example,

(define a 10)
a
(* a 2)

The Scheme interpreter responds by outputting A, 10, and 20. The first expression tells the Scheme interpreter that henceforth a will be another name for the number 10. In evaluating the second expression, the interpreter returns the object that a names, 10. In the final expression, since the first operand, a, evaluates to 10 and the second operand is 2, the value of the expression is 20.

Using define to Create New Procedures

(define (double number)
(* 2 number))

The example above is a procedure named double. It is called on one argument, a number, and returns the value of that number multiplied by two.

Comparison


Even though both Scheme and YACC are functional languages, comparing them to each other is still not an easy task. Though they both are based on the same concepts, their implementation differs because of the type of functions they support. Scheme is made to evaluate mathematical functions. YACC, however, is made to evaluate grammar functions.

Handling of Data Objects

Because of the different types of functions each language is made to support, the types of data objects they are made to handle are different. Scheme needs to support the use of a wide range of numerical types in order to accomplish its goal. Because it is mathematical functions that are being evaluated, the data types supported can be explicitly defined. YACC handles basic data objects in a much more general manner than Scheme. YACC needs to define objects only to support the evaluation of a grammar, a set of rules for a language. The functions in YACC are only concerned with the groupings and orderings of elements in the grammar. Therefore, the data types in YACC are not really of significant importance in evaluating the expressions in the source file. By default, the symbols in YACC are given an integer value, but this is only really used to facilitate identification of each individual symbol.

Handling of Sequence Control

Sequence control is also supported more fully in Scheme than in YACC. Scheme supports a wide range of control structures, comparable to an imperative language like C in its variety. It supports iteration, conditional execution of statements, and even recursion. YACC supports precedence of operators and controls program flow through shift and reduce procedures, but most of this is invisible to the programmer. Actions like shift and reduce happen ‘under the hood’ with the program stack in order to guide evaluation of the grammar dependent on the tokens read up to a point. I believe this is also a result of the types of programs each language is meant to support. YACC must only concern itself with the ordering and grouping of elements in the grammar and so has little need for a wide variety of control structures. Scheme, on the other hand, because it is dealing with mathematical evaluation, requires the use of such a selection of sequence control processes. Mathematical evaluation is inherently repetitive. The support of such control structures helps the programmer alleviate some of the monotony by allowing the language to do the repetitive work.

 

Handling of subprograms and storage management

The main difference between storage management in Scheme and YACC is that Scheme allocates all of its data objects onto a heap while YACC uses a stack for its data storage. Using the stack for YACC storage management makes sense. The stack consists of states of the parser at a given time. These states need to be sorted through from time to time when a particular one is required dependent on the lookahead token. However, even when one state is required, the rest of the states must remain in the order they were previously. A stack is a perfect mechanism for maintaining this restriction. Scheme, on the other hand, doesn’t require such constraints on its data. It is not built on a finite state machine and so does not have states to keep track of. It simply stores all of its data in a heap and accesses data items when necessary.

Handling of abstraction and encapsulation

Scheme’s handling of abstraction begins with the use of the define keyword, which not only assigns variable names to elementary data types, but can also assign variable names to functions written by the programmer. YACC’s use of abstraction is less explicitly defined. YACC’s representation of its relationships is written in a form that more closely represents the definition of a function than Scheme’s representation, which more closely resembles function definitions in an imperative language. YACC’s handling of abstraction resembles mathematical function composition more closely than Scheme’s. Function composition is basically the use of one function to replace the variable name to be evaluated in another function. It makes more sense to use this type of syntax in a language whose main concern is grouping and order of language elements. A ‘function’ in this case is a particular ordering and collection of elements. Once a certain ordering is defined, it is easier to use a reference to that ordering than it is to use the ordering itself over and over again. The abstraction in a language like Scheme is better represented in the imperative language style because of the usage of variables. In evaluating mathematical functions, variables to a function can change and thus need to be passed to the function before its evaluation. Scheme’s style facilitates this.


References

Levine, John R. lex & yacc. O’Reilly and Associates, Inc. 1992.

Johnson, Stephen C. YACC: Yet Another Compiler Compiler. AT & T Bell Laboratories.

http://www.cs.uaf.edu/~cs631/yacc-docs.txt

Darwin, Ian. A History of UNIX Before Berkeley: UNIX Evolution, 1975-1984. 1984.

http://www.daemonnews.org/199903/history.html

Sebesta, Robert W., Concepts of Programming Languages.

Scheme: The Scheme Programming Language. 1999

http://www.engin.umd.umich.edu/CIS/course.des/cis400/scheme/scheme.html

Kelsey, Richard. Scheme: Revised Report on the Algorithmic Language Scheme. 2001.

http://www.swiss.ai.mit.edu/~jaffer/r5rs_toc.html

Stone, John D. Fundamentals of Computer Science I.

http://www.math.grin.edu/courses/Scheme/

Mighty_Rabit@hotmail.com