\documentclass{book}
\usepackage{alltt}
\title{A practical approach to Programming Language Semantics}
\begin{document}

The purpose of this book is to study programming language semantics
in a way that highlights the practical aspects of this field.
The main objective of programming language semantics is to understand
what programs really mean.  The practical outcome of this understanding
is that programming errors can be avoided.  In this book, we will actually
insist on the following aspects.

\begin{itemize}
  \item
    We will show that there is a continuous progression between
    conventional programs and semantic descriptions.  We will start
    by describing tools concerning  programming languages written
    as simple programs, and we will show how the same tools are
    related to more theoretical descriptions of programming languages.
  \item Most of the questions we will study have a practical motivation.  In
    general, this practical motivation will be to ensure that our tools
    do what we mean.
  \item Some of our work consists in mathematical proofs about programming
    language descriptions.  All these proofs be will implemented as formal
    proofs that can be verified using a mathematical proof tool.
  \item We will only describe toy languages, so that the whole content
    of the book can be used as an introduction to programming language
    semantics.
\end{itemize}

We hope that this practical approach, starting with programs in
conventional programming languages to describe a toy programming
language will provide a gentle introduction to the more mathematically
designed formalisms that are customarily used to describe programming
languages and to the practice of proof-verification tools.
Our attachment to the formal verification of semantics
proofs may seem debatable: proof-checkers are still difficult to
use and the constraints imposed by proof-checking add to the
constraints imposed by the formal description of languages in ways
that may discourage students who are already intimidated by the
theoretical content of the course.  We believe that this extra cost is
compensated in the following points:
\begin{itemize}
  \item A formal description of the concepts being used actually makes it
    possible to reduce their ``theoretical load'': the fact that some
    notations are just traditions hiding simple predicates is made explicit
    in a formal description.
  \item Formal descriptions are unambiguous.  If some explanation is not
    well understood, the formal description can be consulted to disambiguate
    the English-language explanations.  Moreover, the dependency between
    theoretical concepts and practical considerations is more explicit
    in formal documents.
  \item Mechanical proofs and formal descriptions provide a playground for
    students to build up their own experience in extending a language,
    defining a new tool, or     proving extra properties of an existing
    language.
  \item Modern proof-checkers also provide ways to extract executable programs
    from formal descriptions and proofs.  This characteristic will make it
    possible to re-inforce the idea that the formal semantics of programming
    languages are directly related to implementing tools to
    program and avoid programming errors.
\end{itemize}

The book contain the description of three toy languages: a simple
imperative programming language, a simple theoretical language inspired
from the study of logic, and simple functional programming language; and three
formal description styles: natural semantics, denotational semantics, and
axiomatic semantics.  The practical tools that we will obtain
from the formal descriptions will fall in the following categories:
\begin{itemize}
  \item Interpreters, that is, tools that execute programs,
  \item Compilers, that is, tools that translate programs in other
    languages,
  \item Static analyzers, that is tools that help make sure that programs
    do not contains certain categories of errors.
\end{itemize}
The purpose of mechanical verification of programming language semantics
will be to ensure that these tools respect the intending meaning of programs.

\chapter{A simple imperative programming language}
\section{An informal description}
We want to study a small programming language whose only instructions
are sequences of assignments to memory locations, possibly repeated inside
loops that are terminated when a test is satisfied.
The values stored in each location will be integers, and for now, we will
assume that the only operations available are additions and the only
tests are strict comparison.  Here are a few examples of such instructions.

\begin{itemize}
  \item A program to compute a multiplication,
\begin{verbatim}
z:=0; while x > z do t := y + t; z:=z+1 done
\end{verbatim}
If {\tt x} contains a positive value, a machine executing this program
should terminate in a state where {\tt t} contains the product of the values
in {\tt y} and {\tt x}.
\item A program to compute a square root,
\begin{verbatim}
y:=0; s := 1;
 while x > s do
  s:=0; y:=y+1; z:=0;
  while y > z do s := y+s; z:= z+1 done
done
\end{verbatim}
If {\tt x} contains a positive value, a machine executing this program
should terminate in a state where {\tt y} contains the greatest integer
whose square is smaller than the value in {\tt x}.
\end{itemize}

These instructions are incomplete in the sense that they do not
indicate what are all the initial values of the memory locations they
refer to.  To make complete programs we will choose to associate
an instruction with a declaration of variables.  Programs will thus
look as follows:
\begin{verbatim}
Variables 
  x=10 y=0 s=1
in
while s < x do
  s := 0; y:=y+1; z:=0;
  while z < y do s := y+s; z:= z+1 done
done
\end{verbatim}

Our programming language does not contain a conditional statement.
This statement can be simulated with the help
of while loops and describing the same language with an extra 
conditional statement is not more difficult, just longer.

\section{An interpreter, written in C}
\subsection{Representing programs}
In our interpreter, we want to have an abstract view of programs:
we don't need to know whether programs were written all in one line or
whether a lot of space was used to separate the location name from the
assignment sign in an assignment instruction.  Actually, we don't even
need to know whether assignments are written with {\tt :=} (as in Pascal)
or {\tt =} (as in C).  The tradition is simply to describe programs
as term structures, where each term has a label (to indicate what kind
of term it is) and a collection of subterms.  For our simple programming
language the various kinds of terms available are as follows:
\begin{itemize}
  \item {\tt VARIABLE\_LIST}, this term is used to hold all the variable
   declarations at the beginning of a program.  Terms of this kind contain
   a first subterm that describes the declaration of a single variable and
   a second subterm that describes the rest of the declarations.  A variable
   list is thus expected to be a nested collection of terms of kind
   {\tt VARIABLE\_LIST},
  \item {\tt VARIABLE\_VALUE}, this term is used to hold the pair of a variable
   and a value in a single variable declaration.  Terms of this kind have
   two subterms where the first one should be a variable and the second
   should be a numeral.
  \item {\tt NUMERAL}, this term is used to hold a number when it appears
   in an assignment, a comparison or an addition.  Terms of this kind
   contain an integer value,
  \item {\tt VARIABLE}, this term is used to hold a memory location name.
   Terms of this kind contain a string,
  \item {\tt ADDITION}, this term is used to hold an addition.  Terms of
   this kind have two subterms, which can themselves be other additions,
   variables, or numerals,
  \item {\tt GREATER}, this term is used to hold a comparison.  Terms
   of this kind have two subterms, which can be additions, variables, or
   numerals,
  \item {\tt ASSIGNMENT}, this term is used to hold the instruction
   assigning a value to a variable.  Terms of this kind contain two
   subterms, where the first one is the variable being assigned to, and
   the second is the expression that needs to be evaluated in order to
   produce the value that is given to the variable,
  \item {\tt SEQUENCE}, this term is used to hold two instructions that
   should be executed one after the other.  Terms of this kind contain
   two subterms, which can themselves be sequences, while terms, 
   assignments, or skip statements,
  \item {\tt WHILE}, this term is used to hold a while loop instruction.
   Terms of this kind contain two subterms, where the first one is
   a comparison (a term of the kind {\tt GREATER}) and the second one
   is itself a while instruction, a sequence, an assignment, or a skip
   statement,
  \item {\tt SKIP}, this term is used to hold an instruction that does
   nothing.  Terms of this kind have no subterms.
\end{itemize}

To keep track of these various kinds we define a collection of constant
macros in our C program:

\begin{verbatim}
#define VARIABLE_LIST 1
...
#define WHILE 9
#define SKIP 10
\end{verbatim}

We then define a data-structure that corresponds to these terms.  In
this definition, we take into account the fact that a term contains
either two subterms, a numeral, or a string of characters.

\begin{verbatim}
 typedef struct term {
  short select;
  union {
    struct term **children;
    int int_val;
    char *string_val;
  } term_body;
} term;
\end{verbatim}

We then define a collection of functions that construct terms:
the function {\tt numeral} takes an integer as argument and
it constructs a term whose {\tt select} field is {\tt NUMERAL} and whose
value is the given integer.  Similarly, a function {\tt sequence}
takes two term arguments and construct a new term that represents the sequence
of these two arguments.  In fact, all constructors rely on the following
{\tt make\_term} function.
\begin{verbatim}
term *make_term(int tag) {
  term *ptr;
  if(!(ptr = (term *)malloc(sizeof(term)))) {
    panic("allocation failed in make_term");
  }
  ptr->select = tag;
  return ptr;
}
\end{verbatim}
The {\tt panic} procedure used in this function simply reports the error
message in terminates the interpreter.  The binary constructors rely on
the following {\tt make\_binary\_term} function:
\begin{verbatim}
term *make_binary_term(int tag, term *i1, term *i2) {
  term *ptr = make_term(tag);
  if(!(ptr->term_body.children 
           = (term **)malloc(sizeof(term*)*2))) {
    panic("could not allocate memory in make_binary_term");
  }
  ptr->term_body.children[0] = i1;
  ptr->term_body.children[1] = i2;
  return ptr;
}
\end{verbatim}
For instance, the {\tt variable} and {\tt addition} functions
are defined as follows:
\begin{verbatim}
term *variable(char *s) {
  term *ptr = make_term(VARIABLE);
  ptr->term_body.string_val = s;
  return ptr;
}

term *addition(term *i1, term *i2) {
  make_binary_term(ADDITION, i1, i2);
}
\end{verbatim}

Once we have defined all the constructors, we can combine them to build
large terms.  For instance, the multiplication program is built by
the following C expression:

\begin{verbatim}
sequence(assignment(variable("z"),numeral(0)),
   term_while(greater(variable("x"),variable("z")),
      sequence(
       assignment(variable("t"),
                  addition(variable("y"),variable("t"))),
       assignment(variable("z"),
                  addition(variable("z"),numeral(1))))))
\end{verbatim}
\subsection{Navigating in programs}
When we have a term representing a program fragment, we access its
sub-expressions using a function {\tt child} that has the following
definition, such that executing
\begin{verbatim}
  child(assignment(variable("z"),numeral(1)),2)
\end{verbatim}
should return the same term as executing
\begin{verbatim}
  numeral(1)
\end{verbatim}
The function {\tt child} is defined as follows:
\begin{verbatim}
term *child(term *t, int rank) {
  return (t->term_body.children[rank-1]);
}
\end{verbatim}

We also define two functions {\tt variable\_name} and
{\tt numeral\_value} to map terms of kind {\tt VARIABLE} and
{\tt NUMERAL} to the values of type {\tt string} and {\tt int},
respectively, that they contain:
\begin{verbatim}
char *variable_name(term *t) {
  return(t->term_body.string_val);
}

int numeral_value(term *t) {
  return(t->term_body.int_val);
}
\end{verbatim}
\subsection{Execution environments}
To simulate the execution of programs, we simply want to view their
effect as modifying a collection of memory locations.  We don't need
to know how these locations are laid out in memory, we simply need
to record the fact that some variable is associated to some integer
value.  To keep this record, we construct a simply linked list using terms
of kind {\tt variable\_list} and {\tt variable\_value}.  We
use the same kind of binary terms as for the program structure to
represent these lists.  To represent the end of such a list, we use the
conventional NULL pointer.

With these conventions, we can build the representation of a state where
the location {\tt x} contains 0 and the location {\tt y} contains 2 as
follows:
\begin{verbatim}
variable_list(variable_value(variable("x"),numeral(0)),
  variable_list(variable_value(variable("y"),numeral(2)),NULL))
\end{verbatim}

Given an arbitrary variable, we often need to compute the value
associated to this variable in a given environment.  The following
C function performs this task by navigating the list structure and
stopping as soon as it finds a pair of a variable and a value that
corresponds to the variable we are looking for and returns the
corresponding integer.
\begin{verbatim}
int lookup(term *env, char *var) {
  while(NULL != env) {
    term *pair = child(env, 1);
    if(strcmp(variable_name(child(pair, 1)), var)==0) {
      return numeral_value(child(pair, 2));
    } else {
      env = child(env, 2);
    }
  }
  fprintf(stderr, "%s : ", var);
  panic("no such variable in environment for lookup");
}
\end{verbatim}
In this function, we use the function {\tt strcmp} from the
standard {\tt string} library of the C language.  This function
is used to compare two strings, it returns 0 if, and only if, the two
strings are the same.

Similarly, we will also need to construct a new environment that
only differs from a previous one in the fact the value of a single
variable has changed.  We want this function to map the environment
\begin{verbatim}
variable_list(variable_value(variable("x"),numeral(0)),
  variable_list(variable_value(variable("y"),numeral(2)),NULL)),
\end{verbatim}
the variable {\tt y} and the value {\tt 4} to the environment
\begin{verbatim}
variable_list(variable_value(variable("x"),numeral(0)),
  variable_list(variable_value(variable("y"),numeral(4)),NULL)),
\end{verbatim}


Such a mapping is described with the following function:
\begin{verbatim}
term *update(term *env, char *var_name, term *val) {
  if(NULL != env) {
    term *var = child(child(env, 1), 1);
    if(strcmp(variable_name(var), var_name) == 0) {
      return(variable_list(variable_value(var, val),
                           child(env, 2)));
    } else {
      return(variable_list(child(env, 1), 
			   update(child(env, 2), var_name, val)));
    }
  } else {
    fprintf(stderr, "%s : ", var_name);
    panic("no such variable in environment for update");
  }
}
\end{verbatim}
\subsection{Evaluating expressions}
Consider the following program:
\begin{verbatim}
x := 1;
y := 2;
x := (x + y) + 1;
y := (x + y) + 1
\end{verbatim}
The first time the expression {\tt x + y} is executed, on the third line,
the result should be be 4.  The second time this expression is executed,
on the fourth line, the result should be 7.  The evaluation of expressions
needs to take into account the value of variables.  This is the reason
why we  have an environment data structure, to keep track of the
value of all the variables used in the program.

In our language there are three kinds of expressions.  For variables,
we need to look in the environment to know the value that is currently
recorded.  We can use the {\tt lookup} function that was defined in the
previous section.

When the expression we want to evaluate is numeral, the value is
simply the value of this numeral.  On the other hand, when the
expression is an addition, we need first to evaluate the two
components of the expression.  This returns integer values that we can
add to obtain the value of the whole addition expression.  Thus, the
function that evaluates the expressions is a recursive function.
This is easily described with the following function:
\begin{verbatim}
int evaluate(term *env, term *exp) {
  int n1, n2;
  switch(exp->select) {
  case VARIABLE: return lookup(env, variable_name(exp));
  case NUMERAL: return numeral_value(exp);
  case ADDITION:
    return (evaluate(env, child(exp, 1)) +
            evaluate(env, child(exp, 2)));
  }
}
\end{verbatim}
We also need to evaluate comparison expressions and return a boolean value.
The function that implements this simply needs to evaluate the two
expressions that need to be compared in a given environment and then
compare the integer values that are received.  Here is this function's
code:
\begin{verbatim}
int evaluate_bool(term *env, term *exp) {
  int n1 = evaluate(env, child(exp, 1));
  int n2 = evaluate(env, child(exp, 2));
  return (n1 > n2);
}
\end{verbatim}

\subsection{Executing instructions}
We now have enough functions to describe the behavior of the three instructions
in our language.  We want to define a function {\tt execute} that takes
two arguments, an environment and  an instruction.  For instance, when
applied to the environment
\begin{verbatim}
variable_list(variable_value(variable("x"),numeral(0)),
  variable_list(variable_value(variable("y"),numeral(2)),NULL)),
and the instruction
\begin{verbatim}
  assignment(variable("x",addition(variable("y"),numeral(2)))
\end{verbatim}
we want this function {\tt execute} to return the following environment
\begin{verbatim}
variable_list(variable_value(variable("x"),numeral(0)),
  variable_list(variable_value(variable("y"),numeral(4)),NULL)).
\end{verbatim}
\begin{enumerate}
\item To execute an assignment instruction in a given environment,
we simply need to evaluate the expression that occurs as this assignment's
right-hand side (using the {\tt evaluate} function) and then to
construct a new environment where only the assigned variable is changed
(using the {\tt update} function),
\item To execute a sequence of two instructions, we need to execute the
first instruction and collect the resulting environment.  The second
instruction must then be executed, but in that intermediate environment.
\item To execute a while loop, we need to evaluate the condition as
a boolean expression.  If the value of this expression is false, then
we can return the input environment: the loop terminates without modifying
its execution environment.  If the value of the expression is true, then
the body of the while loop needs to be executed, the resulting environment
must be collected, and the while loop must be executed again, starting
from that environment. 
\end{enumerate}
These behaviors can be expressed very neatly by a simple composition of
the functions, including the recursive calls of the execution function
that is being defined:
\begin{verbatim}
term *execute(term *env, term *inst) {
  switch(inst->select) {
  case ASSIGNMENT:
    return update(env, 
                  variable_name(child(inst, 1)),
                  numeral(evaluate(env,child(inst, 2))));
  case SEQUENCE:
    return execute(execute(env,child(inst, 1)),child(inst, 2));
  case WHILE:
    if(evaluate_bool(env,child(inst, 1))) {
      return execute(execute(env,child(inst, 2)), inst);
    } else {
      return env;
    }
  }
}
\end{verbatim}
\subsection{Practical aspects: adding a parser}
It is not practical to test our interpreter if we have to build C
expressions by combining the term constructors in a C program, recompile
this C program and observe the result with the navigating functions.
A practical alternative is to define a parser to read in programs written
in a readable syntax and combine this parser with a the {\tt execute} 
function.  We use the Bison tool to add a simple parser to this interpreter.
For instance, the parsing rules for instructions have the following
shape
\begin{verbatim}
nst: identifier T_ASSIGN exp {$$ = assignment($1,$3); }
|  inst T_SCOLUMN inst {$$ = sequence($1,$3); }
|  T_WHILE b_exp T_DO inst T_DONE {$$ = term_while($2,$4); }
|  T_SKIP { $$ = skip(); }
|  T_OPEN_B exp T_CLOSE_B { $$ = $2; }
\end{verbatim}
For most parsing rules, the associated semantic action is simply a
term constructor, except for rules that are concerned with parentheses or
brackets, like the last parsing rule above.

The semantic rule for the program construct triggers
the execution of the program with the initial state given by the
list of variables that was parsed (this list is reversed, because
our parser stores the variable declarations in the wrong order).
\begin{verbatim}
program : T_VARIABLES environment T_IN inst T_END
{ print_env(execute(revert_env($2), $4)); exit(0); }
;
\end{verbatim}
\section{The interpreter in OCaml}
In the OCaml language, defining the terms and the constructors is
directly done with a type definition:
\begin{verbatim}
type exp = 
  Addition of exp*exp | Numeral of int | Variable of string;;
type b_exp = Greater of exp*exp;;
type inst = 
  Assignment of string * exp | Sequence of inst * inst
| While of b_exp * inst | Skip;;
\end{verbatim}
We simply use the pre-defined type of lists to represent environments,
in fact we use lists of pairs, where the first component is a string
and the second argument is a integer.
\begin{verbatim}
type environment = (string * int) list;;
\end{verbatim}
With this data-structure, the functions {\tt lookup} and {\tt update}
are defined as follows:
\begin{verbatim}
let rec lookup (env:environment) (var:string) =
  match env with
      [] -> failwith(var ^ 
             " : no such variable in environment for lookup")
    | (a,v)::tl -> if a = var then v else lookup tl var;;

let rec update (env:environment) (var:string) (n:int) =
  match env with
      [] -> failwith(var ^
             " : no such variable in environment for update")
    | ((a,v) as pair)::tl ->
        if a = var then 
          (a,n)::tl 
        else
          pair::(update tl var n);;
\end{verbatim}
The programming style of OCaml relies on pattern matching construct,
with specific syntax to describe lists.  Thus the expression
\begin{alltt}
  match env with [] -> \(e\sub{1}\) | a::tl -> \(e\sub{2}\)
\end{alltt}
means ``if the input data is an empty list, then return the
value of \(e_1\), otherwise, let {\tt a} be the first element of
the list and {\tt tl} be the second element, compute the value
of \(e_2\), which may depend on the value of {\tt a} and {\tt tl},
and return this value.  In the functions, we know that the elements
of the lists being processed are pairs and the pattern-matching
construct also makes it possible to access the components of
the first pair element concisely.

The pattern-matching construct makes it possible to write the
functions more concisely than in the C programming language.  This
conciseness also helps in the description of the {\tt evaluate},
{\tt b\_evaluate}, and {\tt execute} functions.
\begin{verbatim}
let rec evaluate (env:environment)(e:exp) =
  match e with
      Numeral n -> n
    | Variable s -> lookup env s
    | Addition(e1,e2) -> evaluate env e1 + evaluate env e2;;

let evaluate_bool (env:environment)(e:b_exp) =
  match e with
      Greater(e1,e2) -> evaluate env e1 > evaluate env e2;;

let rec execute (env:environment)(i:inst) =
  match i with
      Skip -> env
    | Assignment(s,e) -> update env s (evaluate env e)
    | Sequence(i1,i2) -> execute (execute env i1) i2
    | While(b,i) as it ->
        if (evaluate_bool env b) then
            execute (execute env i) it
        else
            env;;
\end{verbatim}
The OCaml tools suite also provide means to define parsers.  We
used the {\tt ocamlyacc} tool which is close to {\tt bison}.  In all,
we are able to define the same interpreter more concisely with the
OCaml language.

\section{An interpreter written in Prolog}
The Prolog programming style relies on the possibility to define
predicates with multiple arguments.  Very often, these predicates
can be used to describe functions, when the last argument is actually
viewed as the result of processing the previous argument.

Predicates are usually described in a style that is closed to OCaml's
pattern-matching.  Thus, we will consider a predicate {\tt lookup}
with three arguments, so that {\tt lookup(E,X,V)} can be read as
``the value of {\tt X} in the environment {\tt E} is {\tt V}.
This predicated can actually be defined with the following two clauses.
\begin{verbatim}
lookup([(X,V)|_], X, V) :- ! .
lookup([(Y,_)|Tl], X, V) :- lookup(Tl, X, V).
\end{verbatim}
Prolog's treatement of definitions is that when a question is asked, the
defining clauses are tried to solve the question.  The bang ponctuation
means that no other clause should be tested if this one succeeded.  In
effect, this ensures that the second clause is applied only when {\tt X}
and {\tt Y} have different values.  The {\em anonymous} variable ``{\tt \_}''
is used to indicate that the value found in the question where this
variable lies is not important.  The two clauses defining {\tt lookup}
can be read as follows: ``in any environment that is a list with
a first pair ({\tt X}, {\tt V}), the value of the variable {\tt X} is {\tt V}.
When the previous rule does not apply (when {\tt X} is different from {\tt Y},
the value of {\tt Y} in an environment that starts with a pair
of {\tt Y} and any value and whose tail is an environment {\tt Tl},
the value of the variable {\tt X} is the same as the value it would
have in the environment {\tt Tl}.

Thus, in Prolog, the notation {\tt :-} can be read as an {\em if}
construct.

We can define the operation of updating environments similarly:
\begin{verbatim}
update([(X,_)|Tl], X, V1, [(X,V1)|Tl]) :- !.
update([(Y,V)|Tl], X, V1, [(Y,V)|Tl1]) :- 
  update(Tl, X, V1, Tl1).
\end{verbatim}

\end{document}