\documentclass[namedreferences]{kluwer} \usepackage{latexsym} %\usepackage{amsmath} pb latex \usepackage{amssymb} \usepackage{amsfonts} \usepackage{mymacros} \usepackage{infrules} \usepackage{url} \usepackage{mymacros} \usepackage[usenames]{color} \input{macros} \begin{document} \begin{article} \begin{opening} \title{Formal verification of C front-end} \subtitle{Yet another formal verification in Coq of a C front-end} \runningtitle{C front-end} \author{Sandrine \surname{Blazy} \email{{S}andrine.Blazy@ensiie.fr}} \institute{ENSIIE and INRIA} \author{Xavier \surname{Leroy} \email{{X}avier.Leroy@inria.fr}} \runningauthor{S.Blazy, X.Leroy} \institute{INRIA} \maketitle \begin{abstract} This paper presents \end{abstract} \keywords{formal semantics, formal proof, C language, C compiler} \end{opening} \section{Introduction} \section{A formal semantics for \Clight{}} \subsection{CIL} \subsection{Abstract syntax} The abstract syntax of \Clight{} is given in figures~\ref{fig:syntax} and~\ref{fig:syntax2}. In the Coq formalization, this abstract syntax is presented as inductive data types, therefore achieving a deep embedding of \Clight{} into Coq. At the level of types, \Clight{} features all the integral types of C, along with array, pointer (including function pointers), function types, {\tt struct} and {\tt union} types. Bit-fields in {\tt struct} or {\tt union} are ignored. The integral types fully specify the bit size of integers and floats, unlike the half-specified C types {\tt int}, {\tt long}, etc. \sbcm{TO DO: representation of struct and union} Within expressions, all C operators are supported. Side-effects stem from function calls. All expressions and their sub-expressions are annotated by their static types. We write $a^\tau$ for the expression $a$ carrying type $\tau$. These types are necessary to determine the semantics of type-dependent operators such as the overloaded arithmetic operators. The following expressions may be in left-value position: $\id, *a, a_1[a_2]$ and $a \dot \id$. \input syntax.tex At the level of statements, all structured control statements of C (conditional, loops, simple {\tt switch}, {\tt break}, {\tt continue} and {\tt return}) are supported, but not unstructured statements (unstructured {\tt switch} and {\tt goto}). Contrary to C, the default case is mandatory in a \Clight{} {\tt switch} statement and it is the last branch of the {\tt switch}. An assignment is treated as a statement (instead of an expression). This transformation from expression into statements is performed by CIL. Two kinds of variables are allowed: global variables and local {\tt auto} variables declared at the beginning of a function. Block-scoped local variables and {\tt static} variables are omitted because they are emulated by pulling their declarations to function scope or global scope, respectively. Consequently, there is no block statement in \Clight{}. A function $f$ defined in \Clight{} consists of a header (\textit{i.e.} {\tt fn\_params(f)} the parameters of $f$ and {\tt fn\_return(f)} the return type of $f$), a list of local variables ({\tt fn\_vars(f)}) and a body ({\tt fn\_body(f)}) that is a statement. The functions composing the program are either internal functions, defined within \Clight{}, or external functions (a.k.a. system calls). An external function consists of the internal identifier and the signature of the function. A program $p$ consists of a list of global variable declarations ({\tt prog\_vars(p)}), a list of function definitions ({\tt prog\_funct(p)}), and an identifier naming the entry point of the program ({\tt prog\_main(p)}). In a global variable declaration, the list of initial values represents all values that need to be stored (\textit{e.g.} the values of the cells of an array). \input syntax2.tex \subsection{Static semantics} The type system is very coarse: we check only the typing properties that matter for the translation to be correct. Essentially, we need to check that the types attached to variable references match the declaration types for those variables. \subsection{Dynamic semantics} The dynamic semantics of \Clight{} is specified using natural semantics, also known as big-step operational semantics. These natural semantics capture the final result of program execution, as well as traces of calls to external functions that emit an event when applied. The big-step semantics described in~\cite{blazy06:fm} did not observe traces. This addition of traces leads to a significantly stronger observational equivalence between source and machine code. Execution traces represent the interactions between the program and the rest of the world (\textit{i.e.} the input-output activities of the program); they are defined in figure~\ref{fig:trace}. A trace is a finite list of input-output events. Each call to an external function produces an event. An event consists of an identifier and the actual values of the parameters of the called function. Usually, the identifier represents the called function, but more generally it could also represent any observable identifier. Starting from an initial empty event (called $\E0$), events are propagated and concatenated during program execution. \input trace.tex While the semantics of C is not deterministic (the evaluation order for expressions is not completely specified and compilers are free to choose between several orders), the semantics of \Clight{} is completely deterministic and imposes a left-to-right evaluation order, consistent with the order implemented by our compiler. This choice simplifies greatly the semantics compared with, for instance, Norrish's semantics for C \cite{Norrish:phd}, which captures (at great expense) the non-determinism allowed by the ISO C specification. Our semantics can therefore be viewed as a refinement of (a subset of) the ISO C semantics, or that of Norrish. The semantics is defined by 7 judgements (relations): $$\begin{array}{rclr} G, E & |- & a, M \evallvalue \loc, \tr, M' & \mbox{(expressions in l-value position)} \\ G, E & |- & a , M \evalexpr v, \tr, M' & \mbox{(expressions in r-value position)} \\ G, E & |- & \seq a, M \evalexpr \seq v, \tr, M' & \mbox{(list of expressions)}\\ G, E & |- & s, M \evalstmt \out, \tr, M' & \mbox{(statements)} \\ G, E & |- & \ls, M \evalstmt \out, \tr, M' & \mbox{(labeled statements of a switch)} \\ G & |- & f(\seq v), M \evalfunc v, \tr, M' & \mbox{(function invocations)} \\ & \vdash & p \evalprog v, \tr & \mbox{(programs)} \end{array}$$ Each judgement relates a syntactic element (expression, statement, etc.) and an initial memory state to the result of executing this syntactic element, as well as the final memory state at the end of execution and also the trace of input-output events (system calls). The various kinds of results, as well as the evaluation environments, are defined in figure~\ref{fig:values}. \input values.tex For instance, executing an expression $a$ in l-value position results in a memory location $\loc$ (a memory block reference and an offset within that block), while executing an expression $a$ in r-value position results in the value $v$ of the expression. Values range over 32-bit integers, 64-bit floats, memory locations (pointers), and an undefined value that represents for instance the value of uninitialized variables. The result associated with the execution of a statement $s$ is an ``outcome'' indicating how the execution terminated: either normally by running to completion or prematurely via a {\tt break}, {\tt continue} or {\tt return} statement. The invocation of a function $f$ yields a value $v$, and so does the execution of a program. Two evaluation environments, defined in figure~\ref{fig:values}, appear as parameters to the judgements. The local environment $E$ maps local variables to references of memory blocks containing the values of these variables. These blocks are allocated at function entry and freed at function return. The global environment $G$ maps global variables and function names to memory references. It also maps some references (those corresponding to function pointers) to function definitions. The memory model of the semantics is described in~\cite{icfem:sb}. A memory $M$ maps block references to block contents. Each block represents the result of a C \texttt{malloc}, or a stack frame, a global static variable, or a function code-pointer. A block content consists of the dimensions of the block (low and high bounds) plus a mapping from byte offsets to byte-sized memory cells (\textit{i.e.} values). \sbcm{TO DO: a detailler (faire une section a part?): il faut expliquer loadval et storeval qui sont utilisees dans la semantique.} In the Coq specification, the 7 judgements of the dynamic semantics are encoded as mutually-inductive predicates. Each defining case of each predicate corresponds exactly to an inference rule in the conventional, on-paper presentation of natural semantics. We have one inference rule for each kind of expression and statement described in figure~\ref{fig:syntax}. We show most of the inference rules in figures~\ref{fig:dynsem1} to~\ref{fig:dynsem4}. \input dynsem1.tex The first five rules of figure~\ref{fig:dynsem1} illustrate the evaluation of an expression in l-value position. A variable $\id$ evaluates to the location $(E(\id), 0)$. If an expression $a$ evaluates to a pointer value $\loc$, then the location of the dereferencing expression $(\hbox{{\tt *}}a)^\tau$ is $\loc$. Rule~\ref{rule:3} shows the evaluation of an array cell. The location of an array cell $\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ is computed from the location of the array $\unannot{a_1}{\tau_1}$ ({\it i.e.} the location of the first cell) and the value of the index $\unannot{a_2}{\tau_2}$ thanks to the function {\tt add}. {\tt add} is a partial function: it can be undefined if the types and the shapes of argument values are incompatible (e.g. an integer addition of two pointer values). In the Coq specification, it is a total function returning optional values: either {\tt None} in case of failure, or ${\tt Some}(v)$, abbreviated as $\some v$, in case of success. %More precisely, the computation succeeds only if $v_1$ and %$v_2$ are either 2 integers, or 2 floats or a pointer value and an integer. %The corresponding types $\tau_1$ and $\tau_2$ must also be %compatible with these values. For instance, if $\tau_1$ is a {\tt Tarray}, %then $\tau_2$ must be a {\tt Tint}. The trace resulting from the evaluation of $\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ is the concatenation of the traces resulting from the evaluation of $\unannot{a_1}{\tau_1}$ and $\unannot{a_2}{\tau_2}$. In a similar way, the location of an access to a {\tt struct} field $\unannot{a}{\tau} \dot f$ is computed from the location of the {\tt struct} $\unannot{a}{\tau}$ and an offset with that {\tt struct} (see rule~\ref{rule:4}). \sbcm{un peu rapide, cf. repr ?} The location of an access to a {\tt union} field is the location of the {\tt union} (see rule~\ref{rule:5}). The last four rule of figure~\ref{fig:dynsem1} illustrate the evaluation of an expression in r-value position. Rule~\ref{rule:6} rule shows the evaluation of a l-value expression in an r-value context. The expression is evaluated to its location $l$, with final memory state $M'$. The value at location $l$ in $M'$ is fetched using the {\tt loadval} function (see \sbcm{(faire une section a part?)} %section~\ref{sec:mem} ) and returned. If the location of an expression $a$ is $\loc$, then the value of the expression $(\hbox{{\tt \&}}a)^\tau$ is also $\loc$ (rule~\ref{rule:7}). Rule~\ref{rule:8} evaluates an application of a binary operator $\op$ to expressions $\unannot{a_1}{\tau_1}$ and $\unannot{a_2}{\tau_2}$. Both sub-expressions are evaluated in sequence, and their values are combined with the {\tt eval{\char95}binary{\char95}operation} function, which takes as additional arguments the types $\tau_1$ and $\tau_2$ of the arguments, in order to resolve overloaded and type-dependent operators. Rule~\ref{rule:9} evaluates a cast expression $(\tau)\unannot{a}{\tau_a}$. The expression $\unannot{a}{\tau_a}$ is evaluated, and its value is cast from its natural type $\tau_a$ to the expected type $\tau$ using the partial function {\tt cast}. This function performs appropriate conversions, truncations and sign-extensions over integers and floats, and may fail for undefined casts. The result $v$ of the cast is then the final result. The rules in figure~\ref{fig:dynsem2} are examples of statements execution. The execution of a {\tt skip} statement yields the {\tt Out{\char95}normal} outcome and the empty trace. Similarly, the execution of a {\tt break} (resp. {\tt continue}) statement yields the {\tt Out{\char95}break} (resp. {\tt Out{\char95}continue}) outcome and the empty trace. Rule~\ref{rule:12} evaluates an assignment statement. An assignment statement $a_1^\tau \hbox{{\tt \ =\ }} a_2^{\tau'}$ evaluates the l-value $a_1$ to a location $\loc$, then the r-value $a_2$ to a value $v$. As implicit casts are inserted by CIL, there is no need to cast this $v$. $v$ is stored in memory at location $\loc$, resulting in the final memory state $M_3$. The execution of a sequence of two statements starts with the execution of the first statement, thus yielding an outcome that determines if the second statement must be executed or not (rules~\ref{rule:15}~and~\ref{rule:16}). Rules~\ref{rule:17}--\ref{rule:19} describe the execution of a {\tt while} loop. Once the condition of a while loop is evaluated to a value $v$, the execution of the loop terminates normally if $v$ is false. If $v$ is true, then the loop body is executed, thus yielding an outcome $\out$. If $\out$ is {\tt Out{\char95}break}, the loop terminates normally. If $\out$ is {\tt Out{\char95}normal} or {\tt Out{\char95}continue}, the whole loop is reexecuted in the memory state modified by the execution of the body. Finally, rules~\ref{rule:20}--\ref{rule:21} describe the execution of a {\tt return} statement. \input dynsem2.tex The rules of figure~\ref{fig:dynsem3} define the execution of a call statement. When the expression ${a_{fun}}^{\tau}$ is evaluated to a pointer value that is the location (in the global environment $G$) of a function $f$, and when $f$ has the same signature as ${a_{fun}}^{\tau}$, the control flow is passed to the called function (rule~\ref{rule:25}). The execution of a \Clight{} function allocates the memory required for storing the formal parameters and the local variables of the function (rule~\ref{rule:26}). The allocation initializes local variables to the ${\tt Vundef}$ undefined value. Formal parameters are initialized to their corresponding actual parameters. The body of $f$ is then executed, thus yielding an outcome. The return value of the function is computed from this outcome and from the return type of $f$. The memory allocated for executing $f$ is freed before returning to the caller. A call $<\id \, \sigma>$ to an external function triggers a new event that is the trace resulting from that call. \input dynsem3.tex The rules of figure~\ref{fig:dynsem4} define the execution of a ${\tt switch}(a) \ls$ statement. If the expression of the {\tt switch} evaluates to $n$, then some statements of the {\tt switch} are selected and executed (rule~\ref{rule:28}), thus yielding an outcome $\out$. If $\out$ is {\tt Out{\char95}break}, then the {\tt switch} terminates normally (\textit{i.e.} the {\tt outcome{\char95}switch} function updates $\out$ into {\tt Out{\char95}normal} and leave other outcomes unchanged). The first selected statements that are executed are those of either the branch labeled by $n$ or the default branch (rules~\ref{rule:30} and \ref{rule:29}), thus yielding an outcome that determines if the following statements must be executed or not (rules~\ref{rule:30} and \ref{rule:31}). If needed, the execution falls through to the next statements $\ls$. \input dynsem4.tex \section{Translation from \Clight{} to Csharpminor} \subsection{Overview of Csharpminor} The Csharpminor language is the first intermediate language of the compiler. It is a low-level imperative language, structured like our subset of C into expressions, statements and functions. As in \Clight{}, Csharpminor local variables reside in memory, and their address can be taken. Csharpminor is entirely processor-neutral. In particular, Csharpminor uses a standard set of operations. \subsection{Overview of the translation} \section{Proof of correctness} The theorem guarantees that if the source program terminates and does not go wrong, the compiled code terminates and does not go wrong, performs exactly the same system calls, and returns the same exit code as the source program. Currently, it says nothing about source programs that do not terminate (work in progress). \section{Translation from \Clight{} to Csharpminor} \subsection{Overview of Csharpminor} The Cminor language is the second intermediate language of the compiler and the target language of our front-end compiler. We now give a short overview of Cminor; see \cite{xl:popl} for a more detailed description, and \cite{xl:compcert-backend} for a complete formal specification. Unlike in Csharpminor, Cminor local variables do not reside in memory, and their address can not be taken. \subsection{Overview of the translation} \section{Quantitative results} \section{Conclusion} \bibliographystyle{unsrt} \bibliography{blazy} \end{article} \end{document}