summaryrefslogtreecommitdiff
path: root/papers/cfrontend_new/paper.tex
blob: 4293554fac1872fca79cd43319a678d6f3e4b7f1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
\documentclass[namedreferences]{kluwer}

\usepackage{latexsym}
%\usepackage{amsmath} pb latex
\usepackage{amssymb}
\usepackage{amsfonts}
\usepackage{mymacros}
\usepackage{infrules}
\usepackage{url}
\usepackage{mymacros}
\usepackage[usenames]{color}

\input{macros}

\begin{document}
\begin{article}

\begin{opening}
\title{Formal verification of C front-end}
\subtitle{Yet another formal verification in Coq of a C front-end}
\runningtitle{C front-end}

\author{Sandrine \surname{Blazy} \email{{S}andrine.Blazy@ensiie.fr}}
\institute{ENSIIE and INRIA}

\author{Xavier \surname{Leroy} \email{{X}avier.Leroy@inria.fr}}
\runningauthor{S.Blazy, X.Leroy}
\institute{INRIA}

\maketitle

\begin{abstract}
This paper presents
\end{abstract}

\keywords{formal semantics, formal proof, C language, C compiler}
\end{opening}

\section{Introduction}

\section{A formal semantics for \Clight{}}
\subsection{CIL}

\subsection{Abstract syntax}

The abstract syntax of \Clight{} is given in figures~\ref{fig:syntax}
and~\ref{fig:syntax2}.
In the Coq formalization, this abstract syntax is presented as
inductive data types, therefore achieving a deep embedding of \Clight{}
into Coq.

At the level of types, \Clight{} features all the integral types of C,
along with array, pointer (including function pointers), function 
types, {\tt struct} and {\tt union} types. Bit-fields in {\tt struct} 
or {\tt union} are ignored.
The integral types fully specify the bit size of integers and floats, 
unlike the half-specified C types {\tt int}, {\tt long}, etc.  

\sbcm{TO DO: representation of struct and union}

Within expressions, all C operators are supported. 
Side-effects stem from function calls. 
All expressions and their sub-expressions are
annotated by their static types.  We write $a^\tau$ for the expression
$a$ carrying type $\tau$.  These types are necessary to determine the
semantics of type-dependent operators such as the overloaded
arithmetic operators. The following expressions may be in left-value 
position: $\id, *a, a_1[a_2]$ and $a \dot \id$. 

\input syntax.tex

At the level of statements, all structured control statements of C
(conditional, loops, simple {\tt switch}, {\tt break}, {\tt continue} and 
{\tt return}) are supported, but not unstructured statements 
(unstructured {\tt switch} and {\tt goto}).  
Contrary to C, the default case is mandatory in a \Clight{} {\tt switch} statement
and it is the last branch of the {\tt switch}.
An assignment is treated as a statement (instead of an expression).
This transformation from expression into statements is performed by CIL.
Two kinds of variables are allowed: global variables and local {\tt auto}
variables declared at the beginning of a function.  Block-scoped local
variables and {\tt static} variables are omitted because they are emulated by
pulling their declarations to function scope or global scope,
respectively.  Consequently, there is no block statement in \Clight{}.

A function $f$ defined in \Clight{} consists of a header (\textit{i.e.}
{\tt fn\_params(f)} the parameters of $f$ and {\tt fn\_return(f)} 
the return type of $f$), a list of local variables ({\tt fn\_vars(f)}) 
and a body ({\tt fn\_body(f)}) that is a statement. The functions composing 
the program are either internal functions, defined within \Clight{}, or 
external functions (a.k.a. system calls). An external function consists 
of the internal identifier and the signature of the function. 

A program $p$ consists of a list of global variable declarations
({\tt prog\_vars(p)}), a list of function definitions ({\tt prog\_funct(p)}), 
and an identifier naming the entry point of the program ({\tt prog\_main(p)}).
In a global variable declaration, the list of initial values
represents all values that need to be stored (\textit{e.g.} the
values of the cells of an array).

\input syntax2.tex

\subsection{Static semantics}
 The type system is very coarse: we check only the typing properties
  that matter for the translation to be correct.  Essentially,
  we need to check that the types attached to variable references
  match the declaration types for those variables.

\subsection{Dynamic semantics}

The dynamic semantics of \Clight{} is specified using natural semantics, 
also known as big-step operational semantics. 
 These natural semantics capture the final result of program execution, as 
well as traces of calls to external functions that emit an event when applied.
The big-step semantics described in~\cite{blazy06:fm} did not observe traces.
This addition of traces leads to a significantly
stronger observational equivalence between source and machine code.

Execution traces represent the interactions between the program and
the rest of the world (\textit{i.e.} the input-output activities of 
the program); they are defined in figure~\ref{fig:trace}.
A trace is a finite list of input-output events.
Each call to an external function produces an event. 
An event consists of an identifier and the actual values of the parameters of
the called function. Usually, the identifier represents the called function,
but more generally it could also represent any observable identifier.
Starting from an initial empty event (called $\E0$), events are propagated and 
concatenated during program execution.

\input trace.tex

While the semantics of
C is not deterministic (the evaluation order for expressions is not
completely specified and compilers are free to choose between several
orders), the semantics of \Clight{} is completely deterministic and
imposes a left-to-right evaluation order, consistent with the order
implemented by our compiler.  This choice simplifies greatly the
semantics compared with, for instance, Norrish's semantics for C
\cite{Norrish:phd}, which captures (at great expense) the
non-determinism allowed by the ISO C specification.  Our semantics can
therefore be viewed as a refinement of (a subset of) the ISO C
semantics, or that of Norrish.

The semantics is defined by 7 judgements (relations):
$$\begin{array}{rclr}
G, E & |- & a, M \evallvalue \loc, \tr, M' & \mbox{(expressions in l-value position)} \\
G, E & |- & a , M \evalexpr v, \tr, M' & \mbox{(expressions in r-value position)} \\
G, E & |- & \seq a, M \evalexpr \seq v, \tr, M' & \mbox{(list of expressions)}\\
G, E & |- & s, M \evalstmt \out, \tr, M' & \mbox{(statements)} \\
G, E & |- & \ls, M \evalstmt \out, \tr, M' & \mbox{(labeled statements of a switch)} \\
G    & |- & f(\seq v), M \evalfunc v, \tr, M' & \mbox{(function invocations)} \\
     & \vdash & p \evalprog v, \tr &  \mbox{(programs)}
\end{array}$$
Each judgement relates a syntactic element (expression, statement,
etc.) and an initial memory state to the result of executing this
syntactic element, as well as the final memory state at the end of
execution and also the trace of input-output events (system calls). 
 The various kinds of results, as well as the evaluation
environments, are defined in figure~\ref{fig:values}.

\input values.tex

For instance, executing an expression $a$ in l-value position results in a
memory location $\loc$ (a memory block reference and an offset within that
block), while executing an expression $a$ in r-value position results in
the value $v$ of the expression.  Values range over 32-bit integers,
64-bit floats, memory locations (pointers), and an undefined value
that represents for instance the value of uninitialized variables.
The result associated with the execution of a statement $s$ is an
``outcome'' indicating how the execution terminated: either normally
by running to completion or prematurely via a {\tt break}, {\tt continue} or
{\tt return} statement.
The invocation of a function $f$ yields a value $v$, and so does the execution 
of a program.

Two evaluation environments, defined in figure~\ref{fig:values},
appear as parameters to the judgements.  The local environment $E$ maps
local variables to references of memory blocks containing the values
of these variables.  These blocks are allocated at function entry and
freed at function return.  The global environment $G$ maps global
variables and function names to memory references.  It also maps
some references (those corresponding to function pointers) to function
definitions.

The memory model of the semantics is described in~\cite{icfem:sb}.
A memory $M$ maps block references to block contents.
Each block represents the result of a C \texttt{malloc},
or a stack frame, a global static variable, or a function
code-pointer. 
A block content consists of the dimensions of the block (low and high bounds) 
plus a mapping from byte offsets to byte-sized memory cells (\textit{i.e.} 
values).

\sbcm{TO DO: a detailler (faire une section a part?): il faut expliquer
loadval et storeval qui sont utilisees dans la semantique.}

In the Coq specification, the 7 judgements of the dynamic semantics
are encoded as mutually-inductive predicates. Each defining case of
each predicate corresponds exactly to an inference rule in the
conventional, on-paper presentation of natural semantics.  We have one
inference rule for each kind of expression and statement described in
figure~\ref{fig:syntax}. 
 We show most of the inference rules in figures~\ref{fig:dynsem1} 
to~\ref{fig:dynsem4}.  

\input dynsem1.tex

The first five rules of figure~\ref{fig:dynsem1} illustrate the 
evaluation of an expression in l-value position.
A variable $\id$ evaluates to the location $(E(\id), 0)$.
If an expression $a$ evaluates to a pointer value $\loc$, then
the location of the dereferencing expression $(\hbox{{\tt *}}a)^\tau$ is $\loc$.

Rule~\ref{rule:3} shows the evaluation of an array cell.
The location of an array cell $\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$
is computed from the location of the array $\unannot{a_1}{\tau_1}$
({\it i.e.} the location of the first cell) and the value of the index
$\unannot{a_2}{\tau_2}$ thanks to the function {\tt add}. 
{\tt add} is a partial function:
it can be undefined if the types and the shapes of argument values are
incompatible (e.g. an integer addition of two pointer values).
In the Coq specification, it is a total 
function returning optional values: either {\tt None} in case of failure, or 
${\tt Some}(v)$, abbreviated as $\some v$, in case of success.
%More precisely, the computation succeeds only if $v_1$ and
%$v_2$ are either 2 integers, or 2 floats or a pointer value and an integer.
%The corresponding types $\tau_1$ and $\tau_2$ must also be 
%compatible with these values. For instance, if $\tau_1$ is a {\tt Tarray},
%then $\tau_2$ must be a {\tt Tint}.
The trace resulting from the evaluation of 
$\unannot{a_1}{\tau_1} [\unannot{a_2}{\tau_2}]$ is the concatenation of the 
traces resulting from the evaluation of $\unannot{a_1}{\tau_1}$ and 
$\unannot{a_2}{\tau_2}$.


In a similar way, the location of an access to a {\tt struct} field 
$\unannot{a}{\tau} \dot f$ is computed from the location of the {\tt struct} 
$\unannot{a}{\tau}$ and an offset with that {\tt struct} (see 
rule~\ref{rule:4}).
\sbcm{un peu rapide, cf. repr ?}
The location of an access to a {\tt union} field is the location of
the {\tt union} (see rule~\ref{rule:5}).

The last four rule of figure~\ref{fig:dynsem1} illustrate the 
evaluation of an expression in r-value position.
Rule~\ref{rule:6} rule shows the evaluation of a l-value expression in an
r-value context.  The expression is evaluated to its location $l$,
with final memory state $M'$.  The value at location $l$ in $M'$ is
fetched using the {\tt loadval} function (see 
\sbcm{(faire une section a part?)}
%section~\ref{sec:mem}
) and returned.

If the location of an expression $a$ is $\loc$, then
the value of the expression $(\hbox{{\tt \&}}a)^\tau$ is also $\loc$
(rule~\ref{rule:7}).
Rule~\ref{rule:8} evaluates an application of a binary operator $\op$ to
expressions $\unannot{a_1}{\tau_1}$ and $\unannot{a_2}{\tau_2}$.  
Both sub-expressions are evaluated in
sequence, and their values are combined with the
{\tt eval{\char95}binary{\char95}operation} function, which takes as additional 
arguments the types $\tau_1$ and $\tau_2$ of the arguments, in order to resolve
overloaded and type-dependent operators.  

Rule~\ref{rule:9} evaluates a cast expression $(\tau)\unannot{a}{\tau_a}$.
The expression $\unannot{a}{\tau_a}$ is evaluated, and its
value is cast from its natural type $\tau_a$ to the expected type
$\tau$ using the partial function {\tt cast}.  This function
performs appropriate conversions, truncations and sign-extensions over
integers and floats, and may fail for undefined casts. The result $v$
of the cast is then the final result.

The rules in figure~\ref{fig:dynsem2} are
examples of statements execution.
The execution of a {\tt skip} statement yields the {\tt Out{\char95}normal}
outcome and the empty trace.
Similarly, the execution of a {\tt break} (resp. {\tt continue}) statement 
yields the {\tt Out{\char95}break} (resp. {\tt Out{\char95}continue})
outcome and the empty trace.
 
Rule~\ref{rule:12} evaluates an assignment statement.
An assignment statement $a_1^\tau \hbox{{\tt \ =\ }} a_2^{\tau'}$ evaluates
the l-value $a_1$ to a location $\loc$, then the r-value $a_2$ to a value $v$.
As implicit casts are inserted by CIL, there is no need to cast this $v$. 
$v$ is stored in memory at location $\loc$, resulting in the final memory
 state $M_3$.

The execution of a sequence of two statements starts with the
execution of the first statement, thus yielding an outcome that
determines if the second statement must be executed or not
(rules~\ref{rule:15}~and~\ref{rule:16}).
Rules~\ref{rule:17}--\ref{rule:19} describe the execution of a
{\tt while} loop.  Once the condition of a while loop is evaluated to a
value $v$, the execution of the loop terminates normally if $v$ is
false.  If $v$ is true, then the loop body is executed, thus yielding
an outcome $\out$.  If $\out$ is {\tt Out{\char95}break}, the loop terminates
normally. If $\out$ is {\tt Out{\char95}normal} or {\tt Out{\char95}continue}, 
the whole loop is reexecuted in the memory state modified by the execution of the
body.
Finally, rules~\ref{rule:20}--\ref{rule:21} describe the execution of a 
{\tt return} statement.

\input dynsem2.tex

The rules of figure~\ref{fig:dynsem3} define the execution of a call statement.
When the expression ${a_{fun}}^{\tau}$ is evaluated to a pointer value that is the
location (in the global environment $G$) of a function $f$, and when $f$ has the
same signature as ${a_{fun}}^{\tau}$, the control flow is passed to the called
function (rule~\ref{rule:25}).

The execution of a \Clight{} function allocates the memory required 
for storing the formal parameters and the local variables of the function 
(rule~\ref{rule:26}). The allocation initializes local variables to the 
${\tt Vundef}$ undefined value. Formal parameters are initialized to their 
corresponding actual parameters. The body of $f$ is then executed, thus yielding 
an outcome. The return value of the function is computed from this outcome and 
from the return type of $f$. The memory allocated for executing $f$ is freed
before returning to the caller.

A call $<\id \, \sigma>$ to an external function triggers a new event
that is the trace resulting from that call.

\input dynsem3.tex

The rules of figure~\ref{fig:dynsem4} define the execution of a 
${\tt switch}(a) \ls$ statement. If the expression of the {\tt switch} evaluates
to $n$, then some statements of the {\tt switch} are selected and executed 
(rule~\ref{rule:28}), thus yielding an outcome $\out$. If $\out$ is 
{\tt Out{\char95}break}, then the {\tt switch} terminates normally 
(\textit{i.e.} the {\tt outcome{\char95}switch} function updates $\out$ into 
{\tt Out{\char95}normal} and leave other outcomes unchanged).

The first selected statements that are executed are those of either the branch 
labeled by $n$ or the default branch (rules~\ref{rule:30} and \ref{rule:29}), 
thus yielding an outcome that determines if the following statements must be 
executed or not (rules~\ref{rule:30} and \ref{rule:31}). If needed, the execution falls through to the next statements $\ls$.

\input dynsem4.tex

\section{Translation from \Clight{} to Csharpminor}
\subsection{Overview of Csharpminor}
The Csharpminor language is the first intermediate language of the compiler.
It is a low-level imperative language, structured like our subset of C into 
expressions, statements and functions.

As in \Clight{}, Csharpminor local variables reside in memory, and their address 
can be taken. Csharpminor is entirely processor-neutral. In particular, 
Csharpminor uses a standard set of operations.

\subsection{Overview of the translation}

\section{Proof of correctness}

The theorem guarantees that if the source program terminates and does
not go wrong, the compiled code terminates and does not go wrong,
performs exactly the same system calls, and returns the same exit code
as the source program.
Currently, it says nothing about source programs that do not terminate (work
in progress).

\section{Translation from \Clight{} to Csharpminor}
\subsection{Overview of Csharpminor}
The Cminor language is the second intermediate language of the compiler
and the target language of our front-end compiler.
 We now give a short overview of Cminor; see \cite{xl:popl} for a more detailed
description, and \cite{xl:compcert-backend} for a complete formal
specification.

Unlike in Csharpminor, Cminor local variables do not reside in memory, 
and their address can not be taken.

\subsection{Overview of the translation}

\section{Quantitative results}

\section{Conclusion}

\bibliographystyle{unsrt}
\bibliography{blazy}

\end{article}
\end{document}