1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=ANSI_X3.4-1968">
<META name="GENERATOR" content="hevea 1.08">
<base target="main">
<script language="JavaScript">
<!-- Begin
function loadTop(url) {
parent.location.href= url;
}
// -->
</script>
<LINK rel="stylesheet" type="text/css" href="cil.css">
<TITLE>
Compiling C to CIL
</TITLE>
</HEAD>
<BODY >
<A HREF="cil003.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="ciltoc.html"><IMG SRC ="contents_motif.gif" ALT="Up"></A>
<A HREF="cilly.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
<HR>
<H2 CLASS="section"><A NAME="htoc4">4</A> Compiling C to CIL</H2><A NAME="sec-cabs2cil"></A>
In this section we try to describe a few of the many transformations that are
applied to a C program to convert it to CIL. The module that implements this
conversion is about 5000 lines of OCaml code. In contrast a simple program
transformation that instruments all functions to keep a shadow stack of the
true return address (thus preventing stack smashing) is only 70 lines of code.
This example shows that the analysis is so much simpler because it has to
handle only a few simple C constructs and also because it can leverage on CIL
infrastructure such as visitors and pretty-printers.<BR>
<BR>
In no particular order these are a few of the most significant ways in which
C programs are compiled into CIL:
<OL CLASS="enumerate" type=1><LI CLASS="li-enumerate">
CIL will eliminate all declarations for unused entities. This means that
just because your hello world program includes <TT>stdio.h</TT> it does not mean
that your analysis has to handle all the ugly stuff from <TT>stdio.h</TT>.<BR>
<BR>
<LI CLASS="li-enumerate">Type specifiers are interpreted and normalized:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int long signed x;
signed long extern x;
long static int long y;
// Some code that uses these declaration, so that CIL does not remove them
int main() { return x + y; }
</FONT></PRE>
See the <A HREF="examples/ex1.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Anonymous structure and union declarations are given a name.
<PRE CLASS="verbatim"><FONT COLOR=blue>
struct { int x; } s;
</FONT></PRE>
See the <A HREF="examples/ex2.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Nested structure tag definitions are pulled apart. This means that all
structure tag definitions can be found by a simple scan of the globals.
<PRE CLASS="verbatim"><FONT COLOR=blue>
struct foo {
struct bar {
union baz {
int x1;
double x2;
} u1;
int y;
} s1;
int z;
} f;
</FONT></PRE>
See the <A HREF="examples/ex3.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">All structure, union, enumeration definitions and the type definitions
from inners scopes are moved to global scope (with appropriate renaming). This
facilitates moving around of the references to these entities.
<PRE CLASS="verbatim"><FONT COLOR=blue>
int main() {
struct foo {
int x; } foo;
{
struct foo {
double d;
};
return foo.x;
}
}
</FONT></PRE>
See the <A HREF="examples/ex4.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Prototypes are added for those functions that are called before being
defined. Furthermore, if a prototype exists but does not specify the type of
parameters that is fixed. But CIL will not be able to add prototypes for those
functions that are neither declared nor defined (but are used!).
<PRE CLASS="verbatim"><FONT COLOR=blue>
int f(); // Prototype without arguments
int f(double x) {
return g(x);
}
int g(double x) {
return x;
}
</FONT></PRE>
See the <A HREF="examples/ex5.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Array lengths are computed based on the initializers or by constant
folding.
<PRE CLASS="verbatim"><FONT COLOR=blue>
int a1[] = {1,2,3};
int a2[sizeof(int) >= 4 ? 8 : 16];
</FONT></PRE>
See the <A HREF="examples/ex6.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Enumeration tags are computed using constant folding:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int main() {
enum {
FIVE = 5,
SIX, SEVEN,
FOUR = FIVE - 1,
EIGHT = sizeof(double)
} x = FIVE;
return x;
}
</FONT></PRE>
See the <A HREF="examples/ex7.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Initializers are normalized to include specific initialization for the
missing elements:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int a1[5] = {1,2,3};
struct foo { int x, y; } s1 = { 4 };
</FONT></PRE>
See the <A HREF="examples/ex8.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Initializer designators are interpreted and eliminated. Subobjects are
properly marked with braces. CIL implements
the whole ISO C99 specification for initializer (neither GCC nor MSVC do) and
a few GCC extensions.
<PRE CLASS="verbatim"><FONT COLOR=blue>
struct foo {
int x, y;
int a[5];
struct inner {
int z;
} inner;
} s = { 0, .inner.z = 3, .a[1 ... 2] = 5, 4, y : 8 };
</FONT></PRE>
See the <A HREF="examples/ex9.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">String initializers for arrays of characters are processed
<PRE CLASS="verbatim"><FONT COLOR=blue>
char foo[] = "foo plus bar";
</FONT></PRE>
See the <A HREF="examples/ex10.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">String constants are concatenated
<PRE CLASS="verbatim"><FONT COLOR=blue>
char *foo = "foo " " plus " " bar ";
</FONT></PRE>
See the <A HREF="examples/ex11.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Initializers for local variables are turned into assignments. This is in
order to separate completely the declarative part of a function body from the
statements. This has the unfortunate effect that we have to drop the <TT>const</TT>
qualifier from local variables !
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x = 5;
struct foo { int f1, f2; } a [] = {1, 2, 3, 4, 5 };
</FONT></PRE>
See the <A HREF="examples/ex12.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Local variables in inner scopes are pulled to function scope (with
appropriate renaming). Local scopes thus disappear. This makes it easy to find
and operate on all local variables in a function.
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x = 5;
int main() {
int x = 6;
{
int x = 7;
return x;
}
return x;
}
</FONT></PRE>
See the <A HREF="examples/ex13.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Global declarations in local scopes are moved to global scope:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x = 5;
int main() {
int x = 6;
{
static int x = 7;
return x;
}
return x;
}
</FONT></PRE>
See the <A HREF="examples/ex14.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">Return statements are added for functions that are missing them. If the
return type is not a base type then a <TT>return</TT> without a value is added.
The guaranteed presence of return statements makes it easy to implement a
transformation that inserts some code to be executed immediately before
returning from a function.
<PRE CLASS="verbatim"><FONT COLOR=blue>
int foo() {
int x = 5;
}
</FONT></PRE>
See the <A HREF="examples/ex15.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">One of the most significant transformations is that expressions that
contain side-effects are separated into statements.
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x, f(int);
return (x ++ + f(x));
</FONT></PRE>
See the <A HREF="examples/ex16.txt">CIL output</A> for this
code fragment<BR>
<BR>
Internally, the <TT>x ++</TT> statement is turned into an assignment which the
pretty-printer prints like the original. CIL has only three forms of basic
statements: assignments, function calls and inline assembly.<BR>
<BR>
<LI CLASS="li-enumerate">Shortcut evaluation of boolean expressions and the <TT>?:</TT> operator are
compiled into explicit conditionals:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x;
int y = x ? 2 : 4;
int z = x || y;
// Here we duplicate the return statement
if(x && y) { return 0; } else { return 1; }
// To avoid excessive duplication, CIL uses goto's for
// statement that have more than 5 instructions
if(x && y || z) { x ++; y ++; z ++; x ++; y ++; return z; }
</FONT></PRE>
See the <A HREF="examples/ex17.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">GCC's conditional expression with missing operands are also compiled
into conditionals:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int f();;
return f() ? : 4;
</FONT></PRE>
See the <A HREF="examples/ex18.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">All forms of loops (<TT>while</TT>, <TT>for</TT> and <TT>do</TT>) are compiled
internally as a single <TT>while(1)</TT> looping construct with explicit <TT>break</TT>
statement for termination. For simple <TT>while</TT> loops the pretty printer is
able to print back the original:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x, y;
for(int i = 0; i<5; i++) {
if(i == 5) continue;
if(i == 4) break;
i += 2;
}
while(x < 5) {
if(x == 3) continue;
x ++;
}
</FONT></PRE>
See the <A HREF="examples/ex19.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">GCC's block expressions are compiled away. (That's right there is an
infinite loop in this code.)
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x = 5, y = x;
int z = ({ x++; L: y -= x; y;});
return ({ goto L; 0; });
</FONT></PRE>
See the <A HREF="examples/ex20.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">CIL contains support for both MSVC and GCC inline assembly (both in one
internal construct)<BR>
<BR>
<LI CLASS="li-enumerate">CIL compiles away the GCC extension that allows many kinds of constructs
to be used as lvalues:
<PRE CLASS="verbatim"><FONT COLOR=blue>
int x, y, z;
return &(x ? y : z) - & (x ++, x);
</FONT></PRE>
See the <A HREF="examples/ex21.txt">CIL output</A> for this
code fragment<BR>
<BR>
<LI CLASS="li-enumerate">All types are computed and explicit casts are inserted for all
promotions and conversions that a compiler must insert:<BR>
<BR>
<LI CLASS="li-enumerate">CIL will turn old-style function definition (without prototype) into
new-style definitions. This will make the compiler less forgiving when
checking function calls, and will catch for example cases when a function is
called with too few arguments. This happens in old-style code for the purpose
of implementing variable argument functions.<BR>
<BR>
<LI CLASS="li-enumerate">Since CIL sees the source after preprocessing the code after CIL does
not contain the comments and the preprocessing directives.<BR>
<BR>
<LI CLASS="li-enumerate">CIL will remove from the source file those type declarations, local
variables and inline functions that are not used in the file. This means that
your analysis does not have to see all the ugly stuff that comes from the
header files:
<PRE CLASS="verbatim"><FONT COLOR=blue>
#include <stdio.h>
typedef int unused_type;
static char unused_static (void) { return 0; }
int main() {
int unused_local;
printf("Hello world\n"); // Only printf will be kept from stdio.h
}
</FONT></PRE>
See the <A HREF="examples/ex22.txt">CIL output</A> for this
code fragment</OL>
<HR>
<A HREF="cil003.html"><IMG SRC ="previous_motif.gif" ALT="Previous"></A>
<A HREF="ciltoc.html"><IMG SRC ="contents_motif.gif" ALT="Up"></A>
<A HREF="cilly.html"><IMG SRC ="next_motif.gif" ALT="Next"></A>
</BODY>
</HTML>
|