This document is intended as a backup to the lectures on data layout for C-style data types. We first consider the layout of primitive and complex data types under alignment constraints such as are found in most modern processors (such as MIPS). This leads us to reconsider our rules for data declarations, this time adding multi-dimensional arrays and the calculation of offsets, alignments, and sizes of various data types. Finally we show how to generate code for references to complex data types, based on a MIPS-style base-plus-offset notation for our IC code.
Primitive types, with their sizes in bytes, are as follows:
Each data type (equivalently, each type expression) thus has an alignment, and it also has a length, which is just the number of bytes it takes up in memory. For primitive types, the alignment and length are the same, but for complex data types, this may not be the case. For example, the following structure has an alignment of 4, but a length of 5:
struct {
int A;
char B;
}
The main problem is thus that when laying out data, we have to sometimes skip ahead (wasting space)
because the next data item to be assigned a starting location does not align at the next byte
address available. For example the declarations
int A;
char B;
int C
double D;
if we lay them out in order (the usual practice), would have to be assigned addresses as follows:
Byte ---- 00 A 01 02 03 04 B 05 * 06 * 07 * 08 C 09 10 11 12 * 13 * 14 * 15 * 16 D 17 18 19 20 21 22 23 24 (next available location)The length of the frame is 24 bytes, of which 7 are wasted padding (denoted with "*"), required so that the next data item is aligned; such padding is not always avoidable, but note that we could avoid padding within the frame itself if we laid out the data items in decreasing order of alignment:
Byte ---- 00 D 01 02 03 04 05 06 07 08 A 09 10 11 12 C 13 14 15 16 B 17 (next)The length of this frame is 17 bytes because it wastes no space. For simplicity in these notes, we shall use the naive method of laying data out in the order of the declarations (interestingly, most compilers do the same).
The previous discussion shows how a sequence of data items, whether in file scope, or within a structure, can be laid out in a frame. When composing data types, we must simply make sure that alignment is observed at every point. For example, the array
struct {
int A;
char B;
} C[2];
would have to be laid out as follows, so that the integer field is aligned:
Byte ---- 00 C[0].A 01 02 03 04 C[0].B 05 xxxxxx 06 xxxxxx 07 xxxxxx 08 C[1].A 09 10 11 12 C[1].B 13 (next)
A simple strategy is thus to calculate for each data type the alignment and length, and to use this information recursively in determining the offset of each component from the beginning of the data region or frame for that object. This information can be stored in the top cell of the type expression. This simple recursive strategy is not usually optimal. For example,
struct {
struct {
int A;
char B;
} C;
struct {
char D;
int E;
} F;
} G;
would be recursively processed as follows:
Byte ---- 00 G.C.A | 01 | Frame for structure C; alignment = 4 and length = 5 02 | 03 | 04 G.C.B | 05 * | 06 * | 07 * | 08 G.F.D | Frame for structure F; alignment = 4 and length = 8 09 * | 10 * | 11 * | 12 G.F.E | 13 | 14 | 15 | 16 (next)
That is, the frames for each structure are calculated separately, and then they are put together without regard to their internal structure. Clearly, it would be possible to lay out this data type optimally as:
Byte ---- 00 G.C.A | 01 | Frame for structure C; alignment = 4 and length = 5 02 | 03 | 04 G.C.B | 05 G.F.D | 06 * | Frame for structure F; alignment = 4 and length = 8 07 * | 08 G.F.E | 09 | 10 | 11 | 12 (next)
align(L,A) = ceiling( L / A ) * A
and we shall in general only store the alignment and length, and calculate the size as needed.
For any data type T, we shall refer to these components as length(T), alignment(T), and sizeof(T), assuming
that the last can be calculated by need from the first two.
For primitive data types, length = alignment = size. For complex data types, we must calculate the offsets of component types. We first consider arrays, and then structures.
An array type array(T,size) is size objects of type T laid out contiguously. As discussed above, the sizeof(T) gives the proper alignment when this is done. Then we have the following calculation of the attributes of an array containing size components of base type T:
For structs, we have to (recursively) consider the layout of data items in the ST attached to the structure. Each ST will have a length and alignment attribute, and each entry in a symbol table with have an offset, which is the starting address of that item from the beginning of the frame for that structure.
The calculation of these attributes is quite simple. When a symbol table is created, the length attribute is initialized to 0, and the alignment is initialized to 1. Then, when adding a new entry E to the symbol table, we calculate the attributes as follows:
alignment = max( alignment, alignment(E) )
offset(E) = align( length, alignment(E) )
length = offset(E) + length(E)
The layout of a frame for an entire ST is done the same way, and corresponds
to the layout method described at the beginning of this section.
Prog --> main ( ) "{" Dummy Ds Ss "}"
// Enter types into symbol table
Dummy --> epsilon PushST( ); // Create new ST
Ds --> D ";" Ds | D ";"
Ss --> S ";" Ss | S ";"
D --> T L
T --> int tempType = Int
T --> float tempType = Float
// Note timing of next rule: Dummy creates ST
// before declarations are encountered, then
// type is built from that ST at end of rule
T --> struct "{" Dummy Ds "}"
tempType = struct( Pop() );
alignment(tempType) = alignment of ST in struct
length(tempType) = length in ST of struct
L --> N "," L if(!AddEntry( N.lexval, N.type ))
Error("Multiply-defined identifier");
L --> N if(!AddEntry( N.lexval, N.type ))
Error("Multiply-defined identifier");
N --> * N
N.type = ptr( N1.type ); // alignment = length = 4
N.lexval = N1.lexval;
N --> id A
N.type = A.type
N.lexval = id.lexval;
A --> epsilon
A.type = tempType;
A --> [ intnum ] A
A.type = array( A1.type, intnum.lexval );
alignment(A.type) = alignment(A1.type);
length(A.type) = sizeof(A1.type) * (atoi(intnum.lexval)-1)
+ length(A1.type);
Note that our recursive construction of array types results in what is called Row Major Order, that is, when looking at the individual components in order, the indices vary like an odometer, and (the most common case) a two-dimensional array A[Rows][Columns] has Rows as the Major unit in the Ordering.
For example, a declaration "int C[2][3]" would have the following layout according to our method presented in above:
Byte ---- 00 C[0][0] 01 02 03 04 C[0][1] 05 06 07 08 C[0][2] 09 10 11 12 C[1][0] 13 14 15 16 C[1][1] 17 18 19 20 C[1][2] 21 22 23 24 (next)
Thus, we will change our IC language so that all variable references have the form the form A[I] or A[10]. These references can occur anywhere a variable can occur, e.g.,
add A[4], A[t29], C[0]
sub D[0], B[0], t19
and so on. When the offset is zero, it would be permissible as a convenience
for the reader to just write, e.g.,
add A[4], A[t29], C
sub D, B[0], t19
Temporaries are just placeholders for values, and it is not necessary to use
base plus offset addressing for them.
In the rules for expressions, we now have to account for references to variables, which may involve array or structure dereferencing operations. We will use the following attributes:
E --> R
E.type = R.type;
E.place = R.place + "[" + R.offsetPlace + "]"
// + is string concat here
R --> id
R.type = Lookup(id.lexval);
R.place = id.lexval;
R.offsetPlace = NewTemp();
AQ("move", R.offsetPlace, "-", "0");
R --> R "." id
R.place = R1.place;
R.offsetPlace = R1.offsetPlace;
R.type = Lookup type of id.lexval in ST associated with R1.type;
temp = Lookup offset of id.lexval in ST associated with R1.type;
AQ("add", R.offsetPlace, R.offsetPlace, temp);
R --> R "[" E "]"
R.place = R1.place;
R.offsetPlace = R1.offsetPlace;
R.type = base type stored in array type R1.type;
tempPlace = NewTemp();
AQ("mult", tempPlace, E.place, size(R.type));
AQ("add", R.offsetPlace, R.offsetPlace, tempPlace);
We have left out everything but the essentials here, and you will need to add
such things as type checking (e.g., is R1.type a structure type in the third
rule?), flow of control checks (is E in flow of control mode in the fourth rule
or in value mode?), and so on.
There are several ways to improve this translation scheme, and you should think about this for your project. The main problem is that every variable has a variable offset; when the offset is a constant, e.g., 1, it still produces
move t34, -, 1
move A[t19], -, B[t34]
instead of
move A[t19], -, B[1]
Similarly, a series of structure references will result in explicit addition of
constants; e.g.,
move t34, -, 0
move t34, -, 4
add t34, -, 4
add t34, -, 8
move A[t19], -, B[t34]
This could be preprocessed by the compiler to get:
move A[t19], -, B[16]
We will leave it to the reader to figure out a good way to optimize these (and
other) points.