post 16.1.06 hits
View
My Stats
The incorporation of SIMD technology onto modern general purpose microprocessors is on a more modest scale than were the pioneering efforts. For reasons of economy the SIMD engine has to be included on the same die as the rest of the CPU. This immediately constrains the degree of parallelism that can be obtained. The constraint does not arise from the difficulties of incorporating large numbers of simple processing units. With contemporary feature sizes, one could fit more than a thousand 1-bit processors on a die. Instead the degree of parallelism is constrained by the width of the CPU to memory data path.
The SIMD model provides for all data-processors to simultaneously transfer words of data between internal registers and corresponding locations in their memory banks. Thus with n data-processors each using w-bit words one needs a path to memory of nw bits. If a CPU chip has a 64-bit memory bus then it could support 64 1-bit SIMD data-processors, or 8 8-bit data-processors, 2 32-bit processors, etc.
For bulk data operations, such as those involved in image processing, the relevant memory bus is the off chip bus. For algorithms that can exploit some degree of data locality the relevant bus would be that linking the CPU to the on-chip cache, and the degree of parallelism possible would be constrained by the width of the cache lines used.
Whilst memory access paths constrain the degree of parallelism possible, the large numbers of logic gates available on modern dies, allows the complexity of the individual data-processors to be raised. Instead of performing simple 1-bit arithmetic, they do arithmetic on multi-bit integers and floating point numbers.
As a combined result of these altered constraints we find that SIMD instructions for multi-media applications have parallelism levels of between 32 bits (Texas C62xx) and 128 bits (Intel PIII), and the supported data types range from 8-bit integers to 32-bit floating point numbers.
The MMX architectural extensions were introduced by Intel in late models of the Pentium processor. They have subsequently been incorporated in the PII and PIII processors from Intel and in compatible chips produced by AMD, Cyrix and others. They can now be considered part of the baseline architecture of any contemporary PC.
The data registers available for computational purposes on processors incorporating the MMX architecture are shown in Figure 1.1. The original IA32 architecture had eight general purpose registers and an eight deep stack of floating point registers. When designing the multi-media extensions to the instructionset, Intel wanted to ensure that no new state bits were added to the process model. Adding new state bits would have made CPU's with the extensions incompatible with existing operating systems, as these would not have saved the additional state on a task switch. Instead, Intel added 8 new virtual 64-bit registers which are aliased onto the existing floating point stack. These new multimedia registers, mm0..mm7, use state bits already allocated to the Floating Point Unit(FPU), and are thus saved when an operating system saves the state of the FPU.
The MMX instructions share addressing mode formats with the instructions used for the general purpose registers. The 3-bit register identification fields inherited from the previous instructions are now used to index the eight multi-media rather than the eight general purpose registers. The existing addressing modes for memory operands are also carried over allowing the full gamut of base and index address modes to be applied to the loading and storing of MMX operands.
The MMX registers support 4 data formats as shown in Figure 1.2. A register can hold a single 64-bit tt QWORD, a pair of 32-bit DWORDS, 4 16-bit WORDS or 8 BYTES. Within these formats the data-types shown in Table 1.1 are supported.
The saturated data-types require special comment. They are designed to handle a circumstance that arises frequently in image processing when using pixels represented as integers: that of constraining the result of some arithmetic operation to be within the meaningful bounds of the integer representation.
Format | signed | unsigned | signed saturated | unsigned saturated |
BYTE | yes | yes | yes | yes |
WORD | yes | yes | yes | yes |
DWORD | yes | yes | no | no |
Suppose we are adding two images represented as arrays of bytes in the
range 0..255 with 0 representing black and 255 white. It is possible that the results may be greater than 255.
For example 200+175 = 375 but in 8 bit binary
droping the leading 1 we get 01110111 = 119, which is dimmer than either of the
original pixels. The only sensible answer in this case would have been 255,
representing white.
11001000 + 10101111 = 1 01110111
To avoid such errors, image processing code using 8 bit values has to put in tests to check if values are going out of range, and force all out of range values to the appropriate extremes of the ranges. This inevitably slows down the computation of inner loops. Besides introducing additional instructions, the tests involve conditional branches and pipeline stalls.
The MMX seeks to obviate this by providing packed saturated data types with appropriate arithmetic operations over them. These use hardware to ensure that the numbers remain in range.
The combined effect of the use of packed data and saturated types can be to produce a significant increase in code density and performance.
main() { unsigned char v1[LEN],v2[LEN],v3[LEN]; int i,j,t; for (j=0;j<LEN;j++){ t=v2[j]+v1[j]; v3[j]=(unsigned char)(t>255?255:t); } } ASSEMBLER xor edx, edx ; 9.8 ; ; LOE edx ebx ebp esi edi esp dl dh bl bh $B1$3: ; Preds $B1$5 $B1$2 mov eax, edx ; 10.9 lea ecx, DWORD PTR [esp] ; 10.6 movzx ecx, BYTE PTR [eax+ecx] ; 10.6 mov DWORD PTR [esp+19200], edi ; lea edi, DWORD PTR [esp+6400] ; 10.12 movzx edi, BYTE PTR [eax+edi] ; 10.12 add ecx, edi ; 10.12 cmp ecx, 255 ; 11.26 mov edi, DWORD PTR [esp+19200] ; jle $B1$5 ; Prob 16% ; 11.26 ; LOE eax edx ecx ebx ebp esi edi esp al ah dl dh cl ch bl bh $B1$4: ; Preds $B1$3 mov ecx, 255 ; 11.26 ; LOE eax edx ecx ebx ebp esi edi esp al ah dl dh cl ch bl bh $B1$5: ; Preds $B1$3 $B1$4 inc edx ; 9.18 cmp edx, 6400 ; 9.3 mov DWORD PTR [esp+19200], edi ; lea edi, DWORD PTR [esp+12800] ; 11.4 mov BYTE PTR [eax+edi], cl ; 11.4 mov edi, DWORD PTR [esp+19200] ; jl $B1$3 ; Prob 80% ; 9.3
Consider the C code in figure 1.3 to add 2 images pointed to by p1 and p2, storing the result in the image pointed to by p3. The code includes a check to prevent overflow. Compiled into assembler code by the Visual C++ compiler the resulting assembler code has 18 instructions in the inner loop. The potential acceleration due to the MMX can be seen by comparing it with the following hand coded assembler inner loop:
l1: movq mm0,[esi+ebp-LEN] ; load 8 bytes paddusb mm0,[esi+ebp-2*LEN] ; packed unsigned add bytes movq [esi+ebp-3*LEN],mm0 ; store 8 byte result add esi,8 ; inc dest pntr loop l1 ; repeat for the rest of the arrayThe example assumes that p1,p2,p3 are stored in registers esi,edx,edi for the duration of the loop. Only 5 instructions are used in the whole loop, compared to 18 for the compiled C code. Furthermore, the MMX code processes 8 times as much data per iteration, thus requiring only 0.625 instructions per byte processed. The compiled code thus executes 29 times as many instructions to perform the same task. Whilst some of this can be put down to the superiority of hand assembled versus automatically compiled code, the combination of the SIMD model and the saturated arithmetic are obviously a major factor.
The basic IA32 architecture already provides support for 32-bit and 64-bit IEEE floating point instructions using the FPU stack. However, 64-bit floating point numbers are poor candidates for parallelism in view of the datapath limitations described in section 1.1.
AMD provided a straight forward extension of the MMX whereby an additional data-type, the pair of 32 bit floats shown in figure 1.4, could be operated on. Type conversion operations are provided to convert between pairs of 32-bit integers and 32-bit floats.
The range of operators supported includes instructions for the rapid computation of reciprocals and square roots - relevant to the computation of Euclidean norms in 3D-space.
Another significant extension with 3D Now, copied in the Streaming SIMD extensions is the ability to prefetch data into the cache prior to its use. This is potentially useful in any loop operating on an array of data. For instance the loop in the previous section could be acclerated by inserting the marked prefetch instructions.
l1: movq mm0,[esi] ; load 8 bytes add esi,8 ; inc src pntr prefetch [esi] ; ensure that the next 8 bytes worth are in the cache paddusb mm0,[edx] ; packed unsigned add bytes add edx,8 ; inc src pntr prefetch [edx] ; preload 8 bytes from other array movq [edi],mm0 ; store 8 byte result add edi,8 ; inc dest pntr prefetchw [edi] ; set up the cache ready for writing 8 bytes of dataThe instruction count rises, despite this performance goes up since loads into the cache are initiated 6 instructions prior to the data being needed. This allows the loading of the cache to be overlapped with useful instructions rather than forcing calculations to stall whilst the load takes place.
The most significant difference is to the model of machine state. Whilst the original MMX instructions and 3DNOW add no new state to the machine architecture, Streaming SIMD introduces additional registers. Eight new 128-bit registers (XMM0..7) are introduced. The addition of new state means that operating systems have to be modified to ensure that xmm registers are saved during context switches. Intel provided a driver to do this for Microsoft Windows NT 4.0, Windows 98 and subsequent Windows releases have this support built in.
The default format for the XMM registers is a 4-tuple of 32 bit floating point numbers. Instructions are provided to perform parallel addition, multiplication, subtraction and division on these 4-tuples. Two very useful extensions to this format are provided.
ADDPS xmm0,xmm1
xmm0 | 1.2 | 1.3 | 1.4 | 1.5 | |
xmm1 | 2.0 | 4.0 | 6.0 | 8.0 | + |
xmm0 | 3.2 | 5.3 | 7.4 | 9.5 |
Scalar addition
ADDSS xmm0,xmm1
xmm0 | 1.2 | 1.3 | 1.4 | 1.5 | |
xmm1 | 2.0 | 4.0 | 6.0 | 8.0 | + |
xmm0 | 1.2 | 1.3 | 1.4 | 9.5 |
A PREFETCH instruction is provided to preload data into the cache. This is more sophisticated than the equivalent 3DNOW! instruction described above. The AMD instruction applies to all cache levels. The Intel variant allows the programmer to specify which levels of cache are to be preloaded.
Whereas all previous IA32 load and store instructions had operated equally well on aligned and unaligned data. The Streaming SIMD extensions introduces special load and store instructions to operate on aligned 128 bit words. General purpose load and store instructions capable of handling unaligned data are retained.
There is little exploitation of these extensions on specific architectures because of the lack of appropriate compiler support. For example,Intel supply a C compiler that has low level extensions allowing the extended instructions to be used. Intel terms these extensions 'assembler intrinsics'. Syntactically these look like C functions but they are translated one for one into equivalent assembler instructions. The use of assembler intrinsics simplifies the process of developing MMX code, in that programmers use a single tool - the C compiler, and do not need to concern themselves with low level linkage issues. However, the other disadvantages of assembler coding remain. The Intel C compiler comes with a set of C++ classes that correspond to the fundamental types supported by the MMX and SIMD instruction sets. The SIMD classes do a good job of presenting the underlying capabilities of the architecture within the context of the C language. The code produced is also efficient. However, whilst the C++ code has a higher level of expression than assembler intrinsics, it is not portable to other processors. Similarly, SUN's Hotspot Java Optimiser is targeted at the old Intel 486 instructions rather than the newer MMX instructions available since the Pentium.
There are many disadvantages to these approaches. First of all, programmers must have deep knowledge both of low level architectural behaviour and of architecture specific compiler behaviour to integrate assembly language with high level code. Secondly, effective use of libraries depends on there being a close correspondence between the intended semantics of the application program and the semantics of the library routines. Finally, use of architecture specific libraries inhibits program portability across operating systems and CPU platforms.
There has been sustained research within the parallel programming community into the exploitation of SIMD parallelism on multi-processor architectures. Most work in this field has been driven by the needs of high-performance scientific processing, from finite element analysis to meteorology. In particular, there has been considerable interest in exploiting data parallelism in FORTRAN array processing, culminating in High Performance Fortran and F[12]. Typically this involves two approaches. First of all, operators may be overloaded to allow array-valued expressions, similar to APL. Secondly, loops may be analysed to establish where it is possible to unroll loop bodies for parallel evaluation. Compilers embodying these techniques tend to be architecture specific to maximise performance and they have been aimed primarily at specialised super-computer architectures, even though contemporary general purpose microprocessors provide similar features, albeit on a far smaller scale.
There has been recent interest in the application of vectorisation techniques to instruction level parallelism. Thus, Cheong and Lam [4] discuss the use of the Stanford University SUIF parallelising compiler to exploit the SUN VIS extensions for the UltraSparc from C programs. They report speedups of around 4 on byte integer parallel addition. Krall and Lelait's compiler [10] also exploits the VIS extensions on the Sun UltraSPARC processor from C using the CoSy compiler framework. They compare classic vectorisation techniques to unrolling, concluding that both are equally effective, and report speedups of 2.7 to 4.7. Sreraman and Govindarajan [14] exploit Intel MMX parallelism from C with SUIF, using a variety of vectorisation techniques to generates inline assembly language, achieving speedups from 2 to 6.5. All of these groups target specific architectures. Finally, Leupers [11] has reported a C compiler that uses vectorising optimisation techniques for compiling code for the multimedia instruction sets of some signal processors, but this is not generalised to the types of processors used in desktop computers.
In this chapter I look in detail at possible techniques for programming the MMX hardware. In order to illustrate the techniques I will use a sample program that adds two arrays of bytes storing the result in a third. The arrays will be 6400 bytes long, and the operation will be repeated 100,000 times.
Timings for the program are given for a 233Mhz Sony Vaio. The test program therefore requires 640,000,000 add operations.
The program is used purely for illustrative purposes, it is not realistic as a benchmark of general performance on the MMX.
Microsoft's MASM assembler supports the extended instructionset as does the free assembler NASM. The latter has the advantage of running on both Linux and Windows, and provides support for MMX, 3DNOW! and SIMD instructions.
In the absence of better tools, direct coding of inner loops as assembler routines is the obvious course to pursue where run-time performance requirements demand it. Disadvantages of using assembler are well known:
All of these militate against widespread use. The fact that the AMD extensions can only be programmed in assembler may have been a factor limiting their practical use.
SECTION .text ; global _main LEN equ 6400 _main: enter LEN*3,0 mov ebx,100000 ; perform test 100000 times for timing l0: mov esi,0 ; set esi registers to index the elements mov ecx,LEN/8 ; set up the count byte l1: movq mm0,[esi+ebp-LEN] ; load 8 bytes paddb mm0,[esi+ebp-2*LEN] ; packed unsigned add bytes movq [esi+ebp-3*LEN],mm0 ; store 8 byte result add esi,8 ; inc dest pntr loop l1 ; repeat for the rest of the array dec ebx jnz l0 mov eax,0 leave ret ;
*/ #define LEN 6400 #define CNT 100000 main() { unsigned char v1[LEN],v2[LEN],v3[LEN]; int i,j,t; for(i=0;i<CNT;i++) for (j=0;j<LEN;j++) v3[j]=v2[j]+v1[j]; } /**/
The assembler version of the example program is shown in figure 2.1. It runs in 4.01 seconds on the test machine, a 233Mhz Pentium II, a throughput of 160 million byte arithetic operations per second.
The C version is shown in figure 2.3. It compiled with the Intel C compiler it runs in 82 seconds on the test machine, a performance of around 8 million arithmetic operations per second. Thus the assembler version using MMX is about 20 times as fast as the C version.
__m128 _mm_add_ps(__m128 a, __m128 b)which adds the four SP FP values of a and b.
Syntactically these look like C functions but they are translated one for one into equivalent assembler instructions. The use of assembler intrinsics simplifies the process of developing MMX code, in that programmers use a single tool - the C compiler, and do not need to concern themselves with low level linkage issues. However, the other disadvantages of assembler coding remain.
#define LEN 800 #define CNT 100000 #include "ivec.h" main() { Iu8vec8 v1[LEN],v2[LEN],v3[LEN]; int i,j,t; for(i=0;i<CNT;i++) for (j=0;j<LEN;j++) v3[j]=v2[j]+v1[j]; }//
Figure 2.3 shows the example program implemented in C++ using the Intel SIMD class Iu8vec8. The SIMD classes do a good job of presenting the underlying capabilities of the architecture within the context of the C language. The code produced is also efficient, the example program in C++ runs in 4.56 seconds on the test machine, a performance of 140 million byte operations per second. However, it has to be born in mind that the C++ code is not portable to other processors. The compiler always generates MMX or SIMD instructions for the classes. If run on a 486 processor these would be illegal. The C++ code build around these classes, whilst it has a higher level of expression than assembler intrinsics, is no more portable.
program vecadd; type byte=0..255; var v1,v2,v3:array[0..6399]of byte; i:integer; begin for i:= 1 to 100000 do v3:=v1 + v2; end.
Version | Time in seconds |
Assembler using MMX | 4.95 |
C++ using MMX classes | 5.00 |
High Performance Pascal to MMX instructionset | 9.88 |
High Performance Pascal to 486 instructionset | 75.41 |
C | 87.88 |
Thus if x,y are arrays of integers k = x+y is the array of integers where ki = xi+yi.
The basic concept is simple, there are complications to do with the semantics of operations between arrays of different lengths and different dimensions, but Iverson provides a consistent treatment of these.
The most recent languages to be built round this model are J, an interpretive language[9], and F[12] a modernised Fortran. In principle though any language with array types can be extended in a similar way. Unlike the SIMD classes provided by the Intel C++ compiler, the APL type extensions are machine independent. They can be implemented using scalar instructions or using the SIMD model. The only difference is speed.
I have chosen this model as the basis for our experiments in compiling high level language code to SIMD instructionsets. Our objectives in these experiments were to:
For initial experiment I have chosen the base 486 and the Pentium with MMX as the two instructionsets to be targeted. An extended version of pascal with operators overloaded in the APL style is used as the source language. This is provisionally called High Performance Pascal.
The structure of the experimental system is shown in figure 2.5. It is complex. Consider first the path followed from a source file foo.pas, shown in the upper right of the diagram, to an executable file foo.exe, in the lower center of the diagram. The phases that it goes through are
The key components on this path are Hppc.class and Pentium.class Hppc is a simple, hand writen, recursive descent parser, nothing much interesting about this. Pentium.class has a much more complicated genesis. It is produced from a file Pentium.ilc, in a notation ILCG, which gives a semantic description of the Pentium's instructionset. This is processed by an application ILCG which builds the source file Pentium.java.
The application ILCG is built out of two main components: a collection of classes that comprise a parser for the language ILCG, and a class which walks parse trees for ILCG programs and outputs code generators. The parser classes are automatically generated from a definition of the grammar of ILCG using the parser generator Sable[15]. The walker WalkGen walks a parse tree for an ILCG file and builds a walker for semantic trees.
In what follows I provide a description of the notation ILCG, a description of the semantic trees taken as input to the code generators and then describe the strategy used in the generated code generators to translate these trees to assembler code.
Its purpose is to act as an input to two types of programs:
It is itself intended to be machine independent in that it abstracts from the instruction model of the machine on which it is being executed. However, it will assume certain things about the machine. It will assume that certain basic types are supported and that the machine is addressed at the byte level.
We further assume that all type conversions, dereferences etc have to be made absolutely explicit. Since the notation is intended primarily to be machine generated we are not particularly concerned with human readability, and thus use a prefix notation.
In what followsIwill designate terminals of the language in bold thus octet and nonterminal in sloping font thus word8.
When data is being explicitly operated on without regard to its type, we have terminals which stand for these formats: octet, halfword, word, doubleword, quadword.
We thus allow the following integer types as terminals in the language: int8, uint8, int16, uint16, int32, uint32, int64, uint64 to stand for signed and unsigned integers of the appropriate lengths.
The integers are logically grouped into signed and unsigned. As non terminal types they are represented as byte, short, integer, long and ubyte, ushort, uinteger, ulong.
Floating point numbers are either assumed to be 32 bit or 64 bit with 32 bit numbers given the nonterminal symbols float,double. If we wish to specify a particular representation of floats of doubles we can use the terminals ieee32, ieee64.
The allowed conversions are summarised in the tables 3.1 and 3.2.
The syntax for the type conversions is C style so we have for example (ieee64) int32 to represent a conversion of an 32 bit integer to a 64 bit real.
INPUT | OUTPUT |
byte | integer |
ubyte | integer |
short | integer |
ushort | integer |
uinteger | double |
integer | double |
float | double |
long | double |
INPUT | OUTPUT |
integer | byte |
integer | ubyte |
integer | short |
integer | ushort |
double | uinteger |
double | integer |
double | float |
double | long |
The syntax is prefix with bracketing. Thus the infix operation 3+5 ¸7 would be represented as +(3 div (5 7)).
(ref int32)mem(100)
(ref octet)mem( 299) :=(int8) 99
The first argument is always a reference and the second argument a value of the appropriate format.
If the left hand side is a format the right hand side must be a value of the appropriate size. If the left hand side is an explicit type rather than a format, the right hand side must have the same type.
(octet) ( (ref octet)mem(99))
The argument to the load operator must be a reference.
register word EBX assembles['ebx'] ;
reserved register word ESP assembles['esp'];
would declare EBX to be of type ref word.
alias register octet AL = EAX(0:7) assembles['al'];
alias register octet BL = EBX(0:7) assembles['bl'];
to indicate that BL occupies the bottom 8 bits of register EBX. In this notation bit zero is taken to be the least significant bit of a value. There are assumed to be two pregiven registers FP, GP that are used by compilers to point to areas of memory. These can be aliased to a particular real register.
register word EBP assembles['ebp'] ;
alias register word FP = EBP(0:31) assembles ['ebp'];
Additional registers may be reserved, indicating that the code generator must not use them to hold temporary values:
reserved register word ESP assembles['esp'];
pattern reg means [ EBP| EBX|ESI|EDI|ECX |EAX|EDX|ESP] ;
pattern breg means[ AL|AH|BL|BH|CL|CH|DL|DH];
All registers in an register set should be of the same length.
register vector(8)doubleword MM assembles['MM'i] ;
This declares the symbol MMX to stand for the entire MMX register set. It implicitly defines how the register names are to be printed in the assembly language by defining an indexing variable i that is used in the assembly language definition.
We also need a syntax for explicitly identifying individual registers in the set. This is done by using the dyadic subscript operator:
subscript(MM,2)
which would be of type ref doubleword.
The ilcg syntax allows register stacks to be declared:
register stack (8)ieee64 FP assembles[ ' '] ;
Two access operations are supported on stacks:
PUSH is a void dyadic operator taking a stack of type ref t as first argument and a value of type t as the second argument. Thus we might have:
PUSH(FP,mem(20))
POP is a monadic operator returning t on stacks of type t. So we might have
mem(20):=POP(FP)
An instruction format is an abstraction over a class of concrete instructions. It abstracts over particular operations and types thereof whilst specifying how arguments can be combined.
instruction pattern
RR( operator op, anyreg r1, anyreg r2, int t)
means[r1:=(t) op( ((ref t) r1),((ref t) r2))]
assembles[op ' ' r1 ',' r2];
In the above example, we specify a register to register instruction format that uses the first register as a source and a destination whilst the second register is only a destination. The result is returned in register r1.
We might however wish to have a more powerful abstraction, which was capable of taking more abstract apecifications for its arguments. For example, many machines allow arguments to instructions to be addressing modes that can be either registers or memory references. For us to be able to specify this in an instruction format we need to be able to provide grammer non-terminals as arguments to the instruction formats.
For example we might want to be able to say
instruction pattern
RRM(operator op, reg r1, maddrmode rm, int t)
means [r1:=(t) op( ((ref t)r1),((ref t) rm))]
assembles[op ' ' r1 ',' rm ] ;
This implies that addrmode and reg must be non terminals. Since the non terminals required by different machines will vary, there must be a means of declaring such non-terminals in ilcg.
An example would be:
pattern regindirf(reg r)
means[(r) ] assembles[ r ];
pattern baseplusoffsetf(reg r, signed s)
means[+( (r) ,const s)] assembles[ r '+' s ];
pattern addrform means[baseplusoffsetf| regindirf];
pattern maddrmode(addrform f)
means[mem(f) ] assembles[ '[' f ']' ];
This gives us a way of including non terminals as parameters to patterns.
Implementing Classes for Node: |
Assign, Block, Cast, CharLit, Dyad, For, Format, Goto, If, |
Integer, Label, Location, Memref, Monad, Op, Real, |
Register, RegisterStack, Statement, Type |
These trees can either be directly constructed by a Java program, or can be loaded from a .tree file, which is basically a postfix flattened binary encoding of the tree. Such .tree files can be generated by front ends written in languages other than Java. Each class implementing the interface Node must have a method
toBinary(java.io.DataOutputStream out)
which saves a node in postfix binary notation on the output stream. A utility class ilcg.tree.Loader can be used to load postfix binary trees stored in files into memory.
The base class ilcg.tree.Walker provides strategic methods to carry out
Finally, the machine generated walker classes can be further specialised by hand written classes which override the behaviour of the standard tree walker. Typically these specialisations will be to handle things like procedure call mechanisms which may be specific to a given processor and a given language.
The statement is passed to the instruction matcher which fails, as there is no built in instruction capble of adding operands of type uint8 vector(6400).
java ilcg.ILCG src.ilc dest.java classname
It parses the .ilc file and for each pattern p in the file generates a recognising method. These methods take a single parameter n of type ilcg.tree.Node and return true if n matches p.
instruction pattern RMR(operator op,addrmode rm,immedreg r1,int t) means[ (ref t) rm :=op( ^((ref t) rm),( t) r1)] assembles[op ' ' t ' ' rm ',' r1];
/** recognises ( ref t ) rm := op ( ^ ( ( ref t ) rm ) , ( t ) r1 ) */ public boolean pAAssign1126(Node n){ dlog("( ref t ) rm := op ( ^ ( ( ref t ) rm ) , ( t ) r1 ) ",n); if(!(n instanceof Assign))return false; Assign a=(Assign)n; enterRhs(); boolean rhsok= pADyadicValue1132(a.src); leaveRhs(); if(!rhsok)return false; enterLhs(); boolean lhsok=pARefcastRefval1127(a.dest); leaveLhs(); return lhsok; }
Figures 4.3 and 4.4 illustrate an instruction pattern and the recogniser method that is generated for it. It can be seen that the recogniser for the assignment statement invokes subsidiary recognisers for the left and right hand side of the assignment.
Where a pattern is defined as a set of alternatives, the alternatives are tried in the order in which they are listed in the .ilc file. The matching functions for alternative patterns are implemented as memofns. That is to say, each such function has a hash table in which it stores the trees that it has successfully matched and which alternative they matched. This hash table is consulted whenever a new tree is encountered.
/*
*/ Package ilcg; /*
*/ Helpers letter = [['A'..'Z']+['a'..'z']]; digit = ['0'..'9']; alphanum = [letter+['0'..'9']]; cr = 13; lf = 10; tab = 9; digit_sequence = digit+; fractional_constant = digit_sequence? '.' digit_sequence | digit_sequence '.'; sign = '+' | '-'; exponent_part = ('e' | 'E') sign? digit_sequence; floating_suffix = 'f' | 'F' | 'l' | 'L'; eol = cr lf | cr | lf; // This takes care of different platforms not_cr_lf = [[32..127] - [cr + lf]]; exponent = ('e'|'E'); quote = '''; all =[0..127]; schar = [all-''']; not_star = [all - '*']; not_star_slash = [not_star - '/']; /*
*/ Tokens floating_constant = fractional_constant exponent_part? floating_suffix? | digit_sequence exponent_part floating_suffix?; /*terminals specifying data formats
*/ void ='void'; octet = 'octet'; int8 = 'int8'; uint8 = 'uint8'; halfword = 'halfword'; int16 = 'int16' ; uint16 = 'uint16' ; word = 'word'; int32 = 'int32' ; uint32 = 'uint32' ; ieee32 = 'ieee32'; doubleword = 'doubleword'; int64 = 'int64' ; uint64 = 'uint64'; ieee64 = 'ieee64'; quadword = 'quadword'; /*terminals describing reserved words
*/ function= 'function'; flag = 'flag'; location = 'loc'; procedure='instruction'; returns ='returns'; label = 'label'; goto='goto'; for ='for'; to='to'; step='step'; do ='do'; ref='ref'; const='const'; reg= 'register'; operation = 'operation'; alias = 'alias'; instruction = 'instruction'; address = 'address'; vector ='vector'; stack = 'stack'; sideeffect='sideeffect'; if ='if'; reserved='reserved'; precondition ='precondition'; instructionset='instructionset'; /*terminals for describing new patterns
*/ pattern = 'pattern'; means = 'means'; assembles = 'assembles'; /*terminals specifying operators
*/ colon = ':'; semicolon= ';'; comma = ','; dot = '.' ; bra ='('; ket =')'; plus = '+'; satplus = '+:'; satminus = '-:'; map='->'; equals = '='; le = '<='; ge='>='; ne='<>'; lt='<'; gt='>'; minus = '-'; times = '*'; exponentiate = '**'; divide = 'div'; and = 'AND'; or = 'OR' ; xor = 'XOR'; not = 'NOT'; sin='SIN'; cos='COS'; abs='ABS'; tan='TAN'; remainder = 'MOD'; store = ':='; deref = '^'; push ='PUSH'; pop ='POP'; call='APPLY'; full='FULL'; empty='EMPTY'; subscript='SUBSCRIPT'; intlit = digit+; vbar = '|'; sket=']'; sbra='['; end='end'; typetoken='type'; mem='mem'; string = quote schar+ quote; /*
identifiers come after reserved words in the grammar
*/ identifier = letter alphanum*; blank = (' '|cr|lf|tab)+; comment = '/*' not_star* '*'+ (not_star_slash not_star* '*'+)* '/'; Ignored Tokens blank,comment; /*
*/ Productions program = statementlist instructionlist; instructionlist =instructionset sbra alternatives sket; /*non terminals specifying data formats
*/ format = {octet} octet| {halfword} halfword | {word} word | {doubleword} doubleword | {quadword} quadword|{tformat}tformat; /*non terminals corresponding to type descriptions
*/ reference = ref type ; array = vector bra number ket; aggregate={stack} stack bra number ket |{vector}array |{non}; type= {format} format | {typeid} typeid|{array}type array| {cartesian}sbra type cartesian* sket| {map}bra [arg]:type map [result]:type ket; cartesian = comma type; tformat = {signed} signed|{unsigned}unsigned|{ieee32}ieee32| {ieee63}ieee64; signed = int32 | {int8} int8 | {int16} int16 | {int64} int64; unsigned = uint32 | {uint8} uint8 | {uint16} uint16 | {uint64} uint64; /*non terminals corresponding to typed values
*/ value = {refval}refval |{rhs}rhs|{void}void|{cartval}cartval| {dyadic} dyadic bra [left]:value comma [right]:value ket| {monadic}monadic bra value ket; /*value corresponding to a cartesian product type e.g. record initialisers
*/ cartval=sbra value carttail* sket; carttail = comma value; /*conditions used in defining control structures
*/ condition={dyadic} dyadic bra[left]:condition comma[right]:condition ket| {monadic}monadic bra condition ket | {id}identifier| {number}number; rhs= {number}number| {cast}bra type ket value|{const}const identifier | {deref}deref bra refval ket; refval = loc| {refcast} bra reference ket loc; loc = {id}identifier|{memory}mem bra value ket ; /*predeclaredregister = {fp}fp|{gp}gp;*/ number = {reallit} optionalsign reallit| {integer} optionalsign intlit; optionalsign = |{plus}plus|{minus}minus; reallit= floating_constant; /*operators
*/ dyadic = {plus} plus| {minus} minus | {identifier} identifier|{exp}exponentiate| {times} times | {divide} divide| {lt}lt|{gt}gt|{call}call| {le}le|{ge}ge|{eq}equals|{ne}ne|{push}push|{subscript}subscript| {satplus}satplus|{satminus}satminus| {remainder} remainder|{or}or|{and}and|{xor}xor; monadic ={not}not|{full}full|{empty}empty|{pop}pop| {sin}sin|{cos}cos|{tan}tan|{abs}abs ; /*register declaration
*/ registerdecl = reservation reg aggregate format identifier assembles sbra string sket ; reservation = {reserved}reserved|{unreserved}; aliasdecl = alias reg aggregate format [child]:identifier equals [parent]:identifier bra [lowbit]:intlit colon [highbit]:intlit ket assembles sbra string sket; opdecl = operation identifier means operator assembles sbra string sket; operator = {plus}plus|{minus}minus|{times}times|{lt}lt|{gt}gt| {le}le|{ge}ge|{eq}equals|{ne}ne|{divide} divide| {remainder}remainder|{or}or|{and}and|{xor}xor; /*pattern declarations
*/ assign = refval store value ; meaning = {value}value|{assign}assign| {goto}goto value| {if}if bra value ket meaning| {for} for refval store [start]:value to [stop]:value step [increment]:value do meaning| {loc}location value; patterndecl = pattern identifier paramlist means sbra meaning sket assemblesto sideeffects precond |{alternatives} pattern identifier means sbra alternatives sket; paramlist = bra param paramtail* ket|{nullparam}bra ket; param = typeid identifier| {typeparam} typetoken identifier| {label}label identifier; typeid = identifier; paramtail = comma param; alternatives = type alts*; alts = vbar type; precond = precondition sbra condition sket|{unconditional}; asideeffect= sideeffect returnval; sideeffects = asideeffect*; assemblesto=assembles sbra assemblypattern sket; assemblypattern = assemblertoken*; assemblertoken = {string} string | {identifier} identifier; returnval = returns identifier; /*statements
*/ statement = {aliasdecl} aliasdecl| {registerdecl} registerdecl | {addressmode} address patterndecl| {instructionformat}procedure patterndecl| {opdecl}opdecl| {flag} flag identifier equals intlit| {typerename}typetoken format equals identifier| {patterndecl} patterndecl; statementlist = statement semicolon statements*; statements = statement semicolon; //
Basic ia32 processor description int ilcg copyright(c) Paul Cockshott, University of Glasgow Feb 2000
*/ type word=DWORD; type uint32=DWORD; type int32=DWORD; type ieee64=QWORD; type octet=BYTE; type int16=WORD; type int8=BYTE; type octet=BYTE; type uint8=BYTE; type halfword=WORD; /*
*/ flag realLitSupported = 0; /*
*/ register word EAX assembles['eax'] ; register word ECX assembles['ecx'] ; register word EBX assembles['ebx'] ; reserved register word EBP assembles['ebp'] ; alias register word FP = EBP(0:31) assembles ['ebp']; reserved register word ESP assembles['esp']; alias register word SP = ESP(0:31) assembles ['esp']; register word ESI assembles['esi'] ; register word EDI assembles['edi'] ; register word EDX assembles['edx']; alias register octet AL = EAX(0:7) assembles['al']; alias register octet BL = EBX(0:7) assembles['bl']; alias register octet CL = ECX(0:7) assembles['cl']; alias register octet DL = EDX(0:7) assembles['dl']; alias register octet AH = EAX(8:15) assembles['ah']; alias register octet BH = EBX(8:15) assembles['bh']; alias register octet CH = ECX(8:15) assembles['ch']; alias register octet DH = EDX(8:15) assembles['dh']; pattern reg means [ EBP| EBX|ESI|EDI|ECX |EAX|EDX|ESP] ; pattern areg(reg r) means[(ref int32)r] assembles[ r ]; pattern breg means[ AL|AH|BL|BH|CL|CH|DL|DH]; pattern abreg(breg b) means[(ref int8)b] assembles[b]; pattern anyreg means[ breg|reg]; pattern acc means[EAX]; pattern creg means [ECX]; pattern sreg means [ESI]; pattern dreg means [EDI]; pattern modreg means [EDX]; pattern bacc means [AL]; /*
*/ operation add means + assembles [ 'add']; /* */operation and means AND assembles[ 'and']; operation or means OR assembles['or']; operation xor means XOR assembles['xor'];/* */ operation sub means - assembles [ 'sub']; operation mul means * assembles ['mul']; operation imul means * assembles ['imul ']; operation lt means < assembles ['l']; operation gt means > assembles ['g']; operation eq means = assembles ['z']; operation le means <= assembles ['le']; operation ge means >= assembles ['ge']; operation ne means <> assembles ['nz']; pattern condition means[ne|ge|le|eq|gt|lt]; pattern operator means[add | sub|imul|and|or|xor]; pattern commuteop means[add|and|or|xor]; /*
*/ pattern unsigned means[uint32|uint8|uint16]; pattern signed means[int32 | int8 | int16 ]; pattern int means[signed|unsigned]; pattern double means[ieee64] ; pattern float means[ieee32]; pattern real means [ieee64|float]; pattern dataformat means[octet|word]; pattern longint means [int32|uint32]; pattern two(type t) means[2] assembles['2']; pattern four(type t) means[4] assembles['4']; pattern eight(type t) means[8] assembles['8']; pattern scale means[two|four|eight]; /**
Define the address forms used in lea instructions these differ from the address forms used in other instructions as the semantics includes no memory reference. Also of course register and immediate modes are not present.
*/ pattern regindirf(reg r) means[^(r) ] assembles[ r ]; pattern baseplusoffsetf(reg r, signed s) means[+( ^(r) ,const s)] assembles[ r '+' s ]; pattern scaledIndexPlusOffsetf(reg r1, scale s, signed offs) means[+(*(^(r1),s),const offs)] assembles[r1 '*' s '+' offs]; address pattern basePlusScaledIndexf(reg r1,reg r2,scale s) means[+(^(r1),*(^(r2),s))] assembles[ r1 '+' r2 '*' s ]; address pattern basePlusScaledIndexPlusOffsetf(reg r1,reg r2,scale s,signed off) means[+(+(^(r1),const off),*(^(r2),s))] assembles[ r1 '+' r2 '*' s '+'off ]; address pattern basePlusIndexPlusOffsetf(reg r1,reg r2,signed off) means[+(+(^(r2),const off),^(r1))] assembles[ r1 '+' r2 ' +'off ]; address pattern basePlusIndexf(reg r1,reg r2) means [+(^(r1),^(r2))] assembles[ r1 '+' r2 ]; pattern directf(unsigned s) means[const s] assembles[ s ]; pattern immediate(signed s) means[const s] assembles[ s ]; pattern labelf(label l) means [l] assembles[l]; /*
*/ pattern eaform means[directf | labelf| scaledIndexPlusOffsetf| basePlusScaledIndexPlusOffsetf| basePlusIndexPlusOffsetf| basePlusScaledIndexf| baseplusoffsetf | basePlusIndexf]; /*
*/ pattern addrform means[eaform| regindirf]; /**define the address patterns used in other instructions
*/ pattern maddrmode(addrform f) means[mem(f) ] assembles[ '[' f ']' ]; pattern addrmode means[maddrmode|anyreg]; pattern regderef(anyreg r) means[^(r)] assembles[r]; pattern derefmaddrmode (maddrmode a,type t) means[(t)^(a)] assembles [t ' ' a]; pattern immedmem means[immediate|derefmaddrmode]; pattern immedaddrmode means[immediate|derefmaddrmode|regderef]; pattern immedreg means[immediate|regderef]; /*
*/ instruction pattern STORE(addrmode rm, immedreg r1, type t) means[ (ref t) rm:=(t)r1 ] assembles['mov 't rm ',' r1]; instruction pattern NULMOV(reg r3) means[r3:=^(r3)] assembles[';nulmov ' r3 r3]; instruction pattern INC(addrmode rm,type t) means[(ref t)rm:= + (^((ref t)rm),1)] assembles['inc ' t ' ' rm]; instruction pattern DEC(addrmode rm,type t) means[(ref t)rm:= + (^((ref t)rm),-1)] assembles['dec ' t ' ' rm]; instruction pattern LOAD(immedmem rm, anyreg r1, type t) means[ r1:= ( t)rm ] assembles['mov ' r1 ',' ' ' rm]; instruction pattern MOVE(immedaddrmode rm, anyreg r1, type t) means[ r1:= ( t)rm ] assembles['mov ' r1 ',' ' ' rm]; instruction pattern MOVZXB(reg r1, addrmode rm) means[ r1:=(int32)^((ref octet)rm)] assembles['movzx ' r1 ', BYTE 'rm]; instruction pattern MOVZXW(reg r1, addrmode rm) means[ r1:=(int32)^((ref halfword)rm)] assembles['movzx ' r1 ', WORD 'rm]; instruction pattern LEA(reg r1, eaform ea) means [r1:=ea] assembles ['lea ' r1 ',[' ea ']' ]; instruction pattern Negate(reg r1,type t) means[r1:= -((t)0,^((ref t)r1))] assembles ['neg ' r1]; instruction pattern NEG(reg r1) means[r1:= *(^(r1),-1)] assembles['neg 'r1]; instruction pattern RMR( operator op,addrmode rm,immedreg r1,int t) means[ (ref t) rm :=op( ^((ref t) rm),( t) r1)] assembles[op ' ' t ' ' rm ',' r1]; /* commuting version of the rmr form */ instruction pattern ComRMR( commuteop op,addrmode rm,immedreg r1,int t) means[ (ref t) rm :=op( (t)r1,^((ref t) rm))] assembles[op ' ' t ' ' rm ',' r1]; instruction pattern RRM(operator op, anyreg r1, immedaddrmode rm, int t) means [r1:=(t) op( ^((ref t)r1),(t) rm)] assembles[op ' ' r1 ',' rm ] ; instruction pattern PLANT(label l) means[l] assembles[l ':']; instruction pattern PLANTRCONST( double r) means[loc r] assembles[ 'SECTION .data\n dq ' r '\n SECTION .text']; instruction pattern GOTO(label l) means[goto l] assembles['jmp ' l]; /*
*/ pattern dacc means[DH|DL]; instruction pattern ADDUSB(bacc r1,dacc r2 ) means[ r1 := +:(^(r1),^(r2))] assembles[' push ebx\n push ecx\n mov bl,al\n mov cl,' r2 '\n xor ch,ch\n xor bh,bh\n add bx,cx\n cmp cx,255' '\n jl $+5\n mov cl,255\n nop\n mov al,cl\n pop ecx\n pop ebx']; instruction pattern IDIV(reg r1,reg r2) means[r1:= div(^(r1),^( r2)) ] assembles['push edx\nxchg eax,' r1 '\ncdq\ndiv ' r2 '\npop edx\nxchg eax,' r1]; instruction pattern MOVSD(eaform i,creg cx, sreg si, dreg di,type t) means[for (ref t)mem(i) :=0 to ^(cx) step 1 do mem(+(^(di),*(^(mem(i)),4))):= (word)^(mem(+(^(si),*(^(mem(i)),4))))] assembles['cld \n mov ' t'['i'],ecx\n rep\n movsd']; instruction pattern MOVSB(eaform i,creg cx, sreg si, dreg di,type t) means[for (ref t)mem(i):=0 to ^(cx) step 1 do mem(+(^(di),^(mem(i)))):=(octet)^(mem(+(^(si),^(mem(i)))))] assembles['cld \n mov ' t'['i'],ecx\n rep\n movsb']; instruction pattern SETCC(addrmode d,reg r1,immedaddrmode r2,condition c) means[(ref octet)d:=(octet) c(^(r1),r2)] assembles['cmp ' r1 ',' r2 '\n set' c ' BYTE ' d]; instruction pattern IFGOTO(label l,reg r1,immedaddrmode r2,condition c) means[if(c(^(r1),^(r2)))goto l] assembles['cmp ' r1 ',' r2 '\n j' c ' NEAR ' l]; instruction pattern IFLITGOTO(label l,addrform a,signed r2,condition c,type t) means[if(c( ^((ref t) mem(a)),const r2))goto l] assembles['cmp ' t '[' a ']' ',' r2 '\n j' c ' NEAR ' l]; instruction pattern CALL(label l) means[APPLY( l,void)] assembles['call ' l]; instruction pattern IMOD( modreg r1,reg r2) means[r1:= MOD(^(r1),^( r2)) ] assembles['push eax\nmov eax,' r1 '\ncdq\ndiv ' r2 '\npop eax']; /**/
/*
*/ register stack(8)ieee64 ST assembles[ 'ST']; operation fdiv means div assembles['div']; pattern foperator means[add | sub|mul|fdiv]; instruction pattern FSTR(maddrmode a,real t) means[(ref t) a:= (t)POP(ST)] assembles['fstp 't a]; pattern fint means[int16|int32|int64]; instruction pattern FIOP(maddrmode a,fint t,foperator op) means[PUSH(ST,op(POP(ST),(ieee64)^((ref t)a)))] assembles['fi'op' 't a]; instruction pattern FIDIVR(maddrmode a,fint t) means[PUSH(ST,div((ieee64)^((ref t)a),POP(ST)))] assembles['fidivr 't a]; instruction pattern FLDONE() means[PUSH(ST,1.0)] assembles['fld1']; instruction pattern FLDZ() means[PUSH(ST,0.0)] assembles['fldz']; instruction pattern FLDPI() means[PUSH(ST,3.14159265358979)] assembles['fldpi']; instruction pattern FILD(maddrmode a,fint t) means[PUSH(ST,(ieee64)^((ref t)a))] assembles['fild 't a]; instruction pattern FISTP(maddrmode a,fint t) means[(ref t)a:=(t)POP(ST)] assembles ['fistp 't a]; instruction pattern FLD(maddrmode a, real t) means[PUSH(ST,(ieee64)^((ref t) a))] assembles['fld ' t a]; instruction pattern FNEG(real t) means[PUSH(ST, *((t)POP(ST),-1.0))] assembles['fchs ']; instruction pattern FOP(maddrmode a,foperator op) means[PUSH(ST,op(POP(ST),^((ref ieee64)a) )) ] assembles['f'op ' qword ' a]; instruction pattern ReversePolishFOP(foperator op) means[PUSH(ST,op(POP(ST),POP(ST)))] assembles['f'op'p' ' st1']; instruction pattern FRNDINT() means[PUSH(ST,(ieee64)(int64)POP(ST))] assembles['frndint']; instruction pattern FCOS(real t) means[PUSH(ST,COS((t)POP(ST)))] assembles['fcos']; instruction pattern FSIN(real t) means[PUSH(ST,SIN((t)POP(ST)))] assembles['fsin']; instruction pattern FTAN(real t) means[PUSH(ST,TAN((t)POP(ST)))] assembles['fptan']; instruction pattern FABS(real t) means[PUSH(ST,ABS((t)POP(ST)))] assembles['fabs']; instruction pattern FXM(real t) means[PUSH(ST,-((t)**(2,(t)POP(ST)),1))] assembles['f2xm1']; instruction pattern FSCALE() means[PUSH(ST,*(POP(ST),**(2,(int32)POP(ST))))] assembles['fscale']; /**/
/*
*/ register doubleword MM0 assembles[ 'MM0']; register doubleword MM1 assembles[ 'MM1']; register doubleword MM2 assembles[ 'MM2']; register doubleword MM3 assembles[ 'MM3']; register doubleword MM4 assembles[ 'MM4']; register doubleword MM5 assembles[ 'MM5']; register doubleword MM6 assembles[ 'MM6']; register doubleword MM7 assembles[ 'MM7']; pattern mreg means [MM1|MM3|MM4|MM5|MM7|MM6|MM2|MM0]; /*
*/ pattern mrmaddrmode means[maddrmode|mreg]; /*
*/ pattern byteint means[int8|uint8]; pattern wordint means[int16|uint16]; instruction pattern PMULLW(mreg m, mrmaddrmode ma) means[(ref int16 vector(4))m := (int16 vector(4))*((int16 vector(4))^(m),(int16 vector(4))^(ma))] assembles['pmullw ' m ',' ma]; instruction pattern PADDD(mreg m, mrmaddrmode ma) means[(ref int32 vector(2))m := (int32 vector(2))+((int32 vector(2))^(m),(int32 vector(2))^(ma))] assembles ['paddd 'm ',' ma]; instruction pattern PADDW(mreg m, mrmaddrmode ma) means[(ref int16 vector(4))m := (int16 vector(4))+((int16 vector(4))^(m),(int16 vector(4))^(ma))] assembles ['paddw 'm ',' ma]; instruction pattern PADDB(mreg m, mrmaddrmode ma) means[(ref int8 vector(8))m := (int8 vector(8))+((int8 vector(8))^(m),(int8 vector(8))^(ma))] assembles ['paddb 'm ',' ma]; instruction pattern PADDUB(mreg m, mrmaddrmode ma,byteint t) means[(ref uint8 vector(8))m := (uint8 vector(8))+((uint8 vector(8))^(m),(uint8 vector(8))^(ma))] assembles ['paddb 'm ',' ma]; instruction pattern PADDUSB(mreg m, mrmaddrmode ma) means[(ref int8 vector(8))m := (int8 vector(8))+:((int8 vector(8))^(m),(int8 vector(8))^(ma))] assembles ['paddusb 'm ',' ma]; instruction pattern PSUBUSB(mreg m, mrmaddrmode ma) means[(ref int8 vector(8))m := (int8 vector(8))-:((int8 vector(8))^(m),(int8 vector(8))^(ma))] assembles ['psubusb 'm ',' ma]; instruction pattern PADDUSW(mreg m, mrmaddrmode ma) means[(ref int16 vector(4))m := (int16 vector(4))+:((int16 vector(4))^(m),(int16 vector(4))^(ma))] assembles ['paddusw 'm ',' ma]; instruction pattern PSUBUSW(mreg m, mrmaddrmode ma) means[(ref int16 vector(4))m := (int16 vector(4))-:((int16 vector(4))^(m),(int16 vector(4))^(ma))] assembles ['psubusw 'm ',' ma]; instruction pattern PSUBD(mreg m, mrmaddrmode ma) means[(ref int32 vector(2))m := (int32 vector(2))-((int32 vector(2))^(m),(int32 vector(2))^(ma))] assembles ['psubd 'm ',' ma]; instruction pattern PSUBW(mreg m, mrmaddrmode ma) means[(ref int16 vector(4))m := (int16 vector(4))-((int16 vector(4))^(m),(int16 vector(4))^(ma))] assembles ['psubw 'm ',' ma]; instruction pattern PSUBB(mreg m, mrmaddrmode ma) means[(ref int8 vector(8))m := (int8 vector(8))-((int8 vector(8))^(m),(int8 vector(8))^(ma))] assembles ['psubb 'm ',' ma]; instruction pattern PSUBUB(mreg m, mrmaddrmode ma,byteint t) means[(ref uint8 vector(8))m := (uint8 vector(8))-((uint8 vector(8))^(m),(uint8 vector(8))^(ma))] assembles ['psubb 'm ',' ma]; instruction pattern PAND(mreg m, mrmaddrmode ma) means[m := AND(^(m),^(ma))] assembles ['pand 'm ',' ma]; instruction pattern POR(mreg m, mrmaddrmode ma) means[m := OR(^(m),^(ma))] assembles ['por 'm ',' ma]; instruction pattern MOVDS(addrmode rm, mreg m) means[(ref int32)rm:= (int32)^(m)] assembles['movd 'rm ','m]; instruction pattern MOVDL(addrmode rm, mreg m) means[m := (int64)^((ref int32)rm)] assembles['movd 'm ','rm]; instruction pattern MOVQS(maddrmode rm, mreg m,type t) means[(ref t)rm:= (t)^(m)] assembles['movq 'rm ','m]; instruction pattern MOVQL(addrmode rm, mreg m) means[m := (doubleword)^(rm)] assembles['movq ' m ',' rm]; instruction pattern CMPXCHG(acc a,reg src,addrmode dest) means[if(=(^(a),^(dest)))dest:=^(src)] assembles['cmpxchg dword' dest ',' src ]; /* \begin{verbatim}*/ instructionset [CMPXCHG|MOVSB|MOVSD|INC|DEC|NEG|Negate| LEA|LOAD|MOVZXB|MOVZXW|RRM|ADDUSB| SETCC|IDIV|IMOD|RMR|ComRMR|STORE|MOVE| PLANT|IFLITGOTO |IFGOTO|GOTO|CALL| PLANTRCONST|PADDD|PADDW|PADDB|PSUBD|PSUBW| PSUBB|PSUBUB|PADDUSB|PADDUB| PSUBUSB|PADDUSW|PSUBUSW|POR|PAND|PMULLW |MOVDS|MOVDL|MOVQS|MOVQL ] /**/