Considerations of Datatype Support in the Phoenix Hardware [smh, wkf, jod -- 4 Oct 88] = = = = = = = = = = These are sketches for alternative datatype checking and dispatching schemes for the Phoenix. The purpose is to sketch what compiled and runtime support code would look like so we can focus on the hardware requirements of alternative schemes and on their various speed/space/cache tradeoffs. The items here are not all internally consistent, not all completely worked out, and certainly not all compatible. = = = = = Wish List = = = = = First, a wish list, which is an extraction of features used by various of the datatyping schema below. This list really doesn't belong here because none of the wish items are useful and/or essential to all of the schema. However, it is nice to have these hardware features collected in one place. (1) The call hardware return destination should support an open reg. This would allow some slight code-generation efficiency in some cases. It probably isn't ever essential, however, if it would impose unreasonable hardware cost. (2) Some sort of a computed dispatch on datatype. Dispatch on datatype is done all over the place in the runtime system, and it would be really neat to have some instructions that magically did the right thing. There are several forms it could reasonably take, perhaps depending on other choices that are made. [These should be spelled out here.] (3) Reinstantiate tail calling using the frame-dirty bit. (4) Additional functional sources. More integers? Anything else? (5) A datatype-skip check which skips if left and right datatype are the same, regardless what they are. = = = = = = = = = = Scheme: Datatype Comparison as an Alternative to Datatype Traps This scheme flushes the 3-bit instruction datatype trap code (along with the two bits of left and right tagness) and requires seven bits in the instruction. The right source would always be checked against a 6-bit literal datatype, except for one "reserved" datatype code (either #x00 or #x3f ??) which would imply no datatype checking. The seventh bit would control whether the datatype check should also apply to the left source. Thus, there would be three possibilities: (1) No datatype check. (2) Check right source equal to the given datatype. (3) Check right and left wources both equal to the given datatype. Let us define the comparison to be either "successful" or "unsuccessful." Furthermore, "unsuccessful" in the fixnum case is defined to include overflow, just as in the datatype trap scheme. [We still need to work out whether we actually need both fixnum overflow and no-overflow cases.] If an instruction's datatype comparison is successful, the PC skips the next instruction. This would seem to require some sort "NEXT-PC+2" path in the hardware. A serious question is whether the pipeline architecture permits the choice between NEXT-PC-+1 and NEXT-PC-+2 to be made late enough? The pseudo-Lisp fragment (SETQ A3 (+ A5 A9)) would generate the following two-instruction sequence in pseudo assembly code: (ADD32 (A3 <- A5 A9) DT-FIXNUM DT-BOTH) (OPEN-CALL 2OP+-A5 (O0 <- A9)) Remember, if the 32-bit fixnum add succeeds -- i.e., the datatypes are correct and no overflow occurs -- the second instruction is skipped because the next-pc source is PC+2. 2OP+-A5 is one of a family of (approximately) 17 entry points to 2OP+. Note that the single OPEN-CALL instruction can move either the left or right source into a known place, i.e., O0, but without executing another instruction it is impossible also to collect the other arbitrary argument. Therefore, we adopt the convention that the left source to a generic call will always be moved to O0, and the right source will be encoded by the particular entry point to the generic routine. There are at least 17 possible right sources: the sixteen active-frame registers, plus the MD. [There may be others. In particular, the sixteen open-frame registers, or some subset of them, might be useful. In any case, each entry point only costs one instruction word.] ;; These depend on the call-hardware artifact that after the call the ;; called routine sees the caller's active frame as its open frame. 2OP+-MD (JUMP 2OP+- (A0 <- MD)) 2OP+-A15 (JUMP 2OP+- (A0 <- O15)) 2OP+-A14 (JUMP 2OP+- (A0 <- O14)) 2OP+-A13 (JUMP 2OP+- (A0 <- O13)) ... ... 2OP+-A6 (JUMP 2OP+- (A0 <- O6)) 2OP+-A5 (JUMP 2OP+- (A0 <- O5)) ... ... 2OP+-A0 (A0 <- O0) 2OP+ [Question: Do unconditional jumps impose an extra cycle wait?] The actual body of 2OP+- could be done in several ways. The goal is to make the important cases that are supported by hardware (i.e. floats, especially immediate floats) run really fast. This first possibility for doing so depends in a curious way on a hardware that might or might not be implementable. As before, successful datatype comparisons cause a PC skip: ;; At this point the two arguments are in A0 and A1. They may be ;; any data types whatever, even two fixnums if called for overflow. 2OP+ (NOP <- A0 A1 DT-BOTH DT-SINGLE-FLOAT) (JUMP 2OP+-NOT-1FLOAT (NOP <- A0 A1) DT-DOUBLE-FLOAT) ... do single float addition ... (RETURN <- whatever) 2OP+-NOT-1FLOAT (JUMP 2OP+-NOT-2FLOAT (NOP <- A0 A1) DT-DOUBLE-FLOAT) ... do double float addition ... (RETURN <- whatever) 2OP+-NOT-2FLOAT [etc., but how to do mixed cases??] [This sketch of the branch tree needs to be completed so we can tell how long it takes for the bad cases.] An alternative to using a complicated brach tree to handle all the mixed cases owuld be to implement a hardware type dispatch. This potentially trades off execution cycles [how many] for infrequently-executed instructions. Postulate a next PC source called NEXT-PC+1+RGHT-DT, which does just what it says. Then the code for 2OP+ would be clearer and faster: 2OP+ (NOP <- A0 NEXT-PC+1+RGHT-DT) (JUMP 2OP-ILLEGAL-ARG) ; 0 = nil (JUMP 2OP+FIXNUM) ; 1 = fixnum (JUMP 2OP-ILLEGAL-ARG) ; 2 = cons (JUMP 2OP-ILLEGAL-ARG) ; 3 = symbol (JUMP 2OP+BIGNUM) ; 4 = bignum (JUMP 2OP+SHORTF) ; 5 = short-float (JUMP 2OP+SINGLEF) ; 6 = single-float (JUMP 2OP+DOUBLEF) ; 7 = double-float ... (JUMP 2OP-ILLEGAL-ARG) ; 27 = select-method (JUMP 2OP-ILLEGAL-ARG) ; 28 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 29 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 30 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 31 = unused dt ??? 2OP+FIXNUM (NOP <- A1 NEXT-PC+1+RGHT-DT) (JUMP 2OP-ILLEGAL-ARG) ; 0 = nil (JUMP 2OP+FIXNUM+FIXNUM) ; 1 = fixnum (JUMP 2OP-ILLEGAL-ARG) ; 2 = cons (JUMP 2OP-ILLEGAL-ARG) ; 3 = symbol (JUMP 2OP+FIXNUM+BIGNUM) ; 4 = bignum (JUMP 2OP+FIXNUM+SHORTF) ; 5 = short-float (JUMP 2OP+FIXNUM+SINGLEF) ; 6 = single-float (JUMP 2OP+FIXNUM+DOUBLEF) ; 7 = double-float ... (JUMP 2OP-ILLEGAL-ARG) ; 27 = select-method (JUMP 2OP-ILLEGAL-ARG) ; 28 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 29 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 30 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 31 = unused dt ??? 2OP+BIGNUM ... ;ditto It would be nice if the first instruction of each of these inner dispatches could be merged into the above 32 JUMP instructions. This would require the next PC computation to add the type code to the immediate JUMP address rather than the PC+1. The code would then look like this: 2OP+ (NOP <- A0 NEXT-PC+1+RGHT-DT) (JUMP 2OP-ILLEGAL-ARG (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 0 (JUMP 2OP+FIXNUM (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 1 (JUMP 2OP-ILLEGAL-ARG) ; 2 (JUMP 2OP-ILLEGAL-ARG) ; 3 (JUMP 2OP+BIGNUM (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 4 (JUMP 2OP+SHORTF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 5 (JUMP 2OP+SINGLEF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 6 (JUMP 2OP+DOUBLEF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 7 ... (JUMP 2OP-ILLEGAL-ARG) ; 27 (JUMP 2OP-ILLEGAL-ARG) ; 28 (JUMP 2OP-ILLEGAL-ARG) ; 29 (JUMP 2OP-ILLEGAL-ARG) ; 30 (JUMP 2OP-ILLEGAL-ARG) ; 31 2OP+FIXNUM (JUMP 2OP-ILLEGAL-ARG) ; 0 = nil (JUMP 2OP+FIXNUM+FIXNUM) ; 1 = fixnum (JUMP 2OP-ILLEGAL-ARG) ; 2 = cons (JUMP 2OP-ILLEGAL-ARG) ; 3 = symbol (JUMP 2OP+FIXNUM+BIGNUM) ; 4 = bignum (JUMP 2OP+FIXNUM+SHORTF) ; 5 = short-float (JUMP 2OP+FIXNUM+SINGLEF) ; 6 = single-float (JUMP 2OP+FIXNUM+DOUBLEF) ; 7 = double-float ... (JUMP 2OP-ILLEGAL-ARG) ; 27 = select-method (JUMP 2OP-ILLEGAL-ARG) ; 28 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 29 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 30 = unused dt ??? (JUMP 2OP-ILLEGAL-ARG) ; 31 = unused dt ??? For cycle counting purposes, here is a sketch for one of the interesting (i.e. fast) cases, that of adding two SHORT-FLOATs: 2OP+SHORTF-SHORTF (FLOAT [Note: The ability to skip on DT comparison provides very efficient implementation of primitive datatypes, e.g. CONSP.] [Note: It is unclear whether the dt dispatch should be defined only over the 32 5-bit user-visible data types, or over the full 6-bit 64 datatypes. The cost is making all these tables everywhere in the system twice as big. This would only lower page (not cache) efficiency. However, this feature could reasonably be used by the compiler to compile TYPECASE efficiently, in which case the user could code lots of them, and therefore it would be nicer if they were small.] [Still need some cycle counts for these variations.] ... Lots more to fill in here ... = = = = = = = = = = Scheme: Code Modification to Cache Datatype. This scheme exploits certain statistical properties of the occurrences of datatypes seen by a particular instance of a generic operation in Lisp code. For instance, consider a particular instance of the binary arithmetic addition operator compiled inside a certain Lisp function. Most often the datatypes of the two arguments on a particular invocation will be the same as for the last invocation. This statistical correlation is more likely to hold in code that is executed with high relative frequency (e.g. code inside a loop) than in code executed relatively infreqently. Fortunately, it is mostly code executed with high freqency that dominates run time and therefore matters most. The basic idea is to make compiled code for generic operations self-modifying such that it is always optimized for the case that the argument datatypes are the same as for the previous invocation. There are a number of different ways this could be worked out making various demands on the hardware, and having various cost/benefit effects on speed and code density. The variations are explored below. [The statistical behavior of argument datatypes is part of lisp folklore, and additionally can be inferred from hand analysis of, for example, benchmark code. However, we have never undertaken any systematic gathering of statistics. Without doing so we cannot develop much less prove hard measurements for the anticipated benefit of this scheme. While processors and compilers should not be designed entirely intuitively, it may be the best we can do in the short term.] For purposes of comparison, the current Falcon scheme would be to compile at least addition, subtraction, and the simple bitwise booleans as inline fixnum operations. If the argument types are not both fixnum, or if the result overflows a fixnum, a datatype trap is always taken. This is very fast for the fixnum case since it requires only a single cycle, and achieves the highest possible code density, i.e. only a single instruction word per operaton. However, requiring a trap for all other datatype cases unconditionally imposes a severe overhead. In particular, entry to and exit from the trap handler costs at *least* [handwaving!!!] 20 cycles, even if trap-handler dispatching can be done with great efficiency. Remember that the trap handler may have to do significant work to extract the data arguments. This immediately swamps the time to do hardware-supported operations such as short- and single- float arighmetic, which a floating coprocessor can do in perhaps 4 [??] cycles. Since the coprocessor speed is seen to be irrelevant, it is clear that trap handling is a bottleneck limiting machine speed for non-fixnum arithmetic. === The following not yet comprehensible ... The first variation is simplest. All generic operations would be encoded as a single function call, with the identity of the called routine changing. The called routine would be responsible for checking that the data types are indeed valid, and if incorrect, calling some generic routine which changes the address in the original call instruction in addition to doing the operation. Here's a sketch, again for (SETQ A3 (+ A5 A9)), assuming the previous execution had been passed two short-float data. (OPEN O0 <- A5) (CALL 2OP+-SHORT-FLOAT A3 (O1 <- A9)) The called code would look something like this: 2OP+-SHORT-FLOAT (NOP <- A0 A1 DT-BOTH DT-SINGLE-FLOAT) (JUMP 2OP+-TYPE-FAILURE) ... do short float addition ... (RETURN <- whatever) [Need cycle count for the success case.] ;; This is the common code used by all the 2OP+ handlers when the ;; arg datatypes are not the the ones expected, i.e., the same as ;; appeared in the previous call. 2OP+-TYPE-FAILURE (NOP <- A0 NEXT-PC+1+RGHT-DT) (JUMP 2OP-ILLEGAL-ARG) ; 0 (JUMP 2OP+FIXNUM (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 1 (JUMP 2OP-ILLEGAL-ARG) ; 2 (JUMP 2OP-ILLEGAL-ARG) ; 3 (JUMP 2OP+BIGNUM (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 4 (JUMP 2OP+SHORTF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 5 (JUMP 2OP+SINGLEF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 6 (JUMP 2OP+DOUBLEF (NOP <- A1) NEXT-PC-JMP+RGHT-DT) ; 7 ... 2OP+SHORTF (JUMP 2OP-ILLEGAL-ARG) ; 0 = nil (JUMP 2OP+FIXNUM+FIXNUM) ; 1 = fixnum (JUMP 2OP-ILLEGAL-ARG) ; 2 = cons (JUMP 2OP-ILLEGAL-ARG) ; 3 = symbol (JUMP 2OP+FIXNUM+BIGNUM) ; 4 = bignum (JUMP 2OP+FIXNUM+SHORTF) ; 5 = short-float (JUMP 2OP+FIXNUM+SINGLEF) ; 6 = single-float (JUMP 2OP+FIXNUM+DOUBLEF) ; 7 = double-float ... #| Combining this with the skip-if-dt-ok idea from above, it is possible to remove an instruction: [or maybe it isn't] (OPEN O0 <- A5) (CALL 2OP+-SHORT-FLOAT NEXT-PC+1+RGHT-DT [HUH??] |#