;-*- Mode:Text -*- Set associative means cache is addressed by part of access address. There is one comparator per set, and probably one chip-depth of memory. Each set is addressed by the access address; the tag field in the addressed location is compared with the rest of the address bits, and if the tag is valid and matches, the cache data field is returned for a "hit". Only one of the sets can possibly contain a valid match, so multiple sets work without arbitration. The "hit" out from each is or'd to give the "ready" response to the instruction stream. For a "miss", the instruction stream is frozen while the cache is filled. 1. Fill whole block of 2^N words, then restart original access. Simplest logic; slowest access to desired word, fastest access to adjacent words. 2. Read desired word; return "ready" as data is present, then fill in rest of block. Processor is not delayed for rest of block if words are read in ascending order, but timing is complicated in insuring that "ready" can be returned for a just-comleting cache-fill cycle even if cache does not contain the data yet. Optimization: start cache-fill at desired word, stop at block-size boundary, and have separate valid bit for each word in block. Is fastest but hairiest. Suggests that cross with (1) is better: fill in whole block, but return "ready" as desired word is written. Note that RAM with common I/O causes "hit" hardware to compute a valid "hit" response on data being written into the cache; suggests that fast ready response is cheap. Following words continue to be filled in; cache is busy being filled as next processor fetch appears, but same incoming-hit response works for this too. Is easier if cache is synced with machine cycle. Tentative machine cycle is 70ns. 256K DRAM access is 120, cycle 230, nybble-burst 60. 1M is 100, 200, 50. If synced with machine cycle, access is 140, cycle 280, nybble 70. Easier said than done; cache cycle is skewed from machine clock. Synced time for 4-word load is 140 + 3*70 = 350, not counting precharge of 140 (total 490) Max time for 4-word load is 120 + 3*60 = 300, precharge 110 (total 410) (350 / 420) For 1M is 100 + 3*50 = 250, precharge 100 (total 350) Overall synced 1M is 280; with precharge 350 (420?) (280 / 350; probably 280 / 420) 128-bit parallel 1M (1M x 16 bytes = 16MB) cache fill in 100, with precharge 200. (140 / 210) Sequential cache-miss-and-fill: First takes 140; rest take 70, precharge 70. If processor wants successive words as they are filled in, it gets them with no delay at all. But if the block size is one word, the cache becomes fragmented and the advantage of the burst mode is lost. So, it is better to stick with the 4-word blocks and freeze the processor to load all four. Hmmm. Easiest thing to do is freeze the machine with the current address asserted; then no cache addr mux is required. Can do 50ns this way. Note that all times include the 70ns that is allowed anyway. (1) Freeze processor until whole block is filled in. Saves cache addr mux time. Cache miss: 70ns normal cache cycle; detect miss 140 first word access time 3 * 70 load rest of 4-word burst 70 repeat normal cache cycle to get desired data === 490 access time for miss; DRAM is always idle. (2) Release processor as soon as desired word is filled in. Requires cache addr mux to finish cache-fill if next addr from processor is not the expected addr; requires addr comparator to freeze processor if addr mux inputs don't match. Max benefit ... Cache miss from idle DRAM: 70ns normal cache cycle on first word; detect miss 140 first word access time N * 70 load rest of 4-word burst, but release processor as soon as desired word is read. === 210 - 420 access time for miss, from idle DRAM. For in-line code, miss is on first word = 210ns. For jump, avg. is 315ns. Cache miss on word that is being filled in: === 70 access time for next word of burst (free) Cache miss from busy DRAM: N * 70 load rest of 4-word burst 140 precharge, overlaps 70ns cache cycle to detect miss. 140 access first word N * 70 load rest of 4-word burst, up to desired word === 280 - 700 Random jump: average; 490ns. In-line miss: best case; 280ns. Method (1) has fixed miss time of 490ns for jumps and in-line misses. method (2) has miss time of 490ns for jumps, 210-280 for in-line misses. Most misses are (?) for jumps, so the in-line code advantage of (2) is probably not worth the extra hardware. ;;;;;;;;;;;;;;;; 5/4 current plan: load Icache in 32-bit accesses: 70ns detect cache miss 140 first word access 7*70 rest of 8 32-bit words 70 cache hit === 770ns cache miss averaged over 4 instructions: 192ns / instruction (instead of 70) load Icache in 64-bit accesses: 70ns detect cache miss 140 first word access 3*70 rest of 4 64-bit words 70 cache hit === 490ns cache miss averaged over 4 instructions: 122ns / instruction ;;;;;;;;;;;;;;;; two ways of doing Icache: 15ns tag ram / 25 ns data ram: 55 ns total 1K x 10 x 2 sep I/O 15ns cache tag ram 4K x 64 x 2 common I/O 25 cache data ram saves 16 '374s plus pins on cache data ram requires hairy Cypress ram chips. 25ns tag ram / 25ns data ram: 50 ns total 1K x 10 x 2 sep I/O 25ns cache tag ram 4K x 64 x 2 sep I/O 25ns cache data requires 16 '374s plus I/O pins on cache data still requires 25ns sep I/O chips. ;;;;;;;;;;;;;;;; Other Icache issues: parity: compute parity on write check parity on read timing doesn't matter -- signal error during next instruction. Run from ROM: anything better than 299's? Shift 64 bits into parallel outputs, then clock into IR. Accessed bit per page; spy path Do separate input/output 299-style shift registers exist? ;;;;;;;;;;;;;;;; disabling cache? run just from IR, or require at least cache data to work? IR ... ;;;;;;;;;;;;;;;; cache block size of 4 words ok? probably good for Icache; maybe 2 for Dcache? cons = 2 words, even aligned symbols = 5 words ... ;;;;;;;;;;;;;;;; data cache: complicated by cache-before-maps Simlest: Can't write valid cache entry unless data memory is also written, so don't WRITE directly into cache on a write. DO invalidate the cache entry. Which cache entry? Detect a hit, and if HIT and WRITE, rewrite cache to invalidate. Better: only invalidate cache on write if the maps show that the real word can't be written. Assumes that rest of cache block is already correct. Good, because it prevents continuous faults on blocks that are read and written a lot. ;;;;;;;;;;;;;;;;