The Prefetch Queue


The 80386 is documented as having a 16-byte prefetch queue. At one time, it did, but due to a bug in the pipelining architecture, Intel had to abandon the 16-byte queue, and only use a 12-byte queue. The change occurred (I believe) between the D0, and D1 step of the '386. The '386SX wasn't affected by the bug, and therefore hasn't changed.

Determining the size of the prefetch queue is no trivial task. The theory is simple, but the reality is different. Theoretically, one could write an algorithm which employs self-modifying code to determine the depth of the queue. The algorithm would execute code that writes exactly "n" bytes beyond the beginning of the prefetch queue. Once the modified code gets executed, then the prefetch queue size has been determined. This is not as easy in practice as it is in principal. There are many aspects of the internal architecture that make the task very difficult:

  1. The Bus Interface Unit (BIU) only prefetches code when it is has no other bus cycles pending. Memory and I/O cycles take a higher priority than prefetches.
  2. The Decode Unit (DU) can hold up to three fully decoded instructions. It decodes them at the rate of one byte per CPU clock.
  3. The Prefetch Unit (PU) and the DU must be fully loaded and no bus cycles pending. The self-modifying instruction should be at the head or the middle of the DU.
  4. The modifying instruction should lock the BUS to inhibit any BUS activity (if some should occur).
  5. Synchronizing the starting of the algorithm with refresh.

What we need to meet the criteria of step 3 is a method to fill the PU and DU and ensure that no BUS cycles are pending. To do that, we need an instruction that takes longer to execute than a prefetch. This will guarantee that any prefetch will be absorbed in the execution time of the instruction. Next, that instruction must not perform any bus cycles. If it performs any bus cycles, then they take priority over prefetches, and we can't guarantee the state of the PU or DU. BSF or BSR are ideal instructions for this purpose. Both instructions take longer to execute than a prefetch, and neither requires any BUS activity. To satisfy step 4, OR, XOR are probably best. Any Read/Modify/Write instruction implicitly locks the BUS during execution. This will guarantee that one last fetch doesn't come through while the code is being modified. Finally, synchronizing the beginning of the algorithm with refresh will guarantee that the first few prefetches won't be interrupted, thereby throwing the algorithm "out of sync."

Making any single algorithm work in all cases and CPU speeds is very difficult. In a period of six months, I have spent over 100 hours (in attempts at) perfecting a general-purpose algorithm. Just when I thought I understood it, I found a different revision of the CPU at a speed I hadn't tried, that gave me trouble. The culmination of this effort has led to my current understanding of the CPU prefetch and decode units: Both the prefetch unit, and decode unit have a "low-water" point. Only when each unit gets to this point is a refill triggered. The file PREFCHSZ.ASM has an algorithm that works on '386's with CPU ID of 303, 305, and 308. The algorithm assumes that the decode unit requests instructions from the prefetch unit ONLY after it has exhausted two slots; the prefetch unit ONLY requests more data after it has exhausted all but four bytes. Therefore the assumptions are that the decode unit has a 1-instruction "low-water" point, and the prefetch unit has a 4-byte "low-water" point.

View and download source code:
ftp://ftp.x86.org/pub/x86/source/prefetch/prefchsz.asm
ftp://ftp.x86.org/pub/x86/dloads/PREFCHSZ.ZIP


Back to secrets and bugs