Yes, tech talk for once! Been awhile since I posted a true tech article here at SysOpt, but I'll give this a go with a forum post from my experience. First read this:

http://www.sysopt.com/articles/p4/index.html

The reason for poor software performance in some applications stems to a completely revamped programming structure as compared to any x86 platform available before the P4's introduction. The issue is not impart to branch prediction as many think, as the P4 features one of the most efficient branch prediction routines currently available, especially when combined with its large branch TLB table. Actually, did ya' know the Athlon has a poor branch prediction routine, as even the P3 offers higher efficiency. This is one of the major reasons for the next-gen Athlon Palomino introduction, though only the TLB size is increased with this revision.

Let's take on some other common misconceptions as well:

Before trying to dismiss the pipeline length as associated to the P4's branch prediction routine, did ya' know that the Athlon offers only a relatively simple G-share routine? This means it is limited to only two branch predictions per 16-bit byte code window. To increase the branch table without offering any modification to the branch predictor itself would prove rather useless for the most part. What is the point of adding more buffering space when the Athlon can only carry out two standard branch predictions per cycle? Hopefully AMD will choose to implement a better branch routine to better correlate with the larger TLB table size later generation Thoroughbred or H-series.

For further evidence, one must examine currently available programs. Most programs feature 13-14% of branch predictive coding, with several of the operations often requiring atleast three branch predictions for maximum ILP and IPC. By performing some quick calculations, it can be estimated that ~10% of all branch prediction tasks are not even assigned a predictor bit by the scheduling engine! This leads to a back log, thus creates an effective stall of the Athlon processor. Simply increasing the TLB size does not offset the problems associated with the G-share branch prediction algorithm.

The above scenario does not even account for a cache miss when associated with a branch prediction. The Athlon offers instruction re-ordering for L1 cache misses, but lacks any form of L2 re-sequencing capabilities while in flight during a clock cycle. For loads and stores, the K7 has 2 queues, a 12-entry LS1 and a 32-entry LS2. Load/store-ops are dispatched to LS1 and removed in-order two at a time when the addresses are available (stores and load misses are moved to LS2). So if the oldest op in LS1 does not have its address available, then LS1 stalls. This in-order nature of LS1 allows load-load reordering only for L1 cache misses. To compare, the Pentium 4 offers full re-ordering for all L1 and L2 cache misses associated with all operations, except IN, OUT, and serializing instrctions. Now consider that the P4’s 64-bit inclusive cache architecture offers nearly 150% the bandwidth for L1 cache operations and upwards of 400% the bandwidth for L2 cache operations (SSE2, 128-bit datagram) as compared to the Athlon Thunderbird’s exclusive cache design.

Not only does an Athlon stall during an L2 cache miss, the tendency to stall is quite often. The Athlon’s decode unit can issue 2.8 instructions per clock cycle, but the Athlon’s L2 cache require 11 cycles to return data. Branch predictive operations often operate with L2 data due to large storage arrays. In order to maintain a standard IPC rate of 1.5, the Athlon Thunderbird must find atleast 11 x 5 = 17 instructions per L2 cache miss. The Athlon’s integer instruction buffer can only allow for 18 instructions, thus the processor will effectively stall during an L2 cache miss. Factor in the moderate potential for a branch predictive operation in actually setting forth the condition for this L2 miss, than one can easily see the Athlon will effectively stall, and must flush/reset the execution pipe to continue normal operation.

The Athlon’s short pipeline length does help, but it is not a solution to the concern I’ve noted above. For once, a longer pipeline could prove more effective. For instance, the P4 can buffer upto 126 micro-ops in flight during a single clock cycle due to a split buffering architecture using multiple schedulers and buffering regions along the 28 stage (20 after the trace cache) pipeline. The Athlon only has the capability for 18 in flight operations. Compare the pipeline lengths to the in flight op rate, then one can easily realize the P4 offers a higher degree of instruction efficiency per clock cycle.

Now examine how this relates to the P4’s branch predictor. The Athlon Thunderbird will offer superior branch prediction only situations where highly dependant code utilizing a data of under 512 KB is taking place. The Pentium 4 offers a more linear performance approach due to a higher efficiency capability (more proof, examine the ILP and IPC nature of the P4 as they compare to latency and bandwidth). With large data arrays, the P4 will still offer the same linear performance curve, while the Athlon will continue to degrade at an exponential rate. To examine this in action, try running the industry accepted Queen’s Loop routine (wrote in standard x86 assembly, of course). This test creates a small array of 32x32 for data manipulation while loading an intensive sequence of branch predictive tasks. For comparison, the Athlon TB outscores a similar clocked P4 by less than .2 seconds! Assuming we converted the code to C++, then utilized the Intel’s C++ compiler, the P4 would actually outscore the Athlon by a large margin. This is due to compiler feed back directed optimizations for the HWNT and HST hint prefix instructions. The addition of these two factors can increase branch prediction by upto an estimated 25+%, thus propelling the Pentium 4 past the Athlon in all branch predictive scenarios.

We all know the Pentium 4 offers superior integer performance, but I can go on to produce evidence that the P4 offers superior floating point (x87 and SIMD) as compared to the Athlon, assuming anyone wants to read through the boring technical aspects. Any rate, the above revised SPEC results speak for themselves. The problem with the P4 is the lack of software optimization for the architecture. Most current software does not include any real P4 specific coding outside of the occasional SSE2 instruction. Worse yet, a large variety of today’s software is improperly coded and compiled to meet the programming specifications definied within the P4 programmer's guide, thus leading to even lower performance. However, wait until the next generation of vertorized compilers with dynamic re-coding capabilties arrive! Factor in the expanding acceptance of the Intel P4 IA-32 programmer’s guide, and the P4 should be an exciting platform in the near future, especially considering it only at the early stages of its overall product lifecycle as compared to the AMD Athlon.

Hope this helps,
Robert Richmond

[This message has been edited by RobRich (edited 08-29-2001).]