Yes, tech talk for once! Been awhile since I posted a true tech article here at SysOpt, but I'll give this a go with a forum post from my experience. First read this:
The reason for poor software performance in some applications stems to a completely revamped programming structure as compared to any x86 platform available before the P4's introduction. The issue is not impart to branch prediction as many think, as the P4 features one of the most efficient branch prediction routines currently available, especially when combined with its large branch TLB table. Actually, did ya' know the Athlon has a poor branch prediction routine, as even the P3 offers higher efficiency. This is one of the major reasons for the next-gen Athlon Palomino introduction, though only the TLB size is increased with this revision.
Let's take on some other common misconceptions as well:
Before trying to dismiss the pipeline length as associated to the P4's branch prediction routine, did ya' know that the Athlon offers only a relatively simple G-share routine? This means it is limited to only two branch predictions per 16-bit byte code window. To increase the branch table without offering any modification to the branch predictor itself would prove rather useless for the most part. What is the point of adding more buffering space when the Athlon can only carry out two standard branch predictions per cycle? Hopefully AMD will choose to implement a better branch routine to better correlate with the larger TLB table size later generation Thoroughbred or H-series.
For further evidence, one must examine currently available programs. Most programs feature 13-14% of branch predictive coding, with several of the operations often requiring atleast three branch predictions for maximum ILP and IPC. By performing some quick calculations, it can be estimated that ~10% of all branch prediction tasks are not even assigned a predictor bit by the scheduling engine! This leads to a back log, thus creates an effective stall of the Athlon processor. Simply increasing the TLB size does not offset the problems associated with the G-share branch prediction algorithm.
The above scenario does not even account for a cache miss when associated with a branch prediction. The Athlon offers instruction re-ordering for L1 cache misses, but lacks any form of L2 re-sequencing capabilities while in flight during a clock cycle. For loads and stores, the K7 has 2 queues, a 12-entry LS1 and a 32-entry LS2. Load/store-ops are dispatched to LS1 and removed in-order two at a time when the addresses are available (stores and load misses are moved to LS2). So if the oldest op in LS1 does not have its address available, then LS1 stalls. This in-order nature of LS1 allows load-load reordering only for L1 cache misses. To compare, the Pentium 4 offers full re-ordering for all L1 and L2 cache misses associated with all operations, except IN, OUT, and serializing instrctions. Now consider that the P4ís 64-bit inclusive cache architecture offers nearly 150% the bandwidth for L1 cache operations and upwards of 400% the bandwidth for L2 cache operations (SSE2, 128-bit datagram) as compared to the Athlon Thunderbirdís exclusive cache design.
Not only does an Athlon stall during an L2 cache miss, the tendency to stall is quite often. The Athlonís decode unit can issue 2.8 instructions per clock cycle, but the Athlonís L2 cache require 11 cycles to return data. Branch predictive operations often operate with L2 data due to large storage arrays. In order to maintain a standard IPC rate of 1.5, the Athlon Thunderbird must find atleast 11 x 5 = 17 instructions per L2 cache miss. The Athlonís integer instruction buffer can only allow for 18 instructions, thus the processor will effectively stall during an L2 cache miss. Factor in the moderate potential for a branch predictive operation in actually setting forth the condition for this L2 miss, than one can easily see the Athlon will effectively stall, and must flush/reset the execution pipe to continue normal operation.
The Athlonís short pipeline length does help, but it is not a solution to the concern Iíve noted above. For once, a longer pipeline could prove more effective. For instance, the P4 can buffer upto 126 micro-ops in flight during a single clock cycle due to a split buffering architecture using multiple schedulers and buffering regions along the 28 stage (20 after the trace cache) pipeline. The Athlon only has the capability for 18 in flight operations. Compare the pipeline lengths to the in flight op rate, then one can easily realize the P4 offers a higher degree of instruction efficiency per clock cycle.
Now examine how this relates to the P4ís branch predictor. The Athlon Thunderbird will offer superior branch prediction only situations where highly dependant code utilizing a data of under 512 KB is taking place. The Pentium 4 offers a more linear performance approach due to a higher efficiency capability (more proof, examine the ILP and IPC nature of the P4 as they compare to latency and bandwidth). With large data arrays, the P4 will still offer the same linear performance curve, while the Athlon will continue to degrade at an exponential rate. To examine this in action, try running the industry accepted Queenís Loop routine (wrote in standard x86 assembly, of course). This test creates a small array of 32x32 for data manipulation while loading an intensive sequence of branch predictive tasks. For comparison, the Athlon TB outscores a similar clocked P4 by less than .2 seconds! Assuming we converted the code to C++, then utilized the Intelís C++ compiler, the P4 would actually outscore the Athlon by a large margin. This is due to compiler feed back directed optimizations for the HWNT and HST hint prefix instructions. The addition of these two factors can increase branch prediction by upto an estimated 25+%, thus propelling the Pentium 4 past the Athlon in all branch predictive scenarios.
We all know the Pentium 4 offers superior integer performance, but I can go on to produce evidence that the P4 offers superior floating point (x87 and SIMD) as compared to the Athlon, assuming anyone wants to read through the boring technical aspects. Any rate, the above revised SPEC results speak for themselves. The problem with the P4 is the lack of software optimization for the architecture. Most current software does not include any real P4 specific coding outside of the occasional SSE2 instruction. Worse yet, a large variety of todayís software is improperly coded and compiled to meet the programming specifications definied within the P4 programmer's guide, thus leading to even lower performance. However, wait until the next generation of vertorized compilers with dynamic re-coding capabilties arrive! Factor in the expanding acceptance of the Intel P4 IA-32 programmerís guide, and the P4 should be an exciting platform in the near future, especially considering it only at the early stages of its overall product lifecycle as compared to the AMD Athlon.
Hope this helps,
[This message has been edited by RobRich (edited 08-29-2001).]
anyways Intel will be at min. 3500mhz next year and amd just comes up with a hole lot of new cpuīs: 1400mhz & 1533mhz Athlon(Palomino), the 1,1Ghz Duron and some workstations and server cpuīs(MP?) at 1.333, 1.400 und 1.533 MHz
That was good reading rob
So nVidia's MPU isn't ....perfected yet, I wonder if it would be viable to attach a different southbridge, I mean the thing of most interest is dual-channel DDR RAM. Even though as you've said before where the chipset with be very expensive the price would be offset due to the DDR RAM and Athlon itself. Is the K7 core actually fully capable of utilising the bandwidth though?
How far into the future do you think it'll actually be? The original Katmai's SSE instructions are still not entirely used.
the P4 should be an exciting platform in the near future, especially considering it only at the early stages of its overall product lifecycle as compared to the AMD Athlon
One other thing, will the K8 hammer series actually have any impact on the desktop, or is AMD just working on a die shrink in the form thoroughbred to clock-speed competitive in the desktop arena. Seems trange that AMD has nothing to directly compete clock-speed wise. I read on anand Intel demonstrated a Northwood core PIV running at 3.5ghz already. Also of interest they showcased a 4ghz Processor with dual-ALU's. I like the little comment he put though, "essentially 8ghz", yeah just like 2 MP's on a Tyan Thunder is 2.4ghz
It seems to me that this entire post has gotten of the beaten subject. The original question was:
"Anyone have any news on the schedule for amd chip speeds"
I too would like to know AMD's schedule on new CPUs. Is the latest and greatest coming out soon? I think that given the previous posts that we can all agree that AMD smokes Intel's ****.
Forgot the topic?
Heck, I forgot who I was just going to call after reading the following line above:
"...the Athlon has relatively poor scheduling efficiency due to its severly limited g-share branch predicition routine combined with absolutely no capabilities for L2 cache instruction re-sequencing without completely stalling and flushing the execution pipeline. "
Took me a little while to chew and 'process' that one.
Thanks for your insight, Rob, backed up with some numbers and theory. While I am a big AMD fan, I realize that Intel still has a lot of innovation up it's sleeve (sleeves that are also stuffed full of money/capital), and that in the end, innovation that is properly positioned and priced, and yes, promoted, wins the day.
Price factor will always outweight the innovation factor. Intel could be kicking ****, but if it's for much more $$$ then it won't matter. Let's face it. There is no difference running word on 2.0GHz vs 200MHz. Heck I didn't notice ANY difference surfing online with my PII 233 vs my 1.2 Bird. Only hardcore gamers and CAD design people might be interested in absolut best.
I agree with you on that camaro, I just upgraded from a Duron 600Mhz to a Athlon 1.2Ghz and you don't notice the difference in everyday computing. Though my QuakeIII FPS increased by 20 frames.
The difference is that in order to get the same performance in Word in WinXP you need a boost in the CPU. Every new OS from Micro$oft adds more bloat and slows things down. Plus, over the next year or so who knows what new software will be out to push the systems.
Though you still are right. To do everyday tasks you could still run Win98/Win2k and Word 2000 to get everything done that you need to on that 233Mhz machine. It works great for my parents.
[This message has been edited by gyoung (edited 08-29-2001).]
sorry, double post
[This message has been edited by Turnip12 (edited 08-29-2001).]
Can anyone reccomend any good sites to help explain Rob's post? I'd like to get into it more but most, ok just about all, was over my head. I'd love to understand more about hardware without shooting for an engineering degree. On the subject about "innovative" Intel, the T-bird is at it's end right? I mean the T-birds aren't really meant to compete with the P4s are they? Isn't that what the Hammer line is for? These are just the tweaks for the end of the T-bird line right?
[This message has been edited by Turnip12 (edited 08-29-2001).]
I think the Athlon is meant to be the direct competitor to the P4. The Hammer series is meant to compete with Intel's Itanium.
Here's a good article: http://www.anandtech.com/cpu/showdoc.html?i=1524
Basically clock speed isn't enough anymore. You also have to look at instructions per clock. This snippet may help explain it a little better:
How much work the processor can do in a single clock cycle, measured in Instructions Per Clock (IPC), matters just as much as clock speed. At the same time, measuring a processorís performance in IPC wouldnít make much sense either since a CPU capable of an average IPC of 5 instructions per clock yet only capable of running at 50MHz wouldnít be faster than a CPU capable of an average IPC of 1.5 instructions per clock yet capable of running at 1GHz. The combination of IPC and clock frequency determines the true performance of the CPU.
Are you a phsycic or had you already read this...........
arggg.. i'm tryin' to remember the website of a comparison between the K7 and the Willamette (kinda outdated, but still good info)... it was like www.chip-something.com or something like that...
it was written for the investor, so it isn't too nutso into theory, but it does go into discussions of prefetch, pipelines, and trace cache...
I recommend these in-house articles for people interested on the more technical aspects of computing:
Now to condense my earlier statements, the Athlon is the best choice for most day-to-day activities. I use one myself. It just happens that the P4 is a better solution for certain activities, such as multimedia editing or 3D design work. I am not accounting for cost, as people who generally buy P4 2.0 GHz workstations are usually not concerned over a few hundred dollars anyway. Must be nice......
Hellmund, I forsee the AMD Hammer as a potential desktop competitior in 2003 at the earliest. The ClawHammer should actually ship by mid-2002, but I doubt we will see prices below the server/workstation range until 6-12 months after release.
About the nForce, it should be possible to pair it with a VIA or AMD southbridge assuming nVidia does not use a proprietarty bus-link protocol. I believe specs for a MSI or Asus board featuring a hybrid nForce/VIA chipset was recently posted on Usenet, but I can not confirm this rumor until I receive the information for a trusted source.
The bandwidth of the nForce with 128-bit memory operation is suspect at this time. The Athlon only features a 64-bit memory interface, thus the other 1/2 of provided memory bandwidth will be utilized for other connceted hardware components. This additional 64-bit path is great for the integrated video controller, but it is questionable about the impact of this extra throughput when a standard AGP card is being used.
I personally would like to see AMD integrate the memory controller directly into the processor core to lower latency, then possible move the chipset architecture away from a bus to a crossbar-type design to increase bandwidth. This design has done wonders for Sun's UltraSparc-3 platform, but no desktop manufacturers have moved forward with this type of technology yet.
Catch ya' later,
[This message has been edited by RobRich (edited 08-29-2001).]
I think I started a good discussion here. This has been a great education for me, and I am glad to be an Amd user. There have been some great links here, and it still shows that Intel is not as good as all these places pushing the chip claims... Bang for Buck Amd has Intel Beat!
I purchased a Duron 650 for maybe 43 bucks last year, and was able to OC it to 950, with just a basic fan. I am ready to upgrade to the 1.4 T-bird for 128.00.. I canít wait to see what it will do.
Thanks all for the education on the speeds. I knew this would get a lot of reaction..
Did I mention AMD is the BomB!
Just Having Fun!
Not to bring you down JustHavingFun, but don't expect leaps and bounds going from a Duron 650 to the Athlon 1.4Ghz.
I just did the Duron 600 to the 1.2Ghz and you don't notice anything really in day to day computing. I don't notice anything even in gaming. Only if I look at a benchmark number I see an increase, but I don't feel it when I'm playing a game.