Recently I've read that Pentium4 lacks the so called 'Barrel Shifter' and that that it hurts its performance badly in some situations. Is it really THAT bad? Has anyone done any benchmarks? I'd really like to see a 'practical' difference.
Posted on 2005-07-27 22:50:21 by ti_mo_n
rotate instructions, encryption functions are slower. Haven't done benchmarks myself, but that's the general consensus.
Posted on 2005-07-28 01:16:08 by f0dder
So that makes it generally worse than, or comparable to, PIII (so that they had to increase the clock drastically) ?  (I never had a PIII. Only PII, K6 and now P4)
Posted on 2005-07-28 15:25:43 by ti_mo_n
One of the general ideas of the P4 architecture is to have somewhat worse instructions per clock, but have a much higher clock frequence. So, some instructions are considerably slower per cycle on a P4 than a P3, but that's generally compensated for by the high clockspeeds. Ohter instructions, if correctly organized, have *very* good throughput on P4s. Especially when you start using SSE/2/3, things become interesting.

I personally prefer somewhat lower clockspeeds but higher IPC - heat and power consumption have been pretty insane in the later-model P4s, while they don't perform much better than the AMD64s with lower clockspeed.

As for concrete examples of encryption, I have a p3 celeron 1.3 GHz running linux, a p4 celeron 1.7 and a p4 "normal" 2.53 running XP - I currently can't be arsed to write a benchmark test though. I guess Rijndael/AES encryption could be interesting, though.

Then again, there's a lot of factors around performance - like cache size, and memory speed. I've got an idea the P3 won't hold out too well because of the PC-133 SDRAM, where the P4's have DDR266 and DDR333.
Posted on 2005-07-28 16:44:15 by f0dder
Oh, so it's not THAT bad, as some people curse it to be :)
Posted on 2005-07-28 16:52:06 by ti_mo_n
Well, bad enough IMO :) - I'm not too fond of the P4s after the northwood2 revision. Trading some IPC for higher clock frequency can be okay (since, appearantly, it was hard to push the P3 architecture much more clock frequency wise), but it went amok in the >= prescott P4s, running too hot and requiring too much power.

AMD64s are better, and are pretty nifty in general; P4's still seem to beat them for heavy SSE tasks, seems like the SSE implementation is better on the P4s. But outside of video coding and tasks like that, the AMD64 platform seems a better choice for now. I'm looking forward to see what intel has up it's sleeve for the next CPU, though.
Posted on 2005-07-28 19:28:53 by f0dder

rotate instructions, encryption functions are slower. Haven't done benchmarks myself, but that's the general consensus.

Also affects multiply and divide.
Randy Hyde
Posted on 2005-07-29 11:06:17 by rhyde
Thank you guys for your explainations :)

So it IS bad :P
Posted on 2005-07-29 14:27:07 by ti_mo_n
Shouldn't it also affect the indexed memory access:
mov eax,
etc etc?
The shift here is fixed only to 4 states, so a simpler, dedicated (different) type of unit is probably used, though.
Posted on 2005-07-29 20:03:56 by Ultrano

...And if so, then it should be wiser to do scaled LEA ("lea eax, [2*eax]") instead of SHL ("shl eax, 1"), right ?
Posted on 2005-07-31 21:41:40 by ti_mo_n
No, it probably uses the same type of unit to execute that microop. And lea might be a bit slower to decode and sequence (into microops).

I don't have a P4, only a hated P3 and a beloved AthlonXP, thus I can't make benchmarks, thus the often usage of "probably" ^^"
Posted on 2005-07-31 22:17:50 by Ultrano