You know what? I misunderstood your previous post :P By some reason I though that your PentiumIII had not SSE support.

For SSE support you need a capable CPU and a capable OS. SSE is not as transparent as MMX was to the OS, the OS must set an special control bit on the CPU to confirm that the kernel is fully aware of SSE and hence on context switches it will preserve the XMM registers. I'm not sure when Windows started to support SSE, maybe at Win98 SE?

About if every PentiumIII supports SSE, according to a magazine of that time that I have on my hands right now confirms it and the core name is Katmai. Wikipedia also says that Katmai comes with SSE http://en.wikipedia.org/wiki/Pentium_III .
Posted on 2007-10-30 21:53:46 by LocoDelAssembly
Yeah, to get max speed of a 500MHz P3, use SSE, to compute 4 voices at once. You'll have to group data, though. And instead of float4-phase (for each sine), use a 31-bit integer format (must clear sign-bit, also will have to use the FPU for fld int1 | fmul fOneOver2_31 | fstp QuadFloat.float0).
Posted on 2007-10-30 23:17:29 by Ultrano
P3 does not have SSE2 so there isnt any 4-way parallel 32-bit integer multiplication available.. Regular SSE is restricted to single precision float multiplication .. He will thus need a seperate float to integer conversion somewhere which consumes time in addition to the swizzling if he intends to use SSE

I suggest sticking with the general purpose integer registers since nearly every operation needed has a throughput of 1 per cycle
Posted on 2007-10-31 03:16:49 by Rockoon

Which Windows version are you using now? AFAIK the only significant addition PentiumIII had was SSE. I remember that many people complained about this because them said that "PentiumII with SSE" would be more appropriate since "Pentium with MMX" wasn't called PentiumII.

Iirc it was the PentiumIII that integrated the L2 cache directly on-die, whereas the PII had L2 cache on the processor packaging, but not on-die. Dunno if there were (other) architectural changes compared to the PII than that.
Posted on 2007-10-31 06:27:06 by f0dder
The L2 on die was introduced later, Katmai was the traditional cartridge as the firsts Athlons and all the Pentiums II. It also came with the magic serial number that many people called "big brother inside", a transformation of "Intel inside".
Posted on 2007-10-31 08:40:19 by LocoDelAssembly

The L2 on die was introduced later, Katmai was the traditional cartridge as the firsts Athlons and all the Pentiums II. It also came with the magic serial number that many people called "big brother inside", a transformation of "Intel inside".

Oh, the first P3's were slot rather than socket CPUs? Didn't know that :)

The unique-ID feature appeared and disappeared again pretty quickly, heh.
Posted on 2007-10-31 10:50:16 by f0dder
I suggest sticking with the general purpose integer registers since nearly every operation needed has a throughput of 1 per cycle

Well, with SEE he can do roughly 4 operations / cycle (requires proper data structuring), and float <-> integer conversion can be done by the FPU (which doesn't collide with SSE) before and after the process. Not to mention that SSE code will perform VERY nicely when run on a P4s, Athlons, or Cores. Even Katmai (first P3 to be produced) could execute about 1 SSE instruction/cycle scoring up to 4 FP operations/cycle. It had some limitations that in some cases Katmai would do at most 2 FP operation/cycle, but (1) it's still better than 1, and (2) Future P3s don't have such limitations. IMHO sticking with 'the old' instruction set wastes processing power.

http://en.wikipedia.org/wiki/Pentium_III#Pentium_III.27s_SSE_implementation
There's a minor error there: Tualatins start from 750MHz (I myself have one in one of my machines), not from 1GHz.

and
The Pentium III was the first Intel processor to break 1 GFLOPS, with a theoretical performance of 2 GFLOPS

Actually, Tualatin @ 1GHz scores ~1.4 GFLOP.
Posted on 2007-10-31 14:33:54 by ti_mo_n
keantoken,

    I suggest you look into the incremental Taylor series.  It works well with a LUT and it converges fast.  Ratch

http://www.masm32.com/board/index.php?topic=4765.0

Posted on 2007-10-31 22:10:38 by Ratch

Well, with SEE he can do roughly 4 operations / cycle (requires proper data structuring), and float <-> integer conversion can be done by the FPU (which doesn't collide with SSE) before and after the process.


Not on the pentium 3, which is his minimum target processor.

SSE can do at best 2 multiplications per cycle on the P3.. thats 2 cycles per MULPS throughput... in theory twice as fast, right?

..but if you have followed along then you would know that proper data ordering is not possible because this isnt a SIMD problem (each sine wave has an arbitrary frequency and phase), so he will have to use register swizzles to get the proper data into the proper component of an xmm register.

If he needs 1 swizzle per multiplication, then the best he can do is 1 multiplication per cycle .. the theoretical best is then only equal to using integers .. but note that the swizzle instructions all have latencies significantly higher than the throughput of integer multiplications so it will be hard in practice to ever match the throughput of integer multiplication for this problem set.

On top of that, he will then also need a float to integer stage, which isnt at all free. While the FPU "can" do it, there is no reason that it "should"

I just can't see a justification for using MULPS over IMUL for this problem set on that processor.

.. Also, my advice is to stop using wikipedia for your technical information. I suggest Agner Fog's optimization manuals, which spells out the real world latency and throughput of these instructions quite nicely.
Posted on 2007-11-01 01:38:01 by Rockoon