With bitRAKE's encouragement, I post it now.

I'm posting only one of Level 1 routines, partly because I'm not sure how many of you would be interested in this kind of stuff :) and partly because of my background. Well, they are written in in AT&T syntax and I need time to convert them to Intel syntax.

Anyway, this function computes l_2 norm. Single precision version of this function is a matter of replacing QWORD to DWORD and adjusting offset values accordingly. Complex and double complex (or COMPLEX*16 in Fortran) versions differ only in general increment case. So, here it is. The unrolling factor was determined to hide as much latency as possible. But, some of you may find better unrolling factor for different machines. (And, that's what I hope to find out. :) ) And, if your machine is not P3, you may find a better layout with or without using prefetch.

Now, the attached one assumes that your PC is set to 3, which is Intel's default, but not honored by most x86 OSes. So, if your PC is 2 (which is highly likely if you take your OS as given) you may need to set it to 3 before calling it. If you don't want to do that, there are alternatives.

One is to scale each value with updating scale factor as you find bigger number. This is the implementation you see in
ftp://ftp.netlib.org/blas/dnrm2.f
I tried implementing it in asm, but the result was horrible. Because fdiv is a non-pipelined long-latency instruction, there is no simple way to unroll the loop and thereby hide the latency without incurring too much overhead. (At least, that was why I dropped it.)

Another alternative I tried was scan the vector for the maximum element and use the reciprocal of it as the scale factor. This was actually pretty good, considering the inefficiency of accessing the same memory twice. It was better than the previous alternative on my P3 500MHz with PC100 SDRAM. Maybe that was due to prefetch, or maybe that was because I tested them using small vectors (100 by 1), so the cache contributed to this result.

These are what I found in numerical method textbooks. There may be other alternatives. But, since we are talking about x86, why not utilize 80-bit FPU registers provided by the hardware? It will make some traditional tricks in textbooks obsolete.
Posted on 2002-06-04 15:03:45 by Starless