The dot product or scalar product or inner product of 2 vectors of the dimension n is defined as:

< x, y > = x[0]*y[0] + x[1]*y[1] + ... + x*y

The inner product is a common and often used procedure in the BLAS (Basic Linear Algebra Subprograms). Itís a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. They were first published in 1979, and are used to build larger packages. They are heavily used in high-performance computing.

Letís assume that every array element is a Single Precision (float) value; both arrays are properly aligned by 16. Every array contains 256 elements, so caching shouldn't be a question. I need a fast implementation for that operation. Who can help?

Gunther
Posted on 2010-08-13 12:58:26 by Gunther
You didn't specify what, where and with what. I'm assuming windows and x86 or x86_64 asm module. Your post looks like it has been copy-pasted.

Anyway something like this should do it:
	; use jwasm
if @WordSize ne 8
.686
.xmm
.model flat,c
endif
option casemap:none

.code

;// extern "C" float fastdotp256(float *v1,float *v2);
public _fastdotp256
_fastdotp256:
if @WordSize eq 4
mov ecx,
mov edx,
v1 equ <ecx>
v2 equ <edx>
else
v1 equ <rcx>
v2 equ <rdx>
endif
xorps xmm0,xmm0
i = 0
repeat 256/4
movaps xmm1,
mulps xmm1,
addps xmm0,xmm1
i = i + 1
endm
movhlps xmm1,xmm0
addps xmm1,xmm0
movq xmm0,xmm1
psrlq xmm0,32
addss xmm0,xmm1
movd eax,xmm0
ret

end


you could go faster with dpps (SSE4.1)
Posted on 2010-08-13 19:34:19 by drizz
Hallo drizz,

thank you for that fast reply.

Your post looks like it has been copy-pasted.


No, it isn't.

You didn't specify what, where and with what.


That's right. It's an application for the Fractal Image Compression. The program works under 32-bit UNIX and Windows and is written in C++. After a serious analysis with code walk-through and profiling, I've got the following result: The bottlenecks are exactly 6 procedures. For example, having an image of 256 x 256 pixels, those functions are called over 30 million times. The next question is: What exactly is the purpose of those 6 functions? They are computing several statistical parameters, like the arithmetic mean, the variance or the co-variance, which are needed for the so called linear regression. In other words: hard number crunching.

At the moment, I'm writing a test bed (C++ and Assembly Language), to show what I've found until now.

Anyway something like this should do it:


Good SSE2 code, your ideas are welcome.

you could go faster with dpps (SSE4.1)


Sure, but SSE4.1 isn't very wide spread. Furthermore, would it work with AMD?

Gunther




Posted on 2010-08-13 21:47:22 by Gunther
Sure, but SSE4.1 isn't very wide spread. Furthermore, would it work with AMD?

AMD's CPUS have their own SSE4a (IMHO very useless, esp. when compared with 4.1's MPSADBW and dot product).
Posted on 2010-08-14 09:02:47 by ti_mo_n

you could go faster with dpps (SSE4.1)

Sure, but SSE4.1 isn't very wide spread. Furthermore, would it work with AMD?
IIRC maybe next year ;), there's always cpuid and different code paths.


At the moment, I'm writing a test bed (C++ and Assembly Language), to show what I've found until now.

Posting that testbed and c++ code for those 6 functions is the way to go. Maybe some optimizations can be done on algorithmic/source level. Maybe some parts don't really need to be floats? What's the image format?


Anyway something like this should do it:


Good SSE2 code, your ideas are welcome.

You can't really out-optimize mul + add. Using intrinsics and or optimization options (loop unrolling, architecture) you can produce the same code without asm I'm sure.

Posted on 2010-08-14 09:14:07 by drizz

Posting that testbed and c++ code for those 6 functions is the way to go. Maybe some optimizations can be done on algorithmic/source level.


Sure. The only crucial point is the dot product. But how can I attach the files to a forum message? It seems to me that's not enabled for my account.

Maybe some parts don't really need to be floats? What's the image format?


The Encoder needs TGA images and the Decoder's output is a TGA file, too. But that isn't the point. Internally it's a bad idea to operate with the RGB color model. It's better to use YCbCr; furthermore the entire algorithm makes it necessary to shrink the original image into smaller parts. Therefore one has to compute the arithmetic mean of several pixels; the result is a float or double. I've tested the rounding to integer data types; it works, but you have to pay a price: the image quality isn't so good. At the present time, floats are the only way to go.

Using intrinsics and or optimization options (loop unrolling, architecture) you can produce the same code without asm I'm sure.


I'm not sure. But let's discuss that point at the basis of my test program.

Gunther


Posted on 2010-08-14 10:25:41 by Gunther
Using intrinsics and or optimization options (loop unrolling, architecture) you can produce the same code without asm I'm sure.


I'm not sure. But let's discuss that point at the basis of my test program.


gcc 3.4.5 mingw

(I'm not overly familiar with all of gcc's optimization options)

gcc -O3 -w -c -march=pentium4 -mfpmath=sse -msse2 -mtune=pentium4 -ffast-math -fno-inline-functions -funroll-loops -fomit-frame-pointer

float dotp256_intrin(float *v1, float *v2)
{
__m128 xmm1, *xv1=(__m128 *)v1, *xv2=(__m128 *)v2;
__m128 xmm0 = {0,0,0,0};

for (int i = 0; i < 256/4; i++)
{
xmm1 = _mm_mul_ps(xv1,xv2);
xmm0 = _mm_add_ps(xmm0,xmm1);
}

xmm1 = _mm_movehl_ps(xmm1,xmm0);
xmm1 = _mm_add_ps(xmm1,xmm0);
xmm0 = xmm1;
xmm0 = (__m128)_mm_srli_epi64((__m128i)xmm0,32);
xmm0 = _mm_add_ss(xmm0,xmm1);

float result;
_mm_store_ss( &result, xmm0);
return result;
}


CPU Disasm
Address    Command                                                    Comments
$ ==>      push ebx
$+1        xorps xmm2,xmm2
$+4        xor eax,eax
$+6        sub esp,4
$+9        mov ebx,3F
$+E        mov ecx,
$+12       mov edx,
$+16       /movaps xmm7,
$+1A       |mulps xmm7,
$+1E       |addps xmm2,xmm7
$+21       |movaps xmm6,
$+26       |mulps xmm6,
$+2B       |movaps xmm5,
$+30       |movaps xmm4,
$+35       |mulps xmm5,
$+3A       |addps xmm2,xmm6
$+3D       |mulps xmm4,
$+42       |movaps xmm3,
$+47       |addps xmm2,xmm5
$+4A       |movaps xmm1,
$+4F       |mulps xmm3,
$+54       |addps xmm2,xmm4
$+57       |mulps xmm1,
$+5C       |movaps xmm0,
$+61       |addps xmm2,xmm3
$+64       |addps xmm2,xmm1
$+67       |mulps xmm0,
$+6C       |movaps xmm1,
$+71       |mulps xmm1,
$+76       |addps xmm2,xmm0
$+79       |sub eax,-80
$+7C       |sub ebx,8
$+7F       |addps xmm2,xmm1
$+82       \jns short 004014FE
$+84       movhlps xmm1,xmm2
$+87       addps xmm1,xmm2
$+8A       movdqa xmm2,xmm1
$+8E       psrlq xmm2,20
$+93       addss xmm2,xmm1
$+97       movss ,xmm2
$+9C       fld dword ptr
$+9F       add esp,4
$+A2       pop ebx
$+A3       retn

Posted on 2010-08-14 12:36:02 by drizz
Hallo drizz,

(I'm not overly familiar with all of gcc's optimization options)


Never mind, I can figure out that point. Thank you for your intrinsic code. Until now I gave intrinsics a wide berth, but it seems that's a good alternative, especially with the goal, to migrate software to another operating system. Do you know any descriptions for the hole bunch of gcc intrinsics?

Anyway, I've included your code in my test bed. My goal was to write several variants for the dot product, not only for that special case of a vector length of 256. The main program defines 2 floating point vectors (aligned by 16), calls the 6 different functions 5 million times and measures the time. It's written in C++ and with gcc inline assembler. It compiles with gcc 3.4.5 and better. I made timings under WinXP and Linux (both 32 bit).

The intresting thing is, that your intrinsic code always compete with a simple SSE2 implementation (4 array elements per loop cycle). But more sophisticated SSE2 implementations are a lot faster. I could publish the timings here, but that doesn't make sense without the complete source code, so that anyone interested can test it. Is it possible to send the source code via mail to you? I'm sure, there is room for improvement. The trick is, that my forum account doesnt allow attachments.

Gunther
 
Posted on 2010-08-15 13:19:36 by Gunther

Never mind, I can figure out that point. Thank you for your intrinsic code. Until now I gave intrinsics a wide berth, but it seems that's a good alternative, especially with the goal, to migrate software to another operating system. Do you know any descriptions for the hole bunch of gcc intrinsics?
There's a prototype for intrinsic under every instruction in the intel manuals. You can also use msdn  :). I don't  think they differ between compilers (the simd ones).


Anyway, I've included your code in my test bed. My goal was to write several variants for the dot product, not only for that special case of a vector length of 256. The main program defines 2 floating point vectors (aligned by 16), calls the 6 different functions 5 million times and measures the time. It's written in C++ and with gcc inline assembler. It compiles with gcc 3.4.5 and better. I made timings under WinXP and Linux (both 32 bit).
It would also be interesting to see how great auto-vectorization is in the newer versions of gcc. I don't have a newer version installed here atm.

The trick is, that my forum account doesnt allow attachments.
Doesn't allow attachments? I don't know if there are any restrictions for new members, attaching works for me. Just don't click preview when you choose a file, go directly to submit.
Attachments:
Posted on 2010-08-15 15:18:41 by drizz

I don't know if there are any restrictions for new members, attaching works for me.


Yes, there are. Once a new member reaches 5 posts, in which ensures a greater certainty that they are not a spam-bot, the restriction is released.
Posted on 2010-08-15 15:38:49 by SpooK
Hallo drizz,

Yes, there are. Once a new member reaches 5 posts, in which ensures a greater certainty that they are not a spam-bot, the restriction is released.


So, you see: there are restrictions. Therefore, here is my proposal again. Send me your mail address, because I don't know, if and when that feature will be activated. Please call on me: it isn't spam, only source code. Really!

By the way, after registration to that forum, I received a mail with that content:

Before you can login and start using the forum, your request will be reviewed and approved. When this happens, you will receive another email from this address.


I did wait, and wait ... but it came no other mail. So, I had to send another mail after a few days, and now I'm here. Never mind, I'm patient.

It would also be interesting to see how great auto-vectorization is in the newer versions of gcc.


I'll try that under Linux; there is gcc 4.1. installed. I hope it works.

Gunther



Posted on 2010-08-15 18:56:49 by Gunther

So, you see: there are restrictions.


Look at your current post count, and my original statement.
Posted on 2010-08-15 19:26:13 by SpooK
Please call on me: it isn't spam, only source code. Really!
I've sent you a PM, though it would be better if post your code here.

Never mind, I'm patient.
Are there unpatient programmers  :)
Posted on 2010-08-17 07:56:20 by drizz
... though it would be better if post your code here.


Right. I'll prepare the sources for uploading here. Please give me 1 or 2 days, because I've to convert the inline assembler part into Intel syntax for better readability. That has the drawback that the source won't compile under MacOS X, but what the heck. May be, I make 2 different versions.

Gunther
Posted on 2010-08-17 09:24:22 by Gunther
You don't have to I never said I can't read it.  ;)
Posted on 2010-08-17 10:12:49 by drizz
You don't have to I never said I can't read it.


I know that, but what's with other members? My intention is to get ideas to improve the code.

Gunther
Posted on 2010-08-17 11:14:47 by Gunther
Hallo drizz and other interested members,

here are the sources for the test program. The archive contains the following files:


  • results.pdf - contains the times and a description of the test environment.

  • build_dotfloat.bat - batch file to compile the running EXE under Windows

  • dotfloat.cpp - the main procedure

  • dotfloatfa.cpp - the inline assembly functions

  • dotfloatfc.cpp - C++ functions

  • dotfloatintrin.cpp - C++ code with intrinsics, designed by drizz. Thank you for that again.

  • features.cpp - code to check the available processor features, nothing exciting.



Compile the running application with gcc; before have a look into the batch file for the different compiler switches. That'll give also hints to build the application at the command line under UNIX based systems.

The program will run under Linux, BSD, MacOS X and Windows (all 32 bit).

The software is in experimental state; nothing is finished or the last word. Any proposals for improvements are highly welcome, also test results on other pltforms and different processors.

Gunther





Attachments:
Posted on 2010-08-17 18:52:58 by Gunther
Hi,

First, few notes:

- your asm is doing do-while loop, while c++ code  uses for() loop
- your asm is assuming array lengths multiple of 4 8 16 while c++ code only knows x4, and fpu code x1 x2
- prebuilt binaries are needed for people who just want to post their test results. (mainly win users without gcc :))

A "trick" to remove 1 lea/add from the loop

;  before the loop subtract pointers
sub eax,edx// subl %%edx,%%eax
lo:
...
movaps xmm0, ;; movaps (%%eax,%%edx), %%xmm0;; address evaluates to correct spot
mulps xmm0, ;; mulps (%%edx), %%xmm0;; other one normal
add edx,16;; addl $16,%%edx;; by increasing one you effectively increase both
...
jxx lo


These are my results:
XP SP3, AMD Opteron 148 (~2.2GHz)

Supported by Processor and installed Operating System:
------------------------------------------------------

  Pentium 4 Instruction Set,
+ FPU (floating point unit) on chip,
+ support of FXSAVE and FXRSTOR,
+ 57 MMX Instructions,
+ 70 SSE (Katmai) Instructions,
+ 144 SSE2 (Willamette) Instructions,
+ 13 SSE3 (Prescott) Instructions.


Calculating the dot product in 6 different variations.
That'll take a little while ...

Straight forward C++ implementation (FPU Code):
-----------------------------------------------

Dot Product 1 = 2867507200.00
Elapsed Time  = 18.84 Seconds

C++ implementation (FPU Code - 2 Accumulators):
-----------------------------------------------

Dot Product 2 = 2867507200.00
Elapsed Time  = 9.45 Seconds

C++ Code with intrinsics (Original Code by Drizz):
--------------------------------------------------

Dot Product 3 = 2867507200.00
Elapsed Time  = 4.77 Seconds

Simple SSE2 Code (1 Accumulator - 4 elements per cycle):
--------------------------------------------------------

Dot Product 4 = 2867507200.00
Elapsed Time  = 4.83 Seconds

Solid SSE2 Code (2 Accumulators - 8 elements per cycle):
--------------------------------------------------------

Dot Product 5 = 2867507200.00
Elapsed Time  = 3.22 Seconds

Better SSE2 Code (2 Accumulators - 16 elements per cycle):
----------------------------------------------------------

Dot Product 6 = 2867507200.00
Elapsed Time  = 2.91 Seconds


You said you have other functions too ? Theres little you can do with mul+add loop.

Posted on 2010-08-17 21:42:54 by drizz
Drizz,

thank you for the posted results and your remarks.

- your asm is assuming array lengths multiple of 4 8 16 while c++ code only knows x4, and fpu code x1 x2


Right. I coded that special case, to show the principle. A generic implementation should test that, of course. But as I wrote before:

The software is in experimental state; nothing is finished or the last word.


- prebuilt binaries are needed for people who just want to post their test results. (mainly win users without gcc)


Do you mean the running EXE for Windows? With Unix (Linux, BSD and MacOS X) that isn't necessary, because a version of GCC should be available by default. Is it possible to replace the archive with that added file?

A "trick" to remove 1 lea/add from the loop


I'll think about that. But you know that both registers point to different memory locations?

You said you have other functions too ?


Yes, but those are not the point. Simple arithmetic mean etc. But the dot product is tricky. It must not be calculated once per call, but eight times. That has to do with the algorithm, which contains rotations and mirroring of an image part. Therefore: every saved nano second counts.

Gunther






Posted on 2010-08-18 07:41:07 by Gunther
Do you mean the running EXE for Windows? With Unix (Linux, BSD and MacOS X) that isn't necessary, because a version of GCC should be available by default. Is it possible to replace the archive with that added file?
1)Yes. 2)I know. 3)Yes


A "trick" to remove 1 lea/add from the loop

But you know that both registers point to different memory locations?
Yes. Memory address is first calculated then accessed ( <- any combination you like, except scale which must be 0,1,2,4,8 )
So:
Z = X - Y
Z + Y = X
(EDIT: forgot to write: modulo 2^32)

LESS_ATTTM  8)
float DotXMM2Acc16E_v2(float X[],float Y[],unsigned int m)
{
#define AS(...) #__VA_ARGS__ "\n\t"
__asm__ __volatile__(
AS ( .set xmm0,%%xmm0 )
AS ( .set xmm1,%%xmm1 )
AS ( .set xmm2,%%xmm2 )
AS ( .set xmm3,%%xmm3 )
AS ( .set xmm4,%%xmm4 )
AS ( .set xmm5,%%xmm5 )
AS ( .set xmm6,%%xmm6 )
AS ( .set xmm7,%%xmm7 )
AS ( .set eax,%%eax )
AS ( .set edx,%%edx )
AS ( .set ecx,%%ecx )
AS ( .set esp,%%esp )
AS ( .set sub,subl )
AS ( .set mov,movl )
AS ( .set lea,leal )

AS ( pxor xmm4,xmm4 )

// sums are in xmm4 and xmm5

AS ( mov 12(esp),ecx ) //ecx = n
AS ( mov 4(esp),eax ) //eax -> X
AS ( mov 8(esp),edx ) //edx -> Y
AS ( pxor xmm5,xmm5 )
AS ( shl $2,ecx ) //ecx = 4*n (float)
AS ( sub $4,esp ) //function result
AS ( sub edx,eax ) // <--------------------------------

// init pattern 1. iteration

AS ( movaps (eax,edx),xmm0 ) //xmm0 = X X X X <--------- etc.
AS ( movaps 16(eax,edx),xmm1 ) //xmm1 = X X X X
AS ( mulps (edx),xmm0 ) //xmm0 = X*Y X*Y X*Y X*Y
AS ( sub $64,ecx ) //we need n-1 iterations
AS ( jz 2f ) //jump: only 16 elements

// main loop pattern

AS ( .align 16 )
AS ( 1: )
AS ( movaps 32(eax,edx),xmm2) //xmm2 = X X X X
AS ( mulps 16(edx),xmm1 ) //xmm1 = X*Y X*Y X*Y X*Y
AS ( addps xmm0,xmm4 ) //sum up
AS ( movaps 48(eax,edx),xmm3) //xmm3 = X X X X
AS ( mulps 32(edx),xmm2 ) //xmm2 = X*Y X*Y X*Y X*Y
AS ( addps xmm1,xmm5 ) //sum up
AS ( movaps 64(eax,edx),xmm0) //xmm0 = X X X X
AS ( mulps 48(edx),xmm3 ) //xmm3 = X*Y X*Y X*Y X*Y
AS ( lea 64(edx),edx ) //update Y pointer
AS ( addps xmm2,xmm4 ) //sum up
AS ( movaps 16(eax,edx),xmm1) //xmm1 = X X X X
AS ( mulps (edx),xmm0 ) //xmm0 = X*Y X*Y X*Y X*Y
AS ( addps xmm3,xmm5 ) //sum up
AS ( sub $64,ecx ) //count down
AS ( jnz 1b )

// exit pattern last iteration

AS ( 2: )
AS ( movaps 32(eax,edx),xmm2 ) //xmm2 = X X X X
AS ( mulps 16(edx),xmm1 ) //xmm1 = X*Y X*Y X*Y X*Y
AS ( addps xmm0,xmm4 ) //sum up
AS ( movaps 48(eax,edx),xmm3 ) //xmm3 = X X X X
AS ( mulps 32(edx),xmm2 ) //xmm2 = X*Y X*Y X*Y X*Y
AS ( addps xmm1,xmm5 ) //sum up
AS ( mulps 48(edx),xmm3 ) //xmm3 = X*Y X*Y X*Y X*Y
AS ( addps xmm2,xmm4 ) //sum up
AS ( addps xmm3,xmm5 ) //sum up

// horizontal addition

AS ( addps xmm5,xmm4 ) //sum in xmm4
AS ( movhlps xmm4,xmm0 ) //get bits 64 - 127 from xmm4
AS ( addps xmm0,xmm4 ) //sums in 2 dwords
AS ( pshufd $1,xmm4,xmm0 ) //get bits 32 - 63 from xmm4
AS ( addss xmm0,xmm4 ) //sum in 1 dword
AS ( movss xmm4,(esp) ) //store sum
AS ( flds (esp) ) //load function result
AS ( sub $-4,esp ) //adjust stack
AS ( ret )
:
:"g"(m)
:"%ecx","%eax","%edx"
);
}
- adjusted tabs for display on the forum


removing 1 lea saves ~0.14 sec
Posted on 2010-08-19 10:23:54 by drizz