I came to following solution,
may be someone could find better?


a = X-Y where X >Y
X^2 - Y^2 = X^2 - ((X-a)(X-a)) = X^2 -X2+ax+ax-a^2=
ax+ax-a*a=a(x+x-a)=a(2x-a)
algo:

eax=X ecx=Y
;-------------------
sub ecx,eax
lea eax,[eax*2][ecx]
neg ecx
mul ecx
Posted on 2002-04-06 16:19:40 by The Svin
Very nice, Svin!

I don't think you can do it any better ;)



Your algebra is confusing though. This may help others understand it better:

a = x - y

x? - y? =
( x - y )( x + y ) =
( x - ( x - a ))( x + ( x - a )) =
a( 2x - a )
Posted on 2002-04-06 16:40:46 by iblis
This is shorter, hope I got it right:


sub eax, ecx ; x-y
lea ecx, [eax+2*ecx] ;x-y + 2y = x+y
mul ecx ; x-y*x+y


Thomas
Posted on 2002-04-06 16:50:03 by Thomas
Yes, Tomas, I've come to the same solution but shorter:


lea edx,[eax][ecx]
sub eax,ecx
mul edx
Posted on 2002-04-06 16:57:18 by The Svin
I just found the same one, but with ebx instead of edx :)
When I test them with maverick's profile code, I get this:

Svin1: 5 cycles
Thomas: 6 cycles
Svin2: 7 cycles

However when I put a REPEAT 20 around them, I get these results:
Svin1: 139 [6.95 cycles per iteration]
Thomas: 139 [6.95 cycles per iteration]
Svin2: 117 [5.85 cycles per iteration]

strange :confused:...

Thomas
P.S. all procs were aligned to 64 bytes
Posted on 2002-04-06 17:07:10 by Thomas
Svin2 uses the factored form (x-y)(x+y) instead of his original substitution formula. There's only 1 multiplication instead of 2... maybe that's the best one possible?
Posted on 2002-04-06 18:05:54 by iblis
Your algebra is confusing though

:)
I just avoided standart trunsmutation X^2-Y^2 , that's it.
Sometime I avoid standart(school) ways, but of course follow rules. It helps sometime to get attantion to some unobvious sides. For example to see X^3 not as X*X*X but as sum of part of reccurent sequence with X elements where X=X+2X
3^3 = 3+(3+6)+(9+6).
It mostly reinventing the wheel but sometime can lead to interestion notions in particular cases.
It's like Budda stone park, where from any side you can see just a part of whole picture.
Posted on 2002-04-06 21:05:19 by The Svin

I just found the same one, but with ebx instead of edx :)
When I test them with maverick's profile code, I get this:

Svin1: 5 cycles
Thomas: 6 cycles
Svin2: 7 cycles

However when I put a REPEAT 20 around them, I get these results:
Svin1: 139 [6.95 cycles per iteration]
Thomas: 139 [6.95 cycles per iteration]
Svin2: 117 [5.85 cycles per iteration]

strange :confused:...

Thomas
P.S. all procs were aligned to 64 bytes
That's normal, assuming you've a Athlon.. which has very complex decoders which self-adapt to the code. So 1 iteration is not enough to have some further automatic CPU optimizations, while 20 are plenty.
You'll never see such a behaviour on e.g. a 486 or a Pentium, but Athlons are definitely "weird".
Posted on 2002-04-07 05:00:19 by Maverick
Thanks, I thought it would be something like that.. Athlons are definitely mysterious in their optimisations :).

Thomas
Posted on 2002-04-07 05:30:19 by Thomas