I came to following solution,

may be someone could find better?

may be someone could find better?

```
```

a = X-Y where X >Y

X^2 - Y^2 = X^2 - ((X-a)(X-a)) = X^2 -X2+ax+ax-a^2=

ax+ax-a*a=a(x+x-a)=a(2x-a)

algo:

eax=X ecx=Y

;-------------------

sub ecx,eax

lea eax,[eax*2][ecx]

neg ecx

mul ecx

Very nice, Svin!

I don't think you can do it any better ;)

Your algebra is confusing though. This may help others understand it better:

a = x - y

x? - y? =

( x - y )( x + y ) =

( x - ( x - a ))( x + ( x - a )) =

a( 2x - a )

I don't think you can do it any better ;)

Your algebra is confusing though. This may help others understand it better:

a = x - y

x? - y? =

( x - y )( x + y ) =

( x - ( x - a ))( x + ( x - a )) =

a( 2x - a )

This is shorter, hope I got it right:

Thomas

```
```

sub eax, ecx ; x-y

lea ecx, [eax+2*ecx] ;x-y + 2y = x+y

mul ecx ; x-y*x+y

Thomas

Yes, Tomas, I've come to the same solution but shorter:

```
```

lea edx,[eax][ecx]

sub eax,ecx

mul edx

I just found the same one, but with ebx instead of edx :)

When I test them with maverick's profile code, I get this:

Svin1: 5 cycles

Thomas: 6 cycles

Svin2: 7 cycles

However when I put a REPEAT 20 around them, I get these results:

Svin1: 139 [6.95 cycles per iteration]

Thomas: 139 [6.95 cycles per iteration]

Svin2: 117 [5.85 cycles per iteration]

strange :confused:...

Thomas

P.S. all procs were aligned to 64 bytes

When I test them with maverick's profile code, I get this:

Svin1: 5 cycles

Thomas: 6 cycles

Svin2: 7 cycles

However when I put a REPEAT 20 around them, I get these results:

Svin1: 139 [6.95 cycles per iteration]

Thomas: 139 [6.95 cycles per iteration]

Svin2: 117 [5.85 cycles per iteration]

strange :confused:...

Thomas

P.S. all procs were aligned to 64 bytes

Svin2 uses the factored form (x-y)(x+y) instead of his original substitution formula. There's only 1 multiplication instead of 2... maybe that's the best one possible?

Your algebra is confusing though

:)

I just avoided standart trunsmutation X^2-Y^2 , that's it.

Sometime I avoid standart(school) ways, but of course follow rules. It helps sometime to get attantion to some unobvious sides. For example to see X^3 not as X*X*X but as sum of part of reccurent sequence with X elements where X=X+2X

3^3 = 3+(3+6)+(9+6).

It mostly reinventing the wheel but sometime can lead to interestion notions in particular cases.

It's like Budda stone park, where from any side you can see just a part of whole picture.

I just found the same one, but with ebx instead of edx :)

When I test them with maverick's profile code, I get this:

Svin1: 5 cycles

Thomas: 6 cycles

Svin2: 7 cycles

However when I put a REPEAT 20 around them, I get these results:

Svin1: 139 [6.95 cycles per iteration]

Thomas: 139 [6.95 cycles per iteration]

Svin2: 117 [5.85 cycles per iteration]

strange :confused:...

Thomas

P.S. all procs were aligned to 64 bytes

You'll never see such a behaviour on e.g. a 486 or a Pentium, but Athlons are definitely "weird".

Thanks, I thought it would be something like that.. Athlons are definitely mysterious in their optimisations :).

Thomas

Thomas