PIII ROUTINES : PMULB and PINC emulations

Hi

You probably noticed that you can MMX multiply 8 nibbles at a time, for example by 9 :
movq mm0,
movq mm1,mm0
psllq mm0,3
paddb mm0,mm1

Or a bit by 2... up to 255 :
PCMPEQB mm0,[0x0101...01]
PAND mm0,


Now, my problem :
I want to INC a qword, the faster possible. I have several 6&7 instructions solution, not satisfying at all.
The trouble is, in my program values often go beyond 2^32. For instance I will have :

11111111
11111111
11111111
11111111
11111111
00000000
11111111
00000000

Other precisions :

1/ it is better if I can keep the result to its MMX register.
(I searched add64 and so on on this forum, no real trick !?)

2/ it is a subproblem : I would like to find 1st 0 of an MMX column, i.e. if I have

ABCDEFGH
10000000 1
11100000 2
10100000 3
00000000 4 => first 0 of column A
00000000 5
10100000 6
00000000 7
10100000 8

I would like to find quickly 1st 0 (or 1) of, say, column A.
Posted on 2003-01-13 09:55:43 by valy
AA 00 CC DD 00 FF GG HH    Value

; cmp
00 FF 00 00 FF 00 00 00 Is Zero
; or
00 00 00 00 FF 00 00 FF Has been zero
; not
FF 00 FF FF 00 FF FF 00 Not has been zero
; mov
01 01 01 01 01 01 01 01 Increment Value
; and
01 00 01 01 00 01 01 00 Add mask
; add
06 05 06 06 01 06 06 04 First Zero
Posted on 2003-01-13 13:00:49 by bitRAKE
:stupid: Sorry, too much for my single neurone :confused:
Er... I presume your AA to HH mean bytes 1 to 8 of an MMX register, byte 1 being the LSB, byte 8 the MSB.
Actually in my example above, column A designed least significant bit of each byte.

Now to your code.
So 2nd line "cmp" means a PCMPEQB, up to here all right (hope so).
Next you ORe with a particular mask, dunno why, let us admit it.
Next you NOTe then INC and AND, it is OK, no comment.
Last you PADDB, byte 1 being 06... to byte 8 being 04 ?! Or is it the result ? In any case what does it stand for ?

Thx in advance.
Posted on 2003-01-14 10:17:11 by valy
Here is one way of isolating first '1' of each MMX column.

movq mm0, ;data

movq mm1,mm0
psllq mm1,8
por mm1,mm0

movq mm2,mm1
psllq mm2,16
por mm1,mm2

movq mm2,mm1
psllq mm2,32
por mm1,mm2

movq mm2,mm1
psllq mm1,8
pandn mm1,mm2

A bit too long, though. Anyone has better, or retrieve first position indexes ?!
Posted on 2003-01-15 10:26:40 by valy
Another (unchecked yet) solution to isolate first ones ('1') of a column and zeroe remaining part :

.data
mmxtri db 1,3,7,15,31,63,127,255
col0 db 1,1,1,1,1,1,1,1

1/ copy from column to 8 MMX bytes :

movq mm0, ;data

psllq mm0,7
pmovmskb eax,mm0
movd mm1,eax ;byte 1
punpcklbw mm1,mm1 ;bytes 1-2
pshufw mm1,mm1,0 ;bytes 1-8

2/ get rid of the casual '1' after 1st '0' :

pand mm1,
pcmpeqb mm1,
pand mm1,


So, if input is
ABCDEFGH
1...
1
1
1
0
1
0
1

output should be
ABCDEFGH
1
1
1
1
0
0
0
0

There is certainly better... any idea ? And if anyone can explain Bitrake's trick ?! :confused:
Thx, bye
Posted on 2003-01-16 11:01:23 by valy
I'm sorry I though you were talking about bytes of one/zero. I have to play with the code to understand how the data is organized. Is each byte a row of data in an 8x8 matrix? Where is Nexo when you need him? :tongue:

Does this help any?
movq	mm1, mm0

paddusb mm0, mm_0101010101010101
pxor mm1, mm0

; bits above first zero can be anything
; ie 00,02,04,06,08,...,FE = 01
; 00 -> 01
; 01 -> 02
; 03 -> 04
; 07 -> 08
; 0F -> 10
; 1F -> 20
; 3F -> 40
; 7F -> 80
; FF -> 00
Posted on 2003-01-16 11:48:46 by bitRAKE
No I've tried it, it does not do what I'd like. Thanks anyway, Bitrake.

Actually several tricks may help me, handling with a set of two 8-byte chunks, say only one to simplify.


1st trick : get 1st opposite bit of each MMX byte. I got it in very few instructions, thanks to a gem found on the Net.
So if I have

ABCDEFGH
1 01000110
2 00000110
3 ...
8 10100000

result in MMX is

ABCDEFGH
1 01000000
2 00000100
3 ...
8 10000000

assuming that column A is LSBit and column H is MSBit and row 1 is Least Significant BYTE of MM0


2nd trick : get 1st opposite bit of each MMX COLUMN ; posted above, 12 instructions, hoping better ?!

3rd trick : keep bits before first opposite bit, row-wise

4th trick : keep bits before first opposite bit, column-wise

5th-8th trick : same as 1-4, the opposite direction.

9th-? trick : make it work for both directions, starting from row 3 to 6. Or retrieve indexes of opposite bits.

Better if it works on 8 entities in parallel, but appreciated on a single entity, so a "PMULB MM0,255" is worth while :)
*Very* tiny LUTs admitted (actually I make a PSLLQ.../PMOVMSKB EAX/MOVZX EAX,myLUT but I have cache problems :()

Hope it is clearer ?! Anyway, will continue to use my brain.
Posted on 2003-01-17 07:53:55 by valy

I'm sorry I though you were talking about bytes of one/zero. I have to play with the code to understand how the data is organized. Is each byte a row of data in an 8x8 matrix? Where is Nexo when you need him? :tongue:

I bad understanding task with my English :) (that i really say? :))

PMULB mmx0,mmx1 should be have 2 forms PMULLB & PMULHB



; PMULLB
pxor mm7,mm7
movq mm0,[mmx0]
movq mm2,[mmx1]
movq mm1,mm0
movq mm3,mm2
punpcklbw mm0,mm7
punpcklbw mm2,mm7
punpckhbw mm1,mm7
punpckhbw mm3,mm7
pmullw mm0,mm2
pmullw mm1,mm3
pand mm0,mmxw(0FFh) ; for PMULLB
pand mm1,mmxw(0FFh) ; for PMULLB
psrlw mm0,8 ; for PMULHB
psrlw mm1,8 ; for PMULHB
packuswb mm0,mm1
movq [mmx0],mm0


About PINC. I do PADDQ mmx,mmxq(1) ;)
For PIII:


movq mm1,mm0
pcmpeqd mm0,mmxd(0,-1)
psllq mm0,32
paddd mm1,mmxd(0,1)
psrld mm0,31
paddd mm0,mm1


About subproblems. I dont known relation with current thread PMUL and PINC emulations. May be it other problem.


Does this help any?
movq	mm1, mm0

paddusb mm0, mm_0101010101010101
pxor mm1, mm0

; bits above first zero can be anything
; ie 00,02,04,06,08,...,FE = 01
; 00 -> 01
; 01 -> 02
; 03 -> 04
; 07 -> 08
; 0F -> 10
; 1F -> 20
; 3F -> 40
; 7F -> 80
; FF -> 00


Nice PINCB emulation ;)
Posted on 2003-01-18 10:57:21 by Nexo
I want to INC a qword, the faster possible. I have several 6&7 instructions solution, not satisfying at all.


Thx for your reply, Nexo ! I see PINC (Qword) most probably cannot be done under 6 instructions :( Well, I give up. And I do not mind PINCB ; maybe your PMUL(L/W)B could help, I will take a look.

Thx, bye
Posted on 2003-01-20 07:13:21 by valy
5 instructions...(one constant data)
; mm_A is incremented, mm_B is used as a temp register.

PINCQ MACRO mm_A:REQ, mm_B:REQ
movq mm_B, mm_A
pcmpeqd mm_A, mxc(0,-1)
psubd mm_B, mxc(0,-1)
psllq mm_A, 32
psubd mm_B, mm_A
ENDM
I wonder if the shift can be replaced - slow on Pentiums. If another register is used then it can be replaced with PUNPCKLDQ. Example:


; high dword can be anything*, only set once
mov mm7, mxc(-1,0)

;incq #1
movq mm1, mm0
pcmpeqd mm0, mxc(0,-1)

psubd mm1, mxc(0,-1)
punpckldq mm7, mm0 ; bottom dword always zero

; (another instruction goes here)
psubd mm1, mm7

;incq #2
movq mm1, mm0
pcmpeqd mm0, mxc(0,-1)

psubd mm1, mxc(0,-1)
punpckldq mm7, mm0 ; bottom dword always zero

; (another instruction goes here)
psubd mm1, mm7
See you can put several together... :)
Posted on 2003-01-20 11:30:29 by bitRAKE
Played a bit with it... :alright:
Posted on 2003-01-21 06:10:48 by valy

5 instructions...(one constant data)
; mm_A is incremented, mm_B is used as a temp register.

PINCQ MACRO mm_A:REQ, mm_B:REQ
movq mm_B, mm_A
pcmpeqd mm_A, mxc(0,-1)
psubd mm_B, mxc(0,-1)
psllq mm_A, 32
psubd mm_B, mm_A
ENDM


B:=A+1 ;)

PADDQ mmB,mmA,1 ?

Where "Compare Packed Unsigned Integers for Greater Than"? It is allow make PADDQ mmB,mmA,dw :(

Good night. Bye.
Posted on 2003-01-21 14:31:47 by Nexo

B:=A+1 ;)
Yeah, sorry. My comments are bad.
Read the code - ignore the english. :)
Posted on 2003-01-21 15:12:18 by bitRAKE
Umm if you want to inc a qword "fastest possible" then why bother with MMX?

add dword ptr , 1
adc dword ptr , 0
Posted on 2003-01-21 16:20:18 by iblis

Umm if you want to inc a qword "fastest possible" then why bother with MMX?

add dword ptr , 1
adc dword ptr , 0
Bingo! :grin:
Posted on 2003-01-21 16:37:41 by Maverick
valy, stated that the data was already in MMX registers - moving back and forth to MMX->GPR->MMX would be slower than MMX alone, but I think his algo might be faster if totally done in general purpose registers. MMX isn't the answer to everything especially on an Athlon. :)
Posted on 2003-01-21 17:29:32 by bitRAKE
does this work???
pcmpeqb mm1, mm1

psrlq mm1, 63
paddq mm0, mm1
inc mm0 ??? I guess this will only work on a p4. :grin:
Posted on 2003-01-21 23:15:30 by arkane
Totally agree, bitrake. The pb is, I mainly use MMX in my current routines.

I'll wait P5 of more, for GPR-MMX :rolleyes:
Posted on 2003-01-23 05:40:08 by valy