Author Topic: 24 bit multiplication (Read 19003 times)

ACagliano · « **Reply #30 on:** December 11, 2011, 01:41:47 pm »

Ok. I am particularly interested now in 2-byte multiplication and 4-byte square rooting. How would they be done?

jacobly · « **Reply #31 on:** December 11, 2011, 02:06:55 pm »

// Multiply a times b
temp = 0
repeat for each bit in a
	temp <<= 1
	if (high bit of a set)
		temp += b
	a <<= 1
return temp

if a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.

Spoiler For for code:

Code: [Select]

// Sqrt a
temp = high byte of a
a <<= 8
b = 0
repeat for every 2 bits in a
	test = b << 8 + 0x40
	b <<= 1
	if (temp >= test)
		temp -= test
		set low bit of b
	temp += high 2 bits of a
	a <<= 2
return b

If a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.

Spoiler For code:

ACagliano · « **Reply #32 on:** December 11, 2011, 02:23:11 pm »

Quote from: jacobly on December 11, 2011, 02:06:55 pm

Code: [Select]
// Multiply a times b temp = 0 repeat for each bit in a temp <<= 1 if (high bit of a set) temp += b a <<= 1 return tempif a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.
Spoiler For for code:
~~stolen~~ from Axep_MulFull: ; Input in hl, result in cahl ld c,h ld a,l ld hl,0 ;11 ld b,16 ;7 __MulFullNext: add hl,hl ;11 rla ;4 rl c ;8 jr nc,__MulFullSkip ;12/7 add hl,de ;11 adc a,0 ;7 jr nc,__MulFullSkip inc c __MulFullSkip: djnz __MulFullNext ret __MulFullEnd:

Code: [Select]
// Sqrt a temp = high byte of a a <<= 8 b = 0 repeat for every 2 bits in a test = b << 8 + 0x40 b <<= 1 if (temp >= test) temp -= test set low bit of b temp += high 2 bits of a a <<= 2 return bIf a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.
Spoiler For code:
~~stole~~ my own routine from axe (and modified it)p_Sqrt88: ; input in hlde, result in de ld b,16 ld a,h ld c,l push de ; ld ixh,d pop ix ; ld ixl,e ld de,0 ld h,d ld l,e __Sqrt88Loop: sub $40 sbc hl,de jr nc,__Sqrt88Skip add a,$40 adc hl,de __Sqrt88Skip: ccf rl e rl d add ix,ix rl c rla adc hl,hl add ix,ix rl c rla adc hl,hl djnz __Sqrt88Loop ret __Sqrt88End:

Kool. Thanks. But, shouldn't the first one have two inputs?

Xeda112358 · « **Reply #33 on:** December 11, 2011, 02:30:16 pm »

So with two-byte multiplication, you can take advantage of the fact that add hl,hl is the same as shifting hl left. It even gives you the carry! So in this case:

Code: [Select]

     ld hl,0
     ld a,16
MultLoop:
     add hl,hl      ;shifts hl left
     rl e \ rl d    ;shifts de left and if hl overflowed, it overflows into de
     jr nc,$+6      ;if the bit in DE is o, skip this chunk
       add hl,bc    ;add bc to hl (think of this as the first number)
       jr nc,$+3    ;overflow into de
         inc de
     dec a
     jr nz,MultLoop
     ret

That will multiply DE times BC and return the result in DEHL. I will see if I can port a square root routine for 32-bit...

EDIT: changed inc e to inc de

jacobly · « **Reply #34 on:** December 11, 2011, 02:48:20 pm »

Quote from: ACagliano on December 11, 2011, 02:23:11 pm

Quote from: jacobly on December 11, 2011, 02:06:55 pm
Code: [Select]
// Multiply a times b temp = 0 repeat for each bit in a temp <<= 1 if (high bit of a set) temp += b a <<= 1 return tempif a and b are 2 bytes, temp is 4 bytes, and you loop 16 times.
Spoiler For for code:
~~stolen~~ from Axep_MulFull: ; Input in hl and de, result in cahl ld c,h ld a,l ld hl,0 ;11 ld b,16 ;7 __MulFullNext: add hl,hl ;11 rla ;4 rl c ;8 jr nc,__MulFullSkip ;12/7 add hl,de ;11 adc a,0 ;7 jr nc,__MulFullSkip inc c __MulFullSkip: djnz __MulFullNext ret __MulFullEnd:

Code: [Select]
// Sqrt a temp = high byte of a a <<= 8 b = 0 repeat for every 2 bits in a test = b << 8 + 0x40 b <<= 1 if (temp >= test) temp -= test set low bit of b temp += high 2 bits of a a <<= 2 return bIf a is 4 bytes, then b and temp are 2 bytes, and you loop 16 times.
Spoiler For code:
~~stole~~ my own routine from axe (and modified it)p_Sqrt88: ; input in hlde, result in de ld b,16 ld a,h ld c,l push de ; ld ixh,d pop ix ; ld ixl,e ld de,0 ld h,d ld l,e __Sqrt88Loop: sub $40 sbc hl,de jr nc,__Sqrt88Skip add a,$40 adc hl,de __Sqrt88Skip: ccf rl e rl d add ix,ix rl c rla adc hl,hl add ix,ix rl c rla adc hl,hl djnz __Sqrt88Loop ret __Sqrt88End:

Kool. Thanks. But, shouldn't the first one have two inputs?

Of course, hl and de, isn't that what I said

FloppusMaximus · « **Reply #35 on:** December 11, 2011, 04:41:31 pm »

Quote from: jacobly on December 08, 2011, 08:35:58 pm

My first multiplication routine takes 2746 - 4570 cycles, the second takes 1680 - 2880 cycles.

Oh boy, optimization time

The best I have so far is somewhere around 1800 cycles average (I'm too lazy to work out the exact probabilities at the moment, and not counting memory delays) using a squaring table and undocumented IX instructions. Input is BDE and CHL, output is BCDEAL. This routine works by expanding the formula 2xy = x²+y²-|x-y|², summed over each of the 9 pairs of bytes in the input.

(I'm not saying this is practical - unless you really have thousands of 24-bit multiplications to perform, you don't need this kind of speed. This is just for fun.)

Code: [Select]

SUBFIRST .macro src1, src2, hdest, ldest
	exx
	ld a, src1
	sub src2
	jr nc, $ + 4
	neg
	exx
	ld l, a
	ld a, ldest
	sub (hl)
	ld ldest, a
	inc h
	ld a, hdest
	sbc a, (hl)
	ld hdest, a
  .endm

SUBNEXT .macro src1, src2, hdest, ldest
	dec h
	ex af, af'
	exx
	ld a, src1
	sub src2
	jr nc, $ + 4
	neg
	exx
	ld l, a
	ex af, af'
	ld a, ldest
	sbc a, (hl)
	ld ldest, a
	inc h
	ld a, hdest
	sbc a, (hl)
	ld hdest, a
  .endm

BDE_times_CHL_sqrdiff_v3:
	ld a, d
	exx
	ld h, high(sqrtab)
	ld l, a
	ld e, (hl)
	inc h
	ld d, (hl)		; DE = d²
	exx
	ld a, b
	exx
	ld l, a
	ld b, (hl)
	dec h
	ld c, (hl)		; BC = b²
	exx
	ld a, e
	exx
	ld l, a
	ld a, (hl)
	inc h
	ld h, (hl)
	ld l, a			; HL = e²
	call BC_DE_HL_times_10101
	push bc
	 push hl
	  push de
	   exx
	   ld a, h
	   exx
	   ld h, high(sqrtab)
	   ld l, a
	   ld e, (hl)
	   inc h
	   ld d, (hl)		; DE = h²
	   exx
	   ld a, c
	   exx
	   ld l, a
	   ld b, (hl)
	   dec h
	   ld c, (hl)		; BC = c²
	   exx
	   ld a, l
	   exx
	   ld l, a
	   ld a, (hl)
	   inc h
	   ld h, (hl)
	   ld l, a		; HL = l²
	   call BC_DE_HL_times_10101
	   pop ix
	  add ix, de
	  pop de
	 adc hl, de
	 ex de, hl
	 pop hl
	adc hl, bc
	ld b, h
	ld c, l			; BCDEIX = total
	push af

	 ld h, high(sqrtab)
	 SUBFIRST e, l, ixh, ixl
	 SUBNEXT  d, h, d, e
	 SUBNEXT  b, c, b, c
	 jp nc, BDE_times_CHL_sqrdiff_v3_nc1
	 pop af
	ccf
	push af
BDE_times_CHL_sqrdiff_v3_nc1:

	 inc b

	 dec h
	 SUBFIRST e, h, e, ixh
	 SUBNEXT  d, c, c, d
	 jr nc, BDE_times_CHL_sqrdiff_v3_nc2
	 dec b
	 jp nz, BDE_times_CHL_sqrdiff_v3_nc2
	 pop af
	ccf
	push af
BDE_times_CHL_sqrdiff_v3_nc2:

	 dec h
	 SUBFIRST d, l, e, ixh
	 SUBNEXT  b, h, c, d
	 jr nc, BDE_times_CHL_sqrdiff_v3_nc3
	 dec b
	 jp nz, BDE_times_CHL_sqrdiff_v3_nc3
	 pop af
	ccf
	push af
BDE_times_CHL_sqrdiff_v3_nc3:

	 inc c

	 dec h
	 SUBFIRST b, l, d, e
	 jr nc, BDE_times_CHL_sqrdiff_v3_nc4
	 dec c
	 jp nz, BDE_times_CHL_sqrdiff_v3_nc4
	 dec b
	 jp nz, BDE_times_CHL_sqrdiff_v3_nc4
	 pop af
	ccf
	push af
BDE_times_CHL_sqrdiff_v3_nc4:

	 dec h
	 SUBFIRST e, c, d, e
	 pop hl
	jr nc, BDE_times_CHL_sqrdiff_v3_nc5
	dec c
	jp nz, BDE_times_CHL_sqrdiff_v3_nc5
	dec b
	jp nz, BDE_times_CHL_sqrdiff_v3_nc5
	inc l
BDE_times_CHL_sqrdiff_v3_nc5:

	dec b
	dec c

	rr l
	rr b
	rr c
	rr d
	rr e
	ld a, ixl
	ld l, a
	ld a, ixh
	rra
	rr l
	ret


BC_DE_HL_times_10101:
	push bc
	 ld a, h
	 ex af, af'
	 sub a
	 ld c, a
	 ld b, l
	 add hl, bc
	 adc a, a
	 ld b, e
	 add hl, bc
	 adc a, c		; AHL = [ L+H+E L ]
	 pop bc
	push hl
	 push bc
	  ld c, a
	  ld b, 0
	  ex af, af'
	  ld h, a
	  add hl, bc		; no way this can carry (initial HL is a square)
	  ld c, a
	  ld b, e
	  sub a
	  add hl, bc
	  adc a, a		; AHL(SP+2) = [ H+E L+H L+H+E L ]
	  add hl, de
	  adc a, 0		; AHL(SP+2) = [ H+E+D L+H+E L+H+E L ]
	  pop bc
	 add hl, bc
	 adc a, 0		; AHL(SP) = [ H+E+D+B L+H+E+C L+H+E L ]
	 ld e, d
	 ld d, c
	 add hl, de
	 adc a, b
	 jr nc, BC_DE_HL_times_10101_nc1
	 inc b			; BAHL(SP) = [ B B H+E+D+C+B L+H+E+D+C L+H+E L ]
BC_DE_HL_times_10101_nc1:
	 add a, e
	 jr nc, BC_DE_HL_times_10101_nc2
	 inc b			; BAHL(SP) = [ B D+B H+E+D+C+B L+H+E+D+C L+H+E L ]
BC_DE_HL_times_10101_nc2:
	 pop de
	add a, c
	ld c, a
	ret nc
	inc b			; BCHLDE = [ B D+C+B H+E+D+C+B L+H+E+D+C L+H+E L ]
	ret

To get back to the topic somewhat, ACagliano, it sounds like you're more interested in squaring than in general multiplication. Squaring can be considerably faster, especially if you use a lookup table (e.g., my best 16-bit squaring routine is around 170 cycles, versus around 800 for general multiplication.)

cerzus69 · « **Reply #36 on:** December 12, 2011, 10:43:43 am »

Quote from: jacobly on December 07, 2011, 11:05:18 pm

I do have a 24-bit floating-point multiplication routine

saved 2 bytes, 1149 cycles saved by using iy too (and in a way compatible with TIOS, imagine that)
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) rl c jr nc,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Jacobly, are you sure this works because I've been going through the code and it seems to me like the second 'rl c' should instead be 'add carry flag to c'. 2 times 'rl c' per loop seems wrong to me. Could you explain please? Because I've tried it as well in wabbitemu, taking the different input in account, and it is still not doing the right thing.

ACagliano · « **Reply #37 on:** December 12, 2011, 02:35:39 pm »

Yeah, all I need is 16-bit subtraction (which 'sub' supports, I think), 16-bit squaring, 32-bit addition, then 32-bit square rooting (or will I need to go up to 40-bit?).

Xeda112358 · « **Reply #38 on:** December 12, 2011, 02:48:12 pm »

16-bit subtraction

Code: [Select]

or a     ;to make sure the c flag is reset. Not always necessary if you know the c flag will be reset
sbc hl,bc  ;you can do sbc hl,de also.

32-bit addition (you mean two 32-bit inputs?)

Code: [Select]

;Inputs:
;     HLBC is one of the 32-bit inputs
;     DE points to the other 32-bit input in RAM
;Outputs:
;     HLBC is the 32-bit result
;     DE is incremented 3 times
;     A=H
;     c flag is set if there is an overflow
     ld a,(de) \ inc de
     add a,c \ ld c,a
     ld a,(de) \ inc de
     adc a,b \ ld b,a
     ld a,(de) \ inc de
     adc a,l \ ld l,a
     ld a,(de)
     adc a,h \ ld h,a
     ret

Squaring and square rooting... I will think on it

Also, I am working on a mini math library that will include RAM based math (so all the values will be in RAM). It seems like a few of these commands will need to rely on some memory. If they do, I suggest using the OP registers (11 bytes of RAM each).

jacobly · « **Reply #39 on:** December 12, 2011, 07:07:45 pm »

Quote from: cerzus69 on December 12, 2011, 10:43:43 am

Quote from: jacobly on December 07, 2011, 11:05:18 pm
I do have a 24-bit floating-point multiplication routine

saved 2 bytes, 1149 cycles saved by using iy too (and in a way compatible with TIOS, imagine that)
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) rl c jr nc,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Jacobly, are you sure this works because I've been going through the code and it seems to me like the second 'rl c' should instead be 'add carry flag to c'. 2 times 'rl c' per loop seems wrong to me. Could you explain please? Because I've tried it as well in wabbitemu, taking the different input in account, and it is still not doing the right thing.

That's strange. My test program must not have been working right, because when I went back and changed it a bit, it suddenly started telling me that the second routine doesn't work.

Anyway, my new test program seems to agree with this change.

Code: [Select]

	; hldebc = hlc * bde
	ld	(iy+asm_Flag1),b
	xor	a
	ld	ix,0
	ld	b,24
Loop:
	add	ix,ix
	rla
	rl	c
	adc	hl,hl
	jr	nc,Next
	add	ix,de
	adc	a,(iy+asm_Flag1)
	jr	nc,Next
	inc	c
	jr	nz,Next
	inc	hl
Next:
	djnz	Loop
	ld	e,a
	ld	d,c
	push	ix ; ld c,ixl
	pop	bc ; ld b,ixh

cerzus69 · « **Reply #40 on:** December 13, 2011, 11:06:38 am »

Quote from: jacobly on December 12, 2011, 07:07:45 pm

That's strange. My test program must not have been working right, because when I went back and changed it a bit, it suddenly started telling me that the second routine doesn't work.
Anyway, my new test program seems to agree with this change.
Code: [Select]
; hldebc = hlc * bde ld (iy+asm_Flag1),b xor a ld ix,0 ld b,24 Loop: add ix,ix rla rl c adc hl,hl jr nc,Next add ix,de adc a,(iy+asm_Flag1) jr nc,Next inc c jr nz,Next inc hl Next: djnz Loop ld e,a ld d,c push ix ; ld c,ixl pop bc ; ld b,ixh

Cool, thanks a lot, indeed it works now!

Author Topic: 24 bit multiplication (Read 19003 times)

ACagliano

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

ACagliano

Re: 24 bit multiplication

Xeda112358

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

FloppusMaximus

Re: 24 bit multiplication

cerzus69

Re: 24 bit multiplication

ACagliano

Re: 24 bit multiplication

Xeda112358

Re: 24 bit multiplication

jacobly

Re: 24 bit multiplication

cerzus69

Re: 24 bit multiplication