Author Topic: Multiplication using LUT (Read 5881 times)

TheMachine02 · « **on:** June 11, 2014, 05:41:21 am »

So ... I trying to get the maximum speed of the multiplication, since for each vertex (3d), 8 multiplications is performed (yeah, 64 multiplication for one cube).

I currently have a unrolled multiplication (from Xeda), sightly modified :

Code: [Select]

    ld hl,0
    or a
    ret z
    rlca
    jr nc,$+6
    or    a
    sbc    hl, de
    add    hl, hl
    
    rla \ jr    c, _7Set
    rla \ jr    c, _6Set
    rla \ jr    c, _5Set
    rla \ jr    c, _4Set
    rla \ jr    c, _3Set
    rla \ jr    c, _2Set
    rla \ jr    c, _1Set
    
    ret

_7Set:
    add    hl, de
    
    add    hl, hl
    rla
    jr    nc, $+3
_6Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_5Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_4Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_3Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_2Set:
    add    hl, de

    add    hl, hl
    rla
    ret    nc
_1Set:
    add    hl, de
    ret

I was wondering if using LUT would not be faster. Of course LUT must fit the calculator memory :p

I know two method :
antilog/log
the x²/4 method.

The problem is I need a 9bit*8bit signed, (DE*A), with
A between [-64,64] and DE between [-512,512] (result in HL)
I try to extend this routine , but I haven't succed :

Code: [Select]

MulAE:
    ;A in [-64,64] , E in [-64,64]
    ; MULTABLE is a 512 bytes aligned area (starting at $xx00)
    ; CMULTABLE is copied there during init.
    ;f(A+E)-f(A-E)
    
    sub    e
    ld    h, MULTABLE/256
    ld    l, a
    add    a, e
    add    a, e
    ld    e, (hl)
    inc    h
    ld    d, (hl)
    ld    l, a
    ld    a, (hl)
    dec    h
    ld    l, (hl)
    ld    h, a
    sbc    hl, de
    ret    

CMULTABLE:
.DB 0, 0, 1, 2, 4, 6, 9, 12, 16, 20
.DB 25, 30, 36, 42, 49, 56, 64, 72, 81, 90
.DB 100, 110, 121, 132, 144, 156, 169, 182, 196, 210
.DB 225, 240, 0, 16, 33, 50, 68, 86, 105, 124
.DB 144, 164, 185, 206, 228, 250, 17, 40, 64, 88
.DB 113, 138, 164, 190, 217, 244, 16, 44, 73, 102
.DB 132, 162, 193, 224, 0, 32, 65, 98, 132, 166
.DB 201, 236, 16, 52, 89, 126, 164, 202, 241, 24
.DB 64, 104, 145, 186, 228, 14, 57, 100, 144, 188
.DB 233, 22, 68, 114, 161, 208, 0, 48, 97, 146
.DB 196, 246, 41, 92, 144, 196, 249, 46, 100, 154
.DB 209, 8, 64, 120, 177, 234, 36, 94, 153, 212
.DB 16, 76, 137, 198, 4, 66, 129, 192

.DB 0, 192, 129, 66, 4, 198, 137, 76, 16, 212
.DB 153, 94, 36, 234, 177, 120, 64, 8, 209, 154
.DB 100, 46, 249, 196, 144, 92, 41, 246, 196, 146
.DB 97, 48, 0, 208, 161, 114, 68, 22, 233, 188
.DB 144, 100, 57, 14, 228, 186, 145, 104, 64, 24
.DB 241, 202, 164, 126, 89, 52, 16, 236, 201, 166
.DB 132, 98, 65, 32, 0, 224, 193, 162, 132, 102
.DB 73, 44, 16, 244, 217, 190, 164, 138, 113, 88
.DB 64, 40, 17, 250, 228, 206, 185, 164, 144, 124
.DB 105, 86, 68, 50, 33, 16, 0, 240, 225, 210
.DB 196, 182, 169, 156, 144, 132, 121, 110, 100, 90
.DB 81, 72, 64, 56, 49, 42, 36, 30, 25, 20
.DB 16, 12, 9, 6, 4, 2, 1, 0

.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 1, 1, 1, 1, 1, 1, 1, 1
.DB 1, 1, 1, 1, 1, 1, 2, 2, 2, 2
.DB 2, 2, 2, 2, 2, 2, 3, 3, 3, 3
.DB 3, 3, 3, 3, 4, 4, 4, 4, 4, 4
.DB 4, 4, 5, 5, 5, 5, 5, 5, 5, 6
.DB 6, 6, 6, 6, 6, 7, 7, 7, 7, 7
.DB 7, 8, 8, 8, 8, 8, 9, 9, 9, 9
.DB 9, 9, 10, 10, 10, 10, 10, 11, 11, 11
.DB 11, 12, 12, 12, 12, 12, 13, 13, 13, 13
.DB 14, 14, 14, 14, 15, 15, 15, 15

.DB 16, 15, 15, 15, 15, 14, 14, 14, 14, 13
.DB 13, 13, 13, 12, 12, 12, 12, 12, 11, 11
.DB 11, 11, 10, 10, 10, 10, 10, 9, 9, 9
.DB 9, 9, 9, 8, 8, 8, 8, 8, 7, 7
.DB 7, 7, 7, 7, 6, 6, 6, 6, 6, 6
.DB 5, 5, 5, 5, 5, 5, 5, 4, 4, 4
.DB 4, 4, 4, 4, 4, 3, 3, 3, 3, 3
.DB 3, 3, 3, 2, 2, 2, 2, 2, 2, 2
.DB 2, 2, 2, 1, 1, 1, 1, 1, 1, 1
.DB 1, 1, 1, 1, 1, 1, 1, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0

So if someone can help me ....

Runer112 · « **Reply #1 on:** June 11, 2014, 09:02:52 am »

Before I try my hand at this, is there some distribution that can be expected of the input values? Because if you really want this micro-optimized, any input tendencies should be known so they can possibly be taken advantage of.

TheMachine02 · « **Reply #2 on:** June 11, 2014, 09:09:12 am »

Well, A is always in [-64,64], with no particular value (altought an optimization for a=-64, a=0 , a=64 should be great, since it's remarquable cos/sin value).
DE is most likely in [-128,128], but something extend to [-256,256], and something (much rarer) in [-512,512]

Runer112 · « **Reply #3 on:** June 11, 2014, 04:29:09 pm »

So here's what I came up with. My experience with multiplication routines told me that a LUT in this case might be more trouble than it's worth, so I didn't try that. Instead, I mirrored a lot of the logic in Xeda's routine, but added even more case-specific optimizations; namely, handling each of A=64 and A=-64 specially to make them faster and to implement the clamping of 64*512 and -64*-512 to 32767.

It's probably pretty unfair to compare my routine to Xeda's original, because I'm guessing Xeda whipped up her routine in like 10 minutes and didn't optimize for speed that crazily, whereas I took more like 100 minutes, was optimizing for speed crazily, and had her code as a reference. But I'll compare them anyways.

A	Runer	Xeda
`A==-64`	`7*(DE<0)+133`	`248`
`-64<A<0`	`6bitcount(A)+(9(A%2))+201`	`6bitcount(A)+(9(A%2))+236`
`A==0`	`52`	`25`
`A==1`	`96`	`136`
`A==2`	`112`	`143`
`A==3`	`127`	`158`
`0<A<64`	`16lg(A)+(6bitcount(A))+(9*(A%2))+97`	`16lg(A)+(6bitcount(A))+(9*(A%2))+121`
`A==64`	`7*(DE<0)+95`	`223`

So about 25-40 cycles faster for most cases and about 100 cycles faster for A=-64 and A=64, but about 25 cycles slower for A=0. And don't forget that this routine also clamps to 32767, whereas the previous did not. So overall, I'd say the speed optimization was quite a success. And with all the speed optimization and the added clamping, I only increased the size from 65 to 89 92 bytes, so not too bad. ~~I just hope it actually works, because I haven't tested it one bit!~~ I've debugged it a fair amount now and it might fully work!

Code: [Select]

Mul_A_DE_Custom:
;
;Multiplies the signed values A and DE, where A=[-64,64] and DE=[-512,512], and 
;returns the result in HL. If A*DE==32768, then the result is clamped to 32767.
;
; I: A=[-64,64], DE=[-512,512]
;
;If A*DE!=32768 && abs(A)!=64
; O: A=A<<7, DE=DE, HL=A*DE
;FO: S=A%2, Z=!A%2, H=?, P/V=?, N=0, C=?
;CC: A<0:  6*bitcount(A)+(9*(A%2))+201
;    A==0: 52
;    A==1: 96
;    A==2: 112
;    A==3: 127
;    A>3:  16*lg(A)+(6*bitcount(A))+(9*(A%2))+97
;
;If A*DE!=32768 && abs(A)==64
; O: A=(A*DE)%256, DE=DE/4, HL=A*DE
;FO: S=S(A*DE), Z=Z(A*DE>>8), H=0, P/V=P(A*DE>>8), N=0, C=0
;CC: A==64:  7*(DE<0)+95
;    A==-64: 7*(DE<0)+133
;
;If A*DE==32768
; O: A=0, DE=256, HL=32767
;FO: S=0, Z=0, H=0, P/V=0, N=0, C=0
;CC: A==64:  77
;    A==-64: 115
;
	add	a,a
	jr	c,Mul_A_DE_Custom_ANeg
	add	a,a
	jr	c,Mul_A_DE_Custom_A64
	jr	z,Mul_A_DE_Custom_A0
	ld	h,d
	ld	l,e
	add	a,a
	jr	c,Mul_A_DE_Custom_A5High
	add	a,a
	jr	c,Mul_A_DE_Custom_A4High
	add	a,a
	jr	c,Mul_A_DE_Custom_A3High
	add	a,a
	jr	c,Mul_A_DE_Custom_A2High
	add	a,a
	ret	nc
	add	hl,hl
	ret	z
	add	hl,de
	ret

Mul_A_DE_Custom_ANeg64:
	ex	de,hl
Mul_A_DE_Custom_A64:
	sra	d
	jp	m,Mul_A_DE_Custom_A64_DENot512
	jr	nz,Mul_A_DE_Custom_A64_DE512
Mul_A_DE_Custom_A64_DENot512:
	rr	e
	rra
	sra	d
	rr	e
	rra
	ld	h,e
	ld	l,a
	ret

Mul_A_DE_Custom_A64_DE512:
	ld	hl,32767
	ret

Mul_A_DE_Custom_A0:
	ld	h,a
	ld	l,a
	ret

Mul_A_DE_Custom_ANeg:
	ld	hl,0
	or	a
	sbc	hl,de
	add	a,a
	jr	z,Mul_A_DE_Custom_ANeg64
	add	hl,hl
	add	a,a
	jr	nc,$+3
	add	hl,de
Mul_A_DE_Custom_A5High:
	add	hl,hl
	add	a,a
	jr	nc,$+3
	add	hl,de
Mul_A_DE_Custom_A4High:
	add	hl,hl
	add	a,a
	jr	nc,$+3
	add	hl,de
Mul_A_DE_Custom_A3High:
	add	hl,hl
	add	a,a
	jr	nc,$+3
	add	hl,de
Mul_A_DE_Custom_A2High:
	add	hl,hl
	add	a,a
	jr	nc,$+3
	add	hl,de
	add	hl,hl
	ret	z
	add	hl,de
	ret

[9AM EST 6/12] EDIT: Attempted to fix behavior when -64<A<0 by changing jr c,Mul_A_DE_Custom_ANeg64 to jr z,Mul_A_DE_Custom_ANeg64.

[10AM EST 6/12] EDIT 2: Attempted to fix behavior when A=-64.

[11AM EST 6/12] EDIT 3: Last fix failed. Attempted to fix behavior when A=-64 or A=64.

TheMachine02 · « **Reply #4 on:** June 12, 2014, 02:54:01 am »

So ... it does seems to work for DE>=0 and A>=0 , but always return 32767 if one of these are negative. So there definitly a problem with negative handleling.
btw thanks for your help

Runer112 · « **Reply #5 on:** June 12, 2014, 08:43:59 am »

Whoops, I think I see where I went wrong. Try changing jr c,Mul_A_DE_Custom_ANeg64 (third instruction after Mul_A_DE_Custom_ANeg) to jr z,Mul_A_DE_Custom_ANeg64 and see if that fixes it, at least partially.

TheMachine02 · « **Reply #6 on:** June 12, 2014, 10:04:25 am »

no change, it still give 32767 as result

Runer112 · « **Reply #7 on:** June 12, 2014, 10:05:17 am »

For all negative A? Or just for A=-64?

TheMachine02 · « **Reply #8 on:** June 12, 2014, 10:07:55 am »

Only for a = -64, but otherwise, result given are wrong.

(exemple, de=1, a=-1, result = 65471)

EDIT : result are correct if it's DE which is negative, (DE=-1, A=1 , return 65535).

Runer112 · « **Reply #9 on:** June 12, 2014, 10:28:23 am »

Okay, I think I've implemented a fix for the A=-64 behavior. Grab the new code (I made a number of changes, so I'd recommend just re-grabbing the whole thing) and try again?

Author Topic: Multiplication using LUT (Read 5881 times)

TheMachine02

Multiplication using LUT

Runer112

Re: Multiplication using LUT

TheMachine02

Re: Multiplication using LUT

Runer112

Re: Multiplication using LUT

TheMachine02

Re: Multiplication using LUT

Runer112

Re: Multiplication using LUT

TheMachine02

Re: Multiplication using LUT

Runer112

Re: Multiplication using LUT

TheMachine02

Re: Multiplication using LUT

Runer112

Re: Multiplication using LUT