Author Topic: Multiplication using LUT  (Read 5457 times)

0 Members and 1 Guest are viewing this topic.

Offline TheMachine02

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 452
  • Rating: +105/-0
  • me = EF99+F41A
    • View Profile
Multiplication using LUT
« on: June 11, 2014, 05:41:21 am »
So ... I trying to get the maximum speed of the multiplication, since for each vertex (3d), 8 multiplications is performed (yeah, 64 multiplication for one cube).

I currently have a unrolled multiplication (from Xeda), sightly modified :
Code: [Select]
    ld hl,0
    or a
    ret z
    rlca
    jr nc,$+6
    or    a
    sbc    hl, de
    add    hl, hl
   
    rla \ jr    c, _7Set
    rla \ jr    c, _6Set
    rla \ jr    c, _5Set
    rla \ jr    c, _4Set
    rla \ jr    c, _3Set
    rla \ jr    c, _2Set
    rla \ jr    c, _1Set
   
    ret

_7Set:
    add    hl, de
   
    add    hl, hl
    rla
    jr    nc, $+3
_6Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_5Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_4Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_3Set:
    add    hl, de
    add    hl, hl
    rla
    jr    nc, $+3
_2Set:
    add    hl, de

    add    hl, hl
    rla
    ret    nc
_1Set:
    add    hl, de
    ret

I was wondering if using LUT would not be faster. Of course LUT must fit the calculator memory :p

I know two method :
antilog/log
the x²/4 method.

The problem is I need a 9bit*8bit signed, (DE*A), with
A between [-64,64] and DE between [-512,512] (result in HL)
I try to extend this routine , but I haven't succed :

Code: [Select]
MulAE:
    ;A in [-64,64] , E in [-64,64]
    ; MULTABLE is a 512 bytes aligned area (starting at $xx00)
    ; CMULTABLE is copied there during init.
    ;f(A+E)-f(A-E)
   
    sub    e
    ld    h, MULTABLE/256
    ld    l, a
    add    a, e
    add    a, e
    ld    e, (hl)
    inc    h
    ld    d, (hl)
    ld    l, a
    ld    a, (hl)
    dec    h
    ld    l, (hl)
    ld    h, a
    sbc    hl, de
    ret   

CMULTABLE:
.DB 0, 0, 1, 2, 4, 6, 9, 12, 16, 20
.DB 25, 30, 36, 42, 49, 56, 64, 72, 81, 90
.DB 100, 110, 121, 132, 144, 156, 169, 182, 196, 210
.DB 225, 240, 0, 16, 33, 50, 68, 86, 105, 124
.DB 144, 164, 185, 206, 228, 250, 17, 40, 64, 88
.DB 113, 138, 164, 190, 217, 244, 16, 44, 73, 102
.DB 132, 162, 193, 224, 0, 32, 65, 98, 132, 166
.DB 201, 236, 16, 52, 89, 126, 164, 202, 241, 24
.DB 64, 104, 145, 186, 228, 14, 57, 100, 144, 188
.DB 233, 22, 68, 114, 161, 208, 0, 48, 97, 146
.DB 196, 246, 41, 92, 144, 196, 249, 46, 100, 154
.DB 209, 8, 64, 120, 177, 234, 36, 94, 153, 212
.DB 16, 76, 137, 198, 4, 66, 129, 192

.DB 0, 192, 129, 66, 4, 198, 137, 76, 16, 212
.DB 153, 94, 36, 234, 177, 120, 64, 8, 209, 154
.DB 100, 46, 249, 196, 144, 92, 41, 246, 196, 146
.DB 97, 48, 0, 208, 161, 114, 68, 22, 233, 188
.DB 144, 100, 57, 14, 228, 186, 145, 104, 64, 24
.DB 241, 202, 164, 126, 89, 52, 16, 236, 201, 166
.DB 132, 98, 65, 32, 0, 224, 193, 162, 132, 102
.DB 73, 44, 16, 244, 217, 190, 164, 138, 113, 88
.DB 64, 40, 17, 250, 228, 206, 185, 164, 144, 124
.DB 105, 86, 68, 50, 33, 16, 0, 240, 225, 210
.DB 196, 182, 169, 156, 144, 132, 121, 110, 100, 90
.DB 81, 72, 64, 56, 49, 42, 36, 30, 25, 20
.DB 16, 12, 9, 6, 4, 2, 1, 0

.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 1, 1, 1, 1, 1, 1, 1, 1
.DB 1, 1, 1, 1, 1, 1, 2, 2, 2, 2
.DB 2, 2, 2, 2, 2, 2, 3, 3, 3, 3
.DB 3, 3, 3, 3, 4, 4, 4, 4, 4, 4
.DB 4, 4, 5, 5, 5, 5, 5, 5, 5, 6
.DB 6, 6, 6, 6, 6, 7, 7, 7, 7, 7
.DB 7, 8, 8, 8, 8, 8, 9, 9, 9, 9
.DB 9, 9, 10, 10, 10, 10, 10, 11, 11, 11
.DB 11, 12, 12, 12, 12, 12, 13, 13, 13, 13
.DB 14, 14, 14, 14, 15, 15, 15, 15

.DB 16, 15, 15, 15, 15, 14, 14, 14, 14, 13
.DB 13, 13, 13, 12, 12, 12, 12, 12, 11, 11
.DB 11, 11, 10, 10, 10, 10, 10, 9, 9, 9
.DB 9, 9, 9, 8, 8, 8, 8, 8, 7, 7
.DB 7, 7, 7, 7, 6, 6, 6, 6, 6, 6
.DB 5, 5, 5, 5, 5, 5, 5, 4, 4, 4
.DB 4, 4, 4, 4, 4, 3, 3, 3, 3, 3
.DB 3, 3, 3, 2, 2, 2, 2, 2, 2, 2
.DB 2, 2, 2, 1, 1, 1, 1, 1, 1, 1
.DB 1, 1, 1, 1, 1, 1, 1, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
.DB 0, 0, 0, 0, 0, 0, 0, 0
So if someone can help me ....
AXE/asm programmer - unleash the power of z80 //C++//C

epic 3D things http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Offline Runer112

  • Moderator
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Multiplication using LUT
« Reply #1 on: June 11, 2014, 09:02:52 am »
Before I try my hand at this, is there some distribution that can be expected of the input values? Because if you really want this micro-optimized, any input tendencies should be known so they can possibly be taken advantage of.

Offline TheMachine02

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 452
  • Rating: +105/-0
  • me = EF99+F41A
    • View Profile
Re: Multiplication using LUT
« Reply #2 on: June 11, 2014, 09:09:12 am »
Well, A is always in [-64,64], with no particular value (altought an optimization for a=-64, a=0 , a=64 should be great, since it's remarquable cos/sin value).
DE is most likely in [-128,128], but something extend to [-256,256], and something (much rarer) in [-512,512]
AXE/asm programmer - unleash the power of z80 //C++//C

epic 3D things http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Offline Runer112

  • Moderator
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Multiplication using LUT
« Reply #3 on: June 11, 2014, 04:29:09 pm »
So here's what I came up with. My experience with multiplication routines told me that a LUT in this case might be more trouble than it's worth, so I didn't try that. Instead, I mirrored a lot of the logic in Xeda's routine, but added even more case-specific optimizations; namely, handling each of A=64 and A=-64 specially to make them faster and to implement the clamping of 64*512 and -64*-512 to 32767.

It's probably pretty unfair to compare my routine to Xeda's original, because I'm guessing Xeda whipped up her routine in like 10 minutes and didn't optimize for speed that crazily, whereas I took more like 100 minutes, was optimizing for speed crazily, and had her code as a reference. But I'll compare them anyways. :P

A   Runer   Xeda   
A==-64   7*(DE<0)+133   248   
-64<A<0   6*bitcount(A)+(9*(A%2))+201   6*bitcount(A)+(9*(A%2))+236   
A==0   52   25   
A==1   96   136   
A==2   112   143   
A==3   127   158   
0<A<64   16*lg(A)+(6*bitcount(A))+(9*(A%2))+97   16*lg(A)+(6*bitcount(A))+(9*(A%2))+121   
A==64   7*(DE<0)+95   223   

So about 25-40 cycles faster for most cases and about 100 cycles faster for A=-64 and A=64, but about 25 cycles slower for A=0. And don't forget that this routine also clamps to 32767, whereas the previous did not. So overall, I'd say the speed optimization was quite a success. And with all the speed optimization and the added clamping, I only increased the size from 65 to 89 92 bytes, so not too bad. I just hope it actually works, because I haven't tested it one bit! I've debugged it a fair amount now and it might fully work!

Code: [Select]
Mul_A_DE_Custom:
;
;Multiplies the signed values A and DE, where A=[-64,64] and DE=[-512,512], and
;returns the result in HL. If A*DE==32768, then the result is clamped to 32767.
;
; I: A=[-64,64], DE=[-512,512]
;
;If A*DE!=32768 && abs(A)!=64
; O: A=A<<7, DE=DE, HL=A*DE
;FO: S=A%2, Z=!A%2, H=?, P/V=?, N=0, C=?
;CC: A<0:  6*bitcount(A)+(9*(A%2))+201
;    A==0: 52
;    A==1: 96
;    A==2: 112
;    A==3: 127
;    A>3:  16*lg(A)+(6*bitcount(A))+(9*(A%2))+97
;
;If A*DE!=32768 && abs(A)==64
; O: A=(A*DE)%256, DE=DE/4, HL=A*DE
;FO: S=S(A*DE), Z=Z(A*DE>>8), H=0, P/V=P(A*DE>>8), N=0, C=0
;CC: A==64:  7*(DE<0)+95
;    A==-64: 7*(DE<0)+133
;
;If A*DE==32768
; O: A=0, DE=256, HL=32767
;FO: S=0, Z=0, H=0, P/V=0, N=0, C=0
;CC: A==64:  77
;    A==-64: 115
;
add a,a
jr c,Mul_A_DE_Custom_ANeg
add a,a
jr c,Mul_A_DE_Custom_A64
jr z,Mul_A_DE_Custom_A0
ld h,d
ld l,e
add a,a
jr c,Mul_A_DE_Custom_A5High
add a,a
jr c,Mul_A_DE_Custom_A4High
add a,a
jr c,Mul_A_DE_Custom_A3High
add a,a
jr c,Mul_A_DE_Custom_A2High
add a,a
ret nc
add hl,hl
ret z
add hl,de
ret

Mul_A_DE_Custom_ANeg64:
ex de,hl
Mul_A_DE_Custom_A64:
sra d
jp m,Mul_A_DE_Custom_A64_DENot512
jr nz,Mul_A_DE_Custom_A64_DE512
Mul_A_DE_Custom_A64_DENot512:
rr e
rra
sra d
rr e
rra
ld h,e
ld l,a
ret

Mul_A_DE_Custom_A64_DE512:
ld hl,32767
ret

Mul_A_DE_Custom_A0:
ld h,a
ld l,a
ret

Mul_A_DE_Custom_ANeg:
ld hl,0
or a
sbc hl,de
add a,a
jr z,Mul_A_DE_Custom_ANeg64
add hl,hl
add a,a
jr nc,$+3
add hl,de
Mul_A_DE_Custom_A5High:
add hl,hl
add a,a
jr nc,$+3
add hl,de
Mul_A_DE_Custom_A4High:
add hl,hl
add a,a
jr nc,$+3
add hl,de
Mul_A_DE_Custom_A3High:
add hl,hl
add a,a
jr nc,$+3
add hl,de
Mul_A_DE_Custom_A2High:
add hl,hl
add a,a
jr nc,$+3
add hl,de
add hl,hl
ret z
add hl,de
ret



[9AM EST 6/12] EDIT: Attempted to fix behavior when -64<A<0 by changing jr c,Mul_A_DE_Custom_ANeg64 to jr z,Mul_A_DE_Custom_ANeg64.

[10AM EST 6/12] EDIT 2: Attempted to fix behavior when A=-64.

[11AM EST 6/12] EDIT 3: Last fix failed. Attempted to fix behavior when A=-64 or A=64.
« Last Edit: June 12, 2014, 10:59:39 am by Runer112 »

Offline TheMachine02

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 452
  • Rating: +105/-0
  • me = EF99+F41A
    • View Profile
Re: Multiplication using LUT
« Reply #4 on: June 12, 2014, 02:54:01 am »
So ... it does seems to work for DE>=0 and A>=0 , but always return 32767 if one of these are negative. So there definitly a problem with negative handleling.
btw thanks for your help  :P
AXE/asm programmer - unleash the power of z80 //C++//C

epic 3D things http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Offline Runer112

  • Moderator
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Multiplication using LUT
« Reply #5 on: June 12, 2014, 08:43:59 am »
Whoops, I think I see where I went wrong. Try changing jr c,Mul_A_DE_Custom_ANeg64 (third instruction after Mul_A_DE_Custom_ANeg) to jr z,Mul_A_DE_Custom_ANeg64 and see if that fixes it, at least partially.

Offline TheMachine02

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 452
  • Rating: +105/-0
  • me = EF99+F41A
    • View Profile
Re: Multiplication using LUT
« Reply #6 on: June 12, 2014, 10:04:25 am »
no change, it still give 32767 as result  <_<
AXE/asm programmer - unleash the power of z80 //C++//C

epic 3D things http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Offline Runer112

  • Moderator
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Multiplication using LUT
« Reply #7 on: June 12, 2014, 10:05:17 am »
For all negative A? Or just for A=-64?

Offline TheMachine02

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 452
  • Rating: +105/-0
  • me = EF99+F41A
    • View Profile
Re: Multiplication using LUT
« Reply #8 on: June 12, 2014, 10:07:55 am »
Only for a = -64, but otherwise, result given are wrong.

(exemple, de=1, a=-1, result = 65471)

EDIT : result are correct if it's DE which is negative, (DE=-1, A=1 , return 65535).
AXE/asm programmer - unleash the power of z80 //C++//C

epic 3D things http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

Offline Runer112

  • Moderator
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Multiplication using LUT
« Reply #9 on: June 12, 2014, 10:28:23 am »
Okay, I think I've implemented a fix for the A=-64 behavior. Grab the new code (I made a number of changes, so I'd recommend just re-grabbing the whole thing) and try again?
« Last Edit: June 12, 2014, 10:31:39 am by Runer112 »