Author Topic: [z80] Floating Point Routines  (Read 56644 times)

0 Members and 1 Guest are viewing this topic.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #15 on: October 24, 2013, 08:48:06 pm »
I have updated the 24-bit floating point routines. Most of the routines have easy input/output. If there are two inputs, it is BHL and CDE. If it is one input, it is AHL or BHL. Outputs for the floating point routines are AHL or BHL. There are also a handful of extra routines, such as:

BC_Times_DE (returns 32-bit result, worst case is less than 700 t-states)
normalise24 to renormalise 24-bit floats
SetInf to return a float as infinity (keeping the original sign)
FloatToUInt to convert a float to an unsigned integer
FloatToSInt to convert a float to a signed integer
SqrtHL_prec16 returns the square root of HL to 16 bits of accuracy.

The actual floating point routines are:

Float24Mul
Float24Div
Float24Add
Float24Sub
Float24Lg
Float24Sqrt

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #16 on: March 16, 2014, 01:40:43 pm »
A while ago, I thought of a way to possibly make division faster, but I never had the time or focus to actually write it out in assembly. I was working on it today after satisfying myself with the math involved and it doesn't work yet (there is a bug somewhere that seems to be incrementing the wrong byte in the result on occasion). However, the fix for that should not make it much slower. My goal was to get division below 45 000 t-states and from current tests, it will probably be more like 25 000 t-states. This is nearly twice as fast as the OS division and a little faster than the OS' multiplication and this is for 19 digits of accuracy.

My worry when I thought of the algorithm was that it seemed like it would be pretty close to the speed of long division and might only be faster for very large or small sized numbers. It seems to be that on the z80, this method blows away long division when using a floating point storage format.

Here is the way it works-- If my numbers are always normalised (this means the upper bit is always set, unless the number is 0), then I might want something like the following in hexadecimal 8AD176/AC0980 (in base 10, 9097590/11274624). What I do is I make an estimate for the first byte. Normally on the z80, we look at the bits because handling too many bytes is difficult. However, my algorithm isn't quite long division. The speed up comes from how we choose the first chunk of 8 bits. I just take the upper 16 bits of the numerator, upper 8-bits of the denominator, and divide those to get my estimate. The nice part is that this will only be off by as much as 1 (and if it is, it was an overestimate). Then whatever the estimate was, multiply that by the denominator (like in base 10) and subtract. So for the example:
Code: [Select]
First digit = 0 ('digits' are 8-bit ints, so on [0,255])
Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE    = 8AD176-8A6FAF = 61D1
Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091   = 61D1.0-6171.6180 = 5F.9E80
Now 5F9E/AC = 8E, so 5F.9E80   - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B
In this case, there were no over estimates. We would have know if the subtraction step yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int. So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

That multiplication step is just multiplying an 8-bit number by a 64-bit number in my routine, and I combine this with subtraction (so it is a fused multiply-subtract routine) and it only needs to be performed 8 times. Hopefully when I get out of work later, I won't be too tired to keep working on this :)

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: [z80] Floating Point Routines
« Reply #17 on: March 18, 2014, 01:21:38 am »
This stuff is mostly all way beyond me but don't think no one is reading/that no one finds it interesting. I think these routines are really cool and i'm trying to make the effort to understand them. I hope you keep working on them!

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #18 on: March 18, 2014, 10:45:35 am »
Cool, thanks :) I finally got all of the small bugs out and it currently takes 25053 t-states to perform pi/e and 22483 t-states for e/pi. However, I have the number format slightly different from my other 80-bit floats. After experimenting earlier this year, it is actually a little friendlier to store the exponent+sign in little-endian mode and have 16384 represent the exponent 0.

Floating point is basically like scientific notation. So in base 10, 314.1593=3.141593*102. The way it would be stored in floating point is a sign bit (positive, so usually 0), followed by the exponent (2), followed by the string "3141593". Now if we want to square this, we can multiply 3141593*3141593 and add the exponents, and XOR the sign. So you get +:4:9869606577649 and this tells us that the number is +98696.06577649. In practice, we truncate and round the digits, so to 6 digits, +:4:986961.

In binary, it is the same idea. If I keep the numbers normalized (keeping the mantissa on [1,2)), then division of 3/5 for example, is dividing +:1:11000000 / +:2:10100000. We can subtract exponents, XOR the signs, and then we perform 11000000/10100000 to whatever precision we need. Integer division would give us 0, but if we go for 8 more iterations of the same division algorithm, we would get 8 bits of the decimal. Rounding, we get +:-1:10010100 which corresponds to .10010100 = 154/256 = .6015625.

Here is my code, currently. I am pretty sure it isn't nearly as optimized as it could be and it is nearly 900 bytes:
Code: [Select]
Div_Sub:
;DE/C, DE <C*256, C>127
   ld a,d
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ rla \ jr c,$+5 \ cp c \ jr c,$+4 \ sub c \ inc e
   sla e \ adc a,a \ jr c,$+5 \ ret p \ cp c \ ret c \ inc e \ ret
FloatDiv_80:
; 1 bit sign + 15 bits signed exponent (16384 is exp = 0)   (little endian)
; 64 bits mantissa, (big endian)
;Inputs:
;  HL points to dividend
;  DE points to divisor
   ex de,hl
   call LoadFPOPs
   ld hl,(fpOP1)
   ld de,(fpOP2)
   ld a,h
   xor d
   push af
   res 7,d
   res 7,h
   sbc hl,de
   ld bc,16384
   add hl,bc
   pop af
   and $80
   or h
   ld h,a
   ld (fpOP3),hl
;Now perform the division of fpOP2/fpOP1
;The algo works like this:
;  Take the first byte of fpOP2, compare against that of fpOP1
;  If it is bigger, since fpOP1 should have bit 7 set (normalized numbers),
;    it divides at most once. So the first byte is 1, subtract fpOP2-fpOP1->fpOP2
;  After this, we repeatedly compare the upper two bytes of fpOP1 to the first byte
;    of fpOP1. This is to estimate how many times fpOP1 can be divided by fpOP1.
;  This is just a guestimate, but each digit is an overestimate by at most 1!
;
;  Example with smaller numbers. Take 8AD176/AC0980
;   First digit = 0   ('digits' are 8-bit ints, so on [0,255])
;   Now 8AD1/AC = CE, so 8AD176.00 - AC0980*0.CE    = 8AD176-8A6FAF = 61D1
;   Now 61D1/AC = 91, so 61D1.0000 - AC0980*.0091   = 61D1.0-6171.6180 = 5F.9E80
;   Now 5F9E/AC = 8E, so 5F.9E80   - AC0980*.00008E = 5F.9E8000-5F.6D4500 = .313B
;  In this case, there were no over estimates. We would have know if the subtraction step
;  yeilded a negative output. To adjust this, decrement the new digit by 1 and add AC0980 to the int.
;  So the example gives 8AD176/AC0980 = 0.CE918E, or in base 10, 9097590/11274624=.806908488274

;fpOP1+2 has denom
;fpOP2+2 has num
   ld de,fpOP2-2
   ld hl,fpOP2+2
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
   ldi \ ldi \ ldi
denom = fpOP1+2
numer = fpOP2-2
outp  = numer-1
   ld hl,denom
   ld de,numer
   call cp_64b
   ld hl,numer-1
   ld (hl),0
   jr c,noadjust
      inc (hl)
      ex de,hl
      inc de
      ld hl,denom
      call sub_64b
      ex de,hl \ dec hl
noadjust:
   inc hl
   ld de,numer+8
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   call div_sub_1
   ld de,801Eh
   ld hl,800Bh
   ld a,(hl)
   rra
   jr nc,directcopy
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ inc de
      inc hl \ ld a,(hl) \ rra \ ld (de),a \ ret
directcopy:
   inc hl
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld hl,(fpOP3) \ dec hl \ ld (fpOP3),hl \ ret
div_sub_1:
   ld bc,(denom)
   ld a,(hl)
   inc hl
   push hl
   ld l,(hl)
   ld h,a
   ex de,hl
   call Div_Sub
   ld c,e
   ex de,hl
   call fused_mul_sub
   ld hl,9
   add hl,de
   ex de,hl
   pop hl
   ret
fused_mul_sub:
;multiply denominator*E and subtract from numerator
   xor a
   ld hl,(denom+6) \ ld b,a \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   ld a,(de) \ sub l \ ld (de),a \ dec de
   ld a,h \ adc a,b

   ld hl,(denom+5) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+4) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+3) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+2) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom+1) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ adc a,b \ dec de

   ld hl,(denom-1) \ ld l,b
   sla h \ jr nc,$+3 \ ld l,c
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add hl,hl \ jr nc,$+3 \ add hl,bc
   add a,l \ jr nc,$+3 \ inc h \ ld l,a
   ld a,(de) \ sub l \ ld (de),a \ ld a,h \ dec de
   ld l,a
   ld a,(de)
   sbc a,l
;if c flag is set, overestimate
   ld a,c \ ld (de),a
   ret nc
   ld hl,8
   add hl,de
   ex de,hl
   ld hl,denom+7
   ld a,(de) \ add a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec hl \ dec de
   ld a,(de) \ adc a,(hl) \ ld (de),a \ dec de
   ex de,hl \ dec (hl) \ ex de,hl
   ret



;num+7 - hl

sub_64b:
;(de)-(hl), big endian 64-bit.
   ld bc,7
   add hl,bc
   ex de,hl
   add hl,bc
   ex de,hl
   ld a,(de) \ sub (hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ dec de \ dec hl
   ld a,(de) \ sbc a,(hl) \ ld (de),a \ ret
cp_64b:
;compares (de) to (hl), big endian 64-bit ints
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret nz \ inc de \ inc hl
   ld a,(de) \ cp (hl) \ ret

LoadFPOPs:
;HL points to the first
;DE points to the second
   push de
   ld de,fpOP1
   xor a
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   pop hl
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ldi
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ld (de),a \ inc de
   ret
.echo "Size:",$-Div_Sub
It is so large because I unrolled much of it. As an example, using e and pi:
Code: [Select]
   ld hl,float_e
   ld de,float_pi
   jp FloatDiv_80
;e/pi =0.dd816a76547ca9910802972996d4e3
float_pi:
  .dw 16384+1 \ .db $c9,$0f,$da,$a2,$21,$68,$c2,$34    ;pi, not rounded up
float_e:
  .dw 16384+1 \  .db $ad,$f8,$54,$58,$a2,$bb,$4a,$9A    ;e, not rounded up

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #19 on: March 19, 2014, 07:40:17 am »
At a hefty 1631 bytes and a change of syntax, I have managed to get division under 20 000 t-states and multiplication at about 13000 t-states.


The change of syntax is basically using the exponent as 16384 for 0, sign is bit 15, and these together comprise the first two bytes in little-endian. The following 8 bytes are the mantissa and are also little-endian (it was big-endian before).


I also changed how the multiplication and division routines are done. For the division at each step, instead of doing 8 8x8->16 multiplications, it is faster to do 4 8x16->24 multiplications. For the multiplication, instead of doing 16 16x16->32 multiplications, I do 32 8x16->24 multiplications.


I should also note that I have these timings without code to check for overflow, zero, infinities, or NAN, but I changed exponent storage to make these easier to do. Now if I can get addition and subtraction in this new format, I will be tempted to write some really slow routines for exponentials, logs, and trig functions. Then they will be available until I have time to actually design much faster algorithms.


Also, I have been trying to think of a way to make a "geometric mean" function. Basically, the geometric mean of 'a' and 'b' is sqrt(ab). I could just multiply and perform the square root, but for the past few months I have been having the idea of fusing the two operations into one. The idea would be that since an NxN multiplication would return a number of size 2N, then I can try to get 2 bits at a time of the multiplication. For restoring squares square root, it takes in the upper two bits at each iteration, so I can feed the bits directly from the multiplication into the square root algorithm without worrying about storing a 128-bit integer, and while getting full precision to 64 bits (instead of multiplying to get a 64-bit approximation and taking the square root of that).


The geometric mean would be useful when combined with the arithmetic mean (a+b)/2, especially if I can get it to be streamlined and fast. There is a class of algorithms that use the Arithmetic-Geometric Mean. This provides the asymptotically fastest known method of computing things like natural log, arctangent, and others.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #20 on: March 19, 2014, 01:11:59 pm »
I have a new version for the 80-bit floats and I made an example routine to compute the square root using Newton's method (basically .5(Ans+X/Ans) 5 times). Since there isn't already a square root routine and this gets about 64-bits of precision, I have left it in, but it is slow.


Attached is the current source, and it is updated in the Google Docs thing. For a summary of timings:
Code: [Select]

clock cycles ops per sec, 6MHz
Add/Sub   1200 cc 5000
Multiplication 13000 cc 461
Division 19000 cc 315
Sqrt 108000 cc 55

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #21 on: April 10, 2014, 10:33:46 pm »

I added in formats and full support for signed Zero, signed Infinity, and NAN. I also found that Newton's method may actually be my best option for the square root algorithm, but I did mix in my own algorithm, too. The one I came up with is a linear-time convergence algorithm, but the iterations are faster than an iteration of Newton's method. I used one iteration of my method at a cost of about 2000 cycles, to give a better approximate than 2 iterations of Newton's method, so I got to remove one Newton's method iteration and the already optimized first iteration. The result was over 20 000 clock cycles saved, putting it about 700 clock cycles slower than TI's (but if I only had to compute 47 bits of precision, it would be over 22000 t-states faster for the same accuracy :P).


Anyways, I updated the google docs thing, and I also attached the files. I reorganized things by splitting up main routines (add/sub, multiply, divide, square root) into their own files and then I just #include them.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #22 on: April 11, 2014, 09:17:37 am »
Since the previous upload has been downloaded already, I guess I will upload this revised version. Last night in bed I was thinking about better ways to do the square root and I realized my method was not in fact as fast as it could be. See, I did the following computation in my head:


Xn is a sequence converging to sqrt(X) and is defined by:
##x_{n+1}=\frac{x_{n}+\frac{x}{x_{n}}}{2}##
(Basically, it averages an overestimate and underestimate of the square root.)


So say at iteration n, I have the error being ##k=x_{n}-\sqrt{x}##. Then the next iteration of the sequence has the error:


##x_{n+1}-\sqrt{x}=\frac{x_{n}+\frac{x}{x_{n}}}{2}-\sqrt{x}##


##=\frac{x_{n}+\frac{x}{x_{n}}-2\sqrt{x}}{2}##


##=\frac{x_{n}-\sqrt{x}+\frac{x}{x_{n}}-\sqrt{x}}{2}##


##=\frac{k+\frac{x}{x_{n}}-\sqrt{x}}{2}##


##=\frac{k+\frac{x}{\sqrt{x}+k}-\sqrt{x}}{2}##


##=\frac{k+\frac{x-x-k\sqrt{x}}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k\sqrt{x}}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k(x_{n}-k)}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k x_{n}-k^{2})}{\sqrt{x}+k}}{2}##


##=\frac{k-\frac{k x_{n}-k^{2})}{x_{n}}}{2}##


##=\frac{k-k+\frac{k^{2})}{x_{n}}}{2}##

##=\frac{\frac{k^{2})}{x_{n}}}{2}##
What does this even mean? Suppose I had m bits of accuracy. Then the new accuracy, as long as x is on [1,4), as my values are, is at least 2n+1 bits of accuracy (up to 2n+2). Basically, if I could get 8 bits of accuracy initially and cheaply, the next iteration would yield at least 17 bits, then 35, then 71. What I decided to do, then was, just use a simple and fast 16-bit square root routine that I wrote to get an 8-bit estimate. However, when I went to look in my folder, I found that I had already written a 32-bit square root routine that returned a 16-bit result.


This means that I successfully removed 3 iterations of Newton's method since last night, bringing the total to just 2 iterations. Now it is down from 108 000 t-states to just 46 000, putting it much faster than TI's.




Offline TIfanx1999

  • ಠ_ಠ ( ͡° ͜ʖ ͡°)
  • CoT Emeritus
  • LV13 Extreme Addict (Next: 9001)
  • *
  • Posts: 6173
  • Rating: +191/-9
    • View Profile
Re: [z80] Floating Point Routines
« Reply #23 on: April 11, 2014, 09:39:07 am »
I saw on IRC that you made significant speed gains and said that it might be ~2x faster than TI's. Kudos to you! :D

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #24 on: April 11, 2014, 09:47:18 am »
Well, here are timings I got from WabbitEmu for the OS (86825 ccs) and mine (46831ccs). So it isn't quite twice as fast, but it is almost. I am also working on a routine to cut out another 16000 or so, so then it will be almost 3 times faster. For the timings I have:
Code: [Select]

Args used:
1.570796326794897
57.29577951308232
For example, 57.29577951308232/1.570796326794897


                TI-OS      Float80    diff      ratio     analysis
add/subtract    2758       3166       +408      1.1479    Add/sub is a bit slower, possibly noticeably
multiply        35587      10851      -24736    0.3049    Multiplication is signigicantly faster. Noticeable.
divide          40521      18538      -21983    0.4575    Division is significantly faster. Noticeable.
square root     86825      46831      -39994    0.5394    Square roots, are significantly faster. Noticeable


notes: TI-Floats are approximately 47 bits of precision. Float80 uses 64 bits of precision (that is 14 digits versus 19)

Offline Streetwalrus

  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 3821
  • Rating: +80/-8
    • View Profile
Re: [z80] Floating Point Routines
« Reply #25 on: April 11, 2014, 10:22:06 am »
Oh wow that's nice ! I might use these in Illusiat since they are faster. It would be nice if you made an axiom out of them.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #26 on: April 11, 2014, 10:31:42 am »
For that, what kind of floating point precision would you need? 80-bit floats are pretty wild, but you may have seen that I have a bunch of 24-bit floats and I can make 40-bit floats that are really fast, too (like, cut all the times in 4ths). In fact, the multiplication routine uses a divide-and-conquer algorithm that has an intermediate step of computing 32-bit multiplications (and a 40-bit float would have 32-bit multiplication).

Offline TIfanx1999

  • ಠ_ಠ ( ͡° ͜ʖ ͡°)
  • CoT Emeritus
  • LV13 Extreme Addict (Next: 9001)
  • *
  • Posts: 6173
  • Rating: +191/-9
    • View Profile
Re: [z80] Floating Point Routines
« Reply #27 on: April 11, 2014, 10:37:59 am »
Well, here are timings I got from WabbitEmu for the OS (86825 ccs) and mine (46831ccs). So it isn't quite twice as fast, but it is almost. I am also working on a routine to cut out another 16000 or so, so then it will be almost 3 times faster. For the timings I have:
Code: [Select]

Args used:
1.570796326794897
57.29577951308232
For example, 57.29577951308232/1.570796326794897


                TI-OS      Float80    diff      ratio     analysis
add/subtract    2758       3166       +408      1.1479    Add/sub is a bit slower, possibly noticeably
multiply        35587      10851      -24736    0.3049    Multiplication is signigicantly faster. Noticeable.
divide          40521      18538      -21983    0.4575    Division is significantly faster. Noticeable.
square root     86825      46831      -39994    0.5394    Square roots, are significantly faster. Noticeable


notes: TI-Floats are approximately 47 bits of precision. Float80 uses 64 bits of precision (that is 14 digits versus 19)

:o
That's awesome Xeda! ;D

Offline Streetwalrus

  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 3821
  • Rating: +80/-8
    • View Profile
Re: [z80] Floating Point Routines
« Reply #28 on: April 11, 2014, 11:36:08 am »
I don't know how much precision the original needs. I know that calculations go way high in the battle engine and that there is a 9999 damage cap if it goes beyond that. Also it's trimmed to an integer in the end.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: [z80] Floating Point Routines
« Reply #29 on: April 12, 2014, 05:42:20 pm »
Hmm, isn't 16-bit math sufficient, then? Also, in good news, I actually shaved off close to 18000 more clock cycles from the square root routine, putting it at a little over 3 times faster than TI's. I am back to working on the exponential and logarithm routine, but they are table based (using a single 64-element LUT with 9 bytes to each element). From this I will build the Float->Str routine.