Author Topic: Speed Optimised HL_mod_10  (Read 2549 times)

0 Members and 1 Guest are viewing this topic.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Speed Optimised HL_mod_10
« on: December 15, 2013, 01:44:11 pm »
I was trying to make an HL_mod_10 routine today employing some of the trickery I used with my Rand24 routine (24-bit pseudo-random number generator). I wanted to make it as fast as I could, so I was wondering if anybody wanted to have a fun game of optimise-the-short-assembly-routine.

My approach was to break HL into two parts as H*256+L. Mod 10, that is equivalent to:
(H*260-H*4+L) mod 10=(L-H*4) mod 10

H and L are 8-bit values, so H*4 has the potential of being a 10 bit number. If I call this X*256+Y where X is the upper 2 bits of H*4:
(L-H*4) mod 10 = (L-X*256-Y) mod 10
(L-X*256-Y) mod 10 =(L-(X*260-X*4)-Y) mod 10
(L-(X*260-X*4)-Y) mod 10=(L-Y+X*4) mod 10

Now that is an 8-bit subtraction with L and Y. If there is 8-bit 'overflow' with the subtraction, then I need to adjust it by adding a constant. Then, X*4 is at most a 4 bit number to add to that. Once that is all done, you can perform mod 10 stuff on an 8-bit number. I came up with this code:
Code: [Select]
HL_mod_10
;Input: HL
;Output: (HL_mod_10 -> A)
;197.75 avg cycles
ld a,l
ld l,h
ld h,0
add hl,hl
add hl,hl
sub l \ jr nc,$+8 \ add a,4 \ jr nc,$+4 \ add a,6
sla h \ sla h
add a,h \ jr nc,$+8 \ sub 4 \ jr nc,$+4 \ sub 6
sub 160 \ jr nc,$+4 \ add a,160
sub 80 \ jr nc,$+4 \ add a,80
sub 40 \ jr nc,$+4 \ add a,40
sub 20 \ jr nc,$+4 \ add a,20
sub 10 \ ret nc \ add a,10 \ ret

It averages a smidge under 197.75 t-states.


EDIT: If anybody wants to modify this for other languages, here is a supplemental explanation:
This is the equivalent of HL mod 10:
(L-Y+X*4) mod 10

If the input bits of HL are abcdefghijklmnop2, then:
L=ijklmnop2
Y=cdefgh002
X=000000ab2

So to put it into an 8-bit value to operate on:
ijklmnop2-cdefgh002+0000ab002

If this is >256 (it will be at most 267), then add 6 and keep the lower 8 bits.
If this is <0 (it will be at the lowest -255) then take the lower 8-bits (signed, two's complement form), add 4. If that overflows to be >256, then add another 6, keeping only the lower 8 bits.
So for example, if ijklmnop2-cdefgh002+0000ab002=-3 → FD+04→10116→01+06→07.

Now perform mod 10 on the 8-bit result. You can apply more tricks to this if you like. abcdefgh2=128a+0bcdefgh2, so if a=1, you can do 0bcdefgh2-2, for example. Or:
0000ab002+00cdefgh2
Or:
0000abc02+000defgh2

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Speed Optimised HL_mod_10
« Reply #1 on: December 16, 2013, 02:54:10 pm »
I took another approach and now the worst case is 156 t-states:
Code: [Select]
HL_mod_10:
;Input: HL
;Output: HL_mod_10->A
;156 t-states worst case
;141 t-states best case
srl h \ ld a,l \ rra \ add a,h \ adc a,0
ld h,a \ res 4,h
rlca \ rlca \ rlca \ rlca
and 15 \ add a,h \ and 31
rr l \ rra
sub 40 \ jr nc,$+4 \ add a,40
sub 20 \ jr nc,$+4 \ add a,20
sub 10 \ ret nc \ add a,10 \ ret
The approach here basically is to perform mod 5 on the upper 15 bits of HL, then rotate the lower bit of the input back in. This is from the observation that 10=5*2, so no matter what, the lower 1 bit is going to be the same in the input and output. So then I could do
(x mod 10) = ((x mod 10)-(x mod 2)) mod 10 + (x mod 2)
= (floor(x/2) mod 5)*2 + (x mod 2)

However, from earlier observations, mod 2n+1 and mod 2n-1 are pretty easy in binary. For mod 5 you can add each 2k-bit chunk together where k>1 and perform mod 5 on that. In my code, I add the upper 7 bit of the input to there lower 8 bits, if there is overflow, just add 1 to the 8-bit result. Then I add the upper and lower nibble of that producing at most a 5-bit result. At that point, I rotate back in the low bit, making at most a 6-bit value that can be checked against 10*2{0,1,2}.