Author Topic: Speed Optimised HL_mod_10 (Read 2743 times)

Xeda112358 · « **on:** December 15, 2013, 01:44:11 pm »

I was trying to make an HL_mod_10 routine today employing some of the trickery I used with my Rand24 routine (24-bit pseudo-random number generator). I wanted to make it as fast as I could, so I was wondering if anybody wanted to have a fun game of optimise-the-short-assembly-routine.

My approach was to break HL into two parts as H*256+L. Mod 10, that is equivalent to:
(H*260-H*4+L) mod 10=(L-H*4) mod 10

H and L are 8-bit values, so H*4 has the potential of being a 10 bit number. If I call this X*256+Y where X is the upper 2 bits of H*4:
(L-H*4) mod 10 = (L-X*256-Y) mod 10
(L-X*256-Y) mod 10 =(L-(X*260-X*4)-Y) mod 10
(L-(X*260-X*4)-Y) mod 10=(L-Y+X*4) mod 10

Now that is an 8-bit subtraction with L and Y. If there is 8-bit 'overflow' with the subtraction, then I need to adjust it by adding a constant. Then, X*4 is at most a 4 bit number to add to that. Once that is all done, you can perform mod 10 stuff on an 8-bit number. I came up with this code:

Code: [Select]

HL_mod_10
;Input: HL
;Output: (HL_mod_10 -> A)
;197.75 avg cycles
	ld a,l
	ld l,h
	ld h,0
	add hl,hl
	add hl,hl
	sub l \ jr nc,$+8 \ add a,4 \ jr nc,$+4 \ add a,6
	sla h \ sla h
	add a,h \ jr nc,$+8 \ sub 4 \ jr nc,$+4 \ sub 6
	sub 160 \ jr nc,$+4 \ add a,160
	sub 80 \ jr nc,$+4 \ add a,80
	sub 40 \ jr nc,$+4 \ add a,40
	sub 20 \ jr nc,$+4 \ add a,20
	sub 10 \ ret nc \ add a,10 \ ret

It averages a smidge under 197.75 t-states.

EDIT: If anybody wants to modify this for other languages, here is a supplemental explanation:
This is the equivalent of HL mod 10:
(L-Y+X*4) mod 10

If the input bits of HL are abcdefghijklmnop₂, then:
L=ijklmnop₂
Y=cdefgh00₂
X=000000ab₂

So to put it into an 8-bit value to operate on:
ijklmnop₂-cdefgh00₂+0000ab00₂

If this is >256 (it will be at most 267), then add 6 and keep the lower 8 bits.
If this is <0 (it will be at the lowest -255) then take the lower 8-bits (signed, two's complement form), add 4. If that overflows to be >256, then add another 6, keeping only the lower 8 bits.
So for example, if ijklmnop₂-cdefgh00₂+0000ab00₂=-3 → FD+04→101₁₆→01+06→07.

Now perform mod 10 on the 8-bit result. You can apply more tricks to this if you like. abcdefgh₂=128a+0bcdefgh₂, so if a=1, you can do 0bcdefgh₂-2, for example. Or:
0000ab00₂+00cdefgh₂
Or:
0000abc0₂+000defgh₂

Xeda112358 · « **Reply #1 on:** December 16, 2013, 02:54:10 pm »

I took another approach and now the worst case is 156 t-states:

Code: [Select]

HL_mod_10:
;Input: HL
;Output: HL_mod_10->A
;156 t-states worst case
;141 t-states best case
	srl h \ ld a,l \ rra \ add a,h \ adc a,0
	ld h,a \ res 4,h
	rlca \ rlca \ rlca \ rlca
	and 15 \ add a,h \ and 31
	rr l \ rra
	sub 40 \ jr nc,$+4 \ add a,40
	sub 20 \ jr nc,$+4 \ add a,20
	sub 10 \ ret nc \ add a,10 \ ret

The approach here basically is to perform mod 5 on the upper 15 bits of HL, then rotate the lower bit of the input back in. This is from the observation that 10=5*2, so no matter what, the lower 1 bit is going to be the same in the input and output. So then I could do
(x mod 10) = ((x mod 10)-(x mod 2)) mod 10 + (x mod 2)
= (floor(x/2) mod 5)*2 + (x mod 2)

However, from earlier observations, mod 2ⁿ+1 and mod 2ⁿ-1 are pretty easy in binary. For mod 5 you can add each 2k-bit chunk together where k>1 and perform mod 5 on that. In my code, I add the upper 7 bit of the input to there lower 8 bits, if there is overflow, just add 1 to the 8-bit result. Then I add the upper and lower nibble of that producing at most a 5-bit result. At that point, I rotate back in the low bit, making at most a 6-bit value that can be checked against 10*2^{0,1,2}.

Author Topic: Speed Optimised HL_mod_10 (Read 2743 times)

Xeda112358

Speed Optimised HL_mod_10

Xeda112358

Re: Speed Optimised HL_mod_10