556
ASM / Fast 8-bit Squaring
« on: January 28, 2014, 09:03:43 am »
Last night I was thinking again about my Sine Approximation routine and holy crud, I must have been really tired not to recognize this:
In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x2-x.
But wait.
If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x2.
Well then, anybody wanna optimize this? ^^ :
It might be more easy if I explain that it is computing:
I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x2-x.
But wait.
If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x2.
Well then, anybody wanna optimize this? ^^ :
Code: [Select]
L_sqrd:
;Input: L
;Output: L*L->A
;159 t-states
;39 bytes
ld h,l
ld c,l
rr h
sbc a,a
xor l
add a,c
ld c,a
rl l
rr h
sbc a,a
xor l
and %11111000
add a,c
ld c,a
rl l
rr h
sbc a,a
xor l
and %11100000
add a,c
ld c,a
rl l
ld a,h
rrca
xor l
and 128
xor c
neg
ret
It might be more easy if I explain that it is computing:
Code: [Select]
=(hhhhhhhh^abcdefgh)+(gggggg00^bcdefg00)+(ffff0000^cdef0000)+(ee000000^de000000)
=(hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000)+(e0000000^d0000000)
=((hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000))^e0000000^d0000000
^ is XOR
+ is addition modulo 256
EDIT: Cross-posted:I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
Code: [Select]
L_sqrd:
;Input: L
;Output: L*L->A
;151 t-states
;37 bytes
ld h,l
;First iteration, get the lowest 3 bits
sla l
rrh
sbc a,a
or l
;second iteration, get the next 2 bits
ld c,a
rr h
sbc a,a
xor l
and $F8
add a,c
;third iteration, get the next 2 bits
ld c,a
sla l
rr h
sbc a,a
xor l
and $E0
add a,c
;fourth iteration, get the last bit
ld c,a
ld a,l
add a,a
rrc h
xor h
and $80
xor c
neg
ret
Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.