Author Topic: Fast 8-bit Squaring  (Read 2789 times)

0 Members and 1 Guest are viewing this topic.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Fast 8-bit Squaring
« on: January 28, 2014, 09:03:43 am »
Last night I was thinking again about my Sine Approximation routine and holy crud, I must have been really tired not to recognize this:

In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x2-x.

But wait.

If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x2.

Well then, anybody wanna optimize this? ^^ :
Code: [Select]
L_sqrd:
;Input: L
;Output: L*L->A
;159 t-states
;39 bytes
ld h,l
ld c,l
rr h
sbc a,a
xor l
add a,c
ld c,a
rl l

rr h
sbc a,a
xor l
and %11111000
add a,c
ld c,a
rl l

rr h
sbc a,a
xor l
and %11100000
add a,c
ld c,a
rl l

ld a,h
rrca
xor l
and 128
xor c
neg
ret

It might be more easy if I explain that it is computing:
Code: [Select]
=(hhhhhhhh^abcdefgh)+(gggggg00^bcdefg00)+(ffff0000^cdef0000)+(ee000000^de000000)
=(hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000)+(e0000000^d0000000)
=((hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000))^e0000000^d0000000

^ is XOR
+ is addition modulo 256
EDIT: Cross-posted:
I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
Code: [Select]
L_sqrd:
;Input: L
;Output: L*L->A
;151 t-states
;37 bytes
ld h,l
;First iteration, get the lowest 3 bits
sla l
rrh
sbc a,a
or l
;second iteration, get the next 2 bits
ld c,a
rr h
sbc a,a
xor l
and $F8
add a,c
;third iteration, get the next 2 bits
ld c,a
sla l
rr h
sbc a,a
xor l
and $E0
add a,c
;fourth iteration, get the last bit
ld c,a
ld a,l
add a,a
rrc h
xor h
and $80
xor c
neg
ret
Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Fast 8-bit Squaring
« Reply #1 on: January 28, 2014, 12:29:06 pm »
I managed to optimize the first iteration. I thought to analyze it while in class, so I saved two bytes, 8 cycles.

EDIT: Here is some pseudocode for getting the full square (not just the lower bits):
Code: [Select]
A_sqrd:
;Input: A is an x-bit number (example, 8-bit, 16-bit, et-cetera)
;Output: A*A
A<<1→B ;Store A shifted once into B.
0→ACC ;Initialise the accumulator with 0
1<<(x-1)→MASK ;Set the last bit of MASK
For (x-1) Iterations:
ACC<<1→ACC ;shift ACC (same as *2 or ACC+ACC)
B<<1→B ;shift left once, keep carry
((SBC(0,0)^A)&MASK)+ACC→ACC ;SBC(0,0) is "subtract with carry" and will yield either -1 (carry flag set) or 0 (carry flag reset).
(MASK>>1)|MASK→MASK ;arithmetic right shift
Return (A<<x-A)-ACC
Implemented on the Z80, it kind of sucks taking twice as long as a traditional routine. However, it is novel, I think, taking 7 iterations:
Code: [Select]
L_sqrd:
;Input: L
;Output: L*L->HL
;It's really slow compared to a generic 8-bit multiplication
;It's also pretty big
;But it is a novel approach!
ld e,a            ;this just stays a constant for the algorithm
rlca
ld d,a            ;this will be rotated to get the bits of the input
ld l,0            ;the accumulator
ld bc,0780h      ;c is a mask
loop:
add hl,hl
rlc d
sbc a,a
xor e
and c
add a,l
ld l,a
jr nc,$+3
inc h
sra c
djnz loop
;HL is 255*input-input*input
;add input
ld c,e
add hl,bc
;HL is 256*input-input*input
;negate HL
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
;HL is input*input-256*input
;add input*256
ld e,b
add hl,de
;HL is input*input
ret