Author Topic: Fast 8-bit Squaring (Read 2800 times)

Xeda112358 · « **on:** January 28, 2014, 09:03:43 am »

Last night I was thinking again about my Sine Approximation routine and holy crud, I must have been really tired not to recognize this:

In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x²-x.

But wait.

If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x².

Well then, anybody wanna optimize this? ^^ :

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->A
;159 t-states
;39 bytes
	ld h,l
	ld c,l
	rr h
	sbc a,a
	xor l
	add a,c
	ld c,a
	rl l

	rr h
	sbc a,a
	xor l
	and %11111000
	add a,c
	ld c,a
	rl l

	rr h
	sbc a,a
	xor l
	and %11100000
	add a,c
	ld c,a
	rl l

	ld a,h
	rrca
	xor l	
	and 128
	xor c
	neg
	ret

It might be more easy if I explain that it is computing:

Code: [Select]

=(hhhhhhhh^abcdefgh)+(gggggg00^bcdefg00)+(ffff0000^cdef0000)+(ee000000^de000000)
=(hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000)+(e0000000^d0000000)
=((hhhhhhhh^abcdefgh)+(ggggg000^bcdef000)+(fff00000^cde00000))^e0000000^d0000000

^ is XOR
+ is addition modulo 256

EDIT: Cross-posted:
I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states.
Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->A
;151 t-states
;37 bytes
	ld h,l
;First iteration, get the lowest 3 bits
	sla l
	rrh
	sbc a,a
	or l
;second iteration, get the next 2 bits
	ld c,a
	rr h
	sbc a,a
	xor l
	and $F8
	add a,c
;third iteration, get the next 2 bits
	ld c,a
	sla l
	rr h
	sbc a,a
	xor l
	and $E0
	add a,c
;fourth iteration, get the last bit
	ld c,a
	ld a,l
	add a,a
	rrc h
	xor h
	and $80
	xor c
	neg
	ret

Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.

Xeda112358 · « **Reply #1 on:** January 28, 2014, 12:29:06 pm »

I managed to optimize the first iteration. I thought to analyze it while in class, so I saved two bytes, 8 cycles.

EDIT: Here is some pseudocode for getting the full square (not just the lower bits):

Code: [Select]

A_sqrd:
;Input: A is an x-bit number (example, 8-bit, 16-bit, et-cetera)
;Output: A*A
	A<<1→B					;Store A shifted once into B.
	0→ACC					;Initialise the accumulator with 0
	1<<(x-1)→MASK				;Set the last bit of MASK
	For (x-1) Iterations:
		ACC<<1→ACC			;shift ACC (same as *2 or ACC+ACC)
		B<<1→B				;shift left once, keep carry
		((SBC(0,0)^A)&MASK)+ACC→ACC	;SBC(0,0) is "subtract with carry" and will yield either -1 (carry flag set) or 0 (carry flag reset).
		(MASK>>1)|MASK→MASK		;arithmetic right shift
	Return (A<<x-A)-ACC

Implemented on the Z80, it kind of sucks taking twice as long as a traditional routine. However, it is novel, I think, taking 7 iterations:

Code: [Select]

L_sqrd:
;Input: L
;Output: L*L->HL
;It's really slow compared to a generic 8-bit multiplication
;It's also pretty big
;But it is a novel approach!
	ld e,a            ;this just stays a constant for the algorithm
	rlca
	ld d,a            ;this will be rotated to get the bits of the input
	ld l,0            ;the accumulator
	ld bc,0780h      ;c is a mask
loop:
	add hl,hl
	rlc d
	sbc a,a
	xor e
	and c
	add a,l
	ld l,a
	jr nc,$+3
	inc h
	sra c
	djnz loop
;HL is 255*input-input*input
;add input
	ld c,e
	add hl,bc
;HL is 256*input-input*input
;negate HL
	xor a
	sub l
	ld l,a
	sbc a,a
	sub h
	ld h,a
;HL is input*input-256*input
;add input*256
	ld e,b
	add hl,de
;HL is input*input
	ret

Author Topic: Fast 8-bit Squaring (Read 2800 times)

Xeda112358

Fast 8-bit Squaring

Xeda112358

Re: Fast 8-bit Squaring