Author Topic: Streamlined Asm routines (Read 9421 times)

ACagliano · « **Reply #15 on:** October 11, 2011, 09:33:29 pm »

Quote from: ralphdspam on October 11, 2011, 08:59:02 pm

Quote from: Xeda112358 on October 11, 2011, 08:24:12 pm
So if you want to delay for approximately 'a' seconds you can try this:
Code: [Select]
ld b,107 halt djnz $-1 dec a jr nz,$-6 retDoes this help?
Remember to have interrupts enabled.

Thank you mucho. I don't actually disable them to begin with...lol.

NanoWar · « **Reply #16 on:** December 08, 2011, 07:45:22 am »

If you don't want to look up the instruction bytes and are too lazy to write out labels, you could also use relative labels (with Spasm):

_ ld b,107
_ halt
djnz -_
dec a
jr nz, --_
ret

Xeda112358 · « **Reply #17 on:** January 04, 2012, 02:21:31 pm »

Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).

Code: [Select]

A_Times_DE:
;Input:
;     A,DE
;Outputs:
;     A is 0
;     BC is not changed
;     DE is not changed
;     HL is the result
;     z flag is set
;     c flag is set if the input A is not 0
;Notes:
;           If A is 0, 29 cycles
;Speed: 145+6n+21b cycles
;           n=floor(log(a)/log(2))
;           b is the number of bits in the number
;           Testing over all values of A from 1 to 255:
;           313.7058824 average cycles
;           Worst: 355
;           Best  : 166 (non trivial)
;Size: 25 bytes
     ld hl,0           ;10
     or a \ ret z     ;9
     cpl \ scf        ;8
     adc a,a         ;4
     jp nc,$+7       ;10         ;45
Loop:
     add a,a          ;4
     jp c,$-1         ;10         ;14(7-n)

     add hl,de        ;11         ;11         (the rest are counted below)
     add a,a          ;4           ;4b
     ret z              ;5|11      ;5b+6
     add hl,hl         ;11         ;11b-11
     jp p,$-4         ;21|20     ;20n+b
     jp $-7

So that code is about twice as large as the following more standard routine, but it is also on average about 52 cycles faster and the worst case scenario is 35 cycles better than the worst case for the next routine. The advantage with the following code is that the speed is not all over the place:

Code: [Select]

DE_Times_A:
;Inputs:
;     DE and A are factors
;Outputs:
;     A is not changed
;     B is 0
;     C is not changed
;     DE is not changed
;     HL is the product
;Speed:
;     342+6x, x is the number of bits in A
;     Average: 366 cycles
;Size:
;     13 bytes
     ld b,8            ;7              7
     ld hl,0           ;10             10
       add hl,hl      ;11*8         88
       rlca            ;4*8           32
       jr nc,$+3     ;(12|18)*8  96+6x
         add hl,de   ;--             --
       djnz $-5      ;13*7+8      99
     ret               ;10             10

Another interesting note is that the first routine does not use a counter and so it preserves BC. For the best speed without an LUT, I still think unrolled routines are the best

Actually, another note-- the first routine doesn't really get a speed boost from unrolling

You can make a hybrid of the two codes that will cost the b register, but it will be 21 bytes and average a smidge over 320 cycles, too.

EDIT: Also, I posted here because the topic title had a nice name and I am not ready to submit it elsewhere without the scrutiny of the better asm coders, first

FloppusMaximus · « **Reply #18 on:** January 07, 2012, 03:53:28 pm »

Quote from: Xeda112358 on January 04, 2012, 02:21:31 pm

Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).

Unless I've made a mistake somewhere, this routine doesn't work as written, because at the time you're testing the sign flag, the bit you're interested in has already been shifted out. But it's an interesting idea, so here's a version that works (but could probably be optimized more):

Code: [Select]

	ld hl,0			; 10
	or a			; 4
	ret z			; 5

	scf			; 4
skip_zeroes:
	adc a,a			; 4(9-n)
	jr nc,skip_zeroes	; 12(9-n) - 5

	jp loop_add0		; 10

loop_add:
	ret z			; 5k + 6
	add hl,hl		; 11k
loop_add0:
	add hl,de		; 11(k+1)
loop_noadd:
	add a,a			; 4n
	jr c, loop_add		; 7n + 5k
	add hl,hl		; 11(n-k-1)
	jp loop_noadd		; 10(n-k-1)

If I've worked it out correctly, this has a minimum (non-trivial) running time of 192, a maximum of 437, and average of ~368.57.

DrDnar · « **Reply #19 on:** January 07, 2012, 05:15:59 pm »

For reference, here are the timings for the routines from Z80 Bits.

Rolled:

Code: [Select]

				; Ref	Worst	Best
MultHbyE:
	ld	l, 0		; 7	7	7
	ld	d, l		; 4	4	4
	sla	h		; 8	8	8
	jr	nc,$+3		; 12/7	12	7
	ld	l,e		; 4		4
	ld	b, 7		; 7	7	7
_:	add	hl,hl		; 11	11	11
	jr	nc,$+3		; 12/7	7	12
	add	hl,de		; 11	11	
	djnz	-_		; 13/8	13	13
				;	-5	-5
				;	327	284

MultAbyDE:
	ld	hl, 0		; 10	10	10
	ld	c, l		; 4	4	4
	add	a,a		; 4	4	4
	jr	nc,$+4		; 12/7	7	12
	ld	h,d		; 4	4	
	ld	l,e		; 4	4	
	ld	b, 7		; 7	7	7
_:	add	hl,hl		; 11	11	11
	rla			; 4	4	4
	jr	nc,$+4		; 12/7	7	12
	add	hl,de		; 11	11	
	adc	a,c		; 4	4	
	djnz	-_		; 13/8	13	13
				;	-5	-5
				;	385	312

Unrolled:

Code: [Select]

MultHbyE:
; L and D must already be 0
	sla	h		; 8	8	8
	jr	nc,$+3		; 12/7	12	7
	ld	l,e		; 4		4

	add	hl,hl		; 11	11	11
	jr	nc,$+3		; 12/7	7	12
	add	hl,de		; 11	77	
				; *7	*7	*7
				;	223	180

MultAbyDE:
; HL and C must already be 0
	add	a,a		; 4	4	4
	jr	nc,$+4		; 12/7	7	12
	ld	h,d		; 4	4	
	ld	l,e		; 4	4	

	add	hl,hl		; 11	11	11
	rla			; 4	4	4
	jr	nc,$+4		; 12/7	7	12
	add	hl,de		; 11	11	
	adc	a,c		; 4	4	
				; *7	*7	*7
				;	278	205

Xeda112358 · « **Reply #20 on:** January 07, 2012, 07:18:36 pm »

This was to Floppus, but I lost internet connection for a while:
Dang, I must have messed up big time somewhere. You are right, that routine really does not work at all. Now I need to figure out where I went wrong because I was sure I had a working version

I think you over counted your cycles, too o.o What I did to make things easier was to remove the code that stripped the leading zeroes and that alone has a worst case of 404 cycles and best case of 327 cycles. pretty much, it is 316+11b where b is the number of bits in A (b=0 is the trivial case and I am not counting that). Your routine is still eluding me how to best analyse it, though, but I think it is a lot faster than you thought and better than the normal routine.

Author Topic: Streamlined Asm routines (Read 9421 times)

ACagliano

Re: Streamlined Asm routines

NanoWar

Re: Streamlined Asm routines

Xeda112358

Re: Streamlined Asm routines

FloppusMaximus

Re: Streamlined Asm routines

DrDnar

Re: Streamlined Asm routines

Xeda112358

Re: Streamlined Asm routines