2641
ASM / Re: Streamlined Asm routines
« on: January 04, 2012, 02:21:31 pm »
Okay, so Runer was speculating about how to get the best speed out of a math routine, so the first challenge he gave was for 8x8 multiplication with a 16-bit output. I am not doing all that well with the challenge, but here is a variation of what I came up with that actually is a 8x16 multiplication (it requires 4 more cycles to make it 8x8).
Another interesting note is that the first routine does not use a counter and so it preserves BC. For the best speed without an LUT, I still think unrolled routines are the best
Actually, another note-- the first routine doesn't really get a speed boost from unrolling
You can make a hybrid of the two codes that will cost the b register, but it will be 21 bytes and average a smidge over 320 cycles, too.
EDIT: Also, I posted here because the topic title had a nice name and I am not ready to submit it elsewhere without the scrutiny of the better asm coders, first
Code: [Select]
A_Times_DE:
;Input:
; A,DE
;Outputs:
; A is 0
; BC is not changed
; DE is not changed
; HL is the result
; z flag is set
; c flag is set if the input A is not 0
;Notes:
; If A is 0, 29 cycles
;Speed: 145+6n+21b cycles
; n=floor(log(a)/log(2))
; b is the number of bits in the number
; Testing over all values of A from 1 to 255:
; 313.7058824 average cycles
; Worst: 355
; Best : 166 (non trivial)
;Size: 25 bytes
ld hl,0 ;10
or a \ ret z ;9
cpl \ scf ;8
adc a,a ;4
jp nc,$+7 ;10 ;45
Loop:
add a,a ;4
jp c,$-1 ;10 ;14(7-n)
add hl,de ;11 ;11 (the rest are counted below)
add a,a ;4 ;4b
ret z ;5|11 ;5b+6
add hl,hl ;11 ;11b-11
jp p,$-4 ;21|20 ;20n+b
jp $-7
So that code is about twice as large as the following more standard routine, but it is also on average about 52 cycles faster and the worst case scenario is 35 cycles better than the worst case for the next routine. The advantage with the following code is that the speed is not all over the place:Code: [Select]
DE_Times_A:
;Inputs:
; DE and A are factors
;Outputs:
; A is not changed
; B is 0
; C is not changed
; DE is not changed
; HL is the product
;Speed:
; 342+6x, x is the number of bits in A
; Average: 366 cycles
;Size:
; 13 bytes
ld b,8 ;7 7
ld hl,0 ;10 10
add hl,hl ;11*8 88
rlca ;4*8 32
jr nc,$+3 ;(12|18)*8 96+6x
add hl,de ;-- --
djnz $-5 ;13*7+8 99
ret ;10 10
Another interesting note is that the first routine does not use a counter and so it preserves BC. For the best speed without an LUT, I still think unrolled routines are the best
Actually, another note-- the first routine doesn't really get a speed boost from unrolling
You can make a hybrid of the two codes that will cost the b register, but it will be 21 bytes and average a smidge over 320 cycles, too.
EDIT: Also, I posted here because the topic title had a nice name and I am not ready to submit it elsewhere without the scrutiny of the better asm coders, first