Author Topic: Assembly Programmers - Help Axe Optimize! (Read 169057 times)

Munchor · « **Reply #135 on:** January 06, 2011, 10:11:37 am »

Quote from: happybobjr on January 06, 2011, 10:09:17 am

Quote from: Runer112 on January 05, 2011, 11:11:30 pm
By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
* happybobjr loves runner

And so does Scout.

DJ Omnimaga · « **Reply #136 on:** January 07, 2011, 12:20:48 am »

Quote from: Art_of_camelot on January 06, 2011, 08:41:27 am

Quote from: Runer112 on January 05, 2011, 06:42:20 pm
Faster buffer inversion routine. 9951 cycles saved.
It's over 9000!!!!
What?!?
9000?
Yea, I know... I had to...
But seriously dude, all those optimizations are awesome!

Lol I just actually noticed that

Runer112 · « **Reply #137 on:** January 09, 2011, 01:00:51 pm »

Oh damn, you know what? Now I remember why I had the conditional return in the middle of the sprite rotating routines, Quigibo. Without it, the routines would return vx_SptBuff+8 in hl. Oops... But instead of re-implementing the conditional return, here's the better fix:

Code: [Select]

p_RotC:
	.db __RotCEnd-1-$
	ex	de,hl
	ld	c,8
__RotCLoop1:
	ld	hl,vx_SptBuff+8
	ld	b,8
	ld	a,(de)
__RotCLoop2:
	dec	l
	rra
	rr	(hl)
	djnz	__RotCLoop2
	inc	de
	dec	c
	jr	nz,__RotCLoop1
	ret
__RotCEnd:

p_RotCC:
	.db __RotCCEnd-1-$
	ex	de,hl
	ld	c,8
__RotCCLoop1:
	ld	hl,vx_SptBuff+8
	ld	b,8
	ld	a,(de)
__RotCCLoop2:
	dec	l
	rla
	rl	(hl)
	djnz	__RotCCLoop2
	inc	de
	dec	c
	jr	nz,__RotCCLoop1
	ret
__RotCCEnd:

EDIT: And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:

Code: [Select]

	ld	hl,(var)
	dec	hl
	ld	a,h
	or	l
	jp	nz,DS_End
	;Code inside statement goes here
	ld	hl,max
DS_End:
	ld	(var),hl

calc84maniac · « **Reply #138 on:** January 10, 2011, 05:35:19 pm »

Quigibo, you could probably optimize const->{expr} statements to give a lot of optimization benefits:

Code: [Select]

;const->{expr}
;Evaluate expr here
ld (hl),const

;const->{expr}r
;Evaluate expr here
ld (hl),const & $FF
inc hl
ld (hl),const >> 8

;const->{expr}rr
;Evaluate expr here
ld (hl),const >> 8
inc hl
ld (hl),const & $FF

These optimizations would still be compatible with code in earlier Axe versions because HL ends up exactly as it used to.

Edit:
These extra optimizations are also possible for storing 0:

Code: [Select]

;0->{expr}r or 0->{expr}rr
;Evaluate expr here
xor a
ld (hl),a
inc hl
ld (hl),a

Runer112 · « **Reply #139 on:** February 06, 2011, 08:45:09 pm »

It looks like yet another error has been discovered with my attempts to optimize things. The nibble retrieval routines and the nibble storage routine that I posted treat low and high nibbles in opposite ways. I'm pretty sure that the nibble retrieval routines are backwards and that the conditional jr c jumps should be changed to jr nc.

squidgetx · « **Reply #140 on:** February 06, 2011, 08:47:17 pm »

So would changing this make the new nibble routines opposite the ones found in .4.6? (or the same?)

Builderboy · « **Reply #141 on:** February 07, 2011, 01:11:13 am »

Nice catch, can't wait for the new version

squidgetx · « **Reply #142 on:** February 14, 2011, 07:17:34 am »

Could this possibly be auto-optimized:

pxl-Test(CONST1,CONST2)

to

{CONST2*12+(CONST1/8)+L6}^re(CONST1^8) (except ofc the math is all precalculated during parsing time.)? It saves more than 10 bytes and 200 cycles.

Runer112 · « **Reply #143 on:** February 15, 2011, 04:01:39 pm »

Some improvements to MemKit!

Next(): 2 bytes and a few cycles saved. Also, isn't the end-of-VAT check in the wrong place? I could be wrong because my VAT experience isn't too great, but because this routine checks for the end of the VAT at the start, wouldn't this command advance the VAT pointer to the end of the VAT and not recognize it as the end until the next Next()? This would cause problems with programs reading garbage VAT data for the last "entry." If I'm right about this (which may not be the case), the third block of code I posted should hopefully recognize the end of the VAT as soon as it hits it and never advance the VAT pointer to point to the end.

Code: (Original code: 26 bytes, 152/66 cycles) [Select]


 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    e,(hl)
 inc   e
 xor   a
 ld    d,a
 sbc   hl,de
 ld    (axv_X1t),hl
 ret

Code: (Optimized code: 24 bytes, 144/66 cycles) [Select]


 ld    hl,(axv_X1t)
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    (axv_X1t),hl
 ret

Code: (Optimized (and fixed?) code: 24 bytes, 144/113 cycles) [Select]


 ld    hl,(axv_X1t)
 ld    de,-6
 add   hl,de
 ld    a,(hl)
 cpl
 ld    e,a
 add   hl,de
 ld    de,($982E)
 or    a
 sbc   hl,de
 ret   z
 add   hl,de
 ld    (axv_X1t),hl
 ret

Dim()^rr: Fixed the page offset.

Code: (Original code) [Select]


 ld    ix,(axv_X1t)
 ld    l,(ix-6)
 ld    h,0

Code: (Fixed code) [Select]


 ld    ix,(axv_X1t)
 ld    l,(ix-5)
 ld    h,0

Print(): n*16-13 cycles saved, n=name length. Assuming an average name length of 4.5 characters, 59 cycles saved.

Code: (Original code: 18 bytes, n*55+51 cycles) [Select]


 ld    ix,(axv_X1t)
 ld    b,(ix-6)
Ax6_Loop:
 ld    a,(ix-7)
 ld    (hl),a
 inc   hl
 dec   ix
 djnz  Ax6_Loop
 ld    (hl),b
 ret

Code: (Optimized code: 18 bytes, n*39+64 cycles) [Select]


 ex    de,hl
 ld    hl,(axv_X1t)
 ld    bc,-6
 add   hl,bc
 ld    b,(hl)
 ex    de,hl
Ax6_Loop:
 dec   de
 ld    a,(de)
 ld    (hl),a
 inc   hl
 djnz  Ax6_Loop
 ld    (hl),b
 ret

Runer112 · « **Reply #144 on:** February 16, 2011, 01:47:44 pm »

Yay, double post! But it's been almost a day and I have a pretty good question/suggestion. This relates to the screen display commands. This was brought to mind when squidgetx made a post mentioning something I had discovered a while ago when documenting the speed of Axe commands. What he mentioned is that DispGraph^r actually runs faster than DispGraph. Here's a quote of my response to that:

Quote from: Runer112 on February 16, 2011, 01:21:46 pm

I see you've been reading up on my Commands documentation, eh squidgetx? Yeah, that's an interesting thing I discovered when speed testing the display commands. On calculators like mine with the old, "good" screen drivers, the screen driver delay seems to be pretty low and constant from calculator to calculator. DispGraph could run just as fast or faster than DispGraph^r on these calculators. However, due to inconsistencies with the screen drivers in newer units, the routine may run too fast for the driver on some calculators, causing display problems, so Quigibo had to add a portion of code to pause the routine until the driver says it is ready. However, this pause itself adds some overhead time, making the routine slower.

Quigibo, the DispGraph^r routine doesn't have any throttling system in place, yet no problems have been reported with it on newer calculators. Could you just remove the throttling system from the DispGraph routine and add one or two time-wasting instructions to make each loop iteration take as many cycles as each DispGraph^r loop iteration?

EDIT: Hmm I don't know if Quigibo reads this thread and would see that, so I'm probably going to post that in a major thread he reads or send him a message about that.

The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?

Munchor · « **Reply #145 on:** February 16, 2011, 02:58:19 pm »

Quote

Print()

What function would that be Runer?

Quigibo · « **Reply #146 on:** February 16, 2011, 11:32:36 pm »

@squidgetx
I don't think pixel testing points with constant coordinates is common enough to warrant the pixel tester to treat it as a special case. 99% of the time, you're going to be using variable arguments to test pixels. If not, the code can probably be made more efficient without a pixel test in the first place.

Quote from: Runer112 on February 16, 2011, 01:47:44 pm

The second paragraph is my suggested optimization. The 3-level grayscale routine doesn't have a throttling system, yet there have been no reports of display problems from anybody. Wouldn't this suggest that all the screen drivers can handle routines that have as much delay as this? The data copying loop in the 3-level grayscale routine takes 72 cycles per byte output, so could delays simply be added to the normal screen display routine to make its loop at least 72 cycles?

Unfortunately that is not entirely true. There has actually been at least 1 report that the 3-level routine is too fast and causes flickers once in a great while on very new hardware. If there was a lower bound for clock cycles, I'm right on it. Although, I could still probably take the safety stuff off the safe copy routine, still have it faster (but not too fast) and still be smaller. I will look into that.

And I do read most of these threads, I'm just generally too busy to post, but I try to when I have small pockets of free time

Runer112 · « **Reply #147 on:** February 17, 2011, 09:47:36 pm »

Now that you have absolute jumps implemented:

Code: (Original code) [Select]


p_Exchange:
	.db 13
	pop	de
	ex	(sp),hl
	pop	bc
	ld	a,(de)
	ldi
	dec	hl
	ld	(hl),a
	inc	hl
	ld	a,b
	or	c
	jr	nz,$-8

Code: (Optimized code) [Select]


p_Exchange:
	.db 12
	pop	de
	ex	(sp),hl
	pop	bc
__ExchangeLoop:
	ld	a,(de)
	ldi
	dec	hl
	ld	(hl),a
	inc	hl
	jp	pe,__ExchangeLoop	;or is it po?

Runer112 · « **Reply #148 on:** February 20, 2011, 06:47:27 pm »

I felt bad last time I optimized the constant bit-checking auto optimizations because I left about half of them out, stuck with the 8-byte plain old bit check routine. But thanks to a random revelation I had while lying in bed last night, I have come back for the forgotten ones!

Code: (Original code) [Select]

p_GetBit2:
	.db 7			;7 bytes, 49 cycles
	xor	a
	add	hl,hl
	add	hl,hl
	add	hl,hl
	ld	h,a
	rla
	ld	l,a
p_GetBit3:
	.db 8			;8 bytes, 30/29 cycles
	bit	4,h
	ld	hl,0
	jr	z,$+3
	inc	l

p_GetBit4:
	.db 8			;8 bytes, 30/29 cycles
	bit	3,h
	ld	hl,0
	jr	z,$+3
	inc	l

p_GetBit5:
	.db 8			;8 bytes, 30/29 cycles
	bit	2,h
	ld	hl,0
	jr	z,$+3
	inc	l

p_GetBit10:
	.db 7			;7 bytes, 49 cycles
	xor	a
	add	hl,hl
	add	hl,hl
	ld	h,a
	add	hl,hl
	ld	l,h
	ld	h,a
p_GetBit11:
	.db 8			;8 bytes, 30/29 cycles
	bit	4,l
	ld	hl,0
	jr	z,$+3
	inc	l

p_GetBit12:
	.db 8			;8 bytes, 30/29 cycles
	bit	3,l
	ld	hl,0
	jr	z,$+3
	inc	l

p_GetBit13:
	.db 8			;8 bytes, 30/29 cycles
	bit	2,l
	ld	hl,0
	jr	z,$+3
	inc	l

Code: (Optimized code) [Select]

p_GetBit2:
	.db 7			;7 bytes, 37 cycles
	ld	a,h
	set	5,h
	cp	h
	sbc	hl,hl
	inc	hl


p_GetBit3:
	.db 7			;7 bytes, 37 cycles
	ld	a,h
	set	4,h
	cp	h
	sbc	hl,hl
	inc	hl
p_GetBit4:
	.db 7			;7 bytes, 37 cycles
	ld	a,h
	set	3,h
	cp	h
	sbc	hl,hl
	inc	hl
p_GetBit5:
	.db 7			;7 bytes, 37 cycles
	ld	a,h
	set	2,h
	cp	h
	sbc	hl,hl
	inc	hl
p_GetBit10:
	.db 7			;7 bytes, 37 cycles
	ld	a,l
	set	5,l
	cp	l
	sbc	hl,hl
	inc	hl


p_GetBit11:
	.db 7			;7 bytes, 37 cycles
	ld	a,l
	set	4,l
	cp	l
	sbc	hl,hl
	inc	hl
p_GetBit12:
	.db 7			;7 bytes, 37 cycles
	ld	a,l
	set	3,l
	cp	l
	sbc	hl,hl
	inc	hl
p_GetBit13:
	.db 7			;7 bytes, 37 cycles
	ld	a,l
	set	2,l
	cp	l
	sbc	hl,hl
	inc	hl

DJ Omnimaga · « **Reply #149 on:** February 22, 2011, 12:15:45 am »

Nice to see new optimizations

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 169057 times)

Munchor

Re: Assembly Programmers - Help Axe Optimize!

DJ Omnimaga

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

squidgetx

Re: Assembly Programmers - Help Axe Optimize!

Builderboy

Re: Assembly Programmers - Help Axe Optimize!

squidgetx

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Munchor

Re: Assembly Programmers - Help Axe Optimize!

Quigibo

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

DJ Omnimaga

Re: Assembly Programmers - Help Axe Optimize!