Assembly Programmers - Help Axe Optimize!

Omnimaga »
Forum »
Calculator Community »
Major Community Projects »
The Axe Parser Project (Moderator: Runer112) »
Assembly Programmers - Help Axe Optimize!

« previous next »

Print

Pages: 1 ... 17 18 [19] 20 Go Down

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 170367 times)

0 Members and 1 Guest are viewing this topic.

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #270 on: December 12, 2011, 11:57:46 pm »

Yeah, I see no way to optimize the full 32-bit multiplication... But fixed-point multiplication, now that's an entirely different story! First, here's a totally different approach to sign handling that reduces p_88Mul to less than half of its current size!

Original routine: 38 bytes, ~1128 cycles

p_88Mul:
	.db __88MulEnd-1-$
	ld	a,h
	xor	d
	push	af
	bit	7,h
	jr	z,$+8
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
	bit	7,d
	jr	z,$+8
	xor	a
	sub	e
	ld	e,a
	sbc	a,a
	sub	d
	ld	d,a
	call	$3F00+sub_MulFull
	ld	l,h
	ld	h,a
	pop	af
	xor	h
	ret	p
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
	ret
__88MulEnd:

Smaller routine: 18 bytes, ~1089 cycles

p_88Mul:
	.db __88MulEnd-1-$
	push	hl
	call	$3F00+sub_MulFull
	pop	bc
	bit	7,b
	jr	z,$+3
	sub	e
	ld	l,h
	ld	h,a
	bit	7,d
	ret	z
	sub	c
	ld	h,a
	ret
__88MulEnd:

20 bytes saved? Not bad at all! But what if you're more interested in shaving off cycles than bytes? Don't worry, I covered that base too. Instead of using the slower p_MulFull, this final routine uses my faster p_Mul for 8 bits of the multiplication and an inlined, slightly different version of faster multiplication for the other 8 bits. End result: it's about 260 cycles faster than the smaller solution, or about 30% faster!

It's 16 bytes larger than my smaller method, but actually it would often end up resulting in smaller programs because it relies on the much more popular p_Mul instead of p_MulFull.

Faster routine: 34 bytes, ~831 cycles

p_88Mul:
	.db __88MulEnd-1-$
	push	hl
	ld	c,l
	ld	a,h
	ld	l,0
	ld b,b \ .db 8 \ call $3F00+sub_Mul
	ld	a,c
	ld	bc,8<<8+0
__88MulNext:
	add	hl,hl
	rla
	jr	nc,__88MulSkip
	add	hl,de
	adc	a,c
__88MulSkip:
	djnz	__88MulNext
	pop	bc
	bit	7,b
	jr	z,$+3
	sub	e
	ld	l,h
	ld	h,a
	bit	7,d
	ret	z
	sub	c
	ld	h,a
	ret
__88MulEnd:

« Last Edit: December 13, 2011, 12:04:38 am by Runer112 »

Logged

+5/-0 karm for this message

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #271 on: December 13, 2011, 01:19:54 am »

Wow thanks! However there seems to be an issue. The 3 pictures attached are the output from the Mandelbrot Set demo program. The first is the original routine. The second is your new size optimized version. As you can see it works, but the rounding appears to be asymmetrical (which might still be okay). The last one is your speed optimized version. I think you have a bug somewhere...

mbrot1.gif (1.71 kB, 192x128 - viewed 1296 times.)

mbrot2.gif (1.71 kB, 192x128 - viewed 1298 times.)

mbrot3.gif (1.78 kB, 192x128 - viewed 1272 times.)

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #272 on: December 13, 2011, 01:38:30 am »

I think I can explain the asymmetry of the size-optimized version. Because it adjusts signs differently, I think it now rounds down instead of towards zero like the old routine.

However, I have no clue what is going on with the speed-optimized routine. Can you look at the debugger and confirm that the call to sub_Mul is actually entering where it's supposed to be entering, at __MulByte? Because I wouldn't be surprised if the fact that you probably had to add the offset call macro for call nz,__MulByte in p_Mul is messing up the offset calls due to its own size.

« Last Edit: December 13, 2011, 01:40:46 am by Runer112 »

Logged

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #273 on: December 13, 2011, 02:07:20 am »

The disassembly looks fine to me. All the jumps calls and everything of that nature are aligned. I tried 4 test cases with different combinations of sign values and they seemed okay. Since the generated picture is relatively close to the original given that it was a chaotic system sensitive to errors, I would guess it is only a few special cases that cause it to return a wrong result.

EDIT: I made a program to run them side by side on random numbers and quit when the output is different. Here is an output that gives different results between the routines:

$FFE0 ** $F5F1 (-0.125 ** -10.059)

Results in $0143 (1.26) in size optimized.
Results in $0239 (2.22) in speed optimized.

« Last Edit: December 13, 2011, 02:31:40 am by Quigibo »

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #274 on: December 13, 2011, 03:04:07 am »

That edit was helpful, it gave me a hunch as to what the problem was and (I think) that hunch was correct. Unfortunately, the fix for this problem will cost a byte and about 70 cycles. It will still be about 20% faster than the small routine though. And it still relies on the more common p_Mul instead of p_MulFull, so being 17 bytes larger might still be worth it.

Faster routine: 35 bytes, ~900 cycles

p_88Mul:
	.db __88MulEnd-1-$
	push	hl
	ld	c,l
	ld	a,h
	ld	l,0
	ld b,b \ .db 8 \ call $3F00+sub_Mul
	ld	b,8
__88MulNext:
	add	hl,hl
	rla
	rl	c
	jr	nc,__88MulSkip
	add	hl,de
	adc	a,0
__88MulSkip:
	djnz	__88MulNext
	pop	bc
	bit	7,b
	jr	z,$+3
	sub	e
	ld	l,h
	ld	h,a
	bit	7,d
	ret	z
	sub	c
	ld	h,a
	ret
__88MulEnd:

« Last Edit: December 13, 2011, 03:04:48 am by Runer112 »

Logged

calc84maniac

eZ80 Guru
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
Posts: 2913
Rating: +471/-17

Re: Assembly Programmers - Help Axe Optimize!

« Reply #275 on: December 17, 2011, 11:29:24 pm »

So... Z-Test. At a cost of 8 cycles, you can go from 17 bytes plus 3 bytes times the number of options (limited to something like 85?) to 16 bytes plus 2 bytes times the number of options (limited to amount of program space).

Here's my method:

  ld de,-range
  add hl,de
  ld de,jumptable_end
  jr c,default
  add hl,hl
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
default:
  ex de,hl
  jp (hl)
  .dw Label0
  .dw Label1
  .dw Label2
  ;.....
jumptable_end:

Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Quigibo

The Executioner
CoT Emeritus
LV11 Super Veteran (Next: 3000)
Posts: 2031
Rating: +1075/-24
I wish real life had a "Save" and "Load" button...

Re: Assembly Programmers - Help Axe Optimize!

« Reply #276 on: December 17, 2011, 11:39:31 pm »

Wow thanks! I was considering that, but I assumed the overhead would be large, not smaller! Thanks!

Also, I could move the labels to the data section of the code to make it even faster!

  ld de,-range
  add hl,de
  jr c,default
  add hl,hl
  ld de,jumptable_end
  add hl,de
  ld e,(hl)
  inc hl
  ld d,(hl)
  ex de,hl
  jp (hl)
default:

« Last Edit: December 17, 2011, 11:41:17 pm by Quigibo »

Logged

___Axe_Parser___
Today the calculator, tomorrow the world!

calc84maniac

eZ80 Guru
Coder Of Tomorrow
LV11 Super Veteran (Next: 3000)
Posts: 2913
Rating: +471/-17

Re: Assembly Programmers - Help Axe Optimize!

« Reply #277 on: December 17, 2011, 11:50:41 pm »

If you wanted to save 2 cycles in the case of a jump, you could use an odd table setup with all the LSBs in a row followed by all the MSBs in a row, like so:

  ld de,-range
  add hl,de
  jr c,routine_end
  ex de,hl
  ld hl,jumptable_end
  add hl,de
  ld a,(hl)
  add hl,de
  ld l,(hl)
  ld h,a
  jp (hl)
routine_end:

I imagine that might not work well with the way pointers are handled in the compiler, though.

Edit:
And I suppose the current Z-Test is actually limited to 39 options due to the range of the JR instruction...

« Last Edit: December 17, 2011, 11:56:28 pm by calc84maniac »

Logged

"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #278 on: December 19, 2011, 01:12:53 am »

First, an optimization that I can't give you code for: making *^CONST use an equivalent constant division optimization if one exists. And don't forget about the trivial cases, *^1 and *^0. Of course, these only apply if you don't change this operation to return a 32-bit result somehow. Which it really should.

Next, some silly optimizations: ^0, <<ᴇ8000, >>ᴇ7FFF should simply be 0, while ≥≥ᴇ8000 and ≤≤ᴇ7FFF should simply be 1. If you're wondering why ^0 should be 0, that's what the general modulus routine would return anyways.

Finally, some optimizations for signed comparisons. These have been lacking general forms which take advantage of absolute jumps as well as optimized forms for constants for quite some time. Thanks to jacobly and calc84maniac for helping me come up with the first two! If either of you two are reading this, feel free to look at the other operations and try to optimize them.

p_SGT0:
	.db 8
	ld	a,h
	or	l
	jr	z,$+6
	add	hl,hl
	sbc	hl,hl
	inc	hl
p_SLE0:
	.db 9
	ld	a,h
	or	l
	jr	z,$+6
	add	hl,hl
	ccf
	sbc	hl,hl
	inc	hl
p_SLtLeXX:	
	.db 11
	ld	a,h
	add	a,$80
	ld	h,a
	ld	de,$0000		;$8000-const
	add	hl,de
	sbc	hl,hl
	inc	hl
	.db rp_Ans,6
p_SGtGeXX:
	.db 12
	ld	a,h
	add	a,$80
	ld	h,a
	xor	a
	ld	de,$0000		;$8000-const
	add	hl,de
	ld	h,a
	rla
	ld	l,a
	.db rp_Ans,6
p_SIntGt:
	.db 11
	scf
	sbc	hl,de
	add	hl,hl
	jp	pe,$+4
	ccf
	sbc	hl,hl
	inc	hl
p_SIntGe:
	.db 11
	xor	a
	sbc	hl,de
	add	hl,hl
	jp	po,$+4
	ccf
	ld	h,a
	rla
	ld	l,a
p_SIntLt:
	.db 11
	scf
	sbc	hl,de
	add	hl,hl
	jp	po,$+4
	ccf
	sbc	hl,hl
	inc	hl
p_SIntLe:
	.db 11
	xor	a
	sbc	hl,de
	add	hl,hl
	jp	pe,$+4
	ccf
	ld	h,a
	rla
	ld	l,a

« Last Edit: December 19, 2011, 02:52:21 pm by Runer112 »

Logged

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #279 on: December 20, 2011, 04:47:46 am »

p_DrawOff: save 1 byte, save ~40 cycles

Original

	xor	a
	ld	e,a
	dec	a
__DrawOffShift:
	srl	c
	rr	e
	rra
	djnz	__DrawOffShift
	dec	d
	jr	z,__DrawOffSkipRight
	ld	b,a
	and	(hl)
	or	e
	ld	(hl),a
	ld	a,b
__DrawOffSkipRight:
	dec	hl
	inc	d
	jr	z,__DrawOffSkipLeft
	cpl
	and	(hl)
	or	c
	ld	(hl),a
__DrawOffSkipLeft:

Optimized

	xor	a
	ld	e,$FF
__DrawOffShift:
	srl	c
	rr	e
	rra
	djnz	__DrawOffShift
	dec	d
	jr	z,__DrawOffSkipRight
	ld	b,a
	or	(hl)
	and	e
	ld	(hl),a
	ld	a,b
__DrawOffSkipRight:
	dec	hl
	inc	d
	jr	z,__DrawOffSkipLeft
	and	(hl)
	or	c
	ld	(hl),a
__DrawOffSkipLeft:

p_Pix: save 2 bytes, save ~6 cycles

Original

p_Pix:
	.db __PixEnd-1-$		;Draws pixel (c,l)
	ld	de,plotSScreen
	pop	af
	pop	bc
	push	af
	ld	b,0

	ld	a,l
	cp	64
	ld	a,b
	ret	nc
	ld	a,c
	cp	96
	ld	a,b
	ret	nc

	ld	h,b
	ld	a,l
	add	a,a
	add	a,l
	ld	l,a
	add	hl,hl
	add	hl,hl
	add	hl,de
	ld	a,c
	srl	c
	srl	c
	srl	c
	add	hl,bc
	and	%00000111
	ld	b,a
	ld	a,%10000000
	ret	z
___GetPixLoop:
	rrca
	djnz	___GetPixLoop
	ret
__PixEnd:

Optimized

p_Pix:
	.db __PixEnd-1-$		;Draws pixel (c,l)
	ld	de,plotSScreen
	pop	af
	pop	bc
	push	af
	ld	b,0

	ld	a,c
	cp	96
	ld	a,b
	ret	nc
	sla	l
	ret	c
	sla	l
	ret	c

	ld	h,b
	ex	de,hl
	add	hl,de
	add	hl,de
	add	hl,de
	ld	a,c
	srl	c
	srl	c
	srl	c
	add	hl,bc
	and	%00000111
	ld	b,a
	ld	a,%10000000
	ret	z
___GetPixLoop:
	rrca
	djnz	___GetPixLoop
	ret
__PixEnd:

p_ArcTan: save 1 byte, save ~1 cycle

Original

p_ArcTan:
	.db __ArcTanEnd-1-$
	ex	de,hl		;de = y
	pop	hl
	ex	(sp),hl		;hl = x
	push	hl
	ld	a,h		;\
	xor	d		;/ Get parity
	jp	m,__ArcTanSS-p_ArcTan-1
	add	hl,de		;\
	jr	__ArcTanDS	; |
__ArcTanSS:			; |hl = x +- y
	sbc	hl,de		; |
__ArcTanDS:			;/
	ex	de,hl		;de = x +- y
	ld	b,6		;\
__ArcTan64:			; |
	add	hl,hl		; |hl = 64y
	djnz	__ArcTan64	;/
	call	$3F00+sub_SDiv	;hl = 64y/(x +- y)
	pop	af		;\
	rla			; |Right side, fine
	ret	nc		;/
	sbc	a,a		;\
	sub	h		; |Reverse sign extend
	ld	h,a		;/
	ld	a,l		;\
	add	a,128		; |Add or sub 128
	ld	l,a		;/
	ret
__ArcTanEnd:

Optimized

p_ArcTan:
	.db __ArcTanEnd-1-$
	ex	de,hl		;de = y
	pop	hl
	ex	(sp),hl		;hl = x
	push	hl
	ld	a,h		;\
	xor	d		;/ Get parity
	jp	m,__ArcTanSS-p_ArcTan-2
	add	hl,de		;\
	ld c,c \ .db $FA	; |
	;jr	__ArcTanDS	; |
__ArcTanSS:			; |hl = x +- y
	sbc	hl,de		; |
__ArcTanDS:			;/
	ex	de,hl		;de = x +- y
	ld	b,6		;\
__ArcTan64:			; |
	add	hl,hl		; |hl = 64y
	djnz	__ArcTan64	;/
	call	$3F00+sub_SDiv	;hl = 64y/(x +- y)
	pop	af		;\
	rla			; |Right side, fine
	ret	nc		;/
	sbc	a,a		;\
	sub	h		; |Reverse sign extend
	ld	h,a		;/
	ld	a,l		;\
	add	a,128		; |Add or sub 128
	ld	l,a		;/
	ret
__ArcTanEnd:

Logged

+1/-0 karm for this message

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #280 on: December 24, 2011, 01:45:01 pm »

p_DrawOr/Xor: save 17 bytes (plus 4 every time a custom buffer is used)
aligned saves 98 cycles, unaligned saves ~173 cycles
save additional 21 cycles every time a custom buffer is used

p_DrawOr:
	.db __DrawOrEnd-1-$
	push	hl
	pop	ix			;Input ix = Sprite
	ld	hl,plotSScreen		;Input hl = Buffer
	pop	af
	pop	bc			;Input c = Sprite Y Position
	pop	de			;Input e = Sprite X Position
	push	af
	ld	b,7
	ld	a,e
	add	a,b
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	ld	d,a
	ld	a,c
	add	a,b
	jr	c,__DrawOrClipTop
	sub	64+7
	ret	nc
	cpl
	cp	b
	jr	c,__DrawOrClipBottom
	ld	a,b
	jr	__DrawOrClipBottom
__DrawOrClipTop:
	inc	ix
	inc	c
	jr	nz,__DrawOrClipTop
__DrawOrClipBottom:
	inc	a
	ld	b,0
	sla	c
	sla	c
	add	hl,bc
	add	hl,bc
	add	hl,bc
	ld	c,d
	add	hl,bc
	ld	b,a
	ld	a,e
	and	7
	jr	z,__DrawOrAligned
	ld	c,a
	ld	a,e
	cp	-7
	sbc	a,a
	ld	d,a
	and	e
	cp	96-7
	sbc	a,a
	ld	e,a
__DrawOrLoop:
	push	bc
	ld	b,c
	ld	c,(ix)
	xor	a
__DrawOrShift:
	srl	c
	rra
	djnz	__DrawOrShift
	and	e
	or	(hl)
	ld	(hl),a
	dec	hl
	ld	a,c
	and	d
	or	(hl)
	ld	(hl),a
	ld	c,13
	add	hl,bc
	inc	ix
	pop	bc
	djnz	__DrawOrLoop
	ret
__DrawOrAligned:
	ld	de,12
__DrawOrAlignedLoop:
	ld	a,(ix)
	or	(hl)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawOrAlignedLoop
	ret
__DrawOrEnd:

p_DrawXor:
	.db __DrawXorEnd-1-$
	push	hl
	pop	ix			;Input ix = Sprite
	ld	hl,plotSScreen		;Input hl = Buffer
	pop	af
	pop	bc			;Input c = Sprite Y Position
	pop	de			;Input e = Sprite X Position
	push	af
	ld	b,7
	ld	a,e
	add	a,b
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	ld	d,a
	ld	a,c
	add	a,b
	jr	c,__DrawXorClipTop
	sub	64+7
	ret	nc
	cpl
	cp	b
	jr	c,__DrawXorClipBottom
	ld	a,b
	jr	__DrawXorClipBottom
__DrawXorClipTop:
	inc	ix
	inc	c
	jr	nz,__DrawXorClipTop
__DrawXorClipBottom:
	inc	a
	ld	b,0
	sla	c
	sla	c
	add	hl,bc
	add	hl,bc
	add	hl,bc
	ld	c,d
	add	hl,bc
	ld	b,a
	ld	a,e
	and	7
	jr	z,__DrawXorAligned
	ld	c,a
	ld	a,e
	cp	-7
	sbc	a,a
	ld	d,a
	and	e
	cp	96-7
	sbc	a,a
	ld	e,a
__DrawXorLoop:
	push	bc
	ld	b,c
	ld	c,(ix)
	xor	a
__DrawXorShift:
	srl	c
	rra
	djnz	__DrawXorShift
	and	e
	xor	(hl)
	ld	(hl),a
	dec	hl
	ld	a,c
	and	d
	xor	(hl)
	ld	(hl),a
	ld	c,13
	add	hl,bc
	inc	ix
	pop	bc
	djnz	__DrawXorLoop
	ret
__DrawXorAligned:
	ld	de,12
__DrawXorAlignedLoop:
	ld	a,(ix)
	xor	(hl)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawXorAlignedLoop
	ret
__DrawXorEnd:

Logged

+1/-0 karm for this message

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: Assembly Programmers - Help Axe Optimize!

« Reply #281 on: December 24, 2011, 03:58:42 pm »

I finally have an optimisation that might work or be useful >.> Runer112 apparently mentioned optimising the p_FreqOut routine by replacing:

dec hl
dec bc
ld a,b
or c
jr nz,__FreqOutLoop2

with this:

cpd
jp pe,__FreqOutLoop2

However, the issue was that the frequency would be thrown off as it cut out 8*HL cycles. However, when I was stealing the code for my own evil intentions, I saw this optimisation and thought of that issue and here is my solution:


p_FreqOut:
	xor	a
__FreqOutLoop1:
	push	bc
        xor     %00000011
	ld	e,a
__FreqOutLoop2:
	ld	a,h
	or	l
	jr	z,__FreqOutDone
	cpd
	ld	a,e
        scf
	jp	pe,__FreqOutLoop2
__FreqOutDone:
	pop	bc
	out	($00),a
	ret	nc
	jr	__FreqOutLoop1
__FreqOutEnd:

The way the code is reordered, now, it should only cut out 8*HL/BC cycles which is much less than 8*HL. I think Runer said that it might be up to 1% faster for higher notes and negligible for lower notes.

EDIT: Okay, found a problem: It is actually 2 cycles slower in the inside loop, now, so that will just slow the routine by 2*hl, too

« Last Edit: December 24, 2011, 04:03:24 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #282 on: December 26, 2011, 12:17:26 pm »

p_DrawOr: 18 bytes saved
p_DrawXor: 18 bytes saved
p_DrawOff: 14 bytes saved
p_DrawMsk: 10 bytes saved
p_DrawMsk2: 11 bytes saved

p_DrawOr:
	.db __DrawOrEnd-1-$
	push	hl
	pop	ix			;Input ix = Sprite
	ld	hl,plotSScreen		;Input hl = Buffer
	pop	af
	pop	de			;Input e = Sprite Y Position
	pop	bc			;Input c = Sprite X Position
	push	af
	ld	d,7
	ld	a,e
	add	a,d
	jr	c,__DrawOrClipTop
	sub	64+7
	ret	nc
	cpl
	cp	d
	jr	c,__DrawOrClipBottom
	ld	b,d
	jr	__DrawOrNoClipV
__DrawOrClipTop:
	inc	ix
	inc	e
	jr	nz,__DrawOrClipTop
__DrawOrClipBottom:
	ld	b,a
__DrawOrNoClipV:
	ld	a,c
	add	a,d
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	sla	e
	sla	e
	add	hl,de
	add	hl,de
	add	hl,de
	ld	e,a
	inc	b
	ld	a,c
	and	d
	ld	d,-7*3
	add	hl,de
	jr	z,__DrawOrAligned
	ld	e,c
	ld	c,a
	ld	a,e
	cp	-7
	sbc	a,a
	ld	d,a
	and	e
	cp	96-7
	sbc	a,a
	ld	e,a
__DrawOrLoop:
	push	bc
	ld	b,c
	ld	c,(ix)
	xor	a
__DrawOrShift:
	srl	c
	rra
	djnz	__DrawOrShift
	and	e
	or	(hl)
	ld	(hl),a
	dec	hl
	ld	a,c
	and	d
	or	(hl)
	ld	(hl),a
	ld	c,13
	add	hl,bc
	inc	ix
	pop	bc
	djnz	__DrawOrLoop
	ret
__DrawOrAligned:
	ld	de,12
__DrawOrAlignedLoop:
	ld	a,(ix)
	or	(hl)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawOrAlignedLoop
	ret
__DrawOrEnd:

p_DrawXor:
	.db __DrawXorEnd-1-$
	push	hl
	pop	ix			;Input ix = Sprite
	ld	hl,plotSScreen		;Input hl = Buffer
	pop	af
	pop	de			;Input e = Sprite Y Position
	pop	bc			;Input c = Sprite X Position
	push	af
	ld	d,7
	ld	a,e
	add	a,d
	jr	c,__DrawXorClipTop
	sub	64+7
	ret	nc
	cpl
	cp	d
	jr	c,__DrawXorClipBottom
	ld	b,d
	jr	__DrawXorNoClipV
__DrawXorClipTop:
	inc	ix
	inc	e
	jr	nz,__DrawXorClipTop
__DrawXorClipBottom:
	ld	b,a
__DrawXorNoClipV:
	ld	a,c
	add	a,d
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	sla	e
	sla	e
	add	hl,de
	add	hl,de
	add	hl,de
	ld	e,a
	inc	b
	ld	a,c
	and	d
	ld	d,-7*3
	add	hl,de
	jr	z,__DrawXorAligned
	ld	e,c
	ld	c,a
	ld	a,e
	cp	-7
	sbc	a,a
	ld	d,a
	and	e
	cp	96-7
	sbc	a,a
	ld	e,a
__DrawXorLoop:
	push	bc
	ld	b,c
	ld	c,(ix)
	xor	a
__DrawXorShift:
	srl	c
	rra
	djnz	__DrawXorShift
	and	e
	xor	(hl)
	ld	(hl),a
	dec	hl
	ld	a,c
	and	d
	xor	(hl)
	ld	(hl),a
	ld	c,13
	add	hl,bc
	inc	ix
	pop	bc
	djnz	__DrawXorLoop
	ret
__DrawXorAligned:
	ld	de,12
__DrawXorAlignedLoop:
	ld	a,(ix)
	xor	(hl)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawXorAlignedLoop
	ret
__DrawXorEnd:

p_DrawOff:
	.db __DrawOffEnd-1-$
	push	hl
	pop	ix			;Input ix = Sprite
	ld	hl,plotSScreen		;Input hl = Buffer
	pop	af
	pop	de			;Input e = Sprite Y Position
	pop	bc			;Input c = Sprite X Position
	push	af
	ld	d,7
	ld	a,e
	add	a,d
	jr	c,__DrawOffClipTop
	sub	64+7
	ret	nc
	cpl
	cp	d
	jr	c,__DrawOffClipBottom
	ld	b,d
	jr	__DrawOffNoClipV
__DrawOffClipTop:
	inc	ix
	inc	e
	jr	nz,__DrawOffClipTop
__DrawOffClipBottom:
	ld	b,a
__DrawOffNoClipV:
	ld	a,c
	add	a,d
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	ld	d,0
	sla	e
	sla	e
	add	hl,de
	add	hl,de
	add	hl,de
	ld	e,a
	add	hl,de
	inc	b
	ld	a,c
	and	7
	jr	z,__DrawOffAligned
	ld	e,c
	ld	c,a
	ld	a,e
	cp	-7
	jr	nc,__DrawOffLoop
	inc	d
	cp	96-7
	jr	nc,__DrawOffLoop
	inc	d
__DrawOffLoop:
	push	bc
	ld	b,c
	ld	c,(ix+0)
	xor	a
	ld	e,$FF
__DrawOffShift:
	srl	c
	rr	e
	rra
	djnz	__DrawOffShift
	dec	d
	jr	z,__DrawOffSkipRight
	ld	b,a
	or	(hl)
	and	e
	ld	(hl),a
	ld	a,b
__DrawOffSkipRight:
	dec	hl
	inc	d
	jr	z,__DrawOffSkipLeft
	and	(hl)
	or	c
	ld	(hl),a
__DrawOffSkipLeft:
	ld	bc,13
	add	hl,bc
	inc	ix
	pop	bc
	djnz	__DrawOffLoop
	ret
__DrawOffAligned:
	ld	e,12
__DrawOffAlignedLoop:
	ld	a,(ix)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawOffAlignedLoop
	ret
__DrawOffEnd:

p_DrawMsk:
	.db __DrawMskEnd-1-$
	ex	(sp),hl
	pop	ix			;Input hl = Sprite
	pop	de
	pop	bc
	push	hl
	ld	hl,plotSScreen
	ld	d,7
	ld	a,e
	add	a,d
	jr	c,__DrawMskClipTop
	sub	64+7
	ret	nc
	cpl
	cp	d
	jr	c,__DrawMskClipBottom
	ld	b,d
	jr	__DrawMskNoClipV
__DrawMskClipTop:
	inc	ix
	inc	e
	jr	nz,__DrawMskClipTop
__DrawMskClipBottom:
	ld	b,a
__DrawMskNoClipV:
	ld	a,c
	add	a,d
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	ld	d,0
	sla	e
	sla	e
	add	hl,de
	add	hl,de
	add	hl,de
	ld	e,a
	add	hl,de
	inc	b
	ld	a,c
	and	7
	jr	z,__DrawMskAligned
	ld	e,c
	ld	c,a
	ld	a,e
	cp	-7
	jr	nc,__DrawMskLoop
	inc	d
	cp	96-7
	jr	nc,__DrawMskLoop
	inc	d

__DrawMskLoop:
	push	bc

	push	hl

	ld	b,c
	ld	e,(ix+0)
	xor	a
	ld	h,a
	ld	c,(ix+8)
__DrawMskShift:
	srl	e
	rr	h
	srl	c
	rra
	djnz	__DrawMskShift

	ld	b,h
	pop	hl
	push	af

	dec	d
	jr	z,__DrawMskSkipRight1

	push	bc
	xor	b
	cpl
	ld	c,a

	ld	a,(hl)
	or	b
	and	c
	ld	(hl),a
	pop	bc

__DrawMskSkipRight1:
	dec	hl
	inc	d
	push	de
	jr	z,__DrawMskSkipLeft1

	ld	a,c
	xor	e
	cpl
	ld	d,a

	ld	a,(hl)
	or	e
	and	d
	ld	(hl),a

__DrawMskSkipLeft1:
	ld	de,appBackUpScreen-plotSScreen+1
	add	hl,de
	pop	de
	pop	af
	dec	d
	jr	z,__DrawMskSkipRight2

	or	b
	cpl

	and	(hl)
	or	b
	ld	(hl),a

__DrawMskSkipRight2:
	dec	hl
	inc	d
	jr	z,__DrawMskSkipLeft2

	ld	a,c
	or	e
	cpl

	and	(hl)
	or	e
	ld	(hl),a

__DrawMskSkipLeft2:
	ld	bc,plotSScreen-appBackUpScreen+13
	add	hl,bc

	inc	ix
	pop	bc
	djnz	__DrawMskLoop
	ret
__DrawMskAligned:
	push	hl
	ld	de,appBackUpScreen-plotSScreen
	add	hl,de

	ld	a,(ix+0)
	ld	d,a
	xor	(ix+8)
	cpl
	ld	e,a

	and	(hl)
	or	d
	ld	(hl),a

	pop	hl

	ld	a,(hl)
	or	d
	and	e
	ld	(hl),a

	inc	ix
	ld	de,12
	add	hl,de
	djnz	__DrawMskAligned
	ret
__DrawMskEnd:

p_DrawMsk2:
	.db __DrawMsk2End-1-$
	ex	(sp),hl
	pop	ix			;Input hl = Sprite
	pop	de
	pop	bc
	push	hl
	ld	hl,plotSScreen
	ld	d,7
	ld	a,e
	add	a,d
	jr	c,__DrawMsk2ClipTop
	sub	64+7
	ret	nc
	cpl
	cp	d
	jr	c,__DrawMsk2ClipBottom
	ld	b,d
	jr	__DrawMsk2NoClipV
__DrawMsk2ClipTop:
	inc	ix
	inc	e
	jr	nz,__DrawMsk2ClipTop
__DrawMsk2ClipBottom:
	ld	b,a
__DrawMsk2NoClipV:
	ld	a,c
	add	a,d
	cp	96+7
	ret	nc
	rrca
	rrca
	rrca
	and	$1f
	ld	d,0
	sla	e
	sla	e
	add	hl,de
	add	hl,de
	add	hl,de
	ld	e,a
	add	hl,de
	inc	b
	ld	a,c
	and	7
	jr	z,__DrawMsk2Aligned
	ld	e,c
	ld	c,a
	ld	a,e
	cp	-7
	jr	nc,__DrawMsk2Loop
	inc	d
	cp	96-7
	jr	nc,__DrawMsk2Loop
	inc	d
__DrawMsk2Loop:
	push	bc
	push	hl

	ld	b,c
	ld	e,(ix+0)
	xor	a
	ld	h,a
	ld	c,(ix+8)
__DrawMsk2Shift:
	srl	e
	rr	h
	srl	c
	rra
	djnz	__DrawMsk2Shift

	ld	b,h			;e = left spr, b = right spr, c = left msk, a = right msk
	pop	hl

	dec	d
	jr	z,__DrawMsk2SkipRight

	cpl
	and	(hl)
	xor	b
	ld	(hl),a

__DrawMsk2SkipRight:
	dec	hl
	inc	d
	jr	z,__DrawMsk2SkipLeft

	ld	a,c
	cpl
	and	(hl)
	xor	e
	ld	(hl),a

__DrawMsk2SkipLeft:
	ld	bc,13
	add	hl,bc

	inc	ix
	pop	bc
	djnz	__DrawMsk2Loop
	ret
__DrawMsk2Aligned:
	ld	e,12
__DrawMsk2AlignedLoop:
	ld	a,(ix+8)
	cpl
	and	(hl)
	xor	(ix+0)
	ld	(hl),a
	inc	ix
	add	hl,de
	djnz	__DrawMsk2AlignedLoop
	ret
__DrawMsk2End:

Logged

+2/-0 karm for this message

Runer112

Project Author
LV11 Super Veteran (Next: 3000)
Posts: 2289
Rating: +639/-31

Re: Assembly Programmers - Help Axe Optimize!

« Reply #283 on: February 02, 2012, 11:56:23 pm »

Just a small optimization I see with the new Nth string command. Because you restack the return location by popping it into bc, you're already loading bc with a value that's at least $4000 for applications and at least $8000 for programs, so the ld b,h inside the loop is not necessary.

Logged

jacobly

LV5 Advanced (Next: 300)
Posts: 205
Rating: +161/-1

Re: Assembly Programmers - Help Axe Optimize!

« Reply #284 on: September 18, 2012, 03:03:40 am »

Thanks to a suggestion from calc84maniac, I have optimized the routine that is used for both *^ and ** to be 25-50% faster.

In addition, every use of *^ would be 2 bytes smaller.

p_MulFull: same size, save 300-550 cycles

Original

p_MulFull:
	.db __MulFullEnd-1-$
	ld	c,h
	ld	a,l
	ld	hl,0
	ld	b,16
__MulFullNext:
	add	hl,hl
	rla
	rl	c
	jr	nc,__MulFullSkip
	add	hl,de
	adc	a,0
	jr	nc,__MulFullSkip
	inc	c
__MulFullSkip:
	djnz	__MulFullNext
	ret
__MulFullEnd:

Optimized

p_MulFull:
	.db __MulFullEnd-1-$
	xor	a
	ld	c,h
	ld	h,a
	or	l
	ld	l,h
	call	nz,__MulFullByte-p_MulFull-1
	ld	a,c
__MulFullByte:
	ld	b,8
__MulFullNext:
	rra
	jr	nc,__MulFullSkip
	add	hl,de
__MulFullSkip:
	rr	h
	rr	l
	djnz	__MulFullNext
	ret
__MulFullEnd:

Note: Output changed: hl = bits 16-31 of the result, do rra after the routine returns to get a = bits 8-15 of the result.

« Last Edit: September 18, 2012, 03:10:34 am by jacobly »

Logged

+2/-0 karm for this message

Print

Pages: 1 ... 17 18 [19] 20 Go Up

« previous next »

Omnimaga »
Forum »
Calculator Community »
Major Community Projects »
The Axe Parser Project (Moderator: Runer112) »
Assembly Programmers - Help Axe Optimize!

Server load over the past 5, 10 and 15 minutes respectively: 0.5673828125, 0.56494140625, 0.478515625

Page created in 0.136 seconds with 54 queries.