Author Topic: Assembly Programmers - Help Axe Optimize! (Read 168489 times)

calc84maniac · « **Reply #285 on:** September 18, 2012, 09:37:08 am »

And if you ever want a signed high multiplication, I think this routine would work along with that one:

p_MulFullSigned:
	.db __MulFullSignedEnd-1-$
	push	hl
	call	$3F00+sub_MulFull
	pop	bc
	xor	a
	bit	7,b
	jr	z,$+4
	sbc	hl,de
	or	d
	ret	p
	sbc	hl,bc
	ret
__MulFullSignedEnd:

Edit: more optimized

squidgetx · « **Reply #286 on:** December 12, 2012, 10:22:33 am »

Optimizing constant address calls?
Anyway, 5->^oVAR : (^oVAR)() compiles to

Code: [Select]

ld hl, 5
push hl
call $9D9D

when it could just compile to

Code: [Select]

call $0005

Right now the only way to call an address that's not a label is using asm(CDXXXX), and that way makes assigning r1-r6 arguments extremely annoying (manual store)

Xeda112358 · « **Reply #287 on:** February 15, 2013, 07:02:51 pm »

I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation:

Code: [Select]

p_NthStr:
	.db __NthStrEnd-$+1
	pop	bc
	pop	de
	push	bc
	ex	de,hl
__NthStrLoop:
	ld	a,d
	or	e
	ret	z
	xor	a
	ld	b,h
	cpir
	dec	de
	jr	__NthStrLoop
__NthStrEnd:

It took me a second to figure out what you were doing with 'ld b,h', but when I did, I saw that you could just move it outside the loop to save 4 t-states each loop. But then I realised that BC is already large enough since it holds the return address, so you can actually just remove it altogether.

Code: [Select]

p_NthStr:
	.db __NthStrEnd-$+1
	pop	bc
	pop	de
	push	bc
	ex	de,hl
__NthStrLoop:
	ld	a,d
	or	e
	ret	z
	xor	a
	cpir
	dec	de
	jr	__NthStrLoop
__NthStrEnd:

I hope that actually works!

calc84maniac · « **Reply #288 on:** February 15, 2013, 11:18:31 pm »

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Runer112 · « **Reply #289 on:** February 15, 2013, 11:29:47 pm »

Quote from: Xeda112358 on February 15, 2013, 07:02:51 pm

I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0.

Quote from: calc84maniac on February 15, 2013, 11:18:31 pm

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Pfft what are the chances of that...

Xeda112358 · « **Reply #290 on:** February 16, 2013, 07:27:26 am »

Quote from: Runer112 on February 15, 2013, 11:29:47 pm

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0.

Darn, I actually do have 1.2.1 in a different folder, I completely forgot about that .__. I am glad that I got something right, though

Quote from: calc84maniac on February 15, 2013, 11:18:31 pm

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

I was worried about that, but I figured that it would be pretty rare. It would definitely be the only scenario that it would fail, too. .__.

Deep Toaster · « **Reply #291 on:** April 06, 2013, 05:42:16 pm »

Don't know if it's been mentioned before (and maybe there's a reason it's this way), but p_SendByte starts by loading B and C individually where p_GetByte loads them together (saving a byte).

Runer112 · « **Reply #292 on:** April 06, 2013, 05:44:01 pm »

No reason whatsoever. Good catch.

Xeda112358 · « **Reply #293 on:** July 04, 2013, 08:59:58 am »

EDIT: Jacobly pointed out the case HL = 8000h, so this doesn't work

Hopefully this file has the updated SDiv routine. I have this:

Original routine

Code: [Select]

p_SDiv:
	.db __SDivEnd-1-$
	ld	a,h
	xor	d
	push	af
	xor	d
	jp	p,__SDivSkip1-p_SDiv-1
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
__SDivSkip1:
	bit	7,d
	jr	z,__SDivSkip2
	xor	a
	sub	e
	ld	e,a
	sbc	a,a
	sub	d
	ld	d,a
__SDivSkip2:
	call	$3F00+sub_Div
x_SDivEntry:
	pop	af
	ret	p
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
	ret
__SDivEnd:

Smaller routine: 1 byte, 1|6 cycles saved

Code: [Select]

p_SDiv:
	.db __SDivEnd-1-$
	ld	a,h
	xor	d
	push	af
	xor	d
	jp	p,__SDivSkip1-p_SDiv-1
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
__SDivSkip1:
	xor	d
	jp	p,__SDivSkip2-p_SDiv-1
	xor	a
	sub	e
	ld	e,a
	sbc	a,a
	sub	d
	ld	d,a
__SDivSkip2:
	call	$3F00+sub_Div
x_SDivEntry:
	pop	af
	ret	p
	xor	a
	sub	l
	ld	l,a
	sbc	a,a
	sub	h
	ld	h,a
	ret
__SDivEnd:

And my only change is the two lines after __SDivSkip1.
Same size, save at least 1 cycle (up to 6 cycles).

EDIT: The same modification can be made to the fixed point signed division routine.

Matrefeytontias · « **Reply #294 on:** July 12, 2013, 01:30:49 pm »

Seeing the discussion about Fill( in the axiom request thread, I was surprized that this wasn't implemented this way already :

Code: [Select]

; Fill(ptr, amount, byte (not word))
; hl = ptr, de = byte, bc = amount
 ld (hl),e
 dec bc
 ld a,c
 or b
 ret z ; or whatever to quit
 ld e,l
 ld d,h
 inc de
 ldir
 ret ;              ↑

I don't think it's really optimized though >_>

jo-thijs · « **Reply #295 on:** October 31, 2013, 12:02:41 pm »

I found this in the Commands.inc file of axe1.2.2a:
p_IntNe:
   .db 8
   xor   a
   sbc   hl,de
   jr   z,$+5
   ld   hl,1

I can't find the purpose of xor a.

Runer112 · « **Reply #296 on:** October 31, 2013, 12:14:49 pm »

Reset the carry flag for sbc hl,de it seems.

Xeda112358 · « **Reply #297 on:** July 27, 2015, 02:49:49 pm »

I think I finally have a major optimization after having worked on link routines for the past couple of weeks. I didn't modify the timeout or syncing code, just the core get/send stuff. I've tested it and it is reliable.

For reference, in the even that p_SendByte doesn't have to wait, the new routine is 931cc vs 1647cc. Here are my proposed routines:

p_GetByte: +0 bytes, presumably as much faster as p_SendByte

Code: [Select]

p_GetByte:
	.db __GetByteEnd-$-1
	di
	ld	bc,$0803		;Bit counter in b, bit mask in c
	ld	hl,-1
	xor	a
	out	(0),a			;Make sure we are reset
	in	a,(0)
	and	c			;Check to see if sender is ready
	dec	a
	ret	nz			;If not, then go back
	inc	a
	out	(0),a			;Relay a confirmation
	ex	(sp),hl			;Wait at until confirmation is read (59 T-states minimum)
	ex	(sp),hl
	ld	a,(de)			;Bit counter in b and bitmask in c
	xor	a			;Store received byte in l
	ld hl,$AA
	out	(0),a			;Reset the ports to receive data

__GetByteLoop:
    in a,(0)
    xor l
    rra
    jr c,__GetByteLoop
    in a,(0)
    rra
    rra             ;bits cycled in are masked with 0x55. Need to invert anyways, so mask at the end with 0xAA
    rr l
    djnz __GetByteLoop
    ret

p_SendByte: -4 bytes, -723cc

Code: [Select]

p_SendByte:
    .db __SendByteEnd-$-1
	di
	ld	bc,$5503		;Bit counter in b, bit mask in c
	ld	a,%00000010
	out	(0),a			;Indicate we are ready to send
__SendByteTimeout:
	dec	hl
	ld	a,h
	or	l
	jr	z,__SendByteDone
	in	a,(0)			;Loop is 59 T-states maximum
	and	c
	jr	nz,__SendByteTimeout	;Keep looping till we get it
	out (0),a
__SendLoop:
    rrc e
    ccf
    rla
    sla b
    ccf
    rla
    out (0),a
    ex (sp),hl
    ex (sp),hl
    nop
    jr nz,__SendLoop
;need 37cc
    xor a
    ex (sp),hl
    ex (sp),hl
__SendByteDone
    out (0),a
    ret
__SendByteEnd:

EDIT: I looked at the timeout code for p_SendByte, and realized that my code didn't need B to be a counter but instead I was using D as a kind of counter. By using B instead of D, I could cut out the ld d,$55, saving 2 bytes and 7cc.

Xeda112358 · « **Reply #298 on:** September 21, 2019, 07:19:54 pm »

Here is an optimized p_LineShr routine. NOTE: It flips the meaning of the carry flag on output, so the line routines that use this will need to ret c instead of ret nc.

Original routine

Code: [Select]

p_LineShr:
	.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
	ld	a,l
	pop	bc
	pop	hl
	pop	de
	ex	(sp),hl
	ld	d,l
	pop	hl
	ex	(sp),hl
	push	bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
	cp	64
	ret	nc
	ld	h,a
	ld	a,d
	cp	64
	ret	nc

	ld	a,l
	cp	96
	ret	nc
	ld	a,e
	cp	96
	ret	nc

	sub	l
	jr	nc,__LineShrSkipRev
	ex	de,hl
	neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
	push	af			; Saving DX (it will be popped into HL below)
	ld	a,l			; IX+=L/8+D*12 (actually D*4+D*4+D*4)
	rra
	rra
	rra
	and	%00011111
	ld	c,a
	ld	b,0
	add	ix,bc
	ld	a,d
	add	a,a
	add	a,a
	ld	c,a
	add	ix,bc
	add	ix,bc
	add	ix,bc
	ld	a,l			; Calculating the starting pixel mask
	and	%00000111
	inc	a
	ld	b,a
	ld	a,%00000001
__LineShrMaskLoop:
	rrca
	djnz	__LineShrMaskLoop
	ld	c,a
	ld	a,h			; Calculating delta Y and negating the Y increment if necessary
	sub	d			; This is the last instruction for which we need the original data
	ld	de,12
	jr	nc,__LineShrSkipNeg
	ld	de,-12
	neg
__LineShrSkipNeg:
	pop	hl			; Recalling DX
	ld	l,a			; H=DX, L=DY
	cp	h
	jr	nc,__LineVert		; Line is rather vertical than horizontal
	ld	a,h
__LineVert:
	ld	b,a			; Pixel counter
	inc	b
	cp	l
	scf				; Setting up gradient counter
	ccf
	rra
	scf
	ret				; c=1, z=vertical major
__LineShrEnd:

Optimized routine: -4 bytes, -13cc

Code: [Select]

p_LineShr:
	.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
	ld	a,l
	pop	bc
	pop	hl
	pop	de
	ex	(sp),hl
	ld	d,l
	pop	hl
	ex	(sp),hl
	push	bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
	ld	h,a
	ld	a,63
	cp	h
	ret	c
	cp	d
	ret	c

	ld	a,95
	cp	l
	ret	c
	cp	e
	ret	c
	ld	a,e

	sub	l
	jr	nc,__LineShrSkipRev
	ex	de,hl
	neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
	push	af			; Saving DX (it will be popped into HL below)
	ld	a,d
	add	a,a
	add	a,a
	ld	c,a
	ld	b,0
	add	ix,bc
	add	ix,bc
	add	ix,bc
	ld	a,l
	and	7
	ld	e,a
	xor	l
	rra
	rra
	rra
	ld	c,a
	add	ix,bc
	ld	b,a
	inc	b
	ld	a,%00000001
__LineShrMaskLoop:
	rrca
	djnz	__LineShrMaskLoop
	ld	c,a
	ld	a,h			; Calculating delta Y and negating the Y increment if necessary
	sub	d			; This is the last instruction for which we need the original data
	ld	de,12
	jr	nc,__LineShrSkipNeg
	ld	de,-12
	neg
__LineShrSkipNeg:
	pop	hl			; Recalling DX
	ld	l,a			; H=DX, L=DY
	cp	h
	jr	nc,__LineVert		; Line is rather vertical than horizontal
	ld	a,h
__LineVert:
	ld	b,a			; Pixel counter
	inc	b
	cp	l
	res	0,a			; Setting up gradient counter
	rrca
	ret				; c=0, z=vertical major
__LineShrEnd:

Or this version, it only save 3 bytes, but saves 10 more clock cycles:

Code: [Select]

p_LineShr:
	.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
	ld	a,l
	pop	bc
	pop	hl
	pop	de
	ex	(sp),hl
	ld	d,l
	pop	hl
	ex	(sp),hl
	push	bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
	ld	h,a
	ld	a,63
	cp	h
	ret	c
	cp	d
	ret	c

	ld	a,95
	cp	l
	ret	c
	cp	e
	ret	c
	ld	a,e

	sub	l
	jr	nc,__LineShrSkipRev
	ex	de,hl
	neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
	ld	e,a			; Saving DX
	ld	a,l			; IX+=L/8+D*12 (actually D*4+D*4+D*4)
	rra
	rra
	rra
	and	%00011111
	ld	c,a
	ld	b,0
	add	ix,bc
	ld	a,d
	add	a,a
	add	a,a
	ld	c,a
	add	ix,bc
	add	ix,bc
	add	ix,bc
	ld	a,l			; Calculating the starting pixel mask
	and	%00000111
	inc	a
	ld	b,a
	ld	a,%00000001
__LineShrMaskLoop:
	rrca
	djnz	__LineShrMaskLoop
	ld	c,a
	ld	a,h			; Calculating delta Y and negating the Y increment if necessary
	sub	d			; This is the last instruction for which we need the original data

	ld	h,e			; DX
	ld	l,a			; DY

	ld	de,12
	jr	nc,__LineShrSkipNeg
	ld	de,-12
	neg
__LineShrSkipNeg:
	cp	h
	jr	nc,__LineVert		; Line is rather vertical than horizontal
	ld	a,h
__LineVert:
	ld	b,a			; Pixel counter
	inc	b
	cp	l
	res	0,a			; Setting up gradient counter
	rrca
	ret				; c=0, z=vertical major
__LineShrEnd:

Xeda112358 · « **Reply #299 on:** October 20, 2019, 10:47:31 am »

p_EQ0

The current routine is 7 bytes and 36cc:

Code: [Select]

;7 bytes, 36cc
	ld	a,l
	or	h
	add	a,255
	sbc	hl,hl
	inc	hl

But we can save 8cc without sacrificing bytes:

Code: [Select]

;7 bytes, 28cc
	xor	a
	cp	h
	ld	h,a
	sbc	a,l
	sbc	a,a
	ld	l,a
	inc	l

Author Topic: Assembly Programmers - Help Axe Optimize! (Read 168489 times)

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

squidgetx

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

calc84maniac

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Deep Toaster

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Matrefeytontias

Re: Assembly Programmers - Help Axe Optimize!

jo-thijs

Re: Assembly Programmers - Help Axe Optimize!

Runer112

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!

Xeda112358

Re: Assembly Programmers - Help Axe Optimize!