Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 154435 times)

0 Members and 1 Guest are viewing this topic.

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #285 on: September 18, 2012, 09:37:08 am »
And if you ever want a signed high multiplication, I think this routine would work along with that one:
Code: [Select]
p_MulFullSigned:
.db __MulFullSignedEnd-1-$
push hl
call $3F00+sub_MulFull
pop bc
xor a
bit 7,b
jr z,$+4
sbc hl,de
or d
ret p
sbc hl,bc
ret
__MulFullSignedEnd:

Edit: more optimized
« Last Edit: September 18, 2012, 09:52:36 am by calc84maniac »
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline squidgetx

  • Food.
  • CoT Emeritus
  • LV10 31337 u53r (Next: 2000)
  • *
  • Posts: 1881
  • Rating: +503/-17
  • rawr.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #286 on: December 12, 2012, 10:22:33 am »
Optimizing constant address calls?
Anyway, 5->oVAR : (oVAR)() compiles to
Code: [Select]
ld hl, 5
push hl
call $9D9D
when it could just compile to
Code: [Select]
call $0005

Right now the only way to call an address that's not a label is using asm(CDXXXX), and that way makes assigning r1-r6 arguments extremely annoying (manual store)

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #287 on: February 15, 2013, 07:02:51 pm »
I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation:
Code: [Select]
p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
ld b,h
cpir
dec de
jr __NthStrLoop
__NthStrEnd:
It took me a second to figure out what you were doing with 'ld b,h', but when I did, I saw that you could just move it outside the loop to save 4 t-states each loop. But then I realised that BC is already large enough since it holds the return address, so you can actually just remove it altogether.
Code: [Select]
p_NthStr:
.db __NthStrEnd-$+1
pop bc
pop de
push bc
ex de,hl
__NthStrLoop:
ld a,d
or e
ret z
xor a
cpir
dec de
jr __NthStrLoop
__NthStrEnd:

I hope that actually works!

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #288 on: February 15, 2013, 11:18:31 pm »
I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #289 on: February 15, 2013, 11:29:47 pm »
I am not sure if I had an outdated source (1.1.2) but I saw this code and a one -byte optimisation

You do have an outdated version of Axe, I already added that optimizaion in 1.2.0. :P


I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).

Pfft what are the chances of that...
« Last Edit: February 15, 2013, 11:30:22 pm by Runer112 »

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #290 on: February 16, 2013, 07:27:26 am »
You do have an outdated version of Axe, I already added that optimizaion in 1.2.0. :P
Darn, I actually do have 1.2.1 in a different folder, I completely forgot about that .__. I am glad that I got something right, though :D

I'm not so sure that would work, because there's a possible case where you could be running code from an app and finding the Nth string in a large appvar in RAM, for example (which could be more than 16KB in size).
I was worried about that, but I figured that it would be pretty rare. It would definitely be the only scenario that it would fail, too. .__.

Offline Deep Toaster

  • So much to do, so much time, so little motivation
  • Administrator
  • LV13 Extreme Addict (Next: 9001)
  • *************
  • Posts: 8217
  • Rating: +758/-15
    • View Profile
    • ClrHome
Re: Assembly Programmers - Help Axe Optimize!
« Reply #291 on: April 06, 2013, 05:42:16 pm »
Don't know if it's been mentioned before (and maybe there's a reason it's this way), but p_SendByte starts by loading B and C individually where p_GetByte loads them together (saving a byte).




Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #292 on: April 06, 2013, 05:44:01 pm »
No reason whatsoever. Good catch.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #293 on: July 04, 2013, 08:59:58 am »
EDIT: Jacobly pointed out the case HL = 8000h, so this doesn't work D:

Hopefully this file has the updated SDiv routine. I have this:
Original routine
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d
jp p,__SDivSkip1-p_SDiv-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip1:
bit 7,d
jr z,__SDivSkip2
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__SDivSkip2:
call $3F00+sub_Div
x_SDivEntry:
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
   Smaller routine: 1 byte, 1|6 cycles saved
Code: [Select]
p_SDiv:
.db __SDivEnd-1-$
ld a,h
xor d
push af
xor d
jp p,__SDivSkip1-p_SDiv-1
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
__SDivSkip1:
xor d
jp p,__SDivSkip2-p_SDiv-1
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a
__SDivSkip2:
call $3F00+sub_Div
x_SDivEntry:
pop af
ret p
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
__SDivEnd:
And my only change is the two lines after __SDivSkip1.
Same size, save at least 1 cycle (up to 6 cycles).

EDIT: The same modification can be made to the fixed point signed division routine.

Offline Matrefeytontias

  • Axe roxxor (kinda)
  • LV10 31337 u53r (Next: 2000)
  • **********
  • Posts: 1982
  • Rating: +310/-12
  • Axe roxxor
    • View Profile
    • RMV Pixel Engineers
Re: Assembly Programmers - Help Axe Optimize!
« Reply #294 on: July 12, 2013, 01:30:49 pm »
Seeing the discussion about Fill( in the axiom request thread, I was surprized that this wasn't implemented this way already :

Code: [Select]
; Fill(ptr, amount, byte (not word))
; hl = ptr, de = byte, bc = amount
 ld (hl),e
 dec bc
 ld a,c
 or b
 ret z ; or whatever to quit
 ld e,l
 ld d,h
 inc de
 ldir
 ret ;              ↑

I don't think it's really optimized though >_>
« Last Edit: July 12, 2013, 01:34:19 pm by Matrefeytontias »

Offline jo-thijs

  • LV1 Newcomer (Next: 20)
  • *
  • Posts: 19
  • Rating: +1/-0
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #295 on: October 31, 2013, 12:02:41 pm »
I found this in the Commands.inc file of axe1.2.2a:
p_IntNe:
   .db 8
   xor   a
   sbc   hl,de
   jr   z,$+5
   ld   hl,1

I can't find the purpose of xor a.

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #296 on: October 31, 2013, 12:14:49 pm »
Reset the carry flag for sbc hl,de it seems. :P

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #297 on: July 27, 2015, 02:49:49 pm »
I think I finally have a major optimization after having worked on link routines for the past couple of weeks. I didn't modify the timeout or syncing code, just the core get/send stuff. I've tested it and it is reliable.

For reference, in the even that p_SendByte doesn't have to wait, the new routine is 931cc vs 1647cc. Here are my proposed routines:

p_GetByte: +0 bytes, presumably as much faster as p_SendByte
Code: [Select]
p_GetByte:
.db __GetByteEnd-$-1
di
ld bc,$0803 ;Bit counter in b, bit mask in c
ld hl,-1
xor a
out (0),a ;Make sure we are reset
in a,(0)
and c ;Check to see if sender is ready
dec a
ret nz ;If not, then go back
inc a
out (0),a ;Relay a confirmation
ex (sp),hl ;Wait at until confirmation is read (59 T-states minimum)
ex (sp),hl
ld a,(de) ;Bit counter in b and bitmask in c
xor a ;Store received byte in l
ld hl,$AA
out (0),a ;Reset the ports to receive data

__GetByteLoop:
    in a,(0)
    xor l
    rra
    jr c,__GetByteLoop
    in a,(0)
    rra
    rra             ;bits cycled in are masked with 0x55. Need to invert anyways, so mask at the end with 0xAA
    rr l
    djnz __GetByteLoop
    ret
   
p_SendByte: -4 bytes, -723cc
Code: [Select]
p_SendByte:
    .db __SendByteEnd-$-1
di
ld bc,$5503 ;Bit counter in b, bit mask in c
ld a,%00000010
out (0),a ;Indicate we are ready to send
__SendByteTimeout:
dec hl
ld a,h
or l
jr z,__SendByteDone
in a,(0) ;Loop is 59 T-states maximum
and c
jr nz,__SendByteTimeout ;Keep looping till we get it
out (0),a
__SendLoop:
    rrc e
    ccf
    rla
    sla b
    ccf
    rla
    out (0),a
    ex (sp),hl
    ex (sp),hl
    nop
    jr nz,__SendLoop
;need 37cc
    xor a
    ex (sp),hl
    ex (sp),hl
__SendByteDone
    out (0),a
    ret
__SendByteEnd:
EDIT: I looked at the timeout code for p_SendByte, and realized that my code didn't need B to be a counter but instead I was using D as a kind of counter. By using B instead of D, I could cut out the ld d,$55, saving 2 bytes and 7cc.

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #298 on: September 21, 2019, 07:19:54 pm »
Here is an optimized p_LineShr routine. NOTE: It flips the meaning of the carry flag on output, so the line routines that use this will need to ret c instead of ret nc.
Original routine
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
cp 64
ret nc
ld h,a
ld a,d
cp 64
ret nc

ld a,l
cp 96
ret nc
ld a,e
cp 96
ret nc

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
scf ; Setting up gradient counter
ccf
rra
scf
ret ; c=1, z=vertical major
__LineShrEnd:
   
Optimized routine: -4 bytes, -13cc
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c

ld a,95
cp l
ret c
cp e
ret c
ld a,e

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
push af ; Saving DX (it will be popped into HL below)
ld a,d
add a,a
add a,a
ld c,a
ld b,0
add ix,bc
add ix,bc
add ix,bc
ld a,l
and 7
ld e,a
xor l
rra
rra
rra
ld c,a
add ix,bc
ld b,a
inc b
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data
ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
pop hl ; Recalling DX
ld l,a ; H=DX, L=DY
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:



Or this version, it only save 3 bytes, but saves 10 more clock cycles:
Code: [Select]
p_LineShr:
.db __LineShrEnd-$-1
;; l=y2, ix=buff, (sp)=ret, (sp+2)=ret_2, (sp+4)=x2, (sp+6)=y1, (sp+8)=x1
ld a,l
pop bc
pop hl
pop de
ex (sp),hl
ld d,l
pop hl
ex (sp),hl
push bc

;; a=y2, d=y1, e=x2, l=x1, (sp)=ret, (sp+2)=ret_2
ld h,a
ld a,63
cp h
ret c
cp d
ret c

ld a,95
cp l
ret c
cp e
ret c
ld a,e

sub l
jr nc,__LineShrSkipRev
ex de,hl
neg

;; a=dx, d=y1, e=x2, h=y2, l=x1
__LineShrSkipRev:
ld e,a ; Saving DX
ld a,l ; IX+=L/8+D*12 (actually D*4+D*4+D*4)
rra
rra
rra
and %00011111
ld c,a
ld b,0
add ix,bc
ld a,d
add a,a
add a,a
ld c,a
add ix,bc
add ix,bc
add ix,bc
ld a,l ; Calculating the starting pixel mask
and %00000111
inc a
ld b,a
ld a,%00000001
__LineShrMaskLoop:
rrca
djnz __LineShrMaskLoop
ld c,a
ld a,h ; Calculating delta Y and negating the Y increment if necessary
sub d ; This is the last instruction for which we need the original data

ld h,e ; DX
ld l,a ; DY

ld de,12
jr nc,__LineShrSkipNeg
ld de,-12
neg
__LineShrSkipNeg:
cp h
jr nc,__LineVert ; Line is rather vertical than horizontal
ld a,h
__LineVert:
ld b,a ; Pixel counter
inc b
cp l
res 0,a ; Setting up gradient counter
rrca
ret ; c=0, z=vertical major
__LineShrEnd:

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #299 on: October 20, 2019, 10:47:31 am »
p_EQ0

The current routine is 7 bytes and 36cc:
Code: [Select]
;7 bytes, 36cc
ld a,l
or h
add a,255
sbc hl,hl
inc hl

But we can save 8cc without sacrificing bytes:
Code: [Select]
;7 bytes, 28cc
xor a
cp h
ld h,a
sbc a,l
sbc a,a
ld l,a
inc l