Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 159939 times)

0 Members and 1 Guest are viewing this topic.

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #150 on: March 12, 2011, 10:47:24 pm »
I'm back to take on a few more routines that I either couldn't follow or just decided not to try in my first mass optimization post.



p_Sqrt: 1 byte and 4 cycles saved. I still think it may be a good idea to replace this with a restoring square root algorithm, though, like the one I suggested a while ago here. Although maybe not that exact one, because I wrote that when I was still not too familiar with assembly and it may not be very optimized.

Code: (Original code: 14 bytes, n*37+36 cycles) [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld a,-1
ld d,a
ld e,a
__SqrtLoop:
add hl,de
inc a
dec e
dec de
jr c,__SqrtLoop
ld h,0
ld l,a
ret
__SqrtEnd:
   
   
Code: (Optimized code: 13 bytes, n*37+32 cycles) [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ld de,-1&$FF
ld b,e
ld c,e
__SqrtLoop:
add hl,bc
inc e
dec c
dec bc
jr c,__SqrtLoop
ex de,hl
ret
__SqrtEnd:
   



p_Sin: 3 bytes and 8 cycles saved.

Code: (Original code: 29 bytes, too lazy to test cycles) [Select]
p_Sin:
.db __SinEnd-1-$
add a,a
rr c
ld d,a
cpl
ld e,a
xor a
ld b,8
__SinLoop:
rrc e
jr nc,__SinSkip
add a,d
__SinSkip:
rra
djnz __SinLoop
adc a,a
ld l,a
ld h,b
rl c
ret nc
cpl
inc a
ret z
ld l,a
dec h
ret
__SinEnd:
   
   
Code: (Optimized code: 26 bytes, too lazy to test-8 cycles) [Select]
p_Sin:
.db __SinEnd-1-$
ld c,a
add a,a
ld d,a
cpl
ld e,a
xor a
ld b,8
__SinLoop:
rra
rrc e
jr nc,__SinSkip
add a,d
__SinSkip:
djnz __SinLoop
ld l,a
ld h,b
or c
ret p
xor a
sub l
ret z
ld l,a
dec h
ret
__SinEnd:
   



p_Log: 1 byte saved.

Code: (Original code: 11 bytes, n*31+17 cycles) [Select]
p_Log:
.db 11
ld a,16
scf
__LogLoop:
adc hl,hl
dec a
jr nc,__LogLoop
ld l,a
ld h,0
   
   
Code: (Optimized code: 10 bytes, n*31+13 cycles) [Select]
p_Log:
.db 10
ld de,16
scf
__LogLoop:
adc hl,hl
dec e
jr nc,__LogLoop
ex de,hl
   

Before we leave this routine, though, the output for hl=0 isn't really correct; it returns 255. You could change the dec e in my suggested routine to dec de to give a slightly more accurate result of -1, but again, that's not quite correct either. The real result of log(0) should be negative infinity, which would be most properly represented by -32768. For the small cost of 3 bytes, the following routine would give you this result:

Code: (Mathematically correct code: 13 bytes, only a little bit slower cycles) [Select]
p_Log:
.db 13
ld de,16
__LogLoop:
add hl,hl
jr c,__LogLoopEnd
dec e
jr nz,__LogLoop
__LogLoopEnd:
ex de,hl
ccf
rr h
   



p_Exp: As with above suggestion, this isn't an optimization, this is a suggested improvement. With the current routine, I see two issues. Firstly, it returns 2^(input mod 256) instead of 2^(input), the latter of which is probably what you would expect of a 16-bit math function. Secondly, this routine does not do anything special for inputs with high values mod 256, which could result in it taking up to 7195 cycles. The following routine would correct both of these behaviors. Also note that it is a subroutine instead of inline code (more on turning inline code into subroutines later).

Code: (Mathematically correct code: 16 bytes, only a little bit slower cycles) [Select]
p_Exp:
.db __ExpEnd-p_Exp-1
ld b,l
ld a,l
and %11110000
or h
ld hl,0
ret nz
inc b
scf
__ExpLoop:
adc hl,hl
djnz __ExpLoop
ret
__ExpEnd:
   



__DrawMskAligned: 2 bytes, 72 cycles saved. I only tacked the aligned part of the masked sprite routine because the rest is scary.

Code: (Original code: 33 bytes, 1481 cycles) [Select]
__DrawMskAligned:
dec hl
__DrawMskAlignedLoop:
ld a,(ix+0)
xor (ix+8)
cpl

ld c,a
ld a,(hl)
or (ix+0)
and c
ld (hl),a

ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,c
and (hl)
or (ix+0)
ld (hl),a

inc ix
ld de,plotSScreen-appBackUpScreen+12
add hl,de

djnz __DrawMskAlignedLoop
     
   
Code: (Optimized code: 31 bytes, 1409 cycles) [Select]
__DrawMskAligned:
dec hl
__DrawMskAlignedLoop:
push hl
ld de,appBackUpScreen-plotSScreen
add hl,de

ld a,(ix+0)
ld d,a
xor (ix+8)
cpl
ld e,a

and (hl)
or d
ld (hl),a

pop hl

ld a,(hl)
or d
and e
ld (hl),a

inc ix
ld de,12
add hl,de

djnz __DrawMskAlignedLoop
     





Finally, here are some routines that I feel would be better suited to be subroutines instead of inline code. Feel free to disagree with me on any or all of these.
  • p_DispStrApp and p_TextStrApp: Any application that displays text will most likely be doing so far more than once, making the size savings from turning these into subroutines probably quite large. Also, the OS routines used to display text are slow enough that you wouldn't notice any speed difference from the overhead of a subroutine.
  • p_Length: Not really necessary to turn into a subroutine, it just seems like the kind of function that should be.
  • p_Log and p_Exp: Like p_Mul and p_Sqrt, I think that math routines should probably be subroutines. The current routines are only 11 and 10 bytes respectively, but if you use the routines I suggested above, they would then be 13 and 16 bytes respectively, making them much more worthy of being a subroutine. The p_Exp routine I suggested actually relies on being a subroutine.
  • p_GetBit and p_GetBit16: Although listed as only being 12 bytes in Commands.inc, they have an extra 2 bytes of overhead register shifting not shown in the routines. Since these are both 14-byte math functions that could definitely be called on more than once in programs that use them, I feel that they should probably be subroutines.

And for one going in the other direction, perhaps p_OnKey doesn't need to be a subroutine? Very few programs use the on key, and I'm guessing that any program that does isn't likely to use it more than once. Or at least if you don't want to change it, can you change the .db 10 into .db __OnKeyEnd-p_OnKey-1 or .db __OnKeyEnd-1-$? I was always confused when I saw this routine about whether or not it was inserted as inline code or a subroutine.

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #151 on: March 27, 2011, 04:18:51 pm »
Found a very small optimization that I'm surprised you didn't find  ;)

Code: (Original) [Select]
p_Min:
.db 8
pop de
or a
sbc hl,de
add hl,de
jr c,$+3
ex de,hl    
Code: (Optimized) [Select]
p_Min:
.db 8
pop de
or a
sbc hl,de
ex de,hl
jr nc,$+3
add hl,de

Code: (Original) [Select]
p_Max:
.db 8
pop de
or a
sbc hl,de
add hl,de
jr nc,$+3
ex de,hl
Code: (Optimized) [Select]
p_Max:
.db 8
pop de
or a
sbc hl,de
ex de,hl
jr c,$+3
add hl,de

It saves 7 clock cycles in the best case yet the worse case remains the same.  So that's a 3.5 clock cycle speed up in the average uniform case.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #152 on: March 27, 2011, 06:01:33 pm »
Quigibo, I noticed that you implemented a bunch of my optimizations/bug fixes, which is awesome. Having left a few out is fine, but I'm stuck wondering why you left out a specific few of my suggestions in particular that I thought would be useful. Did you just miss some of these, or did you purposely leave them out?
  • An optimized Exch() routine
  • Turning p_DispStrApp and p_TextStrApp into subroutines
  • Adjusted GetCalc() routines; I see you included one of the three adjusted routines (p_GetArc) probably because it's smaller than the original routine, but this routine includes the adjustment made in the other two routines I suggested and it seems incongruous for only the archived GetCalc() routine to have this fix implemented


And on an unrelated note, could you extend the DispGraph change to DispGraphClrDraw as well? And because I know a bunch of people may not like DispGraph no longer working in 15MHz mode, could you perhaps include a routine like DispGraphNormal that saves the CPU speed setting, puts the CPU in 6MHz mode, calls the normal DispGraph routine, and then restores the CPU speed? Something like this:

Code: [Select]
p_FastCopy6MHz:
.db __FastCopy6MHzEnd-1-$
in a,($02)
rla
in a,($20)
push af
xor a
out ($20),a
call sub_FastCpy
pop af
ret c
out ($20),a
ret
__FastCopy6MHzEnd:
« Last Edit: March 27, 2011, 06:21:29 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #153 on: March 27, 2011, 06:33:30 pm »
Oh yeah, now I remember why I didn't want to change Dispgraph before  :banghead:

The overhead is small though relative to the size of the routine,  I think I'll just make regular DispGraph account for the speed settings and it would be roughly the same size as the original.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #154 on: March 27, 2011, 06:38:24 pm »
Sorry if I'm pestering you about this, but could you please respond to the first part of my last post? Even if you tell me you purposely left them out for whatever reasons that's completely fine, I just wouldn't want you to have missed suggestions that you might have actually wanted to implement.
« Last Edit: March 27, 2011, 06:39:20 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #155 on: March 28, 2011, 04:30:42 am »
Oh yeah, sorry.  I was trying to release this quickly with my limited time so I didn't have time to make changes that would require me to rewrite core routines.  The exch() optimization requires this, because even though it works for Axioms, its not the same for regular routines yet but it will be eventually.  The subroutines I didn't convert yet because I kind of overlooked it, but I'll get to it next version.  As for GetCalc() I'm not sure if I'm ready to change it yet because it will create an incompatibility with current programs that still use the Ptr-2 to reference os vars.  I will probably put this on the poll.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #156 on: March 28, 2011, 11:41:47 am »
Alright, that makes sense. Thanks. :)

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #157 on: April 06, 2011, 01:19:54 am »
Code: (Original) [Select]
p_IntGt:
.db 7
ex de,hl
xor a
sbc hl,de
ld h,a
rla
ld l,a
Code: (Optimized) [Select]
p_IntGt:
.db 6
scf
sbc hl,de
sbc hl,hl
inc hl

Also, I think optimized comparisons for comparing to immediate values would be a good idea. For example, <3 becomes ld de,-3 \ add hl,de \ sbc hl,hl \ inc hl
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #158 on: April 06, 2011, 11:50:59 am »
For that matter, a lot of things could still use optimized immediate value versions. Pretty much every comparison could, as well as a few other operators. For instance, p_BoolOrImm, p_BoolAndImm, and p_BoolXorImm have been sitting in Commands.inc since Axe 0.4.7 and I would love to see them implemented. Constant bit setting and resetting would also be awesome. Unfortunately none of these things can be achieved with an Axiom, or I would be all over this.
« Last Edit: April 06, 2011, 11:54:47 am by Runer112 »

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #159 on: April 19, 2011, 11:00:30 pm »
I made a "real" square root algorithm, meaning one that doesn't use the repeated subtraction method. Certainly not a size optimization, but it certainly makes the average execution time a lot lower.
Code: [Select]
p_Sqrt:
.db __SqrtEnd-1-$
ex de,hl
ld bc,$8000
ld h,c
ld l,c
__SqrtLoop:
srl b
rr c
add hl,bc
ex de,hl
sbc hl,de
jr nc,__SqrtNoOverflow
add hl,de
ex de,hl
or a
sbc hl,bc
;jr __SqrtOverflow ;Commented out in favor of super optimization
.db $DA ;JP C opcode to skip next 2 bytes since carry is reset here.
__SqrtNoOverflow:
ex de,hl
add hl,bc
__SqrtOverflow:
srl h
rr l
srl b
rr c
jr nc,__SqrtLoop
ret
__SqrtEnd:
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #160 on: April 19, 2011, 11:07:34 pm »
How does it compare to this one? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it. :P

Offline calc84maniac

  • eZ80 Guru
  • Coder Of Tomorrow
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2912
  • Rating: +471/-17
    • View Profile
    • TI-Boy CE
Re: Assembly Programmers - Help Axe Optimize!
« Reply #161 on: April 19, 2011, 11:11:39 pm »
How does it compare to this one? I suggested it a while ago but Quigibo either didn't see it or didn't seem to be interested in it. :P
Er... that looks impressive :P I'll have to take a closer look at how it works sometime
"Most people ask, 'What does a thing do?' Hackers ask, 'What can I make it do?'" - Pablos Holman

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #162 on: April 20, 2011, 06:08:32 pm »
I'll think about it.  I'm not sure how often square roots are actually needed and how applicable they are to speed constraints.

Another thing, I'm adding new auto-opts for constants in the comparisons (less than, less than or equal to, greater than, and greater than or equal to).  I have found nice optimizations for powers of 2 and some low numbers, but if anyone wants to try writing some, I could definitely use them.  I probably missed some or may have sub-optimal solutions.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #163 on: April 20, 2011, 06:10:56 pm »
If you post what you have so far, I can look at them and see if I can find any optimizations for them. I also might see if there are any other optimized comparisons you don't already have that I could contribute.

Offline Builderboy

  • Physics Guru
  • CoT Emeritus
  • LV13 Extreme Addict (Next: 9001)
  • *
  • Posts: 5673
  • Rating: +613/-9
  • Would you kindly?
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #164 on: April 20, 2011, 06:38:42 pm »
What someone needs to do is take an program that can emulate the z80's basic mathematical and jumping functions in a program setting and brute force out *the* most optimized program that performs a small simple task like multiplying two numbers :P