Optimization Compilation: The Quick & Dirty Guide to Optimizing in AxeThis is intended to be a guide for every Axe programmer with general optimization
For demonstration purposes, I will always use the variable A. Of course, you can do these optimizations with any other var or memory address. Most opts are either in code boxes or on bullet points.
Part I: The BASIC StuffMany of the tricks we've learned from BASIC can also be applied to Axe. However, not all of them are applicable. Let's take a look:
Boolean Stuff: As with BASIC, we can optimize
If VAR?0
to simply
If VAR
However, be warned that this may not always work with compound statements (
If A and B). Because the logic operators are bitwise operations, the statement "2 or 0" will return 0. This means that unless you are certain that A and B are always boolean values (1 or 0), you need to do
If A?0 and (B?0) instead of
If A and BAnother random thing is to ALWAYS CLOSE YOUR QUOTES AND PARENTHESES. Unlike BASIC, the store arrow does not close them for you. Unlike BASIC, leaving them out does NOT improve the size of your program; it is unrelated to optimization but I figured I'd just stick it here since it is a good coding habit and makes code more readable.
Pre-evaluating expressions: Especially in games that heavily reference arrays throughout a section of code, it is often good both for speed and memory to pre-evaluate expressions that do not change throughout the loop. Look at this code for drawing a scrolling 16x16 tilemapper with X and Y positions in pixels:
For(A,0,11)
For(B,0,7)
If {Y/8+B*16+(X/8)+A+L1}
Pt-On(A*8-(X^8),B*8-(Y^8),{Y/8+B*32+(X/8)+A}*8+Pic0
End
End
End
There is a HUGE speed gain from simply preevaluating some of the expressions before entering the loop:
X/8->E
Y/8->F
X^8->I
Y^8->J
For(A,0,7)
For(B,0,11)
If {F+B*16+E+A+L1}->C
Pt-On(A*8-I,B*8-J,C*8+Pic0
End
End
End
Pixel TestingPixel testing can be a mean and nasty cycle stealer from many programs. But never fear, it can be optimized...a lot. Remember that we have access to the screen buffer in L6.
If you are pixel testing a constant pixel, like pxl-Test(20,20), you can more than halve the speed of this command with the following optimization:
{20*12+L6+2}^^re4
This optimization relies on the fact that the numbers can basically be pre-computed: use the following formula to derive the numbers you should use:
{Y*12+L6+(X/8)}e(X^8)
So for another example, the command pxl-Test(8,1) becomes {12+L6}e1.
The speed gain from this is so great that you can even still save (although not as much) even with a variable Y value. How you treat the constant X value remains the same as before, but simply substitute in your variable Y value in the above code. So for example, pxl-Test(31,Y) becomes {Y*12+L6+3}e7.
Part II: GeneralEquality Checks:If A=EXP
optimizes to
!If A xor EXP
Note that you should use
!If A – EXP if it is an optimized addition/subtraction (See Optimized Math)
If A=EXP and (B=EXP)
optimizes to
!If A-EXP + (B-EXP)
where + is the 16bit 'or' operator.
Now, if you are checking the same variable for more than one possible expression, then it yields a greater optimization to do this:
If A=EXP1 or (A=EXP2)
to
If inData(A,Data(EXP1,EXP2,0))
You just have to make sure that you take care of the 0 case first, since this will return a non-zero value if the variable=0
Also, as Quigibo pointed out, this only works with constant, 8bit values.
Simple Math Stuff- If you can, always put the constants last in your expressions:CONST+A to A+CONST
- Use two byte vars over one byte vars or nibbles if you can afford the free RAM; it will be slightly smaller and faster. (but only by a little)
Note: As far as I can tell the speed difference is negligible. Unless you’re raycasting or something - Avoid parentheses as much as possible when writing expressions. Use Axe's left-to-right order of operations to your advantage. This is in the documentation.
- A quick way to determine the sign of a value (better than >>0) is EXP//32768 It will return -1 if the value is negative, and 0 if the value is 0 or positive
- The new ++ -- arguments are a smaller and faster way of doing {EXP}+/-1->{EXP}. In the old form, EXP had to be evaluated twice. Now, it only needs to be evaluated once.
- One random trick is if you need to initialize a 1-byte variable to 0, the line and 0->{EXP} will be smaller than 0->{EXP}
Optimized MathSome operations are hard-coded optimized versions that don’t use the usual arithmetic operations. A complete list of them can be found in the following spoiler which I stole from Runer112’s
handy list of command speed/sizeSpoiler For Optimized Math:
;Optimized Math
;-----------------------------------------------
p_Add0: 0 bytes 0 cycles +0 Results in no compiled code
p_Add1: 1 byte 6 cycles +1 Tied for the smallest way to change a value in Axe
p_Add2: 2 bytes 12 cycles +2
p_Add3: 3 bytes 18 cycles +3
p_Add254: 3 bytes 16 cycles +254
p_Add255: 2 bytes 10 cycles +255
p_Add256: 1 byte 4 cycles +256 The absolute smallest and fastest way to change a value in Axe
p_Add257: 2 bytes 10 cycles +257
p_Add258: 3 bytes 16 cycles +258
p_Add510: 4 bytes 20 cycles +510
p_Add511: 3 bytes 14 cycles +511
p_Add512: 2 bytes 8 cycles +512
p_Add513: 3 bytes 14 cycles +513
p_Add514: 4 bytes 20 cycles +514
p_Add767: 4 bytes 18 cycles +767
p_Add768: 3 bytes 12 cycles +768
p_Add769: 4 bytes 18 cycles +769
p_Add1024: 4 bytes 16 cycles +1024
p_Sub0: 0 bytes 0 cycles -0 Results in no compiled code
p_Sub1: 1 byte 6 cycles -1 Tied for the smallest way to change a value in Axe
p_Sub2: 2 bytes 12 cycles -2
p_Sub3: 3 bytes 18 cycles -3
p_Sub254: 3 bytes 16 cycles -254
p_Sub255: 2 bytes 10 cycles -255
p_Sub256: 1 byte 4 cycles -256 Also the absolute smallest and fastest way to change a value in Axe
p_Sub257: 2 bytes 10 cycles -257
p_Sub258: 3 bytes 16 cycles -258
p_Sub510: 4 bytes 20 cycles -510
p_Sub511: 3 bytes 14 cycles -511
p_Sub512: 2 bytes 8 cycles -512
p_Sub513: 3 bytes 14 cycles -513
p_Sub514: 4 bytes 20 cycles -514
p_Sub767: 4 bytes 18 cycles -767
p_Sub768: 3 bytes 12 cycles -768
p_Sub769: 4 bytes 18 cycles -769
p_Sub1024: 4 bytes 16 cycles -1024
p_Mul0: 3 bytes 10 cycles *0 Same as loading the constant 0
p_Mul1: 0 bytes 0 cycles *1 Results in no compiled code
p_MulN1: 6 bytes 24 cycles *?1 Same as p_IntNeg
p_Mul2: 1 byte 11 cycles *2 Tied for the smallest way to change a value in Axe
p_Mul3: 4 bytes 30 cycles *3
p_Mul4: 2 bytes 22 cycles *4
p_Mul5: 5 bytes 41 cycles *5
p_Mul6: 5 bytes 41 cycles *6
p_Mul7: 6 bytes 52 cycles *7
p_Mul8: 3 bytes 33 cycles *8
p_Mul9: 6 bytes 52 cycles *9
p_Mul10: 6 bytes 52 cycles *10
p_Mul12: 6 bytes 52 cycles *12
p_Mul16: 4 bytes 44 cycles *16
p_Mul32: 5 bytes 55 cycles *32
p_Mul64: 5 bytes 144 cycles *64
p_Mul128: 5 bytes 170 cycles *128
p_Mul255: 6 bytes 31 cycles *255
p_Mul256: 3 bytes 11 cycles *256
p_Mul257: 3 bytes 12 cycles *257
p_Mul258: 4 bytes 23 cycles *258
p_Mul260: 5 bytes 34 cycles *260
p_Mul264: 6 bytes 45 cycles *264
p_Mul512: 4 bytes 22 cycles *512
p_Mul513: 6 bytes 37 cycles *513
p_Mul514: 4 bytes 23 cycles *514
p_Mul516: 5 bytes 34 cycles *516
p_Mul520: 6 bytes 45 cycles *520
p_Mul768: 6 bytes 23 cycles *768
p_Mul1024: 5 bytes 33 cycles *1024
p_Mul1028: 5 bytes 34 cycles *1028
p_Mul1032: 6 bytes 45 cycles *1032
p_Mul2048: 6 bytes 44 cycles *2048
p_Mul2056: 6 bytes 45 cycles *2056
p_Mul4096: 5 bytes 290 cycles *4096
p_Mul8192: 5 bytes 314 cycles *8192
p_Mul16384: 5 bytes 338 cycles *16384
p_Mul32768: 6 bytes 24 cycles *32768
p_Mul65535: 6 bytes 24 cycles *65535 Same as p_MulN1
p_Div0: 3 bytes 10 cycles /0 Same as loading the constant 65535
p_Div1: 0 bytes 0 cycles /1 Results in no compiled code
p_Div2: 4 bytes 16 cycles /2
p_Div10: 3 bytes ~1896 cycles /10 n*3+1878 cycles, n=number of set bits in result
p_Div128: 5 bytes 27 cycles /128
p_Div256: 3 bytes 11 cycles /256
p_Div512: 5 bytes 19 cycles /512
p_Div32768: 5 bytes 27 cycles /32768
p_SDiv0: 3 bytes 10 cycles //0 **NOT MATHEMATICALLY CORRECT** Same as loading the constant 65535
p_SDiv1: 0 bytes 0 cycles //1 Results in no compiled code
p_SDiv2: 4 bytes 16 cycles //2
p_SDiv64: 6 bytes 38 cycles //64
p_SDiv128: 4 bytes 23 cycles //128
p_SDiv256: 5 bytes 20 cycles //256
p_SDiv512: 6 bytes 38 cycles //512
p_SDiv16384: 6 bytes 38 cycles //16384
p_SDiv32768: 3 bytes 26 cycles //32768
p_Mod1: 3 bytes 10 cycles ^1 Same as loading the constant 0
p_Mod2: 5 bytes 20 cycles ^2
p_Mod4: 6 bytes 22 cycles ^4
p_Mod8: 6 bytes 22 cycles ^8
p_Mod16: 6 bytes 22 cycles ^16
p_Mod32: 6 bytes 22 cycles ^32
p_Mod64: 6 bytes 22 cycles ^64
p_Mod128: 4 bytes 15 cycles ^128
p_Mod256: 2 bytes 7 cycles ^256
p_Mod512: 4 bytes 15 cycles ^512
p_Mod1024: 4 bytes 15 cycles ^1024
p_Mod2048: 4 bytes 15 cycles ^2048
p_Mod4096: 4 bytes 15 cycles ^4096
p_Mod8192: 4 bytes 15 cycles ^8192
p_Mod16384: 4 bytes 15 cycles ^16384
p_Mod32768: 2 bytes 8 cycles ^32768
p_EQN512: 9 bytes 44 cycles =?512
p_EQN256: 8 bytes 40 cycles =?256
p_EQN2: 8 bytes 40 cycles =?2
p_EQN1: 7 bytes 36 cycles =?1
p_EQ0: 7 bytes 36 cycles =0
p_EQ1: 7 bytes ~29 cycles =1 24 cycles if true, 34 cycles if false
p_EQ2: 8 bytes ~33 cycles =2 28 cycles if true, 38 cycles if false
p_EQ256: 8 bytes 40 cycles =256
p_EQ512: 9 bytes 44 cycles =512
p_NEN512: 9 bytes ~31 cycles ??512 33 cycles if true, 28 cycles if false
p_NEN256: 8 bytes ~27 cycles ??256 29 cycles if true, 24 cycles if false
p_NEN2: 8 bytes 40 cycles ??2
p_NEN1: 7 bytes 36 cycles ??1
p_NE0: 7 bytes ~23 cycles ?0 25 cycles if true, 20 cycles if false
p_NE1: 8 bytes ~27 cycles ?1 29 cycles if true, 24 cycles if false
p_NE2: 9 bytes ~31 cycles ?2 33 cycles if true, 28 cycles if false
p_NE256: 8 bytes ~27 cycles ?1 29 cycles if true, 24 cycles if false
p_NE512: 9 bytes ~31 cycles ?2 33 cycles if true, 28 cycles if false
p_GE0: 3 bytes 10 cycles ?0 Same as loading the constant 1
p_GT65535: 3 bytes 10 cycles >65535 Same as loading the constant 0
p_LE65535: 3 bytes 10 cycles ?65535 Same as loading the constant 1
p_LT0: 3 bytes 10 cycles <0 Same as loading the constant 0
p_GE1: 7 bytes ~23 cycles ?1 25 cycles if true, 20 cycles if false
p_GT0: 7 bytes ~23 cycles >0 25 cycles if true, 20 cycles if false
p_LE0: 7 bytes 36 cycles ?0
p_LT1: 7 bytes 36 cycles <1
p_SGE0: 4 bytes 32 cycles ??0
p_SLT0: 5 bytes 27 cycles <<0
p_GetBit0: 5 bytes 27 cycles ee0
p_GetBit1: 6 bytes 38 cycles ee1
p_GetBit2: 7 bytes 37 cycles ee2
p_GetBit3: 7 bytes 37 cycles ee3
p_GetBit4: 7 bytes 37 cycles ee4
p_GetBit5: 7 bytes 37 cycles ee5
p_GetBit6: 7 bytes 26 cycles ee6
p_GetBit7: 6 bytes 22 cycles ee7
p_GetBit8: 5 bytes 27 cycles ee8 e0
p_GetBit9: 6 bytes 38 cycles ee9 e1
p_GetBit10: 7 bytes 37 cycles ee10 e2
p_GetBit11: 7 bytes 37 cycles ee11 e3
p_GetBit12: 7 bytes 37 cycles ee12 e4
p_GetBit13: 7 bytes 37 cycles ee13 e5
p_GetBit14: 7 bytes 26 cycles ee14 e6
p_GetBit15: 5 bytes 20 cycles ee15 e7
Part III: "HL is the Ans of Axe" -Runer112Like BASIC, Axe also has an Ans-counterpart, a register called HL. It is written to every time you "load" an expression or constant. This yields a multitude of
small optimizations:
-Duplicate Arguments:
Text(0,0,PTR)
optimizes to
Text(0,,PTR)
See how that works? When Axe goes to parse the second argument of Text(, it would normally load the second argument into HL and then use that as the Y position of the text. However, in the optimized piece it doesn't find anything to load for the second argument, so it just uses what was already in HL, namely, 0.
This, extended, yields further optimization opportunities:
-More Omitted Arguments: Not only does loading arguments in commands write to HL, so does loading arguments in normal math and control structure operations. This means that we can omit lots of things, saving more space and speed. For example...
If A>3: 1?B
becomes
If A>3: ?B
But wait! What if you want to use this optimization on an equality check? Normally,
If A=1:?B would work, but if we wanted to optimize the equality check as well, it doesn't work quite right.
!If A-1 returns zero if true (or rather, if false), so you might then think that it's probably best to skip this particular optimization and go
!If A-1:1?B
However, although it may seem counterintuitive, the following is actually smaller and faster:
!If A-1:+1?B
Similarly,
If EXP: -1?B:End
(with a minus sign) is preferable to
If EXP:0?B:End
0?B: 0?A
becomes
0?B?A
note: When initializing variables that are 2 or less apart, it also yields further optimization to do this:
0?A: 1?B: 3?C
to
0?A+1?B+2?C
An example of HL abuse in a subroutine:
sub(LBL,EXP).....Lbl LBL: EXP*2?{L1}
to
sub(LBL,EXP)....Lbl LBL:*2?{L1]
-Here's another abuse of HL, creating the fastest and smallest loop structure (by Runer 112)
It will execute the loop n times, with A starting at n-1 and decreasing down to 0:
n
While
-1?A
;Code
A
End
Moving around a spriteCheck this out: again courtesy of Runer112: (comments by me)
:.SMILE
:[004466000000817E]->Pic1
:DiagnosticOff
:0->X->Y
:Repeat getkey(15)
:ClrDraw
:getKey(3)-getKey(2)+X //check getkeys for X. But we're not going to store the value just yet...
:!If +1 //First, check if X is negative one; that is, we'll check if X+1=0
:+1 //if so, add one to make the value in HL 1 (!If statements return 0 if true)
:End
:-1 //subtract 1 since we added 1 earlier
:Pt-On(min(,88)?X,getKey(1)-getKey(4)+Y+(=?1)min(,56)?Y,Pic1) //now we'll use the smallest value of either HL or 88 as the X position. For Y, we'll first handle keypresses, then add 1 (the boolean value of y=-1) if Y=-1. Then we'll use the smallest value of either HL or 56 as the Y position.
:DispGraph
:End
I can't post every possible abuse of HL here: there are so many ways to use it. Study these examples to see how they work and you can apply it in your own programs.
Much of HL abuse has been obsoleted by Axe's peephole optimizer. However, you can still make use of some of these tricks. While the weird syntax is not always necessary, remember to arrange your code and operations such that the peephole optimizer can target it. I left the section somewhat intact so that you can look through it to understand the concept; the optimizer is useless if your code isn’t written intelligently.
Part IV: SubroutinesSubroutines probably are capable of saving the most space than any other type of optimization. And they are easy to use, too. There are only a couple rules of thumb to follow:
1. If you are rewriting a section of code more than once (and it is more than just one command), best put it into a subroutine.
2. If you aren't, then don't put it in a subroutine.
Routine calls are around 3-5 bytes, and an additional 3 bytes for every argument you load. Using the recursive subroutine feature that saves your arguments (sub(LBL
r....) costs you 15 bytes per argument.
Tail Call Optimization This is lesser known and lesser used, but it is still worth sticking here I guess: If you are calling subroutines from a subroutine as the last line, then you can use a process known as Tail Call Optimization to change this:
Lbl A : Stuff : sub(B) : Return
to this:
Lbl A: Stuff : Goto B
The Return is not needed because you end up "stealing" subroutine B's return instead of having to return to A and then return again to the main program.
Part V: Speed over SizeThese are all optimizations for aggressive speed gain at the expense of size.
Short Circuit Evaluation:In most cases, it yields a (rather impressive) speed gain to change
If EXP1 and EXP2
to
If EXP1 : If EXP2
Make sure to have EXP1 (the outside If block) be the expression that is less likely to be true to gain the most speed.
Factoring Constants (and powers of 2); When multiplying and dividing by multiples of powers of 2, it yields an optimization to factor the number first.
EXP*96
to
EXP*32*3
With powers of two, this is only advantageous with *64 and *128 (Check the auto-optimizations list for the list of numbers that this applies to).
Part VI: MiscellaneousAll the stuff I couldn't fit into another category...
- If you need to draw horizontal or vertical lines, use the Rect() function with a width or height of 1 instead: Line(0,,20,0) to Rect(0,,20,1) This improves both speed and size
- Use ElseIf and DS<( if you can. That's why they're there. They are surprisingly useful.
- The Text() and Bitmap() commands are slow because they are TI's routines. Avoid them if possible. (I know, it's hard not to use Text(...but do your best ).
- Note about the Fill() command: If you are initializing less than 10 bytes of data, it is better to just use ->{}r 5 times instead.
Also, a little-known fact regarding printing text at constant coordinates:
[/list].Coordinate=Y*256+X
Text(30*256+20)
Text "Stuff"
Is 7 bytes smaller than:
Text(20,30,"Stuff")
The same applies for text drawn to the home screen:
.Coordinate=X*256+Y
Output(20*256+30)
Disp "Stuff"
Is smaller than:
Output(20,30,"Stuff")
- The new DispGraphClrDraw command is the same as DispGraph:ClrDraw but twice as fast. Use it.
However, remember each new command you use in Axe creates its own subroutine in your program. Thus, (for size reasons) it’s best to avoid using Rect if you've already used Line and Pt-On instead of Pt-Change, (and preinvert the sprite).
One major optimization that usually gets ignored is recycling large axe commands. Axe is not like BASIC and so each command needs its own subroutine to add to your program. For instance, lets say you use Pt-On() and Pt-Mask() in your code. Each one has to bring its own 100+ byte subroutine into your program. But you can probably get away with having just a Pt-Mask routine, recycling it to act like Pt-On by simply adding a 2nd layer to your sprite which is only 8 bytes extra instead of 100ish. Or you could do the opposite too and only have Pt-On() and Pt-Change() to manually change both buffers at once. This generally reduces the overall size of the program by a lot when you use the routines only once or very rarely in your code. And I don't mean calling it rarely, it could be the most used routine in your code, I just mean rarely appears in your source.
...And that's all I've got. Happy coding! Let me know if you have any questions, comments, additions, or corrections and I'll be happy to accommodate you
This post will also be updated with any new major optimizations found.
Also special thanks to Quigibo and Runer112 who are the main contributors of all these optimizations
Oh and....666th POST
[/list]