I think most of those I just missed thanks for catching them.  The p_EQNX and p_NENX were intentionally left out though because I need to rewrite my optimizer to handle negative shorts first.  As for your new optimizations, I'm not sure If I want to add those because It would require me to write a lot more code for the parser since I just have all the math operations and optimizations macro'd in right now.  Checking for a variable would be a little tricky in that section.  But I'll try it out later if I have time.
Alright, these aren't that urgent anyways. Just trying to squeeze every last byte and cycle out of Axe programs. ;)
Wow that took a long time. But I hope the results will be worth it.
Quigibo, get out your reading glasses. ;)

(By the way, I haven't tested these myself, but the code looks solid. If you believe that any of these would not work or have any questions, tell me.)

Smaller nibble retrieval routines. 1 byte saved for reading from RAM, 3 bytes saved for reading from ROM.

Thanks to calc84maniac for reminding me that $0000-$7FFF is read-only!

Code: (Original routine: 18 bytes, ~72 cycles) [Select]
.db __Nib1End-$-1
rr h
rr l
ld a,(hl)
jr c,__Nib1Skip
and %00001111
ld l,a
ld h,0
Code: (Optimized routine: 17 bytes, ~105 cycles) [Select]
.db __Nib1End-$-1
xor a
rr h
rr l
ld b,(hl)
jr c,__Nib1Loop
ld (hl),b
ld l,a
ld h,0

Code: (Original routine: 18 bytes, ~68 cycles) [Select]
.db __Nib2End-$-1
srl h
rr l
ld a,(hl)
jr c,__Nib2Skip
and %00001111
ld l,a
ld h,0
Code: (Optimized routine: 15 bytes, ~77 cycles) [Select]
.db __Nib2End-$-1
xor a
srl h
rr l
jr c,__Nib2Skip
ld l,a
ld h,0

Smaller and faster nibble storage routine. 1 byte and ~17 cycles saved.

Code: (Original routine: 23 bytes, ~127 cycles) [Select]
.db __NibStoEnd-$-1
pop bc
pop de
push bc
rr h
rr l
ld b,(hl)
ex de,hl ;hl = byte ;de = addr
ld a,%11110000
jr c,__NibStoSkip
add hl,hl
add hl,hl
add hl,hl
add hl,hl
and b
or l
ld (de),a
Code: (Optimized routine: 22 bytes, ~110 cycles) [Select]
.db __NibStoEnd-$-1
pop bc
pop de
push bc
rr h
rr l
jr c,__NibStoHigh
ld a,e
ld a,e

Faster buffer inversion routine. 9951 cycles saved.

Code: (Original routine: 16 bytes, 38425 cycles) [Select]
.db __InvBuffEnd-1-$
ld hl,plotSScreen
ld bc,768
ld a,(hl)
ld (hl),a
inc hl
dec bc
ld a,b
or c
jr nz,__InvBuffLoop
Code: (Optimized routine: 16 bytes, 28474 cycles) [Select]
.db __InvBuffEnd-1-$
ld hl,plotSScreen
ld bc,3
ld a,(hl)
ld (hl),a
inc hl
djnz __InvBuffLoop
dec c
jr nz,__InvBuffLoop

You'll laugh at this... but I managed to save 4 cycles in the unarchive and archive routines. And only if the targeted variable doesn't exist. But hey, why not take all the savings you can get.

I think this works. It relies on the page number returned in b always being 0 if a RAM page and always being in the range or $01-$7F if a flash page.

Code: (Original routine: 18 bytes, a lot of cycles) [Select]
.db __UnarchiveEnd-1-$
ld hl,0
ret c
inc b
dec b
ret z
ld hl,1
Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select]
.db __UnarchiveEnd-1-$
ld hl,0
ret c
dec b
ret m
inc b
ld hl,1

Code: (Original routine: 18 bytes, a lot of cycles) [Select]
.db __ArchiveEnd-1-$
ld hl,0
ret c
inc b
dec b
ret nz
ld hl,1
Code: (Optimized routine: 18 bytes, a lot of-4 cycles) [Select]
.db __ArchiveEnd-1-$
ld hl,0
ret c
dec b
ret p
inc b
ld hl,1

Smaller archived variable locating. 4 bytes saved.

Code: (Original routine: 55 bytes, a lot of cycles) [Select]
.db __GetArcEnd-1-$
push de
ld hl,0
jr c,__GetArcFail
ld a,(OP1)
cp ListObj
jr z,__GetArcName
cp ProgObj
jr z,__GetArcName
cp AppvarObj
jr z,__GetArcName
cp GroupObj
jr z,__GetArcName
ld hl,14
jr __GetArcDone
ld hl,9
add hl,de
ld d,0
inc hl
inc hl
add hl,de
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
Code: (Optimized routine: 51 bytes, a lot of cycles) [Select]
.db __GetArcEnd-1-$
push de
ld hl,0
jr c,__GetArcFail
and %00011111
ld d,b
ld hl,__GetArcVarTypes
ld bc,__GetArcEnd-__GetArcVarTypes
ld b,d
ld hl,14
jr nz,__GetArcDone
ld l,9
add hl,de
ld d,0
inc e
inc e
add hl,de
ex de,hl
pop hl
ld (hl),e
inc hl
ld (hl),d
inc hl
ld (hl),b
ex de,hl
.db ListObj,ProgObj,AppvarObj,GroupObj

Smaller 8-bit get bit routine. 1 byte saved.

Code: (Original routine: 13 bytes, ~110 cycles) [Select]
.db 13
ld a,e
and %00000111
inc a
ld b,a
ld a,l
add a,a
djnz __GetBitLoop
ld h,b
ld l,b
rl l
Code: (Optimized routine: 12 bytes, ~152 cycles) [Select]
.db 12
ld a,e
and %00000111
inc a
ld b,a
xor a
ld h,a
add hl,hl
djnz __GetBitLoop
ld l,h
ld h,a

As long as the low byte of vx_SptBuff is at most $F8: faster sprite flipping routines. 16 cycles saved each.

Code: (Original routine: 13 bytes, 338 cycles) [Select]
.db __FlipVEnd-1-$
ex de,hl
ld hl,vx_SptBuff+8
ld b,8
dec hl
ld a,(de)
ld (hl),a
inc de
djnz __FlipVLoop
Code: (Optimized routine: 13 bytes, 322 cycles) [Select]
.db __FlipVEnd-1-$
ex de,hl
ld hl,vx_SptBuff+8
ld b,8
dec l
ld a,(de)
ld (hl),a
inc de
djnz __FlipVLoop

Code: (Original routine: 21 bytes, 1907 cycles) [Select]
.db __FlipHEnd-1-$
ld de,vx_SptBuff
push de
ld b,8
ld c,(hl)
ld a,1
rr c
jr nc,__FlipHLoop2
ld (de),a
inc hl
inc de
djnz __FlipHLoop1
pop hl
Code: (Optimized routine: 21 bytes, 1891 cycles) [Select]
.db __FlipHEnd-1-$
ld de,vx_SptBuff
push de
ld b,8
ld c,(hl)
ld a,1
rr c
jr nc,__FlipHLoop2
ld (de),a
inc hl
inc e
djnz __FlipHLoop1
pop hl

Smaller and faster sprite rotating routines. 2 bytes smaller and 166 cycles faster. These also save 16 cycles from relying on the low byte of vx_SptBuff being at most $F8.

Code: (Original routine: 22 bytes, 2874 cycles) [Select]
.db __RotCEnd-1-$
ex de,hl
ld hl,vx_SptBuff
ld c,8
push hl
ld b,8
ld a,(de)
rr (hl)
inc hl
djnz __RotCLoop2
pop hl
inc de
dec c
jr nz,__RotCLoop1
Code: (Optimized routine: 20 bytes, 2708 cycles) [Select]
.db __RotCEnd-1-$
ex de,hl
ld c,8+1
ld hl,vx_SptBuff
dec c
ret z
ld b,8
ld a,(de)
rr (hl)
inc l
djnz __RotCLoop2
inc de
jr __RotCLoop1

Code: (Original routine: 22 bytes, 2874 cycles) [Select]
.db __RotCCEnd-1-$
ex de,hl
ld hl,vx_SptBuff
ld c,8
push hl
ld b,8
ld a,(de)
rl (hl)
inc hl
djnz __RotCCLoop2
pop hl
inc de
dec c
jr nz,__RotCCLoop1
Code: (Optimized routine: 20 bytes, 2708 cycles) [Select]
.db __RotCCEnd-1-$
ex de,hl
ld c,8+1
ld hl,vx_SptBuff
dec c
ret z
ld b,8
ld a,(de)
rl (hl)
inc l
djnz __RotCCLoop2
inc de
jr __RotCCLoop1

That's all I have for now. I think I got just about everything I could possibly find, but I might have some more later. And if you want all the routines in one file, I uploaded them all here.
Your nibble read routine that reads from archive will fail, because it's ROM. In that case, it might work to do something like this:
Code: [Select]
.db __Nib2End-$-1
xor a
srl h
rr l
jr c,__Nib2Skip
ld l,a
ld h,0
 O.O How do you do this? You're a madman!

So I've never really used or knew what the rrd and rld instructions did.  I though thought they were some of those obscure instructions like daa, which they are, but I guess there are situations where you can use them, like with daa in the hex routine.  Awesome job there!  I can't believe I missed the push pop thing in the sprite rotation ones, that was embarrassing...  I really do like the getcalc ones, but it uses an inline self-reference.  I think because there's only one, I can easily replace it, but I'll have to make sure.   The inversion one is excellent as well.  I could have sworn I tried that same method before but couldn't get it the same size.

However, these are the concerns I have:  First, the sprite rotation commands, why did you move the ret to the middle of the routine?  It looks like that's just going to add more cycles since a conditional jr takes the same amount of cycles as a regular jr anyway.  Next, is it really a safe assumption that all ROM pages are between $7F and $FF for all current models and potentially future models?  And lastly, are you sure trying to modifying rom (unsuccessfully) has no potential side effects to things like flags and registers?
« Last Edit: January 06, 2011, 04:50:25 pm by Quigibo »
However, these are the concerns I have:  First, the sprite rotation commands, why did you move the ret to the middle of the routine?  It looks like that's just going to add more cycles since a conditional jr takes the same amount of cycles as a regular jr anyway.

Yeah, I'm not really sure why I did that. Feel free to initialize c to 8 instead and decrease and check c at the end using a conditional jump instead.

Next, is it really a safe assumption that all ROM pages are between $7F and $FF for all current models and potentially future models?

$01 and $7F are all ROM pages, and $80-$87 are all RAM pages (at least for the calculators that have all those RAM pages), so it would make sense that $80 and up is RAM. But feel free to leave this optimization out anyways, it only saves 4 cycles part of the time.

And lastly, are you sure trying to modifying rom (unsuccessfully) has no potential side effects to things like flags and registers?

After a quick test, yes, rrd and rld affect a correctly even when hl points to a byte in ROM.

EDIT: By the way Quigibo, the reason I was looking at every source routine for Axe is because I'm documenting the size and (at least approximate) speed of every Axe command. If I finish it, would you want to bundle it with future Axe releases? If not I'd probably post it somewhere on the forums anyway, so people could still see it.
