Author Topic: Assembly Programmers - Help Axe Optimize!  (Read 154159 times)

0 Members and 1 Guest are viewing this topic.

Offline z80man

  • Casio Traitor
  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 977
  • Rating: +85/-3
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #180 on: May 18, 2011, 04:49:41 am »
Would it be possible to have normal and full compilation modes. So if a program runs at normal by default the code for changing the clock to normal at every dispgraph wouldn't be needed. Also this could be used by 83+ owners so that when they compile a program, full commands are ignored.

List of stuff I need to do before September:
1. Finish the Emulator of the Casio Prizm (in active development)
2. Finish the the SH3 asm IDE/assembler/linker program (in active development)
3. Create a partial Java virtual machine  for the Prizm (not started)
4. Create Axe for the Prizm with an Axe legacy mode (in planning phase)
5. Develop a large set of C and asm libraries for the Prizm (some progress)
6. Create an emulator of the 83+ for the Prizm (not started)
7. Create a well polished game that showcases the ability of the Casio Prizm (not started)

Offline Compynerd255

  • LV6 Super Member (Next: 500)
  • ******
  • Posts: 336
  • Rating: +53/-4
  • Betafreak Games
    • View Profile
    • Betafreak Games
Re: Assembly Programmers - Help Axe Optimize!
« Reply #181 on: May 18, 2011, 10:17:46 am »
Would it be possible to have normal and full compilation modes. So if a program runs at normal by default the code for changing the clock to normal at every dispgraph wouldn't be needed. Also this could be used by 83+ owners so that when they compile a program, full commands are ignored.
Full commands are already ignored in 83 Plus mode. In fact, if a Full command is run on an 83 Plus, HL returns zero and nothing happens. But I have seen the size of the Full command, and yes, I think it would be a good idea to have some option where Full and Normal are skipped.
The Slime: On Hold, preparing to add dynamic tiles

Axe Eitrix: DONE

Betafreak Games: Fun filled games for XBox and PC. Check it out at http://www.betafreak.com



Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #182 on: May 18, 2011, 11:03:11 am »
z80man, the code for saving the CPU clock speed actually serves another useful purpose. It also saves the interrupt status, because the display routines have to disable interrupts to safely run. Because of the new display safety routine, if a program were designed to run at 15MHz, all it would need is one Full at the start and one copy of this safety routine. Removing any CPU speed instructions would only save about 20 bytes, not a very large savings. But I guess I still see the merits of your suggestion for people who want to crazily super-optimize. :P

Also, why is this in the Help Axe Optimize thread? It sounds more like a feature request.
« Last Edit: May 18, 2011, 11:06:16 am by Runer112 »

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #183 on: May 21, 2011, 07:15:18 pm »
I'm back, and this time with screen update routine optimizations! I've used 71 cycles as the target minimum delay between port outputs, because that's the number that you said worked for your calculator with a bad LCD driver. If you want these routines to target 72 or 73 cycles between port outputs instead, that's an easy modification for the first two routines. The grayscale routines could be harder.


EDIT: If you're going to use any of these, make sure to actually test them first.

EDIT 2: I previously didn't have an optimization for p_DispGS, but after more closely inspecting the routine, now I do!




p_FastCopy: 1 byte and 1548 cycles saved.

Code: (Original code: 46 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_FastCopy:
.db __FastCopyEnd-1-$
FastCopy:
ld hl,plotSScreen
ld a,$80
out ($10),a
ld c,-$0C
call $0000 ;Safety
push af
__FastCopyAgain:
ld b,64 ;7
ld a,c ;4
add a,$2C ;7
out ($10),a ;11
ld a,(hl) ;7 (waste)
inc de ;6 (waste)
__FastCopyLoop:
push af ;11 (waste)
pop af ;10 (waste)
ld de,12 ;10
ld a,(hl) ;7
add hl,de ;11
out ($11),a ;11
djnz __FastCopyLoop ;13/8
ld de,1-(12*64) ;10
add hl,de ;11
inc c ;4
jr nz,__FastCopyAgain ;12
__FastCopyRestore:
pop af
out ($20),a
ret c
ei
ret
__FastCopyEnd:
.db rp_Ans,__FastCopyEnd-__FastCopyAgain+3
     
   
Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_FastCopy:
.db __FastCopyEnd-1-$
ld hl,plotSScreen
ld c,-$0C
ld a,$80
out ($10),a ;??cc into
call $0000
push af
__FastCopyAgain:
push hl
ld a,c
add a,$2C
out ($10),a ;many cc into, 73cc loop
inc de ;waste
ld b,64
__FastCopyLoop:
ld a,(hl) ;waste
inc de ;waste
dec de ;waste
ld de,12
ld a,(hl)
add hl,de
out ($11),a ;71cc into, 71cc loop
djnz __FastCopyLoop
pop hl
inc hl
inc c
jr nz,__FastCopyAgain
__FastCopyRestore:
pop af
out ($20),a
ret c
ei
ret
__FastCopyEnd:
.db rp_Ans,__FastCopyEnd-p_FastCopy+11
   




p_DrawAndClr: 2 bytes and 1548 cycles saved. Pretty much the same optimization as above.

Code: (Original code: 47 bytes, ~59389 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DrawAndClr:
.db __DrawAndClrEnd-1-$
ld hl,plotSScreen
ld a,$80
out ($10),a
ld c,-$0C
call $0000 ;Safety
push af
__DrawAndClrAgain:
ld b,64 ;7
ld a,c ;4
add a,$2C ;7
out ($10),a ;11
ld a,(hl) ;7 (waste)
inc de ;6 (waste)
__DrawAndClrLoop:
ld de,12 ;10
ld a,(hl) ;7
ld (hl),d ;7
ld (hl),d ;7 (waste)
ld (hl),d ;7 (waste)
add hl,de ;11
out ($11),a ;11
djnz __DrawAndClrLoop ;13/8
ld de,1-(12*64) ;10
add hl,de ;11
inc c ;4
jr nz,__DrawAndClrAgain ;12
__DrawAndClrRestore:
pop af
out ($20),a
ret c
ei
ret
__DrawAndClrEnd:
.db rp_Ans,__DrawAndClrEnd-__DrawAndClrAgain+3
     
   
Code: (Optimized code: 45 bytes, ~57841 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DrawAndClr:
.db __FastCopyEnd-1-$
ld hl,plotSScreen
ld c,-$0C
ld a,$80
out ($10),a ;??cc into
call $0000
push af
__DrawAndClrAgain:
push hl
ld a,c
add a,$2C
out ($10),a ;many cc into, 73cc loop
inc de ;waste
ld b,64
__DrawAndClrLoop:
inc de ;waste
dec de ;waste
ld de,12
ld a,(hl)
ld (hl),d
add hl,de
out ($11),a ;71cc into, 71cc loop
djnz __DrawAndClrLoop
pop hl
inc hl
inc c
jr nz,__DrawAndClrAgain
__DrawAndClrRestore:
pop af
out ($20),a
ret c
ei
ret
__DrawAndClrEnd:
.db rp_Ans,__DrawAndClrEnd-__DrawAndClrAgain+11
   




p_DispGS: ~4847 cycles faster! This is more of a bug fix than an optimization; the old routine copied 13 columns!

Code: (Original code: 66 bytes, ~63507 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
call $0000
push af
ld a,$80
out ($10),a
ld (OP2),sp
ld hl,flags+asm_Flag2
rr (hl)
sbc a,a
xor %01010101
ld (hl),a
ld c,a
ld l,appbackupscreen&$ff-1
ld sp,plotSScreen-appbackupscreen
__DispGSNext:
ld a,l ;4
ld b,64 ;7
add a,$21-(appbackupscreen&$ff);7
out ($10),a ;11 Into loop: 59 T-states
inc l ;4
ld h,appbackupscreen>>8 ;7
ld de,appbackupscreen-plotSScreen+12;11
__DispGSLoop:
ld a,(hl) ;7 Loop: 61 T-states
rrc c ;8
and c ;4
add hl,sp ;11
or (hl) ;7
out ($11),a ;11
add hl,de ;11
djnz __DispGSLoop ;13/8 Next Loop: 60 T-states
ld a,l ;4
cp 12+(appbackupscreen&$ff);7
jr nz,__DispGSNext ;12
__DispGSDone:
ld sp,(OP2)
__DispGSRestore:
pop af
out ($20),a
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-2
     
   
Code: (Optimized code: 66 bytes, ~58660 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
call $0000
push af
ld a,$80
out ($10),a ;many cc into
ld (OP2),sp
ld hl,flags+asm_Flag2
rr (hl)
sbc a,a
xor %01010101
ld (hl),a
ld c,a
ld l,appbackupscreen&$ff-1
ld sp,plotSScreen-appbackupscreen
__DispGSNext:
ld a,l
ld b,64
add a,$20-(appbackupscreen&$ff-1)
out ($10),a ;113cc into, 71cc loop
inc hl
ld h,appbackupscreen>>8
ld de,appbackupscreen-plotSScreen+12
__DispGSLoop:
ld a,(hl)
rrc c
and c
add hl,sp
or (hl)
out ($11),a ;71cc into, 72cc loop
add hl,de
djnz __DispGSLoop
ld a,l
cp 12+(appbackupscreen&$ff-1)
jr nz,__DispGSNext
__DispGSDone:
ld sp,(OP2)
__DispGSRestore:
pop af
out ($20),a
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-2
   




p_Disp4Lvl: 3 bytes larger, but ~7693 cycles faster! Extra bonuses: updates in row-major order for cleaner grayscale AND works with any pair of buffers! :w00t:

Code: (Original code: 79 bytes, ~78433 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_Disp4Lvl:
.db __Disp4LvlEnd-1-$
call $0000
push af
ld (OP2+2),sp
ld a,$80
out ($10),a
ld sp,appbackupscreen - plotSScreen
ld e,(plotSScreen-appbackupscreen+12)&$ff
ld c,-$0C
ex af,af'
ld a,%11011011
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
add a,a
ld b,(hl)
inc b
jr z,__Disp4Lvlskip
rlca
ld (hl),-2
__Disp4Lvlskip:
ld l,plotSScreen&$ff-1
ex af,af'
__Disp4Lvlentry:
ld a,c
add a,$2C
ld h,plotSScreen>>8
inc l
ld b,64
out ($10),a
__Disp4Lvlloop:
ld a,(hl)
add hl,sp
xor (hl)
ex af,af'
cp e
rra
ld d,a
ex af,af'
and d
xor (hl)
out ($11),a
ld d,(plotSScreen-appbackupscreen+12)>>8
add hl,de
djnz __Disp4Lvlloop
inc c
jr nz,__Disp4Lvlentry
__Disp4LvlDone:
ld sp,(OP2+2)
__Disp4LvlRestore:
pop af
out ($20),a
ret c
ei
ret
__Disp4LvlEnd:
.db rp_Ans,__Disp4LvlEnd-p_Disp4Lvl-2
     
   
Code: (Optimized code: 82 bytes, ~70740 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_Disp4Lvl:
.db __Disp4LvlEnd-1-$
ld hl,appBackUpScreen
ld de,plotSScreen
call $0000
push af
push hl
ld a,$07
out ($10),a ;many cc into
ld a,%11011011
or a
ld hl,flags+asm_flag2
inc (hl)
jr z,__Disp4Lvlskip
rra
ld b,(hl)
inc b
jr z,__Disp4Lvlskip
rra
ld (hl),-2
__Disp4LvlSkip:
ex af,af'
pop hl
ld a,$80
__Disp4LvlEntry:
out ($10),a ;76+cc into, 71cc loop
push af
ex (sp),hl ;waste
ex (sp),hl ;waste
nop ;waste
ld a,$20
out ($10),a ;71cc into
ld b,12
__Disp4LvlLoop:
ex af,af'
rra
ld c,a
ex af,af'
ld a,(de)
xor (hl)
and c
xor (hl)
inc de
inc hl
out ($11),a ;71cc into, 77cc loop
djnz __Disp4LvlLoop
inc bc ;waste
ex af,af'
rra
ex af,af'
pop af
inc a
bit 6,a
jr z,__Disp4LvlEntry
__Disp4LvlDone:
ld a,$05
out ($10),a ;73cc into
pop af
out ($20),a
ret c
ei
ret
__Disp4LvlEnd:
.db rp_Ans,__Disp4LvlEnd-p_Disp4Lvl-8
   





Also, I'm going to bump a few old optimization suggestions. They may have been skipped because Axe couldn't support them at the time, but in case it can now or in the near future, I'll make sure they aren't forgotten. And I'll throw in a new optimization that would also require an upgraded command parser.


And as a side note, would it be possible to reformat DS<() so that the variable is reinitialized to its maximum value at the End? That way, 3 bytes could be saved by having both the zero and not zero conditions using the same store command. For example:

Code: [Select]
ld hl,(var)
dec hl
ld a,h
or l
jp nz,DS_End
;Code inside statement goes here
ld hl,max
DS_End:
ld (var),hl


Now that you have absolute jumps implemented:

Code: (Original code) [Select]

p_Exchange:
.db 13
pop de
ex (sp),hl
pop bc
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
ld a,b
or c
jr nz,$-8

   
Code: (Optimized code) [Select]

p_Exchange:
.db 12
pop de
ex (sp),hl
pop bc
__ExchangeLoop:
ld a,(de)
ldi
dec hl
ld (hl),a
inc hl
jp pe,__ExchangeLoop ;or is it po?





Code: (Original code: 27 bytes, ~220 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
dec l
ld a,l
rra
rra
rra
and %00000111
inc a
ld b,a
ld a,%01111111
rlca
djnz $-1
ld h,a
ld a,l
and %00000111
inc a
ld b,a
ld a,%10000000
rlca
djnz $-1
ld l,a
ret
__DKeyVarEnd:
   
   
Code: (Optimized code: 23 bytes, ~259 cycles) [Select]
p_DKeyVar:
.db __DKeyVarEnd-1-$
ld c,l
dec c
ld a,c
rra
rra
rra
call __DKeyVarMask
cpl
ld h,a
ld a,c
__DKeyVarMask:
and %00000111
inc a
ld b,a
ld a,%10000000
rlca
djnz $-1
ld l,a
ret
__DKeyVarEnd:


   
« Last Edit: May 22, 2011, 03:27:45 pm by Runer112 »

Offline Munchor

  • LV13 Extreme Addict (Next: 9001)
  • *************
  • Posts: 6199
  • Rating: +295/-121
  • Code Recycler
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #184 on: May 22, 2011, 05:56:31 am »
So, 3 level greyscale routine optimized? Great Runer, I don't really get what you did there, but those look like lots of ASM optimizations for Axe, very nice job!

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #185 on: May 22, 2011, 05:14:53 pm »
Thanks for those! :) I was actually thinking of changing the 3 level grayscale to be row major too so I'm probably going to be using a whole new routine for that.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #186 on: May 22, 2011, 07:13:24 pm »
Row-major 3-level grayscale? I already made one of those, it was just slower and a bit larger than the current routine so I didn't think you'd want it. It's 4 bytes larger and about 8000 cycles slower than the column-major routine I posted above, but here it is:

Code: (70 bytes, ~66541 cycles with 3-cycle LCD port delay, excluding p_Safety) [Select]
p_DispGS:
.db __DispGSEnd-1-$
ld hl,plotSScreen
ld de,appBackUpScreen
call $0000
push af
ld a,$07
out ($10),a ;many cc into
ld a,(flags+asm_Flag2)
rra
sbc a,a
xor %01010101
ld (flags+asm_Flag2),a
ld c,a
ld a,$80
__DispGSNext:
push af
out ($10),a ;74cc into, 71cc loop
ex (sp),hl ;waste
ex (sp),hl ;waste
rrc c
ld b,12
ld a,$20
out ($10),a ;71cc into
push af ;waste
pop af ;waste
__DispGSLoop:
inc bc ;waste
dec c ;waste
ld a,(de)
and c
or (hl)
inc de
inc hl
out ($11),a ;72cc into, 71cc loop
ld a,(hl) ;waste
djnz __Disp4Lvlloop
pop af
inc a
bit 6,a
jr z,__Disp4Lvlentry
__DispGSDone:
pop af
out ($20),a
ld a,$05
out ($10),a ;83cc into
ret c
ei
ret
__DispGSEnd:
.db rp_Ans,__DispGSEnd-p_DispGS-8
« Last Edit: May 22, 2011, 07:15:56 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #187 on: May 23, 2011, 06:59:41 am »
Runer, I was testing out your 4 level grayscale routine and I'm getting some really weird results.  Its literally showing black and white lines across the otherwise perfect gray that shift about every second.  When I add some pause between displays, it looks better, but still not as good as the column major routine.  The emulator makes it look fine, so I can't upload a screenshot, but try this upload on hardware.

___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #188 on: May 23, 2011, 11:22:29 am »
Is 71 cycles between outputs still too fast for your calculator perhaps? I don't notice any strange black and white lines on my calculator, which has a good LCD driver. The only lines I saw were the diagonal light and dark stripes that are inherent in any unsynced grayscale routine. Can you perhaps elaborate on the problem?
« Last Edit: May 23, 2011, 12:43:22 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #189 on: May 24, 2011, 09:22:01 pm »
Sorry for the late response, this is how the program looks on my calculator (see attachment).  I'm pretty sure it has nothing to do with the delay being too short since it looks better when I add more pause between each DispGraph.
« Last Edit: May 24, 2011, 09:25:13 pm by Quigibo »
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #190 on: May 24, 2011, 10:36:16 pm »
Hmm I see what you mean... perhaps the mask rotation/logic is wrong in my routine? I wanted to send the program you posted to wabbitemu so I could debug the mask and logic computations at each step, but wabbitemu refuses to accept your program... And I don't see anything obviously wrong with my mask or logic.
« Last Edit: May 24, 2011, 10:39:41 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #191 on: May 24, 2011, 10:46:57 pm »
Really?  Wabbitemu accepts it fine for me... It was just your routine replacing the current 4 level grayscale.  Random squares were drawn to each buffer and then it did a "Repeat getkey(15):DispGraphrr:End" loop.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #192 on: May 24, 2011, 11:00:47 pm »
No worries, I compiled it as an Axiom, sent that to wabbitemu, and am debugging it as we speak. I'm also asking the master of grayscale (thepenguin77) if he sees anything obviously wrong with it.

EDIT: Quigibo, try putting something like a Pause 10-12 in your loop. I think the new routine is actually going too fast and is running at 1.5x your LCD's refresh rate, near-perfectly skipping every third frame.
« Last Edit: May 24, 2011, 11:17:37 pm by Runer112 »

Offline Quigibo

  • The Executioner
  • CoT Emeritus
  • LV11 Super Veteran (Next: 3000)
  • *
  • Posts: 2031
  • Rating: +1075/-24
  • I wish real life had a "Save" and "Load" button...
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #193 on: May 25, 2011, 01:18:47 am »
Yeah, like I said, if I add pause to the loop, it looks better (pause 11 is perfect gray).  But my point is that I'm not sure anymore if having the routines row major is actually an advantage because the old routine produced just as perfect a gray when it was in sync yet it didn't have graphical problems when it wasn't.  This routine seems less resilient to changes in the pause time.

Maybe my calculator is just the exception though, I don't know what the statistics are for what percentage of calculators have bad LCDs.  Regardless, I don't think any single routine can produce perfect gray across ALL calculator models.  So I think I should just forget about column/row ordering and just stick with the smallest, fastest routines.
___Axe_Parser___
Today the calculator, tomorrow the world!

Offline Runer112

  • Project Author
  • LV11 Super Veteran (Next: 3000)
  • ***********
  • Posts: 2289
  • Rating: +639/-31
    • View Profile
Re: Assembly Programmers - Help Axe Optimize!
« Reply #194 on: May 25, 2011, 11:15:39 am »
The only case when this new 4-level grayscale routine should have noticeable problems is when it's alone in a loop with no other delay, simply because it's faster than the old routine. In most real situations that it would be used in, there would probably be much larger delay between display calls, like rendering a frame, in which case you want the routine to be as fast as possible.

I would leave the 3-level grayscale routine in column-major but use the 4-level grayscale routine in row-major order, because they are both faster than their alternatives. Although this will allow for 4-level grayscale to draw from arbitrary buffers and not allow for it in 3-level grayscale, I wouldn't worry too much about the incongruity. Being able to call 4-level grayscale with arbitrary buffer arguments would be quite awesome. ;D
« Last Edit: May 25, 2011, 11:16:41 am by Runer112 »