Author Topic: ASM Optimized routines  (Read 108294 times)

0 Members and 1 Guest are viewing this topic.

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #60 on: October 01, 2012, 09:21:55 am »
Not an optimized "routine", or really even useful, but i was trying to figure out how to add new instructions (and not macros) to spasm, and finally got this:
Code: [Select]
.addinstr add ahl,de 00CE19 3 NOP 1 ;add hl,de \ adc a,0
.addinstr add ahl,cde 8919 2 NOP 1 ;add hl,de \ adc a,c
Now whenever i want to do 24-bit addition (well, kinda) i can just use the new add ahl,de and add ahl,cde instructions ;)

EDIT: oops, my bbcode leaves a lot to be desired...
« Last Edit: October 01, 2012, 09:53:19 am by chickendude »

Offline ben_g

  • Hey cool I can set a custom title now :)
  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1002
  • Rating: +125/-4
  • Asm noob
    • View Profile
    • Our programmer's team: GameCommandoSquad
Re: ASM Optimized routines
« Reply #61 on: October 01, 2012, 03:02:45 pm »
so does it work like this?
Code: [Select]
.addinstr <instruction> <asm code for the instruction, written in hex> <amounth of bytes of asm code> NOP 1
My projects
 - The Lost Survivors (Unreal Engine) ACTIVE [GameCommandoSquad main project]
 - Oxo, with single-calc multiplayer and AI (axe) RELEASED (screenshot) (topic)
 - An android version of oxo (java)  ACTIVE
 - A 3D collision detection library (axe) RELEASED! (topic)(screenshot)(more recent screenshot)(screenshot of it being used in a tilemapper)
Spoiler For inactive:
- A first person shooter with a polygon-based 3d engine. (z80, will probably be recoded in axe using GLib) ON HOLD (screenshot)
 - A java MORPG. (pc) DEEP COMA(read more)(screenshot)
 - a minecraft game in axe DEAD (source code available)
 - a 3D racing game (axe) ON HOLD (outdated screenshot of asm version)

This signature was last updated on 20/04/2015 and may be outdated

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #62 on: October 01, 2012, 06:35:31 pm »
In most cases, yes, that is what you will want to do. If you open up the file TASM80, you will see the format of other such instructions.

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #63 on: October 01, 2012, 06:37:09 pm »
More or less, though you have to watch out for the ordering of the hex (it's backwards):
00CE19
19 = add hl,de
CE00 = adc a,0

And maybe there's a limit of four bytes/command, 'cuz nothing over four bytes seems to work. The command will take up that many bytes, but everything past four bytes seems to turn into random bytes.

You can use other values, too, like an asterisk can be used to pull in data, for example from tasmtabs.htm (NOTOUCH=NOP):
Code: [Select]
                                          EXAMPLE         EXAMPLE
INSTRUCTION DEFINITION                    SOURCE          OBJECT
-------------------------------------------------------------------
XYZ  *      FF   3  NOTOUCH 1             xyz 1234h       FF 34 12
XYZ  *      FF   2  NOTOUCH 1             xyz 1234h       FF 34
ZYX  *      FE   3  SWAP    1             zyx 1234h       FE 12 34
ZYX  *      FE   3  R2      1             zyx $+4         FE 01 00
ABC  *,*    FD   3  COMBINE 1             abc 45h,67h     FD 45 67
ABC  *,*    FD   3  CSWAP   1             abc 45h,67h     FD 67 45
ADD  A,#*   FC   2  NOTOUCH 1             add A,#'B'      FC 42
RET  ""     FB   1  NOTOUCH 1             ret             FB
LD   IX,*   21DD 4  NOTOUCH 1             ld  IX,1234h    DD 21 34 12
LD   IX,*   21DD 4  NOTOUCH 1 1 0         ld  IX,1234h    DD 21 68 24
LD   IX,*   21DD 4  NOTOUCH 1 0 1         ld  IX,1234h    DD 21 35 12
LD   IX,*   21DD 4  NOTOUCH 1 1 1         ld  IX,1234h    DD 21 69 24
LD   IX,*   21DD 4  NOTOUCH 1 8 12        ld  IX,34h      DD 21 12 34

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #64 on: February 01, 2013, 05:08:48 am »
Here's a quick filled rectangle routine i wrote the other night. It probably doesn't belong here since i don't think it's that optimized, but it should be smaller and faster than Jason Kovacs' routine posted on the previous page. As always, do whatever you want with the code and please don't give me credit! It takes the starting x/y coordinates in de, and the width and height in pixels in bc. rectangle_filled_solid will make a solid black rectangle, rectangle_filled_xor will xor the pixels within the rectangle's area.
EDIT: Added Xeda's optimisations. (see below for a slightly faster version)
Code: [Select]
#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93
#else
GBUF_LSB = $29
GBUF_MSB = $8E
#endif

;b = # columns
;c = # rows
;d = starting x
;e = starting y
rectangle_filled_xor:
ld a,$AE ;xor (hl)
jr rectangle_filled2
rectangle_filled_solid:
ld a,$B6 ;or (hl)
rectangle_filled2:
push de
push bc
ld (or_xor),a ;use smc for xor/solid fill
ld a,d ;starting x
and $7 ;what bit do we start on?
ex af,af'
ld a,d ;starting x
ld l,e ;ld hl,e
ld h,0 ; ..
ld d,h ;set d = 0
add hl,de ;starting y * 12
add hl,de ;x3
add hl,hl ;x6
add hl,hl ;x12
rra ;a = x coord / 8
rra ;
rra ;
and %00011111 ;starting x/8 (starting byte in gbuf)
add a,GBUF_LSB
ld e,a ;
ld d,GBUF_MSB ;
add hl,de ;hl = offset in gbuf
ex af,af' ;carry should be reset and z affected from and $7
ld e,a
ld a,%10000000
jr z,$+6
rra
dec e
jr nz,$-2
ld d,a ;starting bit to draw
rectangle_loop_y:
push bc
push hl
rectangle_loop_x:
ld e,a ;save a (overwritten with or (hl))
or_xor = $
or (hl) ;smc will modify this to or/xor
ld (hl),a
ld a,e ;recall a
rrca ;rotate a to draw the next bit
jr nc,$+3
inc hl
djnz rectangle_loop_x
pop hl ;hl = first column in gbuf row
ld c,12 ;b = 0, bc = 12
add hl,bc ;move down to next row
pop bc ;restore b (# columns)
ld a,d ;restore a (starting bit to draw)
dec c
jr nz,rectangle_loop_y
rectangle_end:
pop bc
pop de
ret

Here's a little example of how you could use the routine to make a text box:
Code: [Select]
start:
ld de,$041B
ld bc,$1103
call draw_box2
call ionFastCopy
bcall _getkey
ret
 
draw_box2:
call rectangle_filled_solid
inc d
inc e
dec b
dec b
dec c
dec c
call rectangle_filled_xor
ret

EDIT: Small optimization
« Last Edit: February 04, 2013, 11:43:07 am by chickendude »

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #65 on: February 01, 2013, 01:39:49 pm »
Awesome, I definitely like rectangel codes! I am glad you preserve the coordinates, too. Here are a few optimisations that I see without examining the code too much.
Code: [Select]
  ld hl,or_xor
  ld (hl),a   ;use smc for xor/solid fill
To save a byte and a clock cycle, this can simply be:
Code: [Select]
  ld (or_xor),a
Code: [Select]
   rra    ;a = x coord / 8
   rra    ;
   rra    ;
   and %00011111 ;starting x/8 (starting byte in gbuf)
   ld e,a   ;
   add hl,de  ;add x
   ld de,gbuf  ;
   add hl,de  ;hl = offset in gbuf
To save 7 clock cycles and 0 bytes:
Code: [Select]
   rra    ;a = x coord / 8
   rra    ;
   rra    ;
   and %00011111 ;starting x/8 (starting byte in gbuf)
   or 40h   ;gbuf=9340h, 40h = %01000000, so this won't cause problems
   ld e,a   ;
   ld d,93h
   add hl,de  ;add x


Code: [Select]
  pop hl
  pop bc    ;restore b (# columns)
  pop af
  ld de,12
  add hl,de   ;move down to next row
Since B is already 0 at the start of this code, save a byte and 3 t-states:
Code: [Select]
  pop hl
  ld c,12
  add hl,bc
  pop bc    ;restore b (# columns)
  pop af
And now you can actually modify that whole routine to be a little faster:
Code: [Select]
  ld d,a
rectangle_loop_y:
  push bc
  push hl
rectangle_loop_x:
   ld e,a
or_xor = $
   or (hl)   ;smc will modify this to or/xor
   ld (hl),a
   ld a,e
   rrca
    jr nc,$+3
    inc hl
   djnz rectangle_loop_x
  pop hl
  ld c,12
  add hl,bc
  pop bc
  ld a,d
  dec c
   jr nz,rectangle_loop_y
You get rid of a push af / pop af in the loop which takes 21 t-states and replace it with a ld a,d which is 4 t-states.

This is definitely smaller than my routine. In mine, I make a 12-byte pattern to OR or XOR onto the screen o_o
One of the tricks I use is to find the first byte non-zero byte of the pattern:
Code: [Select]
;D is the x coordinate, here
;HL points to the pattern buffer (12 bytes)
     ld a,d
     and 7
     ld a,80h
     ld b,a
     jr z,$+5
       rrca
       djnz $-1
;A is the mask if it were for a pixel
;B is 0
     add a,a
     dec a
     ld (hl),a
So for example, if D and 7 = 3, you would get %00010000. 'add a,a' turns that to %00100000 and then dec a → %00011111.
And if you are worried about ' D and 7 ' = 0, you get %10000000→%00000000→%11111111 which is correct, or if 'D and 7' returns 7, %00000001→%00000010→%00000001.

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #66 on: February 03, 2013, 04:11:16 am »
That's a cool trick with the add a,a \ dec a. My first go at it only handled multiples of 8 and looked like this:
Code: [Select]
;b = # columns
;c = # rows
;d = starting x
;e = starting y
rectangle_filled2:
ld a,d ;a = starting x coord
ld l,e ;ld hl,e
ld h,0 ; ..
ld d,h ;set d = 0
add hl,de ;starting y * 12
add hl,de ;x3
add hl,hl ;x6
add hl,hl ;x12
rra ;a = x coord / 8
rra ;
rra ;
and %00011111 ;starting x/8
ld e,a ;
add hl,de ;add x
ld de,gbuf
add hl,de ;offset in gbuf
ld a,b ;b = no columns
rra
rra
rra
and %00011111 ;no. columns / 8
ld b,a
ld a,12
sub b
ld e,a
ld d,0
rectangle_loop_y:
push bc
rectangle_loop_x:
ld (hl),$FF
inc hl
djnz rectangle_loop_x
pop bc ;restore c (# columns)
add hl,de ;move down to next row
dec c
jr nz,rectangle_loop_y
rectangle_end:
ret
Much smaller and faster, but certain cases (for example, a rectangle less than 8 pixels wide, 2 byte rectangles, etc.) seemed like they were going to bump up the size quite a bit and what I wanted was to write it more for size than speed since it's not being used in any time-critical areas of the game (mostly in menus and the battle engine). How do you handle cases where, say, a rectangle doesn't fill an entire byte?

The gbuf bit is also pretty creative (i guess you could also just add a,$40) but unfortunately wouldn't work out since it's being written for the 83 as well, though i guess a simple define would do. Yep:
Code: [Select]
#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93

.org progstart-2
.db $bb,$6d
#else
GBUF_LSB = $29
GBUF_MSB = $8E
.org progstart
#endif
;...
rra ;a = x coord / 8
rra ;
rra ;
and %00011111 ;starting x/8 (starting byte in gbuf)
add a,GBUF_LSB
ld e,a ;
ld d,GBUF_MSB ;
add hl,de ;hl = offset in gbuf
...works just fine for the 83/+ :)

I hope deeph doesn't mind, if you feel like checking their project out it's over at yAronet: http://www.yaronet.com/posts.php?s=153983

EDIT: Here's a version that draws vertically (here b = height and c = width) which seems to be slightly faster (and's the same size):
Code: [Select]
#ifdef TI83P
GBUF_LSB = $40
GBUF_MSB = $93
#else
GBUF_LSB = $29
GBUF_MSB = $8E
#endif

;b = # rows
;c = # columns
;d = starting x
;e = starting y
rectangle_filled_xor:
ld a,$AE ;xor (hl)
jr rectangle_filled2
rectangle_filled_solid:
ld a,$B6 ;or (hl)
rectangle_filled2:
push de
push bc
ld (or_xor),a ;use smc for xor/solid fill
ld a,d ;starting x
and $7 ;what bit do we start on?
ex af,af'
ld a,d ;starting x
ld l,e ;ld hl,e
ld h,0 ; ..
ld d,h ;set d = 0
add hl,de ;starting y * 12
add hl,de ;x3
add hl,hl ;x6
add hl,hl ;x12
rra ;a = x coord / 8
rra ;
rra ;
and %00011111 ;starting x/8 (starting byte in gbuf)
add a,GBUF_LSB
ld e,a ;
ld d,GBUF_MSB ;
add hl,de ;hl = offset in gbuf
ex af,af' ;carry should be reset and z affected from and $7
ld d,a
ld a,%10000000
jr z,$+6
rra
dec d
jr nz,$-2
ld e,12
rectangle_loop_x:
push af
push bc
push hl
ld c,a
rectangle_loop_y:
or_xor = $
or (hl) ;smc will modify this to or/xor
ld (hl),a
ld a,c
add hl,de
djnz rectangle_loop_y
pop hl
pop bc
pop af
rrca
jr nc,$+3
inc hl
dec c
jr nz,rectangle_loop_x
rectangle_end:
pop bc
pop de
ret

EDIT: Small optimization :)
« Last Edit: February 04, 2013, 11:41:59 am by chickendude »

Offline NanoWar

  • LV4 Regular (Next: 200)
  • ****
  • Posts: 140
  • Rating: +18/-6
    • View Profile
Re: ASM Optimized routines
« Reply #67 on: February 03, 2013, 12:39:46 pm »
So these work only with rectangle width W where W modulo 8 = 0 and W >= 8 ?

Any chance of a generic rectangle routine?

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #68 on: February 03, 2013, 11:51:44 pm »
They should work with any rectangle with a width/height greater than 0 (there's no error checking for invalid dimensions and no clipping). I can see what i can do for a generic rectangle routine, what i've been doing is just calling the routine twice:
Code: [Select]
start:
ld de,$0204
ld bc,$2121
call draw_box2
call ionFastCopy
bcall _getkey
ret

draw_box2:
call rectangle_filled_solid
inc d
inc e
dec b
dec b
dec c
dec c
call rectangle_filled_xor
ret

EDIT: What i meant was that the first routine i wrote only handled multiples of 8, the other routines (the one in post #64 and at the bottom of #66) can handle any rectangle with valid (non-zero) dimensions. Well, as long as they don't go off-screen!

EDIT2: Here's a simple normal rectangle routine. I just converted the other routine to not draw the inside of the rectangle, so it won't erase what's inside the rectangle. It's currently at 90 bytes and not terribly fast, slightly slower than drawing a filled rectangle. By default a rectangle needs to be at least 3x2 (that is X*Y), but if you don't mind adding 8 bytes (altogether 99 bytes) you can use it to draw horizontal and vertical lines and even plot pixels ;) You can just uncomment those lines.
Code: [Select]
;b = # rows
;c = # columns
;d = starting x
;e = starting y
rectangle:
push de
push bc
ld a,d ;starting x
and $7 ;what bit do we start on?
ex af,af'
ld a,d ;starting x
ld l,e ;ld hl,e
ld h,0 ; ..
ld d,h ;set d = 0
add hl,de ;starting y * 12
add hl,de ;x3
add hl,hl ;x6
add hl,hl ;x12
rra ;a = x coord / 8
rra ;
rra ;
and %00011111 ;starting x/8 (starting byte in gbuf)
add a,GBUF_LSB
ld e,a ;
ld d,GBUF_MSB ;
add hl,de ;hl = offset in gbuf
ex af,af' ;carry should be reset and z affected from and $7
ld e,a
ld a,%10000000
jr z,$+6
rra
dec e
jr nz,$-2
dec b ;you could adjust your input to take care of this, ie b = width-2, c = height-1 and save 3 bytes here
dec b ;we draw the ends separately
dec c ;we'll draw the last line at the end
ld d,a ;starting bit to draw
;d = starting bit
rectangle_loop_y:
push bc
push hl
call rectangle_loop_x
pop hl ;hl = first column in gbuf row
ld c,12 ;b = 0, bc = 12
add hl,b ;move down to next row
pop bc ;restore b (# columns)
xor a
; cp c ; # UNCOMMENT TO ALLOW LINES WITH A HEIGHT OF 1 PIXEL
; jr z,rectangle_end ; #
ld (ld_hl),a ;change ld (hl),a to nop
ld a,d ;restore a (starting bit to draw)
dec c
jr nz,rectangle_loop_y
ld a,$77 ;ld (hl),a
ld (ld_hl),a ;return nop to ld (hl),a
ld a,d
call rectangle_loop_x
rectangle_end:
pop bc
pop de
ret

rectangle_loop_x:
or (hl) ;first bit
ld (hl),a
; inc b ; # UNCOMMENT TO ALLOW LINES WITH A WIDTH OF 1 PIXEL
; ret z ; #
; dec b ; #
; jr z,rectangle_loop_x_end ; #
ld a,d
rectangle_loop_x_inner:
rrca ;rotate a to draw the next bit
jr nc,$+3
inc hl
ld e,a ;save a (overwritten with or (hl))
or (hl) ;smc will modify this to or/xor
ld_hl = $
ld (hl),a
ld a,e ;recall a
djnz rectangle_loop_x_inner
rectangle_loop_x_end:
rrca ;rotate a to draw the next bit
jr nc,$+3
inc hl
or (hl) ;last bit
ld (hl),a
ret
If you don't care about the coordinates, you can just remove the push/pops and get another 4 bytes. And i've got another idea which should be faster (essentially working with bytes instead of pixels), though it might take up a bit more space. If you're interested let me know and i'll try to work on it some.

As always, use and abuse without restrictions :)

EDIT3: Small optimization
« Last Edit: February 04, 2013, 11:39:56 am by chickendude »

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #69 on: February 04, 2013, 07:23:40 am »
EDIT Five years later, I found my code didn't work that well :( At the bottom of this post is a working routine, but not ideal.

Hmm, for 'rectangle_loop_x' and 'rectangle_loop_x_end' you have 'or (hl)' which I think needs to be SMC'd.

I decided to try and optimise my code a bit and I managed to optimise it for speed in most cases and size. Unfortunately, the code is still much bigger than chickendude's at 133 bytes x.x To give an indicator of speed (6MHz):

my old routine, 9x18 rectangle : 1926 times in two seconds
my new routine, 9x18 : 5668 times in two seconds
chickendude's old routine : 1287 times in two seconds
chickendude's new routine : untested .__.

So for cases where you don't need crazy speed, chickendude's is still very fast and much smaller (73 bytes versus 133).
Code: [Select]
Rectangle_or:
ld a,$B6
jr Rectangle
Rectangle_xor:
ld a,$AE
Rectangle:
;    DE = (x,y)
;    BC = (height,width)
ld (smc_logic1),a
ld (smc_logic0),a
push de
push bc
push bc
ld a,d
call ComputeByte
ld (smc_FirstByte),a
ex (sp),hl
ld a,d
neg
and 7
ld b,a
ld a,l
sub b
ex (sp),hl
ld c,a
call ComputeByte
cpl
ld (smc_LastByte),a
ld b,a
    sra c \ sra c \ sra c
    inc c
; ld a,c
; and %11111000
; rra \ rra \ rra
;   inc a
; ld c,a

ld a,d
ld d,0
ld h,d
ld l,e
add hl,hl
add hl,de
add hl,hl
add hl,hl
and %11111000
rra \ rra \ rra
add a,GBUF_LSB
ld e,a
ld d,GBUF_MSB
add hl,de
;HL points to the first byte
pop de
;D is the height
;E is the number of bytes wide
inc c
dec c
jr nz,RectOverLoop-1
ld a,(smc_FirstByte)
and b
ld c,a ;value
ld b,d ;height
ld de,12
ld a,c
smc_logic1:
or (hl)
ld (hl),a
add hl,de
djnz $-4
pop bc
pop de
ret
ld e,c
RectOverLoop:
ld b,e
ld c,12
.db 3Eh       ;start of ld a,*
smc_FirstByte:
.db 0
RectLoop:
smc_logic0:
or (hl)
ld (hl),a
inc hl
dec c
    jr z,ExitLoop
ld a,-1
djnz RectLoop
;    jp p,$+4
;    dec b
.db 3Eh       ;start of ld a,*
smc_LastByte:
.db 0
or (hl)
ld (hl),a
add hl,bc
ExitLoop:
    dec d
jr nz,RectOverLoop
pop bc
pop de
ret

ComputeByte:
and 7
ld b,a
ld a,80h
jr z,$+5
  rrca
  djnz $-1
add a,a
dec a
ret

Necro-edit:
This code is working, and it performs clipping. Using the above benchmarks, this code draws approximately 4240 of those rectangle per two seconds at 6MHz on an actual calc. The downside is the size :( The core routine is 119 bytes, and the code for XOR rectangle is an additional 61 bytes, OR rectangle is also 61 bytes, and Erase rectangle is 66 bytes. They do not preserve registers. However, they were made to run in an app, so they don't rely on SMC-- If they did use SMC, it could probably fit all of the routines in just under 200 bytes.
Code: [Select]
;;
;;rectXOR
;;rectOR
;;rectErase
;;  (B,C) = (x,y) signed
;;  (D,E) = (w,h) unsigned
;;  HL points to buf

rectXOR:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    dec b
    jp m,xorrect0
    inc b
xor_rect_loop:
    push bc
    push hl
    ld a,(hl) \ xor d \ ld (hl),a \ inc hl
    dec b
    jr z,$+8
    ld a,(hl) \ cpl \ ld (hl),a \ inc hl \ djnz $-4
    ld a,(hl) \ xor e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,xor_rect_loop
    ret
xorrect0:
    ld a,d
    and e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    xor (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectErase:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    ld a,d
    cpl
    ld d,a
    ld a,e
    cpl
    ld e,a
    dec b
    jp m,eraserect0
    inc b
erase_rect_loop:
    push bc
    push hl
    ld a,(hl) \ and d \ ld (hl),a \ inc hl
    dec b
    jr z,$+7
    xor a
    ld (hl),a \ inc hl \ djnz $-2
    ld a,(hl) \ and e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,erase_rect_loop
    ret
eraserect0:
    ld a,d
    xor e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    and (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectOR:
    push hl
    call rectSub
    pop ix
    ret nc
    ex de,hl
    add ix,de
    ex de,hl
    push ix
    pop hl
    dec b
    jp m,orrect0
    inc b
or_rect_loop:
    push bc
    push hl
    ld a,(hl) \ or d \ ld (hl),a \ inc hl
    dec b
    jr z,$+8
    ld c,-1
    ld (hl),c \ inc hl \ djnz $-2
    ld a,(hl) \ or e \ ld (hl),a
    ld bc,12
    pop hl
    add hl,bc
    pop bc
    dec c
    jr nz,or_rect_loop
    ret
orrect0:
    ld a,d
    and e
    ld b,c
    ld c,a
    ld de,12
    ld a,c
    or (hl)
    ld (hl),a
    add hl,de
    djnz $-4
    ret
rectsub:
;(B,C) = (x,y) signed
;(D,E) = (w,h) unsigned
;Output:
;  Start Mask  D
;  End Mask    E
;  Byte width  B
;  Height      C
;  buf offset  HL
  bit 7,b
  jr z,+_
  ;Here, b is negative, so we have to add width to x.
  ;If the result is still negative, the entire box is out of bounds, so return
  ;otherwise, set width=newvalue,b=0
  ld a,d
  add a,b
  ret nc
  ld d,a
  ld b,0
_:
  bit 7,c
  jr z,+_
  ld a,e
  add a,c
  ret nc
  ld e,a
  ld c,0
_:
;We have clipped all negative areas.
;Now we need to verify that (x,y) are on the screen.
;If they aren't, then the whole rectangle is off-screen so no need to draw.
  ld a,b
  cp 96
  ret nc
  ld a,c
  cp 64
  ret nc
;Let's also verfiy that height and width are non-zero:
  ld a,d
  or a
  ret z
  ld a,e
  or a
  ret z
;Now we need to clip the width and height to be in-bounds
  add a,c
  cp 65
  jr c,+_
  ;Here we need to set e=64-c
  ld a,64
  sub c
  ld e,a
_:
  ld a,d
  add a,b
  cp 97
  jr c,+_
  ;Here we need to set d=96-b
  ld a,96
  sub b
  ld d,a
_:
;B is starting X
;C is starting Y
;D is width
;E is height

  push bc
  ld a,b
  and 7
  ld b,a
  ld a,-1
  jr z,+_
  rra \ djnz $-1
_:
  inc a
  cpl
  ld h,a    ;start mask

  ld a,b
  add a,d
  and 7
  ld b,a
  ld a,-1
  jr z,+_
  rra \ djnz $-1
_:
  inc a
  ld l,a  ;end mask
  ex (sp),hl
  ;stack now holds DE
  ;HL is now the coordinates
  ;B=0, C=height
  ;A,BC are free to destroy
  ld a,h
  ld h,b
  add hl,hl
  add hl,bc
  add hl,hl
  add hl,hl
  ld b,a
  rrca
  rrca
  rrca
  and 31
  add a,l
  ld l,a
  jr nc,$+3
  inc h

;B is the starting x, D is width
;Only A,B,D,E are available
  ld a,b
  add a,d
  and $F8
  ld d,a

  ld a,b
  and $F8
  ld b,a
  ld a,d
  sub b
  rrca
  rrca
  rrca
  ld b,a
  ld c,e
  pop de
  scf
  ret

Offline chickendude

  • LV8 Addict (Next: 1000)
  • ********
  • Posts: 817
  • Rating: +90/-1
  • Pro-Riot Squad
    • View Profile
Re: ASM Optimized routines
« Reply #70 on: February 04, 2013, 11:37:01 am »
The other routine is just a plain rectangle (non-filled) drawing routine. I don't bother SMC'ing the or (hl), just the ld (hl),a. What it does is draw the first and last pixels of the line/border outside of the main loop then either draws (ld (hl),a) all pixels in between or skips (nop's) them all. It's a bit larger so unless the speed is that important you might be better off drawing two filled rectangles with the other routine, one OR'd and a slightly smaller one XOR'd.

Also, i just realized i don't need the "or a" after "ex af,af'" since the flags should still be preserved from the "and $7", so we can delete that.

It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing :)

And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:
Code: [Select]
ComputeByte:
and 7
ld b,a
ld a,$FF
ret z
  srl a ;or or a \ rra
  djnz $-2
ret

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #71 on: February 04, 2013, 01:03:55 pm »
It's interesting to see how our syntax/style differs even on bits of code that do exactly the same thing :)
I noticed that, too, it is kind of neat. I noticed that I mask A and then shift, whereas you shift, then mask. I've done both, but I typically do the former.
And if you don't mind sacrificing just a few clocks (altogether maybe between 8-64), maybe you could try this and save 2 bytes:
Nice, I like that! When B is not 0, that is actually anywhere from 6 cycles faster to 18 cycles slower, so I think that is a great trade-off.
By adding a byte, I can make your routine anywhere from 0 cycles to 24 cycles faster when b>0
Code: [Select]
ComputeByte:
 neg   ; or 'cpl \ inc a
 and 7
 ld b,a
 ld a,FFh
 ret z
 add a,a
 djnz $-1
 ret
This happens to be faster than my original and smaller by one byte. Still, it is larger than yours :/

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #72 on: March 07, 2013, 04:14:15 pm »
EDIT3: (continued below) Smaller, faster version below.
I wanted to optimise an old routine in Grammer to compute the GCD of two 16-bit numbers. I came up with this:
Code: [Select]
GCDDE_HL:
;Inputs:
;     HL,DE are the two values
;Outputs:
;     B is 0
;     DE is 0
;     HL is the GCD
;     C is not changed
;Destroys:
;     A
     xor a              ;AF     4
     ld b,a             ;47     4
CheckMax:               ;
     sbc hl,de          ;ED52   15n
     jr z,AdjustGCD     ;28**   12n-5
     jr nc,ParityCheck  ;30**   12n-5
    xor a              ;AF     4(n-a)
     sub l              ;95     4(n-a)
     ld l,a             ;6F     4(n-a)
     sbc a,a            ;9F     4(n-a)
     sub h              ;94     4(n-a)
     ld h,a             ;67     4(n-a)
     ex de,hl
     jp CheckMax        ;C3**** 10(n-a)
ParityCheck:            ;
     bit 0,e            ;CB**   8a
     jr nz,DE_Odd       ;20**   12a-5b
     bit 0,l            ;CB**   8b
     jr z,BothEven      ;28**   12b-5c
     rr d               ;CB**   8(n-a-b-c)
     rr e               ;CB**   8(n-a-b-c)
     jp CheckMax        ;C3**** 10(n-a-b-c)
BothEven:               ;
     inc b              ;04     4c
     rr d \ rr e        ;       16c
     rr h \ rr l        ;       16c
     jp CheckMax        ;       10c
DE_Odd:                 ;
     bit 0,l            ;       8b
     jr nz,BothOdd      ;       12b-5d
     rr h \ rr l        ;       16(n-a-b-d)
     jp CheckMax        ;       10(n-a-b-d)
BothOdd:                ;
     sbc hl,de          ;       15d
     rr h \ rr l        ;       16d
     jp CheckMax        ;       10d
AdjustGCD:              ;
     ex de,hl           ;       4
     inc b              ;       4
     dec b              ;       4
     ret z              ;       11+4(k>0)
     add hl,hl          ;       11k
     djnz $-1           ;       13k-5
     ret                ;       --
It is a lot faster than my other version which used division to compute the mod of two 16-bit numbers x.x The JP instructions can be changed to JR for better portability and to save a byte each time.

EDIT: And if I didn't make a mistake, the 8-bit version:
Code: [Select]
GCD_A_C:
;Outputs:
;    A is the GCD
;    C should be the smallest odd number that divides both inputs
;    B is 0
;Destroys:
;    D
     ld b,1
CheckMax:
     sub c
     jr z,AdjustGCD
     jr nc,ParityCheck
     neg
     ld d,a
     ld a,c
     ld c,a
     jr CheckMax
ParityCheck:
     rrc c
     jr c,c_Odd
     inc b
     rrca
     jr nc,CheckMax
     rlca
     djnz CheckMax
c_Odd:
     rlc c
     rrca
     jr nc,CheckMax
     rlca
     jr CheckMax
AdjustGCD:
     ld a,c
     dec b
     ret z
     add a,a
     djnz $-1
     ret
EDIT2: I think I computed a massive overestimate of the slowest speed of the first routine to be a little over 4000 cycles. My old routine used about 1500 cycles at the fastest for a non-trivial result. 1500 is likely close to the slowest that the new routine will run at :)
EDIT3: 14 bytes saved, runs faster?
Code: [Select]
GCDDE_HL:
;Inputs:
;     HL,DE are the two values
;Outputs:
;     B is 0
;     DE is 0
;     HL is the GCD
;     C is not changed
;     A is not changed
     ld b,1
     or a
CheckMax:               ;
     sbc hl,de          ;ED52   15n
     jr z,AdjustGCD     ;28**   12n-5
     jr nc,ParityCheck  ;30**   12n-5
     add hl,de
     or a
     ex de,hl
ParityCheck:            ;
     bit 0,e            ;CB**   8a
     jr nz,DE_Odd       ;20**   12a-5b
     bit 0,l            ;CB**   8b
     jr z,BothEven      ;28**   12b-5c
     rr d               ;CB**   8(n-a-b-c)
     rr e               ;CB**   8(n-a-b-c)
     jp CheckMax        ;C3**** 10(n-a-b-c)
BothEven:               ;
     inc b              ;04     4c
     rr d \ rr e        ;       16c
HL_Even:
     rr h \ rr l        ;       16c
     jp CheckMax        ;       10c
DE_Odd:                 ;
     bit 0,l            ;       8b
     jr z,HL_Even       ;       12b-5d
     sbc hl,de          ;       15d
     rr h \ rr l        ;       16d
     jp nz,CheckMax        ;       10d
AdjustGCD:              ;
     ex de,hl           ;       4
     dec b              ;       4
     ret z              ;       11+4(k>0)
     add hl,hl          ;       11k
     djnz $-1           ;       13k-5
     ret                ;       --

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #73 on: July 04, 2013, 08:46:10 am »
EDIT: Fixed a problem to take care of the case where HL= 8000h (thanks Jacobly!)
This routine a few pages back can be optimised:
Code: [Select]
SignedDivision:
ld a,h
xor d
push af

bit 7,h
jr z,$+8
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a

bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a

call RegularDivision

pop af
add a,a
ret nc

xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
For the sign testing, I came up with this:
Code: [Select]
SignedDivision:
ld a,h
xor d
push af

xor d
jp p,$+9
xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a

bit 7,d
jr z,$+8
xor a
sub e
ld e,a
sbc a,a
sub d
ld d,a

call RegularDivision

pop af
ret p

xor a
sub l
ld l,a
sbc a,a
sub h
ld h,a
ret
In all, it saves 1 bytes and at least 5 t-states (it will be either 5 or 10).

Offline Xeda112358

  • they/them
  • Moderator
  • LV12 Extreme Poster (Next: 5000)
  • ************
  • Posts: 4704
  • Rating: +719/-6
  • Calc-u-lator, do doo doo do do do.
    • View Profile
Re: ASM Optimized routines
« Reply #74 on: October 20, 2013, 03:53:19 pm »
Here is a sqrtL routine:
Code: [Select]
SqrtL:
;Inputs:
;     L is the value to find the square root of
;Outputs:
;      C is the result
;      B,L are 0
;     DE is not changed
;      H is how far away it is from the next smallest perfect square
;      L is 0
;      z flag set if it was a perfect square
;Destroyed:
;      A
     ld bc,400h       ; 10    10
     ld h,c           ; 4      4
sqrt8Loop:            ;
     add hl,hl        ;11     44
     add hl,hl        ;11     44
     rl c             ; 8     32
     ld a,c           ; 4     16
     rla              ; 4     16
     sub a,h          ; 4     16
     jr nc,$+5        ;12|19  48+7x
       inc c
       cpl
       ld h,a
     djnz sqrt8Loop   ;13|8   47
     ret              ;10     10
;287+7x, x is the number of bits in the result
;min: 287
;max: 315
;19 bytes
Also, in case anybody needed a small GCD (Greatest Common Divisor) routine, I have this:
Code: [Select]
GCDHL_DE:
;Outputs:
;     DE is the GCD
GCDLoop:
     or a
     sbc hl,de
     ret z
     jr nc,$-3
     add hl,de
     ex de,hl
     jp GCDLoop
(a faster one is a few posts up)
If you need a fast way to see if a 16-bit number is divisible by 3 (without actually dividing)
Code: [Select]
HL_mod_3:
;Outputs:
;     Preserves HL
;     A is the remainder
;     destroys DE,BC
;     z flag if divisible by 3, else nz
     ld bc,030Fh
     ld a,h
     add a,l
     sbc a,0   ;conditional decrement
;Now we need to add the upper and lower nibble in a
     ld d,a
     and c
     ld e,a
     ld a,d
     rlca
     rlca
     rlca
     rlca
     and c

     add a,e
     sub c
     jr nc,$+3
     add a,c
;add the lower half nibbles

     ld d,a
     sra d
     sra d
     and b
     add a,d
     sub b
     ret nc
     add a,b
     ret
;at most 132 cycles, at least 123