826
ASM / Non-Standard gBuf Ideas
« on: July 16, 2013, 02:02:25 pm »
I have had this idea since my first attempt at an OS, but I ran into a few problems. Basically, I wanted to store the graph buffer in columns because I thought it would be very useful for drawing tiles and updating the LCD. Then I started thinking about line drawing, circle drawing, and anything that would cross a byte boundary and I realised that some routines would take a major hit to speed.
I am working on a project and I am still trying to decide if it would be beneficial to organise the screen in this manner. Here is an example of a tile drawing routine using the current buffer setup:
If you have ever written your own LCD updating routine, you probably already realized just how straight forward this would make the routine (and if we had no LCD delay, it would amount to basically 12 iterations of ld b,64 \ outir). We typically don't need to optimise for speed with such an LCD update because most of the time, the code is waiting for the LCD to respond before moving to the next byte. However, if you are doing something like grayscale or interleaving another routine with the LCD update (like drawing a tilemap at the same time), this gives you even more time to do more complicated things with the LCD, putting your 'waste cycles' to more use.
Sprite Drawing
The reason for why some drawing will be so easy is that each 8 columns of pixels is 64 bytes which is much nicer to work with than a row of pixels being 12 bytes. We also see a huge boost in performance when moving down or up a pixel because that only requires an increment or decrement of a pointer, instead of adding 12 each time. However, now we get the same problem when moving left or right across byte boundaries. This means that sprite routines could take a hit, but let's see how far we can remedy this.
This will be a very simple routine to XOR an 8x8 sprite to the gbuf:
LCD Updating
As promised, the routine to update the LCD is fairly straight forward:
Note that if you are only ever doing fullscreen updates (or at least full columns) and you are always using the same increment mode, you can leave the first part of that code in a setup portion of your code:
Pixel Plotting
Final Analysis
It turns out that most drawing is faster and that my original fears were just based on me being too accustomed to one way of doing things. Line drawing, circle drawing, and rectangle drawing are all faster (lines and circles just because it is faster to locate a pixel, rectangles because it just works fantastically). Sprites, tiles, and LCD updating work out great. However, there is one area that does in fact take hit and that is scrolling the screen. Shifting up and down is still relatively easy, but shifting left and right will be slower and more complicated. Shifting up or down is just shifting the whole buffer 1 byte instead of 12, which is the same speed. Here is shifting right:
Aside from that, I like the idea of organising the buffer this way.
EDIT: Modified a few routines to be smaller, no speed change, though.
EDIT2: Added a link to the rectangle routines below.
I am working on a project and I am still trying to decide if it would be beneficial to organise the screen in this manner. Here is an example of a tile drawing routine using the current buffer setup:
Code: [Select]
drawtile:
;DE points to the sprite data
;BC = (y,x)
; draw an 8x8 tile where X is on [0,11] and Y is on [0,7]
ld a,b
add a,a
add a,b
add a,a
add a,a
add a,a
ld h,0
ld b,h
ld l,a
add hl,hl
add hl,hl
add hl,bc
ld bc,(DrawBufPtr)
add hl,bc
ld bc,12
ld a,8
ex de,hl ; 32
ldi ;128
ex de,hl ; 32
add hl,bc ; 88
inc c ; 32
dec a ; 32
jr nz,$-7 ; 91
ret
And here is how it looks the other way:Code: [Select]
drawtile:
;DE points to the sprite
;BC = (y,x)
;X*64
or a
ld a,c
ld l,0
rra
rr l
rra
rr l
ld h,a
;y*8
ld a,b
add a,a
add a,a
add a,a
add a,l
ld l,a
ld bc,(DrawBufPtr)
add hl,bc
ex de,hl
ld bc,8
ldir
ret
The former is 565 t-states 33 bytes, the latter is 281 t-states 28 bytes. There are ways to optimise both routines for speed. Spoiler For Optimised:
Sprite Drawing
The reason for why some drawing will be so easy is that each 8 columns of pixels is 64 bytes which is much nicer to work with than a row of pixels being 12 bytes. We also see a huge boost in performance when moving down or up a pixel because that only requires an increment or decrement of a pointer, instead of adding 12 each time. However, now we get the same problem when moving left or right across byte boundaries. This means that sprite routines could take a hit, but let's see how far we can remedy this.
This will be a very simple routine to XOR an 8x8 sprite to the gbuf:
Code: [Select]
PutSprite8x8:
;Note: No clipping.
;Inputs:
; BC = (x,y)
; IX points to the sprite
; 1871 worst-case
ld a,b
and $F8
ld h,0
rla \ rl h
rla \ rl h
rla \ rl h
ld l,a
ld a,b
ld b,0
add hl,bc
ld bc,9340h
add hl,bc
;HL points to the first byte to draw at
and 7
jr nz,crossedbound
push ix \ pop de
ld b,8
ld a,(de)
xor (hl)
ld (hl),a
inc hl
inc de
djnz $-5
ret
crossedbound:
ld b,a
dec a
ld (smc_jump1),a
ld (smc_jump2),a
ld a,1
rrca
djnz $-1
dec a
ld e,a
ld c,8
;E is the mask
;IX points to the sprite
;HL points to where to draw
drawloop1:
ld a,(ix)
.db 18h ;start of jr *
smc_jump1:
.db 0
rlca
rlca
rlca
rlca
rlca
rlca
rlca
and e
xor (hl)
ld (hl),a
inc ix
inc hl
dec c
jr nz,drawloop1
ld c,56
add hl,bc
ld a,e
cpl
ld e,a
ld c,8
drawloop2:
ld a,(ix-8)
.db 18h ;start of jr *
smc_jump2:
.db 0
rlca
rlca
rlca
rlca
rlca
rlca
rlca
and e
xor (hl)
ld (hl),a
inc ix
inc hl
dec c
jr nz,drawloop2
ret
That actually turns out to be pretty fast, so if you need to draw sprites, this is still a viable buffer setup.LCD Updating
As promised, the routine to update the LCD is fairly straight forward:
Code: [Select]
#define lcddelay() in a,(16) \ rlca \ jr c,$-3
ld a,5
out (16),a
lcddelay()
ld a,80h
out (16),a
ld hl,9340h
lcddelay()
ld a,20h
col:
out (16),a
push af
ld bc,4011h
row:
lcddelay()
outi
jr nz,row
lcddelay()
pop af
inc a
cp 2Ch
jr nz,col
ret
Note that if you are only ever doing fullscreen updates (or at least full columns) and you are always using the same increment mode, you can leave the first part of that code in a setup portion of your code:
Code: [Select]
.org 9D93h
.db $BB,$6D
Start:
ld a,5 ;set the increment mode, only needs to be done once
out (16),a
lcddelay()
ld a,80h ;set the row pointer, only needs to be done once, since the LCD update routine leaves it where it started.
out (16),a
Main:
<code>
UpdateLCD:
ld hl,9340h
ld a,20h
col:
out (16),a
push af
ld bc,4011h
row:
lcddelay()
outi
jr nz,row
lcddelay()
pop af
inc a
cp 2Ch
jr nz,col
ret
Pixel Plotting
Code: [Select]
;GetPixelLoc
;Inputs:
; BC =(x,y)
; DE is the buffer on which to draw
;Outputs:
; Returns HL pointing to the byte where the pixel gets plotted
; Returns A as a mask
; NC returned if out of bounds, else C if in bounds
ld a,c \ cp 64 \ ret nc
ld a,b \ cp 96 \ ret nc
and $F8
ld h,0
rla \ rl h
rla \ rl h
rla \ rl h
ld l,a
ld a,b
ld b,0
add hl,bc
add hl,de
;HL points to the first byte to draw at
and 7
ld b,a
ld a,1
inc b
rrca \ djnz $-1
scf
ret
Now to set the pixel, use or (hl) \ ld (hl),a or use xor to invert, and to erase, cpl \ and (hl) \ ld (hl),a.Final Analysis
It turns out that most drawing is faster and that my original fears were just based on me being too accustomed to one way of doing things. Line drawing, circle drawing, and rectangle drawing are all faster (lines and circles just because it is faster to locate a pixel, rectangles because it just works fantastically). Sprites, tiles, and LCD updating work out great. However, there is one area that does in fact take hit and that is scrolling the screen. Shifting up and down is still relatively easy, but shifting left and right will be slower and more complicated. Shifting up or down is just shifting the whole buffer 1 byte instead of 12, which is the same speed. Here is shifting right:
Code: [Select]
ld hl,9340h ;gbuf
ld de,64
ld c,e
loop:
or a
ld b,12
rr (hl)
push af \ add hl,de \ pop af
djnz $-5
dec h \ dec h \ dec h
inc l
dec c
jr nz,loop
ret
That is now half the speed of what it is for the current gbuf setup. We can cut out 9828 t-states if interrupts are off, though, but that is still a huge hit to speed.Aside from that, I like the idea of organising the buffer this way.
EDIT: Modified a few routines to be smaller, no speed change, though.
EDIT2: Added a link to the rectangle routines below.