Author Topic: optimizing asm code  (Read 4321 times)

0 Members and 1 Guest are viewing this topic.

Offline ben_g

  • Hey cool I can set a custom title now :)
  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1002
  • Rating: +125/-4
  • Asm noob
    • View Profile
    • Our programmer's team: GameCommandoSquad
optimizing asm code
« on: April 17, 2012, 04:27:08 pm »
Hi,

I have to optimize this code, But I haven't really optimized before and whatever I try to do, I can't get it to work faster. Can anyone give me some tips?

Also: I'm optimizing for speed, not for size. The code can be quite big.

Here's the code itself:
Spoiler For code:
Code: [Select]
DrawTriangle:
  ;IN: x1,y1,u1,v1,x2,y2,u2,v2,x3,y3,u3,v3
  ;scherm = 96*64

;the following code was used to add 100 to the x coordinates, to see if the sign was the problem

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;  ld de, 100
;  ld hl, (x1)
;  add hl, de
;  ld (x1), hl
;  ld hl, (x2)
;  add hl, de
;  ld (x2), hl
;  ld hl, (x3)
;  add hl, de
;  ld (x3), hl
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;---------------------------------------------------------------
; This part sorts the points so that Y1 <= Y2 <= Y3 so
; we can just draw each scanline below the last one.
;---------------------------------------------------------------

  ld hl, (x1)
  call Signed16To8
  ld h, $FF
  ld l, a
  ld (x1), hl
  ld hl, (x2)
  call Signed16To8
  ld h, $FF
  ld l, a
  ld (x2), hl
  ld hl, (x3)
  call Signed16To8
  ld h, $FF
  ld l, a
  ld (x3), hl

  ld hl, (y1)
  ld de, (y2)
  cpHLDE
  jr c, Y1SmallerThanY2

  ld hl, x1
  ld de, dx1 ;temp location
  ld bc, 6 ;size
  ldir

  ld hl, x2
  ld de, x1
  ld bc, 6
  ldir

  ld hl, dx1
  ld de, x2
  ld bc, 6
  ldir

Y1SmallerThanY2:
  ld hl, (y1)
  ld de, (y3)
  cpHLDE
  jr c, Y1SmallerThanY3

  ld hl, x1
  ld de, dx1 ;temp location
  ld bc, 6 ;size
  ldir

  ld hl, x3
  ld de, x1
  ld bc, 6
  ldir

  ld hl, dx1
  ld de, x3
  ld bc, 6
  ldir

Y1SmallerThanY3:
  ld hl, (y2)
  ld de, (y3)
  cpHLDE
  jr c, Y2SmallerThanY3

  ld hl, x2
  ld de, dx1 ;temp location
  ld bc, 6 ;size
  ldir

  ld hl, x3
  ld de, x2
  ld bc, 6
  ldir

  ld hl, dx1
  ld de, x3
  ld bc, 6
  ldir

Y2SmallerThanY3:

; +++++ End of sorting code +++++


;----------------------------------------------------------
; Here, some variables are initialized. The delta
; variables (the variables which start with a 'd')
; contain the values that need to be added to
; the variables which start with a 't'. Variables
; with a 't' and a '1' are used for the start of
; the scanline. Those with a 't' and a '2' are used
; for the end of the scanline.
;----------------------------------------------------------

  res 0, (IY) ;if this bit is 0, the routine is drawing the top half of the triangle. if it's 1, it's drawing the bottom half.
  res 1, (IY) ;This bit is used to store if the deltas for the texture coordinates inside scanlines are already calculated. They are constants, so they only need to be calculated once per half.

  ld hl, (y2)
  ld de, (y1)
  subFP ;This routine is for substracting fixed-point values, but here it's used to substract integer values.
  ld h, l
  ld l, 0
  push hl
  ld hl, (x2)
  ld de, (x1)
  subFP
  ld h, l
  ld l, 0
  pop de
  call DivFP
  ld (dx1), hl
  ld hl, (y3)
  ld de, (y2)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld hl, (x3)
  ld de, (x2)
  subFP
  ld h, l
  ld l, 0
  pop de
  call DivFP
  ld (dx2), hl
  ld hl, (y3)
  ld de, (y1)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld hl, (x3)
  ld de, (x1)
  subFP
  ld h, l
  ld l, 0
  pop de
  call DivFP
  ld (dx3), hl


  ld hl, (y2)
  ld de, (y1)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (u2)
  ld h, a
  ld l, 0
  ld a, (u1)
  ld d, a
  ld e, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (du1), hl
  ld hl, (y3)
  ld de, (y2)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (u3)
  ld h, a
  ld l, 0
  ld a, (u2)
  ld d, a
  ld e, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (du2), hl
  ld hl, (y3)
  ld de, (y1)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (u3)
  ld h, a
  ld l, 0
  ld a, (u1)
  ld d, a
  ld e, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (du3), hl


  ld hl, (y2)
  ld de, (y1)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (v2)
  ld h, a
  ld l, 0
  ld a, (v1)
  ld d, a
  ld l, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (dv1), hl
  ld hl, (y3)
  ld de, (y2)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (v3)
  ld h, a
  ld l, 0
  ld a, (v2)
  ld d, a
  ld e, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (dv2), hl
  ld hl, (y3)
  ld de, (y1)
  subFP
  ld h, l
  ld l, 0
  push hl
  ld a, (v3)
  ld h, a
  ld l, 0
  ld a, (v1)
  ld d, a
  ld e, 0
  subFP
  ;ld h, l
  ;ld l, 0
  pop de
  call DivFP
  ld (dv3), hl

  ld hl, (x1)
  bit 7, h
  jr z, TPos1
  ld (tx1+1),hl \ ld a, $FF \ ld (tx1),a
  ld (tx2+1),hl \ ld a, $FF \ ld (tx2),a
  jr TEnd1
TPos1:
  ld (tx1+1),hl \ xor a \ ld (tx1),a ;store the 16bit integer at hl into 16.8 fixed point number tx1
  ld (tx2+1),hl \ xor a \ ld (tx2),a
TEnd1:
  ld hl, (y1)
  ld (_ty), hl

  ld a, (u1)
  ld h, a
  ld l, 0
  ld (tu1), hl
  ld (tu2), hl
  ld a, (v1)
  ld h, a
  ld l, 0
  ld (tv1), hl
  ld (tv2), hl

;if Y1 == Y2, then we don't need to draw the first half.
  ld hl, (Y1)
  ld de, (y2)
  cpHLDE
  jp z, __TEndLoop

; +++++ End of initializing code +++++


;------------------------------------------------------------
; This is the loop in which the triangle is drawn.
; In each interval of the loop, a single scanline is
; drawn. When this loop finished, one half of the
; triangle is drawn.
;------------------------------------------------------------

TDrawLoop:
  ld a, (_ty)
  ld d, a
;if the Y of the scanline is negative, then go to the next one.
  bit 7, a
  jp nz, Clip
  ld a, (_ty)
;If it reaches the bottom of the screen, then stop drawing the triangle.
  cp 64
  ret nc

;Initialize variables for the scanline
  ld hl, (tu1)
  ld (tmpu), hl
  ld hl, (tv1)
  ld (tmpv), hl
  ld hl, (tu2)
  ld (temp2), hl
  ld hl, (tv2)
  ld (temp3), hl
  ld a, (tx2+1)
  ld (temp+1), a
  add a, 128
  ld b, a
  ld a, (tx1+1)
  ld (temp), a
  add a, 128
  cp b
  jr c, TOrdered
  ;jp po, TOrdered
  ld hl, (tu2)
  ld (tmpu), hl
  ld hl, (tv2)
  ld (tmpv), hl
  ld hl, (tu1)
  ld (temp2), hl
  ld hl, (tv1)
  ld (temp3), hl
  ld a, (tx2+1)
  ld (temp), a
  ld a, (tx1+1)
  ld (temp+1), a
TOrdered:
  ld l, d
  ld a, (temp)
;folowing line was for the test to see if the sign was the problem
  ;sub 100 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
  bit 7, a
  jr z, TGetPixel
  xor a
TGetPixel:
  call GetPixel
  ld (mask), a
  ld (pointer), hl

;If the deltas for the texture coordinates inside a scanline are already
;calculated, then calculating them again is a wast of cycles.
  bit 1, (IY)
  jr nz, TPlotLoop
  ld hl, (tx1)
  ld de, (tx2)
  cpHLDE
  jr z, TPlotLoop
  ld a, (temp)
  ld h, a
  ld l, 0
  ld a, (temp+1)
  ld d, a
  ld e, 0
  subFP
  push hl
  ld hl, (tmpu)
  ld de, (temp2)
  subFP
  pop de
  call DivFP
  ld (tmpdu), hl
  ld a, (temp)
  ld h, a
  ld l, 0
  ld a, (temp+1)
  ld d, a
  ld e, 0
  subFP
  push hl
  ld hl, (tmpv)
  ld de, (temp3)
  subFP
  pop de
  call DivFP
  ld (tmpdv), hl
  set 1, (IY)


;---------------------------------------------------------------------
; In this loop, the scanline is drawn. One interval here
; draws one pixel. When the loop ends, one scanline is drawn.
;---------------------------------------------------------------------

TPlotLoop:
;If the x coordinate of the pixel is negative, then go to the next pixel.
  ld a, (temp)
  bit 7, a
  jr nz, TNoCarry

;if the pixel goes of the right side of the screen, then go to the next scanline
  cp 96
  jp nc, Clip

;Everything with 4 ;'s behind it are for 16x16 textures. Remove those and the
;textures will be 8x8.

  ld a, (tmpv+1)
  add a, a ;;;;
  ld hl, texture
  add a, l
  ld l, a

  ld a, (tmpu+1)
  bit 3, a ;;;;
  jr z, TFirstByte ;;;;
  res 3, a ;;;;
  inc hl ;;;;
TFirstByte: ;;;;
  ld b, a
  inc b
  ld a, (hl)
TshiftLoop:
  rla
  djnz   TshiftLoop
 
  ld a, (mask)
  ld hl, (pointer)
  jr c, TSetPixel

TResPixel:
  ;ld a, b
  cpl
  and (hl)
  ld (hl), a
  jr TEndPlot

TSetPixel:
  ;ld a, b
  or (hl)
  ld (hl), a

TEndPlot:
  ld hl, mask
  rrc (hl)
  jr nc, TNoCarry
  ld hl, (pointer)
  inc hl
  ld (pointer), hl
TNoCarry:

  ld hl, (tmpu)
  ld de, (tmpdu)
  add hl, de
  ld (tmpu), hl
  ld hl, (tmpv)
  ld de, (tmpdv)
  add hl, de
  ld (tmpv), hl

  ld a, (temp+1)
  ld b, a
  ld a, (temp)
  ld hl, temp
  inc (hl)
  cp b
  jp nz, TPlotLoop

; +++++ End of pixel plotting code +++++

;If it's drawing the secound half, then make it recalculate the thexture deltas
;for inside the scanlines. This was to solve a bug in the texture mapping.
  bit 0, (IY)
  jr nz, aaaa ;I suddenly ran out of inspiration for label names
;  res 1, (IY)
aaaa:

Clip:
  ld hl,(tx1)
  ld de, (dx1)
  ld a, d
  rla
  sbc a, a
  ld b, a
  add hl, de
  ld (tx1), hl
  ld a, (tx1+2)
  adc a, b
  ld (tx1+2), a

  ld hl,(tx2)
  ld de, (dx3)
  ld a, d
  rla
  sbc a, a
  ld b, a
  add hl, de
  ld (tx2), hl
  ld a, (tx2+2)
  adc a, b
  ld (tx2+2), a

  ld hl, (tu1)
  ld de, (du1)
  add hl, de
  ld (tu1), hl
  ld hl, (tu2)
  ld de, (du3)
  add hl, de
  ld (tu2), hl

  ld hl, (tv1)
  ld de, (dv1)
  add hl, de
  ld (tv1), hl
  ld hl, (tv2)
  ld de, (dv3)
  add hl, de
  ld (tv2), hl


  ld hl, (_ty)
  inc hl
  ld (_ty), hl
  ld de, (y2)
  cpHLDE
  jp c, TDrawLoop

;This is the end of the drawing loop
;If the secound half was drawn, then stop this routine.
  bit 0, (IY)
  jr nz, _TEnd

__TEndLoop:
  ;Here, some variables are initialized for drawing the secound half.

  ld hl, (y2)
  ld (_ty), hl

  ld hl, (y3)
  ld (y2), hl
  ld hl, (dx2)
  ld (dx1), hl
  ld hl, (du2)
  ld (du1), hl
  ld hl, (dv2)
  ld (dv1), hl

  ld hl, (x2)
  bit 7, h
  jr nz, TPos4
  ld (tx1+1),hl \ ld a, $FF \ ld (tx1),a
  jr TEnd4
TPos4:
  ld (tx1+1),hl \ xor a \ ld (tx1),a
Tend4:

  ld a, (u2)
  ld h, a
  ld l, 0
  ld (tu1), hl
  ld a, (v2)
  ld h, a
  ld l, 0
  ld (tv1), hl

  set 0, (IY)

  jp TDrawLoop

_TEnd:

  ret



getPixel:
   bit 7, a
   ret nz
   bit 7, l
   ret nz
   ld   h, 0
   ld   d, h
   ld   e, l
   
   add   hl, hl
   add   hl, de
   add   hl, hl
   add   hl, hl
   
   ld   e, a
   srl   e
   srl   e
   srl   e
   add   hl, de
   
   ld   de, PlotSScreen
   add   hl, de
   
   and   7
   ld   b, a
   ld   a, $80
   ret   z
   
   rrca
   djnz   $-1
   ret

note: I'm not asking to optimize it for me, but to help me optimize it, so I can learn from this and optimize better in the future.
My projects
 - The Lost Survivors (Unreal Engine) ACTIVE [GameCommandoSquad main project]
 - Oxo, with single-calc multiplayer and AI (axe) RELEASED (screenshot) (topic)
 - An android version of oxo (java)  ACTIVE
 - A 3D collision detection library (axe) RELEASED! (topic)(screenshot)(more recent screenshot)(screenshot of it being used in a tilemapper)
Spoiler For inactive:
- A first person shooter with a polygon-based 3d engine. (z80, will probably be recoded in axe using GLib) ON HOLD (screenshot)
 - A java MORPG. (pc) DEEP COMA(read more)(screenshot)
 - a minecraft game in axe DEAD (source code available)
 - a 3D racing game (axe) ON HOLD (outdated screenshot of asm version)

This signature was last updated on 20/04/2015 and may be outdated

Offline thepenguin77

  • z80 Assembly Master
  • LV10 31337 u53r (Next: 2000)
  • **********
  • Posts: 1594
  • Rating: +823/-5
  • The game in my avatar is bit.ly/p0zPWu
    • View Profile
Re: optimizing asm code
« Reply #1 on: April 20, 2012, 07:28:27 pm »
I'll start working on some stuff for you, but since this routine is so massive, could you like post a guide to what it does? I understand it's mostly math, but a simple explanation of what each section does could greatly improve the way people can optimize it.

Edit:
   I see the section headers. But how about an overarching explanation of it all.

I'm just going to keep adding information until I write done, or you respond. Feel free to correct anything that's wrong.

First of all, I'd like to see FPDiv and FPSub.

;dx1 = dx/dy of 1 to 2
;dx2 = dx/dy of 2 to 3
;dx3 = dx/dy of 1 to 3
;du1 = du/dy of 1 to 2
;du2 = du/dy of 2 to 3
;du3 = du/dy of 1 to 3
;dv1 = dv/dy of 1 to 2
;dv2 = dv/dy of 2 to 3
;dv2 = dv/dy of 1 to 3

What are u and v?

What are the bounds on the input variables?

Done for now (got stuff to do)
« Last Edit: April 20, 2012, 07:46:13 pm by thepenguin77 »
zStart v1.3.013 9-20-2013 
All of my utilities
TI-Connect Help
You can build a statue out of either 1'x1' blocks or 12'x12' blocks. The 1'x1' blocks will take a lot longer, but the final product is worth it.
       -Runer112

Offline ben_g

  • Hey cool I can set a custom title now :)
  • LV9 Veteran (Next: 1337)
  • *********
  • Posts: 1002
  • Rating: +125/-4
  • Asm noob
    • View Profile
    • Our programmer's team: GameCommandoSquad
Re: optimizing asm code
« Reply #2 on: April 20, 2012, 07:52:43 pm »
pretty much all code that's commented out were either failed tests or debug code. You can just ignore those parts.
The part after the first header sorts the points based on their Y-coordinate (from smallest to biggest)
The secound part sets up flags and variables, and it calculates some numbers. For example dx is the fixed-point number that needs to be added in every scanline to adjust the X and make the line meet at the next point.
The part after that calculates some values for the scanline (drawloop is the loop in wich scanlines are drawn, 1 cycle=1scanline). The first part does vertical clipping, then it checks the start and the end and when nessicary it switches them so the scanline can always be drawn from left to right. Then GetPixel is called and saved so it only has to be called once per scanline. The rest of that part is texture interpolation.

PlotLoop is the loop in which the pixels are plotted. 1 cycle = 1 pixel.
It basically just reads the pixel of the texture at coordinates calculated in the part before this one, and plots that pixel. Then it shifts the texture coordinates and adjusts the results of the getPixel to plot the next pixel.

After that, the values for the scanlines are calculated again to update the x coordinates and texture coordinates.
after the end of pizel plotting code comment, the variables are updated to draw the bottom half of the triangle. After repeating the loop, the triangle is complete and it returns.

The getPixel routine bihind that was just as documentation, so you could see which one I'm using and what exactely that it does. I doubt that that can be optimized.

If anything isn't still clear, feel free to ask.
My projects
 - The Lost Survivors (Unreal Engine) ACTIVE [GameCommandoSquad main project]
 - Oxo, with single-calc multiplayer and AI (axe) RELEASED (screenshot) (topic)
 - An android version of oxo (java)  ACTIVE
 - A 3D collision detection library (axe) RELEASED! (topic)(screenshot)(more recent screenshot)(screenshot of it being used in a tilemapper)
Spoiler For inactive:
- A first person shooter with a polygon-based 3d engine. (z80, will probably be recoded in axe using GLib) ON HOLD (screenshot)
 - A java MORPG. (pc) DEEP COMA(read more)(screenshot)
 - a minecraft game in axe DEAD (source code available)
 - a 3D racing game (axe) ON HOLD (outdated screenshot of asm version)

This signature was last updated on 20/04/2015 and may be outdated

Offline thepenguin77

  • z80 Assembly Master
  • LV10 31337 u53r (Next: 2000)
  • **********
  • Posts: 1594
  • Rating: +823/-5
  • The game in my avatar is bit.ly/p0zPWu
    • View Profile
Re: optimizing asm code
« Reply #3 on: April 21, 2012, 11:21:52 pm »
Ok, here's my interpretation.

https://docs.google.com/document/d/1maLMvzk6Qlk7eQjadjPe2bXdXrGYcogIMG5yOSG8nQE/edit

I don't actually expect you to reply to the questions I asked, they are just to get you to think about what you are doing.

Edit:
  Things you should do to make it look nice:
1. Put your browser in full screen
2. View>Document View>Compact
3. View>Compact controls
« Last Edit: April 21, 2012, 11:24:04 pm by thepenguin77 »
zStart v1.3.013 9-20-2013 
All of my utilities
TI-Connect Help
You can build a statue out of either 1'x1' blocks or 12'x12' blocks. The 1'x1' blocks will take a lot longer, but the final product is worth it.
       -Runer112