Author Topic: nGL - a fast (enough) 3D engine for the nspire (Read 286140 times)

gameblabla · « **Reply #465 on:** May 16, 2015, 04:36:00 pm »

Quote from: Vogtinator on May 13, 2015, 08:38:25 pm

I finally got around to writing a tutorial on how to use nGL: http://github.com/Vogtinator/nGL

That would be great if you made a tutorial about texture mapping.
Also, does anyone know if nGL is faster for 2D stuff than n2DLib ?
From what i've seen, it seems so.

Vogtinator · « **Reply #466 on:** May 16, 2015, 04:59:43 pm »

Quote

That would be great if you made a tutorial about texture mapping.

Will do, that's the next step

Quote

Also, does anyone know if nGL is faster for 2D stuff than n2DLib ?

The 3D parts definitely not. Although it's faster on desktop machines to use orthogonal projection for 2D rendering as that's hardware accelerated,
that's not the case here.
There is a small 2D part in texturetools.cpp, for working with TEXTURE objects, like (GL_LINEAR scaled) blitting, block blitting with 50% opacity, resizing and converting to greyscale for classic calcs, but not much more. Those parts are optimized, thus probably faster than n2DLib (never tested, but it might show if you blit an excessive amount of pixels) and support blitting from TEXTURE to TEXTURE instead of blitting to screen only.
If you read some older posts in this thread and n2DLib, you might notice that there was already quite a discussion about speed...

Edit: Lesson 2 - Texture mapping, is up.

Matrefeytontias · « **Reply #467 on:** May 16, 2015, 06:41:58 pm »

Yeah well people wouldn't stop talking shit about how n2DLib was slower than nGL even though not a single test was ever made. I'm very glad to notice the exact same sentence again, "it's probably faster (though never tested)". You optimized it ? Oh yeah cool, so did pierrotdu18, Hayleia and I. So if someone could make the damned speed test so that we finally know.

For the thousandth time, I won't be angry if nGL is faster that n2DLib. But reading sentences like "it's optimized, thus probably faster than n2DLib" clearly assuming that n2DLib is not optimized, thus totally ignoring all the work that was put into it by several people, that really upsets me.

pimathbrainiac · « **Reply #468 on:** May 17, 2015, 01:39:43 am »

Quote from: Vogtinator on May 13, 2015, 08:38:25 pm

I finally got around to writing a tutorial on how to use nGL: http://github.com/Vogtinator/nGL
Maybe we'll see some more 3D games on the nspire now!

Yay tutorials! Now to make some cool stuff with nGL!

Vogtinator · « **Reply #469 on:** May 17, 2015, 08:07:19 am »

I don't know why I even bother answering this the second time, I wrote this some time ago already.

Quote

Yeah well people wouldn't stop talking shit about how n2DLib was slower than nGL even though not a single test was ever made.

I would make a test, but n2DLib doesn't support TEXTURE-TEXTURE blitting what nGL does. Although that shouldn't make a huge difference, it'll be unfair.

Quote

You optimized it ? Oh yeah cool, so did pierrotdu18, Hayleia and I.

Well, I'm sorry to say, but it just doesn't look like it.
The drawSprite routine, as the simplest example, makes a call to setPixel per pixel.
This is bad because of four reasons:
-Function calls are slow
-Two comparisons
-Multiplication
-Variable loaded from RAM, indirectly

For easier comparision, here are the two inner loops of drawSprite and the nGL equivalent, compiled with the same flags as your example:

Code: [Select]

00000a80 <setPixel>:
     a80:       e35100ef        cmp     r1, #239        ; 0xef
     a84:       93500d05        cmpls   r0, #320        ; 0x140
     a88:       33a03d05        movcc   r3, #320        ; 0x140
     a8c:       30210193        mlacc   r1, r3, r1, r0
     a90:       359f300c        ldrcc   r3, [pc, #12]   ; aa4 <setPixel+0x24>
     a94:       31a01081        lslcc   r1, r1, #1
     a98:       35933000        ldrcc   r3, [r3]
     a9c:       318320b1        strhcc  r2, [r3, r1]
     aa0:       e12fff1e        bx      lr
     aa4:       00011078        .word   0x00011078 

In drawSprite:
     c94:       e1550008        cmp     r5, r8
     c98:       e08b3005        add     r3, fp, r5
     c9c:       aa000008        bge     cc4 <drawSprite+0x68>
     ca0:       e0da20b2        ldrh    r2, [sl], #2
     ca4:       e1d630b4        ldrh    r3, [r6, #4]
     ca8:       e1530002        cmp     r3, r2
     cac:       0a000002        beq     cbc <drawSprite+0x60>
     cb0:       e1a01004        mov     r1, r4
     cb4:       e1a00005        mov     r0, r5
     cb8:       ebffff70        bl      a80 <setPixel>
     cbc:       e2855001        add     r5, r5, #1

Code: [Select]

 5a0:   e25cc001        subs    ip, ip, #1
 5a4:   3a000005        bcc     5c0 <drawTexture(...)+0x158>
 5a8:   e0d560b2        ldrh    r6, [r5], #2
 5ac:   e1d080b6        ldrh    r8, [r0, #6]
 5b0:   e2811002        add     r1, r1, #2
 5b4:   e1580006        cmp     r8, r6
 5b8:   114160b2        strhne  r6, [r1, #-2]

As that code is run per pixel, I guess that that is definitely a noticable difference.

Adriweb · « **Reply #470 on:** May 17, 2015, 09:13:38 am »

I guess one is optimized C, the other is optimized ASM... (or, at least, C code such that the ASM gets optimized much better in the end). So clearly on this level the assembly-optimized version can only be better.
But knowing how to optimize at this level is definitely something that much less people know how to do.

And I'm sure that since both things are open-source, one could help the other when needed

pimathbrainiac · « **Reply #471 on:** May 17, 2015, 12:16:42 pm »

That with what Adriweb said. I'm sure if you all worked on the same project, we'd get one even better library than either of the two are now

Hayleia · « **Reply #472 on:** May 17, 2015, 05:21:15 pm »

Quote from: Matrefeytontias on May 16, 2015, 06:41:58 pm

You optimized it ? Oh yeah cool, so did pierrotdu18, Hayleia and I.

Wut ? I don't know about pierrot and you but I didn't do anything. I pretty much dropped the project when I saw one of you was not putting brackets for one-instruction blocks, even when it's a for in a for (which leads to very ugly code in my opinion).

rwill · « **Reply #473 on:** May 18, 2015, 09:35:12 am »

Quote from: Vogtinator on May 17, 2015, 08:07:19 am

For easier comparision, here are the two inner loops of drawSprite and the nGL equivalent, compiled with the same flags as your example:
Code: [Select]
...As that code is run per pixel, I guess that that is definitely a noticable difference.

Sadly both routines appear to be quite unoptimized.

Adriweb · « **Reply #474 on:** May 18, 2015, 09:47:55 am »

Then let's just code the lib in hand-written ASM directly

Vogtinator · « **Reply #475 on:** May 18, 2015, 01:44:43 pm »

Quote

Sadly both routines appear to be quite unoptimized.

I guess you could make it faster by doing 32-bit transfers, but the shortest asm version with word-transfers is the nGL version minus

Code: [Select]

ldrh r8, [r0, #6], because that should happen outside of the loop and r1 could be used as counter instead of r12.
Basically (r0 is source, r1 is dest, r2 is end of source)

Code: [Select]

loop:
ldrh r3, [r0], #2
strh r3, [r1], #2
cmp r0, r2
bne loop

Also, 32bit transfers would be impossible if source or dest aren't 32-bit aligned which isn't the case if you have an uneven X.

rwill · « **Reply #476 on:** May 18, 2015, 03:40:58 pm »

Shorter is not *edit* always */edit* faster. Especially given the ARM9EJ-S core.

*edit*

Given your example, which seems to copy without the transparency check, we end up with something this ( cycle estimate at end of line ):

Code: [Select]

loop:
ldrh r3, [r0], #2   | 1
strh r3, [r1], #2   | 2-4 ( 2 cycles of 2 cycle interlock on r3 )
cmp r0, r2	    | 5
bne loop            | 6-8

So, for example, unrolling the loop would help quite a bit.
8 Cycles for copying a single pixel without any transformation is way too much.

Vogtinator · « **Reply #477 on:** May 20, 2015, 07:44:41 am »

Quote

So, for example, unrolling the loop would help quite a bit.
8 Cycles for copying a single pixel without any transformation is way too much.

It would, but as I said, only possible if Xsrc % 2 == Xdest % 2, so not widely applicable.

Quote

8 Cycles for copying a single pixel without any transformation is way too much.

Yeah, but I guess you can't do much about it without using Asm (and I target not to, except if there are very obvious improvements).
The assembler doesn't look much different with more gcc optimizations.

Quote

Given your example, which seems to copy without the transparency check, we end up with something this ( cycle estimate at end of line ):

I guess it could be improved by one cycles if the cmp is moved between the ldrh/strh?

rwill · « **Reply #478 on:** May 20, 2015, 10:08:15 am »

Ah .. might as well be more specific and direct ..

Code: [Select]

void blit_pels( const int16_t *pp_srcB, int32_t i_src_stride, int16_t *pp_dstB, int32_t i_dst_stride, int32_t i_width, int32_t i_height )
{
	int32_t i_y;
	for( i_y = 0; i_y < i_height; i_y++ )
	{
		int16_t pp_a0, pp_a1, pp_a2, pp_a3;
		const int16_t *pp_src;
		int16_t *pp_dst;
		int32_t i_remain4, i_remain1;
		
		pp_src = pp_srcB;
		pp_dst = pp_dstB;

		i_remain4 = i_width >> 2;
		while( i_remain4 > 0 )
		{
			pp_a0 = *( pp_src++ );
			pp_a1 = *( pp_src++ );
			pp_a2 = *( pp_src++ );
			pp_a3 = *( pp_src++ );
			*( pp_dst++ ) = pp_a0;
			*( pp_dst++ ) = pp_a1;
			*( pp_dst++ ) = pp_a2;
			*( pp_dst++ ) = pp_a3;
			i_remain4--;
		}
		
		i_remain1 = i_width & 3;
		while( i_remain1 > 0 )
		{
			pp_a0 = *( pp_src++ );
			*( pp_dst++ ) = pp_a0;
			i_remain1--;
		}
		pp_srcB += i_src_stride;
		pp_dstB += i_dst_stride;
	}
}

There might be some gain to increase the unroll block to 8 pixels instead of the 4 but I did not bother to even test this one here.

Vogtinator · « **Reply #479 on:** May 21, 2015, 11:02:19 am »

Quote

Quote
So, for example, unrolling the loop would help quite a bit.
8 Cycles for copying a single pixel without any transformation is way too much.
It would, but as I said, only possible if Xsrc % 2 == Xdest % 2, so not widely applicable.

Somehow I was thinking about 32-bit access here, I don't know why.

Author Topic: nGL - a fast (enough) 3D engine for the nspire (Read 286140 times)

gameblabla

Re: nGL - a fast (enough) 3D engine for the nspire

Vogtinator

Re: nGL - a fast (enough) 3D engine for the nspire

Matrefeytontias

Re: nGL - a fast (enough) 3D engine for the nspire

pimathbrainiac

Re: nGL - a fast (enough) 3D engine for the nspire

Vogtinator

Re: nGL - a fast (enough) 3D engine for the nspire

Adriweb

Re: nGL - a fast (enough) 3D engine for the nspire

pimathbrainiac

Re: nGL - a fast (enough) 3D engine for the nspire

Hayleia

Re: nGL - a fast (enough) 3D engine for the nspire

rwill

Re: nGL - a fast (enough) 3D engine for the nspire

Adriweb

Re: nGL - a fast (enough) 3D engine for the nspire

Vogtinator

Re: nGL - a fast (enough) 3D engine for the nspire

rwill

Re: nGL - a fast (enough) 3D engine for the nspire

Vogtinator

Re: nGL - a fast (enough) 3D engine for the nspire

rwill

Re: nGL - a fast (enough) 3D engine for the nspire

Vogtinator

Re: nGL - a fast (enough) 3D engine for the nspire