Author Topic: [Axe] Plane deformations are fun (Read 28107 times)

Sorunome · « **Reply #45 on:** January 30, 2014, 02:40:57 pm »

So that is now 15MHZ again, right?
Or did you manage to get 6MHZ to run it at that speed?

Matrefeytontias · « **Reply #46 on:** January 30, 2014, 02:41:26 pm »

No, it's 15 MHz

I wish it was 6 MHz, although even at that last speed it runs quite acceptably fast.

Sorunome · « **Reply #47 on:** January 30, 2014, 02:42:06 pm »

Still epic, though

Runer112 · « **Reply #48 on:** February 02, 2014, 02:09:43 am »

My turn!

I was convinced this effect could be made faster, and after a lot of careful thought and crazy tricks, I managed to bump up the FPS: from 14.3 to 18.5, an improvement of about 30%! Again, it is running at 15MHz. It may look slightly different, as I had to rotate the texture up and right by one pixel per frame (rather than down and right) due to complications of the immensely aggressive optimization. I'll attach the source to this post, and here's a gif proving that it does indeed work:

Spoiler For Oh, and did I mention...:

Sorunome · « **Reply #49 on:** February 02, 2014, 03:50:56 am »

wow, just wow

You sir, are amazing!

Eiyeron · « **Reply #50 on:** February 02, 2014, 05:57:42 am »

And that's how, kids, how Runer112 made my day.

Matrefeytontias · « **Reply #51 on:** February 02, 2014, 07:17:39 am »

Okay what the actual fuck. So you're getting 18.5, then 44, then 107 FPS ? You overclocked your calc or what.

EDIT : tested and yes. I see you actually wrote your own code, only taking my deformation functions. I can't understand shit to what you wrote, so yeah.

I find it actually a bit desperating, how whatever code we come with you can make it 600% faster.

TIfanx1999 · « **Reply #52 on:** February 02, 2014, 07:41:42 am »

Quote from: Matrefeytontias on January 30, 2014, 02:39:49 pm

Bump,

So my demo made it on Pouet ! https://www.pouet.net/prod.php?which=62454

And ticalc too ! http://www.ticalc.org/archives/files/fileinfo/458/45819.html

Let's spam the staff with mails saying to feature it

Final gif :

Very nice stuff. ^^

Quote from: Runer112 on February 02, 2014, 02:09:43 am

My turn!

I was convinced this effect could be made faster, and after a lot of careful thought and crazy tricks, I managed to bump up the FPS: from 14.3 to 18.5, an improvement of about 30%! Again, it is running at 15MHz. It may look slightly different, as I had to rotate the texture up and right by one pixel per frame (rather than down and right) due to complications of the immensely aggressive optimization. I'll attach the source to this post, but be warned that there's a still a bug that I haven't had the time to hunt down which crashes the calculator with fair frequency upon exiting. But here's a gif, proving that it does indeed work:

Spoiler For Oh, and did I mention...:
... that that's just the pure Axe version?

As is always the case for assembly, and is especially the case for really specific, concise algorithms, if you know what you're doing you can get big performance gains over a compiled language. Re-coding only the rendering core in just as (if not more) aggressively optimized assembly, I registered a huge boost in FPS: 18.5 to 44, an improvement of about 140% on my pure Axe version and 200% on the existing assembly core version! The source for this will be attached as well, and luckily this one is 100% stable, so play with it all you want! Here's a gif again, showing it off, although keep in mind that it's actually rendering about twice as many frames as the gif captured:

Spoiler For Wait a second...:
... why does that gif say 6MHz and load effects so slowly? Because it is 6MHz!

The true 15MHz FPS for the version with the assembly core is a stupidly high 107! This is with no pre-rendering of frames or any such cheating, every frame is rendered pixel-by-pixel as always. So the total performance markup on the original 14.3 FPS comes to about 650%. You can try it for yourself by simply un-commenting the Full in the setup part of the assembly version source. And although it doesn't even capture a quarter of the frames rendered and the original effects are all but impossible to discern, here's a gif:

Beat that.

Why am i not surprised.

Those are some nice speed gains.

Quote from: Matrefeytontias on February 02, 2014, 07:17:39 am

Okay what the actual fuck. So you're getting 18.5, then 44, then 107 FPS ? You overclocked your calc or what.

EDIT : tested and yes. I see you actually wrote your own code, only taking my deformation functions. I can't understand shit to what you wrote, so yeah.

I find it actually a bit desperating, how whatever code we come with you can make it 600% faster.

Didn't you know? Runer112 is some sort of wizard or something working his assembly magic.

fb39ca4 · « **Reply #53 on:** February 02, 2014, 11:02:40 am »

Quote from: Runer112 on February 02, 2014, 02:09:43 am

My turn!

I was convinced this effect could be made faster, and after a lot of careful thought and crazy tricks, I managed to bump up the FPS: from 14.3 to 18.5, an improvement of about 30%! Again, it is running at 15MHz. It may look slightly different, as I had to rotate the texture up and right by one pixel per frame (rather than down and right) due to complications of the immensely aggressive optimization. I'll attach the source to this post, but be warned that there's a still a bug that I haven't had the time to hunt down which crashes the calculator with fair frequency upon exiting. But here's a gif, proving that it does indeed work:

Spoiler For Oh, and did I mention...:
... that that's just the pure Axe version?

As is always the case for assembly, and is especially the case for really specific, concise algorithms, if you know what you're doing you can get big performance gains over a compiled language. Re-coding only the rendering core in just as (if not more) aggressively optimized assembly, I registered a huge boost in FPS: 18.5 to 44, an improvement of about 140% on my pure Axe version and 200% on the existing assembly core version! The source for this will be attached as well, and luckily this one is 100% stable, so play with it all you want! Here's a gif again, showing it off, although keep in mind that it's actually rendering about twice as many frames as the gif captured:

Spoiler For Wait a second...:
... why does that gif say 6MHz and load effects so slowly? Because it is 6MHz!

The true 15MHz FPS for the version with the assembly core is a stupidly high 107! This is with no pre-rendering of frames or any such cheating, every frame is rendered pixel-by-pixel as always. So the total performance markup on the original 14.3 FPS comes to about 650%. You can try it for yourself by simply un-commenting the Full in the setup part of the assembly version source. And although it doesn't even capture a quarter of the frames rendered and the original effects are all but impossible to discern, here's a gif:

Beat that.

Do you think this could be made into a mode 7 engine?

Matrefeytontias · « **Reply #54 on:** February 02, 2014, 11:03:33 am »

Nope, since it only uses 1 tile. It's really nothing more than a demo effect.

Runer112 · « **Reply #55 on:** February 02, 2014, 12:19:26 pm »

For any of those curious, perhaps I should try to give a basic explanation of ~~the mind of a madman~~ how I got the speed boosts I did, and provide a look into my mindset on optimization. I assure you it was not magic, just very careful optimization and smart LUT generation! I had a lot of fun and spend a lot of time on the code itself, so I may as well take a bit more to share with you guys.

The core (the duty of which is to render one full byte) of the pure Axe version went through a few iterations. In fact, I still left the old iterations, except the first two, in a block comment in the source, each with a comment of how fast it was.

Iteration 0: [1046 cycles] {{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}→{D}+1→D
- S reads from the 6144-byte texel-mapping LUT
- D writes to L₆
- The method employed is a simple repetition of rotate the result byte left by one bit, read the next texel (0 or 1), and add it to the result. This is essentially the first step of the optimization process, and really where statistically most of the performance boost comes from: develop a simple, fast base idea of which all further iterations are low-level optimizations. Even if the idea of low-level optimization with cycle counting is foreign to you, this first optimization is a step that, with enough thought and practice, you can come to as well, and get a good chuck of the performance boost.
Iteration 1: [992 cycles] {{S}+Pic1}*2+{{S+1}+Pic1}*2+{{S+2}+Pic1}*2+{{S+4→S-1}+Pic1}*2+{{S}+Pic1}*2+{{S+1}+Pic1}*2+{{S+2}+Pic1}*2+{{S+4→S-1}+Pic1}→{D}+1→D
- Saved 54 cycles by reducing the number of updates of the S variable from 8 to 2 and adding in offset reads with small (≤2) offsets.
Iteration 2: [941 cycles] {{S}ᕀᴇ8000}*2+{{S+1}ᕀᴇ8000}*2+{{S+2}ᕀᴇ8000}*2+{{S+5→S-2}ᕀᴇ8000}*2+{{S-1}ᕀᴇ8000}*2+{{S}ᕀᴇ8000}*2+{{S+1}ᕀᴇ8000}*2+{{S+3→S-1}ᕀᴇ8000}→{D}+1→D
- Saved 48 cycles by moving the bit-exploded version of the sprite to ᴇ8000, so instead of having to perform a 16-bit addition for each texel read, it only performs an 8-bit or (the operator is 16-bit, but the low byte of the operand is 0 and gets optimized out since or 0 has no effect).
- Saved 3 cycles by rearranging how the S variable is updated (there's a lot of neutral shifting of cycles between texel reads here, but the underlying saving comes from changing the second +4 to +3).
Iteration 3: [~890 cycles] {{S}ᕀᴇ8000}+{{S+1}ᕀᴇ8100}+{{S+2}ᕀᴇ8200}+{{S+5→S-2}ᕀᴇ8300}+{{S-1}ᕀᴇ8400}+{{S}ᕀᴇ8500}+{{S+1}ᕀᴇ8600}+{S+3→S-1}ᕀᴇ8700}→{D}+1→D
- Saved 77 cycles by removing the shifts altogether and simply adding texel reads. This is achieved by maintaining not one, but eight copies of the bit-exploded sprite, each storing the data bit in a different bit. This does incur eight times the cost of shifting the texture each frame, but by my math that was only an increase of about 2850 cycles to 22800 cycles per frame, which spread across 768 bytes is an additional ~26 cycles. Subtracting that from the core optimization, this approach really saves about 51 cycles.
Iteration 4: [853 cycles] {{S}-256}*2+{{S+1}-256}*2+{{S+2}-256}*2+{{S+5→S-2}-256}*2+{{S-1}-256}*2+{{S}-256}*2+{{S+1}-256}*2+{{S+3→S-1}-256}→{D}+1→D
- Saved 11 cycles by moving the bit-exploded sprite to ᴇFF00, allowing for it to be accessed simply by subtracting 256 (only takes 4 cycles!) from the index, but the shifting had to be re-introduced. However, due to only having one version of the sprite, the ~26 cycles is reclaimed for a total savings of about 37 cycles.
Iteration 5 (final): [~802 cycles] {{S}-256}+{{S+1}-256}+{{S+2}-256}+{{S+5→S-2}-256}+{{S-1}-512}+{{S}-512}+{{S+1}-512}+{{S+3→S-1}-512}→{D}+1→D
- Essentially merged iterations 3 and 4, saving 77 cycles by removing the shifting and incurring ~26 cycles by having eight copies of the bit-exploded sprite, for a total savings of about 51 cycles.
- It may look like I only have two copies of the bit-exploded sprite, one at ᴇFE00 and one at ᴇFF00, but in fact I have one at every 64-byte interval in the whole 512-byte region. How can you do that with only one 6144-byte texel LUT, you might ask? The answer is quite simple, really: instead of each entry simply being a value 0-63, added to that is a variant of X^4*64, which essentially assigns each combination of the lower two bits of X to a different 64-byte bit-exploded sprite. Multiply those four combinations by having two 256-byte areas for eight bit-exploded sprites.
- I didn't mention this in iteration 3, but allowing sprite data to be stored in the ᴇFE00-ᴇFFFF range is tricky, as the stack and other important OS stuff is around there. I'll spare you the details, but generally I copied whatever was there out, made sure interrupts were off so nothing would try to access it, and abused recursion to get the stack pointer outside of the region.

Now, the real fun one, the assembly core. This may make varying degrees of sense to you, as not only is it assembly, but it's quite hacky assembly. I'll paste the source I used for reference here, and then give a quick rundown of how it actually works.

Code: [Select]

Disp:
	ld	(spSave),sp
	ld	c,$20
	ld	a,$80
	out	($10),a
	ld	sp,(LUT)
ColLoop:
	ld	de,64*256+%11111110
	ld	hl,ColLoop
	ld	a,c
	inc	c
	ld	b,7
	djnz	$
	out	($10),a		;152cc into, 153cc loop
	cp	$2C
	ld	a,e
	ret	c
	ld	sp,(spSave)


;Pixel:
	add	a,a		;or adc, a a
	ret	c
	out	($11),a		;147cc into, 148cc loop
	ld	a,e
	scf
	dec	d
	ret	nz
	jp	(hl)

Firstly, a lot of time is saved by making the core render right to the LCD, skipping the extra ~60000 cycles of writing to a buffer and then having to read it back out later with delays injected for the slow LCD driver. But the real meat of the speed boost comes from not simply storing a 64-byte bit-exploded sprite, but storing a 576-byte array of 64 9-byte "codelets," each responsible for one pixel/texel of the source sprite (this is the code you see labelled ;Pixel:). Each byte (8 pixels) is initialized with a=%11111110 and the carry flag set, so for each pixel, you can simply perform add a,a to shift the result left by one bit and rotate in a 0 bit, or adc a,a to shift the result left by one bit and rotate in the carry bit, which will always be 1. That is, it will always be 1 until eight of these bits have been shifted in and that 0 in the lowest bit of %11111110 is finally shifted out, which allows for easily determination of when a byte is done by checking the state of the carry flag (1=not done, 0=done).

So each codelet writes its texel in only one instruction with lightning-fast speed, only 4 cycles. The real trick, then, is directing control to these 64 codelets with lightning-fast speed. And just as each codelet handled writing its texel in only one instruction, this is also achieved in only one instruction: ret c. This conditional return, which returns only if the carry flag is set, uses the carry flag effects described above to handle the done/not done determination. But how the hell is a return supposed to help, and when does the 6144-byte texel-mapping LUT get read? The answer is simple: that's what the return does! The texel-mapping LUT was instead blown up to 12288 bytes, giving each pixel 2 bytes, just like a stack entry. And instead of simply storing one of 64 texel indices, each entry stores a pointer to one of the 64 codelets. So to start rendering a frame, the stack pointer is pointed to the start of the LUT and a simple return will take you to the codelet to produce the next pixel!

For a time analysis, the extremely tight loop of ad(d/c) a,a \ ret c to produce each bit gives a stupidly low 15 cycles per bit. Multiplying this by 8 for eight bits in a byte, and adding the overhead of handling each full byte, each byte takes a mere 148 cycles to render; about 5 times faster than my fastest Axe version. Taking into account the fact that the Axe version still needs to perform a LCD update and the different formats of the data that needs to be rotated each frame, the assembly version ultimately comes out to about 6x faster than the Axe version, and with only 46 bytes worth of unique assembly instructions executed.

Final notes:

While writing this post, I realized the mistake I made with the pure Axe version (I allocated enough space in the high memory region for the sprite data, but not for the stack space that the rest of my program actually needs). I believe I have now fixed it and the Axe version should be stable, so I have updated that file in my original post.

Matrefeytontias, since most of this code is still yours and I would have had no clue about this general rendering method or how to make any of these effects myself, feel free to absorb my code into yours and use it as you see fit.

Also, my computer had a massive glitch in which the screen had crazy artifacts everywhere and the only audio was digital noise when I was about 95% done writing this post. I had feared that I lost my ~1.5 hours of work. Thank you Google, because when I restarted, Chrome prompted to restore my session and it had amazingly saved everything I wrote.

Matrefeytontias · « **Reply #56 on:** February 02, 2014, 12:49:45 pm »

how could you think of that with a human brain xD it's just stupid how it's optimized. Amazing as always.

Also, I'm kinda maniac on my work, so I use to not use anybody's code without being capable of replicating it. And I'm clearly not capable of replicating that, so it'll stay yours.

Runer112 · « **Reply #57 on:** February 02, 2014, 01:04:33 pm »

At least on ticalc, you can specify co-authors of files. One of us should put them on there with the other specified as a co-author.

On sites that don't allow that explicit specification, it's up to you if you'd feel comfortable updating it with a description that credits me for the renderer.

Hmm, for some reason, the assembly version doesn't work on my actual calculator... I was really looking forward to seeing the blur that would result. Guess I'd better investigate this.

EDIT: Also, to clarify: most of the code is still really yours. My compulsive coding style drove me to perform little optimizations, move things around, and make personal stylistic changes, but most of the code still does the same thing you originally designed it to. I could probably recreate the assembly version from your original source only by importing my texel-rendering codelet generator and assembly core, with slight modifications to the big texel LUT generation and the texture rotating code.

EDIT 2: I couldn't run the assembly version because I didn't have enough RAM for the 12288-byte LUT. But I discovered another problem... it's too fast for the LCD driver.

Matrefeytontias · « **Reply #58 on:** February 02, 2014, 01:12:35 pm »

I won't update my version with your code, since it's on several sites that don't permit file editing at all (especially pouet.net). But yeah you're right, you should upload it on ticalc named "Illogical optimized" or something, with me as co-author.

DJ Omnimaga · « **Reply #59 on:** February 07, 2014, 01:43:57 pm »

Quote from: Runer112 on February 02, 2014, 02:09:43 am

My turn!

I was convinced this effect could be made faster, and after a lot of careful thought and crazy tricks, I managed to bump up the FPS: from 14.3 to 18.5, an improvement of about 30%! Again, it is running at 15MHz. It may look slightly different, as I had to rotate the texture up and right by one pixel per frame (rather than down and right) due to complications of the immensely aggressive optimization. I'll attach the source to this post, and here's a gif proving that it does indeed work:

Spoiler For Oh, and did I mention...:
... that that's just the pure Axe version?

As is always the case for assembly, and is especially the case for really specific, concise algorithms, if you know what you're doing you can get big performance gains over a compiled language. Re-coding only the rendering core in just as (if not more) aggressively optimized assembly, I registered a huge boost in FPS: 18.5 to 44, an improvement of about 140% on my pure Axe version and 200% on the existing assembly core version! The source for this will be attached as well, and luckily this one is 100% stable, so play with it all you want! Here's a gif again, showing it off, although keep in mind that it's actually rendering about twice as many frames as the gif captured:

Spoiler For Wait a second...:
... why does that gif say 6MHz and load effects so slowly? Because it is 6MHz!

The true 15MHz FPS for the version with the assembly core is a stupidly high 107! This is with no pre-rendering of frames or any such cheating, every frame is rendered pixel-by-pixel as always. So the total performance markup on the original 14.3 FPS comes to about 650%. You can try it for yourself by simply un-commenting the Full in the setup part of the assembly version source. And although it doesn't even capture a quarter of the frames rendered and the original effects are all but impossible to discern, here's a gif:

Beat that.

EDIT: Apparently, at 15MHz, the assembly version is too fast for the LCD driver on my calculator and glitches out a bit. Whoops. I should probably fix that...

You could basically port Mario Kart

Author Topic: [Axe] Plane deformations are fun (Read 28107 times)

Sorunome

Re: [Axe] Plane deformations are fun

Matrefeytontias

Re: [Axe] Plane deformations are fun

Sorunome

Re: [Axe] Plane deformations are fun

Runer112

Re: [Axe] Plane deformations are fun

Sorunome

Re: [Axe] Plane deformations are fun

Eiyeron

Re: [Axe] Plane deformations are fun

Matrefeytontias

Re: [Axe] Plane deformations are fun

TIfanx1999

Re: [Axe] Plane deformations are fun

fb39ca4

Re: [Axe] Plane deformations are fun

Matrefeytontias

Re: [Axe] Plane deformations are fun

Runer112

Re: [Axe] Plane deformations are fun

Matrefeytontias

Re: [Axe] Plane deformations are fun

Runer112

Re: [Axe] Plane deformations are fun

Matrefeytontias

Re: [Axe] Plane deformations are fun

DJ Omnimaga

Re: [Axe] Plane deformations are fun