Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Runer112

Pages: 1 ... 28 29 [30] 31 32 ... 153
436
Official Contest / Re: TI-Concours 2014
« on: February 02, 2014, 03:37:35 pm »
Is a contestant able to try to qualify for more than one category? If so, and if they do succeed in qualifying for more than one category, can they also then compete in more than one category?

437
Forum Arcade Games / Re: Final Omni Arcade highscores (top 3)
« on: February 02, 2014, 03:25:32 pm »
Aww, I can't any more trophies now... I'm forever stuck with a ton of golds and one crappy silver!

But yes, a lot of those games were of pretty questionable quality That's how I could get the highest score in lots of games so easily, just by screwing around with a game for a bit until finding out the trick to setting a high score.

438
Axe / Re: [Axe] Plane deformations are fun
« on: February 02, 2014, 01:04:33 pm »
At least on ticalc, you can specify co-authors of files. One of us should put them on there with the other specified as a co-author. :) On sites that don't allow that explicit specification, it's up to you if you'd feel comfortable updating it with a description that credits me for the renderer.

Hmm, for some reason, the assembly version doesn't work on my actual calculator... I was really looking forward to seeing the blur that would result. Guess I'd better investigate this.

EDIT: Also, to clarify: most of the code is still really yours. My compulsive coding style drove me to perform little optimizations, move things around, and make personal stylistic changes, but most of the code still does the same thing you originally designed it to. I could probably recreate the assembly version from your original source only by importing my texel-rendering codelet generator and assembly core, with slight modifications to the big texel LUT generation and the texture rotating code.

EDIT 2: I couldn't run the assembly version because I didn't have enough RAM for the 12288-byte LUT. But I discovered another problem... it's too fast for the LCD driver. x.x

439
Axe / Re: [Axe] Plane deformations are fun
« on: February 02, 2014, 12:19:26 pm »
For any of those curious, perhaps I should try to give a basic explanation of the mind of a madman how I got the speed boosts I did, and provide a look into my mindset on optimization. I assure you it was not magic, just very careful optimization and smart LUT generation! I had a lot of fun and spend a lot of time on the code itself, so I may as well take a bit more to share with you guys.



The core (the duty of which is to render one full byte) of the pure Axe version went through a few iterations. In fact, I still left the old iterations, except the first two, in a block comment in the source, each with a comment of how fast it was.

  • Iteration 0: [1046 cycles] {{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}*2+{{S++}+Pic1}→{D}+1→D
    • S reads from the 6144-byte texel-mapping LUT
    • D writes to L₆
    • The method employed is a simple repetition of rotate the result byte left by one bit, read the next texel (0 or 1), and add it to the result. This is essentially the first step of the optimization process, and really where statistically most of the performance boost comes from: develop a simple, fast base idea of which all further iterations are low-level optimizations. Even if the idea of low-level optimization with cycle counting is foreign to you, this first optimization is a step that, with enough thought and practice, you can come to as well, and get a good chuck of the performance boost.
  • Iteration 1: [992 cycles] {{S}+Pic1}*2+{{S+1}+Pic1}*2+{{S+2}+Pic1}*2+{{S+4→S-1}+Pic1}*2+{{S}+Pic1}*2+{{S+1}+Pic1}*2+{{S+2}+Pic1}*2+{{S+4→S-1}+Pic1}→{D}+1→D
    • Saved 54 cycles by reducing the number of updates of the S variable from 8 to 2 and adding in offset reads with small (≤2) offsets.
  • Iteration 2: [941 cycles] {{S}ᕀᴇ8000}*2+{{S+1}ᕀᴇ8000}*2+{{S+2}ᕀᴇ8000}*2+{{S+5→S-2}ᕀᴇ8000}*2+{{S-1}ᕀᴇ8000}*2+{{S}ᕀᴇ8000}*2+{{S+1}ᕀᴇ8000}*2+{{S+3→S-1}ᕀᴇ8000}→{D}+1→D
    • Saved 48 cycles by moving the bit-exploded version of the sprite to ᴇ8000, so instead of having to perform a 16-bit addition for each texel read, it only performs an 8-bit or (the operator is 16-bit, but the low byte of the operand is 0 and gets optimized out since or 0 has no effect).
    • Saved 3 cycles by rearranging how the S variable is updated (there's a lot of neutral shifting of cycles between texel reads here, but the underlying saving comes from changing the second +4 to +3).
  • Iteration 3: [~890 cycles] {{S}ᕀᴇ8000}+{{S+1}ᕀᴇ8100}+{{S+2}ᕀᴇ8200}+{{S+5→S-2}ᕀᴇ8300}+{{S-1}ᕀᴇ8400}+{{S}ᕀᴇ8500}+{{S+1}ᕀᴇ8600}+{S+3→S-1}ᕀᴇ8700}→{D}+1→D
    • Saved 77 cycles by removing the shifts altogether and simply adding texel reads. This is achieved by maintaining not one, but eight copies of the bit-exploded sprite, each storing the data bit in a different bit. This does incur eight times the cost of shifting the texture each frame, but by my math that was only an increase of about 2850 cycles to 22800 cycles per frame, which spread across 768 bytes is an additional ~26 cycles. Subtracting that from the core optimization, this approach really saves about 51 cycles.
  • Iteration 4: [853 cycles] {{S}-256}*2+{{S+1}-256}*2+{{S+2}-256}*2+{{S+5→S-2}-256}*2+{{S-1}-256}*2+{{S}-256}*2+{{S+1}-256}*2+{{S+3→S-1}-256}→{D}+1→D
    • Saved 11 cycles by moving the bit-exploded sprite to ᴇFF00, allowing for it to be accessed simply by subtracting 256 (only takes 4 cycles!) from the index, but the shifting had to be re-introduced. However, due to only having one version of the sprite, the ~26 cycles is reclaimed for a total savings of about 37 cycles.
  • Iteration 5 (final): [~802 cycles] {{S}-256}+{{S+1}-256}+{{S+2}-256}+{{S+5→S-2}-256}+{{S-1}-512}+{{S}-512}+{{S+1}-512}+{{S+3→S-1}-512}→{D}+1→D
    • Essentially merged iterations 3 and 4, saving 77 cycles by removing the shifting and incurring ~26 cycles by having eight copies of the bit-exploded sprite, for a total savings of about 51 cycles.
    • It may look like I only have two copies of the bit-exploded sprite, one at ᴇFE00 and one at ᴇFF00, but in fact I have one at every 64-byte interval in the whole 512-byte region. How can you do that with only one 6144-byte texel LUT, you might ask? The answer is quite simple, really: instead of each entry simply being a value 0-63, added to that is a variant of X^4*64, which essentially assigns each combination of the lower two bits of X to a different 64-byte bit-exploded sprite. Multiply those four combinations by having two 256-byte areas for eight bit-exploded sprites.
    • I didn't mention this in iteration 3, but allowing sprite data to be stored in the ᴇFE00-ᴇFFFF range is tricky, as the stack and other important OS stuff is around there. I'll spare you the details, but generally I copied whatever was there out, made sure interrupts were off so nothing would try to access it, and abused recursion to get the stack pointer outside of the region.



Now, the real fun one, the assembly core. This may make varying degrees of sense to you, as not only is it assembly, but it's quite hacky assembly. I'll paste the source I used for reference here, and then give a quick rundown of how it actually works.

Code: [Select]
Disp:
ld (spSave),sp
ld c,$20
ld a,$80
out ($10),a
ld sp,(LUT)
ColLoop:
ld de,64*256+%11111110
ld hl,ColLoop
ld a,c
inc c
ld b,7
djnz $
out ($10),a ;152cc into, 153cc loop
cp $2C
ld a,e
ret c
ld sp,(spSave)


;Pixel:
add a,a ;or adc, a a
ret c
out ($11),a ;147cc into, 148cc loop
ld a,e
scf
dec d
ret nz
jp (hl)

Firstly, a lot of time is saved by making the core render right to the LCD, skipping the extra ~60000 cycles of writing to a buffer and then having to read it back out later with delays injected for the slow LCD driver. But the real meat of the speed boost comes from not simply storing a 64-byte bit-exploded sprite, but storing a 576-byte array of 64 9-byte "codelets," each responsible for one pixel/texel of the source sprite (this is the code you see labelled ;Pixel:). Each byte (8 pixels) is initialized with a=%11111110 and the carry flag set, so for each pixel, you can simply perform add a,a to shift the result left by one bit and rotate in a 0 bit, or adc a,a to shift the result left by one bit and rotate in the carry bit, which will always be 1. That is, it will always be 1 until eight of these bits have been shifted in and that 0 in the lowest bit of %11111110 is finally shifted out, which allows for easily determination of when a byte is done by checking the state of the carry flag (1=not done, 0=done).

So each codelet writes its texel in only one instruction with lightning-fast speed, only 4 cycles. The real trick, then, is directing control to these 64 codelets with lightning-fast speed. And just as each codelet handled writing its texel in only one instruction, this is also achieved in only one instruction: ret c. This conditional return, which returns only if the carry flag is set, uses the carry flag effects described above to handle the done/not done determination. But how the hell is a return supposed to help, and when does the 6144-byte texel-mapping LUT get read? The answer is simple: that's what the return does! The texel-mapping LUT was instead blown up to 12288 bytes, giving each pixel 2 bytes, just like a stack entry. And instead of simply storing one of 64 texel indices, each entry stores a pointer to one of the 64 codelets. So to start rendering a frame, the stack pointer is pointed to the start of the LUT and a simple return will take you to the codelet to produce the next pixel!

For a time analysis, the extremely tight loop of ad(d/c) a,a \ ret c to produce each bit gives a stupidly low 15 cycles per bit. Multiplying this by 8 for eight bits in a byte, and adding the overhead of handling each full byte, each byte takes a mere 148 cycles to render; about 5 times faster than my fastest Axe version. Taking into account the fact that the Axe version still needs to perform a LCD update and the different formats of the data that needs to be rotated each frame, the assembly version ultimately comes out to about 6x faster than the Axe version, and with only 46 bytes worth of unique assembly instructions executed.



Final notes:

While writing this post, I realized the mistake I made with the pure Axe version (I allocated enough space in the high memory region for the sprite data, but not for the stack space that the rest of my program actually needs). I believe I have now fixed it and the Axe version should be stable, so I have updated that file in my original post.

Matrefeytontias, since most of this code is still yours and I would have had no clue about this general rendering method or how to make any of these effects myself, feel free to absorb my code into yours and use it as you see fit.

Also, my computer had a massive glitch in which the screen had crazy artifacts everywhere and the only audio was digital noise when I was about 95% done writing this post. I had feared that I lost my ~1.5 hours of work. Thank you Google, because when I restarted, Chrome prompted to restore my session and it had amazingly saved everything I wrote.

440
Axe / Re: [Axe] Plane deformations are fun
« on: February 02, 2014, 02:09:43 am »
My turn!

I was convinced this effect could be made faster, and after a lot of careful thought and crazy tricks, I managed to bump up the FPS: from 14.3 to 18.5, an improvement of about 30%! Again, it is running at 15MHz. It may look slightly different, as I had to rotate the texture up and right by one pixel per frame (rather than down and right) due to complications of the immensely aggressive optimization. I'll attach the source to this post, and here's a gif proving that it does indeed work:



Spoiler For Oh, and did I mention...:
... that that's just the pure Axe version? :evillaugh:

As is always the case for assembly, and is especially the case for really specific, concise algorithms, if you know what you're doing you can get big performance gains over a compiled language. Re-coding only the rendering core in just as (if not more) aggressively optimized assembly, I registered a huge boost in FPS: 18.5 to 44, an improvement of about 140% on my pure Axe version and 200% on the existing assembly core version! The source for this will be attached as well, and luckily this one is 100% stable, so play with it all you want! Here's a gif again, showing it off, although keep in mind that it's actually rendering about twice as many frames as the gif captured:



Spoiler For Wait a second...:
... why does that gif say 6MHz and load effects so slowly? Because it is 6MHz! :evillaugh: :evillaugh: :evillaugh:

The true 15MHz FPS for the version with the assembly core is a stupidly high 107! This is with no pre-rendering of frames or any such cheating, every frame is rendered pixel-by-pixel as always. So the total performance markup on the original 14.3 FPS comes to about 650%. You can try it for yourself by simply un-commenting the Full in the setup part of the assembly version source. And although it doesn't even capture a quarter of the frames rendered and the original effects are all but impossible to discern, here's a gif:



Beat that. :P

EDIT: Apparently, at 15MHz, the assembly version is too fast for the LCD driver on my calculator and glitches out a bit. Whoops. I should probably fix that...

441
WabbitStudio Software Suite / Re: CSE support for Wabbitemu ?
« on: January 30, 2014, 11:01:23 pm »
It is coming soon, but I'm still working on resolving all the complexities of the LCD. You can test my latest build here: http://buckeyedude.zapto.org/Revsoft/Wabbitemu/Beta/Wabbitemu.exe

BuckeyeDude

Boy, I know calc84maniac, myself, and some others have awaited this for a long time now. I tried it out quickly, and it certainly seems to be in a basic functioning state. Of course a number of features haven't been made compatible yet, like taking screenshots, but I did manually take this one of calc84maniac's stupendous technological achievement, Steins;Gate, to show off that the hardware emulation is working at least fairly well:



You likely already know about most issues, but I'll mention two that I've discovered.

First, MicrOS does not function properly, which could suggest any number of bugs. DrDnar all but completely throws out the OS and reboots the calculator himself, so I'm guessing at least one of the low-level hardware aspects of its initialization procedure is causing issues. Perhaps DrDnar can be contacted and recruited to help debug its support, and/or you could try yourself with the full source included in the download.

Second, the performance! It's likely that you simply haven't gotten around to implementing the screen in a properly optimizer manner, but I'm seeing generally slow emulation, and unless my eyes are deceiving me, possibly a decent amount of frame skipping on top. If you'd like any assistance in understanding the (much more complicated than the 83+/84+) driver, I might be able to answer questions, or you could probably ask in IRC or make a dedicated topic to get responses from more knowledgeable people.

But a natively running 84+CSE emulator is a huge boon, so thanks for the good work, and keep it up!

442
is a savestate option possible?

I've been occasionally pestering calc84maniac to add this for the last few years. I've even talked through most of the implementation hurdles with him on IRC, but nothing has come of it yet.

* Runer112 violently stabs calc84maniac

443
Axe / Re: Axe Q&A
« on: January 16, 2014, 09:41:20 pm »
(A) While a full-screen update on the non-color calculators takes about 1/100th of a second and almost everything that uses it relies on it being fast, a full-screen update on the color calcs takes about 1/5th of a second
Yeah but we could cheat with some frameskip and with the double-pixel trick. Moreover, some coders (like matref and I, or Builderboy and others) still have the possibility to trigger the 15MHz mode to get more speed on the CSE without losing any compatibility this time, so our programs of course will have speed issues, but also some speed gains.
Also that depends on the program you are porting because I don't think AudaciTI will have too much issues with refreshing since nothing happens on the screen except menus :P

Unfortunately, these points are not really valid. The ~5Hz screen update rate I cited is with 15MHz mode enabled already. If you tried frameskip, considering most non-color games probably run at 20-60fps, you'd be skipping anywhere from 75% to 90% of frames, which is very poor. And finally, it doesn't really matter how few differences there are between frames, because DispGraph has always been a full-frame resend command and can't really be modified to (efficiently) perform incremental updates.

444
Axe / Re: Axe Q&A
« on: January 15, 2014, 09:03:21 pm »
I just got a stupid idea. Why not make an early version of Axe for CSE that would be exactly the same as the current Axe, but with changed DispGraph (and DispGraphr and caetera) so that it displays a stretched L6 (and L3 and caetera) on the screen ?

(A) While a full-screen update on the non-color calculators takes about 1/100th of a second and almost everything that uses it relies on it being fast, a full-screen update on the color calcs takes about 1/5th of a second, (B) probably other things I can't think of right now, and (C) why not just use a non-color calculator in that case.

I do promise that I'd like to get a color version of Axe done eventually, but my time right now is superbly limited. I'm working 45 hours a week. x.x

445
Site Feedback and Questions / Re: Cookie Pickup Bug?
« on: January 12, 2014, 03:16:35 pm »
www versus no www is definitely the issue. Is this something that can be fixed site-side? Or some easy way I can fix it on my side (other than simply using the different URL, of course)?

446
Site Feedback and Questions / Cookie Pickup Bug?
« on: January 12, 2014, 10:55:58 am »
For me, the most useful page on Omnimaga is Latest Activity, so when I got my new computer, I made sure that's always the page I start at when navigating to Omnimaga. But when I started doing this, I noticed that navigating straight to that URL and not getting there by clicking the "NEW POSTS" link on another page results in my account not being logged in. This results in annoyances like the post times not being adjusted to my locale, so I have to then click the "NEW POSTS" link to reload the page and pick up my account.

Browser version: Chrome 32.0.1700.72 m. I don't notice this issue when directly navigating to any other pages on Omnimaga. This issue also doesn't seem to occur at all in my quick test with Internet Explorer 11.0.9600.16476. So the fact that it only occurs on one page suggests to me that it's a site bug, but the fact that it doesn't occur in another browser counters with the suggestion that it's a browser bug.

447
Yes, even if your calculator is "off", it still uses battery power.

Well, technically. But I think the answer he was looking for is really no, because it doesn't use any more power (or maybe a tiny bit more) than when you turn the calculator off normally in the OS.

448
TI Z80 / Re: [Axe] Free bitmapper/pixelmapper/smoothscroller for everyone
« on: January 11, 2014, 04:53:51 pm »
Hey, that looks familiar... But I'm assuming yours is some combination of faster, more featured, or easier to use than my 2-year-old bitmapper?

449
Other Calculators / Re: [AXE] Little contest :D
« on: January 02, 2014, 03:30:13 pm »
/2 is 4 bytes, so dividing by 8 with /2/2/2 would be 4+4+4=12 bytes, whereas *32/256 is 5+3=8 bytes (of course this only works if you can afford to lose the top 5 bits with *32). Dividing by 256 with /2/2/2/2/2/2/2/2 would be pretty wasteful, 4*8=32 bytes versus /256, which is only 3 bytes.

The and 0 is effectively just 0, but one byte smaller because it doesn't set the high byte. I used it because the pixel operations throw out the high byte of coordinates anyways (and as Hayleia pointed out, even if the high byte was used, it would already be 0 because r1 is between 0 and 63).

450
Other Calculators / Re: [AXE] Little contest :D
« on: January 02, 2014, 01:50:22 pm »
Oops, forgot this thing was ending! Anyways I guess here's my prize code, coming in at 317 bytes in the memory management menu (309 bytes in Axe compile screen). You could argue that there is one slight glitch, in that the very first frame displayed is whatever was in the main buffer before the program started, but I see that everyone else cut that corner too so nobody got an unfair gain from it. :P

The source is in the code box below and attached.

Code: [Select]
.S

ᴇ847A→°SX
ᴇ8481→°SY

FnOff

While 1
DispGraphClrDraw
(Select((getKey(3)+SX+1-getKey(2)??96)-97,PutSprite())?+96)PutSprite()
getKey(1)-getKey(4)+SY→SY
EndIf getKey(15)

Lbl PutSprite
→SX
64
While 1
If pxl-Test(-1→r₁, and 0,[AA55AA55AA55AA55])
Pxl-On(r₁ and 7+SX,r₁*32/256+SY and 63)
End
End!If r₁
Return

Pages: 1 ... 28 29 [30] 31 32 ... 153