This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.
Messages - Xeda112358
Pages: 1 ... 37 38 [39] 40 41 ... 317
571
« on: January 25, 2014, 09:27:38 pm »
Thanks! Hopefully I can find more tomorrow.
This did let me remove the "ld d,a \ stuff \ ld a,d" and just put the "ld d,a" outside the loop and use "rrc d" instead of "ld a,d \ rrca". It saves 8 t-states, 0 bytes.
572
« on: January 25, 2014, 09:13:20 pm »
Haha, touché Also, what rules are you using? Like, do you move down 1/2 the time, and move left/right 1/4 of the time?
573
« on: January 25, 2014, 08:54:06 pm »
Hmm, you could use a matrix (it uses indices the same as pixel coords or homescreen coordinates, so "Y" then "X").
This way, the program doesn't start to slow down as more snow flakes are added. Basically, just create an 8*16 matrix (it will be huge, being the size of a 128 element list). Then you can just check if a snow flake is already in a spot as if you were reading homescreen coordinates:
[A](Y,X
This might speed things up a bit (not sure).
574
« on: January 25, 2014, 08:50:07 pm »
I made a neat little algorithm today that seems to do an okay job at approximating sine. My original intention was to create a routine for the following with a fixed point format where x is on [0,1) x(1-x) However, if 'x' is in 'a', i can just do a*(-a), but NEG is 2 bytes, 8 cycles, whereas CPL is 1 byte, 4 cycles. That made me wonder if I could make a fast routine to multiply A by its compliment (is it called 1s compliment?). After working in my notebook with some more abstract approaches (I used 7th degree polynomials) I reduced it to a fairly simple solution that would be much simpler to integrate into an IC or something. However, I decided to experiment a little with the result by seeing what happens if I don't let some bits to propagate into the upper 8 bits (remember, 8.8 fixed point). So basically, I computed the product of 7-th degree polynomials with binary coefficients, and then truncated any terms of degree 7 or lower, then I converted it back to binary and the result is a close approximation of x(1-x). This happens to be a close approximation of sine, too, so after some tweaking, I got it to have the same input/output range as Axe (I recall that Quigibo said he used x(1-x) as an approximation to sine, too). The following uses 3 iterations as opposed to 8 and is faster than the Axe version, but larger by 8 (?) bytes: p_Cos: ld a,l add a,64 ld l,a p_Sin: ld a,l add a,a push af ld d,a ld h,a ld l,0 ld bc,037Fh __SinLoop: sla h sbc a,a xor d and c add a,l ld l,a rrc d srl c srl c djnz __SinLoop ld h,b pop af ret nc xor a sub l ret z ld l,a dec h ret ;This: ; 34 bytes ; 269 t-states min, else 282, else 294 ; avg. 76 t-states faster than Axe ;Axe: ; 27 bytes ; 341+b t-states min, else 354, else 366
If the bits of register 'l' are sabcdefg, this basically returns: (0aaaaaaa^0bcdefg0)+(000bbbbb^000cdefg)+(00000ccc^00000def)
^ is for XOR logic, + is regular integer addition modulo 256
(s is the sign) The original algorithm was going to be for computing the natural logarithm I came up with some new (to me) algorithms for that as well. EDIT1: Thanks to calc84maniac for pointing out the optimisation using a different order of operations with xor/and ! Saved 1 byte, 12 cycles. This let me shuffle some code to save 8 more cycles. EDIT2: Modified the last note to reflect the normalized input/output to match that of Axe.
575
« on: January 25, 2014, 08:33:17 pm »
How do you detect snowflake collision?
576
« on: January 25, 2014, 08:31:20 pm »
This will be interesting. I probably won't have much time to work on the projects just because I have a lot of work and school, but I might try
577
« on: January 25, 2014, 08:29:46 pm »
Oh no, I meant how would I send the numbers to the asm routine :p sorry for not being clear.
For example, would I set A to the number to rotate?
What MGOS said is indeed the way to pass the value to the routine: Just put them in HL (Axe's Ans) - that is usually the last value you worked with. Just write the expression (or the single varibale) in the line before the routine. The same way (with the sto-> operator) you can retrieve the value returned by the routine.
:A :Asm(STUFF) :->A
578
« on: January 23, 2014, 11:30:19 am »
Okay, say your 16-bit number has digits "abcdefghijklmnop" If you want to rotate the whole thing left, "bcdefghijklmnopa" I would do this which averages a half cycle faster for the same size:
Asm(2930012C) In assembly: add hl,hl jr nc,$+3 inc l
If you want to rotate the bytes separately, bcdefgha jklmnopi and then add them together to get the addition of the two 8-bit numbers: bcdefgha+jklmnopi : back into Axe's "Ans"
Asm(7C07CB05856F2600CB14) ld a,h rlca rlc l add a,l ld l,a ld h,0 rl h
If you just want the 16-bit number bcdefghajklmnopi:
Asm(CB04CB05) rlc h rlc l
The right rotations for all of those:
;abcdefghijklmnop->pabcdefghijklmno Asm(7C0FCB1DCB1C) ld a,h rrca rr l rr h
;abcdefghijklmnop->habcdefg+pijklmno Asm(7C0FCB0D856F2600CB14) ld a,h rrca rrc l add a,l ld l,a ld h,0 rl h
;abcdefghijklmnop->habcdefgpijklmno Asm(CB0CCB0D) rrc h rrc l
I hope I got these correct.
579
« on: January 20, 2014, 01:13:28 pm »
Ah, cool, you know 68k Asm? There are a few calcs with Motorola 68000 processors, like the TI-89 and 92 calcs, but there isn't as much 68k activity as there is for the TI z80 calcs and the Casio and HP calcs. I saw some of your sprite work, it is excellent! Welcome to Omni
580
« on: January 20, 2014, 10:50:36 am »
Speed has been a noticeable issue since the last time I talked to calcdude on IRC. It turns out that the fancy wrapper I made called AllocateRAM around the OS bcalls _InsertMem and _DelMem-- while fantastic-- turns out to be reallly slow. I expected it to be slow, but not nearly as slow as it actually was in practice. On top of that, the way I was handling things before used a bunch of push/pop instructions for the various stacks and these required at least one use of AllocateRAM each. That meant the following took over 2 million clock cycles (1/3 of a second at 6MHz): 3*3*3*3*3*3.
After seeing that, calcdude mentioned that my method sounded kind of redundant and for the most part it was. I had scratch ram that I was working with anyways, so just switching to using that saved about 700000 t-states. I can't remember what else I did, but I have somehow managed to reduce the time needed to about 680 000 t-states (I cut out at least half the push/pop instructions). This is still terribly slow since each push and pop takes over 60000 t-states, it seems, so I will probably use a completely custom InsertMem/DelMem routine. I will also likely reorganise the memory structure of the GDB var I am using. While the LangZ80 parser is active, GDB13 will be expanded to fill RAM:
The start will hold the variable LUT for the binary search (grows upward) Followed by this will be the variable data, the second most volatile region (grows upward) This will be followed by the AddrStack which is a stack of pointers that get updated with AllocateRAM (grows upward, least volatile) At the end will be the AnsStack which grows downward.
This should mean that push/pop instructions to the AnsStack will be as simple as updating a pointer and using LDIR if a push is performed. This will completely eliminate the use of AllocateRAM in that context. When adding a new variable, I can just insert it between the last variable and the AddrStack. Since the AddrStack will usually be small (<100 bytes, though I haven't had it larger than 2, yet), I will only need to move a small number of bytes to insert a new variable. The slowest operation will then be resizing a variable. For example, if there are 10000 bytes of variables that needs to be shifted to add or remove bytes from a variable, that will take over 200000 t-states.
Finally, another speed optimisation that I made-- which seems tiny compared to the one above-- is in my binary search routine. Before, it was taking over 7100 t-states to locate the correct function, but I have now reduced that to 3100. I have added in the GETKEY routine, and I will try to add in a WHILE() routine, too. This will use the AddrStack to keep track of where the function returns and such. This means I will need to write the routines for comparisons.
581
« on: January 20, 2014, 08:09:56 am »
582
« on: January 19, 2014, 10:08:49 pm »
Well I could definitely add some command with that functionality. I guess it would take arguments such as the variable name, an offset, the number of bytes to read, and the location to output to? It would be one of the lower level commands, but it would be doable. Now, here is where I am running into trouble: I realised before I added the "+" and ADD() instruction that string concatenation could reasonably exceed the 511 bytes of scratch RAM. As well, this scratch RAM (which is in fact 514 bytes) was allocated for the arbitrary precision arithmetic. However, for the much higher precision stuff, I would need some other dynamic location. Because of this, I am thinking of allowing those types of intermediate calculation to take place after the user variables. On top of that, if I really want to remove this project even further from the OS, I wanted to take advantage of all of the available RAM pages. I am currently using 16-bit offsets to locate variables, so I am thinking that the easiest solution to working with the >16-bit range is to designate a table for every <65536 bytes. Either that, or I need to switch to 24-bit addressing, which might be the best option (but more difficult) if I plan to work with the flash, too. My temporary fix to the first trouble is that I added in a variable type "indirect." This just holds the actual variable's name, so when an indirect var is parsed, it then tries to locate the actual var (and if that was indirect, it keeps going until it finds the root variable). The second will just be a pain for later. While I was thinking about these problems, I decided to work out one of the features that I forgot about. The first day that I wrote the parser, I added code to allow reading parts of lists, arrays, and strings. Of these, I only have strings variables actually supported. This means I have a routine that can get the size in bytes of a string variable, as well as locate a given element, and I have a routine that can generate a string variable and convert it to a string-- strings were trivial. So on testing, STRING1[3] (grab byte #3, 0-indexed, from STRING1), it crashed, but I managed to actually restructure it and get it working. The whole thing still has bugs and such, unfortunately, but it is attached in case you want to experiment a little. I have to figure out why it only parses one line, now If you want to step through the debugger, you can set a breakpoint at $4105 which is currently where the call to the parser is done to interpret prgmTEST. EDIT: One person downloaded this, but I found that I left in a "jr $" opcode for debugging the routine to overwrite an existing variable. I reuploaded a fixed version. I usually use "jr $ " (with a space after $) so that I can easily find it with a search function and remove it.
583
« on: January 19, 2014, 11:18:01 am »
What do you mean by direct writing to the gbuf? Like pixel commands? Except for the loops, the rest of it could be added now. I finally got variables working, so basically what is left is adding new commands and support for various data types.
584
« on: January 18, 2014, 09:34:37 pm »
It would only require me to change from using 1 byte for the size to 2 bytes. In fact, I made an app two (or three?) years ago that used that high of precision. However, it took over an hour to perform a large multiplication. I think that was mostly due to the method I was using to multiply. The algorithm I am currently using is much faster, but I use an even faster version in my floating point library for 80-bit floats. I could probably add another type of int and float that is basically max precision
585
« on: January 18, 2014, 02:53:11 pm »
Also, as a note, this is how the coordinate system works at the hardware and assembly level. TI-BASIC converts coordinates for all of its drawing commands to pixel coordinates which takes up a lot of time. This is nice for math purposes, but inefficient for graphics.
To give you an idea, you could probably draw a line in Axe a few dozen times in the time TI-BASIC can draw a single Pt-On( command since it has to convert the inputs, then transform the inputs into pixel coordinates, then it converts those float values to integers that can actually be used directly, then it uses those coordinates to a pixel location and plots it. Axe only needs to do the last step (which is the fastest one).
Pages: 1 ... 37 38 [39] 40 41 ... 317
|