After seven years, Grammer was revived with a new update! In fact, since then, there have been three more major updates, including today's Grammer 2.51.0.0.
After losing the source code in 2013, Grammer progress died, but then in 2016 Hans Burch(ticalc) recovered the source code, presumably by disassembling the last binary and comparing it to an old source file. I didn't actually get to working on Grammer again until late 2018, but since that it has had fairly steady updates. I first documented, reorganized, and cleaned up the code, then put it up on GitHub. I started adding features and managed to rope in @NonstickAtom785 into the action with bug reports, improved documentation, and recently some bug fixes.
In the past 16 months, there have been many updates and modifications to expand Grammer's abilities. With these new updates, Grammer has now graduated up from a 1-page app to a 2-page app (16384 bytes vs. 32768 bytes), but the updates are (hopefully) worth it!
Here is a quick summary of added features since the 2012 update.
Grammer now has a real main menu! Instead of just showing one entry at a time, now you can scroll through programs with a real scrollable menu.
Grammer now has support for single-precision floating point math, including the basic operations, as well as exponentials, logarithms, trig, inverse trig, and more! Keep in mind, Grammer is intended to use 16-bit integer math, so it's a bit convoluted to use floats:
Grammer has an external module system, which means that specialty routines can be added to extend the language. For example, Yeong has created a textbox module which can be useful for game dialog!
Grayscale is not backwards compatible. Grayscale is now 4-level gray, instead of 3-level. Grayscale now takes color from both buffers, so BOTH buffers must have a black pixel to appear black (before, there was a black buffer and a gray buffer, where a black pixel in the former would override the latter).
Input and Menu routines have been significantly modified, with Input being the most obvious. In place of highlighting the entered text, a blinking cursor is displayed, making it more obvious that input is desired from the user. On the programming end, you can now relocate the Input buffer and size. So now if you want to have the user enter a name of up to 8 bytes, you can point the input buffer to where you want the string written, and set the size to 8. The Menu( routine now has scrolling, and internally it is more memory friendly, allowing more options. An entirely new Menu(' routine allows the programmer to generate menu items using a subroutine. This could be really useful for game menus where, say, an inventory menu is prone to changing.
Grammer officially has stack support!
You can now save and restore variables within a ▶Nom(...End block, which is useful for subroutines.
Pt-Off( finally supports big sprites!
Spoiler For More updates:
The source code has been cleaned up significantly and put on GitHub.
RecallPic now works on archived picture variables.
pxl-Test(' was removed and replaced with an equivalent rectangle method.
Rectangle clipping now works in all directions!
Line drawing now has real clipping!
The solve( now includes port commands, and relocation of where the vars are stored.
L (the list L), can be used to execute specific lines of code.
Text( now supports displaying signed numbers via the Fix command, as well as floating point numbers.
▶Dec converts a string to a float, useful for user input.
Param can be used to read a list of parameters passed to a subroutine.
Param' and Param° push to and pop from a stack, respectively, where Pmt_Bgn points to the start of the stack and Pmt_End points to the end of the stack.
Circle( has 9 new methods, all with different fill options (much like the rectangle routines).
Variable-width fonts have a more sensible format, and the fonts are ported to work with DrDnar's Monochrome Font Editor. The font files are included with the downloads so you can use them as templates for your own fonts.
With @NonstickAtom785's talk of making an RPG in Grammer, I decided that I should finally add in smooth-scrolling tilemap support in Grammer (finally). Anyways, I went with a more standard format, so we need a new tilemap editor, and I like how this one has turned out:
The controls are listed in the readme, and I've included a sample set of files (spritesheet+tilemap+config).I've also included a map viewer, prgmTILEMAP:
That is not documented, but use [ *] to use 15MHz mode, and [/] to use 6MHz mode (default). It uses the same data as the map editor, so it loads the same tilemap and spritesheet that you were working on.
Hi folks! I've noticed that the "Z80 Optimized Routines" threads and their equivalents on various sites aren't very easy to navigate. I am starting a repository on GitHub in the hopes of addressing these three issues:
Organization! "Is this routine documented? What page is it on?
Collaboration! "Is there a better version later in the thread? On what page!? Here is yetanotherversion!"
Cleanliness! "What is this random request doing in the middle of the thread?"
If you want to help port documentation, I only ask that you cite the original author if possible, except when the original author doesn't care to be cited. If you want to add your own routines, keep it organized! And please, if you see an optimization, please make it!
A final note: I think it would be great to have an eZ80 and TI-BASIC repository, too, but I don't think I'm up for maintaining that!
Hey there, it's ya gender non-specific diminutive Zeda, here, and today we'll be looking at the Fisher-Yates algorithm and just how freaking efficient it can be for shuffling a list. For reference, it takes one second to shuffle a 999-element list at 6MHz, and if that ain't the way your deity intended it, I don't know what is.
This is a super clever algorithm, but slow as heck as the lists get bigger. Plus, it uses an extra list of the same size, wasting precious RAM. So how does the Fisher-Yates algorithm work? You start at the last element. Randomly choose an element up to and including the current element and swap them. Now move down one element and repeat (so now the last element is off limits, then the last two, et cetera). Repeat this until there is one element left.
This is easy to perform in-place, and it performs n-1 swaps, making it significantly faster than the BASIC algorithm above. In fact, let's implement it in BASIC:
dim(L1->N For(K,N,2,-1 randInt(1,K->A L1(K->B L1(A->L1(K B->L1(A End This takes approximately 37.5 seconds to sort a 999 element list. I don't even have the RAM needed to test the regular method, but extrapolating, it would take the "normal" method approximately 73 seconds for 999 elements. So basically, the Fisher-Yates algorithm is actually faster even in TI-BASIC (after about 400 elements, though). So without further ado, the assembly code!
; Put it into 15MHz mode if possible! in a,(2) add a,a sbc a,a out (20h),a
; Initialize the random seed ld hl,seed1 ld b,7 ld a,r _: xor (hl) ld (hl),a inc hl djnz -_ or 99 or (hl) ld (hl),a
; Locate Ans, verify that it is a list or complex list bcall(_RclAns) ex de,hl ld c,(hl) inc hl ld b,(hl) inc hl ld (list_base),hl dec a jr z,+_ sub 12 ret nz dec a _:
;A is 0 if a real list, -1 if complex ;HL points to the first element ;BC is the number of elements and $29 ;make it either NOP or ADD HL,HL ld (get_complex_element),a sub 29h sbc a,a ;FF if real, 00 if complex cpl and 9 add a,9 ld (element_size),a
shuffle_loop: push bc
push bc call rand pop bc ex de,hl call mul16 dec bc ;swap elements DE and BC
call get_element push hl ld d,b ld e,c call get_element pop de
call swap_elements
pop bc dec bc ld a,c dec a jr nz,shuffle_loop inc b dec b jr nz,shuffle_loop ret
swap_elements: ;HL and DE point to the elements element_size = $+2 ld bc,255 _: ld a,(de) ldi dec hl ld (hl),a inc hl djnz -_ ret
get_element: ;Input: ; DE is the element to locate ;Output: ; HL points to the element ld l,e ld h,d add hl,hl add hl,hl add hl,hl add hl,de get_complex_element: nop list_base = $+1 ld de,0 add hl,de ret
rand: ;Tested and passes all CAcert tests ;Uses a very simple 32-bit LCG and 32-bit LFSR ;it has a period of 18,446,744,069,414,584,320 ;roughly 18.4 quintillion. ;LFSR taps: 0,2,6,7 = 11000101 ;291cc ;Thanks to Runer112 for his help on optimizing the LCG and suggesting to try the much simpler LCG. On their own, the two are terrible, but together they are great. ld hl,(seed1) ld de,(seed1+2) ld b,h ld c,l add hl,hl \ rl e \ rl d add hl,hl \ rl e \ rl d inc l add hl,bc ld (seed1_0),hl ld hl,(seed1_1) adc hl,de ld (seed1_1),hl ex de,hl ;;lfsr ld hl,(seed2) ld bc,(seed2+2) add hl,hl \ rl c \ rl b ld (seed2_1),bc sbc a,a and %11000101 xor l ld l,a ld (seed2_0),hl ex de,hl add hl,bc ret
mul16: ;BC*DE ld hl,0 ld a,16 mul16_loop: add hl,hl rl e rl d jr nc,+_ add hl,bc jr nc,+_ inc de _: dec a jr nz,mul16_loop ret It isn't perfect, but it is pretty good and importantly, it is fast! The biggest problem is in the random number generator, but even that is still pretty good for this application.
Hi all! At Eeems' suggestion, I am posting this work-in-progress code
I've been working on-and-off on my tilemap engine since my Pokemon Amber attempt. The current rewrite is slower, but a bit easier to work with on the my end. It also allows the surrounding map to still be animated if I need to pull up a menu or dialog! I've also added in map compression using modified code from my LZD project, as well as event tiles!
In this version, I mess around with OS RAM to make larger chunks of contiguous non-user RAM, so it might cause crashes.
As of now, all you can do is walk around the map and enter/exit map editing mode. Since maps are currently stored in the app, map editing is not permanent. If you leave the map and come back, all changes are reverted.
Press [mode] to enter map editing mode Use [+] and [-] to scroll through the available tiles for the map Use [*] to toggle walkability. You'll see corner guards if the tile can't be walked on. Use [Enter] to set a tile. Use [stat] to exit map editing
When you hover over a tile that can't be walked over, a solid border will be drawn. When you walk over an event tile, a dotted border will be drawn.
If you want to *actually* edit or create maps, you'll have to do that on your computer and recompile the app. You'll need spasm (with app signing enabled!) and Python with numpy. The map structure is:
.db height,width .db start_y,start_x ;The player will be drawn at y+5 and x+6 ! .db number_of_tiles .dw tile_index_0 .dw tile_index_1 ...
;map data .db blah .db blah .db blah ...
;Event code .db 0 The map data is actually stored transposed. So the tiles you see across the top row are actually stored in the left column of bytes! This makes it a headache to code.
You can use up to 64 different tiles in a map.
Each byte in the map has the index into the tileset stored in the lower six bits. Setting the top bit means it can't be walked on, and setting bit 6 means walking into it triggers an event.
When events are triggered, the map's event handler is parsed an executed. It is a very basic scripting language that currently handles addition, subtraction, =, <, &, warping, and conditional execution. Scripts end with a null byte (0x00) and are stored similarly to how they'd be parsed in RPN format, currently with the exception of conditional execution (since the condition is checked before executing). 'x' and 'y' can be used to read the player's current coordinates, 'i' evaluates a condition (like an If statement) and if it is true, it executes the next command, otherwise it skips. 'w' is used to warp and requires bytes in the form of `map_id,start_y-5,start_x-6`. Finally, 8-bit integers are prefixed by a `.db 1`. For a convoluted example, here is the event handler for inside the "house" :
;if (y == 3 or y == 4) and x == 7: ; warp(0,2-5,5-6) .db "y",_num,3,"-",_num,2,"<x",_num,7,"=&iw",0,2-5,5-6 .db 0
Once you've made your map, you'll need to compress it. First you'll compile it to a binary, then you need to compress that data and convert the compressed data back to assembly. For example:
And finally, you need to add the map to the mapLUT found near the bottom of main.z80, just add the label name (in the above case, `.dw map9`) and include the compressed version of the file at the bottom.
The system is largely interrupt-based. A custom interrupt updates the current keypress, updates the tile animation, updates the LCD, and draws the tilemap. All you have to do is set the appropriate flag:
rpgflags = 33 updateLCD = 0 ;Tells the interrupts to update the LCD drawmap = 1 ;Tells the interrupts to redraw the map animate = 2 ;Tells the interrupts to animate tiles noupdate = 3 ;Overrides the interrupt's LCD update and tilemap redraw When a tile's animation counter runs out, it automatically sets the drawmap flag. When a map is redrawn, it automatically set the updateLCD flag and resets the drawmap flag (no need to keep doing it). And if the noupdate flag is set, then interrupt-based map drawing and LCD updating is disabled (but these can still be used manually).
There are three graphics buffers, one of which is quite non-standard. They are all vertically aligned instead of horizontally aligned (the OS' format is horizontal). That means the second byte corresponds to the 8-pixel chunk below the first byte, as opposed to adjacent to it.
gbuf : This is a 1120-byte buffer! It is 80 pixels tall and 112 pixels wide. Only the center part is shown. This is where the tiemap is drawn. maskbuf : A 768 byte buffer. During LCD updates, the data here is ANDed on top of the gbuf data. topbuf : A 768 byte buffer. During LCD updates, the data here is ORed on top of the previous data.
The latter two buffers are used for overworld graphics, like the player sprite, or in the future, dialog, menus, and NPCs, etc.
Hi, for a long time I've wanted to create a compiler (targeting the Z80) that was even better at optimizing. Back when I was first brainstorming a Grammer compiler, I had in mind the issues of existing Z80 C compilers and even Axe. The Z80 really is poorly suited to C, but Axe is actually pretty amazing (as many know). One thing that C, Axe, and Grammer all have in common is a standard way of passing arguments. For example, in the latter cases, the first arguments are passed in HL and BC (respectively). This doesn't always produce optimal code, but it is a lot easier to work with.
So my fantasy was writing a program on my computer that tried to find a more optimal way of passing arguments. I haven't managed it, but it is a lovely dream
I want to put this small, undocumented set of programs out for the public. I wrote it today and there are a bunch of issues, but on some code it works really well.
irToZ80.db basically contains the conversions from a high-level token to low level assembly. Each routine has input and output registers defined, as well as size and timing to use as metrics for optimization.
I reused tokens.db from another project, all that is important is name, type, and precedence.
If you run ./compile, it passes test.txt through sy.py to convert the source code into an intermediate representation (basically RPN), then passes that file through irToZ80.py to generate Z80 assembly code.
I basically implemented what I think is Djikstra's Shortest Path algorithm to find the solution that minimizes the target metric. In this case, I try to find the code that minimizes (6*size)^2+(speed)^2.
In playing with ideas to improve multiplication in my float routines, I recreated an algorithm that I soon learned was a known, but rarely discussed generalization of the Karatsuba algorithm!
For a long time, I thought of the Karatsuba algorithm as a special case of Toom-Cook multiplication. If you aren't familiar with that, I discuss it here (see the spoiler!). To be fair, it is a special case, but there is actually another way to arrive to the Karatsuba algorithm, and honestly it is a bit easier to implement and recognize.
To arrive at the algorithm, we start with polynomial multiplication. I'll start off with multiplying two cubic polynomials and that should give us enough terms to get a good view of what's going on: ## \begin{array}{l} f(x)g(x) &=& (a_{0}+a_{1}x+a_{2}x^{2}+a_{3}x^{3})(b_{0}+b_{1}x+b_{2}x^{2}+b_{3}x^{3})\\
There is a lot of symmetry going on there in the form of ##a_{n}b_{m}+\dots+a_{m}b_{n}##. My observation was that ##a_{n}b_{m}+a_{m}b_{n} = (a_{n}+a_{m})(b_{n}+b_{m})-a_{n}b_{n}-a_{m}b_{m}##.On the surface, this looks like we are trading two multiplications for three (more work), but we actually get to reuse some of those terms in other calculations. So let's reorganize:
So with just this trick, we went from 16 multiplies to 10, at the cost of 12 extra additions/subtractions. In general, to multiply two n-th degree polynomials, we need ##\frac{(n+1)(n+2)}{2}## multiplies instead of ##(n+1)^{2}##, so almost half as many multiplies are needed. These are both ##O(n^{2})## operations, but the trick is to apply recursion. For example, if we want to multiply two degree-15 polynomials, we can break it up into 4 cubic polynomials and treat the smaller polynomials as coefficients. Then we need to multiply 10 pairs of cubic polynomials, resulting in 100 total multiplies. That's a lot, but it's less than 256 with the standard method, or even 136 with the algorithm above. If we carry this out another step, we can multiply two 63-degree polynomials with 1000 multiplies instead of 4096 or 2080. In fact, for a ##(4^{n}-1)-th## degree polynomial, we need ##10^{n}## multiplies, making this algorithm ##O(n^{log_{4}(10)})=O(n^{1.66096\dots})##. This is a non-trivial, better-than-##O(n^{2})## multiplication algorithm.
Using our generalized technique, let's multiply two quadratic polynomials:
&=&z_{0}\\ &&+x(z_{3}-z_{0}-z_{1})\\ &&+x^{2}(z_{4}-z_{0}-z_{2}+z_{1})\\ &&+x^{3}(z_{5}-z_{1}-z_{2})\\ &&+x^{4}(z_{2})\\ \end{array} ## And by applying this recursively we get an ##O(n^{log_{3}(6)})=O(n^{1.63092\dots})## algorithm.
And the Karatsuba algorithm that we've all come to know and love:
\begin{array}{l} f(x)g(x) &=& (a_{0}+a_{1}x)(b_{0}+b_{1}x)\\ &=&z_{0}+x(z_{2}-z_{0}-z_{1})+x^{2}(z_{1})\\ \end{array} ## And by applying this recursively we get an ##O(n^{log_{2}(3)})=O(n^{1.58496\dots})## algorithm.
But is Karatsuba the best of this class of algorithm?
Spoiler For Yes:
To recursively apply multiplying ##(k-1)-th## degree polynomials, we get an algorithm that is ## O(n^{log_{k}((k)(k+1)/2)})=O(n^{log_{k}(k)+log_{k}(k+1)-log_{k}(2)}) =O(n^{1+log_{k}(k+1)-log_{k}(2)}) =O(n^{2+log_{k}(1+1/k)-log_{k}(2)}) ## And the integer, ##k>1## that minimizes the exponent is 2.
A neat application of the non-recursive algorithm discussed above can be used to multiply a number by it's complement, which is another approach to how I came up with this Sine Approximation Algorithm.
Spoiler For Spoiler:
So let's assume that all of our coefficients are either 0 or 1. The complement of 0 is 1, and vice versa. This is useful in programming, since inverting the bits of a number is basically doing -x-1.
Let's define some operations. Where x and y are binary (0 or 1), then
~x means the complement of x, and can be rewritten (1-x).
x|y is bitwise-OR and can be rewritten as x+y-xy.
x&y is bitwise-AND and can be rewritten as x*y or xy for brevity.
x^y is bitwise-XOR and can be rewritten as x+y-2xy.
Observe, then, that:
x~x is always 0.
x&x=x.
x~y+y~x = x-xy+y-yx = x+y-2xy = x^y.
So if we want to multiply, say, a 4-bit number by its complement, we have:
Here is an arbitrary precision calculator for the digits of pi. It is written in TI-BASIC, so it is slooooow, but it works!
You basically pass in how many digits you want and it returns them in 5-digit chunks in L1. Keep in mind, if a chunk only has 4 digits, that means the top digit is 0 !
I have been working on a side project for a little while that will revolutionize the 83+/84+ monochrome series: A basic calculator program.
That's right, your TI-83+ is no longer just for games. If you want to add or subtract-- or even multiply and divide-- this program can do that for you!
Be warned, some of the math is a little buggy still. You'll need the Floatlib app located here, but otherwise just run it as an assembly program. I made a custom input routine, so it might feel a little clunky, but it should be pretty straight-forward (except [DEL] actually acts as a backspace). Exit with [ON].
This program gives you the zeros of a quadratic function! Inputs must include a decimal point so that Grammer knows it is a float. Hopefully that can be fixed in the future. Also it looks like Grammer has some float issues, so it gives weird answers when the result is complex.
TICalc.org's Program Of The Year polls are open! For the TI-83+/84+ category, it looks like a bunch of older projects split between two authors-- myself and squidgetx (ticalc profile).
Up for the vote are the following programs (in the order listed on ticalc.org).
Batlib (ticalc link) Batlib is a huge library for BASIC programmers. It has 120+ functions mostly for graphics and string and data manipulation, list and matrix manipulation, compression, and much more.
CopyProg (ticalc link) This is a small program that allows users to copy variables from RAM or Archive to another variable. CopyProg2 has a few other features as well (like line reading and reading the names of the variables on your calc). It allows you to do things like copy an archived appvar to a temp program for execution
Embers (ticalc link) Embers is an ARPG that won Omnimaga Contest 2012. This was a really well put together game. It features good graphics, good AI, and storyline.
FloatLib (ticalc link) Floatlib is an app that holds many single-precision float routines from addition to hyperbolic tangent. It comes with a built-in reference manual and there are some examples like computing the Mandelbrot Set.
Gravity Guy (ticalc link) Gravity Guy is "a port/variation of the popular iphone/flash game" of the same name. You basically get to flip the direction of gravity to help navigate obstacles.
LblRW (ticalc link) LblRW is a small utility for BASIC programmers that lets you read or modify data within the BASIC program, using a label offset. That's a mouthful. Basically, "Hey, I want to store player data in my RPG, let's store it after Lbl PD." Or you can store monster data for quick access for example (no screenshot, sorry )
StickNinja (ticalc link) StickNinja was an Omnimaga Contest 2011 entry that earned 3rd place. It's basically a platformer with awesome Stick Figure and Ninja graphics. Collect coins, destroy enemies; It's got it all.
So, fun fact: floating point numbers can be changed to a variable reference instead, kind of like a shortcut on your computer, and the OS doesn't verify it.
In assembly, when you are looking for a variable, you put a name in OP1 and use FindSym or ChkFindSym. Those names are 9 bytes, which is also the size of a float (not a coincidence-- it's part of why names are limited to 8 bytes). You can actually change the contents of a Real variable to such a name, and when you try to read it's value, it will instead return the value of the variable it points to. For example, change the bytes of the Real variable, A, from 008000000000000000 (the number 0) to 015D01000000000000 and when you read A, it will return the contents of L2.
Or, since lists are just stored as a bunch of Real (or Complex) numbers, you can modify elements of the list to be pointers. In this way, you could read Str1, Str2, Str3,..., Str0 by reading an element of L1, which could be useful, occasionally
Some things to note: You can't modify the contents of the original variable in this way. If A points to Str2, then "33"+A→A does not modify Str2. However, "33"+A will work like "33"+Str2. Storing the names of programs and appvars and whatnot isn't useful.
Attached is a program that turns L1 into a 4-element list {L2,[A],Str1,L1}. Here is the source. You need to delete L1 first, my code wasn't reliably deleting it.
Hi folks, I wanted to share a technique I came up with for evaluating sequential points of a polynomial, which is especially useful when you are drawing them. A naive approach would be evaluate the whole polynomial at each point, but the method I'll show you allows you to evaluate at a few points, and then for each subsequent point just use additions.
The basic premise is to use ##f(x)## as a starting point to get to ##f(x+1)##. For example, take ##f(x)=3+2x##. To get to ##f(x+1)##, all you have to do is add 2. Things get more complicated as you go to higher degree polynomials. For example, take ##f(x)=3+2x-3x^{2}+x^{3}##. Then to get to ##f(x+1)##, you need to add ##3x^{2}-3x##. However, you can go even further by finding how to get from ##3x^{2}-3x## to ##3(x+1)^{2}-3(x+1)##. Now comes the theory.
Take ##f_{1}(x)=f(x+1)-f(x)## and ##f_{n+1}(x)=f_{n}(x+1)-f_{n}(x)##. Then ##f(x+1)=f(x)+f_{1}(x)## and ##f_{n}(x+1)=f_{n}(x)+f_{n+1}(x)##.
##\begin{aligned} f_{3}(x)&=f_{2}(x+1)-f_{2}(x),\\ &=(f(x+3)-2f(x+2)+f(x+1))-(f(x+2)-2f(x+1)+f(x)),\\ &=f(x+3)-3f(x+2)+3f(x+1))-f(x). \end{aligned}## And in general, ##f_{n}(x)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(x+n-k)}## If you set ##x=0##: ##f_{n}(0)=\sum_{k=0}^{n}{{n\choose k}(-1)^{k}f(n-k)}##
So now let's take again, ##f(x)=3+2x-3x^{2}+x^{3}##. ##\begin{aligned} f(0)&=3,\\ f_{1}(0)&=f(1)-f(0)&=0,\\ f_{2}(0)&=f(2)-2f(1)+f(0)&=0,\\ f_{3}(0)&=f(3)-3f(2)+3f(1)-f(0)&=6,\\ f_{k>3}(0)&=0.\\ \end{aligned}## So what these values allow us to do, ##{3,0,0,6}##, is use a chain of additions to get the next value, instead of evaluating the cubic polynomial at each x: ##\begin{array}{rrrr} [&3,&0,&0,&6]\\ [&3,&0,&6,&6]\\ [&3,&6,&12,&6]\\ [&9,&18,&18,&6]\\ [&27,&36,&24,&6]\\ [&63,&60,&30,&6]\\ [&123,&90,&36,&6]\\ [&213,&126,&42,&6]\\ ... \end{array}## Basically, add the second element to the first, third element to the second, and fourth element to the third. The new value is the first. Repeat. So the next iteration would yield [213+126,126+42,42+6,6]=[339,168,48,6], and indeed, f(8)=339. I'd like to apologise for how garbled this is, I'm really tired.
Also, you should note that for an n-th degree polynomial, you need to compute up to ##f_{n}(x)##, but everything after that is 0.
A few months ago I implemented heapsort and used it to create a sorted array of VAT entries. While it is fast, instead of directly sorting the VAT, it creates an array of pointers to each entry and sorts the pointers. This means that the VAT itself remains unsorted (not necessarily an issue), and a variable amount of RAM needs to be allocated. In the example, if the user had over 383 variables in the named variables table, only the first 383 entries could be sorted (unlikely, but an issue nevertheless).
My goal here was to sort the VAT in-place, using only a fixed amount of overhead memory. After a couple of weeks of brainstorming and testing, I ended up with an adaptive insertion sort implementation and a mergesort implementation.
pTemp = 982Eh ;bottom of named vars VAT progPtr= 9830h ;top of named vars VAT OP1 = 8478h ;uses 19 bytes of RAM, 4 bytes of stack space #define tmp OP1+14 #define head tmp+1 #define head_lag tmp+3 #macro advanceVAT() #ifdef speed ;17cc saved per iteration ;8 bytes vs a 3 byte call (but no 8-byte subroutine) ld bc,-6 add hl,bc sbc a,a ;HL<=FE66, so carry flag is set sub (hl) ld c,a add hl,bc #else call advance_VAT ;preserves DE #endif #endmacro
sortVAT: #ifdef nointerrupt di #endif ld hl,(progPtr) isort_main: _: ld (head_lag),hl ld d,h ld e,l advanceVAT() ld (head),hl #ifdef speed ;11 bytes, 29cc or 46cc. avg=29.06640625cc ld a,(pTemp) cp l jr nz,$+7 ld a,(pTemp+1) cp h ret z #else ;adds 8 bytes, 55cc ld bc,(pTemp) ;Need to verify that we haven't reached the end of the progVAT or a ; sbc hl,bc ; ret z ; add hl,bc ; #endif call cmpVAT ld hl,(head) jr nc,-_ ;if it makes it here, then (head) needs to be inserted into the previous part of the VAT ;We might be able to speed it up a little more if I also grab the next element ; If (head_advance) is bigger than (head), then no need to start the search from the beginning ld de,tmp #ifdef speed ldd ldd ldd ldd ldd ldd ld b,0 ld c,(hl) lddr ldd #else ld bc,6 lddr ld c,(hl) inc c lddr #endif ld hl,(progPtr) _: push hl #ifdef speed ;+5 bytes, -11cc ld bc,-6 add hl,bc ld de,tmp-6 call cmpVAT_stepin #else ex de,hl ld hl,tmp call cmpVAT #endif pop hl jr c,+_ advanceVAT() jp -_ _: ;HL is where to insert ld de,(head) or a sbc hl,de ld b,h ld c,l ld hl,-6 add hl,de ld a,l sub a,(hl) ld l,a jr nc,$+4 dec h or a inc de ex de,hl #ifdef speed call fastldir #else ldir #endif ;begin at DE, copy tmp. First need size of tmp ld hl,tmp-6 ld c,(hl) sbc hl,bc ld a,c ldir #ifdef speed ldi ldi ldi ldi ldi ldi ldi add a,7 #else ld c,7 add a,c ldir #endif ld hl,(head_lag) ld c,a ld a,l sub c ld l,a jp nc,isort_main dec h jp isort_main #ifndef speed advance_VAT: ld bc,-6 add hl,bc sbc a,a ;HL<=FE66, so carry flag is set sub (hl) ld c,a add hl,bc ret #endif cmpVAT: ;if @HL>=@DE, return nc ld bc,-6 add hl,bc ex de,hl add hl,bc cmpVAT_stepin: ld a,(de) cp (hl) jr nc,first_longer ;the second name is longer. ld c,a _: dec hl dec de ld a,(de) cp (hl) ret nz dec c jr nz,-_ scf ret first_longer: ;the first name is longer, so load c with the size of the second name ld c,(hl) _: dec hl dec de ld a,(de) cp (hl) ret nz dec c jr nz,-_ ret #ifdef speed fastldir: ;copy BC bytes from HL to DE ;copy BC bytes from HL to DE ;breaks even at 26 bytes ; 5% faster than LDIR with 35 bytes ;10% faster than LDIR with 48 bytes ;15% faster than LDIR with 91 bytes ;20% faster than LDIR with 635 bytes ;max is ~ 20.8% faster than LDIR ;Cost: 104+16N+10ceiling(N/16) push hl ; push af xor a sub c and 15 ;change to n-1 add a,a ld hl,ldirloop add a,l ld l,a #if (ldirloop>>8)!=(_ldirloop_end>>8) jr nc,$+3 ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes. inc h ; #endif ; pop af ex (sp),hl ret ldirloop: ;n=16, (number of LDI instructions, use qty of 4,8,16,32,64) ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi _ldirloop_end: ldi jp pe,ldirloop ret #endif #undefine tmp #undefine head #undefine head_lag1 .echo $-sortVAT," bytes"
pTemp = 982Eh ;bottom of named vars VAT progPtr= 9830h ;top of named vars VAT OP1 = 8478h ;uses 29 bytes of state. ; 4 bytes stack space ; 10 bytes variables ; 15 bytes swap #define var_n OP1+15 #define partition_size var_n+2 #define var_k partition_size #define head0 var_n+4 #define head1 var_n+6 #define tail var_n+8
cmpVAT: ;HL points to one entry ;DE points to the second entry ;return c if first>second ;return nc if first<=second ld bc,-6 add hl,bc ld a,(hl) ex de,hl add hl,bc ld b,(hl) ex de,hl ;A is the size of the first name ;HL points to the byte above the first name ;B is the size of the second name ;DE points to the byte above the second name ld c,a jr nc,second_name_longer _: dec hl dec de ld a,(de) cp (hl) ret nz dec c djnz -_ inc c scf ret second_name_longer: dec hl dec de ld a,(de) cp (hl) ret nz dec c jr nz,second_name_longer ret nthstr: ;Input: ; HL points to the size byte of the name of the first element ; BC is the number of elements ;Output: ; c flag set if end of VAT reached, else nc ; HL points to the element ;speed: ; worst: 34+(95-7/256)*BC ; best: 34+(95-14/256)*BC ld de,(pTemp) _: or a ;Can do some analysis to omit this code for most runs sbc hl,de ;could speed this up by almost 37% add hl,de ;Still, only adds .001 seconds per 100 VAT entries :P ret c ;35cc ld a,-6 sub (hl) add a,l ld l,a jr c,$+3 dec h or a cpd jp pe,-_ or a ret sortVAT: ld hl,(progPtr) ;get the number of elements in the VAT ld de,-6 ; add hl,de ; ld bc,0 ; call nthstr ; ld (var_n),bc ; ld hl,1 ld (var_k),hl sort_main: ;until partition_size exceeds size of VAT ld hl,(progPtr) sort_main_0: ;merge multiples of partition_size ld (head0),hl ld de,-6 add hl,de ld bc,(var_k) call nthstr jp c,sort_main_0_end push hl ld bc,(var_k) call nthstr ld de,6 add hl,de ld (tail),hl pop hl add hl,de ld (head1),hl
ex de,hl ;head1 ld hl,(head0) ;or a sbc hl,de ld hl,(tail) jr z,mergeloop_end ;or a ;head0>=head1, so c flag should be reset already sbc hl,de jr z,mergeloop_end-1 ld de,(head1) mergeloop: ld hl,(head0) call cmpVAT jr nc,+_ #ifdef speed ;saves 42cc ld hl,(head1) ld de,var_n-1 ldd ldd ldd ldd ldd ldd ld a,(hl) ldd ld c,a add a,7 ld b,0 #else ld hl,(head1) ld de,-6 add hl,de ld a,(hl) ld de,var_n-1 ld hl,(head1) add a,7 ld b,0 ld c,a #endif lddr ;HL now points to the bottom ld de,(head1) ld (head1),hl ld hl,(head0) sbc hl,de ;number of bytes that need to be shifted up by A ld b,h ld c,l ;need to shift from DE to (head1) ld hl,(head1) ex de,hl inc de inc hl #ifdef speed call fastldir #else ldir #endif
ld hl,var_n-1 ld de,(head0) ld c,a lddr ex de,hl jp $+9 _: ld a,l sub c ld l,a jr nc,$+3 dec h
ld (head0),hl ld de,(head1) or a sbc hl,de ld hl,(tail) jr z,mergeloop_end ;or a ;head0>=head1, so c flag is reset already sbc hl,de jr nz,mergeloop add hl,de mergeloop_end: ex de,hl ;ld de,(tail) ld hl,(pTemp) ;or a sbc hl,de ex de,hl jp nz,sort_main_0 sort_main_0_end: ld hl,(var_k) add hl,hl ld (var_k),hl ld bc,(var_n) add hl,bc jp nc,sort_main ret #ifdef speed fastldir: ;copy BC bytes from HL to DE ;copy BC bytes from HL to DE ;breaks even at 26 bytes ; 5% faster than LDIR with 35 bytes ;10% faster than LDIR with 48 bytes ;15% faster than LDIR with 91 bytes ;20% faster than LDIR with 635 bytes ;max is ~ 20.8% faster than LDIR ;Cost: 104+16N+10ceiling(N/16) push hl push af xor a sub c and 15 ;change to n-1 add a,a ld hl,ldirloop add a,l ld l,a #if (ldirloop>>8)!=(_ldirloop_end>>8) jr nc,$+3 ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes. inc h ; #endif pop af ex (sp),hl ret ldirloop: ;n=16, (number of LDI instructions, use qty of 4,8,16,32,64) ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi ldi _ldirloop_end: ldi jp pe,ldirloop ret #endif .echo $-$9D95," bytes" #undefine var_n #undefine partition_size #undefine var_k #undefine head0 #undefine head1 #undefine tail
I want to clarify that my mergesort method had to be in-place using O(1) space, so it devolved into an O(n2) algorithm. With my estimates, mergesort would be faster than a non-adaptive insertion sort at 90 elements in the worst case. In practice, at 96 elements it was about 5% faster, probably because it wasn't a worst case scenario.
Either way, my insertion sort is fast compared with other implementations like that in MirageOS or DoorsCS7. It's much faster when the VAT is mostly sorted already, like if it gets sorted and then a user adds a few new programs. In that case, their implementations still seem to be O(n2) whereas mine is closer to O(n).
In both codes, the biggest speed factor was in shifting VAT entries. In the speed optimized versions, I take advantage of LDI over LDIR where I can, and I use my fastldir routine to eke out as much performance as possible, gaining a 10% speed boost in the overall routine in my test.