The problem: Is there an ##s\in \mathbb{C}## such that ##\zeta(s)=\zeta(2s)=0##? If so, what is ##\lim_{z\rightarrow s}{\frac{\zeta(s)^{2}}{\zeta(2s)}}##?
Okee dokee, I know this is getting dangerously close to the Riemann Hypothesis, but I swear that wasn't my original intent!
I was playing around with infinite sums and products and I wanted to figure out the sum form of ##\prod_{p}{\frac{1+p^{-s}}{1-p^-s}}##. I started by knowing that ##\prod_{p}{(1-p^{-s})}^{-1} = \zeta(s)##, and from previous work, ##\prod_{p}{\sum_{k=0}^{n-1}{(p^{-s})^{k}}} = \frac{\zeta(s)}{\zeta(ns)}##). From that, I knew ##\prod_{p}{1+p^{-s}} = \frac{\zeta(s)}{\zeta(2s)}##, thus I know the product converges when ##\zeta(2s) \neq 0, \zeta(s) \in \mathbb{C}##. I knew convergence wasn't an issue (for the most part) so I expanded the product some (a reversal of Euler's conversion from ##\zeta(s) = \sum_{k=1}^{\infty}{k^{-s}} = \prod_{p}{(1-p^{-s})}^{-1}##) and obtained:
where ##f(k)## is the number of prime factors of k. It turns out that this has been known since 1979 and after I had access to the internet, I figured out the typical symbol used for my ##f(k)## in this context is ##\omega(k)##. So that was cool, and I could write that sum and product in terms of the zeta function as ##\frac{\zeta(s)}{\zeta(2s)}\zeta(s) = \frac{\zeta(s)^{2}}{\zeta(2s)}##, so do with that what you will (I tried to use it to find the number of prime factors of a number n, but I didn't get anywhere useful). What I decided to pursue was when this function was zero. As long as ##\zeta(s)=0, \zeta(2s)\neq 0##, we know that it is 0, but I haven't yet figured out if there is ever an instance where ##s\in \mathbb{C}, \zeta(s)=\zeta(2s)=0##. If there is such an s, then the Riemann Hypothesis is false. However, if such an s does not exist, this says nothing about the Riemann Hypothesis .
Spoiler For Why the existence of an s would prove RH false:
It is simple: RH says that ##\zeta(s)=0## if and only if ##Re(s)=\frac{1}{2}##. So if ##\zeta(s)=\zeta(2s)=0##, assuming RH is true, then ##Re(s)=Re(2s)=2Re(s)##, which would imply s=0, a contradiction to the Riemann Hypothesis which we just assumed was true!
As well, if there is such an s, I wanted to know if it could be one of those removable discontinuity thingies.
EDIT: fixed a tag and an incorrect restatement of the RIemann Hypothesis
If you want to post math stuff, it is a fantastic idea to take advantage of the [tex] <expression> [/tex] tags. Here is an example:
##e^{\sqrt{\frac{-3x}{2}}}## versus e^(sqrt(-3x/2)).
The code used in the tags is called LaTex and you can look up the fun stuff you can do with it via Google. However, for a bit of a quick start I will give the following nonsense expression and it's code:
The code is: [tex]\sqrt{3} \neq \frac{\pi^{2}}{6} = \sum_{k=1}^{\infty}{\frac{1}{k^{2}}} \gt \zeta \left(2+\epsilon\right), \epsilon \in \mathbb{R}, \epsilon>0[/tex]
That might look scary, so let's break it down: \sqrt{} will put a square root symbol over everything inside {} \neq is the "not equals" sign. There are also \leq, \lt, \geq, \gt. \frac{numerator}{denominator} makes an expression in fraction form. \pi is the pi symbol ^{exponent} raises the expression to a power = is the = symbol \sum_{lower limit}^{upper limit}{expression} creates a sum from an upper to lower limit (optional arguments). As a note, _{} creates a subscript, ^{} is a super script, if that helps to figure the stuff out. \infty is the infinity symbol. \gt is the greater than symbol \zeta is the lower case zeta symbol. \Zeta is uppercase. LaTeX is case sensitive. \left( is a left parentheses. It gets sized according to what comes between it and it's matching \right). \in makes the "element of" symbol \mathbb{} makes a "blackboard bold" character for certain special sets (N,Z,Q,A,R,C,H,O,S)
So for fun, here are some simpler codes and their outputs: \left( 3^{2} \right) ##\left( 3^{2} \right)##
So I've been developing a way to compute the geometric mean of two numbers (defined by ##\sqrt(xy)##) and my original algorithm had a few snazzy properties: -It is an add-shift algorithm -No multiplication needed and so no risk of overflow -No explicit square root operation performed
I played with the math a bit, and if I take it out of the realm of add-shift routines, and allow division and multiplication, I came up with an algorithm that is relatively easy to do by hand, converges quarticly (quadruples the digits of accuracy at each step), and still has no overflow risk. In practice, on my TI-84+, I get full 14-digit precision in 2 iterations. The next iteration should theoretically get 56 digits, and each iteration requires 2 divisions, 1 inverse, and 1 multiplication (plus a shift and a few adds). Here is an example to show how it works:
Say we wish to find the geometric mean of two numbers, x=209, y=654321. First, for notation, I am just going to refer to the geometric mean as {x,y}, so {209,654321}. The first step is to find the first 'n' so that 4n*x is bigger than y, so in this case, n=6. Then {x2n,y2-n}={x,y}, but now the numbers are much "closer." We get {13376,10223.765625}. Now factor out the largest power of 2 that fits in both numbers, which is 8192. We have 8192{1.6328125,1.248018265}.
All of that preemptive stuff has set up a new {x,y} where we have y<x<2y, so 1<x/y<2, or 1>y/x>.5. This will allow for some fine-tuned magic in a bit. What we should realize is that if we can get x=y somehow, then we have {x,y}={r,r}=##\sqrt(r^{2})=r##. So notice that if we choose some number ##a##, we have ##{xa,y/a}=\sqrt{xay/a}=\sqrt{xy}={x,y}##. So what we need to do is find some ##a## such that xa=y/a, so xa2=y, ##a=\sqrt{y/x}##. However, we want to avoid computing square roots in our computation of square roots, so what we can do is make an estimate for a. As luck would have it, based on our restrictions above, we can make a pretty decent estimate of the square root just by averaging the quotient with 1. In this case, that is ##\frac{\frac{1.248018265}{1.6328125}+1}{2}\approx .8821682725##. So then we get 8192{1.440415382,1.414716788}. Iterating: 8192{1.427566085,1.427450431} 8192{1.427508258,1.427508256} 8192{1.427508257,1.427508257}
And for safety, we could do one more iteration and average the two numbers in case of extra hidden precision, but we get the final result 11694.147638883, which is pretty close to 11694.147638883.
But that required 5 iterations requiring 2 divisions and 1 multiplication each. This is not the promised 2 iterations! So here is where we do some magic. At the division step for y/x, we want to find the square root, right? Why not try a couple of iterations at this step? We would get {1,y/x} as the square root, our estimate is still ##\frac{\frac{y}{x}+1}{2}##, so if we apply this, we get ##{\frac{\frac{y}{x}+1}{2},y/x\frac{2}{\frac{y}{x}+1}}\approx \frac{\frac{y}{x}+1}{4}+\frac{y/x}{\frac{y}{x}+1}##. So in pseudo-code, we could do:
1+y/x→a 1-1/a+a>>2→a {xa,y/a}→{x,y} This is the same as performing two iterations at the cost of 3 divisions, 1 multiplication, compared to 4 divisions, 2 multiplications. So we save 1 division, 1 multiplication.
{max(Ans),min(Ans→L1 int(log(L1(1)/L1(2))/log(4 L1{2^-Ans,2^Ans→L1 2^int(log(Ans(1))/log(2→P L1/Ans→L1 While E-10<abs(max(deltaList(L1 ;engineering E 1+L1(2)/L1(1 1-Ans^-1+.25Ans L1{Ans,Ans^-1→L1 End Pmean(L1
Okay, so yesterday, I was thinking about the algorithms I use for computing logs and exponentials with my floating point numbers and I realized that I had the core of a fast division algorithm right there. Then I thought, "oh wait this is just Goldschmidt Division." I was wrong though, now that I am more awake and looking at the algorithm. The biggest difference is that mine works on the same time complexity as long-division so I need 64 iterations for my 64 bits of precision versus 5.
But there is a but. Those 5 iterations would require 2 full precision 64-bit multiplications, which I currently have down to about 9000 t-states each. My method would require 3*32 shifts+subtracts. On the surface, it looks like it would require 32 more, but I have tricks and magic on my side.
So how does it work? I decided to look at x/y as a vector {x,y} where I try to drive y towards 1 (I have been doing a lot of CORDIC stuff lately). The speed comes from how I drive y towards 1. In it's normal state, I have mantissas as 1.xxxxx, so x and y can be treated as numbers on [1,2). Now as a quick observation, if y>=4/3, then y*3/4 is less than y, but greater than or equal to 1. However, y*3/4=y-y/4 = y-(y>>2) where ">>2" refers to a right shift of 2. Similarly, I compare to 8/7, 16/15, et cetera.
That comparison step can get annoying, especially if you are using a lookup table of 64-bit numbers. It can also get slow. What I did was look at the binary expansions of these numbers: 4/3 = 1.01010101... 8/7 = 1.001001001... 16/15 = 1.000100010001... 32/31 = 1.000010000100001...
See the pattern? Basically, I can count the leading zeroes after the decimal point and if the result is 63, I can stop (denominator is 1) else if it is n, the denominator is guaranteed to be on ##[\frac{2^{n+2}}{2^{n+2}-1},\frac{2^{n+1}}{2^{n+1}-1}]##. From this, a quick pseudo-code example would be:
Label: 2+clz(y)→a if a>m : Return x-x>>a→x y-y>>a→y The code assumes clz will count the leading 0s after the first digit (which is always 1) and m is the desired accuracy. A speed up can occur by noticing that if y=1.000...001xxxxx in binary with n 0s, then y-y>>(n+1) isn't going to change the next sequence of n bits after the 1 (because you are subtracting 0s from xs). So if you have gone up to m bits, the next m bits are basically computed for free in y. This leads to a 2-part algorithm:
Label1: 2+clz(y)→a if a>m : Goto Label2 x-x>>a→x y-y>>a→y Label2: For n from m+1 to 2m-1 if bit n of y is 1: x-x>>n→x Adding that extra bit of code gets 2M bits of precision.
I just thought I would share This sucks on the Z80 because there isn't a CLZ instruction (but this can be simulated at decent speed), but those arbitrary shifts are what can't kills it.
I threw this together after Art_of_camelot mentioned that it could be useful. It is a lightweight font hook program that does the following:
During program execution, you can change the homescreen font.
Upon exiting, the previous font hook is restored
Fonts must be appvars without a header (so Omnicalc fonts won't work, for example), but they can be archived or in RAM. The "commands" are simple. A string in Ans contains the appvar name. Lowercase won't work very easily, as I don't include a tokens→ASCII routine, so use uppercase letters and numbers. For example:
Notes- If you don't use "+" to swap fonts, the hook won't be able to load in the default font after exiting. If you use the assembly program to allow you to use Delvar / Archive / Unarchive on a program, displaying text before setting the mode back to "program mode" will cause the font to be disabled.
SetUpFont: bcall(_RclAns) sub 4 jr z,Parse deacthook1: call Is_Hook_Active ret nz jp deacthook-SpriteFont+9872h Parse: ex de,hl ld c,(hl) inc hl ld b,(hl) inc hl ld a,(hl) cp 71h jr z,deacthook1 cp 70h;+ sign jr nz,LoadHook call Is_Hook_Active inc hl jr nz,LoadHook ld de,name+1 ldir xor a ld (de),a ret LoadHook: push hl push bc ld hl,(fontHookPtr) ld a,(fontHookPtr+2) ld (prevhook),hl ld (prevpage),a
ld hl,SpriteFont ld de,9872h;appbackupscreen ld bc,SpriteFontEnd-SpriteFont ldir pop bc pop hl ld de,name+1 ldir xor a ld (de),a
ld hl,9872h ld a,l bcall(4FE4h) ;Sets the font hook ret Is_Hook_Active: ld a,(9872h) sub 83h ret nz bit 5,(iy+35h) ret nz push hl ld hl,(fonthookPtr) ld de,9872h sbc hl,de pop hl ret nz SpriteFont: .db 83h dec a jr z,$+5 exithook: xor a inc a ret bit 1,(iy+8) jr nz,locate_replace push hl deacthook: .db 21h prevhook: .dw 0 .db 3Eh prevpage: .db 0 bcall(4FE4h) ;Sets the font hook pop hl jr exithook locate_replace:
;B is the char ld (restoreBC-SpriteFont+9872h),bc ld l,b ld h,a in a,(6) ld (restorePage-SpriteFont+9872h),a ld c,l ld b,h add hl,hl add hl,bc add hl,hl add hl,bc push hl ld hl,name rst 20h bcall(_ChkFindSym) pop hl jr c,deacthook ex de,hl inc hl inc hl add hl,de ld a,b or a jr z,ram_font+2 ld b,0 add hl,bc ld c,10 add hl,bc jp p,ram_font-SpriteFont+9872h inc a \ ld h,40h ram_font: out (6),a ld de,lFont_record ld b,7
ld a,(hl) \ ld (de),a inc e \ inc l call z,adjust djnz $-7
ld hl,lFont_record .db 3Eh restorePage: .db 0 out (6),a .db 1 restoreBC: .dw 0 xor a ret a: inc h ret po in a,(6) inc a out (6),a ld h,40h ret n: .db 15h,0,0,0,0,0,0,0,0 SpriteFontEnd:
One of the cooler ways of computing logarithms that I know of generates 1 bit per iteration using a look up table and add/shift operations in programming-terms. It works using some simple rules about logs, in particular: ##log(xy)=log(x)+log(y)##
So for example: ##log(x)=log(x\frac{3}{4}\frac{4}{3})=log(x\frac{3}{4})+log(\frac{4}{3})##
log(1)=0 for any base, so if we multiply x by enough stuff to get it close to 1, we end up with 0+stuff and "stuff" can be in a precomputed table. Multiplying by 3/4, for example, is easy. Division by a power of 2 can be done with bit shifts, so to translate to code, x-x>>2. If we only multiply by ##\frac{2^{n}-1}{2^{n}}## then the process is an add and shift, and we just need to add ##log(\frac{2^{n}}{2^{n}-1})## to the accumulator. Let's do some practice to find log(23):
First, get 23 into the range [1,2). If we divide by 24, then we get 1.4375, so: ##log(23)=4log(2)+log(1.4375)## Since 1.4375>4/3, if we multiply by 3/4, we will get a smaller number that is still bigger than 1. So: ##4log(2)+log(1.078125)+log(4/3)## In turn, 1.078125<8/7, so we skip that. 1.078125>16/15, 1.078125*15/16=1.0107421875 ##4log(2)+log(1.0107421875)+log(4/3)+log(16/15)##
Continuing this process up to 1024/1023 for 3 digits of accuracy, we get:
For comparison, ##log_{10}(23)\approx1.361727836##, and ##4log_{10}(2)+log_{10}(4/3)+log_{10}(16/15)+log_{10}(128/127)+log_{10}(512/511)\approx 1.361342752##. We only wanted 3 digits of accuracy, but this is actually accurate to within 1/2596. This is fast, but can we make it faster? (hint: You know the answer to this.)
Basically, we are comparing to 4/3, 4/3 (yes, twice), 8/7, 16/15, ... but if we are computing logs in a program, especially at the assembly level, we will be working with bits. Allow me to rewrite these numbers in binary:
Do you see the pattern? On top of that, we can really just count the leading zeroes after the decimal. If there are 5, we can multiply by 63/64 and at the worst case, it will be an overestimate by 1 shift (for example, 1.000001 is not bigger than 63/64, so we would actually need to x*127/128 = x-x>>7 instead of x-x>>6). This lets us skip any unnecessary steps for comparing and we don;t need an LUT of these values. As a new example, lets compute log(1.890625). In binary, this is log(1.111001). There are no leading zeroes, so multiply by 3/4: 1.111001-.01111001=1.01101011 [Aside: this is a case where we have to do an initial, multiply by 3/4 and do it again]
Now we jump into the main iteration. There is 1 leading 0, so multiplying by 3/4, we still get a number bigger than 1: 1.01101011-.0101101011=1.0001000001 Now we have 3 leading zeroes, but it happens to be smaller than 16/15 (notice it is "3 zeros, 1, 5zeroes, 1") so we multiply by 31/32: 1.0001000001 -.000010001000001 -------------------------- 1.000001111011111
Now there are 5 leading zeroes, so we will stop here and estimate log(4/3)+log(4/3)+log(32/31)+log(64/63) which is accurate to two decimal places. [/font] I promised faster, though. That was an improvement, but not by as much as the following. Notice that multiplying by 15/16 is the same as shifting x 4 times to the right and then subtracting that from x. Since x is on [1,2), that means there is always 3 leading zeroes after the decimal followed by a 1 and everything else. We also know the first 3 digits have to be zeroes, to, so after the 1, we have 3 zeroes, so it looks like this:
1.0001abcdefgh -.0001000abcdefgh But if we subtract them, we see that we will get about 1.0000abc.... The abc are left alone if d=1 (no carry from subtraction) otherwise the bit propagates. Still, if a is set, we can expect that the next iteration is going to require multiplying by 31/32, but that won't likely change b or c, so then the next iteration depends on what b and c are.
Basically, what I am saying is this: If we have the number log(1.01abcdefg), we would multiply by 3/4 and if we stop here, we could get a better approximation by adding using a*log(8/7). If we stop at log(1.000001abcdefgh), though, there are 4 leading zeroes, so we can add to the end of this for a better approximation, a*log(128/127)+b*log(256/255)+c*log(512/511)+d*log(1024/1023)+e*log(2048/2047). Going back to an example in the previous section, we had x=1.000001111011111 with the approximation log(4/3)+log(4/3)+log(32/31)+log(64/63). If we want to guess and get pretty close to double the accuracy, we can add 1*log(128/127)+1*log(256/255)+1*log(512/511)+0*log(1024/1023)+1*log(2048/2047). Indeed, we get 4 digits of added accuracy (.2766723863, estimated vs. .2766053963... actual).
We can apply this recursively to double the digits of accuracy at each iteration, but this usually overestimates, especially early on. This will be easy to detect and correct when you subtract x-x>>n→x for each bit, so you can adjust there. Ideally, this is useful to try to squeeze as much extra precision as you can in the last iteration. If I wanted to write a logarithm routine to 64 bits of accuracy, I could wait until it got to 32-bits or more completed, and then I would have a pretty safe estimate for at least the next 31 bits and most likely the 64th bit and this would be without all the computation of x-x>>n.
Hi. Some of you might not know me, but I was wondering if we could have something like [math][/math] / [itex][/itex] / [latex][/latex]-like tags. They make equations look a lot nicer than "237/101-4sqrt(2)/(x+sqrt(2))".
But really, I do like to post in our Math section and some formulas start looking pretty ugly in linear format.
Last night I was thinking again about my Sine Approximation routine and holy crud, I must have been really tired not to recognize this:
In the approximation, I was taking the upper 8 bits, truncating propagation. This was so that I had a fixed-point routine. But if I had taken the lower 8 bits, I would have had a useful 8-bit integer routine for -x2-x.
But wait.
If I start the accumulator with x (which is actually 3 cycles faster to do), then I end up with -x2.
^ is XOR + is addition modulo 256 EDIT: Cross-posted: I thought of a way to optimize the first iteration. Saved 2 bytes, 8 t-states. Basically, at the first iteration, C=-1 or C=2L, then I shift L up one bit for the next iteration. I got rid of the initial ld c,a and use the first iteration to compute c. To do this, I just shift L at the beginning, then after the "sbc a,a" I just OR that with L. If a was $FF, the result is FF (which is -1), else it is 2*L:
L_sqrd: ;Input: L ;Output: L*L->A ;151 t-states ;37 bytes ld h,l ;First iteration, get the lowest 3 bits sla l rrh sbc a,a or l ;second iteration, get the next 2 bits ld c,a rr h sbc a,a xor l and $F8 add a,c ;third iteration, get the next 2 bits ld c,a sla l rr h sbc a,a xor l and $E0 add a,c ;fourth iteration, get the last bit ld c,a ld a,l add a,a rrc h xor h and $80 xor c neg ret Also, as a note, if you stop at any iteration and perform NEG, you can mask to get the lower bits of the square. So for example, the first iteration gives you -L*L mod 8->A, the second returns -L*L mod 32->A, the third gives it mod 128.
I made a neat little algorithm today that seems to do an okay job at approximating sine. My original intention was to create a routine for the following with a fixed point format where x is on [0,1) x(1-x)
However, if 'x' is in 'a', i can just do a*(-a), but NEG is 2 bytes, 8 cycles, whereas CPL is 1 byte, 4 cycles. That made me wonder if I could make a fast routine to multiply A by its compliment (is it called 1s compliment?). After working in my notebook with some more abstract approaches (I used 7th degree polynomials) I reduced it to a fairly simple solution that would be much simpler to integrate into an IC or something. However, I decided to experiment a little with the result by seeing what happens if I don't let some bits to propagate into the upper 8 bits (remember, 8.8 fixed point). So basically, I computed the product of 7-th degree polynomials with binary coefficients, and then truncated any terms of degree 7 or lower, then I converted it back to binary and the result is a close approximation of x(1-x). This happens to be a close approximation of sine, too, so after some tweaking, I got it to have the same input/output range as Axe (I recall that Quigibo said he used x(1-x) as an approximation to sine, too).
The following uses 3 iterations as opposed to 8 and is faster than the Axe version, but larger by 8 (?) bytes:
^ is for XOR logic, + is regular integer addition modulo 256 (s is the sign)
The original algorithm was going to be for computing the natural logarithm I came up with some new (to me) algorithms for that as well.
EDIT1: Thanks to calc84maniac for pointing out the optimisation using a different order of operations with xor/and ! Saved 1 byte, 12 cycles. This let me shuffle some code to save 8 more cycles. EDIT2: Modified the last note to reflect the normalized input/output to match that of Axe.
I started posting about LangZ80 in the FileSyst thread. Although I don't consider it a project-- it is more of a "proof of concept" program-- I have been having updates and instead of derailing the FileSyst thread, I brought it here.
Basically, I want to get FileSyst to a point where it can use variables and functions and basically work as a higher-level programming language and not just an exposed filesystem. I decided to experiment by creating a parser (yet again).
It is really broken and in a messy state, especially with the variable storage. There are lots of bugs there, as well as a few other bugs (integer that are multiples of 256 don't work properly, and I believe I left a few bugs in the float multiply routine). However, when I am not introducing bugs and things are working as they should, here is what it has:
Commands are typed out, like "UPDATELCD()"
Integers are arbitrary precision, from 0 to 2040 bits, in multiples of 8. These are unsigned.
Strings are supported. Start them with either " or ' and end them with the same token. This allows the use of quotes in strings.
Floats are supported. They default to at least 128 bits of precision (38 decimal digits) and at least 32-bits of precision after the decimal point. Floats can be arbitrary sizes, though, also up to a total of 2040 bits.
I have wrote the float_times_float and int_times_int routines, but not int_times_float and others.
It has an XORSPRITE() routine that was just for testing. It totally fakes the sprite rendering by not actually reading a sprite var and just assuming 8x8 sprites. Later this will change.
It has an UPDATELCD() command
It has a WAIT() command to wait for a key press
It has a few stacks to hold intermediate data. This part of the app is complete, to my knowledge. It pushes, pops, converts, updates pointers, all that good stuff.
When exiting, it returns "Ans" in the OS ans variable as a string (or it returns one of the various errors)
As such, it has an int?str routine, str?str, and most recently about an hour ago, a float?str routine.
It parses from prgmTEST
Storing and reading variables is almost supported. It is still broken and causes crashes, though.
It is really kind of garbage at the moment, but feel free to download it to test out in WabbitEmu. It is not stable and it will likely end up crashing a few times.
If at any point it seems like the program has frozen, it might instead be in the WAIT() loop or in another loop I made. To get out, try the following first:
Press any key that isn't [ON]
Press [ON]
If you have to press [ON] to exit, it was in a loop where I haven't added in the needed code yet, so I used that loop as a filler.
I actually made the first version if this program a long time ago, before I became active in the community. I remade it a few nights ago and uploaded it to TICalc, but it basically works like this:
-Give it a list of coordinates and types -It generates a hex opcode that draws the rectangles with the given coordinates and methods.
It takes input in the form of {x,y,width,height,type} and you can give it multiple rectangles. It does 6 rectangle types, including the typical black/white/invert, using OS bcalls. This is also a BASIC program that generates the opcodes. As well, it tries to optimise the opcode a little, so if there are 8 or more rectangles, it uses a different method of drawing them (generating an array of coordinates/types and reading them, using 29 bytes extra overhead, 4 bytes less for each rectangle).
Anyways, it was useful for me when I was working on BASIC RPGs to make an assembly program that drew a map. The download is below. With so many other tools available now, this type of program is a bit outdated
When I was working on the Grammer particle engine, I had an idea that I put to the side. I finally got around to testing it, and it was a bit of a success, so I will share it, even though it isn't a project.
If you have seen the Grammer particle engine, it is fast, but it gets slower as you add more particles. Eventually, after 2048 particles, it still gets almost 6FPS, but hope was that this alternative method would have speed independent of the number of particles on screen.
The trick? Instead of adding particles by pushing them on to a stack-like array of coordinates, I would use something like a graph buffer. In grayscale, we have a front buffer and back buffer. For this, we have a "fixed buffer" and a "particle buffer." When updating the LCD, these two get ORed together to show the particles superimposed on the fixed buffer.
Disadvantage: You have to scan every pixel on the particle buffer to locate particles and move them accordingly. Advantage: -The particle buffer is a fixed 768 bytes and can hold up to 6144 particles (compared to 12288 bytes). -Updating particles takes about teh same time no matter how many particles there are. -Adding particles can be done with simple drawing commands since it is just a graphics buffer in disguise. -Deleting particles is done with the same ease. Deleting individual particles with the other method is very slow, having to search through the buffer to locate the particle with the given coordinates. -Performing large scale operations on a group of particles is very easy
My goal was to implement a simple rule to move down if possible, or move left/right. This is a standard rule. However, scanning each pixel individually would have been very slow. Instead, I scanned one row of pixels at a time, applying some bit logic to see which pixels can move in a given direction. This turns out to be really effective:
In fact, using a circle as the pen tool and updating the LCD cause the biggest speed bottlenecks. The number of particles on screen does not affect the speed at all since every pixel in the particle buffer is treated as a particle and has all of the transformations performed on it.
I decided to keep the pen tool as it is, but interleaving the particle updating with LCD updating gains some FPS.
Readme: Run it with the Asm( token. Use [ON] to exit. Use arrows to move the cursor. Use [2nd] to draw on the fixed buffer (makes impassable barriers). Use [Mode] to draw particles. Use [Del] to delete from the fixed buffer (deletes barriers). Use [Y=]+[Up] to increase the radius of the circle. Use [Y=]+[Down] to decrease the radius of the circle. Multiple key presses are supported. The F2-F5 keys shadow [up],[right],[left],[down] 15MHz is automatically set if it is supported by your calculator. 6MHz should still expect ~22FPS according to WabbitEmu.
As some have seen me talking in IRC, I have my own rendition of Fermat factoring. I compared it to what I thought was a typical approach and my method asymptotically replaces a square root operation with a subtract and an increment. In complexity terms:
Square root is on the level of multiplication and division, so O(M(n)) Subtraction is O(n) Increment, if my calculations are correct, is O(1) (in the case of the algorithm described below, invert bit 1, if that returns 0, invert bit 2, ... to increment by 2)
So while Fermat factoring isn't as efficient as the general number field sieve, the method I have is more efficient than a typical approach to Fermat factoring. To describe the idea, notice that a2-b2=(a-b)(a+b). This means that, given our number c, if we can find a difference of two squares so that they equal c, then we have a pair of factors for c. For example, 102-52=75, so 75=(10-5)(10+5)=5*15. If c>1 and is a positive odd integer, it will always have a trivial solution (the difference between consecutive squares is 1,3,5,7,..., so this is "trivial"). Any other solutions are non-trivial and imply that c is composite. If only the trivial solution exists, c is prime.
Spoiler For proof:
Assume c is composite. Then c=nm where n,m are natural numbers greater than 1. If we let n be the larger of the two, then we have: n=(a+b) m=(a-b) m+2b=a-b+2b=a+b m+2b=n n-m=2b (n-m)/2=b n=a+b n=a+(n-m)/2 n-(n-m)/2=a (n+m)/2=a (n-m)/2=b
The trivial solution is (c+1)/2=a,(c-1)/2=b factoring to 1c=c. If c is composite, it has a non trivial solution. Now the other direction, if c is prime, it does not have a non-trivial solution. Essentially, notice that if there is a non trivial solution, where n is not (c+1)/2 and m is not (c-1)/2, then we can factor c as (n-m)(n+m) which is a contradiction.
Summary: If c is prime, it has only the trivial solution. If c is composite, it has non-trivial solutions.
The next few steps toward solving this formula are the typical approaches and I do the same thing. We notice that a>sqrt(c).
Spoiler For proof:
Rewrite a2-b2=c as a2-c=b2. Then we see that a2-c is non-negative, so: a2>=c, a>=sqrt(c).
So then with our algorithm, we can initialise a=cieling(sqrt(c). That is the smallest that a can be, so now we enter the naive algorithm, checking if the difference a2-c is a perfect square:
ceiling(sqrt(c))→a While notperfectsqaure(a*a-c) a+1→a EndWhile sqrt(a*a-c)→b Return {a-b,a+b} However, the next step away from the naive approach is to notice that every time you add 1 to a, the difference increases by 2a+1. So you can save some computations by keeping track of the difference and updating at each iteration:
ceiling(sqrt(c))→a a*a-c→b2 While notperfectsqaure(b2) b2+2a+1→b2 a+1→a EndWhile sqrt(b2)→b Return {a-b,a+b} However, that notperfectsquare() routine will likely require computing the square root at least once (twice if done naively). My approach takes the optimisation another step further by keeping track of how far away b2 is from a perfect square. With this, note that if b2 is the next lowest perfect square below a2-c, then the one above is b2+2*b2+1, so if the difference between b and b2 exceeds 2*b2+1, we need to increment b and adjust the difference:
ceiling(sqrt(c))→a a*a-c→diff floor(sqrt(diff))→b diff-b*b→diff While diff>0 diff+a+a+1→diff a+1→a While diff>=(2b+1) diff-b-b-1→diff b+1→b EndWhile EndWhile Return {a-b,a+b} As it is, the inner while loop converges to iterating 1 time pretty quickly (bounded the square root of the difference between the initial guess for a and c). We can say then that the original square root required at each iteration is replaced by 1 subtract and 1 increment.
Rewriting this for some more efficiency, we can do:
ceiling(sqrt(c))→a a*a-c→diff floor(sqrt(diff))→b diff-b*b→diff 2a+1→a 2b+1→b While diff>0 diff+a→diff a+2→a While diff>=b diff-b→diff b+2→b EndWhile EndWhile floor(a/2)→a floor(b/2)→b Return {a-b,a+b} At the chip level, those +2 can also be increments requiring O(1) time instead of additions requiring O(n) time. This means that each iteration of the algorithm requires about: - 2 add/sub operations - 2 increments - 3 compares (that can be changed to simply checking for overflow, so these can be trivially removed)
And the maximum number of iterations are based on how far the actual a value is from the original estimate. If we know C is an odd composite (as, for example, RSAs) then this is a pretty crummy upper bound of c/3-sqrt(c) in the worst case (c=3*p).
Spoiler For Example 1:
So let's try to factor 14111. First, estimate a=119. Then 1192-14111=50, so our initial b is 7, with difference of 1. (50-7^2=1). a=119, b=7, diff=1. Increment a, add 2*119+1 to diff a=120, b=7, diff=240. 240>=7*2+1, so increment b, subtract 15 from diff a=120, b=8, diff=225. increment b, subtract 17 from diff a=120, b=9, diff=208. increment b, subtract 19 from diff a=120, b=10, diff=189. increment b, subtract 21 from diff a=120, b=11, diff=168. increment b, subtract 23 from diff a=120, b=12, diff=145. increment b, subtract 25 from diff a=120, b=13, diff=120. increment b, subtract 27 from diff a=120, b=14, diff=93. increment b, subtract 29 from diff a=120, b=15, diff=64. increment b, subtract 31 from diff a=120, b=16, diff=33. increment b, subtract 33 from diff a=120, b=17, diff=0. diff=0, so we have found a and b
I came up with this algorithm a few years ago while working on Grammer. This was before Grammer had Line() or Circle() drawing routines, during the summer (when I didn't have internet). Grammer had a Circle() routine long before Line() because the Circle() routine was derived from my Fermat factoring routine above. The routine above can be thought of as interpolating a curve of the form y=sqrt(x^2-b) compared to y=sqrt(z-x^2) where z is the radius squared. However, the algorithm above stops when it lands exactly on a point (when its error=0) whereas circle drawing typically requires the whole curve to be drawn. I believe that my method of drawing circles and lines is like the Bresenham method, so I made a tutorial saying as much. However, I haven't validated that, so I haven't officially released it. 0,1,4,9, Fermat, Bresenham, Circles, and Lines
My circle and line drawing routines takes advantage of overflow (the c flag) to avoid making comparisons. Likewise:
If 0==c%2 Return {2,a/2} sqrt(c)→a If floor(a)=a Return {a,a} 1+floor(a)→a a*a-c→diff floor(sqrt(diff))→b diff-b*b→diff 2a-1→a 2b-1→b diff-b→diff Loop0: a+2→a diff+a→diff if carry_set Loop1: b+2→b diff-b→diff GOTO Loop1 IF carry_not_set GOTO Loop0 IF zero_not_set a>>1→a b>>1→b Return {a-b,a+b} I should also note that at the assembly level, my integer square root routines typically also returns the difference between the original input and the next smaller perfect square, so that also helps speed things up a little (getting rid of all of the multiplications).
The Chaos Game is not a 'game' in the standard sense. Instead, it is a mathematical 'game' like Conway's Game of Life or Langton's Ant. There are so many cool, neat things about the Chaos Game, and many uses for it, but that would take a good semester or two just to describe and employ it. Instead, I will describe the basics and the way I am using it at the moment:
Start with a polygon, like a triangle or a square, or something more inventive
Start at some point, maybe on the perimeter-- it doesn't really matter (well, some deeper investigation will say otherwise, but... )
Now select a vertex randomly and move halfway to that vertex. Repeat this step many times.
On the surface, it makes neat things happen if you use a triangle-- it actually does not fill it in. There are certain points that can never be reached inside the triangle and in fact, it will make a Sierpinski's Triangle. However, squares and such will be filled in.
USES: One of the really cool effects of this deals with the very important part of the iterative step. The key word is "random." If you use an algorithm that cycles, this game will show that with strong biases for certain points. Any common runs or cycles will do this. For example, if you use a square and an RNG that returns the sequence {2,3} more frequently than other pairings with 2, the space between points 2 and 3 will be darker than any other points. Since we cannot produce real "random" numbers, we can test pseudo-random number generators to see if they produce the correct result or where biases might be. A very chaotic PRNG that has a very long period will yield good results. A PRNG that has a shorter period or common runs will produce a less amazing result.
Since I have been working on PRNGs, I have been using some simple implementations of the Chaos Game. Below is a code that allows you to add in your own 'Random' routine to test as well as the choice of using a square or triangle. In z80 assembly, we often mask out bits to generate integers in a smaller range like [0,1,2,3] or [0,1,2,3,4,5,6,7] (powers of 2, minus 1 make easy masks). While this is fast and simple, the Chaos Game will tell you if this is actually destroying the chaotic behavior of your RNG. If this is the case, you might have better results using a range that isn't a power of 2. However, this can be more time consuming. It's your pick
.org 9D93h .db $BB,$6D bcall(4BD0h) ;_GrBufClr bcall(486Ah) ;_GrBufCpy bcall(4570h) ;_RunIndicOff di ld a,$FD out (1),a ld hl,0 ChaosGame_SierpinskisGasket: in a,(4) ;test ON press and 8 ret z in a,(1) ;test Clear press and 64 ret z ld (coords),hl call Plot ld hl,(coords) srl h srl l push hl #IF update==256 ld a,(count) inc a ld (count),a jr nz,$+5 bcall(486Ah) ;_GrBufCpy #ENDIF call Random #IF shape==3 call L_mod_3 pop hl jr z,ChaosGame_SierpinskisGasket dec a ld a,32 jr nz,$+7 add a,l ld l,a jp ChaosGame_SierpinskisGasket add a,h ld h,a jp ChaosGame_SierpinskisGasket #ENDIF #IF shape==4 and 3 pop hl jr z,ChaosGame_SierpinskisGasket rrca ld b,a ld a,32 jr nc,$+6 add a,l ld l,a ld a,32 rrc b jp nc,ChaosGame_SierpinskisGasket add a,h ld h,a jp ChaosGame_SierpinskisGasket #ENDIF #IF rand==0 ; From CaDan, by Iambian and Geekboy Random: LFSR: ld hl,lfsrseed1 ld a,(hl) rrca ld (hl),a jp nc,+_ xor %00111000 ld (hl),a _: ld l,a ret #ENDIF #IF rand==1 ; From CaDan, by Iambian and Geekboy Random: LFSR2Byte: ld hl,(lfsrseed2) ld a,L rlca xor h rlca ; rlca ; xor h ; rlca xor h rlca adc hl,hl ld (lfsrseed2),hl ld a,h ;for combatability ret #ENDIF #IF rand==2 ; From Zeda's library Random: Rand8: ;Output: L is a pseudo-random number, DE=509, B=0, C=seed1 ;Destroys:A ;Size: 141 bytes ;Speed: fastest: 672, slowest: 796 ;Notes: ; Requires 2 seeds of 1 byte each, seed1 and seed2 ;Method ; We iterates 2 Linear Congruential Generators. 'x' is seed1, 'y' is seed2 ; (33x+17) mod 251 -> x ; (13y+83) mod 241 -> y ; Next, we output the lower 8 bits of x*y mod 509 ; This all gives a period of 60491 (251*241) which should be overkill for games and such ld hl,(seed1) ld h,0 ;multiply by 13 ld b,h ld c,l add hl,hl add hl,bc add hl,hl add hl,hl add hl,bc ;add 83 ld c,83 add hl,bc ;mod_241 ld c,15 ld a,h add a,a add a,a add a,a add a,a sub h add a,l ;139 jr nc,$+6 add a,c jr nc,$+3 add a,c add a,c jr c,$+3 sub c ;store back to seed1 ld (seed1),a ;seed2 ld hl,(seed2) ld h,b ;multiply by 33 ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl add hl,hl add hl,bc ;add 17 ld c,17 add hl,bc ;--- ;make BC=seed1 ld c,a ;--- ;mod_251 ld a,h add a,a add a,a add a,h add a,l jr nc,$+8 add a,5 jr nc,$+4 add a,5 cp 251 jr c,$+4 add a,5 ;store seed2 ld (seed2),a
ld h,a ld l,b ;H_Times_C sla h \ jr nc,$+3 \ ld l,c add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc add hl,hl \ jr nc,$+3 \ add hl,bc HL_mod_509: ;HL_mod_509 -> HL ld a,h ld d,b srl a rl d add a,h sub d jr nc,$+3 inc d add a,l jr nc,$+4 inc d ccf ld e,a ld hl,509 ex de,hl sbc hl,de jr c,$+6 add hl,de sbc hl,de ret nc add hl,de ret #ENDIF #IF shape==3 L_mod_3: ;Outputs: ; A is the remainder ; destroys L,BC ; z flag if divisible by 3, else nz
ld bc,0306h ld a,l res 4,l rlca rlca rlca rlca and 15 add a,l and 31 ld l,a sra l sra l and b add a,l sub c \ jr nc,$+3 \ add a,c sub b \ ret nc \ add a,b ret ;at most 123 cycles, at least 114, 30 bytes #ENDIF #IF rand==3 ; From Zeda's library Random: Rand24: ;Inputs: seed1,seed2 ;Outputs: ; HLA is the pseudo-random number ; seed1,seed2 incremented accordingly ;Destroys: BC,DE ;Notes: ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2 ld hl,(seed1) xor a ld b,h \ ld c,l ld de,83 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,hl \ rla ;60 add hl,hl \ rla ;120 add hl,bc \ adc a,d ;121 add hl,hl \ rla ;242 add hl,bc \ adc a,d ;243 add hl,de \ adc a,d ;243x+83 ;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl add hl,bc add hl,de ld c,17 jr nc,$+3 add hl,bc add hl,bc jr c,$+4 sbc hl,bc ld (seed1),hl ;seed1 done, now we need to do seed2 ld hl,(seed2) ; seed1*243+83 mod 65519 -> seed1 ; seed2*251+43 mod 65521 -> seed2 ; Output seed1*seed2 xor a ld b,h \ ld c,l ld de,43 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,bc \ adc a,d ;31 add hl,hl \ rla ;62 add hl,hl \ rla ;124 add hl,bc \ adc a,d ;125 add hl,hl \ rla ;250 add hl,bc \ adc a,d ;251 add hl,de \ adc a,d ;251x+83 ;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl sbc hl,bc add hl,de ld c,15 jr nc,$+3 add hl,bc add hl,bc jr c,$+4 sbc hl,bc ld (seed2),hl ;now seed1 and seed 2 are computed ld bc,(seed1) ex de,hl call BC_Times_DE ex de,hl ld l,b ld h,0 ld b,h ld c,l add hl,hl add hl,hl add hl,bc add hl,hl add hl,hl add hl,bc add hl,hl add hl,bc ld c,d ld d,e ld e,a ld a,c sbc hl,bc sbc a,b ret nc ld c,43 add hl,bc ret BC_Times_DE: ;BHLA is the result ld a,b or a ld hl,0 ld b,h ;1 add a,a jr nc,$+4 ld h,d ld l,e ;2 add hl,hl rla jr nc,$+4 add hl,de adc a,b ;227+10b-7p add hl,hl rla jr nc,$+4 add hl,de adc a,b
add hl,hl rla jr nc,$+4 add hl,de adc a,b
add hl,hl rla jr nc,$+4 add hl,de adc a,b
add hl,hl rla jr nc,$+4 add hl,de adc a,b
add hl,hl rla jr nc,$+4 add hl,de adc a,b
add hl,hl rla jr nc,$+4 add hl,de adc a,b
;=== ;AHL is the result of B*DE*256 push hl ld h,b ld l,b ld b,a ld a,c ld c,h ;1 add a,a jr nc,$+4 ld h,d ld l,e ;2 add hl,hl rla jr nc,$+4 add hl,de adc a,c ;227+10b-7p add hl,hl rla jr nc,$+4 add hl,de adc a,c
add hl,hl rla jr nc,$+4 add hl,de adc a,c
add hl,hl rla jr nc,$+4 add hl,de adc a,c
add hl,hl rla jr nc,$+4 add hl,de adc a,c
add hl,hl rla jr nc,$+4 add hl,de adc a,c
add hl,hl rla jr nc,$+4 add hl,de adc a,c
pop de ;Now BDE*256+AHL ld c,a ld a,l ld l,h ld h,c add hl,de ret nc inc b ;BHLA is the 32-bit result ret #ENDIF #IF rand==4 Random: ;from ionRandom ld hl,(seed1) ld a,r ld d,a ld e,(hl) add hl,de add a,l xor h ld (seed1),hl ld hl,0 ld e,a ld d,h add hl,de djnz $-1 ld l,h ret #ENDIF #IF rand==5 Random: ld hl,(seed1) xor a ld b,h \ ld c,l ld de,83 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,hl \ rla ;60 add hl,hl \ rla ;120 add hl,bc \ adc a,d ;121 add hl,hl \ rla ;242 add hl,bc \ adc a,d ;243 add hl,de \ adc a,d ;243x+83 ;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl add hl,bc add hl,de ld c,17 jr nc,$+3 add hl,bc add hl,bc jr c,$+4 sbc hl,bc ld (seed1),hl ret #ENDIF #IF rand==6 Random: ld hl,(seed2) ; seed2*251+43 mod 65521 -> seed2 xor a ld b,h \ ld c,l ld de,43 add hl,hl \ rla ;2 add hl,bc \ adc a,d ;3 add hl,hl \ rla ;6 add hl,bc \ adc a,d ;7 add hl,hl \ rla ;14 add hl,bc \ adc a,d ;15 add hl,hl \ rla ;30 add hl,bc \ adc a,d ;31 add hl,hl \ rla ;62 add hl,hl \ rla ;124 add hl,bc \ adc a,d ;125 add hl,hl \ rla ;250 add hl,bc \ adc a,d ;251 add hl,de \ adc a,d ;251x+83 ;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL ex de,hl ld l,a ld b,h ld c,l add hl,hl add hl,hl add hl,hl add hl,hl sbc hl,bc add hl,de ld c,15 jr nc,$+3 add hl,bc add hl,bc jr c,$+4 sbc hl,bc ld (seed2),hl ret #ENDIF Plot: ;need to plot the pixel at HL ld a,l cp 64 \ ret nc #IF update==1 add a,80h out (16),a #ENDIF ld a,h cp 96 \ ret nc ld c,l ld h,0 ld b,h add hl,hl add hl,bc add hl,hl add hl,hl ld b,a rrca \ rrca \ rrca and 0Fh #IF update==1 add a,20h out (16),a add a,20h #ELSE or 40h #ENDIF ld c,a ld a,b ld b,93h add hl,bc and 7 ld b,a inc b ld a,1 rrca djnz $-1 or (hl) ld (hl),a #IF update==1 out (17),a #ENDIF
Set shape as 3 or 4 based on the mode you want to compile.
Set rand accordingly. There are 7 included (0,1,2,3,4,5,6). See below.
Set update to 256 to update every 256 iterations using the OS routine. This is not recommended. Use 1. It is faster and you see every pixel plotted.
There are 7 included PRNGs and 2 shapes. The shapes are 3 for a triangle, 4 for a square. The PRNGS are:
0 for a 1-byte Linear Feedback Shift Register (LFSR) that I was testing. It is a very fast code and pretty chaotic, but using shape=4 shows that it is actually strongly biased for simple masking. Using shape=3 yields excellent results with some minor bias. (credit goes to Iambian and Geekboy for this)
1 for a 2-byte LFSR, also from Iambian and Geekboy
2 for a much heavier/slower 8-bit PRNG I wrote. It works well for shape=3 and shows some fairly weak bias in shape=4, so it is decent for chaotic behavior
3 for an even larger, slower routine (about half the speed of the previous) but it yields a 24-bit pseudo-random number and passes both test very well. This routine is on par with the OS rand routine, but many times faster (probably hundreds or thousands times faster).
4 is the ION routine
5 and 6 are the LCGs used in (3)
You can add your own routines named Random as well, but they should return the result in L. EDIT: Updated the code, added a download with a readme.
I was trying to make an HL_mod_10 routine today employing some of the trickery I used with my Rand24 routine (24-bit pseudo-random number generator). I wanted to make it as fast as I could, so I was wondering if anybody wanted to have a fun game of optimise-the-short-assembly-routine.
My approach was to break HL into two parts as H*256+L. Mod 10, that is equivalent to: (H*260-H*4+L) mod 10=(L-H*4) mod 10
H and L are 8-bit values, so H*4 has the potential of being a 10 bit number. If I call this X*256+Y where X is the upper 2 bits of H*4: (L-H*4) mod 10 = (L-X*256-Y) mod 10 (L-X*256-Y) mod 10 =(L-(X*260-X*4)-Y) mod 10 (L-(X*260-X*4)-Y) mod 10=(L-Y+X*4) mod 10
Now that is an 8-bit subtraction with L and Y. If there is 8-bit 'overflow' with the subtraction, then I need to adjust it by adding a constant. Then, X*4 is at most a 4 bit number to add to that. Once that is all done, you can perform mod 10 stuff on an 8-bit number. I came up with this code:
HL_mod_10 ;Input: HL ;Output: (HL_mod_10 -> A) ;197.75 avg cycles ld a,l ld l,h ld h,0 add hl,hl add hl,hl sub l \ jr nc,$+8 \ add a,4 \ jr nc,$+4 \ add a,6 sla h \ sla h add a,h \ jr nc,$+8 \ sub 4 \ jr nc,$+4 \ sub 6 sub 160 \ jr nc,$+4 \ add a,160 sub 80 \ jr nc,$+4 \ add a,80 sub 40 \ jr nc,$+4 \ add a,40 sub 20 \ jr nc,$+4 \ add a,20 sub 10 \ ret nc \ add a,10 \ ret
It averages a smidge under 197.75 t-states.
EDIT: If anybody wants to modify this for other languages, here is a supplemental explanation: This is the equivalent of HL mod 10: (L-Y+X*4) mod 10
If the input bits of HL are abcdefghijklmnop2, then: L=ijklmnop2 Y=cdefgh002 X=000000ab2
So to put it into an 8-bit value to operate on: ijklmnop2-cdefgh002+0000ab002
If this is >256 (it will be at most 267), then add 6 and keep the lower 8 bits. If this is <0 (it will be at the lowest -255) then take the lower 8-bits (signed, two's complement form), add 4. If that overflows to be >256, then add another 6, keeping only the lower 8 bits. So for example, if ijklmnop2-cdefgh002+0000ab002=-3 → FD+04→10116→01+06→07.
Now perform mod 10 on the 8-bit result. You can apply more tricks to this if you like. abcdefgh2=128a+0bcdefgh2, so if a=1, you can do 0bcdefgh2-2, for example. Or: 0000ab002+00cdefgh2 Or: 0000abc02+000defgh2