I'll be honest, I have no clue what a lot of that means. I've played with OOPLs here and there, but I've never managed to really "get it." However, I know for many programmers it is more natural and it would be cool to have such a language available on the (e)Z80 calculators.
I totally get being busy, but maybe you'll be lucky enough to pull a JamesV and come back in ten years with renewed energy/time/motivation.
I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.
; divide HL by C (HL is signed, C is not) ; output: HL = quotient, A = remainder divHLbyCs: bit 7,h push af jr z,divHLbyCsStart xor a \ sub l \ ld l,a sbc a,a \ sub h \ ld h,a divHLbyCsStart: xor a ld b,16 divHLbyCsLoop: add hl,hl rla cp c jp c,divHLbyCsNext sub c inc l divHLbyCsNext: djnz divHLbyCLoop ld b,a pop af ld a,b ret z xor a \ sub l \ ld l,a sbc a,a \ sub h \ ld h,a ld a,c sub b ret I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all
Since I was able to optimize my best 16-bit multiply even further today, I thought I'd share! I even think it can be further optimized. And for that matter, here is my favorite DE_Times_A
I have one and it was my primary computer for a while (and before that it was the Pi 2, and before that it was the Pi B). I used it mostly for math-related programming and TI-programming and video games! Tilem and TiLP work well, I installed Sage for more powerful math computing, and RetroPie to play some of my old PSP and PS1 games-- the latter of which runs smoothly, the former requires some tweaking.
I also wrote the brunt of a Point-Of-Sale (POS) system from scratch that it handled just fine, but I need to modify it to be web-based now My Pi 2 uses a 128GB USB drive for the root file system which makes it my second highest memory computer (superseded by a 140GB Windows laptop).
EDIT 2022-06-29: The post below has a much needed improvement
I decided to compute some square roots by hand to keep my tired self awake and alert at work. I opted to use Newton's Method this time and I noticed an optimization! Instead of attaining 2N precision with a 2N-by-N->2N division, I split it up into an N-by-N multiply and a N-by-N->N division, which on the Z80 or similar processor can be nearly twice as fast.
The standard algorithm works as follows: To find the square root of c, start with a guess called x. Iterate the following: (x+c/x)/2
Each iteration doubles the digits of precision; try it on your calculator if you aren't familiar with it! To fin the square root of 21, guess '5'. Type 5, hit [Enter]. Then do .5(Ans+21/Ans and hit [Enter] a few times.
The way it works is, if the current x is bigger than the square root of c, then c/x is less, and vice-versa. So averaging the two gets a closer estimate to the actual square root. My observation was that if I had, say, 3 digits precision, then those digits stay the same in the next iteration. Through some observations, this would mean the first 3 digits of c/x will match that of x. In fact, for c/x, I'll only need to compute it to 6 digits, but if I can skip the first 3, then the division is easier and faster! So, lets optimize:
=>.5(x+c/x) =>.5(x+x+(c/x-x)) =>x+.5(c-x*x)/x
So it's probably not so clear, but essentially for 2n-digits precision, I need to square an n-digit number and divide an n-digit number by n digits, to n digits precision. This replaces a 2n/2n division.
The former is faster, so let's try an example of finding the square root of 21: c=21 Guess: x=5 (c-x*x) = -4 -4/5/2 = -.4 x-.4 = 4.6
In practice by hand, it actually doesn't save much time, but that's because by hand I usually can perform a 2n-by-2n division a little slower than half that of a 2n-by-2n and I can do multiplication and division in roughly the same speed. So overall This gets me about 20% faster.
On the Z80, a 2N-by-N->2N division takes roughly the time of 1.6 2N-by-2N->4N multiplies. As well, a 2N-by-2N->4N multiply takes roughly 3.5 times the time of an N-by-N->2N multiply using recursive Karatsuba multiplication.
So the standard algorithm takes roughly 5.6 N-by-N->2N multiplies worth of time. The modified algorithm takes roughly 2.9 N-by-N->2N multiplies worth of time.
So I have been really wanting to implement the Shunting-Yard algorithm for a few years now and I think I've got it in assembly. I keep finding bugs, but in case other people are interested, here is my code. It assumes that the operator stack and output (combined) won't exceed 768 bytes.
Things that would be useful:
Converting numbers to an intermediate format.
Handling multi-byte tokens and named vars and functions.
This way the code can be pseudo-compiled when it is converted to RPN, then an RPN calculator can parse the code faster or prep it for actual compiling.
#define bcall(x) rst 28h \ .dw x saveSScreen = 86ECh scrap=saveSScreen outhead=8000h stackhead = 8002h .db $BB,$6D .org $9D95 ld hl,test ld bc,test_end-test call shuntingyard call rpn ld hl,scrap bcall(450Ah) bcall(452Eh) ret shuntingyard: ld de,scrap ld (outhead),de ld d,(scrap/256)+3 ld (stackhead),de _: ld a,(hl) call +_ cpi jp pe,-_ ld hl,scrap+768 ld de,(stackhead) or a sbc hl,de ld b,h ld c,l ld hl,(outhead) ex de,hl jr z,$+3 ldir dec de xor a ld (de),a ret _: cp '.' jp z,num_dec cp 30h jr c,+_ cp 3Ah jp c,num _: cp '(' jp nz,+_ ex de,hl ld hl,(stackhead) dec hl ld (hl),',' dec hl ld (hl),a ld (stackhead),hl ex de,hl ret _: cp ')' jp nz,checkunops push hl push bc ld hl,scrap+768 ld de,(stackhead) sbc hl,de jp z,ERR_Unmatched_lparens ld b,h ld c,l ex de,hl ld de,(outhead) ;BC is the size of the stack. Use this in case there is a missing ')' so we don't read garbage. ;basically search for the matching '(' while piping out the stack to the output. outerloop: ld a,(hl) cp '(' jr z,parens_found ld a,',' _: cp (hl) ldi jp po,ERR_Unmatched_lparens jr z,outerloop jp -_ parens_found: inc hl inc hl ld (outhead),de ld (stackhead),hl pop bc pop hl ret checkunops: checkbinops: ;;if the token is an operator, then: ;;while there is an operator at the top of the operator stack with ;;greater than or equal to precedence and the operator is left associative: ;;pop operators from the operator stack, onto the output queue. ;;push the read operator onto the operator stack. ;; ;; push bc ex de,hl call getprecedence ld a,c pop bc ex de,hl jp c,search_function ;now C is the precedence, with lower bit = 1 if left-associative push bc push hl ld de,(stackhead) ld hl,scrap+768 sbc hl,de ld b,h ld c,l ld hl,(outhead) ex de,hl jr z,pushop ;a is the precedence against which to compare _: push hl push bc push af ld a,(hl) call getprecedence jr c,+_ pop hl ld a,h ;incoming cp c jr nz,$+4 rra \ nop
pop bc pop hl
;====================================================== jr nc,pushop .echo "The following code only works until we have to add >1 byte tokens." ldi ldi jp pe,-_ jp $+6 _: pop af pop bc pop hl pushop: ld (outhead),de pop de dec hl ld (hl),',' dec hl ld a,(de) ld (stackhead),hl ld (hl),a ex de,hl pop bc ret search_function: jp ERR_Func_Not_Found getprecedence: ld hl,binops ld b,(binops_end-binops)/2 _: cp (hl) inc hl ld c,(hl) ret z inc hl djnz -_ scf ret binops: .db 4, $01 .db '=',$50 .db '|',$60 .db '&',$70 .db '-',$81 ;right associative is odd .db '+',$80 ;left associative is even .db '/',$83 ;right associative .db '*',$82 ;left associative .db '^',$85 ;right associative binops_end: num: ld de,(outhead) _: ldi jp po,+_ ld a,(hl) cp '.' jr z,num_dec+4 cp 30h jr c,+_ cp 3Ah jr c,-_ _: ld a,',' ld (de),a inc de ld (outhead),de dec hl inc bc ret num_dec: ld de,(outhead) _: ldi jp po,+_ ld a,(hl) cp 30h jr c,+_ cp 3Ah jr c,-_ _: cp '.' jp z,ERR_Syntax_00 ld a,',' ld (de),a inc de ld (outhead),de dec hl inc bc ret ERR_Syntax_00: ;Too many decimal points. ERR_Func_Not_Found: ERR_Unmatched_lparens: ret rpn: ret test: ; .db "(3.1415926535)" .db "(3.142+6/2-7)*3^6*3" test_end:
I'll be honest, my eyes glossed over. So you have a 3-D space and two buckets that move back and forth along a fixed (randomly initialized) path. A projectile is launched from a bucket, in only one direction and the goal is for it to fall into the other moving bucket. The problem then is to determine if it is possible with a given setup.
Since the projectile is launched in one fixed direction, I'm assuming directly 'up' and it falls directly back 'down', let's pretend we do that at every single point and map the result. If the path made a loop, the plot would look something like a cookie cutter. Plot the other path and find if it ever intersects. If not, return FALSE.
If it does, we need to then verify if they intersect in time. What I would do is determine how many integer time units it takes both paths to complete. If the time units of both paths are coprime, then return TRUE.
Perform the following loop in the even that they are not coprime:
0. Given the length of time of the first path is A, and the length of the second is B.
1. Start with a first intersection.
2. Identify the time units it takes to reach that point on both paths called C and D.
3. If gcd(|C-D|,gcd(A,B))=1, return TRUE, else move to the next intersection and go to Step 2
4. If all intersections are exhausted, return FALSE.
WARNING: My math skills have been deteriorating since college. I'm sure there are optimizations and identities to apply.
Hi all, I lost my previous references and example programs and it took me this morning to locate this algorithm, digest it, and spew out my own version. I looked on all of my calculators and Omni first, so I'm going to post it here for when it happens again
Anyways, this is one of my favorite algorithms for evaluating logarithms:
It achieves single precision accuracy (at least 24.4996 bits) on the range of inputs from [.5,2].
During range reduction, x is usually reduced to some value on [c,2c].
The best precision is found when c=.5sqrt(2) (range is [.5sqrt(2),sqrt(2)], achieving at least 31.5 bits
I prefer c=2/3 since 2/3 and 4/3 are equidistant from 1-- it makes it easier for me to analyze time complexity. This still offers at least 29.97 bits, which is better than single precision
Cost is:
amean: 7 . 'amean' is the same cost as an add in binary floats
half divide: 1
full divide: 3
multiply: 1
shift-by-2: 3
shift-by-4: 2. This is sightly more efficient on the Z80 than 4 single-shifts when values are in RAM
add/sub: 5
add/sub const:1
I derived this algorithm from this wonderful paper which is annoyingly never at the top of a Google search and in fact took me a loooong time to ever stumble upon it, sadly.
This paper greatly accelerates the classic Borchardt-Gauss algorithm to be on par with the AGM algorithm. At their core, both algorithms perform an arithmetic and geometric mean, but AGM requires them to be done in parallel (emulated on single-core processors by some simple variable juggling), whereas B-G does them sequentially. As well, AGM achieves quadratic convergence or better (I've seen some exponential convergence in specific special cases), whereas classic B-G usually achieves linear convergence. Carlson's version of the B-G algorithm adds O(log(n)^2) additions and O(log(n)) space for quadratic convergence (where n is the number of desired bits of accuracy).
I like the B-G-C algorithm since I can easily obtain the inverse trig functions and inverse hyperbolic functions as well as the natural logarithm.
Sorry, I'm on my phone so I'll probably not go too in-depth on this Bug me for details if I don't get around to it and you need them
So: Given ##x\in[-.5ln2,.5ln2]## Let ##y=x^{2}## Let ##a=\frac{x}{2}\frac{1+\frac{5y}{156}\left(1+\frac{3y}{550}\left(1+\frac{y}{1512}\right)\right)}{1+\frac{3y}{26}\left(1+\frac{5y}{396}\left(1+\frac{y}{450}\right)\right)}## Then ##e^{x}\approx\frac{1+a}{1-a}##
Accuracy is ~75.9666 bits. 7 increments, 1 decrement 6 constant multiplications 6 general multiplications 2 general divisions 1 div by 2 (right shift or decrement an exponent)
For comparison, that's comparable to 16 terms of the Taylor's series, or 8 terms of the standard Padé expansion (exponential is special in that it comes out to f(x)/f(-x) so it can be done even easier than most).
I basically carried out a Padé expansion for e^x to infinity, noticed that after the constant term all the even coefficients were zero, so I used a Padé expansion on that function to quickly find our approximation for a.
In my usage, I actually implemented a 2x function since I'm using binary floats with 64-bit precision. I take int(x) and save that for a final exponent for the float. Remove that value from x. By definition of int(), x is now non-negative. If x≥.5, increment that saved exponent, subtract 1 from x. Now x is on [-.5,.5]. Now we need to perform 2x, but that is equivalent to ex*ln(2). We've effectively applied range reduction to our original input and now we can rely on the algorithm at the top. That integer part that we saved earlier now gets added to the exponent byte, et voilà !
I think when I calculated max error using my floats it was accurate up to the last two bits or something. Don't quote me on that as I don't have those notes on me at the moment and it was done months ago.
Hey Art! It's been the same way for me, too. Adulthood and social media... ugh I love programming and do program on my calc most days, but it's usually just a couple of minutes here and there throughout the day I just don't have time for projects. I couldn't tell if I wasn't seeing you around or if it was because I don't check in regularly
I'm also guessing division would be the subtraction instead of addition, trunkating any remainder.
Division is typically performed exactly like 'schoolbook' long division (until you get to higher precision, then you can use some state-of-the-art algorithms and math magic).
Start with an accumulator and quotient set to 0. Rotate the one bit of the numerator into the accumulator. If accumulator>=denominator, then subtract the denominator from the accumulator and shift in a 1 to the quotient, else shift in a zero Repeat at the "rotate" step.
In code, since we are getting rid of one bit at a time in the numerator and adding 1 bit at a time to the quotient, we can actually recycle the freed up bits in the numerator. Here is an example of HL/C where C<128, A is the accumulator and HL doubles as the quotient and numerator:
HL_div_C: ;Input: ; HL is the numerator ; C is the denominator. C<128 ;Output: ; A is the remainder ; HL is the quotient. xor a ld b,16 loop: add hl,hl ;this works like shifting HL left by 1. Overflow (thus, the top bit) is left in the carry flag rla ;shift a left, rotating in the c flag as the low bit. cp c ;compare a to c. Basically does A-C and returns the flags. If C>A, then there will be underflow setting the C flag. jr c,skip ;skip the next two bytes if c flag is set (so A<C) inc l ;we know the low bit of HL is 0, so incrementing HL will set that bit. sub c skip: djnz loop ret
ld b,8 - This must be telling the loop to repeat 8 times to check each bit, so, how does the loop relate to b? I'm thinking about how nested loops would work...
djnz loop - kindof like "goto" loop, with the automatic decrementing of register b?
That is correct and that's why we initially do ld b,8. Keep in mind that djnz * and jr * are intended for small (ish) loops or redirecting to code relatively close by. It can jump backwards up to 126 bytes and forward up to 128 (back 1 byte is simply a slower way of performing rst 38h and I do not suggest it, back 2 bytes creates an infinite loop, back 0 bytes does nothing but waste 7 or 9 clock cycles). Most assemblers will warn you of out-of-bounds jumps. For longer loops or jumping to code far away, use jp *. djnz * only works with register b and always decrements.
rrc e - you say this rotates register e, is this the same as changing the register input from left to right?
For example, if e=01111011, then rrc e would change it to e=10111101. The bottom bit gets returned in the c flag as well as being returned to the top bit of e. I used rrc * since rotating 8 times would leave it exactly how it was input. I could have used rr * or even sra * or srl *, but I prefer to avoid destroying registers if I can.
But the multiplication isn't convoluted! It's exactly how most of us are taught in grade school, except instead of multiplying digits 0~9, it's just multiplying by 0 or 1 which is super trivial. Like: 01110101 x10101101 --------- 01110101 000000000 0111010100 01110101000 000000000000 0111010100000 00000000000000 011101010000000
Or removing the multiplies by zero: 01110101 x10101101 --------- 01110101 0111010100 01110101000 0111010100000 011101010000000
So suppose bit 3 is set. Then you basically add your top number, shifted left three times. As an example, suppose you wanted to multiply C*E (ignoring the top 8 bits):
;C is our "top" number. ;E is our "bottom" number. ;A will be our "accumulator"
ld a,0
rrc e ;this rotates register 'e' right, putting the bottom bit as "carry" [out of the register]. jr nc,checkbit1 ;nc == not carry. If "carry" out was zero, skip this step. add a,c ;if carry out was 1, then add to our accumulator. checkbit1: sla c ;finally, shift our "top number" to the left in case we need to add this to the accumulator, too. Then [REPEAT] 7 more times. rrc e jr nc,checkbit2 add a,c checkbit2: sla c rrc e jr nc,checkbit3 add a,c checkbit3: sla c rrc e jr nc,checkbit4 add a,c checkbit4: sla c rrc e jr nc,checkbit5 add a,c checkbit5: sla c rrc e jr nc,checkbit6 add a,c checkbit6: sla c rrc e jr nc,checkbit7 add a,c checkbit7: sla c rrc e jr nc,all_done add a,c all_done: ret If you can see how that relates to the school book algorithm, then just know that the following does practically the same thing:
;C is our "top" number. ;E is our "bottom" number. ;A will be our "accumulator"
xor a ;mad hax to set A to zero. Faster, smaller. ld b,8 loop: rrc e ;this rotates register 'e' right, putting the bottom bit as "carry" [out of the register]. jr nc,no_add ;nc == not carry. If "carry" out was zero, skip this step. add a,c ;if carry out was 1, then add to our accumulator. no_add: sla c ;finally, shift our "top number" to the left in case we need to add this to the accumulator, too. djnz loop ;aaand repeat, decrementing register B until zero (this is a specialized instruction on the Z80) ret
But please don't use that in a real program It's terribly inefficient. If you understand how the register pairing works, you can come up with a much better AND more convoluted algorithm:
Spoiler For "Step-by-Step How to derive the 'best' 8-bit Multiplication Algorithm":
Let's start by rearranging the above code. The way we do 'schoolbook' multiplication starts at the least significant digit, but we can just as easily start from the most significant digit. So let's do an example in base 10: 377 x613 ---- =1*3*377+10*1*377+100*6*377 =1(3*377+10(1*377+10(6*377)))
If we want to convert that last line to pseudo-code:
H*E -> A ;H is the "bottom" number that we will no be checking the top digit down to the bottom digit. ;E is the 'Top" number. ;A is the accumulator ;basic algo, after initializing A to zero. ; multiply A by 2. ; shift H left by 1 ; if this results in a bit carried out (so a 1 carried out), then add E ('top' number) to A (the accumulator) ; repeat 7 more times for all bits in H. xor a ld b,8 loop: add a,a sla h ;shifts H left by 1, bringing in a 0 for the low bit. mathematically the same as H*2 -> H jr nc,skip_add add a,e skip_add: djnz loop ret That's faster, but not optimal! To get the optimal way, lets stray from optimality a little to make 'L' our accumulator. Since we can't directly add another register to L, we'll have to juggle with register A making it slower:
ld l,0 ld b,8 loop: sla l sla h jr nc,skip_add ld a,l \ add a,e \ ld l,a skip_add: djnz loop ret But since we know that 'sla l' will spit out a zero-bit for the first 8 iterations (all of them), we can do 'sla l \ rl h' which is the same size and speed. However, this is the same as just doing doing "add hl,hl" ! This is where you'll have to learn how register pairs work
ld l,0 ld b,8 loop: add hl,hl jr nc,skip_add ld a,l \ add a,e \ ld l,a skip_add: djnz loop ret But wait, there is more! We don't have an "add l,e" instruction, but we do have an "add hl,de" instruction. If we make D==0, then we can change 'ld a,l \ add a,e \ ld l,a' to 'add hl,de'. The problem is, if the lower byte of HL, (so L) overflows, then the upper byte H, our 'bottom' number. We can't have it changing our input value halfway through the algorithm! Thankfully, the changes never propagate into those upper bits. This requires some tedious work to prove, but if you are cool with taking that at face value, then our last piece of the puzzle gives us:
ld d,0 ld l,d ;since D=0 already, this sets L to 0, as we want. It's smaller and faster than ld l,0. ld b,8 loop: add hl,hl jr nc,skip_add add hl,de skip_add: djnz loop ret Even better, this actually givers us the full 16-bit result instead of just the lower 8 bits