This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.
Topics - Xeda112358
Pages: 1 2 [3] 4 5 ... 13
31
« on: August 06, 2015, 08:22:36 am »
This is a work-in-progress, but the intent is for it to convert a token into it's individual chars. It first converts the token to ASCII (this is easy; the OS provides a call), then the hard part is converting the ASCII to individual tokens. Uppercase is easy, and I also added conversion for lowercase chars, space, and "(" so that should get most tokens. However, something like tan-1( will not properly convert the -1.
32
« on: August 05, 2015, 10:10:09 am »
There was a request on Cemetech that I took interest in, and this is my work so far. You basically pass a number into the program, and it returns the nth catalog element. So 1:Asm(prgmCATELEM would return "abs(" as an example. If you pass 0 into it, it returns how many catalog elements there are.
Before you can use prgmCATELEM for it's intended purpose, you must run it once, and it will set itself up for your specific OS. Suggestion: If you are going to use prgmCATELEM in a program that you want to release into the wild, provide a fresh prgmCATELEM, not the version on your calculator. Further, you should run it once at the start of your program to install, just in case the user of your program didn't run the installation step.
There are definitely issues with how prgmCESETUP searches for the catalog table. It assumes that the catalog will not have its first 4 bytes split on two pages and it assumes that catalog starts with "abs(" and then " and ". It also assumes that the catalog end is on the same page and ends with "?".
33
« on: July 27, 2015, 12:03:43 pm »
I am trying to write some link port routines and I am having some issues. Every so often, data won't be correctly transmitted and I have deduced that this only occurs when the sent byte has bit 1,3, or 5 set, it sometimes causes the next bit up to be set (so if bit 3 is supposed to be set, it also sets bit 4). This is a chaotic occurence.
What I do know is when the odd bits are read, bit 0 of port 0 (so the tip) reads 1. As well, the sending calc had a low battery, so I am not sure if it is an issue of bringing the lines up. I am pretty sure this is not the problem, though, as I switched out batteries.
I also tried adding code that guaranteed that the lines were reading correctly, otherwise it would rewrite the value to port 0 until it did. This still didn't work. I tried adding huge delays, it still was just as unreliable. Interrupts and link assist are disabled.
Below is my code if you want to muddle through it. It is supposed to send a value in Ans to the other calc and the receiving calc receives in Ans (modulo 100).
#define bcall(x) rst 28h \ .dw x .db $BB,$6D .org $9D95 bcall(4AD7h) bcall(4AEFh) sendb: di ld c,a xor a out (8),a ld e,55h sendloop: rrc c rla sla e ccf rla out (0),a ld b,6 ;this is a delay, and not claculated to be precise. We probably do not even need a delay djnz $ jr nz,sendloop ld b,10 ;delay djnz $ xor a out (0),a ret
#define bcall(x) rst 28h \ .dw x .db $BB,$6D .org $9D95 call getb bcall(478Ch) bcall(4ABFh) ret getb: di xor a out (8),a ld bc,$08AA readloop: in a,(0) xor c rra jr c,readloop rra ;bits cycled in are masked with 0x55 as a sideeffect. Need to invert anyways, so mask at the end with 0xAA rr c djnz readloop ld a,c and 1 rrca xor c xor $AA ret
34
« on: July 15, 2015, 10:29:19 am »
I was not sure where to post this, but I made a calc-to-calc serial cable!
I ended up finding two sets of headphones with the right size tips (not sure what they are called), so I cut them up with an ample length of wire. I also have a bunch of jumper wires that I got for my Raspberry Pi, so after cutting the headphone wires, I exposed the three sets of wire, used an abrasive to strip off the enameling, and spliced each wire with a female tip of a jumper wire. From there, I can connect them with male-male jumpers and it properly transmits data between my calcs!
The reason that I didn't just splice the two sets of wires together directly is because now I can plug in other items. For example, I connected male leads to an electric motor and LED-- it lets me control them through the link port now! I can also hook them up to a breadboard if I wanted to, making it possible to connect more calculators, or I can directly plug them into the GPIO board on my Raspberry Pi. A few months ago I did this and wrote a Python program and a calc program that could communicate with each other (basically just sending a string in Ans to the Pi, and have it displayed in the terminal).
Unfortunately, I lost the Python program due to a crash, but I still have the calc programmed backed up on calc. My plan is to come up with some working routines from scratch since I am still pretty terrible with figuring out the link port. It took me forever to figure out that the receiving calc can't change the lines.
35
« on: June 23, 2015, 12:30:33 pm »
I don't want to put this in the Optimized Routines thread as I haven't had it long enough for it to be ultra-optimized. This code is intended as a method of testing if HL is divisible by DE. It's average run time for random 16-bit inputs is 163cc, and where DE<=HL (more likely in practice), it is 229cc. For a code that is doing prime-checking, it will more likely take about 707.5cc (Since DE will be roughly the square root of HL or less). isDivisible: ;;Inputs: DE,HL ;;Outputs: c flag set if HL is not divisible by DE, else c flag is set. ;; HL is 0 if true. ;;1087cc worst case (HL=65535, DE=1). ;;Average for random inputs: 163cc ;;Average for DE<=HL: 229cc. ld a,d \ or e \ ccf \ ret z ;remove this if DE is always guaranteed non-zero ;step 1 ld a,e \ or l \ rra \ jr c,step2 ;\ srl d \ rr e \ rr h \ rr l ; | ld a,e \ or l \ rra \ jr nc,$-11 ; |Remove these if DE is always guaranteed odd at input. step2: ; | ld a,e \ rra \ ccf \ ret c ;/ ;steps 3, 4, and 5 ld a,l or a loop: sbc hl,de \ ret c \ ret z rr h \ rra \ bit 0,a \ jr z,$-5 ld l,a jp loop
Motivation and DevelopmentI often find myself in a situation where I need to find the factors of a number, but I have no technology around to aid me. This means I need to use... mental arithmetic! I've been doing this for 15 years, so I have refined my mental process quite a bit. It is still a trial division algorithm, but with a very obfuscated "division" technique. We don't need to do 1131/7 to see if it is divisible by 7, we just need to see if 7 divides 1131 and this is what my algorithm does. Interestingly, testing divisibility at the algorithmic level is a little faster than division. Not by much, but it is also non-negligible. The AlgorithmThe core algorithm is designed around checking that (A mod B == 0) is true or false. We also make the assumption that B is odd and by extension, non-zero. The case where B is non-zero and even will be discussed later. Since B is odd, 2 does not divide B. This means that if A is even: (A mod B == 0) if and only if (A/2 mod B)==0. We also know by the definition of divisibility that (A mod B) == (A+c*B mod B) where c is any integer. Combining all this, we have an algorithm: - Remove all factors of 2 from A
- With A now odd, do A=A-B
- If the result is zero, that means (A mod B == 0)
- If the result underflow (becomes "negative", or on the Z80, sets the carry flag), it means that A was somewhere on [1,B-1], so it is not divisible by B.
- Continue back at 1.
Now suppose B is allowed to be non-zero and even. Then B is of the form d*2^k where d is odd. This just means there are some factors of 2 that can be removed from B until it is odd. The only way A is divisible by B, is if it has the same number or more of factors of 2 as B. If we factor out common factors of 2 and find B is still even, then A is not divisible by B. Otherwise we have an odd number and only need to check the new (A mod d) for which we can use the "odd algorithm" above. So putting it all together: - If B==0, return FALSE.
- Remove common factors of 2 from A and B.
- If B is even, return FALSE.
- Remove all factors of 2 from A.
- Subtract B from A (A=A-B).
- If the result is zero, return TRUE.
- If the result is "negative" (setting the carry flag on many processors), return FALSE.
- Repeat at 4.
The overhead steps are (1) to (3). The iterated steps are (4) and (5). Because (5) always produces an even number, when it then performs step 4, it always divides by at least one factor of 2. This means the algorithm takes at most 1+ceil(log2(A))-floor(log2(B) iterations. For example, if A is a 37-bit number and B is a 13-bit number,this takes at most 38-13 = 25 iterations. However, in practice it is usually fewer iterations. Example Time:Say I wanted to test if 1337 is divisible by 17. Since 17 is odd, we can proceed. 1337 is odd, so no factors of 2 to remove. 1337-17 == 1320. 1320/2 == 660 660/2 == 330 330/2 == 165 165-17 == 148 148/2 == 74 74/2 == 37 37-17 == 20 20/2 == 10 10/2 == 5 5-17 == -12 So 1337 is not divisible by 17. Now test divisibility by 7: 1337 => 1330 =>665 =>658 =>329 =>322 =>161 =>154 =>77 =>70 =>35 =>28 =>14 =>7 =>0 So 1337 is divisible by 7. The worst-case timing is 66*16+31 = 1087
36
« on: April 20, 2015, 09:48:20 am »
EDIT:23-June-2015 Runer took a look at my LFSR code and suggested changing it to a Galois LFSR using the inverted mask and the code is much simpler.
I have a bunch of Pseudo-Random Number Generators (PRNGs), so instead of spamming the Optimized Routines topic, I'll post here until I have it all sorted out.
So this is my most recent work and I think zillions times better than previous work. It has a period of 4,294,901,760 (65536*65535) and takes a tiny 308cc in the worst case. It passes a chaos game test that I wrote, it is very nicely distributed (mostly uniform, but not *exactly* so). As a comparison, I wrote a Combined LCG that had a fairly long period (on a similar scale to this), but it took ~1500cc, and yesterday I realized that it might not touch a dust of numbers on it's range (this depends on the upper bound for prime gaps since it used multiplication, and a modulus [which may have remedied the multiplication]).
So the method I chose was to use a Lehmer RNG which is a special case LCG (Linear Congruential Generator) that had a period of 65536 (75*seed mod 65537 ->seed) and a 16-bit LFSR (Linear Feedback Shift Register) with maximal period, so 65535. I then add the two. The whole process cycles every gcd(65535,65536) iterations. However, since they are consecutive integers, they are coprime, so the cycle is of length 65536*65535.
Now I will give all three PRNGs -- the Lehmer RNG, which is pretty good on its own, the LFSR which is very fast, and the combined PRNG:
lehmer: ;;Input: ;; (seed) has the seed value of the RNG ;;Output: ;; (seed) is updated, HL is the result ;;Destroys: ;; A,DE,BC ;;Timing: ;; if seed>0 231cc or 232cc, condition dependent ;; if seed=0 91cc ;; if smc=1 subtract 6cc ;;Size: 44 bytes ;;Notes: ;; Uses the Lehmer RNG used by the Sinclair ZX81 ;; 75x mod 65537 -> x ;; 0 is never a possibility, but 65536, the maximum value, is. I encode 65536 as 0, so the seed is only 2 bytes. #IF smc == 0 ld hl,(seed) #ELSE seed = $+1 ld hl,0 #ENDIF ;multiply by 75 ld c,l ld b,h xor a adc hl,hl \ jr z,special \ ld d,a \ rla add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla \ add hl,bc ;modulo 65537, see note below on how this works ld e,a sbc hl,de jr nc,$+3 inc hl ld (seed),hl ret special: ;In the case that HL=0, this should be interpreted as 65536 = -1 mod 65537, so return -75 mod 65537 = -74 mod 65536 in HL ld hl,-74 ld (seed),hl ret ;mod by 2^16 + 1 (a prime) ;current form is A*2^16+HL ;need: ; (A*2^16+HL) mod (2^16+1) ;add 0 as +1-1 ; (A*(2^16+1-1)+HL) mod (2^16+1) ;distribute ; (A*(2^16+1)-A+HL) mod (2^16+1) ;A*(2^16+1) mod 2^16+1 = 0, so remove ; (-A+HL) mod (2^16+1) ;Oh hey, that's easy! :P ;I use this trick everywhere, you should, too.
smc=1
LFSR16: ;;Input: (seed1) is non-zero ;;Output: (seed1) updated, HL is the result ;;Destroys: A ;;13 bytes ;;66cc, add 6cc if not using smc #IF smc == 0 ld hl,(seed1) #ELSE seed1 = $+1 ld hl,1 #ENDIF add hl,hl sbc a,a and %00101101 xor l ld l,a ld (seed1),hl ret
smc = 1
prng16: ;;Input: ;; (seed1) is a non-zero 16-bit int ;; (seed2) is a 16-bit int ;;Output: ;; HL is the pseudo random number ;; DE is the output of the LFSR ;; BC is the previous seed of the Lehmer PRNG ;; (seed1) is the output of the LFSR ;; (seed2) is the output of the Lehmer PRNG ;;Destroys: A ;;Notes: ;; This uses a 16-bit Lehmer PRNG, and an LFSR ;; The period is 4,294,901,760 (65536*65535) ;; Technically,the Lehmer PRNG here can have values from 1 to 65536. In this application, we store 65536 as 0. ;;Speed: ;; If smc = 0, add 12cc ;; 293+a-c, a,b,c are {0,1} ;; probability a= 1 is 38/65536 ;; probability c= 1 is 1/65536 ;; Average: 293+39/65536 = 293.00059509277cc ;; Best:293cc ;; Worst:294cc ;;50 bytes #IF smc == 0 ld hl,(seed2) #ELSE seed2 = $+1 ld hl,0 #ENDIF ;multiply by 75 ld c,l ld b,h xor a adc hl,hl \ jr nz,$+3 \ inc a \ rla add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla \ add hl,bc ld e,a sbc hl,de jr nc,$+3 inc hl ld (seed2),hl ex de,hl #IF smc == 0 ld hl,(seed1) #ELSE seed1 = $+1 ld hl,1 #ENDIF add hl,hl sbc a,a and %00101101 xor l ld l,a ld (seed1),hl add hl,de ret
37
« on: April 03, 2015, 12:40:21 pm »
A few years ago when Builderboy wrote his fire graphics turorial, I remember thinking it was super cool and I went on to make my own fire engines designed around it. However, one thing always peeved me-- the design was very clever and made for speed, but it seemed too artificial for my liking-- particularly in the way it masked out pixels. The idea was to kill a pixel 1/8 of the time, so a PRNG was used to select from one of the following masks: %01111111 %10111111 %11011111 %11101111 %11110111 %11111011 %11111101 %11111110
Of course an AND mask will kill one bit in a byte, so exactly 1/8 of the pixels in a byte. A few weeks ago, this was bugging me and preventing my sleep. I set out to devise an algorithm that would generate masks that didn't necessarily have a 1/8 kill rate, but over time the probability converges to 1/8. For example, one mask might be %11111111 (0% killed) and the next could be %10110111 (25% killed). The IdeaSuppose you 3 random bits and you OR them together. The only way this can result in a 0 is if every input was a 0. The probability of that happening is .5^3 = .125 = 1/8. So for our fire mask, we can generate three random bytes, OR them together, and each bit has a 1/8 chance of being reset. SpeedThe disadvantage here is in speed. Builderboy's method requires one pseudo-random number, this method requires three. However, if we consider that we are almost certainly going to use a fast pseudo-random number generator, and that we will want a long (ish) period, we have room to take advantage of the PRNG and achieve almost the same speed. For example, suppose you generate a 24-bit pseudo-random number-- with this method, you can just OR the three bytes generated (12cc) versus using an LUT (Builder's method): ld hl,LUT and 7 add a,l ld l,a jr nc,$+3 inc hl ld a,(hl) ;43cc
In the example code I will give, I use a 16-bit PRNG, so I generate three of these numbers (6 8-bit values) for 2 masks, making generation take 1.5 times longer than Builder's as opposed to 3 times as long. ConsiderationsIn order for this to work, you need an RNG or PRNG in which every bit has a 50% chance of being set or reset. I went with a Lehmer LCG that was fast to compute and had a period of 2^16 and had this property. ExampleThe following code works at 6MHz or 15MHz and the LCD will provide almost no bottleneck. Sorry, no screenie: smc = 1 ;1 for SMC, 0 for no SMC (use 1 if code is in RAM; it's faster) plotSScreen = 9340h saveSScreen = 86ECh
;============================== #IF smc = 0 seed = saveSScreen #ENDIF ;==============================
.db $BB,$6D .org $9D95 fire: ;;first, set up. We will be writing bytes to the LCD left to right then down di ld a,7 ;LCD y-increment out (16),a ;;setup the keyboard port to read keys [Enter]...[Clear] ld a,%11111101 out (1),a ;make the bottom row of pixel;s black to feed the flames ld hl,plotSScreen+756 ld bc,$0CFF ld (hl),c \ inc hl \ djnz $-2 fireloopmain: ld ix,plotSScreen+12 in a,(1) and %01000000 ret z call LCG ld b,63 fireloop: ;wait for LCD delay in a,(16) rla jr c,$-3 ld a,80h+63 sub b out (16),a push bc call LCG ld a,20h out (16),a call fire_2bytes+3 call fire_2bytes call fire_2bytes call fire_2bytes call fire_2bytes call fire_2bytes pop bc djnz fireloop jp fireloopmain fire_2bytes: call lcg push hl call lcg pop de ld a,e or d or l and (ix) out (17),a ld (ix-12),a inc ix push hl call lcg pop af or h or l and (ix) out (17),a ld (ix-12),a inc ix ret lcg: ;240cc or 241cc, condition dependent (-6cc using SMC) ;;uses the Lehmer RNG used by the Sinclair ZX81 #IF SMC = 1 seed = $+1 ld hl,1 #ELSE ld hl,(seed) #ENDIF ;multiply by 75 ld c,l ld b,h xor a ld d,a adc hl,hl jr nz,$+9 ld hl,-74 ld (seed),hl ret rla add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla add hl,hl \ rla \ add hl,bc \ adc a,d add hl,hl \ rla \ add hl,bc \ adc a,d ;mod by 2^16 + 1 (a prime) ;current form is A*2^16+HL ;need: ; (A*2^16+HL) mod (2^16+1) ;add 0 as +1-1 ; (A*(2^16+1-1)+HL) mod (2^16+1) ;distribute ; (A*(2^16+1)-A+HL) mod (2^16+1) ;A*(2^16+1) mod 2^16+1 = 0, so remove ; (-A+HL) mod (2^16+1) ;Oh hey, that's easy! :P ;I use this trick everywhere, you should, too. ld e,a sbc hl,de ;No need to reset the c flag since it is already jr nc,$+3 inc hl ld (seed),hl ret
EDIT: .gif attatched. It looks slower in the screenshot than my actual calc, though. EDIT2: On my actual calc, it is roughly 17.2FPS
38
« on: March 31, 2015, 02:46:38 pm »
I made this program to remove the delay in the BASIC getKey function! It's simple and ugly and I want to improve it at some point. For now, I just wanted a small code.
Basically, repeating keys like arrows and delete will repeat a lot faster, and not just on the homescreen like one of my older programs (Speedy Keys).
39
« on: March 30, 2015, 01:48:30 pm »
I'm really peeved that I cannot seem to find an algorithm for computing arcsine that works like what I have. This algorithm is based on my work from years ago to compute sine and I have no idea why I never reversed it before now. Anyways, the algorithm: ArcSine(x), ##x\in[-1,1]##
x=2x s=z=sign(x) iterate n times x=x*x-2 if x<0 s=-s z=2z+s return pi*z*2^(-n-2)
This algorithm extracts one bit each iteration, so for 16 bit of accuracy, you would iterate 16 times. I think it is a pretty cute and compact algorithm, so where is it? @_@
As a note, at the endpoints {-1,1} it may not give the expected answer. In both cases, if allowed to run infinitely, it would return (in binary): ##\frac{\pi}{2}.1111111111111..._{2}## which of course is equivalent to ##\frac{\pi}{2}##.
40
« on: March 13, 2015, 01:00:16 pm »
I got this email from Google: Hello,
Earlier today, Google announced we will be turning down Google Code Project Hosting. The service started in 2006 with the goal of providing a scalable and reliable way of hosting open source projects. Since that time, millions of people have contributed to open source projects hosted on the site.
But a lot has changed since 2006. In the past nine years, many other options for hosting open source projects have popped up, along with vibrant communities of developers. It’s time to recognize that Google Code’s mission to provide open source projects a home has been accomplished by others, such as GitHub and Bitbucket.
We will be shutting down Google Code over the coming months. Starting today, the site will no longer accept new projects, but will remain functionally unchanged until August 2015. After that, project data will be read-only. Early next year, the site will shut down, but project data will be available for download in an archive format.
As the owner of the following projects, you have several options for migrating your data.
scope-os langz80 The simplest option would be to use the Google Code Exporter, a new tool that will allow you to export your projects directly to GitHub. Alternatively, we have documentation on how to migrate to other services — GitHub, Bitbucket, and SourceForge — manually.
For more information, please see the Google Open Source blog or contact [email protected].
-The Google Code team
Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043 You have received this mandatory email service announcement to update you about important changes to Google Code Project Hosting.
I found Google Code much easier to use, but I see their point about there being enough services already.
41
« on: February 02, 2015, 03:00:47 pm »
Hi all, I was writing a binary search algorithm and since I cannot test it just yet, I was hoping people might be able to spot any problems or optimizations. My intent was to optimize this for the eZ80. Feel free to offer suggestions that include instructions from the eZ80 instruction set.
First, the input is a "key" to search for. It is zero terminated, so it may be something like .db "Hello",0 The table to look for it in is comprised of indices to the actual data. So the table might look like:
table_start: .dw (table_end-table_start)/2-1 .dw expr0 .dw expr1 .dw expr2 ... .dw exprlast .table_end:
expr0: .db "AAAaaa",0 ;other data expr1: .db "BBBbbb",0 ;other data ...
The actual strings do not need to be in order, just the order in which they are indexed. The strings must be zero-terminated, too.
So, passing a pointer to the keyword in DE and the table start in HL:
BinarySearch: ;expected in Z80 mode ld (key_ptr),de ld e,(hl) \ inc hl ld d,(hl) \ inc hl ld b,h \ ld c,l ld (max),de or a \ sbc hl,hl \ ld (min),hl .loop0: ;average max (de) and min (hl) with truncation add hl,de \ rr h \ rr l push hl ;2 bytes per index, and BC points to the base of the table add hl,hl \ add hl,bc ;load the pointer to the expression to compare to in HL ld hl,(hl) \ ld de,(key_ptr) .loop1: ld a,(de) \ cp (hl) \ jr c,less ;if they are the same, make sure to test if the bytes are zero (termination) jr z,$+4 jr nc,more ;if a != 0 then the string is not terminated, but we don't know which string is "bigger" so keep looping inc de \ inc hl \ or a \ jr nz,loop1 .match: pop bc \ ret .less: ; "or a" is just to reset the c flag for future code ;if [index]<key, add 1 to the average of min and max and set that as the new min or a \ pop hl \ inc hl \ ld (min),hl \ ld de,(max) \ jr loop0_end .more: ;[index]>key set max to the average of max and min. pop de \ ld (max),de \ ld hl,(min) .loop0_end: sbc hl,de \ add hl,de \ jr nz,loop0 or 1 ret
The output should have z flag if a match was found, else nz. In the case of a match being found, BC is the index number, DE points to the byte after the key, HL points to the byte after the match. If no match was found, HL and DE point to the index in front of where the match should have been and BC points to the table data start. In any case, the c flag is reset.
42
« on: January 29, 2015, 09:25:53 am »
EDIT: Routines so far: mul16 mul32 gcd16 sqrt16 sqrt24 rand24 setbuffer hl*a/255 div16 prng24 atan8 [link]
Now that the eZ80 has been confirmed as the new processor in the next line of TI calculators, I thought it would be fun to start making optimized routines in preparation! We can take advantage of the 8-bit multiplication routine for multiplication of course. As a note, I did not find Karatusba multiplication to be faster (it was 66 to 76 clock cycles). mul16 optimizedmul16: ;;Expects Z80 mode ;;Inputs: hl,bc ;;Outputs: Upper 16 bits in DE, lower 16 bits in BC ;;54 or 56 t-states ;;30 bytes ld d,c \ ld e,l \ mlt de \ push de ;11 ld d,h \ ld e,b \ mlt de ;8 ld a,l \ ld l,c \ ld c,a ;3 mlt hl \ mlt bc \ add hl,bc ;13 jr nc,$+3 \ inc de \ pop bc ;6 ld a,b \ add a,l \ ld b,a ;3 ld a,e \ adc a,h \ ld e,a ;3 ret nc \ inc d \ ret ;7 (+2 for carry)
Notice on the right side are the instruction timings. MLT is always 6 cycles and can be used in Z80 or ADL mode. All they did was include it in the extended instructions. In this case, I use Z80 mode to take advantage of some of the math and as a result, I also save 3 t-states over ADL mode which would require that add hl,bc to be done in z80 mode, so "add.s hl,bc" which costs 2 cycles instead of one, and the pushes/pops would require 4cc instead of 3cc each. So, who wants to have fun? EDIT: 5-Feb-15mul32 optimizedmul32: ;;Expects Z80 mode ;;Inputs: ix points to the first 32-bit number ;; ix+4 points to the next 32-bit number ;;Outputs: ix+8 points to the 64-bit result ;;Algorithm: Karatsuba ;348cc to 375cc
;compute z0 ld hl,(ix) ;5 ld bc,(ix+4) ;5 call mul16 ;59+2 ld (ix+8),bc ;5 ld (ix+10),de ;5 ;compute z2 ld hl,(ix+2) ;5 ld bc,(ix+6) ;5 call mul16 ;59+2 ld (ix+12),bc ;5 ld (ix+14),de ;5 ;compute z1, most complicated as it requires 17-bits of info for each factor ld hl,(ix+2) ;5 ld bc,(ix) ;5 add hl,bc ;1 rla ;1 ld b,h ;1 ld c,l ;1 ld hl,(ix+6) ;5 ld de,(ix+4) ;5 add hl,de ;1 rla ;1 push hl ;3 push bc ;3 push af ;3 call mul16 ;59+2 pop af ;3 and 3 ;2 ex de,hl ;1 pop de ;3 jr z,$+7 ;label0 ;3|(6+1) ;bit 0 means add [stack0] to the upper 16 bits ;bit 1 means add [stack1] to the upper 16 bits ;both mean add 2^32 jp po,$+5 \ or 4 ;-- rra \ jr nc,$+7 ;4+4 rrca \ add hl,bc \ adc a,0 \ rlca ;-- ; srl a \ pop de \ jr nc,$+5 ;8+2 add hl,bc \ adc a,0 ; ld d,b \ ld e,c ;2
;z1 = AHLDE-z0-z2 ld bc,(ix+8) \ ex de,hl \ sbc hl,bc ;8 ld bc,(ix+10) \ ex de,hl \ sbc hl,bc \ sbc a,0 ;10
ld bc,(ix+12) \ ex de,hl \ sbc hl,bc ;8 ld bc,(ix+14) \ ex de,hl \ sbc hl,bc \ sbc a,0 ;10 ex de,hl ;1
ld bc,(ix+10) \ add hl,bc \ ld (ix+10),hl \ ex de,hl ;12 ld bc,(ix+12) \ adc hl,bc \ ld (ix+12),hl \ adc a,0 ;13 ret z \ ld hl,(ix+14) \ add a,l \ ld l,a ;7+16 jr nc,$+3 \ inc h \ ld (ix+14),hl \ ret ;--
gcd16 optimizedGCD16: ;;Expect Z80 mode ;;Inputs: HL,DE ;;Output: HL ;; BC=0 ;; DE=0 ;; A=0 ;Binary GCD algorithm ;About 432cc on average (average of 500 iterations with randomized inputs on [0,65535] ;78 bytes xor a ld b,a ld c,a sbc hl,bc ex de,hl ret z sbc hl,bc ex de,hl ret z
;factor out cofactor powers of two ld a,l \ or e \ rra \ jr nc,$+16 srl h \ rr l srl d \ rr e inc b ld a,l \ or e \ rra \ jr nc,$-12 .loop: ;factor out powers of 2 from hl ld a,l \ rra \ ld a,h \ jr c,$+10 rra \ rr l \ bit 0,l \ jr z,$-5 \ ld h,a ;factor out powers of 2 from de ld a,e \ rra \ ld a,d \ jr c,$+10 rra \ rr e \ bit 0,e \ jr z,$-5 \ ld d,a
xor a sbc hl,de jr z,finish jr nc,loop add hl,de or a ex de,hl sbc hl,de jr loop .finish: ex de,hl dec b inc b ret z add hl,hl \ djnz $-1 \ ret
sqrt16 optimizedsqrt16: ;;Expext Z80 mode ;;Inputs: HL ;;Output: A ;Unrolled, speed optimized. ;At most 112 clock cycles ;111 bytes. xor a \ ld c,a \ ld d,a \ ld e,l \ ld l,h \ ld h,c
add hl,hl \ add hl,hl \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
sla c \ ld a,c \ add a,a \ add hl,hl \ add hl,hl jr nc,$+5 \ sub h \ jr $+5 \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a
ld e,h \ ld h,l \ ld l,c \ ld a,l add hl,hl \ rl e \ rl d add hl,hl \ rl e \ rl d sbc hl,de \ rla \ ret
rand24 optimizedRand24: ;;Expects Z80 mode ;;Inputs: seed1,seed2 ;;Outputs: ;; AHL is the pseudo-random number ;; seed1,seed2 incremented accordingly ;;Destroys: BC,DE ;;Notes: ;; seed1*243+83 mod 65519 -> seed1 ;; seed2*251+43 mod 65521 -> seed2 ;; Output seed1*seed2 mod 16777259 ;;215cc worst case ld hl,(seed1) ld d,h ld h,243 ld e,h mlt hl mlt de ld a,83 add a,l ld l,a ld a,h adc a,e ld h,a ld a,d adc a,0
;now AHL mod 65519 ; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL ld c,a ld b,17 mlt bc add hl,bc
ld de,65519 jr nc,$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,$+3 add hl,de ld (seed1),hl ;seed1 done, now we need to do seed2
ld hl,(seed2) ld d,h ld h,251 ld e,h mlt hl mlt de ld a,43 add a,l ld l,a ld a,h adc a,e ld h,a ld a,d adc a,0
;now AHL mod 65521 ; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL ld c,a ld b,15 mlt bc add hl,bc
ld de,65521 jr nc,$+5 or a \ sbc hl,de or a \ sbc hl,de jr nc,$+3 add hl,de ld (seed2),hl
;now seed1 and seed 2 are computed ;seed1*seed2 mod 16777259
ld bc,(seed1) call mul16 ; -> DEBC ;16777259 = 2^24+43 ;D2^24+EBC mod 16777259 = EBC+(D*2^24+43D-43D) mod 16777259 ;D2^24+EBC mod 16777259 = EBC+(16777259D-43D) mod 16777259 ;D2^24+EBC mod 16777259 = EBC-43D mod 16777259 ld a,e ld e,43 mlt de ;ABC-de ld h,b \ ld l,c or a sbc hl,de sbc a,0 ret nc ld bc,43 add hl,bc ret
Set Buffer optimizedsetBuf: ;;Any mode ;;Inputs: A is the byte to set each byte in the buffer to ;; HL points to the buffer ;; BC is the buffer size, greater than 1 ;;8 bytes ;;14+3*BC clock cycles ld (hl),a dec bc ld d,h ld e,l inc de ldir ret
EDIT 9-Feb-15HL*A/255 (can be used for division by 3,5,15,17,51, and 85, among others) This one performs A*HL/255. Be warned that this does not work on certain boundary values. For example, A=85 would imply division by 3, but if you input HL as divisible by 3, you get one less than the actual result. So 9*85/255 should return 3, but this routine returns 2 instead. fastMul_Ndiv255: ;;Expects Z80 mode ;;Description: Multiplies HL by A/255 ;;Inputs: A,HL ;;Outputs: HL is the result ;; A is a copy of the upper byte of the result ;; DE is the product of the input A,H ;; B is a copy of E ;; C is a copy of the upper byte of the product of inputs A,L ;;32cc ;;18 bytes ld d,a ld e,h ld h,d mlt hl mlt de ;15 ;DE ; DE ; HL ; HL ld c,h ld b,e ld a,d add hl,bc \ adc a,0 add hl,de \ adc a,0 ld l,h \ ld h,a ret ;17
HL*A/255 "fixed"Modifying this to correct the rounding issue is a pain in terms of speed hit and size hit: fastMul_Ndiv255_fix: ;;Expects Z80 mode ;;Description: Multiplies HL by A/255 ;;Inputs: A,HL ;;Outputs: HL is the result ;;52cc ;;26 bytes push af ld d,a ld e,l ld l,d mlt hl mlt de ;HL ; HL ; DE ; DE ld c,d ld b,l ld a,h add hl,de \ adc a,0 add hl,bc \ adc a,0 ld d,l \ ld l,h \ ld h,a pop af \ add a,e ret nc adc a,d ret nc inc hl ret HL*A/255 RoundedHowever, rounding it is at little cost and works well: fastMul_Ndiv255_round: ;;Expects Z80 mode ;;Description: Multiplies HL by A/255 ;;Inputs: A,HL ;;Outputs: HL is the result ;;37cc ;;23 bytes ld d,a ld e,l ld l,d mlt hl mlt de ;15 ;HL ; HL ; DE ; DE ld c,d ld b,l ld a,h add hl,de \ adc a,0 add hl,bc \ adc a,0 ld l,h \ ld h,a sla e \ jr nc,$+3 \ inc hl ret ;22
sqrt24 optimizedFinally, a routine that expects ADL mode instead of Z80 mode! This is an integer square root routine, but can be used for 8.8 fixed point numbers by copying the fixed point number to HL as 8.16 where the lower 8 bits are zero (or if you had extended precision from a previous calculation, feel free to use that). then the square root is 4.8 sqrt24: ;;Expects ADL mode ;;Inputs: HL ;;Outputs: DE is the integer square root ;; HL is the difference inputHL-DE^2 ;; c flag reset xor a \ ld b,l \ push bc \ ld b,a \ ld d,a \ ld c,a \ ld l,a \ ld e,a ;Iteration 1 add hl,hl \ rl c \ add hl,hl \ rl c sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a ;Iteration 2 add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a ;Iteration 3 add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a ;Iteration 4 add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a ;Iteration 5 add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a ;Iteration 6 add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 7 add hl,hl \ rl c \ add hl,hl \ rl c \ rl b ex de,hl \ add hl,hl \ push hl \ sbc hl,bc \ jr nc,$+8 ld a,h \ cpl \ ld b,a ld a,l \ cpl \ ld c,a pop hl jr nc,$+4 \ inc hl \ inc hl ex de,hl ;Iteration 8 add hl,hl \ ld l,c \ ld h,b \ adc hl,hl \ adc hl,hl ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl jr nc,$+6 \ sbc hl,de \ inc de \ inc de ;Iteration 9 pop af rla \ adc hl,hl \ rla \ adc hl,hl ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl jr nc,$+6 \ sbc hl,de \ inc de \ inc de ;Iteration 10 rla \ adc hl,hl \ rla \ adc hl,hl ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl jr nc,$+6 \ sbc hl,de \ inc de \ inc de ;Iteration 11 rla \ adc hl,hl \ rla \ adc hl,hl ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl jr nc,$+6 \ sbc hl,de \ inc de \ inc de ;Iteration 11 rla \ adc hl,hl \ rla \ adc hl,hl ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl jr nc,$+6 \ sbc hl,de \ inc de \ inc de
rr d \ rr e \ ret
It is huge and I believe less than 240cc. div16 optimizeddiv16: ;;Inputs: DE is the numerator, BC is the divisor ;;Outputs: DE is the result ;; A is a copy of E ;; HL is the remainder ;; BC is not changed ;140 bytes ;145cc xor a \ sbc hl,hl
ld a,d rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ cpl \ ld d,a
ld a,e rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc rla \ cpl \ ld e,a ret
prng24 optimizedThis is a lot faster and smaller than the rand24 routine. I'm also pretty sure this routine wins on unpredictability, too, and it has a huge cycle length. prng24: ;;Expects ADL mode. ;;Output: HL ;;50cc ;;33 bytes ;;cycle length: 281,474,959,933,440 (about 2.8 trillion) ld de,(seed1) or a sbc hl,hl add hl,de add hl,hl add hl,hl inc l add hl,de ld (seed1),hl ld hl,(seed2) add hl,hl sbc a,a and %00011011 xor l ld l,a ld (seed2),hl add hl,de ret
43
« on: December 24, 2014, 02:12:45 am »
EDIT: See the necro-update below-- it provides a much more efficient algorithm, even 1 multiply better than most suggested implementations of Newton-Raphson division
In the study of numerical algorithms, quadratic convergence refers to an iterative algorithm that approximates a function and each iteration doubles the digits of accuracy. In this post, I will expose an algorithm for division that can double, triple, quadruple, etc. the number of digits of accuracy at each iteration. I will then show that the optimal algorithm is that which offers cubic convergence (digits of accuracy tripled). What does it mean to multiply the number of correct digits by a?Suppose that we have a sequence ##x_{0},x_{1},x_{2},x_{3},...## and suppose that ##x_{n}\rightarrow c## Then by the definition of convergence, ##\lim\limits_{n\rightarrow\infty}{|x_{n}-c|}=0##. You should always think of |x-y| as a "distance" between two points, and in this case, that distance is the error of our approximations, ##x_{n}##. If the number of correct digits is multiplied by a, then that means ##\log(|x_{n+1}-c|)=a\cdot\log|x_{n}-c|##, or rewritten using logarithm tricks, ##|x_{n+1}-c|=|x_{n}-c|^{a}##. This assumes that the error is initially less than 1. The AlgorithmThe algorithm is very simple, but to break it down we need to look at "obvious" bits of math. Firstly, ##\frac{N}{D}=\frac{N\cdot c}{D\cdot c}##, and secondly, all real numbers can be written in the form of ##x=y\cdot 2^{m}, y\in(.5,1]##. You may have said "duh" at the first statement, but if you need convincing of the second, it basically says that if you divide or multiply a number by 2 enough times, then it will eventually get between 1/2 and 1. Or in another way, shown by example: 2 11<3767<2 12, so dividing it all by 2 12, .5<3767/4096<1. All real numbers can be bounded by consecutive power of 2, therefore they can all be written in the above form. The final piece of the puzzle is to recognize that ##\frac{N}{1}={N}##. What we will do is recursively multiply D by some constant c at each iteration in order to drive it closer to 1. If we use range reduction techniques to get D on (.5,1] (multiply both N and D by the power of 2 that gets D in that range). Then what we want is to choose c so that ##|1-D_{n}\cdot c|= |1-D_{n}|^{a}##. If ##0\leq D_{n}\leq 1##, then we have ##1-D_{n}\cdot c = (1-D_{n})^{a}##. When a is a natural number greater than 1 (if it is 1, then there is no convergence ), we have the following: ##1-D_{n}\cdot c = \sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}####D_{n}\cdot c = 1-\sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}####D_{n}\cdot c = 1-(1+\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}})####D_{n}\cdot c = -\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}}####D_{n}\cdot c = {a \choose 1}D_{n}-{a \choose 2}D_{n}^{2}+{a \choose 3}D_{n}^{3}+\cdots####c = {a \choose 1}-{a \choose 2}D_{n}+{a \choose 3}D_{n}^{2}+\cdots##Using a technique similar to Horner's Method, we can obtain: ##c = {a \choose 1}-D_{n}({a \choose 2}-D_{n}({a \choose 3}-D_{n}({a \choose 4}-D_{n}\cdots##Then the selection for c requires a-2 multiplications and a-1 subtractions. If we take ##N_{n+1}=N_{n}\cdot c## and ##D_{n+1}=D_{n}\cdot c##, then the number of digits of accuracy is multiplied by a using a total of a multiplications and a-1. ExampleLets say we want quadratic convergence. That is, a=2. Then ##c_{n+1}=2-D_{n}## That's right-- all you have to do is multiply by 2 minus the denominator at each iteration and if you started with 1 digit of accuracy, you next get 2 digits, then 4, then 8, 16, 32,.... Let's perform 49/39. Divide numerator and denominator by 2 6=64 and we get .765625/.609375. First iteration: c=2-D=1.390625 Nc=1.064697266 Dc=.8474121094 Second iteration: c=1.152587891 Nc=1.227157176 Dc=.9767169356 next iterations in the form {c,N,D}: {1.023283064,1.255729155,.9994578989} {1.000542101,1.256409887,.9999997061} {1.000000294,1.256410256,.9999999999} {1.000000294,1.256410256,1.000000000} And in fact, 49/39 is 1.256410256 according to my calc. Can we find an optimal choice for a?The only variable over which we have control is a-- how quickly the algorithm converges. Therefore we should find an optimal choice of a. In this case, "optimal" means getting the most accuracy for the least work (computation time). First, ##1-D_{0}<.5=2^{-1}## So then after n iterations, we have at least a n bits of accuracy and since ##D_{0}## may get arbitrarily close to .5, this is the best lower bound on the number of bits of accuracy achieved for n iterations of a randomly chosen D. As well, at n iterations, we need to use ##a\cdot n## multiplications and the subtractions are treated as trivial. Then in order to get m bits of accuracy, we need roughly ##a\log_{a}(m)## multiplications. We want to minimize this function of a for a is an integer greater than 1. To do that, we find when ##0=\frac{d}{da}(a\log_{a}(m))##: ##0=\frac{d}{da}(a\log_{a}(m))####0=\frac{d}{da}(a\frac{\log(m)}{\log(a)})####0=\frac{d}{da}(\frac{a}{\log(a)})####0=\frac{1}{\log(a)}-\frac{1}{\log^{2}(a)}##Since a>1: ##0=1-\frac{1}{\log(a)}####1=\frac{1}{\log(a)}####\log(a)=1####a=e##However, e is not an integer, but it is bounded by 2 and 3, so check which is smaller: 2/log(2) or 3/log(3) and we find that a=3 is the optimal value that achieves the most accuracy for the least work. Example:Our algorithm body is: c=3-D*(3-D) N=N*c D=D*c
Then 49/39=.765625/.609375 and {c,N,D}=: {1.543212891,1.181522369,.9403953552} {1.063157358,1.256144201,.9997882418} {1.000211803,1.256410256,1.000000000} ConclusionIn reality, their is a specific M so that for all m>M, a=3 always requires less computational power than a=2. Before that cutoff, there are values for which they require the same amount of work or even the cost is in favor of a=2. For example, 32768 bits of accuracy can be achieved with the same amount of work. The cutoff appears to be m=2^24 which is pretty gosh darn huge (16777216 bits of accuracy is over 5 million digits).
44
« on: December 03, 2014, 12:02:16 pm »
OverviewI was reading the BKM algorithm and I thought it was similar to my previous attempt at making an algorithm for the complex exponential function. However, I tried implementing it and failed miserably. So I altered my previous algorithm with an idea from the BKM algorithm, and I made my own algo that works as quickly, uses a smaller LUT, works on a wider range of values, and has a more straightforward implementation. Why is the complex exponential function useful?The complex exponential function can be evaluated as ##e^{x+iy}=e^{x}\left(\cos(y)+i \sin(y)\right)##. So for people who actually need a complex exponential function, this is useful, but it can also compute the real exponential function(take y=0), or real sine and cosine, simultaneously (take x=0). Goals of the algorithmAlthough I only advertise my algorithm as working on the rectangle [-1,1]+[-1,1]i, it actually works on approximately (-1.754365792,1.754365792)+(-1.213427912,1.213427912)i. Feel free to use this to your advantage. As well, since shifting and adding are easy citation needed, this is a shift and add algorithm. It takes approximately n iterations for n bits of accuracy, and 8 tables of n/2 floats (versus 18 tables suggested by the BKM algorithm). The AlgorithmWe will first give simple pseudo code, then proceed to optimize it. As in my old algorithm and the BKM algorithm, we start by observing that e z-s=e ze -s=e z/e s. So then if z-a-b-c-d-...=0, then we have ez-a-b-c-...=e z/(eaebec...), so then 1=ez/(eaebec...), so eaebec...=ez. The trick is finding values for a,b,c,... so that the multiplications are "easy" as in, comprised of a shift and add. Since we are working with complex numbers, take the complex number d where the real and imaginary parts are in the set {-1,0,1}. Multiplication by those values is even more trivial than a shift and add. What we will do is take ea=1+d2-n. Multiplication by 2-n is a right shift by n.If ea=1+d2-n, then log(ea)=log(1+ d2 -n), so a=log(1+ d2 -n). There are 9 possible values for d, so you might think, "oh no Zeda, we need 9 tables of complex numbers which is basically 18 tables of floats," but we all know that math+programming=magic citation not needed. So let's look at how the complex logarithm is defined and call the real part of d, s, and the imaginary part t. So d=s+t*i. By the way, references to "log" imply the natural logarithm in the math world, so log e or ln() not log 10. ##\log(x+iy) = \log(\sqrt(x^{2}+y^{2}))+i\tan^{-1}(y/x)= \frac{1}{2}\log(x^{2}+y^{2})+i\tan^{-1}(\frac{y}{x})## (side note: atan has infinitely many valid outputs-- just add 2*k*pi for integers k, but we are only interested in k=0, the principle Arg). So our tables would be with x=1+ s2 -n, y= t2 -n. But there are lots of repeated values, so we actually need tables for: For our purposes, we will note that to choose s and t, we will use s=sign(iPart(Re(z)*2m+1)), t=sign(iPart(Im(z)*2m+1)) where iPart() returns the integer part of the input, and sign() returns 1 if positive, -1 if negative, and 0 if 0. Without further ado, the algorithm:
1.0→v 1.0→w For k,0,n-1 sign(x)→s sign(y)→t if |x|<2^(1-k) 0→s if |y|<2^(1-k) 0→t if s≠0 & t≠0 v→a if t=0 x+log(1+s2^-k)→x v+s*v>>k→v if s=0 x+log(1+4^-k)→x y+t*atan(2^-k)→y v-tw>>k→v w+ta>>k→w if s=-1 x+log((1-2^-k)^2+4^-k)→x y+t*atan(1/(2^k-1))→y v-v>>k-tw>>k→w w-w>>k+ta>>k→v if s=1 x+log((1+2^-k)^2+4^-k)→x y+t*atan(1/(2^k+1))→y v+v>>k-tw>>k→w w+w>>k+ta>>k→v
The result is v+iw to n bits of accuracy. It seems complicated, but it is basically just taking care of each case for s and t. As well, we can multiply x, y by 2 each iteration so instead of checking the nth bit, we can check if they are less than 1. Then we would need to multiply each table entry by 2k+1. For further optimization, we note that to 2m bits of accuracy, when k>m-1, log(1+d2-(k+1))=.5*log(1+d2-k) So we can basically reuse the last table value, divide by 2 each iteration, then multiplying by 2 to get our 2-(k+1). But wait, x/2*2=x, so we can directly use the last value in each table! To clean things up a bit, lets name the tables and use an index to address each value: For the optimized code to 2m bits of accuracy,1.0→v 1.0→w For k,0,2m-1 if k<m k→n x<<1→x y<<1→y sign(x)→s sign(y)→t if |x|<1 0→s if |y|<1 0→t v→a if s=0 if t=-1 x+table6[n]→x y-table7[n]→y v+w>>k→v w-a>>k→w if t=1 x+table6[n]→x y+table7[n]→y v-w>>k→w w+a>>k→v if s=-1 if t=-1 x+table0[n]→x y-table2[n]→y v-v>>k+w>>k→w w-w>>k-a>>k→v if t=0 x+table4[n]→x v-v>>k→w w-w>>k→v if t=1 x+table0[n]→x y+table2[n]→y v-v>>k-w>>k→w w-w>>k+a>>k→v if s=1 if t=-1 x+table1[n]→x y-table3[n]→y v+v>>k+w>>k→w w+w>>k-a>>k→v if t=0 x+table1[n]→x v+v>>k→w w+w>>k→v if t=1 x+table1[n]→x y+table3[n]→y v+v>>k-w>>k→w w+w>>k+a>>k→v
Summary:For floating point math with floats in base 2, arbitrary shifting is trivial (just change the exponent). So each iteration requires at most 6 trivial shifts, 6 trivial compares, 6 non-trivial additions/subtractions. So for 2m bits of precison, the total speed cost is 12mA+epsilon, where A is the speed of an addition or subtraction, and epsilon is a small, positive number. The total space cost is 8m Floats. Compared to the BKM algorithm, the speed is the same (with just a change in epsilon), but the BKM uses 18m Floats, so we save 10 Floats. We also work on a larger space than the BKM algorithm which works on a trapezoid region of the complex pain, entirely contained in the set (a proper subset) on which this algorithm works. Future PlansI plan to write this up more formally, including the proofs as to why we can use m elements for 2m bits of accuracy, why we can just look at the sign and the nth bits of the real and complex component of the input to choose our d. This last bit was a difficult task for the creator of the BKM algorithm and in their paper, they mentioned that the bounds were theoretical and only as tight as they could manage. In practice they found that the bounds were loose and they suspected that they could expand their bounds, but they couldn't prove it. I found an easy way to prove this for my algorithm which I suspect will apply to the BKM algorithm since at this point they are nearly identical. Gosh I hope I don't have typos, formatting issues, or errors in the algorithms I posted. I have been typing and revising this for hours .__.
45
« on: November 20, 2014, 10:07:48 am »
Hi again! This time I was fooling around with tangent (who knows why? I don't .__.) Anyways, I was having some fun and I figured out that Horner's method can be abused to approximate tangent by a similar logic I used to approximate arctangent: all you have to do is approximate infinitely many terms of the Maclaurin series with one simple function!
So before the fun, I'll show what Horner's method is (it's a super simple idea): If you've never seen the identity that ##sin(x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+...##, that's fine, just trust me (you'll get to that in calculus and it's crazy cool stuff). Horner's method is like what a true 31337 programmer would come up with if you want to efficiently approximate that polynomial. First, factor out the first term in the series (##x##): ##\sin(x)=x(1-\frac{x^{2}}{3!}+\frac{x^{4}}{5!}-\frac{x^{6}}{7!}+...## Now in that inside term, factor out the ##-\frac{x^{2}}{3!}##:
##\sin(x)=x(1-\frac{x^{2}}{3!}(1-\frac{x^{2}}{4\cdot 5}+\frac{x^{4}}{4\cdot 5\cdot 6\cdot 7\cdot }-\frac{x^{6}}{4\cdot 5\cdot 6\cdot 7\cdot 8\cdot 9}+...## Rinse, repeat:
##\sin(x)=x(1-\frac{x^{2}}{3!}(1-\frac{x^{2}}{4\cdot 5}(1-\frac{x^{2}}{6\cdot 7}(1-\frac{x^{2}}{8\cdot 9}-...## So the programmer sees, "aw, shweet, I only need to compute x2 once and reuse it a bunch of times with some constant multiplications!" So yeah, Horner's method is snazztacular, and I urge you to use it when possible.
Now let's look at tan(x). The Maclaurin series expansion for tan(x) is uuuugly. Each term uses Bernoulli numbers which are recursively computed and are intricately linked with some really cool functions like the Riemann Zeta function. For comparison (thanks to Mathworld for the table):
##\cos(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}}{(2k)!}x^{2k}}## (simple, cute, compact) ##\sin(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}}{(2k+1)!}x^{2k+1}}## (simple, cute, compact) ##\tan(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}4^{k+1}\left(4^{k+1}-1\right)B_{2k+2}}{(2k+2)!}x^{2k+1}}## (wtf are math) That's what it looks like to divide sine by cosine. Not pretty (well, okay, it is actually pretty gorgeous, but whatevs). So the first few terms of tan(x) are: ##\tan(x)=x+\frac{x^{3}}{3}+\frac{2x^{5}}{15}+\frac{17x^{7}}{315}+\frac{62x^{9}}{2835}+\frac{1382x^{11}}{155925}+...## Nasty gorgeous, but applying Horner's Method, we end up with: ##\tan(x)=x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}(1+...##
So seeing this, I realized, "wait a minute, those coefficients sure don't look like they are converging to zero or one, but something in between!" And that made me think about how ##\frac{1}{1-ax^{2}}=(1+ax^{2}(1+ax^{2}(1+ax^{2}(1+ax^{2}(1+...##. Now, I am a mathematician, but I am also a programmer so what I am about to say is blasphemous and I am sorry, but I decided to numerically evaluate and assume those constants converge. I ended up with approximately .405285. What I am saying is:
##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}(1+.405285x^{2}(1+.405285x^{2}(1+.405285x^{2}(1+...## So then I ended up with:
##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}\frac{1}{1-.405285x^{2}}))))## ##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153-62x^{2}}))))## ##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}\frac{153}{153-62x^{2}})))##It's still a bit of work to compute, but I had fun! No to do the mathy thing and actually prove the constants converge and maybe even find it's exact value!
Pages: 1 2 [3] 4 5 ... 13
|