Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Topics - Xeda112358

Pages: 1 2 [3] 4 5 ... 13

TI Z80 / Tok2Char

« on: August 06, 2015, 08:22:36 am »

This is a work-in-progress, but the intent is for it to convert a token into it's individual chars. It first converts the token to ASCII (this is easy; the OS provides a call), then the hard part is converting the ASCII to individual tokens. Uppercase is easy, and I also added conversion for lowercase chars, space, and "(" so that should get most tokens. However, something like tan^-1( will not properly convert the ^-1.

TI Z80 / CatElem - Catalog Reader

« on: August 05, 2015, 10:10:09 am »

There was a request on Cemetech that I took interest in, and this is my work so far. You basically pass a number into the program, and it returns the nth catalog element. So 1:Asm(prgmCATELEM would return "abs(" as an example. If you pass 0 into it, it returns how many catalog elements there are.

Before you can use prgmCATELEM for it's intended purpose, you must run it once, and it will set itself up for your specific OS.
Suggestion: If you are going to use prgmCATELEM in a program that you want to release into the wild, provide a fresh prgmCATELEM, not the version on your calculator. Further, you should run it once at the start of your program to install, just in case the user of your program didn't run the installation step.

There are definitely issues with how prgmCESETUP searches for the catalog table. It assumes that the catalog will not have its first 4 bytes split on two pages and it assumes that catalog starts with "abs(" and then " and ". It also assumes that the catalog end is on the same page and ends with "?".

ASM / Link Port Woes

« on: July 27, 2015, 12:03:43 pm »

I am trying to write some link port routines and I am having some issues. Every so often, data won't be correctly transmitted and I have deduced that this only occurs when the sent byte has bit 1,3, or 5 set, it sometimes causes the next bit up to be set (so if bit 3 is supposed to be set, it also sets bit 4). This is a chaotic occurence.

What I do know is when the odd bits are read, bit 0 of port 0 (so the tip) reads 1. As well, the sending calc had a low battery, so I am not sure if it is an issue of bringing the lines up. I am pretty sure this is not the problem, though, as I switched out batteries.

I also tried adding code that guaranteed that the lines were reading correctly, otherwise it would rewrite the value to port 0 until it did. This still didn't work.
I tried adding huge delays, it still was just as unreliable.
Interrupts and link assist are disabled.

Below is my code if you want to muddle through it. It is supposed to send a value in Ans to the other calc and the receiving calc receives in Ans (modulo 100).

Code: [Select]

#define bcall(x) rst 28h \ .dw x
.db $BB,$6D
.org $9D95
    bcall(4AD7h)
    bcall(4AEFh)
sendb:
    di
    ld c,a
    xor a
    out (8),a
    ld e,55h
sendloop:
    rrc c
    rla
    sla e
    ccf
    rla
    out (0),a
    ld b,6          ;this is a delay, and not claculated to be precise. We probably do not even need a delay
    djnz $
    jr nz,sendloop
    ld b,10         ;delay
    djnz $
    xor a
    out (0),a
    ret

Code: [Select]

#define bcall(x) rst 28h \ .dw x
.db $BB,$6D
.org $9D95
    call getb
    bcall(478Ch)
    bcall(4ABFh)
    ret
getb:
    di
    xor a
    out (8),a
    ld bc,$08AA
readloop:
    in a,(0)
    xor c
    rra
    jr c,readloop
    rra             ;bits cycled in are masked with 0x55 as a sideeffect. Need to invert anyways, so mask at the end with 0xAA
    rr c
    djnz readloop
    ld a,c
    and 1
    rrca
    xor c
    xor $AA
    ret

Technology and Development / Serial Link, TI Z80

« on: July 15, 2015, 10:29:19 am »

I was not sure where to post this, but I made a calc-to-calc serial cable!

I ended up finding two sets of headphones with the right size tips (not sure what they are called), so I cut them up with an ample length of wire. I also have a bunch of jumper wires that I got for my Raspberry Pi, so after cutting the headphone wires, I exposed the three sets of wire, used an abrasive to strip off the enameling, and spliced each wire with a female tip of a jumper wire. From there, I can connect them with male-male jumpers and it properly transmits data between my calcs!

The reason that I didn't just splice the two sets of wires together directly is because now I can plug in other items. For example, I connected male leads to an electric motor and LED-- it lets me control them through the link port now! I can also hook them up to a breadboard if I wanted to, making it possible to connect more calculators, or I can directly plug them into the GPIO board on my Raspberry Pi. A few months ago I did this and wrote a Python program and a calc program that could communicate with each other (basically just sending a string in Ans to the Pi, and have it displayed in the terminal).

Unfortunately, I lost the Python program due to a crash, but I still have the calc programmed backed up on calc. My plan is to come up with some working routines from scratch since I am still pretty terrible with figuring out the link port. It took me forever to figure out that the receiving calc can't change the lines.

ASM / isDivisible Routine

« on: June 23, 2015, 12:30:33 pm »

I don't want to put this in the Optimized Routines thread as I haven't had it long enough for it to be ultra-optimized.

This code is intended as a method of testing if HL is divisible by DE. It's average run time for random 16-bit inputs is 163cc, and where DE<=HL (more likely in practice), it is 229cc. For a code that is doing prime-checking, it will more likely take about 707.5cc (Since DE will be roughly the square root of HL or less).

Code: [Select]

isDivisible:
;;Inputs: DE,HL
;;Outputs: c flag set if HL is not divisible by DE, else c flag is set.
;;         HL is 0 if true.
;;1087cc worst case (HL=65535, DE=1). 
;;Average for random inputs: 163cc
;;Average for DE<=HL: 229cc.
    ld a,d \ or e \ ccf \ ret z         ;remove this if DE is always guaranteed non-zero
;step 1
    ld a,e \ or l \ rra \ jr c,step2    ;\
    srl d \ rr e \ rr h \ rr l          ; |
    ld a,e \ or l \ rra \ jr nc,$-11    ; |Remove these if DE is always guaranteed odd at input.
step2:                                  ; |
    ld a,e \ rra \ ccf \ ret c          ;/
;steps 3, 4, and 5
    ld a,l
    or a
loop:
    sbc hl,de \ ret c \ ret z
    rr h \ rra \ bit 0,a \ jr z,$-5
    ld l,a
    jp loop

Motivation and Development
I often find myself in a situation where I need to find the factors of a number, but I have no technology around to aid me. This means I need to use... mental arithmetic! I've been doing this for 15 years, so I have refined my mental process quite a bit. It is still a trial division algorithm, but with a very obfuscated "division" technique. We don't need to do 1131/7 to see if it is divisible by 7, we just need to see if 7 divides 1131 and this is what my algorithm does. Interestingly, testing divisibility at the algorithmic level is a little faster than division. Not by much, but it is also non-negligible.
The Algorithm
The core algorithm is designed around checking that (A mod B == 0) is true or false. We also make the assumption that B is odd and by extension, non-zero. The case where B is non-zero and even will be discussed later.

Since B is odd, 2 does not divide B. This means that if A is even:
(A mod B == 0) if and only if (A/2 mod B)==0.
We also know by the definition of divisibility that
(A mod B) == (A+c*B mod B)
where c is any integer. Combining all this, we have an algorithm:

Remove all factors of 2 from A
With A now odd, do A=A-B
- If the result is zero, that means (A mod B == 0)
- If the result underflow (becomes "negative", or on the Z80, sets the carry flag), it means that A was somewhere on [1,B-1], so it is not divisible by B.
Continue back at 1.

Now suppose B is allowed to be non-zero and even. Then B is of the form d*2^k where d is odd. This just means there are some factors of 2 that can be removed from B until it is odd. The only way A is divisible by B, is if it has the same number or more of factors of 2 as B. If we factor out common factors of 2 and find B is still even, then A is not divisible by B. Otherwise we have an odd number and only need to check the new (A mod d) for which we can use the "odd algorithm" above.
So putting it all together:

If B==0, return FALSE.
Remove common factors of 2 from A and B.
If B is even, return FALSE.
Remove all factors of 2 from A.
Subtract B from A (A=A-B).
- If the result is zero, return TRUE.
- If the result is "negative" (setting the carry flag on many processors), return FALSE.
Repeat at 4.

The overhead steps are (1) to (3). The iterated steps are (4) and (5).
Because (5) always produces an even number, when it then performs step 4, it always divides by at least one factor of 2. This means the algorithm takes at most 1+ceil(log2(A))-floor(log2(B) iterations. For example, if A is a 37-bit number and B is a 13-bit number,this takes at most 38-13 = 25 iterations. However, in practice it is usually fewer iterations.
Example Time:
Say I wanted to test if 1337 is divisible by 17.
Since 17 is odd, we can proceed.
1337 is odd, so no factors of 2 to remove.
1337-17 == 1320.
1320/2 == 660
660/2 == 330
330/2 == 165
165-17 == 148
148/2 == 74
74/2 == 37
37-17 == 20
20/2 == 10
10/2 == 5
5-17 == -12
So 1337 is not divisible by 17.

Now test divisibility by 7:
1337 => 1330
=>665
=>658
=>329
=>322
=>161
=>154
=>77
=>70
=>35
=>28
=>14
=>7
=>0

So 1337 is divisible by 7.

The worst-case timing is 66*16+31 = 1087

ASM / (P)RNG

« on: April 20, 2015, 09:48:20 am »

EDIT:23-June-2015 Runer took a look at my LFSR code and suggested changing it to a Galois LFSR using the inverted mask and the code is much simpler.

I have a bunch of Pseudo-Random Number Generators (PRNGs), so instead of spamming the Optimized Routines topic, I'll post here until I have it all sorted out.

So this is my most recent work and I think zillions times better than previous work. It has a period of 4,294,901,760 (65536*65535) and takes a tiny 308cc in the worst case. It passes a chaos game test that I wrote, it is very nicely distributed (mostly uniform, but not *exactly* so). As a comparison, I wrote a Combined LCG that had a fairly long period (on a similar scale to this), but it took ~1500cc, and yesterday I realized that it might not touch a dust of numbers on it's range (this depends on the upper bound for prime gaps since it used multiplication, and a modulus [which may have remedied the multiplication]).

So the method I chose was to use a Lehmer RNG which is a special case LCG (Linear Congruential Generator) that had a period of 65536 (75*seed mod 65537 ->seed) and a 16-bit LFSR (Linear Feedback Shift Register) with maximal period, so 65535. I then add the two. The whole process cycles every gcd(65535,65536) iterations. However, since they are consecutive integers, they are coprime, so the cycle is of length 65536*65535.

Now I will give all three PRNGs -- the Lehmer RNG, which is pretty good on its own, the LFSR which is very fast, and the combined PRNG:

Code: [Select]

lehmer:
;;Input:
;;  (seed) has the seed value of the RNG
;;Output:
;;  (seed) is updated, HL is the result
;;Destroys:
;;  A,DE,BC
;;Timing:
;;  if seed>0     231cc or 232cc, condition dependent
;;  if seed=0     91cc
;;  if smc=1      subtract 6cc
;;Size: 44 bytes
;;Notes:
;;    Uses the Lehmer RNG used by the Sinclair ZX81
;;    75x mod 65537 -> x
;;    0 is never a possibility, but 65536, the maximum value, is. I encode 65536 as 0, so the seed is only 2 bytes.
#IF smc == 0
    ld hl,(seed)
#ELSE
seed = $+1
    ld hl,0
#ENDIF
;multiply by 75
    ld c,l
    ld b,h
    xor a
    adc hl,hl \ jr z,special \ ld d,a \ rla
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla \ add hl,bc
;modulo 65537, see note below on how this works
    ld e,a
    sbc hl,de
    jr nc,$+3
    inc hl
    ld (seed),hl
    ret
special:
;In the case that HL=0, this should be interpreted as 65536 = -1 mod 65537, so return -75 mod 65537 = -74 mod 65536 in HL
    ld hl,-74
    ld (seed),hl
    ret
    
;mod by 2^16 + 1 (a prime)
;current form is A*2^16+HL
;need:
;  (A*2^16+HL) mod (2^16+1)
;add 0 as +1-1
;  (A*(2^16+1-1)+HL) mod (2^16+1)
;distribute
;  (A*(2^16+1)-A+HL) mod (2^16+1)
;A*(2^16+1) mod 2^16+1 = 0, so remove
;  (-A+HL) mod (2^16+1)
;Oh hey, that's easy! :P
;I use this trick everywhere, you should, too.

Code: [Select]

smc=1

LFSR16:
;;Input: (seed1) is non-zero
;;Output: (seed1) updated, HL is the result
;;Destroys: A
;;13 bytes
;;66cc, add 6cc if not using smc
#IF smc == 0
    ld hl,(seed1)
#ELSE
seed1 = $+1
    ld hl,1
#ENDIF
    add hl,hl
    sbc a,a
    and %00101101
    xor l
    ld l,a
    ld (seed1),hl
    ret

Code: [Select]

smc = 1

prng16:
;;Input:
;;  (seed1) is a non-zero 16-bit int
;;  (seed2) is a 16-bit int
;;Output:
;;  HL is the pseudo random number
;;  DE is the output of the LFSR
;;  BC is the previous seed of the Lehmer PRNG
;;  (seed1) is the output of the LFSR
;;  (seed2) is the output of the Lehmer PRNG
;;Destroys: A
;;Notes:
;;  This uses a 16-bit Lehmer PRNG, and an LFSR
;;  The period is 4,294,901,760 (65536*65535)
;;  Technically,the Lehmer PRNG here can have values from 1 to 65536. In this application, we store 65536 as 0.
;;Speed:
;;    If smc = 0, add 12cc
;;    293+a-c, a,b,c are {0,1}
;;      probability a= 1 is 38/65536
;;      probability c= 1 is 1/65536
;;    Average: 293+39/65536 = 293.00059509277cc
;;    Best:293cc
;;    Worst:294cc
;;50 bytes
#IF smc == 0
    ld hl,(seed2)
#ELSE
seed2 = $+1
    ld hl,0
#ENDIF
;multiply by 75
    ld c,l
    ld b,h
    xor a
    adc hl,hl \ jr nz,$+3 \ inc a \ rla
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla \ add hl,bc
    ld e,a
    sbc hl,de
    jr nc,$+3
    inc hl
    ld (seed2),hl
    ex de,hl
#IF smc == 0
    ld hl,(seed1)
#ELSE
seed1 = $+1
    ld hl,1
#ENDIF
    add hl,hl
    sbc a,a
    and %00101101
    xor l
    ld l,a
    ld (seed1),hl
    add hl,de
    ret

TI Z80 / Fire (revisited)

« on: April 03, 2015, 12:40:21 pm »

A few years ago when Builderboy wrote his fire graphics turorial, I remember thinking it was super cool and I went on to make my own fire engines designed around it. However, one thing always peeved me-- the design was very clever and made for speed, but it seemed too artificial for my liking-- particularly in the way it masked out pixels.

The idea was to kill a pixel 1/8 of the time, so a PRNG was used to select from one of the following masks:

Code: [Select]

Of course an AND mask will kill one bit in a byte, so exactly 1/8 of the pixels in a byte. A few weeks ago, this was bugging me and preventing my sleep. I set out to devise an algorithm that would generate masks that didn't necessarily have a 1/8 kill rate, but over time the probability converges to 1/8. For example, one mask might be %11111111 (0% killed) and the next could be %10110111 (25% killed).

The Idea
Suppose you 3 random bits and you OR them together. The only way this can result in a 0 is if every input was a 0. The probability of that happening is .5^3 = .125 = 1/8. So for our fire mask, we can generate three random bytes, OR them together, and each bit has a 1/8 chance of being reset.
Speed
The disadvantage here is in speed. Builderboy's method requires one pseudo-random number, this method requires three. However, if we consider that we are almost certainly going to use a fast pseudo-random number generator, and that we will want a long (ish) period, we have room to take advantage of the PRNG and achieve almost the same speed. For example, suppose you generate a 24-bit pseudo-random number-- with this method, you can just OR the three bytes generated (12cc) versus using an LUT (Builder's method):

Code: [Select]

ld hl,LUT
and 7
add a,l
ld l,a
jr nc,$+3
inc hl
ld a,(hl)
;43cc

In the example code I will give, I use a 16-bit PRNG, so I generate three of these numbers (6 8-bit values) for 2 masks, making generation take 1.5 times longer than Builder's as opposed to 3 times as long.
Considerations
In order for this to work, you need an RNG or PRNG in which every bit has a 50% chance of being set or reset. I went with a Lehmer LCG that was fast to compute and had a period of 2^16 and had this property.
Example
The following code works at 6MHz or 15MHz and the LCD will provide almost no bottleneck. Sorry, no screenie:

Code: [Select]

smc = 1    ;1 for SMC, 0 for no SMC (use 1 if code is in RAM; it's faster)
plotSScreen = 9340h
saveSScreen = 86ECh


;==============================
#IF smc = 0
seed = saveSScreen
#ENDIF
;==============================



.db $BB,$6D
.org $9D95
fire:
;;first, set up. We will be writing bytes to the LCD left to right then down
    di
    ld a,7      ;LCD y-increment
    out (16),a
;;setup the keyboard port to read keys [Enter]...[Clear]
    ld a,%11111101
    out (1),a
;make the bottom row of pixel;s black to feed the flames
    ld hl,plotSScreen+756
    ld bc,$0CFF
    ld (hl),c \ inc hl \ djnz $-2
fireloopmain:
    ld ix,plotSScreen+12
    in a,(1)
    and %01000000
    ret z
    call LCG
    ld b,63
fireloop:
;wait for LCD delay
    in a,(16)
    rla 
    jr c,$-3
    ld a,80h+63
    sub b
    out (16),a
    push bc
    call LCG
    ld a,20h
    out (16),a
    call fire_2bytes+3
    call fire_2bytes
    call fire_2bytes
    call fire_2bytes
    call fire_2bytes
    call fire_2bytes
    pop bc
    djnz fireloop
    jp fireloopmain    
fire_2bytes:
    call lcg
    push hl
    call lcg
    pop de
    ld a,e
    or d
    or l
    and (ix)
    out (17),a
    ld (ix-12),a
    inc ix
    push hl
    call lcg
    pop af
    or h
    or l
    and (ix)
    out (17),a
    ld (ix-12),a
    inc ix
    ret
lcg:
;240cc or 241cc, condition dependent (-6cc using SMC)
;;uses the Lehmer RNG used by the Sinclair ZX81
#IF SMC = 1
seed = $+1
    ld hl,1
#ELSE
    ld hl,(seed)
#ENDIF
;multiply by 75
    ld c,l
    ld b,h
    xor a
    ld d,a
    adc hl,hl
    jr nz,$+9
    ld hl,-74
    ld (seed),hl
    ret
    rla
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla \ add hl,bc \ adc a,d
;mod by 2^16 + 1 (a prime)
;current form is A*2^16+HL
;need:
;  (A*2^16+HL) mod (2^16+1)
;add 0 as +1-1
;  (A*(2^16+1-1)+HL) mod (2^16+1)
;distribute
;  (A*(2^16+1)-A+HL) mod (2^16+1)
;A*(2^16+1) mod 2^16+1 = 0, so remove
;  (-A+HL) mod (2^16+1)
;Oh hey, that's easy! :P
;I use this trick everywhere, you should, too.
    ld e,a
    sbc hl,de       ;No need to reset the c flag since it is already
    jr nc,$+3
    inc hl
    ld (seed),hl
    ret

EDIT: .gif attatched. It looks slower in the screenshot than my actual calc, though.
EDIT2: On my actual calc, it is roughly 17.2FPS

TI Z80 / FastGK

« on: March 31, 2015, 02:46:38 pm »

I made this program to remove the delay in the BASIC getKey function! It's simple and ugly and I want to improve it at some point. For now, I just wanted a small code.

Basically, repeating keys like arrows and delete will repeat a lot faster, and not just on the homescreen like one of my older programs (Speedy Keys).

Math and Science / Computing Arcsine

« on: March 30, 2015, 01:48:30 pm »

I'm really peeved that I cannot seem to find an algorithm for computing arcsine that works like what I have. This algorithm is based on my work from years ago to compute sine and I have no idea why I never reversed it before now. Anyways, the algorithm:
ArcSine(x), ##x\in[-1,1]##

Code: [Select]

    x=2x
    s=z=sign(x)
iterate n times
    x=x*x-2
    if x<0
        s=-s
    z=2z+s
return pi*z*2^(-n-2)

This algorithm extracts one bit each iteration, so for 16 bit of accuracy, you would iterate 16 times. I think it is a pretty cute and compact algorithm, so where is it? @_@

As a note, at the endpoints {-1,1} it may not give the expected answer. In both cases, if allowed to run infinitely, it would return (in binary): ##\frac{\pi}{2}.1111111111111..._{2}## which of course is equivalent to ##\frac{\pi}{2}##.

Miscellaneous / Google Code shutting down

« on: March 13, 2015, 01:00:16 pm »

I got this email from Google:

Quote

Hello,

Earlier today, Google announced we will be turning down Google Code Project Hosting. The service started in 2006 with the goal of providing a scalable and reliable way of hosting open source projects. Since that time, millions of people have contributed to open source projects hosted on the site.

But a lot has changed since 2006. In the past nine years, many other options for hosting open source projects have popped up, along with vibrant communities of developers. It’s time to recognize that Google Code’s mission to provide open source projects a home has been accomplished by others, such as GitHub and Bitbucket.

We will be shutting down Google Code over the coming months. Starting today, the site will no longer accept new projects, but will remain functionally unchanged until August 2015. After that, project data will be read-only. Early next year, the site will shut down, but project data will be available for download in an archive format.

As the owner of the following projects, you have several options for migrating your data.

scope-os
langz80
The simplest option would be to use the Google Code Exporter, a new tool that will allow you to export your projects directly to GitHub. Alternatively, we have documentation on how to migrate to other services — GitHub, Bitbucket, and SourceForge — manually.

For more information, please see the Google Open Source blog or contact [email protected].

-The Google Code team

Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
You have received this mandatory email service announcement to update you about important changes to Google Code Project Hosting.

I found Google Code much easier to use, but I see their point about there being enough services already.

ASM / Binary Search Algorithm

« on: February 02, 2015, 03:00:47 pm »

Hi all, I was writing a binary search algorithm and since I cannot test it just yet, I was hoping people might be able to spot any problems or optimizations. My intent was to optimize this for the eZ80. Feel free to offer suggestions that include instructions from the eZ80 instruction set.

First, the input is a "key" to search for. It is zero terminated, so it may be something like .db "Hello",0
The table to look for it in is comprised of indices to the actual data. So the table might look like:

Code: [Select]

table_start:
.dw (table_end-table_start)/2-1
.dw expr0
.dw expr1
.dw expr2
...
.dw exprlast
.table_end:


expr0: .db "AAAaaa",0
;other data
expr1: .db "BBBbbb",0
;other data
...

The actual strings do not need to be in order, just the order in which they are indexed. The strings must be zero-terminated, too.

So, passing a pointer to the keyword in DE and the table start in HL:

Code: [Select]

BinarySearch:
;expected in Z80 mode
    ld (key_ptr),de
    ld e,(hl) \ inc hl
    ld d,(hl) \ inc hl
    ld b,h \ ld c,l
    ld (max),de
    or a \ sbc hl,hl \ ld (min),hl
.loop0:
;average max (de) and min (hl) with truncation
    add hl,de \ rr h \ rr l
    push hl
;2 bytes per index, and BC points to the base of the table
    add hl,hl \ add hl,bc
;load the pointer to the expression to compare to in HL
    ld hl,(hl) \ ld de,(key_ptr)
.loop1:
    ld a,(de) \ cp (hl) \ jr c,less
;if they are the same, make sure to test if the bytes are zero (termination)
    jr z,$+4
    jr nc,more
;if a != 0 then the string is not terminated, but we don't know which string is "bigger" so keep looping
    inc de \ inc hl \ or a \ jr nz,loop1
.match:
    pop bc \ ret
.less:
; "or a" is just to reset the c flag for future code
;if [index]<key, add 1 to the average of min and max and set that as the new min
    or a \ pop hl \ inc hl \ ld (min),hl \ ld de,(max) \ jr loop0_end
.more:
;[index]>key set max to the average of max and min. 
    pop de \ ld (max),de \ ld hl,(min)
.loop0_end:
    sbc hl,de \ add hl,de \ jr nz,loop0
    or 1
    ret

The output should have z flag if a match was found, else nz. In the case of a match being found, BC is the index number, DE points to the byte after the key, HL points to the byte after the match. If no match was found, HL and DE point to the index in front of where the match should have been and BC points to the table data start. In any case, the c flag is reset.

ASM / eZ80 Optimized Routines

« on: January 29, 2015, 09:25:53 am »

EDIT: Routines so far:
mul16
mul32
gcd16
sqrt16
sqrt24
rand24
setbuffer
hl*a/255
div16
prng24
atan8 [link]

Now that the eZ80 has been confirmed as the new processor in the next line of TI calculators, I thought it would be fun to start making optimized routines in preparation!

We can take advantage of the 8-bit multiplication routine for multiplication of course. As a note, I did not find Karatusba multiplication to be faster (it was 66 to 76 clock cycles).
mul16 optimized

Code: [Select]

mul16:
;;Expects Z80 mode
;;Inputs: hl,bc
;;Outputs: Upper 16 bits in DE, lower 16 bits in BC
;;54 or 56 t-states
;;30 bytes 
    ld d,c \ ld e,l \ mlt de \ push de ;11
    ld d,h \ ld e,b \ mlt de           ;8
    ld a,l \ ld l,c \ ld c,a           ;3
    mlt hl \ mlt bc \ add hl,bc        ;13
    jr nc,$+3 \ inc de \ pop bc        ;6
    ld a,b \ add a,l \ ld b,a          ;3
    ld a,e \ adc a,h \ ld e,a          ;3
    ret nc \ inc d \ ret               ;7 (+2 for carry)

Notice on the right side are the instruction timings. MLT is always 6 cycles and can be used in Z80 or ADL mode. All they did was include it in the extended instructions. In this case, I use Z80 mode to take advantage of some of the math and as a result, I also save 3 t-states over ADL mode which would require that add hl,bc to be done in z80 mode, so "add.s hl,bc" which costs 2 cycles instead of one, and the pushes/pops would require 4cc instead of 3cc each.

So, who wants to have fun?

EDIT: 5-Feb-15
mul32 optimized

Code: [Select]

mul32:
;;Expects Z80 mode
;;Inputs:  ix points to the first 32-bit number
;;         ix+4 points to the next 32-bit number
;;Outputs: ix+8 points to the 64-bit result
;;Algorithm: Karatsuba
;348cc to 375cc

;compute z0
    ld hl,(ix)        ;5
    ld bc,(ix+4)      ;5
    call mul16        ;59+2
    ld (ix+8),bc      ;5
    ld (ix+10),de     ;5
;compute z2            
    ld hl,(ix+2)      ;5
    ld bc,(ix+6)      ;5
    call mul16        ;59+2
    ld (ix+12),bc     ;5
    ld (ix+14),de     ;5
;compute z1, most complicated as it requires 17-bits of info for each factor
    ld hl,(ix+2)      ;5
    ld bc,(ix)        ;5
    add hl,bc         ;1
    rla               ;1
    ld b,h            ;1
    ld c,l            ;1
                       
    ld hl,(ix+6)      ;5
    ld de,(ix+4)      ;5
    add hl,de         ;1
    rla               ;1
                       
    push hl           ;3
    push bc           ;3
    push af           ;3
    call mul16        ;59+2
    pop af            ;3
    and 3             ;2
    ex de,hl          ;1
    pop de            ;3
    jr z,$+7 ;label0  ;3|(6+1)
;bit 0 means add [stack0] to the upper 16 bits
;bit 1 means add [stack1] to the upper 16 bits
;both mean add 2^32    
    jp po,$+5 \ or 4                     ;--
    rra \ jr nc,$+7                      ;4+4
    rrca \ add hl,bc \ adc a,0 \ rlca    ;--
                                         ;
    srl a \ pop de \ jr nc,$+5           ;8+2
    add hl,bc \ adc a,0                  ;
    ld d,b \ ld e,c                      ;2

;z1 = AHLDE-z0-z2
    ld bc,(ix+8) \ ex de,hl \ sbc hl,bc              ;8
    ld bc,(ix+10) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10

    ld bc,(ix+12) \ ex de,hl \ sbc hl,bc             ;8
    ld bc,(ix+14) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10
    ex de,hl                                         ;1

    ld bc,(ix+10) \ add hl,bc \ ld (ix+10),hl \ ex de,hl  ;12
    ld bc,(ix+12) \ adc hl,bc \ ld (ix+12),hl \ adc a,0   ;13
    ret z \ ld hl,(ix+14) \ add a,l \ ld l,a              ;7+16
    jr nc,$+3 \ inc h \ ld (ix+14),hl \ ret               ;--

gcd16 optimized

Code: [Select]

GCD16:
;;Expect Z80 mode
;;Inputs: HL,DE
;;Output: HL
;;        BC=0
;;        DE=0
;;        A=0
;Binary GCD algorithm
;About 432cc on average (average of 500 iterations with randomized inputs on [0,65535]
;78 bytes
    xor a
    ld b,a
    ld c,a
    sbc hl,bc
    ex de,hl
    ret z
    sbc hl,bc
    ex de,hl
    ret z

;factor out cofactor powers of two
    ld a,l \ or e \ rra \ jr nc,$+16
    srl h \ rr l
    srl d \ rr e
    inc b
    ld a,l \ or e \ rra \ jr nc,$-12
.loop:
;factor out powers of 2 from hl
    ld a,l \ rra \ ld a,h \ jr c,$+10
    rra \ rr l \ bit 0,l \ jr z,$-5 \ ld h,a
;factor out powers of 2 from de
    ld a,e \ rra \ ld a,d \ jr c,$+10
    rra \ rr e \ bit 0,e \ jr z,$-5 \ ld d,a

    xor a
    sbc hl,de
    jr z,finish
    jr nc,loop
    add hl,de
    or a
    ex de,hl
    sbc hl,de
    jr loop
.finish:
    ex de,hl
    dec b
    inc b
    ret z
    add hl,hl \ djnz $-1 \ ret

sqrt16 optimized

Code: [Select]

sqrt16:
;;Expext Z80 mode
;;Inputs: HL
;;Output: A
;Unrolled, speed optimized.
;At most 112 clock cycles
;111 bytes.
    xor a \ ld c,a \ ld d,a \ ld e,l \ ld l,h \ ld h,c

    add hl,hl \ add hl,hl \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    sla c \ ld a,c \ add a,a \ add hl,hl \ add hl,hl
    jr nc,$+5 \ sub h \ jr $+5 \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    ld e,h \ ld h,l \ ld l,c \ ld a,l
    add hl,hl \ rl e \ rl d
    add hl,hl \ rl e \ rl d
    sbc hl,de \ rla \ ret

rand24 optimized

Code: [Select]

Rand24:
;;Expects Z80 mode
;;Inputs: seed1,seed2
;;Outputs:
;;	AHL is the pseudo-random number
;;	seed1,seed2 incremented accordingly
;;Destroys: BC,DE
;;Notes:
;; seed1*243+83 mod 65519 -> seed1
;; seed2*251+43 mod 65521 -> seed2
;; Output seed1*seed2 mod 16777259
;;215cc worst case
	ld hl,(seed1)
    ld d,h
    ld h,243
    ld e,h
    mlt hl
    mlt de
    ld a,83
    add a,l
    ld l,a
    ld a,h
    adc a,e
    ld h,a
    ld a,d
    adc a,0

;now AHL mod 65519
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL
    ld c,a
    ld b,17
    mlt bc
    add hl,bc

	ld de,65519
	jr nc,$+5
	or a \ sbc hl,de
	or a \ sbc hl,de
	jr nc,$+3
	add hl,de
	ld (seed1),hl
;seed1 done, now we need to do seed2


	ld hl,(seed2)
    ld d,h
    ld h,251
    ld e,h
    mlt hl
    mlt de
    ld a,43
    add a,l
    ld l,a
    ld a,h
    adc a,e
    ld h,a
    ld a,d
    adc a,0

;now AHL mod 65521
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL
    ld c,a
    ld b,15
    mlt bc
    add hl,bc

	ld de,65521
	jr nc,$+5
	or a \ sbc hl,de
	or a \ sbc hl,de
	jr nc,$+3
	add hl,de
	ld (seed2),hl

;now seed1 and seed 2 are computed
;seed1*seed2 mod 16777259

	ld bc,(seed1)
	call mul16   ; -> DEBC
;16777259 = 2^24+43
;D2^24+EBC mod 16777259 = EBC+(D*2^24+43D-43D) mod 16777259
;D2^24+EBC mod 16777259 = EBC+(16777259D-43D) mod 16777259
;D2^24+EBC mod 16777259 = EBC-43D mod 16777259
    ld a,e
    ld e,43
    mlt de
;ABC-de
    ld h,b \ ld l,c
    or a
    sbc hl,de
    sbc a,0
    ret nc
    ld bc,43
    add hl,bc
    ret

Set Buffer optimized

Code: [Select]

setBuf:
;;Any mode
;;Inputs: A is the byte to set each byte in the buffer to
;;        HL points to the buffer
;;        BC is the buffer size, greater than 1
;;8 bytes
;;14+3*BC clock cycles
    ld (hl),a
    dec bc
    ld d,h
    ld e,l
    inc de
    ldir
    ret

EDIT 9-Feb-15
HL*A/255 (can be used for division by 3,5,15,17,51, and 85, among others)
This one performs A*HL/255. Be warned that this does not work on certain boundary values. For example, A=85 would imply division by 3, but if you input HL as divisible by 3, you get one less than the actual result. So 9*85/255 should return 3, but this routine returns 2 instead.

Code: [Select]

fastMul_Ndiv255:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;         A is a copy of the upper byte of the result
;;         DE is the product of the input A,H
;;         B is a copy of E
;;         C is a copy of the upper byte of the product of inputs A,L
;;32cc
;;18 bytes
    ld d,a
    ld e,h
    ld h,d
    mlt hl
    mlt de
;15
;DE
; DE
; HL
;  HL
    ld c,h
    ld b,e
    ld a,d
    add hl,bc \ adc a,0
    add hl,de \ adc a,0
    ld l,h \ ld h,a
    ret
;17

HL*A/255 "fixed"
Modifying this to correct the rounding issue is a pain in terms of speed hit and size hit:

Code: [Select]

fastMul_Ndiv255_fix:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;52cc
;;26 bytes
    push af
    ld d,a
    ld e,l
    ld l,d
    mlt hl
    mlt de
;HL
; HL
; DE
;  DE
    ld c,d
    ld b,l
    ld a,h
    add hl,de \ adc a,0
    add hl,bc \ adc a,0
    ld d,l \ ld l,h \ ld h,a
    pop af  \ add a,e
    ret nc
    adc a,d
    ret nc
    inc hl
    ret

HL*A/255 Rounded
However, rounding it is at little cost and works well:

Code: [Select]

fastMul_Ndiv255_round:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;37cc
;;23 bytes
    ld d,a
    ld e,l
    ld l,d
    mlt hl
    mlt de
;15
;HL
; HL
; DE
;  DE
    ld c,d
    ld b,l
    ld a,h
    add hl,de \ adc a,0
    add hl,bc \ adc a,0
    ld l,h \ ld h,a
    sla e \ jr nc,$+3 \ inc hl
    ret
;22

sqrt24 optimized
Finally, a routine that expects ADL mode instead of Z80 mode! This is an integer square root routine, but can be used for 8.8 fixed point numbers by copying the fixed point number to HL as 8.16 where the lower 8 bits are zero (or if you had extended precision from a previous calculation, feel free to use that). then the square root is 4.8

Code: [Select]

sqrt24:
;;Expects ADL mode
;;Inputs: HL
;;Outputs: DE is the integer square root
;;         HL is the difference inputHL-DE^2
;;         c flag reset
    xor a \ ld b,l \ push bc \ ld b,a \ ld d,a \ ld c,a \ ld l,a \ ld e,a
;Iteration 1
    add hl,hl \ rl c \ add hl,hl \ rl c
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 2
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 3
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 4
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 5
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 6
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a

;Iteration 7
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl b
    ex de,hl \ add hl,hl \ push hl \ sbc hl,bc \ jr nc,$+8
    ld a,h \ cpl \ ld b,a
    ld a,l \ cpl \ ld c,a
    pop hl
    jr nc,$+4 \ inc hl \ inc hl
    ex de,hl
;Iteration 8
    add hl,hl \ ld l,c \ ld h,b \ adc hl,hl \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 9
    pop af
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 10
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 11
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 11
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de

    rr d \ rr e \ ret

It is huge and I believe less than 240cc.
div16 optimized

Code: [Select]

div16:
;;Inputs: DE is the numerator, BC is the divisor
;;Outputs: DE is the result
;;         A is a copy of E
;;         HL is the remainder
;;         BC is not changed
;140 bytes
;145cc
    xor a \ sbc hl,hl

    ld a,d
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ cpl \ ld d,a

    ld a,e
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ cpl \ ld e,a
    ret

prng24 optimized
This is a lot faster and smaller than the rand24 routine. I'm also pretty sure this routine wins on unpredictability, too, and it has a huge cycle length.

Code: [Select]

prng24:
;;Expects ADL mode.
;;Output: HL
;;50cc
;;33 bytes
;;cycle length: 281,474,959,933,440 (about 2.8 trillion)
    ld de,(seed1)
    or a
    sbc hl,hl
    add hl,de
    add hl,hl
    add hl,hl
    inc l
    add hl,de
    ld (seed1),hl
    ld hl,(seed2)
    add hl,hl
    sbc a,a
    and %00011011
    xor l
    ld l,a
    ld (seed2),hl
    add hl,de
    ret

Spoiler For template for Zeda because she is lazy:

routine optimized

Code: [Select]

Math and Science / Quadratic and Faster Convergence for Division

« on: December 24, 2014, 02:12:45 am »

EDIT: See the necro-update below-- it provides a much more efficient algorithm, even 1 multiply better than most suggested implementations of Newton-Raphson division

In the study of numerical algorithms, quadratic convergence refers to an iterative algorithm that approximates a function and each iteration doubles the digits of accuracy. In this post, I will expose an algorithm for division that can double, triple, quadruple, etc. the number of digits of accuracy at each iteration. I will then show that the optimal algorithm is that which offers cubic convergence (digits of accuracy tripled).

What does it mean to multiply the number of correct digits by a?
Suppose that we have a sequence ##x_{0},x_{1},x_{2},x_{3},...## and suppose that ##x_{n}\rightarrow c## Then by the definition of convergence, ##\lim\limits_{n\rightarrow\infty}{|x_{n}-c|}=0##. You should always think of |x-y| as a "distance" between two points, and in this case, that distance is the error of our approximations, ##x_{n}##.

If the number of correct digits is multiplied by a, then that means ##\log(|x_{n+1}-c|)=a\cdot\log|x_{n}-c|##, or rewritten using logarithm tricks, ##|x_{n+1}-c|=|x_{n}-c|^{a}##. This assumes that the error is initially less than 1.

The Algorithm
The algorithm is very simple, but to break it down we need to look at "obvious" bits of math. Firstly, ##\frac{N}{D}=\frac{N\cdot c}{D\cdot c}##, and secondly, all real numbers can be written in the form of ##x=y\cdot 2^{m}, y\in(.5,1]##. You may have said "duh" at the first statement, but if you need convincing of the second, it basically says that if you divide or multiply a number by 2 enough times, then it will eventually get between 1/2 and 1. Or in another way, shown by example: 2¹¹<3767<2¹², so dividing it all by 2¹², .5<3767/4096<1. All real numbers can be bounded by consecutive power of 2, therefore they can all be written in the above form.

The final piece of the puzzle is to recognize that ##\frac{N}{1}={N}##. What we will do is recursively multiply D by some constant c at each iteration in order to drive it closer to 1. If we use range reduction techniques to get D on (.5,1] (multiply both N and D by the power of 2 that gets D in that range). Then what we want is to choose c so that ##|1-D_{n}\cdot c|= |1-D_{n}|^{a}##. If ##0\leq D_{n}\leq 1##, then we have ##1-D_{n}\cdot c = (1-D_{n})^{a}##. When a is a natural number greater than 1 (if it is 1, then there is no convergence

), we have the following:
##1-D_{n}\cdot c = \sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = 1-\sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = 1-(1+\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}})##
##D_{n}\cdot c = -\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = {a \choose 1}D_{n}-{a \choose 2}D_{n}^{2}+{a \choose 3}D_{n}^{3}+\cdots##
##c = {a \choose 1}-{a \choose 2}D_{n}+{a \choose 3}D_{n}^{2}+\cdots##
Using a technique similar to Horner's Method, we can obtain:
##c = {a \choose 1}-D_{n}({a \choose 2}-D_{n}({a \choose 3}-D_{n}({a \choose 4}-D_{n}\cdots##
Then the selection for c requires a-2 multiplications and a-1 subtractions. If we take ##N_{n+1}=N_{n}\cdot c## and ##D_{n+1}=D_{n}\cdot c##, then the number of digits of accuracy is multiplied by a using a total of a multiplications and a-1.

Example
Lets say we want quadratic convergence. That is, a=2. Then ##c_{n+1}=2-D_{n}## That's right-- all you have to do is multiply by 2 minus the denominator at each iteration and if you started with 1 digit of accuracy, you next get 2 digits, then 4, then 8, 16, 32,.... Let's perform 49/39. Divide numerator and denominator by 2⁶=64 and we get .765625/.609375.

First iteration:
c=2-D=1.390625
Nc=1.064697266
Dc=.8474121094

Second iteration:
c=1.152587891
Nc=1.227157176
Dc=.9767169356

next iterations in the form {c,N,D}:
{1.023283064,1.255729155,.9994578989}
{1.000542101,1.256409887,.9999997061}
{1.000000294,1.256410256,.9999999999}
{1.000000294,1.256410256,1.000000000}

And in fact, 49/39 is 1.256410256 according to my calc.

Can we find an optimal choice for a?
The only variable over which we have control is a-- how quickly the algorithm converges. Therefore we should find an optimal choice of a. In this case, "optimal" means getting the most accuracy for the least work (computation time).

First, ##1-D_{0}<.5=2^{-1}## So then after n iterations, we have at least aⁿ bits of accuracy and since ##D_{0}## may get arbitrarily close to .5, this is the best lower bound on the number of bits of accuracy achieved for n iterations of a randomly chosen D. As well, at n iterations, we need to use ##a\cdot n## multiplications and the subtractions are treated as trivial. Then in order to get m bits of accuracy, we need roughly ##a\log_{a}(m)## multiplications. We want to minimize this function of a for a is an integer greater than 1. To do that, we find when ##0=\frac{d}{da}(a\log_{a}(m))##:
##0=\frac{d}{da}(a\log_{a}(m))##
##0=\frac{d}{da}(a\frac{\log(m)}{\log(a)})##
##0=\frac{d}{da}(\frac{a}{\log(a)})##
##0=\frac{1}{\log(a)}-\frac{1}{\log^{2}(a)}##
Since a>1:
##0=1-\frac{1}{\log(a)}##
##1=\frac{1}{\log(a)}##
##\log(a)=1##
##a=e##

However, e is not an integer, but it is bounded by 2 and 3, so check which is smaller: 2/log(2) or 3/log(3) and we find that a=3 is the optimal value that achieves the most accuracy for the least work.

Example:
Our algorithm body is:

Code: [Select]

    c=3-D*(3-D)
    N=N*c
    D=D*c

Then 49/39=.765625/.609375 and {c,N,D}=:
{1.543212891,1.181522369,.9403953552}
{1.063157358,1.256144201,.9997882418}
{1.000211803,1.256410256,1.000000000}

Conclusion
In reality, their is a specific M so that for all m>M, a=3 always requires less computational power than a=2. Before that cutoff, there are values for which they require the same amount of work or even the cost is in favor of a=2. For example, 32768 bits of accuracy can be achieved with the same amount of work. The cutoff appears to be m=2^24 which is pretty gosh darn huge (16777216 bits of accuracy is over 5 million digits).

Math and Science / Efficient Computation of the Complex Exponential Function

« on: December 03, 2014, 12:02:16 pm »

Overview
I was reading the BKM algorithm and I thought it was similar to my previous attempt at making an algorithm for the complex exponential function. However, I tried implementing it and failed miserably. So I altered my previous algorithm with an idea from the BKM algorithm, and I made my own algo that works as quickly, uses a smaller LUT, works on a wider range of values, and has a more straightforward implementation.

Why is the complex exponential function useful?
The complex exponential function can be evaluated as ##e^{x+iy}=e^{x}\left(\cos(y)+i \sin(y)\right)##. So for people who actually need a complex exponential function, this is useful, but it can also compute the real exponential function(take y=0), or real sine and cosine, simultaneously (take x=0).

Goals of the algorithm
Although I only advertise my algorithm as working on the rectangle [-1,1]+[-1,1]i, it actually works on approximately (-1.754365792,1.754365792)+(-1.213427912,1.213427912)i. Feel free to use this to your advantage.
As well, since shifting and adding are easy^{citation needed}, this is a shift and add algorithm. It takes approximately n iterations for n bits of accuracy, and 8 tables of n/2 floats (versus 18 tables suggested by the BKM algorithm).

The Algorithm
We will first give simple pseudo code, then proceed to optimize it.
As in my old algorithm and the BKM algorithm, we start by observing that e^z-s=e^ze^-s=e^z/e^s. So then if z-a-b-c-d-...=0, then we have e^z-a-b-c-...=e^z/(e^ae^be^c...), so then 1=e^z/(e^ae^be^c...), so e^ae^be^c...=e^z.

The trick is finding values for a,b,c,... so that the multiplications are "easy" as in, comprised of a shift and add.

Since we are working with complex numbers, take the complex number d where the real and imaginary parts are in the set {-1,0,1}. Multiplication by those values is even more trivial than a shift and add. What we will do is take e^a=1+d2^-n. Multiplication by 2^-n is a right shift by n.

If e^a=1+d2^-n, then log(e^a)=log(1+d2^-n), so a=log(1+d2^-n). There are 9 possible values for d, so you might think, "oh no Zeda, we need 9 tables of complex numbers which is basically 18 tables of floats," but we all know that math+programming=magic^{citation not needed}. So let's look at how the complex logarithm is defined and call the real part of d, s, and the imaginary part t. So d=s+t*i. By the way, references to "log" imply the natural logarithm in the math world, so log_e or ln() not log₁₀.
##\log(x+iy) = \log(\sqrt(x^{2}+y^{2}))+i\tan^{-1}(y/x)= \frac{1}{2}\log(x^{2}+y^{2})+i\tan^{-1}(\frac{y}{x})## (side note: atan has infinitely many valid outputs-- just add 2*k*pi for integers k, but we are only interested in k=0, the principle Arg).

So our tables would be with x=1+s2^-n, y=t2^-n.

Spoiler For tables for n:

But there are lots of repeated values, so we actually need tables for:

Spoiler For Actual tables needed:

For our purposes, we will note that to choose s and t, we will use s=sign(iPart(Re(z)*2^m+1)), t=sign(iPart(Im(z)*2^m+1)) where iPart() returns the integer part of the input, and sign() returns 1 if positive, -1 if negative, and 0 if 0. Without further ado, the algorithm:

Code: [Select]

1.0→v
1.0→w
For k,0,n-1
    sign(x)→s
    sign(y)→t
    if |x|<2^(1-k)
        0→s
    if |y|<2^(1-k)
        0→t
    if s≠0 & t≠0
        v→a
        if t=0
            x+log(1+s2^-k)→x
            v+s*v>>k→v
        if s=0
            x+log(1+4^-k)→x
            y+t*atan(2^-k)→y
            v-tw>>k→v
            w+ta>>k→w
        if s=-1
            x+log((1-2^-k)^2+4^-k)→x
            y+t*atan(1/(2^k-1))→y
            v-v>>k-tw>>k→w
            w-w>>k+ta>>k→v
        if s=1
            x+log((1+2^-k)^2+4^-k)→x
            y+t*atan(1/(2^k+1))→y
            v+v>>k-tw>>k→w
            w+w>>k+ta>>k→v

The result is v+iw to n bits of accuracy. It seems complicated, but it is basically just taking care of each case for s and t. As well, we can multiply x, y by 2 each iteration so instead of checking the nth bit, we can check if they are less than 1. Then we would need to multiply each table entry by 2^k+1. For further optimization, we note that to 2m bits of accuracy, when k>m-1, log(1+d2^-(k+1))=.5*log(1+d2^-k) So we can basically reuse the last table value, divide by 2 each iteration, then multiplying by 2 to get our 2^-(k+1). But wait, x/2*2=x, so we can directly use the last value in each table!
To clean things up a bit, lets name the tables and use an index to address each value:

Spoiler For tables:

For the optimized code to 2m bits of accuracy,

Code: [Select]

1.0→v
1.0→w
For k,0,2m-1
    if k<m
        k→n
    x<<1→x
    y<<1→y
    sign(x)→s
    sign(y)→t
    if |x|<1
        0→s
    if |y|<1
        0→t
    v→a
    if s=0
        if t=-1
            x+table6[n]→x
            y-table7[n]→y
            v+w>>k→v
            w-a>>k→w
        if t=1
            x+table6[n]→x
            y+table7[n]→y
            v-w>>k→w
            w+a>>k→v
    if s=-1
        if t=-1
            x+table0[n]→x
            y-table2[n]→y
            v-v>>k+w>>k→w
            w-w>>k-a>>k→v
        if t=0
            x+table4[n]→x
            v-v>>k→w
            w-w>>k→v
        if t=1
            x+table0[n]→x
            y+table2[n]→y
            v-v>>k-w>>k→w
            w-w>>k+a>>k→v
    if s=1
        if t=-1
            x+table1[n]→x
            y-table3[n]→y
            v+v>>k+w>>k→w
            w+w>>k-a>>k→v
        if t=0
            x+table1[n]→x
            v+v>>k→w
            w+w>>k→v
        if t=1
            x+table1[n]→x
            y+table3[n]→y
            v+v>>k-w>>k→w
            w+w>>k+a>>k→v

Summary:
For floating point math with floats in base 2, arbitrary shifting is trivial (just change the exponent). So each iteration requires at most 6 trivial shifts, 6 trivial compares, 6 non-trivial additions/subtractions. So for 2m bits of precison, the total speed cost is 12mA+epsilon, where A is the speed of an addition or subtraction, and epsilon is a small, positive number. The total space cost is 8m Floats. Compared to the BKM algorithm, the speed is the same (with just a change in epsilon), but the BKM uses 18m Floats, so we save 10 Floats. We also work on a larger space than the BKM algorithm which works on a trapezoid region of the complex pain, entirely contained in the set (a proper subset) on which this algorithm works.

Future Plans
I plan to write this up more formally, including the proofs as to why we can use m elements for 2m bits of accuracy, why we can just look at the sign and the nth bits of the real and complex component of the input to choose our d. This last bit was a difficult task for the creator of the BKM algorithm and in their paper, they mentioned that the bounds were theoretical and only as tight as they could manage. In practice they found that the bounds were loose and they suspected that they could expand their bounds, but they couldn't prove it. I found an easy way to prove this for my algorithm which I suspect will apply to the BKM algorithm since at this point they are nearly identical.

Gosh I hope I don't have typos, formatting issues, or errors in the algorithms I posted. I have been typing and revising this for hours .__.

Math and Science / Approximating Tangent

« on: November 20, 2014, 10:07:48 am »

Hi again! This time I was fooling around with tangent (who knows why? I don't .__.) Anyways, I was having some fun and I figured out that Horner's method can be abused to approximate tangent by a similar logic I used to approximate arctangent: all you have to do is approximate infinitely many terms of the Maclaurin series with one simple function!

So before the fun, I'll show what Horner's method is (it's a super simple idea):
If you've never seen the identity that ##sin(x)=x-\frac{x^{3}}{3!}+\frac{x^{5}}{5!}-\frac{x^{7}}{7!}+...##, that's fine, just trust me (you'll get to that in calculus and it's crazy cool stuff). Horner's method is like what a true 31337 programmer would come up with if you want to efficiently approximate that polynomial. First, factor out the first term in the series (##x##):
##\sin(x)=x(1-\frac{x^{2}}{3!}+\frac{x^{4}}{5!}-\frac{x^{6}}{7!}+...##
Now in that inside term, factor out the ##-\frac{x^{2}}{3!}##:

##\sin(x)=x(1-\frac{x^{2}}{3!}(1-\frac{x^{2}}{4\cdot 5}+\frac{x^{4}}{4\cdot 5\cdot 6\cdot 7\cdot }-\frac{x^{6}}{4\cdot 5\cdot 6\cdot 7\cdot 8\cdot 9}+...##
Rinse, repeat:

##\sin(x)=x(1-\frac{x^{2}}{3!}(1-\frac{x^{2}}{4\cdot 5}(1-\frac{x^{2}}{6\cdot 7}(1-\frac{x^{2}}{8\cdot 9}-...##
So the programmer sees, "aw, shweet, I only need to compute x² once and reuse it a bunch of times with some constant multiplications!" So yeah, Horner's method is snazztacular, and I urge you to use it when possible.

Now let's look at tan(x). The Maclaurin series expansion for tan(x) is uuuugly. Each term uses Bernoulli numbers which are recursively computed and are intricately linked with some really cool functions like the Riemann Zeta function. For comparison (thanks to Mathworld for the table):

##\cos(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}}{(2k)!}x^{2k}}## (simple, cute, compact)
##\sin(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}}{(2k+1)!}x^{2k+1}}## (simple, cute, compact)
##\tan(x)=\sum\limits_{k=0}^{\infty}{\frac{(-1)^{k}4^{k+1}\left(4^{k+1}-1\right)B_{2k+2}}{(2k+2)!}x^{2k+1}}## (wtf are math)
That's what it looks like to divide sine by cosine. Not pretty (well, okay, it is actually pretty gorgeous, but whatevs). So the first few terms of tan(x) are:
##\tan(x)=x+\frac{x^{3}}{3}+\frac{2x^{5}}{15}+\frac{17x^{7}}{315}+\frac{62x^{9}}{2835}+\frac{1382x^{11}}{155925}+...##
Nasty gorgeous, but applying Horner's Method, we end up with:
##\tan(x)=x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}(1+...##

So seeing this, I realized, "wait a minute, those coefficients sure don't look like they are converging to zero or one, but something in between!" And that made me think about how ##\frac{1}{1-ax^{2}}=(1+ax^{2}(1+ax^{2}(1+ax^{2}(1+ax^{2}(1+...##. Now, I am a mathematician, but I am also a programmer so what I am about to say is blasphemous and I am sorry, but I decided to numerically evaluate and assume those constants converge. I ended up with approximately .405285. What I am saying is:

##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}(1+.405285x^{2}(1+.405285x^{2}(1+.405285x^{2}(1+...##
So then I ended up with:

##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153}\frac{1}{1-.405285x^{2}}))))##
##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}(1+\frac{62x^{2}}{153-62x^{2}}))))##
##\tan(x)\approx x(1+\frac{x^{2}}{3}(1+\frac{2x^{2}}{5}(1+\frac{17x^{2}}{42}\frac{153}{153-62x^{2}})))##It's still a bit of work to compute, but I had fun! No to do the mathy thing and actually prove the constants converge and maybe even find it's exact value!

Pages: 1 2 [3] 4 5 ... 13