Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - Xeda112358

Pages: 1 ... 30 31 [32] 33 34 ... 317
466
Art / Re: Mockups "please say this is going to be a game"
« on: February 09, 2015, 09:03:06 am »
@Nanowar and tr1p1ea: Wow, those are amazing o.o

467
ASM / Re: eZ80 Optimized Routines
« on: February 06, 2015, 08:35:51 am »
For the Z80, I made floating point routines for 80-bit floats (64-bit precision) and 24-bit floats (16-bit precision). I just started working on a single-precision floating point library for the eZ80 and Z80 that is conforming to IEEE-754 standards and it is going alright.

I've written a bunch of fixed-point and integer routines, too.

468
ASM / Re: eZ80 Optimized Routines
« on: February 05, 2015, 02:17:37 pm »
GCD is the Greatest Common Divisor. So GCD(15,27) is 3 since 3 is the largest number that divides both 27 and 15. It isn't used often, bt if somebody wanted to make a 16-bit rational library *cough*I should*cough* it is extremely useful.

469
Other / Re: Raspberry Pi 2
« on: February 03, 2015, 08:34:15 am »
Yessss! I am totally getting this. I'm glad I waited :P

470
ASM / Binary Search Algorithm
« on: February 02, 2015, 03:00:47 pm »
Hi all, I was writing a binary search algorithm and since I cannot test it just yet, I was hoping people might be able to spot any problems or optimizations. My intent was to optimize this for the eZ80. Feel free to offer suggestions that include instructions from the eZ80 instruction set.

First, the input is a "key" to search for. It is zero terminated, so it may be something like .db "Hello",0
The table to look for it in is comprised of indices to the actual data. So the table might look like:
Code: [Select]
table_start:
.dw (table_end-table_start)/2-1
.dw expr0
.dw expr1
.dw expr2
...
.dw exprlast
.table_end:


expr0: .db "AAAaaa",0
;other data
expr1: .db "BBBbbb",0
;other data
...
The actual strings do not need to be in order, just the order in which they are indexed. The strings must be zero-terminated, too.

So, passing a pointer to the keyword in DE and the table start in HL:
Code: [Select]
BinarySearch:
;expected in Z80 mode
    ld (key_ptr),de
    ld e,(hl) \ inc hl
    ld d,(hl) \ inc hl
    ld b,h \ ld c,l
    ld (max),de
    or a \ sbc hl,hl \ ld (min),hl
.loop0:
;average max (de) and min (hl) with truncation
    add hl,de \ rr h \ rr l
    push hl
;2 bytes per index, and BC points to the base of the table
    add hl,hl \ add hl,bc
;load the pointer to the expression to compare to in HL
    ld hl,(hl) \ ld de,(key_ptr)
.loop1:
    ld a,(de) \ cp (hl) \ jr c,less
;if they are the same, make sure to test if the bytes are zero (termination)
    jr z,$+4
    jr nc,more
;if a != 0 then the string is not terminated, but we don't know which string is "bigger" so keep looping
    inc de \ inc hl \ or a \ jr nz,loop1
.match:
    pop bc \ ret
.less:
; "or a" is just to reset the c flag for future code
;if [index]<key, add 1 to the average of min and max and set that as the new min
    or a \ pop hl \ inc hl \ ld (min),hl \ ld de,(max) \ jr loop0_end
.more:
;[index]>key set max to the average of max and min.
    pop de \ ld (max),de \ ld hl,(min)
.loop0_end:
    sbc hl,de \ add hl,de \ jr nz,loop0
    or 1
    ret
The output should have z flag if a match was found, else nz. In the case of a match being found, BC is the index number, DE points to the byte after the key, HL points to the byte after the match. If no match was found, HL and DE point to the index in front of where the match should have been and BC points to the table data start. In any case, the c flag is reset.

471
TI Z80 / Re: SPASM-ng, now with eZ80 support!
« on: January 30, 2015, 09:35:00 am »
Yay, thank you!

472
ASM / Re: eZ80 Optimized Routines
« on: January 30, 2015, 09:11:02 am »
As it is only a 16-bit by 16-bit multiplication, yes, only up to 16-bit factors :P

And the upper 16 bits of the result are in DE, lower 16 in BC.

473
TI Z80 / Re: SPASM-ng, now with eZ80 support!
« on: January 30, 2015, 09:03:38 am »
Oh, am I able to use it on my Raspberry Pi? :3

474
ASM / Re: eZ80 Optimized Routines
« on: January 30, 2015, 09:02:00 am »
No, the eZ80 has two modes, ADL mode (24-bit) and Z80 (16-bit). In Z80 mode, certain instructions are one or two cycles faster, so it is beneficial for me to use that mode since I don't use any ADL-specific instructions. As well, add hl,hl sets the c flag on 16-bit overflow in Z80 mode, which I needed, but doesn't set it in ADL mode (it has to overflow 24 bits). Since there is no easy way to access the top bits of HL, it would have to be performed in Z80 mode, making it take even more clock cycles.


The full 32-bit result is returned.

475
ASM / eZ80 Optimized Routines
« on: January 29, 2015, 09:25:53 am »
EDIT: Routines so far:
mul16
mul32
gcd16
sqrt16
sqrt24
rand24
setbuffer
hl*a/255
div16
prng24
atan8 [link]



Now that the eZ80 has been confirmed as the new processor in the next line of TI calculators, I thought it would be fun to start making optimized routines in preparation!

We can take advantage of the 8-bit multiplication routine for multiplication of course. As a note, I did not find Karatusba multiplication to be faster (it was 66 to 76 clock cycles).
mul16 optimized
Code: [Select]
mul16:
;;Expects Z80 mode
;;Inputs: hl,bc
;;Outputs: Upper 16 bits in DE, lower 16 bits in BC
;;54 or 56 t-states
;;30 bytes
    ld d,c \ ld e,l \ mlt de \ push de ;11
    ld d,h \ ld e,b \ mlt de           ;8
    ld a,l \ ld l,c \ ld c,a           ;3
    mlt hl \ mlt bc \ add hl,bc        ;13
    jr nc,$+3 \ inc de \ pop bc        ;6
    ld a,b \ add a,l \ ld b,a          ;3
    ld a,e \ adc a,h \ ld e,a          ;3
    ret nc \ inc d \ ret               ;7 (+2 for carry)

Notice on the right side are the instruction timings. MLT is always 6 cycles and can be used in Z80 or ADL mode. All they did was include it in the extended instructions. In this case, I use Z80 mode to take advantage of some of the math and as a result, I also save 3 t-states over ADL mode which would require that add hl,bc to be done in z80 mode, so "add.s hl,bc" which costs 2 cycles instead of one, and the pushes/pops would require 4cc instead of 3cc each.

So, who wants to have fun? :D

EDIT: 5-Feb-15
mul32 optimized
Code: [Select]
mul32:
;;Expects Z80 mode
;;Inputs:  ix points to the first 32-bit number
;;         ix+4 points to the next 32-bit number
;;Outputs: ix+8 points to the 64-bit result
;;Algorithm: Karatsuba
;348cc to 375cc

;compute z0
    ld hl,(ix)        ;5
    ld bc,(ix+4)      ;5
    call mul16        ;59+2
    ld (ix+8),bc      ;5
    ld (ix+10),de     ;5
;compute z2           
    ld hl,(ix+2)      ;5
    ld bc,(ix+6)      ;5
    call mul16        ;59+2
    ld (ix+12),bc     ;5
    ld (ix+14),de     ;5
;compute z1, most complicated as it requires 17-bits of info for each factor
    ld hl,(ix+2)      ;5
    ld bc,(ix)        ;5
    add hl,bc         ;1
    rla               ;1
    ld b,h            ;1
    ld c,l            ;1
                       
    ld hl,(ix+6)      ;5
    ld de,(ix+4)      ;5
    add hl,de         ;1
    rla               ;1
                       
    push hl           ;3
    push bc           ;3
    push af           ;3
    call mul16        ;59+2
    pop af            ;3
    and 3             ;2
    ex de,hl          ;1
    pop de            ;3
    jr z,$+7 ;label0  ;3|(6+1)
;bit 0 means add [stack0] to the upper 16 bits
;bit 1 means add [stack1] to the upper 16 bits
;both mean add 2^32   
    jp po,$+5 \ or 4                     ;--
    rra \ jr nc,$+7                      ;4+4
    rrca \ add hl,bc \ adc a,0 \ rlca    ;--
                                         ;
    srl a \ pop de \ jr nc,$+5           ;8+2
    add hl,bc \ adc a,0                  ;
    ld d,b \ ld e,c                      ;2

;z1 = AHLDE-z0-z2
    ld bc,(ix+8) \ ex de,hl \ sbc hl,bc              ;8
    ld bc,(ix+10) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10

    ld bc,(ix+12) \ ex de,hl \ sbc hl,bc             ;8
    ld bc,(ix+14) \ ex de,hl \ sbc hl,bc \ sbc a,0   ;10
    ex de,hl                                         ;1

    ld bc,(ix+10) \ add hl,bc \ ld (ix+10),hl \ ex de,hl  ;12
    ld bc,(ix+12) \ adc hl,bc \ ld (ix+12),hl \ adc a,0   ;13
    ret z \ ld hl,(ix+14) \ add a,l \ ld l,a              ;7+16
    jr nc,$+3 \ inc h \ ld (ix+14),hl \ ret               ;--
gcd16 optimized
Code: [Select]
GCD16:
;;Expect Z80 mode
;;Inputs: HL,DE
;;Output: HL
;;        BC=0
;;        DE=0
;;        A=0
;Binary GCD algorithm
;About 432cc on average (average of 500 iterations with randomized inputs on [0,65535]
;78 bytes
    xor a
    ld b,a
    ld c,a
    sbc hl,bc
    ex de,hl
    ret z
    sbc hl,bc
    ex de,hl
    ret z

;factor out cofactor powers of two
    ld a,l \ or e \ rra \ jr nc,$+16
    srl h \ rr l
    srl d \ rr e
    inc b
    ld a,l \ or e \ rra \ jr nc,$-12
.loop:
;factor out powers of 2 from hl
    ld a,l \ rra \ ld a,h \ jr c,$+10
    rra \ rr l \ bit 0,l \ jr z,$-5 \ ld h,a
;factor out powers of 2 from de
    ld a,e \ rra \ ld a,d \ jr c,$+10
    rra \ rr e \ bit 0,e \ jr z,$-5 \ ld d,a

    xor a
    sbc hl,de
    jr z,finish
    jr nc,loop
    add hl,de
    or a
    ex de,hl
    sbc hl,de
    jr loop
.finish:
    ex de,hl
    dec b
    inc b
    ret z
    add hl,hl \ djnz $-1 \ ret
sqrt16 optimized
Code: [Select]
sqrt16:
;;Expext Z80 mode
;;Inputs: HL
;;Output: A
;Unrolled, speed optimized.
;At most 112 clock cycles
;111 bytes.
    xor a \ ld c,a \ ld d,a \ ld e,l \ ld l,h \ ld h,c

    add hl,hl \ add hl,hl \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    add hl,hl \ add hl,hl \ rl c \ ld a,c \ rla
    sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    sla c \ ld a,c \ add a,a \ add hl,hl \ add hl,hl
    jr nc,$+5 \ sub h \ jr $+5 \ sub h \ jr nc,$+5 \ inc c \ cpl \ ld h,a

    ld e,h \ ld h,l \ ld l,c \ ld a,l
    add hl,hl \ rl e \ rl d
    add hl,hl \ rl e \ rl d
    sbc hl,de \ rla \ ret
rand24 optimized
Code: [Select]
Rand24:
;;Expects Z80 mode
;;Inputs: seed1,seed2
;;Outputs:
;; AHL is the pseudo-random number
;; seed1,seed2 incremented accordingly
;;Destroys: BC,DE
;;Notes:
;; seed1*243+83 mod 65519 -> seed1
;; seed2*251+43 mod 65521 -> seed2
;; Output seed1*seed2 mod 16777259
;;215cc worst case
ld hl,(seed1)
    ld d,h
    ld h,243
    ld e,h
    mlt hl
    mlt de
    ld a,83
    add a,l
    ld l,a
    ld a,h
    adc a,e
    ld h,a
    ld a,d
    adc a,0

;now AHL mod 65519
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL
    ld c,a
    ld b,17
    mlt bc
    add hl,bc

ld de,65519
jr nc,$+5
or a \ sbc hl,de
or a \ sbc hl,de
jr nc,$+3
add hl,de
ld (seed1),hl
;seed1 done, now we need to do seed2


ld hl,(seed2)
    ld d,h
    ld h,251
    ld e,h
    mlt hl
    mlt de
    ld a,43
    add a,l
    ld l,a
    ld a,h
    adc a,e
    ld h,a
    ld a,d
    adc a,0

;now AHL mod 65521
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL
    ld c,a
    ld b,15
    mlt bc
    add hl,bc

ld de,65521
jr nc,$+5
or a \ sbc hl,de
or a \ sbc hl,de
jr nc,$+3
add hl,de
ld (seed2),hl

;now seed1 and seed 2 are computed
;seed1*seed2 mod 16777259

ld bc,(seed1)
call mul16   ; -> DEBC
;16777259 = 2^24+43
;D2^24+EBC mod 16777259 = EBC+(D*2^24+43D-43D) mod 16777259
;D2^24+EBC mod 16777259 = EBC+(16777259D-43D) mod 16777259
;D2^24+EBC mod 16777259 = EBC-43D mod 16777259
    ld a,e
    ld e,43
    mlt de
;ABC-de
    ld h,b \ ld l,c
    or a
    sbc hl,de
    sbc a,0
    ret nc
    ld bc,43
    add hl,bc
    ret

Set Buffer optimized
Code: [Select]
setBuf:
;;Any mode
;;Inputs: A is the byte to set each byte in the buffer to
;;        HL points to the buffer
;;        BC is the buffer size, greater than 1
;;8 bytes
;;14+3*BC clock cycles
    ld (hl),a
    dec bc
    ld d,h
    ld e,l
    inc de
    ldir
    ret

EDIT 9-Feb-15
HL*A/255 (can be used for division by 3,5,15,17,51, and 85, among others)
This one performs A*HL/255. Be warned that this does not work on certain boundary values. For example, A=85 would imply division by 3, but if you input HL as divisible by 3, you get one less than the actual result. So 9*85/255 should return 3, but this routine returns 2 instead.
Code: [Select]
fastMul_Ndiv255:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;         A is a copy of the upper byte of the result
;;         DE is the product of the input A,H
;;         B is a copy of E
;;         C is a copy of the upper byte of the product of inputs A,L
;;32cc
;;18 bytes
    ld d,a
    ld e,h
    ld h,d
    mlt hl
    mlt de
;15
;DE
; DE
; HL
;  HL
    ld c,h
    ld b,e
    ld a,d
    add hl,bc \ adc a,0
    add hl,de \ adc a,0
    ld l,h \ ld h,a
    ret
;17
HL*A/255 "fixed"
Modifying this to correct the rounding issue is a pain in terms of speed hit and size hit:
Code: [Select]
fastMul_Ndiv255_fix:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;52cc
;;26 bytes
    push af
    ld d,a
    ld e,l
    ld l,d
    mlt hl
    mlt de
;HL
; HL
; DE
;  DE
    ld c,d
    ld b,l
    ld a,h
    add hl,de \ adc a,0
    add hl,bc \ adc a,0
    ld d,l \ ld l,h \ ld h,a
    pop af  \ add a,e
    ret nc
    adc a,d
    ret nc
    inc hl
    ret
HL*A/255 Rounded
However, rounding it is at little cost and works well:
Code: [Select]
fastMul_Ndiv255_round:
;;Expects Z80 mode
;;Description: Multiplies HL by A/255
;;Inputs: A,HL
;;Outputs: HL is the result
;;37cc
;;23 bytes
    ld d,a
    ld e,l
    ld l,d
    mlt hl
    mlt de
;15
;HL
; HL
; DE
;  DE
    ld c,d
    ld b,l
    ld a,h
    add hl,de \ adc a,0
    add hl,bc \ adc a,0
    ld l,h \ ld h,a
    sla e \ jr nc,$+3 \ inc hl
    ret
;22
sqrt24 optimized
Finally, a routine that expects ADL mode instead of Z80 mode! This is an integer square root routine, but can be used for 8.8 fixed point numbers by copying the fixed point number to HL as 8.16 where the lower 8 bits are zero (or if you had extended precision from a previous calculation, feel free to use that). then the square root is 4.8
Code: [Select]
sqrt24:
;;Expects ADL mode
;;Inputs: HL
;;Outputs: DE is the integer square root
;;         HL is the difference inputHL-DE^2
;;         c flag reset
    xor a \ ld b,l \ push bc \ ld b,a \ ld d,a \ ld c,a \ ld l,a \ ld e,a
;Iteration 1
    add hl,hl \ rl c \ add hl,hl \ rl c
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 2
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 3
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 4
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 5
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a
;Iteration 6
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl e \ ld a,e
    sub c \ jr nc,$+6 \ inc e \ inc e \ cpl \ ld c,a

;Iteration 7
    add hl,hl \ rl c \ add hl,hl \ rl c \ rl b
    ex de,hl \ add hl,hl \ push hl \ sbc hl,bc \ jr nc,$+8
    ld a,h \ cpl \ ld b,a
    ld a,l \ cpl \ ld c,a
    pop hl
    jr nc,$+4 \ inc hl \ inc hl
    ex de,hl
;Iteration 8
    add hl,hl \ ld l,c \ ld h,b \ adc hl,hl \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 9
    pop af
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 10
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 11
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de
;Iteration 11
    rla \ adc hl,hl \ rla \ adc hl,hl
    ex de,hl \ add hl,hl \ sbc hl,de \ add hl,de \ ex de,hl
    jr nc,$+6 \ sbc hl,de \ inc de \ inc de

    rr d \ rr e \ ret
It is huge and I believe less than 240cc.
div16 optimized
Code: [Select]
div16:
;;Inputs: DE is the numerator, BC is the divisor
;;Outputs: DE is the result
;;         A is a copy of E
;;         HL is the remainder
;;         BC is not changed
;140 bytes
;145cc
    xor a \ sbc hl,hl

    ld a,d
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ cpl \ ld d,a

    ld a,e
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ adc hl,hl \ sbc hl,bc \ jr nc,$+3 \ add hl,bc
    rla \ cpl \ ld e,a
    ret
prng24 optimized
This is a lot faster and smaller than the rand24 routine. I'm also pretty sure this routine wins on unpredictability, too, and it has a huge cycle length.
Code: [Select]
prng24:
;;Expects ADL mode.
;;Output: HL
;;50cc
;;33 bytes
;;cycle length: 281,474,959,933,440 (about 2.8 trillion)
    ld de,(seed1)
    or a
    sbc hl,hl
    add hl,de
    add hl,hl
    add hl,hl
    inc l
    add hl,de
    ld (seed1),hl
    ld hl,(seed2)
    add hl,hl
    sbc a,a
    and %00011011
    xor l
    ld l,a
    ld (seed2),hl
    add hl,de
    ret
Spoiler For template for Zeda because she is lazy:
routine optimized
Code: [Select]

476
TI Z80 / Re: Figuring out what's in front of you.
« on: January 18, 2015, 04:16:51 pm »
So do you have the coordinates of the objects?

477
Math and Science / Re: Quadratic and Faster Convergence for Division
« on: December 24, 2014, 10:26:23 am »
When I first learned about Goldshmidt Division (case a=2), my first thought was to generalize it. Of course, after I did that I researched if it had already been done and it has :P I even think that it is known that a=3 is the most optimal case judging by this.

EDIT: And now I found a paper that mentioned an even more efficient way of doing it. It finds the inverse of D using one fewer mul at each iteration, then you use a final multiplication to do N*(1/D). For the curious:
Code: [Select]
r=1 (or some approximation of 1/D)
    x=1-rD
    r=r+rx(1+x(1+x(1+...
The last step is carried out finitely. The number of 'x' terms is how quickly it converges.

478
Math and Science / Quadratic and Faster Convergence for Division
« on: December 24, 2014, 02:12:45 am »
EDIT: See the necro-update below-- it provides a much more efficient algorithm, even 1 multiply better than most suggested implementations of Newton-Raphson division :P

In the study of numerical algorithms, quadratic convergence refers to an iterative algorithm that approximates a function and each iteration doubles the digits of accuracy. In this post, I will expose an algorithm for division that can double, triple, quadruple, etc. the number of digits of accuracy at each iteration. I will then show that the optimal algorithm is that which offers cubic convergence (digits of accuracy tripled).

What does it mean to multiply the number of correct digits by a?
Suppose that we have a sequence ##x_{0},x_{1},x_{2},x_{3},...## and suppose that ##x_{n}\rightarrow c## Then by the definition of convergence, ##\lim\limits_{n\rightarrow\infty}{|x_{n}-c|}=0##. You should always think of |x-y| as a "distance" between two points, and in this case, that distance is the error of our approximations, ##x_{n}##.

If the number of correct digits is multiplied by a, then that means ##\log(|x_{n+1}-c|)=a\cdot\log|x_{n}-c|##, or rewritten using logarithm tricks, ##|x_{n+1}-c|=|x_{n}-c|^{a}##. This assumes that the error is initially less than 1.

The Algorithm
The algorithm is very simple, but to break it down we need to look at "obvious" bits of math. Firstly, ##\frac{N}{D}=\frac{N\cdot c}{D\cdot c}##, and secondly, all real numbers can be written in the form of ##x=y\cdot 2^{m}, y\in(.5,1]##. You may have said "duh" at the first statement, but if you need convincing of the second, it basically says that if you divide or multiply a number by 2 enough times, then it will eventually get between 1/2 and 1. Or in another way, shown by example: 211<3767<212, so dividing it all by 212, .5<3767/4096<1. All real numbers can be bounded by consecutive power of 2, therefore they can all be written in the above form.

The final piece of the puzzle is to recognize that ##\frac{N}{1}={N}##. What we will do is recursively multiply D by some constant c at each iteration in order to drive it closer to 1. If we use range reduction techniques to get D on (.5,1] (multiply both N and D by the power of 2 that gets D in that range). Then what we want is to choose c so that ##|1-D_{n}\cdot c|= |1-D_{n}|^{a}##. If ##0\leq D_{n}\leq 1##, then we have ##1-D_{n}\cdot c = (1-D_{n})^{a}##. When a is a natural number greater than 1 (if it is 1, then there is no convergence :P), we have the following:
##1-D_{n}\cdot c = \sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = 1-\sum\limits_{k=0}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = 1-(1+\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}})##
##D_{n}\cdot c = -\sum\limits_{k=1}^{a}{{a \choose k}(-D_{n})^{k}}##
##D_{n}\cdot c = {a \choose 1}D_{n}-{a \choose 2}D_{n}^{2}+{a \choose 3}D_{n}^{3}+\cdots##
##c = {a \choose 1}-{a \choose 2}D_{n}+{a \choose 3}D_{n}^{2}+\cdots##
Using a technique similar to Horner's Method, we can obtain:
##c = {a \choose 1}-D_{n}({a \choose 2}-D_{n}({a \choose 3}-D_{n}({a \choose 4}-D_{n}\cdots##
Then the selection for c requires a-2 multiplications and a-1 subtractions. If we take ##N_{n+1}=N_{n}\cdot c## and ##D_{n+1}=D_{n}\cdot c##, then the number of digits of accuracy is multiplied by a using a total of a multiplications and a-1.

Example
Lets say we want quadratic convergence. That is, a=2. Then ##c_{n+1}=2-D_{n}## That's right-- all you have to do is multiply by 2 minus the denominator at each iteration and if you started with 1 digit of accuracy, you next get 2 digits, then 4, then 8, 16, 32,.... Let's perform 49/39. Divide numerator and denominator by 26=64 and we get .765625/.609375.

First iteration:
c=2-D=1.390625
Nc=1.064697266
Dc=.8474121094

Second iteration:
c=1.152587891
Nc=1.227157176
Dc=.9767169356

next iterations in the form {c,N,D}:
{1.023283064,1.255729155,.9994578989}
{1.000542101,1.256409887,.9999997061}
{1.000000294,1.256410256,.9999999999}
{1.000000294,1.256410256,1.000000000}

And in fact, 49/39 is 1.256410256 according to my calc.

Can we find an optimal choice for a?
The only variable over which we have control is a-- how quickly the algorithm converges. Therefore we should find an optimal choice of a. In this case, "optimal" means getting the most accuracy for the least work (computation time).

First, ##1-D_{0}<.5=2^{-1}## So then after n iterations, we have at least an bits of accuracy and since ##D_{0}## may get arbitrarily close to .5, this is the best lower bound on the number of bits of accuracy achieved for n iterations of a randomly chosen D. As well, at n iterations, we need to use ##a\cdot n## multiplications and the subtractions are treated as trivial. Then in order to get m bits of accuracy, we need roughly ##a\log_{a}(m)## multiplications. We want to minimize this function of a for a is an integer greater than 1. To do that, we find when ##0=\frac{d}{da}(a\log_{a}(m))##:
##0=\frac{d}{da}(a\log_{a}(m))##
##0=\frac{d}{da}(a\frac{\log(m)}{\log(a)})##
##0=\frac{d}{da}(\frac{a}{\log(a)})##
##0=\frac{1}{\log(a)}-\frac{1}{\log^{2}(a)}##
Since a>1:
##0=1-\frac{1}{\log(a)}##
##1=\frac{1}{\log(a)}##
##\log(a)=1##
##a=e##

However, e is not an integer, but it is bounded by 2 and 3, so check which is smaller: 2/log(2) or 3/log(3) and we find that a=3 is the optimal value that achieves the most accuracy for the least work.

Example:
Our algorithm body is:
Code: [Select]
    c=3-D*(3-D)
    N=N*c
    D=D*c
Then 49/39=.765625/.609375 and {c,N,D}=:
{1.543212891,1.181522369,.9403953552}
{1.063157358,1.256144201,.9997882418}
{1.000211803,1.256410256,1.000000000}


Conclusion
In reality, their is a specific M so that for all m>M, a=3 always requires less computational power than a=2. Before that cutoff, there are values for which they require the same amount of work or even the cost is in favor of a=2.  For example, 32768 bits of accuracy can be achieved with the same amount of work. The cutoff appears to be m=2^24 which is pretty gosh darn huge (16777216 bits of accuracy is over 5 million digits).

479
So for the actual, cleaned up code:

First, take the following tables:
##L1[k]:=2^{k}\log((1-2^{-k})^{2}+4^{-k})##
##L2[k]:=2^{k+1}\log(1-2^{-k})##
##L3[k]:=2^{k}\log(1+4^{-k})##
##L4[k]:=2^{k}\log((1+2^{-k})^{2}+4^{-k})##
##L5[k]:=2^{k+1}\log(1+2^{-k})##
##A1[k]:=2^{k+1}\arctan(\frac{1}{2^{k}-1})##
##A2[k]:=2^{k+1}\arctan(2^{-k})##
##A3[k]:=2^{k+1}\arctan(\frac{1}{2^{k}+1})##
Code: [Select]
1→v
0→w
    for k,0,2m-1:
        if k<m:k→n
        2x→x
        2y→y
        sign(x)→s
        sign(y)→t
        if |x|<1:0→s
        if |y|<1:0→t
        v→a
        if s=-1
            if t=-1
                x-L1[n]→x
                y+A1[n]→y
                v-v>>k+w>>k→v
                w-w>>k-a>>k→w
            if t=0
                x-L2[n]→x
                v-v>>k→v
                w-w>>k→w
            if t=1
                x-L1[n]→x
                y-A1[n]→y
                v-v>>k-w>>k→v
                w-w>>k+a>>k→w
        if s=0
            if t=-1
                x-L3[n]→x
                y+A2[n]→y
                v+w>>k→v
                w-a>>k→w
            if t=1
                x-L3[n]→x
                y-A2[n]→y
                v-w>>k→v
                w+a>>k→w
        if s=1
            if t=-1
                x-L4[n]→x
                y+A3[n]→y
                v+v>>k+w>>k→v
                w+w>>k-a>>k→w
            if t=0
                x-L5[n]→x
                v+v>>k→v
                w+w>>k→w
            if t=1
                x-L4[n]→x
                y-A3[n]→y
                v+v>>k-w>>k→v
                w+w>>k+a>>k→w
It might also be convenient to move all the v,w calculations outside of the cases, since they are all essentially the same with components multiplied by s or t which are just multiplications by {1,0,-1}.
Code: [Select]
1→v
0→w
    for k,0,2m-1:
        if k<m:k→n
        2x→x
        2y→y
        sign(x)→s
        sign(y)→t
        if |x|<1:0→s
        if |y|<1:0→t
        v→a
        v+sv>>k-tw>>k→v
        w+sw>>k+ta>>k→w
        if s=-1
            if t=-1
                x-L1[n]→x
                y+A1[n]→y
            if t=0
                x-L2[n]→x
            if t=1
                x-L1[n]→x
                y-A1[n]→y
        if s=0
            if t=-1
                x-L3[n]→x
                y+A2[n]→y
            if t=1
                x-L3[n]→x
                y-A2[n]→y
        if s=1
            if t=-1
                x-L4[n]→x
                y+A3[n]→y
            if t=0
                x-L5[n]→x
            if t=1
                x-L4[n]→x
                y-A3[n]→y
Also, in the last m iterations, can be even more optimized if you are using a binary floating point format for x and y. Convert the mantissas so that you have m*2^0 (so 1.1011*2^-5 would have a mantissa of 110110000.... but would be converted to 000011011000...). Now use these as your new, converted x and y value, treating them as ints. If the function "MSb(x)" returns the most significant bit in x (so for 64-bit numbers, bit 63):
Code: [Select]
sign(x)→b
sign(y)→c
convert(|x|)→x
convert(|y|)→y
    for k,m,2m-1:
        MSb(x)→s
        MSb(y)→t
        x<<1→x
        y<<1→y
        v→a
        v+s*b*v>>k-t*c*w>>k→v
        w+s*b*w>>k+t*c*a>>k→w
In the last part, s,t are either 0 or 1, and b,c are either -1, 0, or 1. If any of them are 0, of course the algorithm can be optimized further (just plug in 0 to see how it reduces). For example, suppose x=0 upon entering:
Code: [Select]
    for k,m,2m-1:
        MSb(y)→t
        y<<1→y
        v→a
        v-t*c*w>>k→v
        w+t*c*a>>k→w

480
News / Re: https by default
« on: December 06, 2014, 05:59:32 pm »
I don't know anything about these things, so I don't actually know the significance of this news .___.

Pages: 1 ... 30 31 [32] 33 34 ... 317