Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Messages - Xeda112358

Pages: 1 ... 21 22 [23] 24 25 ... 317

331

TI Z80 / Re: Antelope (formerly "OPIA") - A Polymorphic z80 language

« on: November 04, 2017, 09:55:09 am »

I'll be honest, I have no clue what a lot of that means. I've played with OOPLs here and there, but I've never managed to really "get it." However, I know for many programmers it is more natural and it would be cool to have such a language available on the (e)Z80 calculators.

I totally get being busy, but maybe you'll be lucky enough to pull a JamesV and come back in ten years with renewed energy/time/motivation.

332

ASM / Re: ASM Optimized routines

« on: October 16, 2017, 11:52:22 pm »

I find that doing the sign check before and after is the best method. I did find a few optimizations and noticed that the remainder isn't properly restored to A in the case where the result is negative. Also keep in mind that the routine may fail when C>127.

Code: [Select]

; divide HL by C (HL is signed, C is not)
; output: HL = quotient, A = remainder
divHLbyCs:
    bit 7,h
    push    af
    jr      z,divHLbyCsStart
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
divHLbyCsStart:
    xor     a
    ld      b,16
divHLbyCsLoop:
    add     hl,hl
    rla
    cp      c
    jp      c,divHLbyCsNext
    sub     c
    inc     l
divHLbyCsNext:
    djnz    divHLbyCLoop
    ld      b,a
    pop     af
    ld      a,b
    ret     z
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld a,c
    sub b
    ret

I changed the output remainder so that it returns the remainder modulo c. So -7/3 returns 2 instead of 1. This means that no bytes nor clock cycles were saved after all

333

ASM / Re: ASM Optimized routines

« on: October 16, 2017, 10:19:05 pm »

Since I was able to optimize my best 16-bit multiply even further today, I thought I'd share! I even think it can be further optimized. And for that matter, here is my favorite DE_Times_A

mul16 596.34375cc, 92 bytes (incl. DE_Times_A)

Code: [Select]

mul16:
;Inputs:
;   BC,DE are unsigned integers
;Output:
;   HL:DE is the 32-bit product
;Destroys:
;   A,B,C
;min: 359cc
;max: 717cc
;avg: 596.34375cc
;92 bytes
    ld a,c
    call DE_Times_A
    push hl
    push af
    ld a,b
    call DE_Times_A+2
    pop bc
    pop de
;AHL
; BDE
    ld c,d
    add hl,bc
    adc a,0
;AHLE
    ld d,l
    ld l,h
    ld h,a
;HLDE
    ret
DE_Times_A:
;Input: DE,A
;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.
;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2
;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4
;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8
;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16
;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32
;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64
;A:2~3     123+{0,19}            avg=132.5   *1/128
;A:1       107cc                 avg=107     *1/256
;A:0       119cc                 avg=119     *1/256
;overall avg: 237.671875cc
    ld c,0
    ld h,d
    ld l,e
    rla \ jr c,mul_07
    rla \ jr c,mul_06
    rla \ jr c,mul_05
    rla \ jr c,mul_04
    rla \ jr c,mul_03
    rla \ jr c,mul_02
    rla \ jr c,mul_01
    rla
    ret c
    ld h,a
    ld l,a
    ret
mul_07:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_06:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_05:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_04:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_03:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_02:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_01:
    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c
    ret

DE_Times_A, 237.671875cc, 72 bytes

Code: [Select]

DE_Times_A:
;Input: DE,A
;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.
;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2
;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4
;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8
;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16
;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32
;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64
;A:2~3     123+{0,19}            avg=132.5   *1/128
;A:1       107cc                 avg=107     *1/256
;A:0       119cc                 avg=119     *1/256
;overall avg: 237.671875cc
    ld c,0
    ld h,d
    ld l,e
    rla \ jr c,mul_07
    rla \ jr c,mul_06
    rla \ jr c,mul_05
    rla \ jr c,mul_04
    rla \ jr c,mul_03
    rla \ jr c,mul_02
    rla \ jr c,mul_01
    rla
    ret c
    ld h,a
    ld l,a
    ret
mul_07:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_06:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_05:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_04:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_03:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_02:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_01:
    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c
    ret

334

Miscellaneous / Re: RPi3 - A cheap platform for C/C++/Python programmers. Have a say!

« on: October 15, 2017, 08:43:18 pm »

I have one and it was my primary computer for a while (and before that it was the Pi 2, and before that it was the Pi B). I used it mostly for math-related programming and TI-programming and video games! Tilem and TiLP work well, I installed Sage for more powerful math computing, and RetroPie to play some of my old PSP and PS1 games-- the latter of which runs smoothly, the former requires some tweaking.

I also wrote the brunt of a Point-Of-Sale (POS) system from scratch that it handled just fine, but I need to modify it to be web-based now

My Pi 2 uses a 128GB USB drive for the root file system which makes it my second highest memory computer (superseded by a 140GB Windows laptop).

335

Math and Science / A faster Newton's Method Square Root

« on: October 10, 2017, 02:04:50 pm »

EDIT 2022-06-29: The post below has a much needed improvement

I decided to compute some square roots by hand to keep my tired self awake and alert at work. I opted to use Newton's Method this time and I noticed an optimization!
Instead of attaining 2N precision with a 2N-by-N->2N division, I split it up into an N-by-N multiply and a N-by-N->N division, which on the Z80 or similar processor can be nearly twice as fast.

The standard algorithm works as follows:
To find the square root of c, start with a guess called x. Iterate the following: (x+c/x)/2

Each iteration doubles the digits of precision; try it on your calculator if you aren't familiar with it! To fin the square root of 21, guess '5'. Type 5, hit [Enter]. Then do .5(Ans+21/Ans and hit [Enter] a few times.

The way it works is, if the current x is bigger than the square root of c, then c/x is less, and vice-versa. So averaging the two gets a closer estimate to the actual square root. My observation was that if I had, say, 3 digits precision, then those digits stay the same in the next iteration. Through some observations, this would mean the first 3 digits of c/x will match that of x. In fact, for c/x, I'll only need to compute it to 6 digits, but if I can skip the first 3, then the division is easier and faster! So, lets optimize:

=>.5(x+c/x) =>.5(x+x+(c/x-x)) =>x+.5(c-x*x)/x
So it's probably not so clear, but essentially for 2n-digits precision, I need to square an n-digit number and divide an n-digit number by n digits, to n digits precision. This replaces a 2n/2n division.

The former is faster, so let's try an example of finding the square root of 21:
c=21 Guess: x=5 (c-x*x) = -4 -4/5/2 = -.4 x-.4 = 4.6 (c-x*x) = 21-4.6*4.6 = -.16 -.16/4.6/2 = -.0173913... ~.02 x-.02 = 4.58 (c-x*x) = 21-4.58*4.58 = .0236 .0236/4.58/2 = .0025764... ~.0026 x+.0026 = 4.5826 (c-x*x) = 21-4.5826*4.5826 = -.00022276 -.00022276/4.5826/2 = -.000024304979... ~-.000024305 x-.000024305 = 4.582575695

In practice by hand, it actually doesn't save much time, but that's because by hand I usually can perform a 2n-by-2n division a little slower than half that of a 2n-by-2n and I can do multiplication and division in roughly the same speed. So overall This gets me about 20% faster.

On the Z80, a 2N-by-N->2N division takes roughly the time of 1.6 2N-by-2N->4N multiplies. As well, a 2N-by-2N->4N multiply takes roughly 3.5 times the time of an N-by-N->2N multiply using recursive Karatsuba multiplication.

So the standard algorithm takes roughly 5.6 N-by-N->2N multiplies worth of time.
The modified algorithm takes roughly 2.9 N-by-N->2N multiplies worth of time.

336

TI Z80 / Shunting-Yard Algorithm

« on: September 27, 2017, 08:37:33 pm »

So I have been really wanting to implement the Shunting-Yard algorithm for a few years now and I think I've got it in assembly. I keep finding bugs, but in case other people are interested, here is my code. It assumes that the operator stack and output (combined) won't exceed 768 bytes.

Things that would be useful:

Converting numbers to an intermediate format.
Handling multi-byte tokens and named vars and functions.

This way the code can be pseudo-compiled when it is converted to RPN, then an RPN calculator can parse the code faster or prep it for actual compiling.

Code: [Select]

#define bcall(x) rst 28h \ .dw x
saveSScreen = 86ECh
scrap=saveSScreen
outhead=8000h
stackhead = 8002h
.db $BB,$6D
.org $9D95
    ld hl,test
    ld bc,test_end-test
    call shuntingyard
    call rpn
    ld hl,scrap
    bcall(450Ah)
    bcall(452Eh)
    ret
shuntingyard:
    ld de,scrap
    ld (outhead),de
    ld d,(scrap/256)+3
    ld (stackhead),de
_:
    ld a,(hl)
    call +_
    cpi
    jp pe,-_
    ld hl,scrap+768
    ld de,(stackhead)
    or a
    sbc hl,de
    ld b,h
    ld c,l
    ld hl,(outhead)
    ex de,hl
    jr z,$+3
    ldir
    dec de
    xor a
    ld (de),a
    ret
_:
    cp '.'
    jp z,num_dec
    cp 30h
    jr c,+_
    cp 3Ah
    jp c,num
_:
    cp '('
    jp nz,+_
    ex de,hl
    ld hl,(stackhead)
    dec hl
    ld (hl),','
    dec hl
    ld (hl),a
    ld (stackhead),hl
    ex de,hl
    ret
_:
    cp ')'
    jp nz,checkunops
    push hl
    push bc
    ld hl,scrap+768
    ld de,(stackhead)
    sbc hl,de
    jp z,ERR_Unmatched_lparens
    ld b,h
    ld c,l
    ex de,hl
    ld de,(outhead)
;BC is the size of the stack. Use this in case there is a missing ')' so we don't read garbage.
;basically search for the matching '(' while piping out the stack to the output.
outerloop:
    ld a,(hl)
    cp '('
    jr z,parens_found
    ld a,','
_:
    cp (hl)
    ldi
    jp po,ERR_Unmatched_lparens
    jr z,outerloop
    jp -_
parens_found:
    inc hl
    inc hl
    ld (outhead),de
    ld (stackhead),hl
    pop bc
    pop hl
    ret
checkunops:
checkbinops:
;;	if the token is an operator, then:
;;		while there is an operator at the top of the operator stack with
;;			greater than or equal to precedence and the operator is left associative:
;;				pop operators from the operator stack, onto the output queue.
;;		push the read operator onto the operator stack.
;;
;;
    push bc
    ex de,hl
    call getprecedence
    ld a,c
    pop bc
    ex de,hl
    jp c,search_function
    ;now C is the precedence, with lower bit = 1 if left-associative
    push bc
    push hl
    ld de,(stackhead)
    ld hl,scrap+768
    sbc hl,de
    ld b,h
    ld c,l
    ld hl,(outhead)
    ex de,hl
    jr z,pushop
    ;a is the precedence against which to compare
_:
    push hl
    push bc
    push af
    ld a,(hl)
    call getprecedence
    jr c,+_
    pop hl
    ld a,h      ;incoming
    cp c
    jr nz,$+4
    rra \ nop

    pop bc
    pop hl
    
 ;======================================================
    jr nc,pushop
.echo "The following code only works until we have to add >1 byte tokens."
 ldi
 ldi
    jp pe,-_
    jp $+6
_:
    pop af
    pop bc
    pop hl
pushop:
    ld (outhead),de
    pop de
    dec hl
    ld (hl),','
    dec hl
    ld a,(de)
    ld (stackhead),hl
    ld (hl),a
    ex de,hl
    pop bc
    ret
search_function:
    jp ERR_Func_Not_Found
getprecedence:
    ld hl,binops
    ld b,(binops_end-binops)/2
_:
    cp (hl)
    inc hl
    ld c,(hl)
    ret z
    inc hl
    djnz -_
    scf
    ret
binops:
    .db 4,  $01
    .db '=',$50
    .db '|',$60
    .db '&',$70
    .db '-',$81     ;right associative is odd
    .db '+',$80     ;left associative is even
    .db '/',$83     ;right associative
    .db '*',$82     ;left associative
    .db '^',$85     ;right associative
binops_end:
num:
    ld de,(outhead)
_:
    ldi
    jp po,+_
    ld a,(hl)
    cp '.'
    jr z,num_dec+4
    cp 30h
    jr c,+_
    cp 3Ah
    jr c,-_
_:
    ld a,','
    ld (de),a
    inc de
    ld (outhead),de
    dec hl
    inc bc
    ret
num_dec:
    ld de,(outhead)
_:
    ldi
    jp po,+_
    ld a,(hl)
    cp 30h
    jr c,+_
    cp 3Ah
    jr c,-_
_:
    cp '.'
    jp z,ERR_Syntax_00
    ld a,','
    ld (de),a
    inc de
    ld (outhead),de
    dec hl
    inc bc
    ret
ERR_Syntax_00:      ;Too many decimal points.
ERR_Func_Not_Found:
ERR_Unmatched_lparens:
    ret
rpn:
    ret
test:
;    .db "(3.1415926535)"
.db "(3.142+6/2-7)*3^6*3"
test_end:

337

Math and Science / Re: Identifying 4D intersections with two unknowns.

« on: September 27, 2017, 08:27:01 pm »

I'll be honest, my eyes glossed over. So you have a 3-D space and two buckets that move back and forth along a fixed (randomly initialized) path. A projectile is launched from a bucket, in only one direction and the goal is for it to fall into the other moving bucket. The problem then is to determine if it is possible with a given setup.

Since the projectile is launched in one fixed direction, I'm assuming directly 'up' and it falls directly back 'down', let's pretend we do that at every single point and map the result. If the path made a loop, the plot would look something like a cookie cutter. Plot the other path and find if it ever intersects. If not, return FALSE.

If it does, we need to then verify if they intersect in time. What I would do is determine how many integer time units it takes both paths to complete. If the time units of both paths are coprime, then return TRUE.

Perform the following loop in the even that they are not coprime:

0. Given the length of time of the first path is A, and the length of the second is B.
1. Start with a first intersection.
2. Identify the time units it takes to reach that point on both paths called C and D.
3. If gcd(|C-D|,gcd(A,B))=1, return TRUE, else move to the next intersection and go to Step 2
4. If all intersections are exhausted, return FALSE.

WARNING: My math skills have been deteriorating since college. I'm sure there are optimizations and identities to apply.

338

Math and Science / Single-Precision Natural Logarithm Using Borchardt-Gauss-Carlson

« on: September 27, 2017, 01:18:08 pm »

Hi all, I lost my previous references and example programs and it took me this morning to locate this algorithm, digest it, and spew out my own version. I looked on all of my calculators and Omni first, so I'm going to post it here for when it happens again

Anyways, this is one of my favorite algorithms for evaluating logarithms:

Code: [Select]

;Natural Logarithm on [.5,2]
;Single precision
a=.5(1+x)
g=.5(a+x/a) ;half precision divide
g=.5(g+x/g) ;full precision divide
b=a
a=(a+g)/2
c=.5(a+g)
g=.5(c+a*g/c) ;full precision divide
c=a
b = a-b/4
a=(a+g)/2
c = a-c/4-b/16
return (x-1)(1-1/4)(1-1/16)/c

It achieves single precision accuracy (at least 24.4996 bits) on the range of inputs from [.5,2].
During range reduction, x is usually reduced to some value on [c,2c].

The best precision is found when c=.5sqrt(2) (range is [.5sqrt(2),sqrt(2)], achieving at least 31.5 bits
I prefer c=2/3 since 2/3 and 4/3 are equidistant from 1-- it makes it easier for me to analyze time complexity. This still offers at least 29.97 bits, which is better than single precision

Cost is:

amean: 7 . 'amean' is the same cost as an add in binary floats
half divide: 1
full divide: 3
multiply: 1
shift-by-2: 3
shift-by-4: 2. This is sightly more efficient on the Z80 than 4 single-shifts when values are in RAM
add/sub: 5
add/sub const:1

I derived this algorithm from this wonderful paper which is annoyingly never at the top of a Google search and in fact took me a loooong time to ever stumble upon it, sadly.

This paper greatly accelerates the classic Borchardt-Gauss algorithm to be on par with the AGM algorithm. At their core, both algorithms perform an arithmetic and geometric mean, but AGM requires them to be done in parallel (emulated on single-core processors by some simple variable juggling), whereas B-G does them sequentially. As well, AGM achieves quadratic convergence or better (I've seen some exponential convergence in specific special cases), whereas classic B-G usually achieves linear convergence. Carlson's version of the B-G algorithm adds O(log(n)^2) additions and O(log(n)) space for quadratic convergence (where n is the number of desired bits of accuracy).

I like the B-G-C algorithm since I can easily obtain the inverse trig functions and inverse hyperbolic functions as well as the natural logarithm.

339

General Discussion / Re: Watcha Been Listening To?

« on: August 02, 2017, 07:25:03 am »

For a moment I thought you meant to type Otep

I haven't heard anything about that one for a while.

340

Math and Science / Padé Expansions For Approximating the Exponential Function

« on: August 01, 2017, 08:57:17 pm »

Sorry, I'm on my phone so I'll probably not go too in-depth on this

Bug me for details if I don't get around to it and you need them

So:
Given ##x\in[-.5ln2,.5ln2]##
Let ##y=x^{2}##
Let ##a=\frac{x}{2}\frac{1+\frac{5y}{156}\left(1+\frac{3y}{550}\left(1+\frac{y}{1512}\right)\right)}{1+\frac{3y}{26}\left(1+\frac{5y}{396}\left(1+\frac{y}{450}\right)\right)}##
Then ##e^{x}\approx\frac{1+a}{1-a}##

Accuracy is ~75.9666 bits.
7 increments, 1 decrement
6 constant multiplications
6 general multiplications
2 general divisions
1 div by 2 (right shift or decrement an exponent)

For comparison, that's comparable to 16 terms of the Taylor's series, or 8 terms of the standard Padé expansion (exponential is special in that it comes out to f(x)/f(-x) so it can be done even easier than most).

I basically carried out a Padé expansion for e^x to infinity, noticed that after the constant term all the even coefficients were zero, so I used a Padé expansion on that function to quickly find our approximation for a.

In my usage, I actually implemented a 2^x function since I'm using binary floats with 64-bit precision. I take int(x) and save that for a final exponent for the float. Remove that value from x. By definition of int(), x is now non-negative. If x≥.5, increment that saved exponent, subtract 1 from x. Now x is on [-.5,.5]. Now we need to perform 2^x, but that is equivalent to e^x*ln(2). We've effectively applied range reduction to our original input and now we can rely on the algorithm at the top. That integer part that we saved earlier now gets added to the exponent byte, et voilà !

I think when I calculated max error using my floats it was accurate up to the last two bits or something. Don't quote me on that as I don't have those notes on me at the moment and it was done months ago.

341

Miscellaneous / Re: Hello again everyone

« on: May 02, 2017, 09:08:38 am »

Hey Art! It's been the same way for me, too. Adulthood and social media... ugh

I love programming and do program on my calc most days, but it's usually just a couple of minutes here and there throughout the day

I just don't have time for projects. I couldn't tell if I wasn't seeing you around or if it was because I don't check in regularly

342

TI-BASIC / Re: TI-Concours 2017 has started!

« on: March 25, 2017, 08:22:47 pm »

Oh, this looks fun! I make no promises, but I might enter!

I have a new job and my free time is much more consistent. Aaaand the competition ends on my birthday

343

ASM / Re: Shifting data block down a nibble?

« on: March 25, 2017, 08:18:15 pm »

harold's method is essentially what I did to shift the screen left or right by multiples of 4. It's slow, but faster than shifting by 1 four times.

344

ASM / Re: Should I learn ASM?

« on: March 15, 2017, 10:32:24 am »

Quote from: coops on March 15, 2017, 08:25:19 am

I'm also guessing division would be the subtraction instead of addition, trunkating any remainder.

Division is typically performed exactly like 'schoolbook' long division (until you get to higher precision, then you can use some state-of-the-art algorithms and math magic).

Start with an accumulator and quotient set to 0.
Rotate the one bit of the numerator into the accumulator.
If accumulator>=denominator, then subtract the denominator from the accumulator and shift in a 1 to the quotient, else shift in a zero
Repeat at the "rotate" step.

In code, since we are getting rid of one bit at a time in the numerator and adding 1 bit at a time to the quotient, we can actually recycle the freed up bits in the numerator.
Here is an example of HL/C where C<128, A is the accumulator and HL doubles as the quotient and numerator:

Code: [Select]

HL_div_C:
;Input:
;  HL is the numerator
;  C is the denominator. C<128
;Output:
;  A is the remainder
;  HL is the quotient.
    xor a
    ld b,16
loop:
    add hl,hl   ;this works like shifting HL left by 1. Overflow (thus, the top bit) is left in the carry flag
    rla         ;shift a left, rotating in the c flag as the low bit.
    cp c        ;compare a to c. Basically does A-C and returns the flags. If C>A, then there will be underflow setting the C flag.
    jr c,skip   ;skip the next two bytes if c flag is set (so A<C)
    inc l       ;we know the low bit of HL is 0, so incrementing HL will set that bit.
    sub c
skip:
    djnz loop
    ret

Quote from: coops on March 15, 2017, 08:25:19 am

ld b,8 - This must be telling the loop to repeat 8 times to check each bit, so, how does the loop relate to b? I'm thinking about how nested loops would work...

Quote from: coops on March 15, 2017, 08:25:19 am

djnz loop - kindof like "goto" loop, with the automatic decrementing of register b?

That is correct and that's why we initially do ld b,8. Keep in mind that djnz * and jr * are intended for small (ish) loops or redirecting to code relatively close by. It can jump backwards up to 126 bytes and forward up to 128 (back 1 byte is simply a slower way of performing rst 38h and I do not suggest it, back 2 bytes creates an infinite loop, back 0 bytes does nothing but waste 7 or 9 clock cycles). Most assemblers will warn you of out-of-bounds jumps.
For longer loops or jumping to code far away, use jp *. djnz * only works with register b and always decrements.

Quote from: coops on March 15, 2017, 08:25:19 am

rrc e - you say this rotates register e, is this the same as changing the register input from left to right?

For example, if e=01111011, then rrc e would change it to e=10111101. The bottom bit gets returned in the c flag as well as being returned to the top bit of e. I used rrc * since rotating 8 times would leave it exactly how it was input. I could have used rr * or even sra * or srl *, but I prefer to avoid destroying registers if I can.

345

ASM / Re: Should I learn ASM?

« on: March 13, 2017, 11:25:06 pm »

But the multiplication isn't convoluted! It's exactly how most of us are taught in grade school, except instead of multiplying digits 0~9, it's just multiplying by 0 or 1 which is super trivial. Like:
01110101 x10101101 --------- 01110101 000000000 0111010100 01110101000 000000000000 0111010100000 00000000000000 011101010000000
Or removing the multiplies by zero:
01110101 x10101101 --------- 01110101 0111010100 01110101000 0111010100000 011101010000000

So suppose bit 3 is set. Then you basically add your top number, shifted left three times. As an example, suppose you wanted to multiply C*E (ignoring the top 8 bits):

Code: [Select]

;C is our "top" number.
;E is our "bottom" number.
;A will be our "accumulator"

    ld a,0

    rrc e    ;this rotates register 'e' right, putting the bottom bit as "carry" [out of the register].
    jr nc,checkbit1   ;nc == not carry. If "carry" out was zero, skip this step.
    add a,c    ;if carry out was 1, then add to our accumulator.
checkbit1:
    sla c    ;finally, shift our "top number" to the left in case we need to add this to the accumulator, too. Then [REPEAT] 7 more times.
    rrc e
    jr nc,checkbit2
    add a,c
checkbit2:
    sla c
    rrc e
    jr nc,checkbit3
    add a,c
checkbit3:
    sla c
    rrc e
    jr nc,checkbit4
    add a,c
checkbit4:
    sla c
    rrc e
    jr nc,checkbit5
    add a,c
checkbit5:
    sla c
    rrc e
    jr nc,checkbit6
    add a,c
checkbit6:
    sla c
    rrc e
    jr nc,checkbit7
    add a,c
checkbit7:
    sla c
    rrc e
    jr nc,all_done
    add a,c
all_done:
    ret

If you can see how that relates to the school book algorithm, then just know that the following does practically the same thing:

Code: [Select]

;C is our "top" number.
;E is our "bottom" number.
;A will be our "accumulator"

    xor a    ;mad hax to set A to zero. Faster, smaller.
    ld b,8
loop:
    rrc e    ;this rotates register 'e' right, putting the bottom bit as "carry" [out of the register].
    jr nc,no_add   ;nc == not carry. If "carry" out was zero, skip this step.
    add a,c    ;if carry out was 1, then add to our accumulator.
no_add:
    sla c    ;finally, shift our "top number" to the left in case we need to add this to the accumulator, too.
    djnz loop  ;aaand repeat, decrementing register B until zero (this is a specialized instruction on the Z80)
    ret

But please don't use that in a real program

It's terribly inefficient. If you understand how the register pairing works, you can come up with a much better AND more convoluted algorithm:

Spoiler For "Step-by-Step How to derive the 'best' 8-bit Multiplication Algorithm":

Let's start by rearranging the above code. The way we do 'schoolbook' multiplication starts at the least significant digit, but we can just as easily start from the most significant digit. So let's do an example in base 10:
377 x613 ---- =1*3*377+10*1*377+100*6*377 =1(3*377+10(1*377+10(6*377)))
If we want to convert that last line to pseudo-code:

Code: [Select]

0->acc
10*acc+6*377->acc
10*acc+1*377->acc
10*acc+3*377->acc

So if we wanted something like that in assembly:

Code: [Select]

H*E -> A
;H is the "bottom" number that we will no be checking the top digit down to the bottom digit.
;E is the 'Top" number.
;A is the accumulator
;basic algo, after initializing A to zero.
;    multiply A by 2.
;    shift H left by 1
;    if this results in a bit carried out (so a 1 carried out), then add E ('top' number) to A (the accumulator)
;    repeat 7 more times for all bits in H.
    xor a
    ld b,8
loop:
    add a,a
    sla h        ;shifts H left by 1, bringing in a 0 for the low bit. mathematically the same as H*2 -> H
    jr nc,skip_add
    add a,e
skip_add:
    djnz loop
    ret

That's faster, but not optimal! To get the optimal way, lets stray from optimality a little to make 'L' our accumulator. Since we can't directly add another register to L, we'll have to juggle with register A making it slower:

Code: [Select]

    ld l,0
    ld b,8
loop:
    sla l
    sla h
    jr nc,skip_add
    ld a,l \ add a,e \  ld l,a
skip_add:
    djnz loop
    ret

But since we know that 'sla l' will spit out a zero-bit for the first 8 iterations (all of them), we can do 'sla l \ rl h' which is the same size and speed. However, this is the same as just doing doing "add hl,hl" ! This is where you'll have to learn how register pairs work

Code: [Select]

    ld l,0
    ld b,8
loop:
    add hl,hl
    jr nc,skip_add
    ld a,l \ add a,e \  ld l,a
skip_add:
    djnz loop
    ret

But wait, there is more! We don't have an "add l,e" instruction, but we do have an "add hl,de" instruction. If we make D==0, then we can change 'ld a,l \ add a,e \ ld l,a' to 'add hl,de'. The problem is, if the lower byte of HL, (so L) overflows, then the upper byte H, our 'bottom' number. We can't have it changing our input value halfway through the algorithm! Thankfully, the changes never propagate into those upper bits. This requires some tedious work to prove, but if you are cool with taking that at face value, then our last piece of the puzzle gives us:

Code: [Select]

    ld d,0
    ld l,d    ;since D=0 already, this sets L to 0, as we want. It's smaller and faster than ld l,0.
    ld b,8
loop:
    add hl,hl
    jr nc,skip_add
    add hl,de
skip_add:
    djnz loop
    ret

Even better, this actually givers us the full 16-bit result instead of just the lower 8 bits

Pages: 1 ... 21 22 [23] 24 25 ... 317