ASM Optimized routines

« previous next »

Print

Pages: 1 ... 4 5 [6] 7 8 Go Down

Author Topic: ASM Optimized routines (Read 115643 times)

0 Members and 1 Guest are viewing this topic.

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #75 on: November 13, 2013, 03:09:47 pm »

I made this rather fast 8.8 fixed point natural log (ln, log_e) routine today based on some math I have been working on. It currently averages 677 t-states, but only works on [1,2] I plan to extend this to the whole range of numbers by using a 5 element LUT and a single division making the average somewhere around 1100 t-states:

lognat:
;Input:  H.L needs to be on [1,2]
;Output: H.L if z flag is set, else if nz, no result
;avg speed: 677 t-states
	dec h
   	dec h
	jr nz,$+8
	inc l \ dec l
	ret nz
	ld l,177
	ret
	inc h
	ret nz
	ld b,h
	ld c,l
	ld e,l
	ld d,8
	add hl,hl
	add hl,hl
	add hl,de
	ex de,hl

	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ adc a,a

	ld h,a \ ld l,b
	sla h \ jr c,$+3 \ ld l,c
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	rl l
	ld a,h
	adc a,b
	ld l,a
	ld h,b
	cp a
	ret

I had fun twisting some of the logic, but there is a division and multiplication in there. It might seem backwards, but it was optimal

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #76 on: November 14, 2013, 08:17:22 pm »

This version works on (0,128) and averages less than 1250 t-states:

lognat:
;Input:  H.L needs to be on (0,128.0)
;Output: H.L if c flag set
;	 returns nc if input is negative (HL not modified)
;Error:
;	The error on the outputs is as follows:
;		20592 inputs are exact
;		12075 inputs are off by 1/256
;		100 inputs are off by 2/256
;	So all 32767 inputs are within 2/256, with average error being <1/683 which is smaller than 1/256.
;Size: 177 bytes
;Speed: average speed is less than 1250 t-states

	ld a,h \ or l \ jr nz,$+5
	ld h,80h \ ret
	dec h
	dec h
	jr nz,$+9
	inc l \ dec l
	jr nz,normalizeln-1
	ld l,177
	ret
	inc h
	jr nz,normalizeln
	ld b,h
	ld c,l
	ld e,l
	ld d,8
	add hl,hl
	add hl,hl
	add hl,de
	ex de,hl
	call HL_Div_DE
	ld h,a \ ld l,b
	sla h \ jr c,$+3 \ ld l,c
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	add hl,hl \ jr c,$+3 \ add hl,bc
	rl l
	ld a,h
	adc a,b
	ld h,b
	ld l,a
	scf
	ret
HL_Div_DE:
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ jr nc,$+3 \ add hl,de \ adc a,a
	add hl,hl \ sbc hl,de \ adc a,a \ ret

	inc h
normalizeln:
	xor a
	inc h \ ret m
	ld d,a \ ld e,a
	ld a,l
	jr z,toosmall
	inc e \ srl h \ rra \ jr nz,$-4
	rla \ rl h
	dec e
stepin:
	ld l,a
	push de
	call lognat
	pop de
	;now multiply DE by 355, then divide by 2 (rounding)
	ld b,d \ ld c,e \ ld a,d
	ex de,hl
	add hl,hl
	add hl,hl	;4
	add hl,bc	;5
	add hl,hl	;10
	add hl,bc	;11
	add hl,hl	;22
	add hl,hl
	add hl,hl
	add hl,hl
	add hl,bc
	add hl,hl
	add hl,bc
	sra h \ rr l
	adc hl,de
	scf
	ret
toosmall:
	dec d
	dec e \ add a,a \ jr nc,$-2
	inc h
	jp stepin

The worst error is 2/256, which is much better than I thought it would be

On [1,2] it is at worst 1/256 and the average case is <1/683 which is smaller than 1/256 (the smallest positive 8.8 number).

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #77 on: November 14, 2013, 11:50:30 pm »

If you are willing to use a 256-byte LUT, then lg() can be computed in an average of 340 t-states, and for an additional 325 (at most), ln() can be computed from that:
(these are fixed point 8.8 routines)

ln_88:
;Input: HL is a fixed point number
;Output: ln(H.L)->H.L
;Speed: Avg: 340+(325 worst case)
	call lg_88
	;now signed multiply HL by 355, then divide by 2 (rounding)
    ld de,0
    bit 7,h
    jr z,$+9
    dec e \ xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ld b,h
    ld c,l
    xor a
	add hl,hl
   	add hl,hl \ rla
	add hl,bc \ adc a,d
	add hl,hl \ rla
	add hl,bc \ adc a,d
	add hl,hl \ rla
	add hl,hl \ rla
	add hl,hl \ rla
	add hl,hl \ rla
	add hl,bc \ adc a,d
	add hl,hl \ rla
	add hl,bc \ adc a,d
    sra a \ rr h
    ld l,h
    ld h,a
    inc e
    ret nz
    xor a \ sub l \ ld l,a
    sbc a,a \ sub h \ ld h,a
    ret
lg_88:
;Input: HL is a fixed point number
;Output: lg(H.L)->H.L
;Speed: Avg: 340
	ld de,lgLUT
	ld b,0
	ld a,h
	or a
	ret m
	ld a,l
	jr z,$+8
	inc b \ srl h \ rra \ jr nz,$-4
	or a \ jr nz,$+6
	ld hl,8000h \ ret
	rra \ inc b \ jr nc,$-2
;A is the element to look up in the LUT
	ld l,a
    ld c,h
    dec b
	add hl,hl
	add hl,de
	ld e,(hl)
	inc hl
	ld d,(hl)
    ex de,hl
	add hl,bc
	ret
lglut:
.dw $F800
.dw $F996
.dw $FA52
.dw $FACF
.dw $FB2C
.dw $FB76
.dw $FBB3
.dw $FBE8
.dw $FC16
.dw $FC3F
.dw $FC64
.dw $FC86
.dw $FCA5
.dw $FCC1
.dw $FCDC
.dw $FCF4
.dw $FD0B
.dw $FD21
.dw $FD36
.dw $FD49
.dw $FD5C
.dw $FD6D
.dw $FD7E
.dw $FD8E
.dw $FD9D
.dw $FDAC
.dw $FDBA
.dw $FDC8
.dw $FDD5
.dw $FDE2
.dw $FDEE
.dw $FDFA
.dw $FE06
.dw $FE11
.dw $FE1C
.dw $FE26
.dw $FE31
.dw $FE3B
.dw $FE44
.dw $FE4E
.dw $FE57
.dw $FE60
.dw $FE69
.dw $FE71
.dw $FE7A
.dw $FE82
.dw $FE8A
.dw $FE92
.dw $FE9A
.dw $FEA1
.dw $FEA9
.dw $FEB0
.dw $FEB7
.dw $FEBE
.dw $FEC5
.dw $FECB
.dw $FED2
.dw $FED8
.dw $FEDF
.dw $FEE5
.dw $FEEB
.dw $FEF1
.dw $FEF7
.dw $FEFD
.dw $FF03
.dw $FF09
.dw $FF0E
.dw $FF14
.dw $FF19
.dw $FF1E
.dw $FF24
.dw $FF29
.dw $FF2E
.dw $FF33
.dw $FF38
.dw $FF3D
.dw $FF42
.dw $FF47
.dw $FF4B
.dw $FF50
.dw $FF55
.dw $FF59
.dw $FF5E
.dw $FF62
.dw $FF67
.dw $FF6B
.dw $FF6F
.dw $FF74
.dw $FF78
.dw $FF7C
.dw $FF80
.dw $FF84
.dw $FF88
.dw $FF8C
.dw $FF90
.dw $FF94
.dw $FF98
.dw $FF9B
.dw $FF9F
.dw $FFA3
.dw $FFA7
.dw $FFAA
.dw $FFAE
.dw $FFB2
.dw $FFB5
.dw $FFB9
.dw $FFBC
.dw $FFC0
.dw $FFC3
.dw $FFC6
.dw $FFCA
.dw $FFCD
.dw $FFD0
.dw $FFD4
.dw $FFD7
.dw $FFDA
.dw $FFDD
.dw $FFE0
.dw $FFE4
.dw $FFE7
.dw $FFEA
.dw $FFED
.dw $FFF0
.dw $FFF3
.dw $FFF6
.dw $FFF9
.dw $FFFC
.dw $FFFF

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #78 on: November 16, 2013, 09:51:46 am »

Here is my 8.8 fixed point, table-based arctangent routine:

arctan_88:
;Input:
;    D.E
;Output: atan(D.E)->D.E
    push de
    ld a,d
    or a
    jp p,$+5
    neg
    ld d,a
    dec a
    jr nz,checkneedinv
    inc e \ dec e \ jr nz,checkneedinv
    pop af \ rla \ ld de,201 \ ret nc \ ld de,-201 \ ret
checkneedinv:
    inc a
    call nz,DEgt1_Inv
;0.E is the value to atan
    ld hl,adjustatan
    push hl
    ld a,e
    cp 46 \ ret c
    dec a \ cp 42h \ ret c
    dec a \ cp 4Eh \ ret c
    dec a \ cp 57h \ ret c
    dec a \ cp 5Eh \ ret c
    dec a \ cp 64h \ ret c
    dec a \ cp 6Ah \ ret c
    dec a \ cp 6Fh \ ret c
    sub 6Fh \ ld e,a
    ld hl,atanlut
    add hl,de
    ld a,(hl)
    ret
adjustatan:
    ld e,a
    pop bc
    ld a,b
    or a
    jp p,$+5
    neg
    jr z,$+9
    ld hl,402
    or a
    sbc hl,de
    ex de,hl
    rl b
    ret nc
    xor a
    sub e
    ld e,a
    sbc a,a
    sub d
    ld d,a
    ret
DEgt1_Inv:
;Works if DE>1
    ld hl,256
    ld b,8
InvLoop:
    add hl,hl
    sbc hl,de
    jr nc,$+3
    add hl,de
    adc a,a
    djnz InvLoop
    cpl
    ld e,a
    ld d,b
    ret
atanlut:
.db $6F
.db $6F
.db $70
.db $71
.db $72
.db $73
.db $73
.db $74
.db $75
.db $76
.db $77
.db $77
.db $78
.db $79
.db $7A
.db $7B
.db $7B
.db $7C
.db $7D
.db $7E
.db $7F
.db $7F
.db $80
.db $81
.db $82
.db $82
.db $83
.db $84
.db $85
.db $85
.db $86
.db $87
.db $88
.db $88
.db $89
.db $8A
.db $8B
.db $8B
.db $8C
.db $8D
.db $8E
.db $8E
.db $8F
.db $90
.db $90
.db $91
.db $92
.db $93
.db $93
.db $94
.db $95
.db $95
.db $96
.db $97
.db $97
.db $98
.db $99
.db $9A
.db $9A
.db $9B
.db $9C
.db $9C
.db $9D
.db $9E
.db $9E
.db $9F
.db $A0
.db $A0
.db $A1
.db $A2
.db $A2
.db $A3
.db $A3
.db $A4
.db $A5
.db $A5
.db $A6
.db $A7
.db $A7
.db $A8
.db $A9
.db $A9
.db $AA
.db $AA
.db $AB
.db $AC
.db $AC
.db $AD
.db $AD
.db $AE
.db $AF
.db $AF
.db $B0
.db $B0
.db $B1
.db $B2
.db $B2
.db $B3
.db $B3
.db $B4
.db $B5
.db $B5
.db $B6
.db $B6
.db $B7
.db $B7
.db $B8
.db $B9
.db $B9
.db $BA
.db $BA
.db $BB
.db $BB
.db $BC
.db $BC
.db $BD
.db $BE
.db $BE
.db $BF
.db $BF
.db $C0
.db $C0
.db $C1
.db $C1
.db $C2
.db $C2
.db $C3
.db $C3
.db $C4
.db $C4
.db $C5
.db $C6
.db $C6
.db $C7
.db $C7
.db $C8
.db $C8
.db $C9

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #79 on: November 19, 2013, 12:15:26 pm »

EDIT:(27-Oct-2015) Updated at the end of the original post with a muuuch better routine.

I wrote this routine today for generating pseudorandom numbers. If I did the math correctly, it has a period of 4292870399. To think of it differently, it executes in about ~~1430~~ 1592 t-states~~, so that means at 15MHZ, it would take about 4 days, 19 hours, 49 minutes, 41 seconds to loop back to the start of the over 4 billion number sequence.~~ Or, if you generated one number per second, it would take over 136 years to finish the cycle.

Rand24:
;Inputs: seed1,seed2
;Outputs:
;	HLA is the pseudo-random number
;	seed1,seed2 incremented accordingly
;Destroys: BC,DE
;Notes:
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
	ld hl,(seed1)
	xor a
	ld b,h \ ld c,l
	ld de,83
	add hl,hl \ rla		;2
	add hl,bc \ adc a,d	;3
	add hl,hl \ rla		;6
	add hl,bc \ adc a,d	;7
	add hl,hl \ rla		;14
	add hl,bc \ adc a,d	;15
	add hl,hl \ rla		;30
	add hl,hl \ rla		;60
	add hl,hl \ rla		;120
	add hl,bc \ adc a,d	;121
	add hl,hl \ rla		;242
	add hl,bc \ adc a,d	;243
	add hl,de \ adc a,d	;243x+83
;now AHL mod 65519
; Essentially, we can at least subtract A*65519=A(65536-17), so add A*17 to HL
	ex de,hl
	ld l,a
	ld b,h
	ld c,l
	add hl,hl
	add hl,hl
	add hl,hl
	add hl,hl
	add hl,bc
	add hl,de
	ld de,65519
	jr nc,$+5
	or a \ sbc hl,de
	or a \ sbc hl,de
	jr nc,$+3
	add hl,de
	ld (seed1),hl
;seed1 done, now we need to do seed2
	ld hl,(seed2)
; seed1*243+83 mod 65519 -> seed1
; seed2*251+43 mod 65521 -> seed2
; Output seed1*seed2
	xor a
	ld b,h \ ld c,l
	ld de,43
	add hl,hl \ rla		;2
	add hl,bc \ adc a,d	;3
	add hl,hl \ rla		;6
	add hl,bc \ adc a,d	;7
	add hl,hl \ rla		;14
	add hl,bc \ adc a,d	;15
	add hl,hl \ rla		;30
	add hl,bc \ adc a,d	;31
	add hl,hl \ rla		;62
	add hl,hl \ rla		;124
	add hl,bc \ adc a,d	;125
	add hl,hl \ rla		;250
	add hl,bc \ adc a,d	;251
	add hl,de \ adc a,d	;251x+83
;now AHL mod 65521
; Essentially, we can at least subtract A*65521=A(65536-15), so add A*15 to HL
	ex de,hl
	ld l,a
	ld b,h
	ld c,l
	add hl,hl
	add hl,hl
	add hl,hl
	add hl,hl
	sbc hl,bc
	add hl,de
	ld de,65521
	jr nc,$+5
	or a \ sbc hl,de
	or a \ sbc hl,de
	jr nc,$+3
	add hl,de
	ld (seed2),hl
;now seed1 and seed 2 are computed
	ld bc,(seed1)
	ex de,hl
	call BC_Times_DE
	ex de,hl
	ld l,b
	ld h,0
	ld b,h
	ld c,l
	add hl,hl
	add hl,hl
	add hl,bc
	add hl,hl
	add hl,hl
	add hl,bc
	add hl,hl
	add hl,bc
	ld c,d
	ld d,e
	ld e,a
	ld a,c
	sbc hl,bc
	sbc a,b
	ret nc
	ld c,43
	add hl,bc
	ret
;now do BC_times_DE
BC_Times_DE:
;BHLA is the result
	ld a,b
	or a
	ld hl,0
	ld b,h
;1
	add a,a
	jr nc,$+4
	ld h,d
	ld l,e
;2
	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b
;227+10b-7p
	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,b

;===
;AHL is the result of B*DE*256
	push hl
	ld h,b
	ld l,b
	ld b,a
	ld a,c
	ld c,h
;1
	add a,a
	jr nc,$+4
	ld h,d
	ld l,e
;2
	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c
;227+10b-7p
	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	add hl,hl
	rla
	jr nc,$+4
	add hl,de
	adc a,c

	pop de
;Now BDE*256+AHL
	ld c,a
	ld a,l
	ld l,h
	ld h,c
	add hl,de
	ret nc
	inc b
;BHLA is the 32-bit result
	ret
seed1:
	.dw 0
seed2:
	.dw 0

To test its 'randomness' I only used HL and it did not exhibit a perfectly uniform random distribution (standard deviation was slightly off from the expected value). However, it seems to be on par with the routine used by the OS and I believe I am using a modified version of that.

Now I will feel confident to use this in my math libraries

EDIT: Added a mod 16777259 to the output to fix a flaw.

EDIT: The much better routine:

It passes all the tests at CAcert Labs
The cycle size is almost 4.3 million times longer .
It is 550% faster (287cc versus 1592cc)
It would take over 11,184,544 years at the calculator's max speed to reach the full cycle.

rand:
;Tested and passes all CAcert tests
;Uses a very simple 32-bit LCG and 32-bit LFSR
;it has a period of 18,446,744,069,414,584,320
;roughly 18.4 quintillion.
;291cc
seed1_0=$+1
    ld hl,12345
seed1_1=$+1
    ld de,6789
    ld b,h
    ld c,l
    add hl,hl \ rl e \ rl d
    add hl,hl \ rl e \ rl d
    inc l
    add hl,bc
    ld (seed1_0),hl
    ld hl,(seed1_1)
    adc hl,de
    ld (seed1_1),hl
    ex de,hl
seed2_0=$+1
    ld hl,9876
seed2_1=$+1
    ld bc,54321
    add hl,hl \ rl c \ rl b
    ld (seed2_1),bc
    sbc a,a
    and %11000101
    xor l
    ld l,a
    ld (seed2_0),hl
    ex de,hl
    add hl,bc
    ret

EDIT (28 August 2019): Hey, so here is a compact version from a few years ago:

;#define smc	;uncomment if you are using SMC
rand16:
;collaboration by Zeda with Runer112
;160cc or 148cc if using SMC
;26 bytes
;cycle: 4,294,901,760 (almost 4.3 billion)
#ifdef smc
seed1=$+1
	ld hl,9999
#else
    ld hl,(seed1)
#endif
    ld b,h
    ld c,l
    add hl,hl
    add hl,hl
    inc l
    add hl,bc
    ld (seed1),hl
#ifdef smc
seed2=$+1
	ld hl,9999
#else
    ld hl,(seed2)
#endif
    add hl,hl
    sbc a,a
    and %00101101
    xor l
    ld l,a
    ld (seed2),hl
    add hl,bc
    ret

And here is a different routine that is faster and might be better (it is on the tests I am using):

xsp32:
;Inputs: (seed1), (seed2), and (seed3) are 16-bit seeds. (seed1) and (seed2) can't both be 0.
;Outputs: HL is the pseudorandom number
;Destroys: A,DE,BC
;cycle: 281,474,976,645,120
;It would take about 185 years at 15MHz to repeat
;min: 258cc (236cc if using SMC)
;max: 288cc (266cc if using SMC)
;avg: 273cc (251cc if using SMC)
;63 bytes (62 bytes if using SMC)
#ifdef SMC
seed1 = $+1
  ld hl,12345
seed2 = $+1
  ld de,6789
#else
  ld hl,(seed1)
  ld de,(seed2)
#endif


;first, XOR it with itself, shifted left 23 bits
;low bit of d needs to be shifted in
  ld a,h
  rra
  ld a,l
  rra
  jr nc,+_
  rl e
  ccf
  rr e
_:
  xor d
  ld d,a

;XOR it with itself, shifted right 15 bits
  ld a,h
  rla
  ld a,e
  rla
  xor l
  ld l,a

  ld a,e
  rla
  ld a,d
  rla
  jr nc,+_
  rr e
  ccf
  rl e
_:
  xor h
  ld h,a

;XOR it with itself, shifted left 17 bits
;HL<<1
  ld (seed1),hl
  add hl,hl
  ld a,h
  xor d
  ld h,a

  ld a,l
  xor e
  ld l,a
  ld (seed2),hl
  ex de,hl

#ifdef SMC
seed3 = $+1
  ld hl,33333
#else
  ld hl,(seed3)
#endif

  inc hl
  inc h
  ld (seed3),hl
  add hl,de
  ret

It has a smaller period, but still far larger than a calc needs. It uses smaller state, and combines a 32-bit xorshift with a basic 16-bit counter (increments by 257).

Just a 32-bit xorshift routine, which is pretty decent on its own:

xs32:
;32-bit xorshift
;seed^=seed<<23
;seed^=seed>>15
;seed^=seed<<17
;min: 209cc (193cc if using SMC)
;max: 239cc (223cc if using SMC)
;avg: 224cc (208cc if using SMC)
;53 bytes (52 bytes if using SMC)
#ifdef SMC
seed1 = $+1
  ld hl,12345
seed2 = $+1
  ld de,6789
#else
  ld hl,(seed1)
  ld de,(seed2)
#endif

;first, XOR it with itself, shifted left 23 bits
;low bit of d needs to be shifted in
  ld a,h
  rra
  ld a,l
  rra
  jr nc,+_
  rl e
  ccf
  rr e
_:
  xor d
  ld d,a

;XOR it with itself, shifted right 15 bits
  ld a,h
  rla
  ld a,e
  rla
  xor l
  ld l,a

  ld a,e
  rla
  ld a,d
  rla
  jr nc,+_
  rr e
  ccf
  rl e
_:
  xor h
  ld h,a

;XOR it with itself, shifted left 17 bits
;HL<<1
  ld (seed1),hl
  add hl,hl
  ld a,h
  xor d
  ld h,a

  ld a,l
  xor e
  ld l,a
  ld (seed2),hl
  ret

« Last Edit: August 28, 2019, 05:27:53 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #80 on: November 28, 2013, 08:45:11 am »

Here is a 32-bit square root routine. I am fairly sure there are optimisations that can be made, especially to the stuff I added last night. I already found a bunch of code that I removed, about 32 bytes, saving 128 t-states.

SqrtHLDE:
;input: HLDE
;Output: BC
;310 bytes
;Average is about 1443 t-states
	push de
	xor a
	ld b,a

	ld e,l
	ld l,h
	ld h,a

	add hl,hl
	add hl,hl
	cp h
	jr nc,$+5
	dec h
	ld a,4

	add hl,hl
	add hl,hl
	ld c,a
	sub h
	jr nc,$+6
	cpl
	ld h,a
	inc c
	inc c

	ld a,c
	add hl,hl
	add hl,hl
	add a,a
	ld c,a
	sub h
	jr nc,$+6
	cpl
	ld h,a
	inc c
	inc c

	ld a,c
	add hl,hl
	add hl,hl
	add a,a
	ld c,a
	sub h
	jr nc,$+6
	cpl
	ld h,a
	inc c
	inc c

	ld a,c
	ld l,e

	add hl,hl
	add hl,hl
	add a,a
	ld c,a
	sub h
	jr nc,$+6
	cpl
	ld h,a
	inc c
	inc c

	ld a,c
	add hl,hl
	add hl,hl
	add a,a
	ld c,a
	sub h
	jr nc,$+6
	cpl
	ld h,a
	inc c
	inc c

	ld a,c
	add a,a \ ld c,a
	add hl,hl
	add hl,hl
	jr nc,$+6
	sub h \ jp $+6
	sub h
	jr nc,$+6
	inc c \ inc c
	cpl
	ld h,a


	ld a,l
	ld l,h
	add a,a
	ld h,a
	adc hl,hl
	adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

;iteration 9
;now I need to rotate in more bits
	pop de
	sla d \ adc hl,hl \ sla d \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla d \ adc hl,hl \ sla d \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla d \ adc hl,hl \ sla d \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla d \ adc hl,hl \ sla d \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla e \ adc hl,hl \ sla e \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla e \ adc hl,hl \ sla e \ adc hl,hl
	sll c \ rl b
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sll c \ rl b
	sla e \ adc hl,hl \ sla e \ adc hl,hl
	jr nc,$+9
    	or a
		sbc hl,bc
		inc c
		jp $+13
	sbc hl,bc
	jr nc,$+3
	add hl,bc
	sbc a,a \ add a,a \ inc a \ add a,c \ ld c,a

	sla e \ adc hl,hl \ jr nc,$+4
    	inc c \ ret
    sla e \ adc hl,hl \ jr nc,$+8
    	sbc hl,bc \ or a \ jp $+7
    or a \ sbc hl,bc \ ret c
    scf \ sbc hl,bc \ ret c
    inc c \ ret

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Matrefeytontias

Axe roxxor (kinda)
LV10 31337 u53r (Next: 2000)
Posts: 1982
Rating: +310/-12
Axe roxxor

Re: ASM Optimized routines

« Reply #81 on: April 29, 2014, 06:04:00 am »

If that's of any use to anyone, here is a signed AHL = AHL / DE routine - I think it's pretty optimized (70 bytes) :

sDiv2416:
 xor d
 push af
 xor d
 ld b, a
 jp p, divdPos
 xor a
 sub l
 ld l, a
 ld a, 0
 sbc a, h
 ld h, a
 ld a, 0
 sbc a, b
 ld b, a
divdPos:
 bit 7, d
 jr z, divrPos
 xor a
 sub e
 ld e, a
 ld a, 0
 sbc a, d
 ld d, a
 ld a, b
divrPos:

 push hl
 pop ix
 ld hl, 0
 ld b, 24
divLoop:
 add ix, ix
 rla
 adc hl, hl
 sbc hl, de
 jr nc, $ + 4
 add hl, de
 ; jp nc, xxxx
 ; it works because the carry can not be set and "inc ix" is 2 bytes
 .db $D2
 inc ix
 djnz divLoop

 push ix
 pop hl
 ld b, a
 pop af
 ld a, b
 ret p
 xor a
 sub l
 ld l, a
 ld a, 0
 sbc a, h
 ld h, a
 ld a, 0
 sbc a, b
 ret

« Last Edit: April 29, 2014, 06:07:13 am by Matrefeytontias »

Logged

/\ >< [- |_| _\~ [- /?

All of the TI-z80 screenshots that I will post, unless I specifically mention that it's not the case, will run at 6 MHz. Always. If you find that unbelievable, you should check the last link of my sig

Nerdiness

Projects : [Ndless3.1] nKaruga, [Axe] Worms (yes !), [Ndless 3.1] F-Zero TrackSpire (dropped), [Ndless 3.1] nSpeedX 3D (paused), [Ndless 3.1] nRayC, an easy-to-use raycasting (3D) engine in C for your Nspire ! (paused)

Proud author of : [TI-8x] Jetpack 8x+, [TI-8x+] Gravity Walls, [ASM for Axe] AxeDCS axiom for Axe programmers, [TI-8x+] WiredWorks : a SolidWorks on your Ti-83+/84+ !, [TI-8x+] Super Crate Box, [Axe] IkarugaX, an awesomely challenging danmaku/shoot-them-up for your TI-83+/84+ !, [ASM for Axe] AxeJh3D axiom, a very fast 3D ASM library for Axe coders !

If at a certain moment you think that I'm awesome, you can express it by giving me one more Internet (>^_^)>

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #82 on: March 01, 2015, 10:14:07 pm »

This routine is a fast way to copy a chunk of data. For BC>34, calling this proves to be faster than using LDIR:

fastLDIR:
;copy BC bytes from HL to DE
;Cost:
;    27cc for having to call
;    110cc for setting up the loop, worst case
;    10cc * ceiling(BC/n)        ;n=2^k for some k, see the line below "ldirloop:"
;    16cc * BC
;costs roughly 152-BC*(5-10/n) more than a simple LDIR (worst case)
;for n=4,  BC>=61 saves
;for n=8,  BC>=41 saves
;for n=16, BC>=35 saves   * default, see the "ldirloop" to change
;for n=32, BC>=33 saves
;for n=64, BC>=32 saves
    push hl
    push af
    xor a
    sub c
    and 15               ;change to n-1
    add a,a
    ld hl,ldirloop
    add a,l
    ld l,a
    jr nc,$+3  ;these aren't needed if the ldirloop doesn't cross a 256 byte boundary. Can save 12cc on the above timings and 3 bytes.
    inc h       ;
    pop af
    ex (sp),hl
    ret
ldirloop:
;n=16, (number of LDI instructions, use qty of 4,8,16,32,64)
    ldi
    ldi
    ldi
    ldi
    
    ldi
    ldi
    ldi
    ldi
    
    ldi
    ldi
    ldi
    ldi
    
    ldi
    ldi
    ldi
_ldirloop_end:
    ldi
    jp pe,ldirloop
    ret

This might be useful for things like copying code to RAM (from an App) in speed critical applications, or other data handling tasks.

EDIT 9 Oct 18: Saved 1 byte and 3cc. Originally, I had 'ld a,16 \ sub c \ and 15'. The 'ld a,16' could hold any multiple of n (16 in this example), including 0, so I just used 'xor a'. I also updated all the timing info.

« Last Edit: October 09, 2018, 07:31:40 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #83 on: March 02, 2015, 06:13:50 am »

Not sure if this is useful for any actual applications, but to get a mask for the lowest set bit in C:

    xor a
    sub c
    and c

So if c=%10110100, this would return a=%00000100 (it also always returns nc, and z only when c=0).

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #84 on: April 17, 2015, 05:39:14 pm »

This is a fast 16-bit Lehmer RNG. It is seeded, and has a period of 65536. This code

smc = 1   ;use 1 if the code is in RAM, since it is faster. If it is in an app, use 0 and define a 2-byte RAM location for the seed.

lehmer:
;;Input:
;;  (seed) has the seed value of the RNG
;;Output:
;;  (seed) is updated, HL is the result
;;Destroys:
;;  A,DE,BC
;;Timing:
;;  if seed>0     231cc or 232cc, condition dependent
;;  if seed=0     91cc
;;  if smc=1      subtract 6cc
;;Size: 44 bytes
;;Notes:
;;    Uses the Lehmer RNG used by the Sinclair ZX81
;;    75x mod 65537 -> x
#IF smc == 0
    ld hl,(seed)
#ELSE
seed = $+1
    ld hl,0
#ENDIF
;multiply by 75
    ld c,l
    ld b,h
    xor a
    adc hl,hl \ jr z,special \ ld d,a \ rla
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla
    add hl,hl \ rla \ add hl,bc \ adc a,d
    add hl,hl \ rla \ add hl,bc
;modulo 65537, see note below on how this works
    ld e,a
    sbc hl,de       ;No need to reset the c flag since it is already
    jr nc,$+3
    inc hl
    ld (seed),hl
    ret
special:
;In the case that HL=0, this should be interpreted as 65536 = -1 mod 65537, so return -75 mod 65537 = -74 mod 65536 in HL
    ld hl,-74
    ld (seed),hl
    ret
    
;mod by 2^16 + 1 (a prime)
;current form is A*2^16+HL
;need:
;  (A*2^16+HL) mod (2^16+1)
;add 0 as +1-1
;  (A*(2^16+1-1)+HL) mod (2^16+1)
;distribute
;  (A*(2^16+1)-A+HL) mod (2^16+1)
;A*(2^16+1) mod 2^16+1 = 0, so remove
;  (-A+HL) mod (2^16+1)
;Oh hey, that's easy! :P
;I use this trick everywhere, you should, too.

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #85 on: August 02, 2015, 09:06:26 am »

So earlier on IRC I posted a challenge that I was thinking about before bed: Convert an 8-bit unsigned integer to BCD as fast as you can.

I managed to get my code down to 143.5cc, then jacobly noticed an optimization on mine, and I noticed a bug that had been carried through from the beginning. After all was done, we got it down to 131cc.

The code given is a subroutine (so adding an ret to the end, +1 byte +10cc). Without further ado:

L_To_Dec:
;;Unrolled
;;Converts the 8-bit register L to binary coded decimal
;;Digits stored in LA (A has the lower 2 digits, L the upper).
;;Inputs: L is the 8-bit unsigned int to convert
;;Output: A has the lower 2 digits (in BCD form), L has the upper
;;Destroys: H,F
;;141cc 
;;27 bytes
    ld h,0
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,hl
    ld a,h \ daa  \ rl l
    adc a,a \ daa \ rl l
    adc a,a \ daa \ rl l
    adc a,a \ daa \ rl l
    adc a,a \ daa \ rl l
    ret

EDIT: Found a transcription error. Had to adjust code, size, and timing by +4 bytes, +16cc

« Last Edit: August 03, 2015, 08:55:23 am by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #86 on: August 03, 2015, 10:54:32 am »

Related to the previous post, I offer several alternatives to the bcalls _SetXXOP1 and _SetXXOP2. These routines are for converting 8-bit integers to TI floats. The advantages with my routines are that they are faster, you don't need to truncate to only the bottom 2 digits, and you can store the output to any location instead of just OP1 or OP2.

So first, if you still want the output to be the same:

setXXOP2:
    ld hl,OP2
    jr setXX
setXXOP1:
    ld hl,OP1
setXX:
;;Inputs: A is the unsigned int
;;        HL is where to write the TI float
;;Destroys:All
;;291cc+38b (or 144cc if A=0)
;;average: b=29/255
;;295.3215686cc
;;59 bytes
    ld c,0
    ld (hl),c \ inc hl
    ld (hl),81h
    inc hl \ ld (hl),c
    ld d,h \ ld e,l
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c

    or a \ ret z    ;If A is zero, exit early. +138cc
    ld l,a          ;\
    ld h,c          ; |
    add hl,hl       ; |Start converting A to BCD
    add hl,hl       ; |
    add hl,hl       ; |
    add hl,hl       ; |
    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD
    adc a,a \ daa \ rl l    ; |Number is in LA
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa           ;/ +124cc
    ex de,hl
    ld (hl),a
    and $F0
    ret nz      ;+29cc
    rld         ;\ Rotate up 1 digit
    dec hl      ; |
    ld (hl),80h ; |
    ret         ; /

And if you want to get all 3 digits:

setXXXOP1:
    ld hl,OP1
setXXX:
;;Inputs: A is the unsigned int
;;        HL is where to write the TI float
;;Destroys:All
;;423cc+13a+63b (or 233cc if A=0)
;;average: a=99/255, b=29/255
;;435.2117647cc average
;;64 bytes
    ld bc,$0700
    ld (hl),c
    inc hl
    ld (hl),83h
    ld d,h
    ld e,l
    inc hl \ ld (hl),c \ djnz $-2
    or a \ ret z    ;If A is zero, exit early. +227cc
    ld l,a          ;\
    ld h,c          ; |
    add hl,hl       ; |Start converting A to BCD
    add hl,hl       ; |
    add hl,hl       ; |
    add hl,hl       ; |
    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD
    adc a,a \ daa \ rl l    ; |Number is in LA
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa \ rl l    ;/ +132cc
    ex de,hl
    jr nz,$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc
    inc hl
    ld (hl),e
    inc hl
    ld (hl),a
    ld a,e
    and $F0
    ret nz      ;+48cc
    rld         ;\ Rotate up 1 digit
    dec hl      ; |
    rld         ; |
    dec hl      ; |
    dec (hl)    ; |Decrement exponent
    ret         ; /+63(29/255)cc

And if you want to do that, but maybe a little faster:

setXXX:
;;Inputs: A is the unsigned int
;;        HL is where to write the TI float
;;Destroys:All
;;334cc+13a+63b (or 144cc if A=0)
;;average: a=99/255, b=29/255
;;346.2117647cc average
;;75 bytes
    ld c,0
    ld (hl),c
    inc hl
    ld (hl),83h
    ld d,h
    ld e,l
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    inc hl \ ld (hl),c
    
    or a \ ret z    ;If A is zero, exit early. +227cc
    ld l,a          ;\
    ld h,c          ; |
    add hl,hl       ; |Start converting A to BCD
    add hl,hl       ; |
    add hl,hl       ; |
    add hl,hl       ; |
    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD
    adc a,a \ daa \ rl l    ; |Number is in LA
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa \ rl l    ; |
    adc a,a \ daa \ rl l    ;/ +132cc
    ex de,hl
    jr nz,$+6 \ ld e,a \ xor a \ ld (hl),81h    ;+(21+4/85)cc
    inc hl
    ld (hl),e
    inc hl
    ld (hl),a
    ld a,e
    and $F0
    ret nz      ;+48cc
    rld         ;\ Rotate up 1 digit
    dec hl      ; |
    rld         ; |
    dec hl      ; |
    dec (hl)    ; |Decrement exponent
    ret         ; /+63(29/255)cc

And if you want to convert signed 8-bit ints:

setXXX_signed:
;;Inputs: A is the signed int
;;        HL is where to write the TI float
    ld c,0
    ld (hl),c
    add a,a
    jr c,$+6
    neg
    ld (hl),80h
    inc hl
    ld (hl),81h
    ld d,h
    ld e,l
    inc hl \ ld (hl),c \ djnz $-2
    or a \ ret z
    ld l,a          ;\
    ld h,c          ; |
    add hl,hl       ; |Start converting A to BCD
    add hl,hl       ; |
    add hl,hl       ; |
    add hl,hl       ; |
    ld a,h \ daa  \ rl l    ; |Finish converting A to BCD
    adc a,a \ daa \ rl l    ; |Number is in cA
    adc a,a \ daa \ rl l    ; |(c is carry)
    adc a,a \ daa           ;/
    ex de,hl
    jr nc,$+15
    ld (hl),82h
    inc hl
    inc hl
    ld (hl),a
    xor a
    rld
    or $10
    dec hl
    ld (hl),a
    ret
    inc hl
    ld (hl),a
    and $F0
    ret nz
    rld         ;\ Rotate up 1 digit
    dec hl      ; |
    ld (hl),80h ; |
    ret         ; /

The slowest routine here has a worst case of 499cc and slowest average case is 436cc.

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #87 on: August 04, 2015, 10:52:59 am »

I decided to see if I could make a faster routine than what Runer made that did the same stuff and I came up with the following code. Mine is 3 bytes larger and on average almost twice as fast (ranging from 120cc to 525cc, and one case of 59cc). It has output in HL and returns a DataType error for inputs that are not non-negative reals, and a Domain error if the input exceeds a 16-bit unsigned integer.

One other advantage that mine has is that you can convert a float pointed to by DE, so you aren't limited to just OP1.

ConvOP1:
;;Output: HL is the 16-bit result.
    ld de,OP1
ConvFloat:
;;Input: DE points to the float.
;;Output: HL is the 16-bit result.
;;Errors: DataType if the float is negative or complex
;;        Domain if the integer exceeds 16 bits.
;;Timings:  Assume no errors were called.
;;  Input is on:
;;  (0,1)         => 59cc                        Average=59
;;  0 or [1,10)   => 120cc or 129cc                     =124.5
;;  [10,100)      => 176cc or 177cc                     =176.5
;;  [100,1000)    => 309cc, 310cc, 318cc, or 319cc.     =314
;;  [1000,10000)  => 376cc to 378cc                     =377
;;  [10000,65536) => 514cc to 516cc, or 523cc to 525cc  =519.5
;;Average case:  496.577178955078125cc
;;vs 959.656982421875cc
;;87 bytes
 
    ld a,(de)
    or a
    jr nz,ErrDataType
    inc de
    ld hl,0
    ld a,(de)
    inc de
    sub 80h
    ret c
    jr z,final
    cp 5
    jp c,enterloop
ErrDomain:
;Throws a domain error.
    bcall(_ErrDomain)
ErrDataType:
;Throws a data type error.
    bcall(_ErrDataType)
loop:
    ld a,b
    ld b,h
    ld c,l
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,hl
enterloop:
    ld b,a
    ex de,hl
    ld a,(hl) \ and $F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl)
    inc hl
    ex de,hl
    add a,l
    ld l,a
    jr nc,$+3
    inc h
    dec b
    ret z
    djnz loop
    ld b,h
    ld c,l
    xor a
;check overflow in this mul by 10!
    add hl,hl \ adc a,a
    add hl,hl \ adc a,a
    add hl,bc \ adc a,0
    add hl,hl \ adc a,a
    jr nz,ErrDomain
final:
    ld a,(de)
    rrca
    rrca
    rrca
    rrca
    and 15
    add a,l
    ld l,a
    ret nc
    inc h
    ret nz
    jr ErrDomain

If you do not want to throw errors and are comfortable returning garbage results:

ConvOP1:
;;Output: HL is the 16-bit result.
    ld de,OP1
ConvFloat:
;;Input: DE points to the float.
;;Output: HL is the 16-bit result.
;;  Input is on:
;;  (0,1)         => 57cc                        Average=57
;;  0 or [1,10)   => 117cc or 126cc                     =121.5
;;  [10,100)      => 174cc or 175cc                     =174.5
;;  [100,1000)    => 276cc, 277cc, 285cc, or 286cc.     =281
;;  [1000,10000)  => 374cc to 376cc                     =375
;;  [10000,65536) => 481cc to483cc, or 490cc to 492cc  =486.5
;;Average case:  467.88153076171875cc
 
    ld hl,0
    ld a,(de)
    or a
    ret z
    inc de
    ld a,(de)
    inc de
    sub 80h
    ret c
    jr z,final
    cp 5
    jp c,enterloop
    ret
loop:
    ld a,b
    ld b,h
    ld c,l
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,hl
enterloop:
    ld b,a
    ex de,hl
    ld a,(hl) \ and $F0 \ rra \ ld c,a \ rra \ rra \ sub c \ add a,(hl)
    inc hl
    ex de,hl
    add a,l
    ld l,a
    jr nc,$+3
    inc h
    dec b
    ret z
    djnz loop
    ld b,h
    ld c,l
    add hl,hl
    add hl,hl
    add hl,bc
    add hl,hl
final:
    ld a,(de)
    rrca
    rrca
    rrca
    rrca
    and 15
    add a,l
    ld l,a
    ret nc
    inc h
    ret

EDIT: I do not have time right now, but I found some really significant speed improvements while I was at work yesterday. Gotta take a kitten to the vet, but later I'll post the update ^~^

Here, this one is a bit faster, a little larger, and has the same outputs (DE,A) as the OS routine, but with modified error handling (no negative or non-real inputs, max input is 65535 instead of 9999).

ConvOP1:
    ld hl,OP1
convFloat:
;;Inputs: HL points to the TI Float.
;;Outputs: The float is converted to a 16-bit integer held in DE, or LSB in A.
;;avg: 436.723938
;;  0-digit : 30cc              (0,1)
;;  1-digit : 92cc or 100cc     0 or [1,10)
;;  2-digit : 134cc             [10,100)
;;  3-digit : 257cc or 265cc    [100,1000)
;;  4-digit : 329cc or 330cc    [1000,10000)
;;  5-digit : 453cc or 454cc or 461cc or 462cc   [10000,65536)
;;95 bytes

    ld a,(hl)
    or a
    jr nz,err_datatype
    ld de,0
    inc hl
    ld a,(hl)
    sub 80h
    ld b,a      ;ugh, gotta keep A as the LSB, but we need to get B to be the exponent
    ld a,e      ;
    ret c
    inc hl
    ld a,(hl)
    jr z,lastdigit2
;    ld a,(hl)   ;\ I feel particularly proud of this code, so feel free to use it ^~^
    and $F0     ; |It converts a byte of BCD to an 8-bit int.
    rra         ; |The catch is, I only have A and C to use, and (hl) is the BCD
    ld c,a      ; |number.
    rra         ; |If you come up with faster, please let me know and post it in
    rra         ; |the optimized routines thread on popular TI forums.
    sub a,c     ; |Algo: Multiply upper digit by -6, add the original byte.
    add a,(hl)  ;/ Result is upper_digit*(-6+16)+lower_digit
    ld e,a
    dec b
    ret z
    dec b
    jr z,lastdigit
    inc hl
    ex de,hl
    ld a,b
    sla l
    add hl,hl
    ld b,h
    ld c,l
    add hl,hl
    add hl,bc
    add hl,hl
    add hl,hl
    add hl,hl
    add hl,bc
    ex de,hl
    ld b,a
    ld a,(hl)   ;\ I feel particularly proud of this code, so feel free to use it ^~^
    and $F0     ; |It converts a byte of BCD to an 8-bit int.
    rra         ; |The catch is, I only have A and C to use, and (hl) is the BCD
    ld c,a      ; |number.
    rra         ; |If you come up with faster, please let me know and post it in
    rra         ; |the optimized routines thread on popular TI forums.
    sub a,c     ; |Algo: Multiply upper digit by -6, add the original byte.
    add a,(hl)  ;/ Result is upper_digit*(-6+16)+lower_digit. Ha, repost.
;    inc hl      ;
    add a,e
    ld e,a
    jr nc,$+3
    inc d
    dec b
    ret z
    dec b
    jr nz,err_domain
lastdigit:
    inc hl
    ld a,(hl)
    ld h,d
    ld l,e
    add hl,hl
    add hl,hl
    add hl,de
    add hl,hl
    jr c,err_domain
    ex de,hl
lastdigit2:
    rrca
    rrca
    rrca
    rrca
    and 15
    add a,e
    ld e,a
    ret nc
    inc d
    ret nz
err_domain:
    bcall(_ErrDomain) 
err_datatype: 
    bcall(_ErrDataType)

« Last Edit: August 07, 2015, 01:17:46 pm by Xeda112358 »

Logged

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

Xeda112358

they/them
Moderator
LV12 Extreme Poster (Next: 5000)
Posts: 4705
Rating: +719/-6
Calc-u-lator, do doo doo do do do.

Re: ASM Optimized routines

« Reply #88 on: October 16, 2017, 10:19:05 pm »

Since I was able to optimize my best 16-bit multiply even further today, I thought I'd share! I even think it can be further optimized. And for that matter, here is my favorite DE_Times_A

mul16 596.34375cc, 92 bytes (incl. DE_Times_A)

mul16:
;Inputs:
;   BC,DE are unsigned integers
;Output:
;   HL:DE is the 32-bit product
;Destroys:
;   A,B,C
;min: 359cc
;max: 717cc
;avg: 596.34375cc
;92 bytes
    ld a,c
    call DE_Times_A
    push hl
    push af
    ld a,b
    call DE_Times_A+2
    pop bc
    pop de
;AHL
; BDE
    ld c,d
    add hl,bc
    adc a,0
;AHLE
    ld d,l
    ld l,h
    ld h,a
;HLDE
    ret
DE_Times_A:
;Input: DE,A
;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.
;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2
;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4
;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8
;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16
;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32
;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64
;A:2~3     123+{0,19}            avg=132.5   *1/128
;A:1       107cc                 avg=107     *1/256
;A:0       119cc                 avg=119     *1/256
;overall avg: 237.671875cc
    ld c,0
    ld h,d
    ld l,e
    rla \ jr c,mul_07
    rla \ jr c,mul_06
    rla \ jr c,mul_05
    rla \ jr c,mul_04
    rla \ jr c,mul_03
    rla \ jr c,mul_02
    rla \ jr c,mul_01
    rla
    ret c
    ld h,a
    ld l,a
    ret
mul_07:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_06:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_05:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_04:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_03:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_02:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_01:
    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c
    ret

DE_Times_A, 237.671875cc, 72 bytes

DE_Times_A:
;Input: DE,A
;Output: A:HL is the product, C=0, B,DE unaffected, z flag set if result is zero, c flag set if A is input as 1, else nc.
;A:128~255 219+6{0,10}+{0,19}    avg=258.5   *1/2
;A:64~127  203+5{0,10}+{0,19}    avg=237.5   *1/4
;A:32~63   187+4{0,10}+{0,19}    avg=216.5   *1/8
;A:16~31   171+3{0,10}+{0,19}    avg=195.5   *1/16
;A:8~15    155+2{0,10}+{0,19}    avg=174.5   *1/32
;A:4~7     139+{0,10}+{0,19}     avg=153.5   *1/64
;A:2~3     123+{0,19}            avg=132.5   *1/128
;A:1       107cc                 avg=107     *1/256
;A:0       119cc                 avg=119     *1/256
;overall avg: 237.671875cc
    ld c,0
    ld h,d
    ld l,e
    rla \ jr c,mul_07
    rla \ jr c,mul_06
    rla \ jr c,mul_05
    rla \ jr c,mul_04
    rla \ jr c,mul_03
    rla \ jr c,mul_02
    rla \ jr c,mul_01
    rla
    ret c
    ld h,a
    ld l,a
    ret
mul_07:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_06:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_05:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_04:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_03:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_02:
    add hl,hl \ rla \ jr nc,$+4 \ add hl,de \ adc a,c
mul_01:
    add hl,hl \ rla \ ret nc \ add hl,de \ adc a,c
    ret

Logged

+1/-0 karm for this message

My pastebin|Pokémon Amber|Grammer Programming Language|BatLib Library|Jade Simulator|Zeda's Hex Opcodes
|FileSyst Library|CopyProg|TPROG|GroupRead|Lbl Read/Write|Z80 Floating Point Routines(z80float on GitHub)| Z80 Optimized Routines Repository

JamesV

LV5 Advanced (Next: 300)
Posts: 276
Rating: +77/-0

Re: ASM Optimized routines

« Reply #89 on: October 16, 2017, 10:50:12 pm »

This is great, Xeda!

I recently took a fairly standard Div16/8 routine and modified it slightly to allow the dividend to be signed. To do this, it checks HL (the dividend) at the start, and if it's negative then it negates it back to positive, performs the division, and then negates the result. Is this an efficient way of handling this, or would you suggest something else?

For what it's worth, it performs well enough for the task I was working on, but I'm just curious as I've not delved much into bit-level multiplication or division before

; divide HL by C (HL is signed, C is not)
; output: HL = quotient, A = remainder
divHLbyCs:
    bit     7,h
    push    af
    jr      z,divHLbyCsStart
    ld a,h \ cpl \ ld h,a
    ld a,l \ cpl \ ld l,a
    inc     hl
divHLbyCsStart:
    xor     a
    ld      b,16
divHLbyCsLoop:
    add     hl,hl
    rla
    cp      c
    jp      c,divHLbyCsNext
    sub     c
    inc     l
divHLbyCsNext:
    djnz    divHLbyCLoop
    ld      b,a
    pop     af
    ld      a,b
    ret     z
    ld a,h \ cpl \ ld h,a
    ld a,l \ cpl \ ld l,a
    inc     hl
    ret

Logged

Print

Pages: 1 ... 4 5 [6] 7 8 Go Up

« previous next »

Server load over the past 5, 10 and 15 minutes respectively: 0.40966796875, 0.4501953125, 0.318359375

Page created in 0.3 seconds with 62 queries.