This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.
Messages - fghsgh
Pages: [1]
1
« on: December 23, 2020, 02:32:23 pm »
Okay so first, Nover 3 is not compatible with the CX II (yet), so do not expect it to work at all. I only know this because I caught some nspire devs talking about it somewhere and might be misremembering but given that it's not working for you it's probably correct. It's also not too much of a priority because all nspire games until now were written to run at the CX I's 180 MHz and the CX II runs at over double that. You should definitely not need to overclock to run e.g. gpsp.
Even on newer CX I models (like my rev AB), the auto overclock doesn't work (it actually got stuck in a boot loop that required a filesystem reformat to fix). In any case what I ended up doing was to just save the highest stable overclock values and restore them manually on a reboot (these settings were 204 MHz, 204 MHz, and 68 MHz). Then just disable hibernation (regular sleep doesn't use that much battery power anyway) and you will basically never have to worry about it unless your battery runs out.
Actually, something else that is to be noted is that the default AHB frequency was waaay to high (94 MHz, with >66 MHz being considered dangerous according to Nover 3's readme). So overclocking-related things have definitely been changing before they even started working on the CX II. It is a wonder it even ran on my rev AB (in ndless's compatibility mode).
2
« on: March 23, 2019, 12:51:01 pm »
Okay, you win.
3
« on: March 20, 2019, 07:46:36 am »
Actually, HL' is 0 from the 8th iteration onward, so no need to ADD HL,HL. So I'd say: first 8 iterations ADD HL,HL \ RL E \ RL D then one EX DE,HL then 4 iterations ADD HL,HL then 4 iterations SLA H or maybe even split up the first 8 iterations into ADD HL,HL \ RL E \ RL D one EX DE,HL and SLA D \ ADC HL,HL The remainder is no more than 2sqrt(x), so in this case at most 0x1FFFE, but on the 15th iteration it is at most 0xFFFE
It must've been a bug then.
4
« on: March 19, 2019, 05:07:08 pm »
One thing though: In you routine, A is 0 until the last iteration, so you can make the last iteration a special case and speed up the inner loop.
You mean C? A can get quite big, actually. While debugging, I've seen it go up to 7. That might have been because of a bug though (which is now fixed). I'd rather go to sleep now, so it'll be for tomorrow. EDIT: I tried, and it didn't pass the first of my 1000 tests.
5
« on: March 19, 2019, 04:56:25 pm »
The routine using shadow registers is done. The previous one probably wasn't working correctly anyway. I've tested this one with 1000 different inputs, so I'm quite confident this time. This uses an odd number of EXXs, so the BC input ends up in BC'.
; input: DEHL = number to calculate sqrt of ; output: DE = sqrt ; destroys: ; BC, DE', HL' = 0 ; AHL = remainder (how much lower than DEHL the closest perfect square was) ; BC' = previous BC ; enables interrupts ; time: 3087 to 3375 (.51 ms to .56 ms at 6 MHz) ; size: 56 bytes
sqrt32: di ; 4 exx ; 4 xor a ; 4 ld bc,$1000 ; 10 ld d,a ; 4 ld e,a ; 4 ld h,a ; 4 ld l,a ; 4 sqrt32loop: exx ; 4 add hl,hl ; 11 rl e ; 8 rl d ; 8 exx ; 4 adc hl,hl ; 15 rla ; 4 exx ; 4 add hl,hl ; 11 rl e ; 8 rl d ; 8 exx ; 4 adc hl,hl ; 15 rla ; 4
; registers here: ; AHL: remainder ; DEHL': input ; CDE: output * 2 ; B: djnz
ex de,hl ; 4 add hl,hl ; 11 rl c ; 8 inc l ; 4 ex de,hl ; 4
sbc hl,de ; 15 sbc a,c ; 4 jr nc,nottoobig; 12/7 add hl,de ; 11 adc a,c ; 4 dec e ; 4 dec e ; 4 nottoobig: inc e ; 4 djnz sqrt32loop; 13/8 rr c ; 8 rr d ; 8 rr e ; 8 ei ; 4 ret ; 10
6
« on: March 18, 2019, 02:56:10 pm »
True (except not using djnz since that would use B !), but you would have to call that routine which uses a stack anyways I imagine you mean a very limited stack as opposed to no stack whatsoever?
About not using B: I thought that was evident so I left it out. I meant copy & pasting the routine into the program, where it is needed: ; code code code ; set arguments for divisionn ; insert division routine here ; code code code
... or I would have to put page 0 back in before calling. It's not that I often need division, especially not in a stackless environment, but it's handy to have either way.
7
« on: March 18, 2019, 01:50:58 pm »
Good luck! The pseudocode outlined here has yielded me better results than algorithm I learned as a kid. It's the same algorithm, just structured better for programming instead of computing by hand.
Well, I hadn't seen this, so here is a routine using the normal algorithm. I wasn't able to fit everything into the registers so it also uses one stack frame. It even uses IY, so don't run this with IM 1 enabled and restore it afterwards. It doesn't use shadow registers though, which means you can still use IM 2. This algorithm could also be better because it only uses left shifts, which are quicker for 16-bit numbers. ; calculate sqrt of 32-bit number ; input: IXIY = the number to calculate sqrt of ; output: DE = square root of IXIY, rounded down ; B = 0 ; if CHL = 0, IXIY was a perfect square ; IXIY = 0
sqrt32: ld bc,$1000 ld d,c ld e,c ld h,c ld l,c sqrt32loop: ; IXIY << 2; carry into CHL add iy,iy ; adc ix,ix doesn't exist ld a,ixl rla ld ixl,a ld a,ixh rla ld ixh,a adc hl,hl rl c ; second time now add iy,iy ld a,ixl rla ld ixl,a ld a,ixh rla ld ixh,a adc hl,hl rl c
; get DE * 4 + 1 and store into AHL push hl xor a ld h,d ld l,e add hl,hl rla add hl,hl rla inc l
; if DE*4+1<=remainder -> add 1 to answer sub c jr c,nottoobigA ; A < C jr nz,toobigA ; A > C ex de,hl ; A = C ex (sp),hl sbc hl,de jr c,toobigDE ; HL < DE add a,c ; HL >= DE ex de,hl pop hl add hl,hl inc l ex de,hl djnz sqrt32loop ret
nottoobigA: pop hl ex de,hl add hl,hl inc l ex de,hl djnz sqrt32loop ret
toobigA: add a,c pop hl ex de,hl add hl,hl ex de,hl djnz sqrt32loop ret
toobigDE: add a,c add hl,de ex de,hl pop hl add hl,hl ex de,hl djnz sqrt32loop ret
A little explanation of what the registers mean: B is a DJNZ counter for 32 / 2 = 16 (2 bits in IXIY are only one bit in DE) CHL is the remainder A(SP) is DE * 4 + 1 In between, the registers are swapped around a bit to be able to perform arithmetic on them. Time: 4213 to 5247 cycles (.70 to .87 ms at 6 MHz), which is actually not that bad (although it seems that a 16-bit sqrt is faster than a 16-bit division, so maybe this should also be possible) Size: 78 bytes I'm curious how you'll optimize this one. As for the other algorithm you sent, I will try that one later. EDIT: I'm currently remaking this one but with shadow registers instead of IXIY and it seems promising so far.
8
« on: March 17, 2019, 10:06:07 pm »
writing a general-purpose routine was a daunting idea
My trick: pen & paper & sleep I use SPASM-ng and that has been great for me.
This is not the first time someone has recommended SPASM to me. It is, however, the first time someone provided me with a Github link and I thought it was only available for Windows until now. It would also mean that I have to transform all my previous programs (or at least include files) to this syntax. As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack.
Hmm, I'll keep this in mind if I work more on these.
I don't think that's too hard in this case: just djnz 8 times I will try to make the sqrt routine asap, if I don't forget. It'll probably be for next weekend.
9
« on: March 17, 2019, 08:02:08 pm »
I wonder, if you can make my routine twice as fast and if you've written complete support for floats, why didn't you write this routine in the first place? Good job finding that BC > 8000 bug btw. Also, I was having trouble with getting INC/DEC IXL to work for some reason. Haven't had the time to investigate that though. It could be that I just typed in the wrong opcode because I have a bad assembler which doesn't support it apparently and it's too much of a hassle to get a new one on Linux. As for your optimizations, I could follow along until you got those subroutines in. (EDIT: I think I got it now) And that might also be a problem if you have limited stack space or if you can't use the stack at all (in which case the routine would be inline). I tend to put RAM page 02 into bank C and completely fill it up with data, not leaving any place for a stack. Another, easy way of doing this would be by thinking of the 32 and 16 bit numbers as 4&2-digit base-256 numbers and using a routine for smaller numbers to do the actual division. And yes, that's a good place for it. I have been working on-and-off on that page for a few months to revamp it. If it is locked due to a pending draft, let me know and I'll move the draft to another page altogether.
I think it would be better if you put it there.
10
« on: March 17, 2019, 12:35:00 pm »
a few quick optimizations I see are changing those Inc/dec ix to use ixl instead
Sure! Didn't see that one. The problem there is also that those are undocumented and don't run on some emulators. you don't need the `or a` before the sbc
Yes, that's also right. Wasn't too smart of me. But hey, at least I wrote a 32 div 16 routine, okay? Do you prefer speed optimizations or size optimizations, btw?
Usually, I prefer speed, but it shouldn't be to big either. When the program is done and it runs too slowly or is too big, I can optimize as I need. The resulting routine would look like this: div32_16: ld de,0 ; 10 ld a,32 ; 7 div32_16loop: add ix,ix ; 15 adc hl,hl ; 15 ex de,hl ; 4 adc hl,hl ; 15 sbc hl,bc ; 15 inc ixl ; 8 jr nc,cansub ; 12/7 add hl,bc ; 11 dec ixl ; 8 cansub: ex de,hl ; 4 dec a ; 4 jr nz,div32_16loop ; 12/7 ret ; 10
This takes between 3838 and 3390 T-States, which is .64 to .57 ms. I also though that maybe we could invert BC so then we can use ADD HL,BC instead of SBC, which is 4 T-States faster, but this has the problem that we would need SBC further on. This would bring down the best case scenario time, but up the worst case. About the square root routine: is it still needed? I already know how the sqrt algorithm works. I just have to implement it, but I don't have the time now. EDIT: maybe we could put this on http://z80-heaven.wikidot.com/math, where many other math routines already reside?
11
« on: March 17, 2019, 09:53:03 am »
I know this is old (and I'm sorry for sending you all a notification), but I happened to need a 32 div 16 routine today, so I made one: This divides HLIX by BC The result is stored in HLIX, the remainder in DE BC is unmodified A is 0 it doesn't use any other registers or RAM
div32_16: ld de,0 ; 10 ld a,32 ; 7 div32_16loop: add ix,ix ; 15 adc hl,hl ; 15 ex de,hl ; 4 adc hl,hl ; 15 or a ; 4 sbc hl,bc ; 15 inc ix ; 10 jr nc,cansub ; 12/7 add hl,bc ; 11 dec ix ; 10 cansub: ex de,hl ; 4 dec a ; 4 jr nz,div32_16loop ; 12/7 ret ; 10
I'm not saying this is the fastest possible, but at least it does the job. This takes 4094 T-States in the worst case, 3582 in the best case, which is between .6 and .7 ms at 6 MHz. I could add some comments to this, but it's more difficult to explain what is going on than to just program it. Also, note that this doesn't check if BC is 0 so you will get junk as an answer in that case. I might also write a 32-bit square root routine if it is still needed.
EDIT: an optimized version can be found further in this topic.
Pages: [1]
|