0 Members and 1 Guest are viewing this topic.
Alright I have a conversion from C to asm. It is get pixel and is as fast as possible and can be no more optimized. Unless gcc automatically handles the zero extension which I'll have to check.
Quote from: z80man on May 26, 2011, 10:36:58 amAlright I have a conversion from C to asm. It is get pixel and is as fast as possible and can be no more optimized. Unless gcc automatically handles the zero extension which I'll have to check.Is it just me or does this routine not handle multiplying by 2 to index the array?
short GetPixel(short x, short y){ MOV.W (width),R2 ;get screen width of 384 * 2 MOV.L (VRAM),R3 ;get VRAM buffer location, usually 0xA8000000, but uses pre-initialized global variable just in case. MULU.W R2,R5 ;unsigned 16 bit multiplication SHLL R4 ;single left bit shift which multiplies R4 by 2. Also used to fill slot before the MAC load STS MACL,R2 ;load result of multiplication into R2 ADD R3,R2 ;add VRAM base address to resulting y value ADD R4,R2 ;add modified x value to to the already added y and VRAM base MOV.W @R2,R0 ;Load word from what's at R2 into R0. Sign extension RTS ;delayed branch and return. Note, R0 not touched because the result of the previous instruction is still be loaded EXTU.W R0 ;safe to touch R0 now. Get rid of that pesky sign extension. May remove later if gcc handles this step on its own align.4 width: word 384 * 2 VRAM: long VRAM_address ;global variable that was initialized earlier to correct address. }
short GetPixel(short x, short y){ MOV.W (width),R2 ;get screen width of 384 * 2 MOV.L (VRAM),R0 ;get VRAM buffer location, usually 0xA8000000, but uses pre-initialized global variable just in case. MULU.W R2,R5 ;unsigned 16 bit multiplication SHLL R4 ;single left bit shift which multiplies R4 by 2. Also used to fill slot before the MAC load STS MACL,R2 ;load result of multiplication into R2 ADD R4,R2 ;add modified x value to to the scaled y offset MOV.W @(R0,R2),R0 ;Load word from R2 offsetted from the VRAM base into R0. Sign extension RTS ;delayed branch and return. Note, R0 not touched because the result of the previous instruction is still be loaded EXTU.W R0 ;safe to touch R0 now. Get rid of that pesky sign extension. May remove later if gcc handles this step on its own align.4 width: word 384 * 2 VRAM: long VRAM_address ;global variable that was initialized earlier to correct address. }
Nice job catching that there calc84. Sometimes I forget possible optimizations when I don't have the full instruction set in front of me.btw on the instructions that you replaced, I wanted to know if you're aware of the hidden slow down when you try to access a register when the previous instruction loaded data from memory into it. The code is fine, but I was just wondering if it was by luck you had it formatted that way or if you knew the whole time because you seem to be very skilled with Super H asm.
GetPixel: ldr r2,=VRAM @buffer = VRAM; add r3,r1,r1,lsl #2 @temp = y*3; add r3,r0,r3,lsl #7 @temp = x+temp*128; add r3,r2,r3,lsl #1 @temp = buffer+temp*2; ldrh r0,[r3] @return *temp; bx lr
stc sr, r8mov.l 1f, r9or r9, r8ldc r8, sr
xor r7,r7Start:add r7,r4add #0xE0,R4mov #0x3A,r6cmp/hs r4,r6bf/s Startmov #0xFF,r7shll2 r4shll2 r4 /* Loop is unrolled here for speed, but you could easily loop back to the beginning for this after storing r4 to a safe register. */xor r7,r7Start2:add r7,r5add #0xE0,R5mov #0x3A,r6cmp/hs r5,r6bf/s Startmov #0xFF,r7add r5,r4rtsmov r4,r1
No, it switches the register set that's currently swapped in. The Processor has two register sets in Privileged mode: Regular and banked (which map to r8-r15). The GCC doesn't use the banked registers, so they're free to the ASM programmer to mess with. I doubt the OS uses them much either with the probable exception of error handling. Needless to say, I love abusing the banked registers
int GetKeyNB() { int key; key = PRGM_GetKey(); if (key == KEY_PRGM_MENU) GetKey(&key); return key; }void GetKeyHold(int key) { while(GetKeyNB() != key) { } while(GetKeyNB() != KEY_PRGM_NONE) { }}void GetKeyWaitNone() { while(GetKeyNB() != KEY_PRGM_NONE) { } }unsigned short RandomShort() { unsigned short retshort = 0; int*cur_stack = 0; int cur_stackv = 0; for(int i = 0; i < 16; i--) { retshort = retshort << 1; cur_stack = GetStackPtr(); cur_stackv = *(RTC_GetTicks() + cur_stack); retshort = retshort | (cur_stackv % 0xFFFFFFFF); } return retshort;}unsigned int RandomInt() { unsigned int retint = 0; int*cur_stack = 0; int cur_stackv = 0; for(int i = 0; i < 32; i--) { retint = retint << 1; cur_stack = GetStackPtr(); cur_stackv = *(RTC_GetTicks() + cur_stack); retint = retint | (cur_stackv % 0xFFFFFFFF); } return retint;}unsigned char RandomChar() { unsigned char retchar = 0; int*cur_stack = 0; int cur_stackv = 0; for(int i = 0; i < 8; i--) { retchar = retchar << 1; cur_stack = GetStackPtr(); cur_stackv = *(RTC_GetTicks() + cur_stack); retchar = retchar | (cur_stackv % 0xFFFFFFFF); } return retchar;}
In Simon's random routine he used a static seed to contribute to the result which increased randomness.
unsigned short random(int extra_seed) { int seed = 0; int seed2 = 0; for(int i = 0; i < 32; i++ ){ seed <<= 1; seed |= (RTC_GetTicks()%2); } for(int i = 0; i < 32; i++ ){ seed2 <<= 1; seed2 |= (RTC_GetTicks()%16); } seed ^= seed2; seed = (( 0x41C64E6D*seed ) + 0x3039); seed2 = (( 0x1045924A*seed2 ) + 0x5023); extra_seed = (extra_seed)?(( 0xF201B35C*extra_seed ) + 0xD018):(extra_seed); seed ^= seed2; if(extra_seed){ seed ^= extra_seed; } return ((seed >> 16) ^ seed) >> 16;}
for(int i = 0; i < 3600; i++) { (*(map+i)).natural_cell = random(random(RTC_GetTicks()))%6; OS_InnerWait_ms(random(0)%64); }