Author Topic: FPU routines (Read 5793 times)

z80man · « **on:** March 24, 2011, 12:10:49 am »

I've now started work on creating an SH3E library for the the SH3. In case you don't know the SH3E is a version of the SH3 that incorporates an fpu. Later on I'll create libraries for the SH4 which incorporates double length double length data along with video acceleration hardware (except when emulated it will be more like video deceleration

). For the libraries I'm making three versions of each function. In each version there are three different sub-sets for a inline function, callable function, and an interrupt triggered function. On the inline version, the register numbers are labeled Ra, Rb, Rc and so on. When used in code the programmer should define these registers as they see fit. FR0-FR15 are memory locations along with the FPSCR and the FPUL.
Fast version:
The fastest possible version of each instruction. May destroy some registers during execution so care must be taken when using them. Is more of a floating point library than an fpu emulator as it ignores the system and control registers found in the emulated hardware.

Safe version:
Protects registers from being corrupted, and is only a little bit slower than the fast routines. Closely emulates the actual hardware, but still doesn't incorporate the fpu system and control registers.

True version:
Is an exact copy of the emulated hardware instruction. May be slow in some routines due to the system and control registers that must be managed. Also ensures that none of the general purpose registers are corrupted.

FABS FRn
Floating point absolute value of FRn

Code: (fast) [Select]

;FABS  FRn

;*******************************
;inline
;Ra is the register to be processed
;destroyed registers
;Ra, T bit

ROTR  Ra    ;shifts Ra 1 bit right. LSB is placed into the T bit
ROTCL Ra    ;shifts Ra 1 bit left. T bit placed into the LSB of Ra


;*******************************

;callable function
;@(R15) is the data to be processed
;data returned to @(R15)
;registers destroyed 
;R1 and T bit

MOV.L @R15,R1    ;pops single argument off the stack. only one argument so no point in decrementing R15
ROTR  R1         ;shifts R1 1 bit right. LSB is placed into the T bit
ROTCL R1         ;shifts R1 1 bit left. T bit placed into the LSB of R1
RTS              ;return to code. Delayed branch
MOV.L R1,@R15    ;pushes single argument back on

;*******************************

;interrupt
;jumped to from the interrupt handler
;@(R15) is the data to be processed
;registers destroyed 
;R1 (bank 1)


MOV.L @R15,R1    ;pops single argument off the stack. only one argument so no point in decrementing R15
ROTR  R1         ;shifts R1 1 bit right. LSB is placed into the T bit
ROTCL R1         ;shifts R1 1 bit left. T bit placed into the LSB of R1
RTE              ;SSR/SPC -> SR/PC. returns to code. Delayed branch
MOV.L R1,@R15    ;pushes single argument back on the stack

Code: (safe) [Select]

;FABS FRn

;*******************************
;inline
;Ra is the register to be processed
;cannot use R1 as Ra
;destroyed registers
;Ra

MOV.L R1,@-R15   ;pushes R1 onto the stack
MOVT  R1         ;moves the T bit into R1
MOV.L R1,@-R15   ;moves R1 onto the stack
ROTR  Ra         ;shifts Ra 1 bit right. LSB is placed into the T bit
ROTCL Ra         ;shifts Ra 1 bit left. T bit placed into the LSB of Ra
MOV.L @R15+,R1   ;pops R1 off the stack. T bit
MOV.L R2,@-R15   ;pushes R2 onto the stack
MOV $00,R2       ;moves 0 into R2
CMP/HI R1,R2     ;if the T bit was 1 then the T bit equals 1. Sounds retarded I know :P
MOV.L @R15+,R2   ;pops R2
MOV.L @R15+,R1   ;pops R1

;*****************************
;callable
;NOT FINISHED

;*****************************
;interrupt
;jumped to from the interrupt handler
;@(R15+8) is the FR to be processed
;registers destroyed 
;FRn


MOV.L R1,@R15+    ;pushes R1 onto the stack
MOV.L R2,@R15+    ;pushes R2 onto the stack
MOV.L @R15,R2     ;pops FR address off the stack. No decrement
MOV.L @R2,R1      ;places contents of @R2 into R1
ROTR  R1         ;shifts R1 1 bit right. LSB is placed into the T bit
ROTCL R1         ;shifts R1 1 bit left. T bit placed into the LSB of R1
MOV.L R1,@R2    ;places result in FRn
MOV.L @R15-,R2   ;pops R2
RTE              ;SSR/SPC -> SR/PC. returns to code. Delayed branch
MOV.L @R15-,R1    ;pops R1

So this was some of the hardest stack work I've ever done before. And it was only on the simplest of the floating point math operations. In fact I still haven't finished the safe version of the callable function or any of the true functions yet. May God save us all when we start working on the true version of the square root and the division functions. Just as a head up here is the C code for the fpu divide instruction. Now just imagine the SH3 asm version of that while preserving all of the registers.

Code: (divide) [Select]

FDIV(float *FRm,*FRn) /* FDIV FRm,FRn */
{
	clear_cause_VZ();
	if((data_type_of(FRm) = = sNaN) | |
		(data_type_of(FRn) = = sNaN)) invalid(FRn);
	else if((data_type_of(FRm) = = qNaN) | |
		(data_type_of(FRn) = = qNaN)) qnan(FRn);
	else case((data_type_of(FRm) {
	NORM :
	case(data_type_of(FRn)) {
		PINF :
		NINF : inf(FRn,sign_of(FRm)^sign_of(FRn)); break;
		default : *FRn =*FRn / *FRm; break;
	} break;
	PZERO :
	NZERO :
	case(data_type_of(FRn)) {
		PZERO :
		NZERO : invalid(FRn); break;
		PINF :
		NINF : inf(Fn,Sign_of(FRm)^sign_of(FRn)); break;
		default : dz(FRn,sign_of(FRm)^sign_of(FRn)); break;
	} break;
	PINF :
	NINF :
	case(data_type_of(FRn)) {
		PINF :
		NINF : invalid(FRn); break;
		default :zero (FRn,sign_of(FRm)^sign_of(FRn)); break
		break;
	}
	pc += 2;
}

AngelFish · « **Reply #1 on:** March 24, 2011, 01:52:26 am »

Perhaps it would be a good time to mention that I already have a few of these routines in a math library I've been writing for awhile, along with some other useful math functions such as Sin( and arbitrary precision operations? I can add stack work into the SH3E FPU instructions if you want. I can also send you a copy of all of them when I'm finished, unless you want to do them yourself

Anyway, some optimizations to your "safe" routine. Since the actual FABS routine uses register Rn, I figured that it'd be best to stick with the actual inputs and outputs, since the T bit doesn't reflect the results.

Code: [Select]


Destroys T bit:
SHLL  Rn    
SHLR Rn

Keeps it:
MOV.L Rm,@R15-
MOVT Rm
SHLL  Rn    
SHLR Rn
ROTR Rm
MOV.L @+R15, Rm

If you want the T bit to reflect whether or not the number was zero like your previous routine did, then...

Code: [Select]

SHLL  Rn    
SHLR Rn
TST Rn,Rn

z80man · « **Reply #2 on:** March 24, 2011, 02:49:59 am »

What do you mean exactly by the FABS uses register Rn. According to the documentation it uses FRn. Now because no FRn exists I substituted that with a memory location when it was necessary to emulate the actual SH3E instruction. Otherwise I would just use a general purpose register. What I was wondering with the fpu interpreter idea you had a few days ago is if you intended to have an SH3E or SH4 emulator run on the calc or to just add floating point libraries. Because what I found to be the most effective was to use the fast routines in an interrupt state because it has little damage on the registers and is easy to use due to the RTE instruction and the second register bank. One thing I do want to optimize is determining which register is to be operated upon. So far my best choice has been to just use the stack, but I'm trying to find something faster for when the code is not inline.

Also I don't know if this already exists, but it might be a good idea to establish some "good programming suggestions" for coding on the SH3 or Prizm. Such as making all code location independent, R15 as stack pointer, R0 as an offset, maybe R7 bank 1 as an exception stack pointer, certain registers to be used in for loops, ones to be heavily worked upon by arithmetic instructions, ones to be pointers, ones to hold data, and so on.

Lastly I was just working on this routine to load the PC into R0. Can you tell me if it works. Seems important for location independent code.

Code: [Select]

Start:
BSR PCtoR0
NOP
blah...

PCtoR0:
RTS
STS PR,R0

AngelFish · « **Reply #3 on:** March 24, 2011, 03:04:36 am »

Spelling fail on my part about the FRn thing...

It's easily fixed, though:

Code: [Select]

MOV.L R1,@R15-
MOV.L R2,@R15-
MOV.L Rm,@R15-
MOVT Rm
MOV.L @R2,R1
SHLL  Rn    
SHLR Rn
ROTR Rm
MOV.L @+R15, Rm
MOV.L R1,@R15-
MOV.L R2,@R15-

Not an optimization over your own code, if taken directly, but it very well could be if the SH3E is interpreted as it necessarily would be, since the proper FRn registers would presumably be loaded automatically in order to handle the floating point instruction. Also, the above way is slightly more readable to me.

Anyway, my hesitation with setting programming standards is that they limit people and I'd break half of them within the better part of an hour...

Also, I see no reason why that code wouldn't work. The value in PC is changed immediately so long as the loop condition is valid in order to keep the pipeline flowing.

z80man · « **Reply #4 on:** March 24, 2011, 03:52:01 am »

With the whole talk of the asm shell for the Prizm going around recently, I had an idea to incorporate multiple run time libraries at a time. In the header a program would define which libraries it wanted to use and those would be loaded into memory before execution began. Perhaps some of the libraries would be attached to the shell itself, but I also liked the feature of allowing users to create their own run time libraries too. Now you can accomplish this without setting suggested standards, but this can give more difficulty for the coder if the libraries go around using random registers in various uses. On most other asm languages there are usually some guidelines that are recommended to be followed, but they still allow flexibility.

Just as an example say there are two different math libraries you are using for floats and the purpose of the program is to find the area of a circle. The first function takes R4 (radius) as an argument and squares it. The next function takes R1 as an argument and multiplies it by pi. The issue here is that not only are your registers becoming unorganized, but you also have to use precious clock cycles transferring data from R4 to R1. But if both routines took R1 as arguments then your code would be faster.

AngelFish · « **Reply #5 on:** March 24, 2011, 04:29:58 am »

Well, that's a good point. However, instead of loading the libraries fully into memory, don't try to reinvent the hardware cache. The most data you should be loading are the library entry points. Trying to load any specific routines is likely to be a waste of time if they aren't used.

JosJuice · « **Reply #6 on:** March 24, 2011, 01:56:14 pm »

Quote from: z80man on March 24, 2011, 03:52:01 am

With the whole talk of the asm shell for the Prizm going around recently, I had an idea to incorporate multiple run time libraries at a time. In the header a program would define which libraries it wanted to use and those would be loaded into memory before execution began. Perhaps some of the libraries would be attached to the shell itself, but I also liked the feature of allowing users to create their own run time libraries too.

I've been having that idea too

* JosJuice blames himself for secretly stealing others' ideas without knowing it

fb39ca4 · « **Reply #7 on:** March 24, 2011, 08:24:37 pm »

It has an FPU?
This is great! It will make writing 3d engines much faster.

AngelFish · « **Reply #8 on:** March 24, 2011, 08:27:14 pm »

Sorry, it doesn't have an FPU. We're planning on writing a low level interpreter that will perform the same functions as the FPU so that programs can be written as if the FPU existed.

Author Topic: FPU routines (Read 5793 times)

z80man

FPU routines

AngelFish

Re: FPU routines

z80man

Re: FPU routines

AngelFish

Re: FPU routines

z80man

Re: FPU routines

AngelFish

Re: FPU routines

JosJuice

Re: FPU routines

fb39ca4

Re: FPU routines

AngelFish

Re: FPU routines