ZFinx
Specification v0.41 DEPRECATED - SEE THIS DOCUMENT INSTEAD
- 1. Spec update history
- 2. Overview
- 3. Configurations
- 4. Semantic Differences
- 5. Discovery
- 6. mstatus.fs
- 7. Register pair handling for XLEN < FLEN
- 8. x0 register target
- 9. NaN-boxing
- 10. Assembly Syntax and Code Porting
- 11. Modifications from extensions to the
ZFinx
versions - 12. Modifications from
V
toZ[FD]inxV
- 13. GDB
- 14. ABI
- 15. Floating Point Configurations To Reduce Area
- 16. Rationale, why implement
ZFinx
?
version |
change |
0.41 |
change misa.F to be hardwired to zero, note that we are waiting for a standard CSR discovery mechanism |
Every current, and future extension which uses F
registers will also have a ZFinx
version which is incompatible with the extension. The ZFinx
version reuses the integer register file (i.e. X
registers) for all scalar floating point instructions. It gives an area saving, and also faster context switching as there are fewer registers to save and restore.
ZFinx
is the version of the F
extension which uses the X
registers, hence the name F-in-x
, and it also used as the general term for the -inx
version of floating point extensions.
If any ZFinx
extension is implemented, then all floating point extensions must use their respective Zfinx
version, i.e. the F
registers must not be referred to by any extension.
The ZFinx
version of an extension additionally removes all:
-
floating point load instructions (e.g.
FLW
) -
floating point store instructions (e.g.
FSW
) -
integer to/from floating point register move instructions (e.g.
FMV.X.W
)
In all cases the integer versions of these are required instead.
The ZFinx
version also changes the assembler syntax of floating point instructions changes so that they only refer to X
registers. Therefore on an RV32F
core, this is legal syntax:
FLW fa4, 12(sp) //load floating point data
FMADD.S fa1, fa2, fa3, fa4 //floating point arithmetic for RV32F
On a RV32_ZFinx
core, this syntax must be used as the F
registers are not implemented, and FLW
is not a supported instruction:
LW a4, 12(sp) //load integer data
FMADD.S a1, a2, a3, a4 //floating point arithmetic for RV32F ZFinx
Note that only the assembler syntax differs between the two FMADD.S
instructions, the encoding is the same.
The assembler syntax changes to avoid code-porting bugs, so that the registers must be updated and not just reused from non-ZFinx
code.
If any ZFinx
version is implemented then the core is a referred to as a Zfinx
core, otherwise it’s a non-Zfinx
core.
Any RISC-V extension that uses F
registers has an equivalent ZFinx
version. The ZFinx
version
-
inherits all details from the base extension
-
remaps the
F
registers toX
registers as set out in this specification -
deletes all floating point load and stores from the extension
-
deletes
F
toX
/X
toF
moves from the extension
ZFinx
versions of extensions
Base Extension | ZFinx version | Comment |
---|---|---|
|
|
Base |
|
|
Implies |
|
|
Implies |
|
|
|
|
|
|
In Table 1 all the existing extensions require ZFinx
to be present, i.e. the modified version of the F
extension. Such a rule is set by the FP extension, and not by this specification.
The ZFinx
version may be used with I
or E
integer variants. The number of integer registers directly affects the ZFinx
version as it refers to the X
registers. This is no different to the effect on (e.g.) the M
extension on an RV32I
or RV32E
core.
ZFinx
may be used with any extensions that uses F
registers. The relative sizes of XLEN
and FLEN
affect the ZFinx
specification.
ZFinx
configurations
Architecture | ZFinx version | Comment |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note
|
RV32D_ZFinx{_Zfh} requires register pairs so is more complex than the other cases, see Chapter 7.
|
Note
|
ZQinx is not considered in the table, or further in this specification.
|
The NaN-boxing behaviour of floating point arithmetic instructions is modified to suppress checking of sources only. Floating point results are always NaN-boxed to XLEN
bits.
NaN-boxing checking is removed as integer loads do not NaN-box their result, and so loading fewer than XLEN
bits (for example using LW
to load floating point data on an RV64 core) would otherwise require NaN-boxing in software that wastes performance and code-size.
There are no other semantic differences for floating point instruction behaviour for ZFinx
versions, but there are some differences for special cases (such as x0
handling) as listed later in this specification.
If any ZFinx
extension is specified then the compiler will have the following #define set:
__riscv_zfinx
So software can use this to choose between ZFinx
or normal versions of floating point code.
Privileged code can detect whether any ZFinx
extension is implemented by checking if:
-
mstatus.FS
is hardwired to zero, and -
misa.F
is hardwired to zero, and -
CSR indicating `Z[FDQ]inx has yet to be specified, I’m waiting for a standard approach for extensions
Non-privileged code can detect whether ZFinx
is implemented as follows:
li a0, 0 # set a0 to zero
#ifdef __riscv_zfinx
fneg.s a0, a0 # this will invert a0
#else
fneg.s fa0, fa0 # this will invert fa0
#endif
If a0
is non-zero then it’s a ZFinx
core, otherwise it’s a non-ZFinx
core. Both branches result in the same encoding, but the assembly syntax is different for each variant.
For ZFinx
cores mstatus.fs
is hardwired to zero, because all the integer registers already form part of the current context. Note however that fcsr
still eds to be saved and restored. This gives a performance advantage when saving/restoring contexts.
Floating point instructions and fcsr
accesses do not trap if mstatus.fs
=0. This is different to non-ZFinx
cores.
For RV32_ZDinx
, all D-extension instructions that are implemented will access register pairs:
-
The specified register must be even, odd registers will cause an illegal instruction exception.
-
Even registers will cause an even/odd pair to be accessed.
-
Accessing Xn will cause the {Xn+1, Xn} pair to be accessed, which is consistent for big and little endian modes. For example if n = 2:
-
X2 is the least significant half (bits [31:0])
-
X3 the most significant half (bits [63:32])
-
-
-
X0 has special handling:
-
Reading {X1, X0} will read all zeros.
-
Writing {X1, X0} will discard the entire result, it will not write to X1.
-
The register pairs are only used by the floating point arithmetic instructions. All integer loads and stores will only access XLEN
bits, not FLEN
.
Note
|
Zp64 from the P-extension will specify consistent register pair handling, but at the time of writing swaps the registers in the pair in big endian mode. |
Note
|
The decision was taken not to swap the order of registers in the pair for big endian mode to reduce read-muxing in the register file, or in the ALU. If big-endian pair swapping is required it will need to be done in software or by a future load-pair instruction. |
Note
|
Big endian mode is enabled in M-mode if mstatus.MBE =1, in S-mode if mstatus.SBE =1, or in U-mode if mstatus.UBE =1.
|
If a floating point instruction targets x0
then it will still execute, and will set any required flags in fcsr
. It will not write to a target register. This matches the standard F
extension behaviour for:
fcvt.w.s x0, f0
If the floating point source is invalid then it will set the fflags.NV
bit, regardless of whether F
or ZFinx
is implemented. The target register is not written as it is x0
.
If fcsr.RM
is in an illegal state then floating point instruction behaviour is the same whether the target register is x0
or not, i.e. targetting x0
doesn’t disable any execution side effects.
In the case of RV32_ZDinx
, register pairs are used. See above for x0
handling.
For ZFinx
cores the NaN-boxing is limited to XLEN
bits, not FLEN
bits. Therefore an FADD.S
executed on an RV64D
core will write a 64-bit value (the MSH will be all 1’s). On an RV32_ZDinx
core it will write a 32-bit register, i.e. a single X register only. This means there is semantic difference between these code sequences:
#ifdef __riscv_zfinx
fadd.s x2, x3, x4 # only write x2 (32-bits), x3 is not written
#else
fadd.s f2, f3, f4 # NaN-box 64-bit f2 register to 64-bits
#endif
NaN-box generation is supported by ZFinx
cores. NaN-box checking is not supported by scalar floating point instructions. For example for RV64F
:
#ifdef __riscv_zfinx
lw[u] x1, 0(sp) # load 32-bits into x1 and sign / zero extend upper 32-bits
fadd.s x1, x1, x1 # use x1 but do not check source is Nan-boxed, NaN-box output
#else
flw.s f1, 0(sp) # load 32-bits into f1 and NaN-box to 64-bits (set upper 32-bits to 0xFFFFFFFF)
fadd.s f1, f1, f1 # check f1 is NaN-boxed, NaN-box output
#endif
Floating point loads are not supported on ZFinx
cores so x1 is not NaN-boxed in the example above, therefore the FADD.S
instruction does not check the input for NaN-boxing.
The result of FADD.S
is NaN-boxed, that means setting the upper half of the output register to all 1’s.
The table shows the effect of writing each possible width of value to the register file for all supported combinations. Note that Verilog syntax is used in the final column.
XLEN | FP output width | Xreg writeback value | |
---|---|---|---|
functional description |
implementation |
||
64 |
16 |
NaN_box_to_XLEN(result[15:0]) |
{48{1’b1}, result[15:0]} |
32 |
16 |
NaN_box_to_XLEN(result[15:0]) |
{16{1’b1}, result[15:0]} |
64 |
32 |
NaN_box_to_XLEN(result[31:0]) |
{32{1’b1}, result[31:0]} |
32 |
32 |
NaN_box_to_XLEN(result[31:0]) |
result[31:0] |
64 |
64 |
NaN_box_to_XLEN(result[63:0]) |
result[63:0] |
Little or big endian (special handling Xreg={0, 1}) |
|||
32 |
64 |
EvenXreg: NaN_box_to_XLEN(result[31:0]) OddXreg: NaN_box_to_XLEN(result[63:32]) |
EvenXreg: result[31:0] OddXreg: result[63:32] |
Therefore, for example, if an FADD.S
instruction is issued on an RV64_ZFinx
core then the upper 32-bits will be set to one in the target integer register, or an FADD.H
(floating point add half-word) instruction will set the upper 48-bits to one.
misa.mxl
can be programmed to change the current value of XLEN
.
The combination of ZFinx
and programming misa.mxl
to reduce XLEN
from the maximum implemented value gives addition cases to consider as shown in the table.
The result from the floating point instruction is NaN-boxed to the current value of XLEN
, and then sign extended to the maximum value of XLEN
.
misa.mxl
XLEN | FP output width | Xreg writeback value | ||
---|---|---|---|---|
maximum |
misa.mxl |
functional description |
implementation |
|
128 |
64 |
16 |
SignExt_to_128(NaN_box_to_64(result[15:0])) |
{112{1’b1}, result[15:0]} |
128 |
32 |
16 |
SignExt_to_128(NaN_box_to_32(result[15:0])) |
{112{1’b1}, result[15:0]} |
64 |
32 |
16 |
SignExt_to_64(NaN_box_to_32(result[15:0])) |
{48{1’b1}, result[15:0]} |
128 |
64 |
32 |
SignExt_to_128(NaN_box_to_64(result[31:0])) |
{96{1’b1}, result[31:0]} |
128 |
32 |
32 |
SignExt_to_128(result[31:0]) |
{96{result[31]}, result[31:0]} |
64 |
32 |
32 |
SignExt_to_64(result[31:0]) |
{32{result[31]}, result[31:0]} |
128 |
64 |
64 |
SignExt_to_128(result[63:0]) |
(64{result[63]}, result[63:0]} |
Little or big endian (special handling Xreg={0, 1}) |
||||
128 |
32 |
64 |
EvenXreg: SignExt_to_128(result[31:0]) OddXreg: SignExt_to_128(result[63:32]) |
EvenXreg: {96{result[31]}, result[31:0]} OddXreg: {96{result[63]}, result[63:32]} |
64 |
32 |
64 |
EvenXreg: SignExt_to_64(result[31:0]) OddXreg: SignExt_to_64(result[63:32]) |
EvenXreg: {32{result[31]}, result[31:0]} OddXreg: {32{result[63]}, result[63:32]} |
Any references to F
registers, or removed instructions will cause assembler errors.
For example, the encoding for:
FMADD.S <1>, <2>, <3>, <4>
will disassemble and execute as:
FMADD.S f1, f2, f3, f4
on a non-ZFinx
core, or:
FMADD.S x1, x2, x3, x4
on a ZFinx
core.
We considered allowing pseudo-instructions for the deleted instructions for easier code porting. For example allowing FLW to be a pseudo-instruction for LW, but decided not to. Because the register specifiers must change to integer registers, it makes sense to also remove the use of FLW etc. In this way the user is forced to rewrite their code for a ZFinx
core, reducing the chance of undiscovered porting bugs. This only affects assembly code, high level language code is unaffected as the compiler will target the correct architecture.
All floating point loads, stores and floating point to/from integer moves are removed on ZFinx
cores. The following sections show the deleted instructions and give suggested replacements to get the same semantics.
Note
|
Where a floating point load loads fewer than XLEN bits then software NaN-boxing in software is required to get the same semantics as a non-ZFinx core. This is specified for consistency but is unlikely to be necessary. The compiler should not NaN-box in software as there is no reason to do so. Assembly writers can choose whether to NaN-box in software to give better error detection.
|
Note
|
Where a floating point move moves fewer than XLEN bits then either sign extension (if the target is an X register) or NaN-boxing (if the target is an F register) is required in software to get the same semantics.
|
The modifications to the ISA of the F
extension are shown in Table 5.
F
extension floating point load/store/move instructions
Instruction | RV32_ZFinx | RV64_ZFinx |
---|---|---|
suggested replacements |
||
FLW frd, offset(xrs1) |
LW |
LW[U] and NaN-box in software |
C.FLW frd, offset(xrs1) |
C.LW |
C.LW and NaN-box in software |
C.FLWSP frd, uimm(x2) |
C.LWSP |
C.LWSP and NaN-box in software |
FSW frd, offset(xrs1) |
SW |
SW |
C.FSW frd, offset(xrs1) |
C.SW |
C.SW |
C.FSWSP frd, uimm(x2) |
C.SWSP |
C.SWSP |
FMV.X.W xrd, frs1 |
MV |
MV and sign extend in software |
FMV.W.X frd, xrs1 |
MV |
MV and NaN-box in software |
The modifications to the ISA of the D
extension are shown in Table 6.
D
extension floating point load/store/move instructions
Instruction | RV32_ZDinx | RV64_ZDinx |
---|---|---|
suggested replacements |
||
FLD frd, offset(xrs1) |
LW,LW |
LD |
C.FLD frd, offset(xrs1) |
C.LW, C.LW |
C.LD |
C.FLDSP frd, uimm(x2) |
C.LWSP, C.LWSP |
C.LDSP and NaN-box in software |
FSD frd, offset(xrs1) |
SW,SW |
SD |
C.FSD frd, offset(xrs1) |
C.SW,C.SW |
C.SD |
C.FSDSP frd, uimm(x2) |
C.SWSP,C.SWSP |
C.SDSP |
FMV.X.D xrd, frs1 |
MV,MV |
MV |
FMV.D.X frd, xrs1 |
MV,MV |
MV |
D
floating point load/store/move instructions
Instruction | RV32_ZFinx_Zfh | RV64_ZFinx_Zfh |
---|---|---|
suggested replacements |
||
FLH frd, offset(xrs1) |
LH[U] and NaN-box in software |
|
FSH frd, offset(xrs1) |
SH |
|
FMV.X.H xrd, frs1 |
MV and sign extend in software |
|
FMV.H.X frd, xrs1 |
MV and NaN-box in software |
The B-extension is useful for sign extending and NaN-boxing.
To sign-extend using the B-extension:
FMV.X.H rd, rs1
is replaced by:
SEXT.H rd, rs1
Without the B-extension two instructions are required: shift left 16 places, then arithmetic shift right 16 places.
NaN boxing in software is more involved, as the upper part of the register must be set to 1. The B-extension is also helpful in this case.
FMV.H.X a0, a1
is replaced by:
C.ADDI a2, zero, -1
PACK a0, a1, a2
The following instructions are deleted, and the integer version is to be used instead.
Instruction | Integer version |
---|---|
vfmv.v.f |
vmv.v.x |
vfmv.f.s |
vmv.x.s |
vfmv.s.f |
vmv.s.x |
vfmerge.vfm |
vmerge.vxm |
Additionally, all instructions with funct3=OPFVF
take the scalar floating point source from either a single or pair of X
registers instead of a single F
register.
When using GDB on a ZFinx
core, GDB must report x-registers instead of f-registers when disassembling floating point opcodes. No other changes are required.
For details of the current calling conventions see:
The ABI when using ZFinx
must be one of the the standard integer calling conventions as listed below:
-
ilp32e
-
ilp32
-
lp64
Note
|
Currently the ELF header is using a temporary flag to denote ZFinx so that the disassembler knows whether to decode e.g. FADD.S x0, x1, x2 or FADD.S f0, f1, f2 |
Note
|
There is a discussion underway about whether RV32D / RV64Q would benefit from an improved ABI. See this thread: https://lists.riscv.org/g/tech-code-size/topic/zfinx_compiler_tools_status/78705569?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,78705569 and this thread: https://lists.riscv.org/g/tech-toolchain-runtime/topic/elf_file_format_and_abis/78806208?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,78806208 |
To reduce the area overhead of FPU hardware new configurations will make the F[N]MADD.*, F[N]MSUB.*
and FDIV.*, FSQRT.*`
instructions optional in hardware. This then gives the choice of implementing them in software instead by:
-
Taking an illegal instruction trap, and calling the required software routine in the trap handler. This requires that the opcodes are not reallocated and gives binary compatibility between cores with/without hardware support for
F[N]MADD.*, F[N]MSUB.*
andFDIV.*, FSQRT.*
, but is lower performance than option 2. -
Use the GCC options below so that a software library is used to execute them
This argument already exists for RISCV:
gcc -mno-fdiv
This argument exists for other architectures (e.g. MIPs) but not for RISCV, so it needs to be added:
gcc -mno-fused-madd
To achieve this we break all current and future floating point extensions into four parts: Z*base
, Z*ma
, Z*div
and Z*ldstmv
. There is an -inx
version of the first three.
Options | Meaning |
---|---|
base ISA |
|
Zfhbase |
Support half precision base instructions |
Zfbase |
Support single precision base instructions |
Zdbase |
Support double precision base instructions |
Zqbase |
Support quad precision base instructions |
base ISA-in-x |
|
Zfhbaseinx |
Support ZFinx half precision base instructions |
Zfbaseinx |
Support ZFinx single precision base instructions |
Zdbaseinx |
Support ZFinx double precision base instructions |
Zqbaseinx |
Support ZFinx quad precision base instructions |
FMA |
|
Zfhma |
Support half precision multiply-add |
Zfma |
Support single precision multiply-add |
Zdma |
Support double precision multiply-add |
Zqma |
Support quad precision multiply-add |
FMA-in-x |
|
Zfhmainx |
Support ZFinx half precision multiply-add |
Zfmainx |
Support ZFinx single precision multiply-add |
Zdmainx |
Support ZFinx double precision multiply-add |
Zqmainx |
Support ZFinx quad precision multiply-add |
FDIV |
|
Zfhdiv |
Support half precision divide/square-root |
Zfdiv |
Support single precision divide/square-root |
Zddiv |
Support double precision divide/square-root |
Zqdiv |
Support quad precision divide/square-root |
FDIV-in-x |
|
Zfhdivinx |
Support ZFinx half precision divide/square-root |
Zfdivinx |
Support ZFinx single precision divide/square-root |
Zddivinx |
Support ZFinx double precision divide/square-root |
Zqdivinx |
Support ZFinx quad precision divide/square-root |
load/store/move, incompatible with -inx options |
|
Zfhldstmv |
Support load,store and integer to/from FP move |
Zfldstmv |
Support load,store and integer to/from FP move |
Zdldstmv |
Support load,store and integer to/from FP move |
Zqldstmv |
Support load,store and integer to/from FP move |
Therefore:
-
RV32F
can be expressed asrv32_Zfbase_Zfma_Zfdiv_Zfldstmv
. -
RV32D
can be expressed asrv32_Zfbase_Zfma_Zfdiv_fldstmv_Zdbase_Zdma_Zddiv_Zdldstmv
. -
RV32_ZFinx
can be expressed asrv32_Zfbaseinx_Zfmainx_Zfdivinx
. -
RV32_ZDinx
can be expressed asrv32_Zfbaseinx_Zfmainx_Zfdivinx_Zdbaseinx_Zdmainx_Zddivinx
.
If any -inx
extension is specified, then all extensions from Table 9 must have an -inx
suffix.
The options are all additive, none of them remove or change instructions.
Small embedded cores that need to implement floating point extensions have some options:
-
Use software emulation of floating point instructions, so don’t implement a hardware FPU that gives minimum core area:
-
The floating point library can be large, and expensive in terms of ROM or flash storage, costing power and energy consumption.
-
The performance of this solution is very low.
-
-
Low core area floating point implementations:
-
Share the integer registers for floating point instructions (
ZFinx
).-
Will cause more register spills/fills than having a separate register file, but the effect of this is application dependant.
-
No need for special instructions such as load and stores to access floating point registers, and moves between integer and floating point registers.
-
-
There are still performance/area tradeoffs to make for the FPU design itself.
-
e.g. pipelined versus iterative.
-
-
Optionally remove multiply-add instructions to save area in the FPU and a register file read port.
-
Optionally remove divide/square root instructions to to save area in the FPU.
-
-
Dedicated FPU registers, and higher performance FPU implementations use the most area:
-
Separate floating point registers allow fewer register spills/fills, and can also be used for integer code to prevent spilling to memory.
-
There are the same performance/area tradeoffs for the FPU design.
-
ZFinx
is implemented to allow core area reduction as the area of the F
register file is significant, for example:
-
RV32I_ZFinx
saves 1/2 the register file state compared toRV32IF
. -
RV32E_ZFinx
saves 2/3 the register file state compared toRV32EF
.
Therefore ZFinx
should allow small embedded cores to support floating point with:
-
Minimal area increase
-
Similar context switch time as an integer only core
-
there are no
F
registers to save/restore
-
-
Reduced code size by removing the floating point library