Handle address translation for misaligned loads and stores better #467

Alasdair · 2024-05-14T21:48:54Z

Refactor the LOAD and STORE instruction so they split misaligned accesses into multiple sub-accesses and perform address translation separately. This means we should handle the case where a misaligned access straddles a page boundary in a sensible way, even if we don't yet cover the full range of possibilities allowed for any possible RISC-V implementation.

In addition tidy up the implementation in a few ways:

Very long lines on the LOAD encdec clause were fixed by adding a helper function
Add some line breaks in the code so it reads less claustrophobic
Ensure we use the same names for arguments in encdec/execute/assembly. Previously we used 'size' and 'width'. I opted for 'width' consistently

I tested this using the test in #49. Additionally I ran the tests in this repository on a tweaked version of model that would split apart even correctly aligned loads and stores using the misaligned logic to test that it was at least doing something sensible.

github-actions · 2024-05-14T22:13:48Z

Test Results

396 tests ±0 396 ✅ ±0 0s ⏱️ ±0s
4 suites ±0 0 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 91a16b4. ± Comparison against base commit 1559013.

♻️ This comment has been updated with latest results.

Timmmm · 2024-05-15T08:30:55Z

I'm not sure this is the right approach, for a few reasons.

We actually only need to split on page boundaries in order to fix the address translation issue.
It's very unlikely that a chip will actually split accesses into memory operations like this. Eventually we want it to be configurable but until then I think it should at least be a likely implementation.
There's no way you are going to want to copy & paste this throughout the float and vector load/stores!
I think having all these details in the execute function itself is probably not ideal from an including-code-in-the-spec point of view.

Did you see this commit? I think it's a lot nicer to read something that abstracts away the virtual address translation a bit:

function clause execute(LOAD(imm, rs1, rd, is_unsigned, width, aq, rl)) = {
  let offset : xlenbits = sign_extend(imm);
  let width_bytes = size_bytes(width);

  // This is checked during decoding.
  assert(width_bytes <= sizeof(xlen_bytes));

  /* Get the address, X(rs1) + offset.
     Some extensions perform additional checks on address validity. */
  match ext_data_get_addr(rs1, offset, Read(Data), width_bytes) {
    Ext_DataAddr_Error(e)  => { ext_handle_data_check_error(e); RETIRE_FAIL },
    Ext_DataAddr_OK(vaddr) => {
      if   check_misaligned(vaddr, width)
      then { handle_mem_exception(vaddr, E_Load_Addr_Align()); RETIRE_FAIL }
      else match vmem_read(Read(Data), vaddr, width_bytes, aq, rl, false) {
        match value {
          Ok(result)    => { X(rd) = extend_value(is_unsigned, result); RETIRE_SUCCESS },
          Err(vaddr, e) => { handle_mem_exception(vaddr, e); RETIRE_FAIL }
        }
    }
  }
}

I couldn't quite do it as cleanly for stores because of the mem_write_ea() thing unfortunately. If we didn't have to worry about that then it could be as simple as that LOAD, just with vmem_write(..., X(rs1), ...).

Alasdair · 2024-05-15T11:29:47Z

We actually only need to split on page boundaries in order to fix the address translation issue.

I think the question is: should be the semantics of this be the same as splitting misaligned accesses into separate operations? If the observable semantics is the same then the model could choose to do things in a less efficient way.

I considered trying to only split only on page boundaries, but then I thought that there are probably a bunch of other cases where misaligned access straddle other things like PMP regions, and just splitting into separate operations might be cleaner.

Alasdair · 2024-05-15T11:45:37Z

On the other points regarding abstracting this detail behind a helper function, I agree. I'll take a closer look at that commit.

We've thought a bit more in general about misaligned and virtual memory for ARM, and I think the semantics there is that you split into byte sized accesses (unless you have some feature flags etc... it always gets more complicated).

Timmmm · 2024-05-15T11:59:35Z

If the observable semantics is the same then the model could choose to do things in a less efficient way.

That's true.

I think the semantics there is that you split into byte sized accesses.

That seems sensible! I think the PMP requirement that all bytes of a memory access be in one PMP region screws this up for RISC-V.

Alasdair · 2024-05-15T12:36:57Z

The interesting case would be: Misaligned access that straddles a page boundary, where each page is in a different PMP region. What is the envelope of allowed behaviour?

Timmmm · 2024-05-15T12:46:22Z

I made a diagram for these cases. In that case you can either split into separate memory operations in which case everything will be fine (first example in the image), or you are technically allowed to keep it as one discontinuous (!) memory operation in which case it will fail (second example).

I can't imagine any real systems that would do the latter but Andrew Waterman said it is allowed. The thing that makes it a single memory operation is that it cannot be observed to be partially complete.

Also if you're not being pedantic a device could just have the second case fail too, because nothing can observe that you actually did one memory operation if you claim you did two. I don't think the difference is observable.

Kind of feels like the PMP spec is just a bit broken tbh. I wonder if they even thought about this stuff.

Alasdair · 2024-05-15T13:00:27Z

Thanks, that's very useful. I think as a first go it would be reasonable to support the 0, 2, and 3 cases and not support 1 for the time being.

PeterSewell · 2024-05-15T13:38:58Z

As Alasdair says, we've looked a lot at this for Arm and IBM Power. Some interesting examples are in http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf in Section 2. Real implementations do a lot of observable splitting on finer granularity than pages, and also on finer granularity than cache lines. Some of this ends up in the concurrency model and some in the Sail ISA definition. For the latter, for RISC-V, as the community is especially concerned to be able to configure the model to match specific hardware implementations, I'd suggest having configurable options for the granularity of splitting (single byte or larger aligned units), and for the ordering between accesses - architecturally, these are probably fine-grained and de-ordered. Some implementations will be nondeterministic, but some will have a deterministic order, and being able to configure the model that will make things easier to test. Within the Sail, there should be some abstraction: a function that does a memory access, that splits into multiple calls of another function that does single-copy-atomic memory accesses (probably with a translation for each). In fact it's more complex than that, as the entire translations of multiple s-c-a accesses from a single instruction should typically be architecturally de-ordered. That needs some lightweight intra-instruction within Sail, which it doesn't yet support (no ISA description language does, AIUI) - but I think RISC-V can get a long way before we need to handle that. I say all this without knowing much about RISC-V PMA... Peter

…

On Wed, 15 May 2024 at 14:01, Alasdair Armstrong ***@***.***> wrote: Thanks, that's very useful. I think as a first go it would be reasonable to support the 0, 2, and 3 cases and not support 1 for the time being. — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABFMZZXLPJPNGENSLG4ZIVDZCNMAFAVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGQ2TMMBXGM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Alasdair · 2024-05-15T15:39:25Z

Ok, I refactored the pull request so there are separate vmem_read and vmem_write_from_register functions. I also added options to the C simulator that change the order in which address translations occur for misaligned accesses and control whether misaligned accesses are split into the largest possible aligned size, or always into bytes.

allenjbaum · 2024-05-15T19:13:24Z

This is my truth table of what I *think* is supposed to happen for misaligned loads. Note that the RSIC-V spec does not specify the order of misaligned accesses if they are not naturally aligned, and you can get two different results if you go low to high address, vs high to low address. So the parameters are: - do you support misaligned at all (and at which granularity - didn't include that in the truth table - do you split up the accesses, and if so, do you first access the lower address(es) or the higher address(es). - Does the misaligned exception have higher or lower priority than illegal access/page fault exception HW misalign hipri-misalign low-first error(s) CAUSE TVAL mem 0 1 x x misalign low unchg 0 0 1 errH,OK misalign low unchg 0 0 1 OK, errL errL low unchg 0 0 1 errH,errL errL low unchg 0 0 0 errH,OK errH hi unchg 0 0 0 OK, errL misalign hi unchg 0 0 0 errH,errL errH hi unchg 1 x x errH,OK errH hi lo 1 x x OK, errL errL low hi 1 x 1 errH,errL errL low unchg 1 x 0 errH,errL errH hi unchg And: this does assume that the parameters are static and don't change during execution (so even with out of order, the assumption is that split accesses will (appear to) go in order with each other.

…

On Wed, May 15, 2024 at 8:39 AM Alasdair Armstrong ***@***.***> wrote: Ok, I refactored the pull request so there are separate vmem_read and vmem_write_from_register functions. I also added options to the C simulator that change the order in which address translations occur for misaligned accesses and control whether misaligned accesses are split into the largest possible aligned size, or always into bytes. — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHPXVJXJA3ZLESQRKEIIW3DZCN6UNAVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSHA4DKMBRGM> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jrtc27 · 2024-05-15T19:34:00Z

The privileged spec is very clear that alignment faults take precedence over access faults.

Alasdair · 2024-05-15T19:53:42Z

do you support misaligned at all (and at which granularity - didn't
include that in the truth table

do you split up the accesses, and if so, do you first access the lower
address(es) or the higher address(es).

Ok good to know, those are essentially the two command line flags I have implemented.

allenjbaum · 2024-05-15T20:26:23Z

Not exactly: The RISC-V Instruction Set Manual: Volume II: Privileged Architecture Table 15. Synchronous exception priority in decreasing priority order. Priority Exc.Code Description Highest 3 Instruction address breakpoint 12, 1 During instruction address translation: First encountered page fault or access fault 1 With physical address for instruction: Instruction access fault 2 0 8,9,11 3 3 Illegal instruction Instruction address misaligned Environment call Environment break Load/store/AMO address breakpoint 4,6 *Optionally*: Load/store/AMO address misaligned 13, 15, 5, 7 During address translation for an explicit memory access: First encountered page fault or access fault 5,7 With physical address for an explicit memory access: Load/store/AMO access fault Lowest 4,6 If not higher priority: Load/store/AMO address misaligned

…

On Wed, May 15, 2024 at 12:34 PM Jessica Clarke ***@***.***> wrote: The privileged spec is very clear that alignment faults take precedence over access faults. — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHPXVJWOVH7E5IGBK3ULTY3ZCO2D3AVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGMYTQMZWGA> . You are receiving this because you commented.Message ID: ***@***.***>

jrtc27 · 2024-05-15T20:30:37Z

Not exactly: The RISC-V Instruction Set Manual: Volume II: Privileged Architecture Table 15. Synchronous exception priority in decreasing priority order. Priority Exc.Code Description Highest 3 Instruction address breakpoint 12, 1 During instruction address translation: First encountered page fault or access fault 1 With physical address for instruction: Instruction access fault 2 0 8,9,11 3 3 Illegal instruction Instruction address misaligned Environment call Environment break Load/store/AMO address breakpoint 4,6 Optionally: Load/store/AMO address misaligned 13, 15, 5, 7 During address translation for an explicit memory access: First encountered page fault or access fault 5,7 With physical address for an explicit memory access: Load/store/AMO access fault Lowest 4,6 If not higher priority: Load/store/AMO address misaligned
…
On Wed, May 15, 2024 at 12:34 PM Jessica Clarke @.> wrote: The privileged spec is very clear that alignment faults take precedence over access faults. — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHPXVJWOVH7E5IGBK3ULTY3ZCO2D3AVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGMYTQMZWGA . You are receiving this because you commented.Message ID: @.>

That was exactly the table I was looking at, but I missed the final row and interpreted the higher-priority "Optionally" as meaning "if you don't support misaligned accesses"... unhelpful wording.

allenjbaum · 2024-05-15T20:31:25Z

Jessica is correct about a nuance I missed: the truth table works only for explicit data accesses, but not for instruction accesses (or implicit accesses related to instruction accesses) On Wed, May 15, 2024 at 1:26 PM Allen Baum ***@***.***> wrote:

…

Not exactly: The RISC-V Instruction Set Manual: Volume II: Privileged Architecture Table 15. Synchronous exception priority in decreasing priority order. Priority Exc.Code Description Highest 3 Instruction address breakpoint 12, 1 During instruction address translation: First encountered page fault or access fault 1 With physical address for instruction: Instruction access fault 2 0 8,9,11 3 3 Illegal instruction Instruction address misaligned Environment call Environment break Load/store/AMO address breakpoint 4,6 *Optionally*: Load/store/AMO address misaligned 13, 15, 5, 7 During address translation for an explicit memory access: First encountered page fault or access fault 5,7 With physical address for an explicit memory access: Load/store/AMO access fault Lowest 4,6 If not higher priority: Load/store/AMO address misaligned On Wed, May 15, 2024 at 12:34 PM Jessica Clarke ***@***.***> wrote: > The privileged spec is very clear that alignment faults take precedence > over access faults. > > — > Reply to this email directly, view it on GitHub > <#467 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AHPXVJWOVH7E5IGBK3ULTY3ZCO2D3AVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGMYTQMZWGA> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

Alasdair · 2024-07-29T15:14:44Z

Should be rebased onto the latest master now.

Alasdair · 2024-08-05T14:53:17Z

If we want I could update this PR to handle the Zama16b extension/option we discussed today as the splitting function I add here would be the ideal place to add this option.

billmcspadden-riscv · 2024-08-05T15:34:02Z

Yes, please do. Thanks. Bill Mc.

…

On Mon, Aug 5, 2024 at 9:53 AM Alasdair Armstrong ***@***.***> wrote: If we want I could update this PR to handle the Zama16b extension/option we discussed today as the splitting function I add here would be the ideal place to add this option. — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AXROLODNMEOI6GI2FPZ6QYTZP6GXRAVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRZGI3TCNJXGE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Bill McSpadden Formal Verification Engineer RISC-V International mobile: 503-807-9309 Join me at RISC-V Summit North America <http://www.riscvsummit.com/> in October!

Timmmm

Hi, I finally got around to reading this (sorry!). Some thoughts:

It also needs to handle float loads/stores and a gazillion vector loads/stores. I don't know how you do that nicely without first class functions though. vmem_write_from_float_reg, vmem_write_from_vector_reg, etc...
This means the misaligned accesses will always be split into aligned accesses, but that isn't necessary for the Sail emulator (since the underlying memory natively supports misaligned accesses) and it happens to not match our hardware (which also natively supports misaligned accesses). The only situations where we do need to split is at page boundaries.

This doesn't matter too much but if you support fine grained (i.e. low G) PMPs and PMAs then it might. Also it makes tracing more difficult because it's physical accesses that are logged and now we have way more.

I think we should add a flag to only split at some grain size, which would be a maximum of the page size (where you must split, unless we make it very complicated and check pages are contiguously mapped). I can imagine implementations where you might want to set it to the cache line size too.

Does that make sense? I feel like we might need a table where the rows are example addresses, the columns are the options and the cells are the expected accesses.

Timmmm · 2024-10-08T16:41:29Z

c_emulator/riscv_sim.c

+      break;
+    case OPT_MISALIGNED_TO_BYTE:
+      fprintf(stderr,
+              "misaligned accesses will be split into individual "


What's the motivation for this option? I know Spike does it but I assumed that was just for implementation simplicity. Does real hardware do this? I would maybe say YAGNI for now unless there's some use case I've missed?

It's how Arm models it in ASL afaik (although they have a bunch of toggles and other options too). That doesn't necessarily constrain what implementations do, it just means the observable behaviour when you split into larger accesses shouldn't be different.

The most relaxed permissible weak-memory behaviour is also given by splitting into individual bytes.

Timmmm · 2024-10-08T16:44:03Z

model/riscv_vmem_utils.sail

+  repeat {
+    let offset = i;
+    let vaddr = vaddr + (offset * bytes);
+    match translateAddr(vaddr, Write(Data)) {


It might be tricky but our hardware (and I presume most) does all the address translation first and then the memory access. Could we support at least that ordering as well?

Yes, I think it would be fine to do all the address translations first then the accesses.

Timmmm · 2024-10-08T16:45:35Z

model/riscv_vmem_utils.sail

+    } else {
+      i = offset + step
+    }
+  } until finished;


Huh TIL Sail has do..while! And also while .. do .. apparently. Might be worth mentioning those in the manual I guess?

Timmmm · 2024-10-08T17:32:28Z

model/riscv_vmem_utils.sail

+  _: "sys_misaligned_to_byte"
+} : unit -> bool
+
+val split_misaligned : forall 'width, 'width in {1, 2, 4, 8}.


Unfortunately there's also cbo.zero which does 32 byte writes (by default). I think you can do this without a match anyway though.

val count_trailing_zeros : forall 'n, 'n >= 0. (bits('n)) -> range(0, 'n) function count_trailing_zeros(x) = { foreach (i from 0 to ('n - 1)) { if x[i] == bitone then return i; }; 'n } val split_misaligned : forall 'width, 'width in {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096} . (xlenbits, int('width)) -> {'n 'bytes, 'width == 'n * 'bytes & 'bytes > 0. (int('n), int('bytes))} function split_misaligned(vaddr, width) = { if is_aligned_addr(vaddr, width) then (1, width) else if sys_misaligned_to_byte() then (width, 1) else { let vaddr_alignment_bytes = 2 ^ count_trailing_zeros(vaddr); assert(vaddr_alignment_bytes <= width); (vaddr_alignment_bytes, width / vaddr_alignment_bytes) } }

I couldn't figure out how to get "must be a power of two" into the type system though. :-(

You can't, as it would require an existential quantifier ("M is a power of two" becomes "there exists an N such that M = 2 ^ N"). We could add some syntactic sugar for a set of powers of two up to some bound though.

allenjbaum · 2024-10-09T02:05:23Z

I think that there needs to be a command line option (or whatever) that gives the misaligned granularity, (as a power of 2 bytes). That could be a cacheline boundary (6), a page boundary (12), or 0 (to indicate the size of the access, i.e. a naturally aligned access)

…

On Tue, Oct 8, 2024 at 10:36 AM Tim Hutt ***@***.***> wrote: ***@***.**** commented on this pull request. Hi, I *finally* got around to reading this (sorry!). Some thoughts: 1. It also needs to handle float loads/stores and a gazillion vector loads/stores. I don't know how you do that nicely without first class functions though. vmem_write_from_float_reg, vmem_write_from_vector_reg, etc... 2. This means the misaligned accesses will *always* be split into aligned accesses, but that isn't necessary for the Sail emulator (since the underlying memory natively supports misaligned accesses) and it happens to not match our hardware (which also natively supports misaligned accesses). The only situations where we do need to split is at page boundaries. This doesn't matter *too* much but if you support fine grained (i.e. low G) PMPs and PMAs then it might. Also it makes tracing more difficult because it's physical accesses that are logged and now we have way more. I think we should add a flag to only split at some grain size, which would be a maximum of the page size (where you *must* split, unless we make it very complicated and check pages are contiguously mapped). I can imagine implementations where you might want to set it to the cache line size too. Does that make sense? I feel like we might need a table where the rows are example addresses, the columns are the options and the cells are the expected accesses. ------------------------------ In c_emulator/riscv_sim.c <#467 (comment)>: > @@ -315,6 +319,18 @@ static int process_args(int argc, char **argv) } rv_pmp_grain = pmp_grain; break; + case OPT_MISALIGNED_ORDER_DEC: + fprintf(stderr, + "misaligned access virtual addresses will be translated in " + "decreasing order.\n"); + rv_misaligned_order_decreasing = true; + break; + case OPT_MISALIGNED_TO_BYTE: + fprintf(stderr, + "misaligned accesses will be split into individual " What's the motivation for this option? I know Spike does it but I assumed that was just for implementation simplicity. Does real hardware do this? I would maybe say YAGNI for now unless there's some use case I've missed? ------------------------------ In model/riscv_vmem_utils.sail <#467 (comment)>: > + (xlenbits, int('width), range(0, 31), bool, bool, bool) -> result(unit, (xlenbits, ExceptionType)) + +function vmem_write_from_register(vaddr, bytes, reg, aq, rl, res) = { + /* If the store is misaligned, split into `n` (single-copy-atomic) memory operations, + each of `bytes` width. If the store is aligned, then `n` = 1 and bytes will remain + unchanged. */ + let ('n, bytes) = split_misaligned(vaddr, bytes); + + let (first, last, step) = misaligned_order(n); + var i : range(0, 'n - 1) = first; + var finished : bool = false; + + repeat { + let offset = i; + let vaddr = vaddr + (offset * bytes); + match translateAddr(vaddr, Write(Data)) { It might be tricky but our hardware (and I presume most) does all the address translation first and then the memory access. Could we support at least that ordering as well? ------------------------------ In model/riscv_vmem_utils.sail <#467 (comment)>: > + match mem_write_value(paddr, bytes, data, aq, rl, false) { + MemException(e) => return Err(vaddr, e), + MemValue(true) => (), + MemValue(false) => + internal_error(__FILE__, __LINE__, "got false from mem_write_value"), + } + } + } + }; + + if offset == last then { + finished = true + } else { + i = offset + step + } + } until finished; Huh TIL Sail has do..while! And also while .. do .. apparently. Might be worth mentioning those in the manual I guess? ------------------------------ In model/riscv_vmem_utils.sail <#467 (comment)>: > + * performs misaligned accesses. + */ +val sys_misaligned_order_decreasing = { + ocaml: "Platform.misaligned_order_decreasing", + _: "sys_misaligned_order_decreasing" +} : unit -> bool + +/* This is an external option that, when true, causes all misaligned accesses + * to be split into single byte operations. + */ +val sys_misaligned_to_byte = { + ocaml: "Platform.misaligned_to_byte", + _: "sys_misaligned_to_byte" +} : unit -> bool + +val split_misaligned : forall 'width, 'width in {1, 2, 4, 8}. Unfortunately there's also cbo.zero which does 32 byte writes (by default). I think you can do this without a match anyway though. val count_trailing_zeros : forall 'n, 'n >= 0. (bits('n)) -> range(0, 'n) function count_trailing_zeros(x) = { foreach (i from 0 to ('n - 1)) { if x[i] == bitone then return i; }; 'n } val split_misaligned : forall 'width, 'width in {1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096} . (xlenbits, int('width)) -> {'n 'bytes, 'width == 'n * 'bytes & 'bytes > 0. (int('n), int('bytes))} function split_misaligned(vaddr, width) = { if is_aligned_addr(vaddr, width) then (1, width) else if sys_misaligned_to_byte() then (width, 1) else { let vaddr_alignment_bytes = 2 ^ count_trailing_zeros(vaddr); assert(vaddr_alignment_bytes <= width); (vaddr_alignment_bytes, width / vaddr_alignment_bytes) } } I couldn't figure out how to get "must be a power of two" into the type system though. :-( — Reply to this email directly, view it on GitHub <#467 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHPXVJV73EQVSI73ITNSJU3Z2QJZXAVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDGNJVGAZDMMBXGI> . You are receiving this because you commented.Message ID: ***@***.***>

Timmmm · 2024-10-09T08:26:29Z

Yeah I agree. I thought about this a bit more and I think in the fullness of time we probably want something like this:

Any of those steps on their own (1, 2a, 2b or 2c) solve the original page boundary problem. The minimum needed to solve it is 1, so I think we should do that first. 2a, 2b and 2c are only required if you want to precisely match PMA/PMP exceptions to a hardware implementation (and that implementation supports PMA/PMPs finer grained than a page; I'm not sure how common that is - our hardware doesn't but I can imagine some does).

Supporting all of those options may be a bit tricky to code in Sail tbf. Probably doable though; I might have a go...

I think actually the biggest issue is mem_write_ea. We don't really want to duplicate this function 3 or 4 times, with just one line in the middle different. :-/

Alasdair · 2024-10-09T22:38:37Z

It also needs to handle float loads/stores and a gazillion vector loads/stores. I don't know how you do that nicely without first class functions though. vmem_write_from_float_reg, vmem_write_from_vector_reg, etc...

I think I was trying too hard to not change the existing order in the code. Realistically if we ditch the write_mem_ea call things become a lot easier and we can just have a vmem_write that everything uses.

The idea is that for the operational concurrency model, an instruction computes the address (and announces it) before it reads the value its writing for re-ordering reasons. However, I think there is an equivalent formalisation of the operational model where the write_mem_ea event takes a set of registers as an additional parameter, and this would remove the need to rely on any ordering in the Sail spec (which has the additional advantage of matching how the axiomatic model works w.r.t. ordering between Sail statements). Although to be clear, this equivalent formalisation only exists in my head and I haven't proven it equivalent.

I think we should add a flag to only split at some grain size, which would be a maximum of the page size (where you must split, unless we make it very complicated and check pages are contiguously mapped). I can imagine implementations where you might want to set it to the cache line size too.

The Zama16b extension mentioned previously is essentially guaranteeing that you can do misaligned accesses in hardware up to 16 byte boundaries, so it makes sense to implement something like this. I wrote a function that implements that, but the logic is quite twisty when you have a different alignments and access width parameters, and you have to be careful when at the the top or bottom of the address space to avoid wrapping when computing the top or bottom byte address used by the access (I might be missing some easy way of doing it though).

allenjbaum · 2024-10-10T06:26:35Z

This is my understanding of the spec-- but the spec is very vague about it all. From a RISC-V spec point of view, it doesn't matter how many pieces a misaligned access is split; it only matters that it is split because a PMP or PMA or PTE will treat (all of) the access(es) on one side one way, and (possibly) all on the other side another way. It can't be treated 3 different ways because the max width of an access should always be smaller than the minimum granule. An FLD or FLQ can span 3 or 5 PMP entries of the minimum granule size. Normally, an access that spans more than one PMP entry is illegal. But, "access" refers to what is actually loaded or stored, so each of 3 or 5 access could be considered a separate access, and each would then be treated separately. But, if we have a misalign split defined (e.g. cacheline or page boundary, or naturally aligned), then by definition any access with those boundaries is not split, so any access within that granule that spans 2 or more PMP entries is automatically an illegal access. My inderstanding of this is not necesssarily gospel, and should be sanity checked (or, if we believe Sail is gospel, we just declare that this interpetation is gospel)

…

On Wed, Oct 9, 2024 at 3:39 PM Alasdair Armstrong ***@***.***> wrote: It also needs to handle float loads/stores and a gazillion vector loads/stores. I don't know how you do that nicely without first class functions though. vmem_write_from_float_reg, vmem_write_from_vector_reg, etc... I think I was trying too hard to not change the existing order in the code. Realistically if we ditch the write_mem_ea call things become a lot easier and we can just have a vmem_write that everything uses. The idea is that for the operational concurrency model, an instruction computes the address (and announces it) before it reads the value its writing for re-ordering reasons. However, I think there is an equivalent formalisation of the operational model where the write_mem_ea event takes a set of registers as an additional parameter, and this would remove the need to rely on any ordering in the Sail spec (which has the additional advantage of matching how the axiomatic model works w.r.t. ordering between Sail statements). Although to be clear, this equivalent formalisation only exists in my head and I haven't proven it equivalent. I think we should add a flag to only split at some grain size, which would be a maximum of the page size (where you must split, unless we make it very complicated and check pages are contiguously mapped). I can imagine implementations where you might want to set it to the cache line size too. The Zama16b extension mentioned previously is essentially guaranteeing that you can do misaligned accesses in hardware up to 16 byte boundaries, so it makes sense to implement something like this. I wrote a function that implements that, but the logic is quite twisty when you have a different alignments and access width parameters, and you have to be careful when at the the top or bottom of the address space to avoid wrapping when computing the top or bottom byte address used by the access (I might be missing some easy way of doing it though). — Reply to this email directly, view it on GitHub <#467 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHPXVJVGS3A7EEDOZOVQ6L3Z2WWAHAVCNFSM6AAAAABHW7D3ESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBTGU2DMMZZGY> . You are receiving this because you commented.Message ID: ***@***.***>

Alasdair · 2024-10-28T17:03:42Z

Updated with a plain vmem_write function that can be used everywhere, including in the vector extension (but I still need to update all those instructions, there aren't that many though). I also updated STORECON and LOADRES even though that may not be needed as they are always aligned.

Refactor the LOAD and STORE instruction so they split misaligned accesses into multiple sub-accesses and perform address translation separately. This means we should handle the case where a misaligned access straddles a page boundary in a sensible way, even if we don't yet cover the full range of possibilities allowed for any RISC-V implementation. There are options for the order in which misaligned happen, i.e. from high-to-low or from low-to-high as well as the granularity of the splitting, either all the way to bytes or to the largest aligned size. The splitting can also be disabled if an implementation supports misaligned accesses in hardware. The Zama16b extension is support with an --enable-zama16b flag on the simulator. In addition tidy up the implementation in a few ways: - Very long lines on the LOAD encdec were fixed by adding a helper - Add some linebreaks in the code so it reads as less claustrophobic - Ensure we use the same names for arguments in encdec/execute/assembly. Previously we used 'size' and 'width'. I opted for 'width' consistently.

Alasdair · 2024-10-28T19:22:08Z

I have now added the option to allow misaligned accesses to be atomic provided they exist within some region with a --misaligned-allowed-within flag, and added the Zama16b extension with the --enable-zama16b flag.

Alasdair force-pushed the ldst_misaligned branch 2 times, most recently from 7db08ac to 1f0312e Compare May 15, 2024 15:34

Alasdair force-pushed the ldst_misaligned branch from 1f0312e to 2dd10d3 Compare May 15, 2024 16:28

billmcspadden-riscv added the tgmm-agenda Tagged for the next Golden Model meeting agenda. label May 23, 2024

Alasdair force-pushed the ldst_misaligned branch 3 times, most recently from eb67458 to 4bdc440 Compare July 29, 2024 15:13

Timmmm reviewed Oct 8, 2024

View reviewed changes

Alasdair force-pushed the ldst_misaligned branch from 4bdc440 to 15b6877 Compare October 28, 2024 17:02

Alasdair force-pushed the ldst_misaligned branch 2 times, most recently from b9ba55d to ef84c7c Compare October 28, 2024 19:16

Alasdair force-pushed the ldst_misaligned branch from ef84c7c to 91a16b4 Compare October 28, 2024 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle address translation for misaligned loads and stores better #467

Handle address translation for misaligned loads and stores better #467

Alasdair commented May 14, 2024

github-actions bot commented May 14, 2024 •

edited

Loading

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

Alasdair commented May 15, 2024

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

PeterSewell commented May 15, 2024 via email

Alasdair commented May 15, 2024

allenjbaum commented May 15, 2024 via email

jrtc27 commented May 15, 2024

Alasdair commented May 15, 2024

allenjbaum commented May 15, 2024 via email

jrtc27 commented May 15, 2024

allenjbaum commented May 15, 2024 via email

Alasdair commented Jul 29, 2024

Alasdair commented Aug 5, 2024

billmcspadden-riscv commented Aug 5, 2024 via email

Timmmm left a comment

Timmmm Oct 8, 2024

Alasdair Oct 9, 2024

Timmmm Oct 8, 2024

Alasdair Oct 9, 2024

Timmmm Oct 8, 2024

Timmmm Oct 8, 2024

Alasdair Oct 9, 2024 •

edited

Loading

allenjbaum commented Oct 9, 2024 via email

Timmmm commented Oct 9, 2024

Alasdair commented Oct 9, 2024

allenjbaum commented Oct 10, 2024 via email

Alasdair commented Oct 28, 2024

Alasdair commented Oct 28, 2024

Handle address translation for misaligned loads and stores better #467

Are you sure you want to change the base?

Handle address translation for misaligned loads and stores better #467

Conversation

Alasdair commented May 14, 2024

github-actions bot commented May 14, 2024 • edited Loading

Test Results

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

Alasdair commented May 15, 2024

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

Timmmm commented May 15, 2024

Alasdair commented May 15, 2024

PeterSewell commented May 15, 2024 via email

Alasdair commented May 15, 2024

allenjbaum commented May 15, 2024 via email

jrtc27 commented May 15, 2024

Alasdair commented May 15, 2024

allenjbaum commented May 15, 2024 via email

jrtc27 commented May 15, 2024

allenjbaum commented May 15, 2024 via email

Alasdair commented Jul 29, 2024

Alasdair commented Aug 5, 2024

billmcspadden-riscv commented Aug 5, 2024 via email

Timmmm left a comment

Choose a reason for hiding this comment

Timmmm Oct 8, 2024

Choose a reason for hiding this comment

Alasdair Oct 9, 2024

Choose a reason for hiding this comment

Timmmm Oct 8, 2024

Choose a reason for hiding this comment

Alasdair Oct 9, 2024

Choose a reason for hiding this comment

Timmmm Oct 8, 2024

Choose a reason for hiding this comment

Timmmm Oct 8, 2024

Choose a reason for hiding this comment

Alasdair Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

allenjbaum commented Oct 9, 2024 via email

Timmmm commented Oct 9, 2024

Alasdair commented Oct 9, 2024

allenjbaum commented Oct 10, 2024 via email

Alasdair commented Oct 28, 2024

Alasdair commented Oct 28, 2024

github-actions bot commented May 14, 2024 •

edited

Loading

Alasdair Oct 9, 2024 •

edited

Loading