Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add human-readable instruction name annotations #520

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ThinkOpenly
Copy link
Contributor

A simple implementation is proposed to annotate each instruction with its human-readable name.

Two methods are used, given the mildly diverse instruction implementations in Sail:

  • An annotation added to the union clause ast for "simple" instructions that are not defined as a group:
    $[name "jump and link"]
    union clause ast = RISCV_JAL : (bits(21), regidx)
    
  • For "simple grouped" instructions where the set of mnemonics are defined as part of a simple mapping, an annotation is added to each relationship of the mapping:
    mapping utype_mnemonic : uop <-> string = {
      $[name "load upper immediate"]
      RISCV_LUI   <-> "lui",
      $[name "add upper immediate to PC"]
      RISCV_AUIPC <-> "auipc"
    }
    

Other methods may be required for more complicated mnemonics, especially those with varied suffixes, but these methods easily cover most instructions. This, at least, is a start.

These annotations allow a parser/backend to have relevant human-readable information for such purposes as creating documentation.

This PR is currently marked as a draft because (1) it is incomplete, and (2) we seek feedback on the approach before continuing the work. (Thanks!)

@ThinkOpenly ThinkOpenly changed the title Upstream names Add human-readable instruction name annotations Aug 9, 2024
@jrtc27
Copy link
Collaborator

jrtc27 commented Aug 9, 2024

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv (which is doable, it's what we do for cheri-specification with sail-cheri-riscv, but I doubt will happen here), I struggle to see how this achieves anything other than duplicate existing descriptions?

model/riscv_insts_base.sail Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Aug 9, 2024

Test Results

712 tests  ±0   712 ✅ ±0   0s ⏱️ ±0s
  6 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 6670d78. ± Comparison against base commit 05b845c.

♻️ This comment has been updated with latest results.

@Alasdair
Copy link
Collaborator

This would be very useful for some of the documentation things I would like to do, so I am quite strongly in favour of adding this metadata.

@Timmmm
Copy link
Collaborator

Timmmm commented Aug 11, 2024

Yeah I would also like more documentation metadata here for a RISC-V VSCode extension I'm working on. I think there's quite a bit of demand for machine readable metadata & doc type stuff (c.f. Qualcomm's other ISA specification language), so we should probably start actually doing it.

I think at least for the names there isn't really any risk from keeping stuff in sync, and it also helps understand for people who don't know what the ridiculously terse mnemonics mean.

How do these name annotations get exposed? Do they end up in the doc JSON generated by Sail?

@billmcspadden-riscv billmcspadden-riscv added the tgmm-agenda Tagged for the next Golden Model meeting agenda. label Aug 12, 2024
@ThinkOpenly
Copy link
Contributor Author

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv

There's no room for iterative improvement?

I struggle to see how this achieves anything other than duplicate existing descriptions?

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

@ThinkOpenly
Copy link
Contributor Author

How do these name annotations get exposed? Do they end up in the doc JSON generated by Sail?

Currently, the net effect will be to add the annotations to the associated node in the AST. The current backends ignore the new content. Our out-of-tree JSON backend pulls the annotations into the JSON, associated with the respective mnemonic:

{
  "instructions": [
    {
      "mnemonic": "addi",
      "name": "add immediate",
      "operands": [...]
      "fields": [...]
[...]

@ThinkOpenly
Copy link
Contributor Author

I've fixed the issue identified by @jrtc27 (by changing "store fence" to "supervisor memory-management fence" as found in Volume II: RISC-V Privileged Architectures V20211203, section 4.2.1 "Supervisor Memory-Management Fence Instruction").

I'm removing the "draft" tag and marking "ready for review".

@ThinkOpenly ThinkOpenly marked this pull request as ready for review August 12, 2024 19:28
@jrtc27
Copy link
Collaborator

jrtc27 commented Aug 12, 2024

I do not believe this should be here. These names are not official, they are invented, yet they are being put in the official model. Naming the instructions is not the responsibility of the model but of the documentation.

Unless the intent is to move the entirety of riscv-isa-manual's instruction semantics descriptions into sail-riscv

There's no room for iterative improvement?

Not if the end goal isn't to avoid duplication.

I struggle to see how this achieves anything other than duplicate existing descriptions?

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

Words in the ISA manual. That's where the descriptions currently belong

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

Why is that the Sail model's responsibility? If the AsciiDoc sucks, fix the AsciiDoc.

@ThinkOpenly
Copy link
Contributor Author

I do not believe this should be here. These names are not official, they are invented, yet they are being put in the official model.

That is a valid concern. There should be canonical human-readable names associated with each instruction mnemonic. These should be enumerable and easily associated with the mnemonics, and the best place to represent things like that is in the Sail code. I looked through the PR to see if the names came directly from the ISA documentation or were "invented".

  • Pulled from ISA doc: lui, auipc, jal, jalr, "conditional branch", slti, "load", "store", ecall, ebreak, wfi, sfence.vma
  • "Invented", but with strong analogs/hints in the ISA doc: addi, andi, ori, xori, sltiu, add, slt, sltu, and, or, xor, sll, srl, sub, sra, addiw, addw, subw, sllw, srlw, sraw, slliw, srliw, sraiw
  • "Invented", but analogs in the ISA doc: slli, srli, srai (actually, I see these aren't in the PR. These should be added.)
  • "Invented": fence, fence.tso, mret, sret

Note that any "invented" instruction names were due to the lack of a clear instruction name in the ISA doc.

Naming the instructions is not the responsibility of the model but of the documentation.

I can see your point, but I disagree. If there is to be complete separation between the model and the documentation, then there is a lot that needs to be removed from the documentation. Since humans are writing the model, other humans should be able to read it. What's being added here is annotations, essentially comments that end up in the AST, and otherwise ignored. They make the Sail code better, and easier to understand and interpret.

To what "existing descriptions" are you referring? I found just a few in the vector code, but little else of actual use.

Words in the ISA manual. That's where the descriptions currently belong

I'll disagree here, too. I think the model needs to stand pretty well on it's own. You mentioned earlier the "sail-cheri-riscv" effort to merge the documentation with the model, which sounds quite useful. Why is it OK there, but not OK here? Does it have to be all-or-nothing?

The goal is to give helpful human-readable names to the mnemonics. Even the current AsciiDoc source doesn't do a consistently good job here.

Why is that the Sail model's responsibility? If the AsciiDoc sucks, fix the AsciiDoc.

I think we should address inadequacies in the Sail code as well.

@PeterSewell
Copy link
Collaborator

PeterSewell commented Aug 13, 2024 via email

@arichardson
Copy link
Collaborator

While I agree we shouldn't invent names here that are not official, I do think having names at least for the ones where they are known is a useful step towards being able to generate per-instruction documentation from the sail model.

I don't think this change has any effect on the readability (if anything it makes some of the more obscure mnemonics more obvious without having to find them in the ISA reference) so I'd be fine with this change (at least for the ones with existing long names).

@ThinkOpenly
Copy link
Contributor Author

This PR was discussed in the Tech Golden Model meeting today.

What I heard was a general consensus that this additional information is needed, to improve the readability of the Sail model and to facilitate uses downstream.

The concern about adding names which do not come from the ISA Specification should be addressed in parallel: add the names to Sail, since they are ready to go (although see below), and propose adding the same names to the ISA Specification. If those additions result in changed names, then the names in the Sail code can easily be changed to match.

There was discussion about Sail-to-Asciidoc that I did not follow.

The current implementation covers those mnemonics for which the mnemonic is not "constructed", but appears as a fully-specified continuous string. In contrast, there are instruction groups where the set of mnemonics is constructed, like "LOAD":

mapping clause assembly = LOAD(imm, rs1, rd, is_unsigned, size, aq, rl)
  <-> "l" ^ size_mnemonic(size) ^ maybe_u(is_unsigned) ^ maybe_aq(aq) ^ maybe_rl(rl) ^ spc() ^ reg_name(rd) ^ sep() ^ hex_bits_signed_12(imm) ^ "(" ^ reg_name(rs1) ^ ")"

Here, the mnemonic is constructed from "l" + { "", "u" } + { "", ".aq" } + { "", ".rl" }.

Looking just now, it is easy to add annotations conveying the intent of each constructive element (".aq" == "acquire"; ".rl" == "release"). I'll do that now, and push new changes here.

(My out-of-tree Sail JSON backend will need to add code to glue the pieces together appropriately in order to construct each full human-readable instruction name, but that's a simple matter of programming.)

Some instruction groups combinatorically construct mnemonics with various
suffixes:
```
mapping clause assembly = LOAD(imm, rs1, rd, is_unsigned, size, aq, rl)
  <-> "l" ^ size_mnemonic(size) ^ maybe_u(is_unsigned) ^ maybe_aq(aq) ^ maybe_rl(rl) ^ spc() ^ reg_name(rd) ^ sep() ^ hex_bits_signed_12(imm) ^ "(" ^ reg_name(rs1) ^ ")"
```
Here, the mnemonic is constructed from
"l" + { "b", "h", "w", "d" } + { "", "u" } + { "", ".aq" } + { "", ".rl" }.

These mnemonic suffixes should be annotated inline for readability.

In addition human-readable names can be similarly constructed, as in
"load word unsigned acquire release", by concatenating the notes.
@Timmmm
Copy link
Collaborator

Timmmm commented Sep 9, 2024

There was discussion about Sail-to-Asciidoc that I did not follow.

I think the suggestion was that the Asciidoc backend (which actually outputs JSON) would include the annotations you've added, so you (maybe) wouldn't need a custom JSON backend.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2024

I disagree with the path forwards. There should be agreement from riscv-isa-manual people before committing to this in the Sail model. Otherwise you'll have to rip them out again if they disagree.

@Alasdair
Copy link
Collaborator

Alasdair commented Sep 9, 2024

We had people from the RISC-V documentation side of things at the meeting today, and we don't think that will be an issue.

I think the suggestion was that the Asciidoc backend (which actually outputs JSON) would include the annotations you've added, so you (maybe) wouldn't need a custom JSON backend.

It actually already does, but I don't think the Asciidoctor plugin has any commands that use them, e.g.

$[name "long name"]
$[simple_attribute]
$[complex_attribute { extra_data = 3, more_data = "some string" }]
val short : unit -> unit

will generate:

  "vals": {
    "short": {
      "val": {
        "source": {
          "contents": "val short : unit -> unit",
          "file": "test.sail",
          "loc": [ 7, 150, 150, 7, 150, 174 ]
        },
        "type": {
          "contents": "unit -> unit",
          "file": "test.sail",
          "loc": [ 7, 150, 162, 7, 150, 174 ]
        },
        "attributes": [
          [ "name", "long name" ],
          "simple_attribute",
          [
            "complex_attribute",
            { "extra_data": 3, "more_data": "some string" }
          ]
        ]
      }
    }
  },

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2024

Is there a reason why attributes is an array of strings or pairs, rather than being a map from name to (optional) value (with something like true or {} as the value for attributes with no value)?

@Alasdair
Copy link
Collaborator

Alasdair commented Sep 9, 2024

No particular reason, it just seemed like a fairly simple and lossless JSON encoding. A default value of true would erase the difference between $[attr true] and $[attr] for example.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2024

Or null, if you're concerned about being able to differentiate those two. I struggle to see when you would, but null vs not present seems even more unlikely to matter.

@Alasdair
Copy link
Collaborator

Alasdair commented Sep 9, 2024

Null would work in the JSON output as the Sail attributes are essentially JSON without null, and a slightly more Sail-ish syntax ('=' rather than ':', and unquoted keys). The array output does preserve the order, which most JSON parsers would not guarantee for an object, and it also allows the same attribute to appear twice, which is legal but of dubious value.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2024

If you need to preserve order or support multiple instances of the same output, then yeah, you can't use an object. But if neither of those are requirements then an object is much easier to work with.

@Alasdair
Copy link
Collaborator

Alasdair commented Sep 9, 2024

I could do [simple_attribute null] if the extra consistency makes it easier to work with, but I think in general it should precisely represent the OCaml type (string * attribute_data option) list.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 9, 2024

I don't think that really helps, the big pain point is having to iterate to find anything (e.g. trying to extract attributes in jq is likely messy).

@Alasdair
Copy link
Collaborator

Alasdair commented Sep 9, 2024

Not sure that making it easier to use with jq is worth losing the ordering information from the source, and jq does have ways to search lists in queries (I think .select(first | contains("attr")) for the more consistent list with null) so it's not even that bad surely?

@ThinkOpenly
Copy link
Contributor Author

FYI, I've added a commit here which adds annotations for the constructed parts of some instruction groups (like LOAD, STORE), as described in my last comment.

I think this at least addresses concerns about "full coverage" for the base instructions.

Whether these names need to go first into the ISA Spec is the only remaining discussion point to my knowledge.

@AFOliveira
Copy link

I find the content of this PR really interesting and I have a question on this:
I have previously tried to do a similar approach to what that back-end is doing in generating the json file and had trouble with how instruction names are defined in SAIL since I could not find any direct standard or correlation. Has this been (or still is) a problem in generating the back-ends?

@ThinkOpenly
Copy link
Contributor Author

I [...] had trouble with how instruction names are defined in SAIL since I could not find any direct standard or correlation.

If you are referring to human-readable instruction names, there are no human-readable instruction names in the RISC-V Sail code. This PR provides one way of doing so.

Further, there aren't even human-readable instruction names in the RISC-V ISA Specification consistently. This is one of the current objections to this PR, that by adding those names in this way, a standard is being implicitly created without whatever proper review would be warranted. Although, it's not completely clear what review is necessary... is a full new ratified version of the spec with instruction names required before adding names to the Sail model, even though it has also been argued that the names are not part of the model proper.

Next steps with this PR are unclear to me.

Has this been (or still is) a problem in generating the back-ends?

The current in-tree backends of the Sail parser, to my knowledge, have no interest in human-readable instruction names. (Or if they do, are starving for lack of them.)

@AFOliveira
Copy link

If you are referring to human-readable instruction names, there are no human-readable instruction names in the RISC-V Sail code. This PR provides one way of doing so.

What I was mentioning was a way to describe instructions in which we know what instruction we are concretely mentioning. Your approach would definitely solve this problem for humans, but being long format wouldn't it hurt machine readibility, at least from a parsing stand point? What I was trying to question is if there are concrete tags in the sail models that are easy to refer to, take this examples:

  • JAL - it is named RISCV_JAL in riscv_insts_base.sail
  • RORIW - it is named RISCV_RORIW in the riscv_insts_zbb.sail
  • C_NOP - it is named C_NOP in the riscv_insts_zca.sail

Do this names follow any standard? Because I am not seing which standard they are following and I believe that if they could be at least EXT_INSTRUCTIONNAME it would be easier to digest the sail code and use it for different things.

@ThinkOpenly
Copy link
Contributor Author

Your approach would definitely solve this problem for humans, but being long format wouldn't it hurt machine readibility, at least from a parsing stand point?

The only purpose for the "long format" is just for human-readable names. They are annotations, like comments but they end up in the AST after parsing, so the backends can see them and make use of them (or not).

What I was trying to question is if there are concrete tags in the sail models that are easy to refer to, take this examples:

  • JAL - it is named RISCV_JAL in riscv_insts_base.sail
  • RORIW - it is named RISCV_RORIW in the riscv_insts_zbb.sail
  • C_NOP - it is named C_NOP in the riscv_insts_zca.sail

Do this names follow any standard?

These were established long before my involvement, so I am not the right person to answer, but I think the answer is an easy "no".

Because I am not seing which standard they are following and I believe that if they could be at least EXT_INSTRUCTIONNAME it would be easier to digest the sail code and use it for different things.

In a previous PR (#506), I changed the way that instructions are "tagged" by their containing extension to hopefully make it easier for a parser backend to determine that. If that is insufficient, let's talk about ways to improve it further.

As for "INSTRUCTIONNAME", that's not so easy given the grouping for similar instructions. For example, both "lui" and "auipc" are in the "UTYPE" definition. What should the single "instruction name" be? :-)

@AFOliveira
Copy link

I now understand the proper reason of this PR, sorry for my misunderstanding. Although, I still think that my point on instruction naming is valid, I also definitely do not have a proper solution for that, I think I'll take some time to better understand if there is a way to make this better and then file an issue or PR on that separate topic to get more opinions on this. All in all, thank you for your explanation, it was really useful and I totally agree with the need for a proper long naming standard on RISC-V instruction documentation.

@jordancarlin
Copy link
Contributor

@ThinkOpenly It looks like the riscv-unified-db project has also been adding long_names to each instruction. It probably makes sense for these to be consistent with whatever ends up being used for the Sail model/ISA manual so it might be worth connecting with that project.

@jrtc27
Copy link
Collaborator

jrtc27 commented Sep 16, 2024

Or that should be your source of truth for the instruction names, not the Sail model, if there's already such a thing...

@ThinkOpenly
Copy link
Contributor Author

@ThinkOpenly It looks like the riscv-unified-db project has also been adding long_names to each instruction. It probably makes sense for these to be consistent with whatever ends up being used for the Sail model/ISA manual so it might be worth connecting with that project.

Thanks for the pointer @jordancarlin. I'm aware of the project, and getting more connected with it. The question is, where are those names coming from?

Or that should be your source of truth for the instruction names, not the Sail model, if there's already such a thing...

I don't understand how that's a better source for names than what I've already proposed. As there is no canonical single source for names, we're all making them up as we go.

FYI, I did pitch a list of names for the base instructions to riscv-isa-manual: riscv/riscv-isa-manual#1638, presuming that is the preferred route. I supposed I should add another column to the table in that issue for the names used by riscv-unified-db, because there are certainly differences.

Copy link
Collaborator

@arichardson arichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be merged once we have agreement on the isa-manual issue that these names are "official"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tgmm-agenda Tagged for the next Golden Model meeting agenda.
Projects
None yet
Development

Successfully merging this pull request may close these issues.