Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function signature based labeling scheme for landing pad #434

Draft
wants to merge 42 commits into
base: cfi-prop
Choose a base branch
from

Conversation

kito-cheng
Copy link
Collaborator

@kito-cheng kito-cheng commented Apr 23, 2024

NOTE: it's chained PR, which based on #417
TODO: We don't add mechanism for generating right PLT yet, we may need a specialized section to record the function signature or hash, so that static linker can generate right label at PLT entries.


Function signature based labeling scheme, follow the "Function types" mangling rule defined in Itanium C++ ABI.

With few specific rules:

  • main function is using signature of (int, pointer to pointer to char) returning int (FiiPPcE).
  • _dl_runtime_resolve use zero for the landing pad.
  • C++ member functions should use the "Pointer-to-member types" mangling rule
    defined in the Itanium C++ ABI.
  • Virtual functions in C++ should use the member function type of the base
    class that first defined the virtual function.
  • If a virtual function is inherited from more than one base class, it should
    use the type of the first base class. Thunk functions will use the type of
    the corresponding base class.

@mylai-mtk
Copy link

Regarding the mechanism to pass the calculated lpad labels to static linker for PLT generation, maybe we can just add a symbol for each called function? Currently, the LLVM KCFI mechanism adds similar symbols to allow assembly codes to reference labels computed from C function declarations.

This approach has the benefit of being easily human readable when examining the assembly text and object dumps (through symbol table), and it does not require us to invent a new format for this purpose, which means it can already be accepted by existing assemblers and compiling pipelines that utilizes independent assemblers.

The downside of this may be that it adds quite a lot of additional entries to the symbol table and the data structure of symbol table entries are a bit too bloated for this purpose, but if we decide to pass the symbols along in the relocatables instead of fetching from shared objects at static link time, maybe we can just advise programmers to strip away the symbol tables after linking so the program size can be reduced.

@kito-cheng
Copy link
Collaborator Author

@mylai-mtk thanks for the inputs, and share few options in my mind:

  1. Mapping symbol scheme: your proposal, generate a special symbol at the same address as the function symbol
  2. Relocation scheme: Similar to mapping symbol scheme, but insert a new relocation associate with the lpad instruction, and point to a dummy symbol string to store the function signature.
  3. Build attribute: build attribute has reserve a space to associate attribute to a symbol.
  4. Customized section which contain an array of ElfNN_SymSig:
typedef struct {
  ElfNN_Half ss_boundto; /* Direct bindings, symbol bound to */
  ElfNN_Word ss_sig; /* Signature string, string index in .riscv.ssstr section .  */
} ElfNN_SymSig;

However option 1 and option 2 will has some problem when dealing with undefined weak symbol*1, there won't have an address associate with that kind of symbol.

For options 3: build attribute for symbol is kinda good fit for current usage, but...all linker seems NOT implement that at all, so I'm a little hesitant to choose this option, also it's bind to symbol, so it could handle undefined weak symbol well in theory.

For option 4: similar to option 3 for many aspect, but use a customized section.

*1: IIRC KCFI require full LTO, every symbol should resolve at that stage, so the undefined weak symbol may not a problme in that situation, but I am expert on that, so plz correct me if I am wrong :)

@mylai-mtk
Copy link

mylai-mtk commented Apr 25, 2024

Hi @kito-cheng, thanks for the reply.

Build attribute: build attribute has reserve a space to associate attribute to a symbol.

Regarding this option, I don't see any structure that looks like the "build attribute" you mentioned that works with symbols in the "Attributes" section of the linked document. Can you be more specific on what existing format we have that you're referring to?

*1: IIRC KCFI require full LTO, every symbol should resolve at that stage, so the undefined weak symbol may not a problme in that situation, but I am expert on that, so plz correct me if I am wrong :)

I don't think we need LTO with KCFI. KCFI works much like Zicfilp, except that it's software emulated. In KCFI, the label used at the caller site comes from the called target's apparent signature as seen from caller, which should be available in C programs for most of the time. The only exception I know is K&R-style functions, which technically do not have signatures and accept a plethora of argument type combinations. However, intended K&R-style functions are rare IMHO, and probably should use a special label rule if we are to handle them.

However option 1 and option 2 will has some problem when dealing with undefined weak symbol*1, there won't have an address associate with that kind of symbol.

I wasn't talking about generating symbol at the same address as the function symbol. In KCFI, the label symbols are associated to the called functions by name, e.g. for a int foo(); C declaration, it would introduce a symbol with the assembly code .weak __kcfi_typeid_foo; .set __kcfi_typeid_foo, <label>;, so I don't think symbol addresses matter here.

What I proposed was to add something like .weak __zicfilp_label_foo; .set __zicfilp_label_foo, <label>; for every called but undefined function in the relocatables. This way, the relocatables would always contain the required labels for PLT generation, and these labels would be calculated from the function signatures that the callers think they're calling. Given that in a relocatable, we can't have two different global symbols of the same name (can we?), using name to associate the called function and its label should be strong enough. Please correct me if I skip something important here 🙇‍♀️

riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
@mylai-mtk
Copy link

We need to have a way (define macro?) to know which labeling scheme (simple/complex) is in use for the current compilation, so assembly files can know which label to use. This is needed for libc implementation.

@mylai-mtk
Copy link

mylai-mtk commented May 7, 2024

(Nitpicking...) I propose that we move away from the name of "complex" when using function signatures as the label content. Though we have a "simple" labeling scheme and it's natural to use its opposite term "complex" to name our new scheme, which is not simple, I think the term "complex" covers too many possibilities and thus reveals too little about what it actually is. Also, avoiding the "complex" term allows us to define more label schemes in the future without the embarrassment of feeling like defining another new "complex" scheme. Based on these points, I propose to name this current new label scheme using the name "func_sig" or "mangled_sig" (the underscore may be removed), which is precise and tells what it really does.

@kito-cheng kito-cheng changed the title Add complex labeling scheme for landing pad Add function signature based labeling scheme for landing pad May 10, 2024
@kito-cheng
Copy link
Collaborator Author

(Nitpicking...) I propose that we move away from the name of "complex" when using function signatures as the label content. Though we have a "simple" labeling scheme and it's natural to use its opposite term "complex" to name our new scheme, which is not simple, I think the term "complex" covers too many possibilities and thus reveals too little about what it actually is. Also, avoiding the "complex" term allows us to define more label schemes in the future without the embarrassment of feeling like defining another new "complex" scheme. Based on these points, I propose to name this current new label scheme using the name "func_sig" or "mangled_sig" (the underscore may be removed), which is precise and tells what it really does.

Good suggestion, let me rename it to function signature / func_sig :)

@kito-cheng
Copy link
Collaborator Author

Create a PR for adding macro riscv-non-isa/riscv-c-api-doc#76

@kito-cheng kito-cheng force-pushed the complex-label-lp branch 2 times, most recently from e4ae5de to c3c4adc Compare May 10, 2024 09:52
@kito-cheng
Copy link
Collaborator Author

Changes:

  • Rename complex labeling scheme to function signature based labeling scheme
  • Fix the PLT stubs
  • Add labeling rule for main and _dl_runtime_resolve.
  • Clarify the rule for those virtual function from more than one base class.
  • Add @mylai-mtk as coauthor since he has contributes many useful feedback.

1: lpad <hash-value-for-function>
auipc t3, %pcrel_hi([email protected])
l[w|d] t3, %pcrel_lo(1b)(t3)
lui t2, <hash-value-for-function>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its needed for direct call to PLT followed by indirect tail call from PLT to target...

riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
riscv-elf.adoc Outdated Show resolved Hide resolved
kito-cheng and others added 2 commits September 6, 2024 21:05
@kito-cheng
Copy link
Collaborator Author

Changes:

  • Apply @mylai-mtk 's comment
    • Update rule for co-variant return type per
    • Update rule for member pointer
    • Fix typo in PLT stuffs
  • Adding rule to address hash result is zero

riscv-elf.adoc Outdated

The label value is derived from the lower 20 bits of the MD5 hash result of the
function signature string. If the lower 20 bits are all zeros, the higher 20
bits are used. If all 32 bits are zeros, the lower 20 bits of the MD5 hash

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MD5 results in a 128-bit number, so I guess you mean 'If all 128 bits are zero' here.
But since MD5 gives a 128-bit number, would you consider taking other parts of the number if both hi20(MD5) and low20(MD5) are zero?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, let me update the rule, I just thought MD5 is 32 bit, but that should be 128 bits...

riscv-elf.adoc Outdated
Comment on lines 1555 to 1556
If less than 20 bits are available in the final segment, the highest 20 bits of
the MD5 hash result will be used. If all 128 bits are zeros, the lower 20 bits
Copy link

@mylai-mtk mylai-mtk Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind if we use the lowest 8 bits (and zero-extend it to 20 bits) in the final segment in case the lowest 120 bits are all zero? This saves an additional 12-bit logical left shift (and some book keeping) if the following algorithm is used to implement this paragraph:

uint128_t MD5 = ...;
while (MD5) {
 if (MD5 & 0xFFFFF) return MD5 & 0xFFFFF;
 MD5 >>= 20;
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this not make much difference, let me update that later :P

kito-cheng and others added 22 commits October 9, 2024 20:39
fix riscv-abi.adoc format
Define two bit for landing pad and shadow stack, and we plan to defined
third bit `GNU_PROPERTY_RISCV_FEATURE_1_CFI_LP_COMPLEX` for complex
labeling scheme.
Changes:
- Rename `GNU_PROPERTY_RISCV_FEATURE_1_CFI_LP_SIMPLE` to `GNU_PROPERTY_RISCV_FEATURE_1_CFI_LP_UNLABELED`
- Fix wrong offset in the first PLT stubs for the simple landing pad PLT.
Function signature based labeling scheme, follow the "Function types" mangling
rule defeind in Itanium C++ ABI.

With few specific rules:

- `main` funciton is using signature of
   `(int, pointer to pointer to char) returning int` (`FiiPPcE`).
- `_dl_runtime_resolve` use zero for the landing pad.
- {Cpp} member functions should use the "Pointer-to-member types" mangling rule
  defined in the _Itanium {Cpp} ABI_ <<itanium-cxx-abi>>.
- Virtual functions in {Cpp} should use the member function type of the base
  class that first defined the virtual function.
- If a virtual function is inherited from more than one base class, it should
  use the type of the first base class. Thunk functions will use the type of
  the corresponding base class.

Co-authored-by: Ming-Yi Lai <[email protected]>
Changes:
- Rename complex labeling scheme to function signature based labeling scheme
- Fix the PLT stubs
- Add labeling rule for `main` and `_dl_runtime_resolve`.
- Clarify the rule for those virtual function from more than one base
  class.
- Speical rule for return type of member function.
- Speical rule for class destructors
- <exception-spec> should be ignored.
- Static functions should follow the same rules as normal functions.
- wchar_t is platform dependent.
- Functions with an empty parameter list are treated as `void` (`v`).
- Add note to mention covariant return types
Use zero-filled value if remain bits is less than 20 bits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.