-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to v4 WIP #2
Conversation
Also fixes CI not running on all branches Co-authored-by: Anthony Ramine <[email protected]>
GitHub Actions runners are not guaranteed to have the necessary CPU features in order for these tests to work. Uses a `--target x86_64-unknown-linux-gnu` directive when compiling so the `target_feature` flags don't apply to build scripts.
* Docs unlink from dalek.rs * Link katex assets to jsdelivr Co-authored-by: Michael Rosenberg <[email protected]>
…raphy#231) Rust editions 2018+ do not require `extern crate` except for linking `alloc` and `std`.
As suggested in dalek-cryptography#453 it is sometimes feasible to select the backend bits via an override. This change provides `cfg(curve25519_dalek_bits)` to override the bits used in serial or fiat target backend.
As proposed in dalek-cryptography#442 this makes `rand_core` an optional feature that is not covered by the SemVer public API stability guarantees. Co-authored-by: Michael Rosenberg <[email protected]>
As proposed in dalek-cryptography#442 this makes `digest` an optional feature that is not covered by the SemVer public API stability guarantees. Co-authored-by: Michael Rosenberg <[email protected]>
Gated random() construtors on cfg(test)
…tography#451) Addresses issue dalek-cryptography#448 that Scalar::bits may leave unzeroed bits on the stack
…-cryptography#461) Previously `cargo test --no-default-features` would succeed but with warnings. This commit fixes all of those warnings and tests `--no-default-features` in CI to ensure that in perpetuity.
* Restructure README and CHANGELOG * Explain semver policy * Specify feature flags and backends more explicitly * Remove nightly from the CI bc that didn't belong there * Add @pinkforest to thankyou list Co-authored-by: pinkforest <[email protected]>
…raphy#455) Crate features are intended to be additive, whereas only 1-of-N possible backends can be selected. Features can also be activated by transitive dependencies, which leads to a problem of different dependences selecting conflicting backends. Using `--cfg` instead moves all backend selection control to the toplevel executable. This commit switches to the following RUSTFLAGS to enable backends: - `--cfg curve25519_dalek_backend="fiat"`: uses `fiat-crypto` - `--cfg curve25519_dalek_backend="simd"`: uses nightly-only SIMD
…graphy#465) build.rs was using cfg(target) but it has to evaluate this from env TARGET as build.rs cfg(target) in build context is the builder host and not the target. This change fixes curve25519_dalek_bits lottery to determine the correct automatic curve25119_dalek_bits with the help of platforms crate. As discussed in dalek-cryptography#456 this also prepares for well known defaults for wasm and arm serial backend via cfg(curve25519_dalek_bits = "64") If the wasm32 or armv7 are going to be u64 serial by default these will be followed up on later.
dalek-cryptography#236) curve25519-dalek: - Enables `digest` and `rand_core` features - Removes transitive `nightly`, `simd_backend`, and `std` features ed25519: - `AsRef` impl for `Signature` has been removed; uses `to_bytes` - Uses `try_from` for `InternalSignature` conversion
This is a convenience/marker trait for types which impl `CryptoRng` + `RngCore` which makes the type signatures a little more readable. It was introduced in `rand_core` v0.6.4 (now pinned as the minimum version)
…yptography#467) Co-authored-by: Michael Rosenberg <[email protected]>
…tography#470) For the field element types `FieldElement` and `Scalar`, use inherent constants instead of (non-const) functions to return these constant values. It's likely the original functions predate support for inherent constants, but now that they're available, they're a better fit for these sort of constant values.
…ptography#472) This is helpful for implementing `ff::PrimeField::from_repr`. Also changes `Scalar::is_canonical` to return `Choice`.
Co-authored-by: str4d <[email protected]>
The backends can be overriden via Features are additive and problem was signalling them across dependency chain where often backends could not be used because transient dependencies didn't set them correctly and there could have been conflict when two backends were enabled by transient dependencies leaving worst a mis-compile. See rationale from this long discussion: dalek-cryptography#414 EDIT: I see a reply here oops - dalek-cryptography#632 |
@pinkforest I apologize I missed the discussion from dalek-cryptography#414 when I filed dalek-cryptography#632. I was mistaken about how cfg flags are interpreted by cargo. I was trying to keep this fork as close to upstream as possible so I introduced the cfg flags we needed and switched from features for specifying dependencies. I was unable to get our optional dependencies to be included when gate-ting them with cfg, but it was that I was misspecifying them. We're using this to cross compile for riscv so I needed to put rustflags under a |
@bunnie Okay. I could use some help finishing this. I believe that I had set everything up so that we can select our backend by using cfg flags now. For example, if you add
To the .cargo/config in the |
sure thing, I'll have a look at it soon! |
I'd like to sanity check my environment against how you're testing. This is what I'm doing from a base
I then run In order to get this to build, I had to add the following to your patch under the
I also get some warnings like this:
Does that look consistent with your environment? |
porting this in a manner similar to that used by the sha2 crate, removing the explicit Xous dependency link. see PR for discussion of issues
and also make the internal functions use the same conventions
OK, that was exciting. The good news is, I have it working in Renode. The also probably good news is, I think some performance problems that face the crate are overcome. The bad news is, we have an API problem to solve. So, the original problem that the The problem with that was that it was totally incompatible with integrating into the curve25519-dalek crate. The problem @kotval hit in integration was exactly that -- how do you map the registers of the engine into the userspace of the caller using the engine. This rework does the following:
This is terrible for obvious reasons, but let's first get into why I'm doing this. Previously, the the basic check test took 1250ms per iteration. With this optimization, it now takes....48.9ms. 25x speedup. The diffie-hellman benchmark goes from 12.3ms per iteration to 8.36ms per iteration. 47% speedup. Not as impressive, but that's because a lot of effort went in to hand-code the DH case to speed it up specifically. Why is it so much faster? it's faster because if the hardware can assume it exclusively owns the hardware, you don't have to use any syscalls to access it. You just deref a pointer, instead of, say, passing memory messages around to a helper in another process space to copy your data to a pointer deref. So, I think this is a Good Thing, and generally, because we have only one application at a time really using Curve25519, the hardware resource lock problem is probably something we can manage. I don't think the current solution on the table should be the final one -- we'll definitely trip into panics. But, it at least sets the stage for this next discussion: how should we implement the locks? I see two obvious approaches:
So the basic tension is, a "clean API" where we're using implicit hardware free on A "dirty API" makes it less nice to use the crate, but we have more explicit control over external factors like suspend and resume, and locking policies. My current inclination is maybe to try for the "clean API" method, and:
Either way, I would propose perhaps to get @kotval moving forward, we do the following:
Thoughts? |
@xobs would love to have your feedback on the above comment. |
I'm not following how the API is implemented, and I only see Isn't the list of usecases of Engine25519 relatively limited? Would it make more sense to bundle the microcode programs into The current approach looks like it invokes the allocator on every operation, which is known to not be very fast. A common pattern is to allocate an array on the stack and copy onto that. I'm not sure how long these operations will be in-use, and it sounds like the operation is heavyweight enough that a process will have exclusive access to the hardware for an extended period of time. If you'd like to transfer ownership of the actual hardware, which it sounds like you'd prefer, you might consider passing it via a |
Hmm, I think maybe I was not clear. The recommendations you laid out match up with what we were doing prior to these changes. However, it's pretty inefficient and heavyweight to do it that way. So, previously, this is how it went:
Steps 3-7 repeat. The problem is that this is very heavy weight, and in reality, only one operation (Scalar Multiplies) were ever accelerated -- everything else ran locally in software because by the time you did all the locking, lending, executing, returning, most 25519 ops could be executed on the CPU -- except the super-heavyweight ones like a Scalar Multipy. The new proposal is this:
In this case, a single thread can have multiple processes share the same map, and everything will run fast. This is similar to how we track CID count on objects, except we don't blow up if disconnect (unmap) is called, because we can always just get the mapping back. The bugaboo is if two separate processes both want accelerated Curve25519 operations, and it is coded to keep the mapping live for the entire duration, and both of the operations depended upon each other (like a client-server talking to each other from different process spaces). In this case, you can deadlock, but that can be avoided by manual calls to unmap the hardware after a message has been sent (for example). But overall this is very fast because mapping memory to CSR and hard-wired hardware memory windows is fast: you aren't searching the allocation table, the address is fully specified, so it's just updating the PTE. The memory isn't even zeroized because it's not main memory. It could even be viable to map/unmap after every arithmetic operation -- this would entirely avoid the possibility of deadlock because of objects living too long, I think, but at the price of doing the map/unmap on every op. Maybe that's something I should just try and see how much it impacts the benchmark....if it's relatively cheap, then, it's the safest way to solve the problem because then you have no shared state, you just keep retrying MapMemory with a yield of your quantum if you fail, so that the other process can make progress and release the engine for you to take... |
I believe this is roughly what exists now, correct? Though the The proposed solution is this: As you say, this can be made faster by fully-specifying the virtual and physical arguments to As an asside, there are a few changes I'd make to
With this, you could hardcode the syscalls using I suspect that would improve performance, but benchmarks would be needed. However, for expedience the approach you listed can work as well, and if performance is so much better then perhaps it's the correct one. |
The current approach is almost correct, except that The low level driver is now a part of the curve25519-dalek backend implementation. This also makes it a little easier to maintain the backend under the theory that it's all self-contained, similar to how we eliminated engine-sha2 and rolled all the drivers into the sha2 crate. In the current implementaion, The only reason |
btw, the current approach also has the advantage of trimming engine25519 as a separate process, freeing up another PID for use for user programs, and I think it should also reduce net code space. Even though it means we're including curve25519 drivers in every app that uses the |
Okay this works for me, and I can keep working on signal. I like this approach of not having engine-25519 for the same reasons that I liked getting rid of engine-sha2, i.e. not forcing other applications to be updated because of cargo patching's transitivity, but the question of locking is interesting, and something I hadn't thought about when I started this. I really like the idea of just mapping/unmapping to acquire a lock on hardware, I am very curious how that performs, and how you would go about measuring it. It is interesting to think about perhaps, having two types of drivers. One type is a process which arbitrates use of hw via a message passing api, and the other is that a process and directly include a driver each process uses locks to ensure exclusive access to the hardware. How would you pick between the two for a particular driver? Presumably, if your hardware has some state, like a ring buffer of messages, that state should be managed by only one process, but for things like this and sha2, relying on explicit unlocking seems fine. I still need to understand how we do scheduling and syscalls better to really understand the difference. For what it is worth, not having engine-25519, means we don't have to patch curve25519-dalek to build apps, which means that each service would in principal have a different version of curve25519-dalek. While we shouldn't be flippant at letting cargo put multiple versions in a build as that'd take up lots of space, being that the main use of hw acceleration for curve25519 is relatively limited, we could have things which don't need acceleration depend directly on the mainline version, and things which want the acceleration can depend on this version. It's not quite as good as a "try hw and fall back to software", but it should work now if needed. |
This function should be able to be enabled by feature selection.
This is still WIP and not ready to merge. While a relatively major refactor in terms of files changed, i.e literally every file, though semantically it is actually fairly straightforward. There were minimal changes to curve25519-dalek itself, though the structure of the backends changed and they moved to using Cargo workspaces. An acceptance test that this is ready to merge is at a minimum passing the
wycheproof test
in xous-core. Care should also be taken to avoid breaking hosted mode.