Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment: add flag to enable scalar tagging #4400

Conversation

crusso
Copy link
Contributor

@crusso crusso commented Feb 17, 2024

This modifes #4369 to gate the tagging behind a flag, initially disabled, so we can check in the code and enable it later.

The basic idea is to use all zero tags, widen the representation compact values to what they were before and explicitly
optimise clearing the tag when all zero.

This seems to actually mostly improve perf compared to master:

dfinity/canister-profiling#110

I expect this is because we actually gain a useful bit in the compact Nat32 and Nat64 representation over master, but who knows.

perf

https://docs.google.com/spreadsheets/d/1zC2Hsl9gGUzJESQmSABPiu-XIsICEw1I3O-JKHNWVQs/edit?usp=sharing

test/perf

Master     Widening   Widening vs Master   Gated   Gated vs Master
gas/assetstorage 10013950   gas/assetstorage 10013950 0.00%   gas/assetstorage 10013950 0.00%
size/assetstorage 186455   size/assetstorage 186705 0.13%   size/assetstorage 186520 0.03%
gas/dao 4413634512   gas/dao 4413744976 0.00%   gas/dao 4413743944 0.00%
size/dao 265797   size/dao 266385 0.22%   size/dao 265922 0.05%
gas/qr 1302744688   gas/qr 1305067118 0.18%   gas/qr 1302750018 0.00%
size/qr 256049   size/qr 256925 0.34%   size/qr 256285 0.09%
gas/reversi 80920993   gas/reversi 81019001 0.12%   gas/reversi 80927129 0.01%
size/reversi 175956   size/reversi 176421 0.26%   size/reversi 176084 0.07%
gas/sha224 460197621   gas/sha224 498978947 8.43%        
size/sha224 191929   size/sha224 192859 0.48%        
gas/sha256 14487063673   gas/sha256 15568532694 7.47%   gas/sha256 14486916565 0.00%
size/sha256 179075   size/sha256 180167 0.61%   size/sha256 179223 0.08%

test/bench

Master     Widening   Widening vs Master   Gated   Gated vs Master
gas/alloc 9,243,068,120.00   gas/alloc 10,350,366,461.00 11.98%   gas/alloc 9243068126 0.00%
size/alloc 181,066.00   size/alloc 180,759.00 -0.17%   size/alloc 180464 -0.33%
gas/bignum 130,604,743.00   gas/bignum 130,606,013.00 0.00%   gas/bignum 130604779 0.00%
size/bignum 184,420.00   size/bignum 184,093.00 -0.18%   size/bignum 183790 -0.34%
gas/heap-32 1,610,218,447.00   gas/heap-32 1,695,702,521.00 5.31%   gas/heap-32 1609469958 -0.05%
size/heap-32 182,167.00   size/heap-32 181,856.00 -0.17%   size/heap-32 181556 -0.34%
gas/nat16 61,393,031.00   gas/nat16 65,587,813.00 6.83%   gas/nat16 61393019 0.00%
size/nat16 181,010.00   size/nat16 180,727.00 -0.16%   size/nat16 180408 -0.33%
gas/palindrome 10,131,340.00   gas/palindrome 10,133,866.00 0.02%   gas/palindrome 10131268 0.00%
size/palindrome 185,338.00   size/palindrome 185,024.00 -0.17%   size/palindrome 184695 -0.35%
gas/region0-mem 6,402,149,937.00   gas/region0-mem 6,452,495,054.00 0.79%   gas/region0-mem 6402149955 0.00%
size/region0-mem 181,898.00   size/region0-mem 181,602.00 -0.16%   size/region0-mem 181281 -0.34%
gas/region-mem 5,974,331,587.00   gas/region-mem 6,024,676,752.00 0.84%   gas/region-mem 5974331605 0.00%
size/region-mem 181,539.00   size/region-mem 181,252.00 -0.16%   size/region-mem 180931 -0.33%
gas/stable-mem 3,885,566,188.00   gas/stable-mem 3,935,898,195.00 1.30%   gas/stable-mem 3885566206 0.00%
size/stable-mem 181,896.00   size/stable-mem 181,600.00 -0.16%   size/stable-mem 181279 -0.34%
gas/xxx-nat32 57,198,791.00   gas/xxx-nat32 57,199,237.00 0.00%   gas/xxx-nat32 57198779 0.00%
size/xxx-nat32 181,001.00   size/xxx-nat32 180,694.00 -0.17%   size/xxx-nat32 180399 -0.33%

@crusso crusso added the build_artifacts Upload moc binary as workflow artifacts label Feb 18, 2024
Copy link

github-actions bot commented Feb 18, 2024

Comparing from 7e2421f to 16fa558:
In terms of gas, 3 tests regressed, 1 tests improved and the mean change is +1.0%.
In terms of size, 5 tests regressed and the mean change is +0.1%.

Copy link

github-actions bot commented Feb 18, 2024

Download the artifacts for this pull request:

@crusso crusso marked this pull request as draft February 18, 2024 23:25
@crusso crusso changed the base branch from master to claudio/small-tags-final-untagged-widened February 20, 2024 10:27
@crusso crusso marked this pull request as ready for review February 20, 2024 10:30
Copy link
Contributor

@ggreif ggreif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

@crusso crusso merged commit 0c1faed into claudio/small-tags-final-untagged-widened Feb 22, 2024
9 checks passed
mergify bot pushed a commit that referenced this pull request Feb 23, 2024
The Motoko runtime representation of values is largely untyped, distinguishing only between scalar and boxed values 
 using a single bit of the 32-bit value representation.  The tagging is only to support garbage collection, not precise runtime type information.

In the existing value encoding, a Motoko value in vanilla form is a  32-bit value that is either:
* false `(0b0)`,
* true `(0b1)`, 
* a word-aligned (encoded) pointer to a heap allocated value.
  Encoded by subtracting 1 from the pointer value (ensuring the 2 LSBs are 0b11), pointing
  heap allocated value
* null (some well-known skewed pointer).
* a 31-bit scalar value, stored in the top bits of the value with LSB 0.

Scalar values encode `Nat8/16` and `Int8/16` values and chars, and 31-bit subranges of `Nat32`, `Int32`, `Nat64`, `Int64`, `Nat` and `Int`. Large integer values that don't fit  in a 31-bit scalar are boxed on the heap.

Observe that, in Motoko, some types are always scalar (eg. `Nat8`), some types are always boxed (e.g. `Blob`), and some types have a mixed scalar/boxed representation (e.g. `Nat32` and `Nat`), depending on the size of the value.

This PR adds exact runtime type information to all[*] scalar values, making the scalar values self describing.
Making the _entire_ heap fully self-describing requires refining the heap tags use to identify heap objects, distinguishing boxed `Nat32` from boxed `Int32`, `Blob` from `Principal` and `Text`, tuples from (mutable and immutable) arrays etc. That work of refining heap tags will need to be completed in a follow on or sibling PR, but is hopefully less involved than the changes herein.

To add precise scalar type info, we extend the scalar tagging scheme with a richer set of (inline) type descriptors, using some of the least significant bits of the 31-bit scalar representation.

To avoid dedicating a fix-length suffix (say 1 byte) to the scalar tag, scalar tags are actually variable length, using shorter tags for larger payload types, and longer tags for shorter payload types. This gives us a reasonable tag space (set of possible tags, some still unused), without  reducing the scalar range of mixed representation types too much.

At one extreme, the tag of `Int` (and `Nat`) is just `0b10`, leaving a 30-bit payload for compact `Nat/Int`, losing just `1` bit from the current representation's 31-bit compact range. This is important because `Int`s are common, and `Nat`s are used to index arrays, so we should avoid boxing more than necessary.

In the middle, the tag of `Nat16`, `Int16` is `0b10(0^12)00` and `0b11(0^12)00`, leaving a 16-bit payload in the MSB.

At the other extreme, the tag of the unit value, `()`, is 32-bit `0x01(0^28)00`, occupying the entire value.

The primary motivation of this work is to support value, not type driven, serialization of stable values to a precisely typed stable format, without loss of type information, so that upgrades can still accommodate type dependent changes of representation from one in-memory format to another. Secondary motivations are live and post-mortem heap inspection tools and light-weight debugging tools, that can parse values in locals, arguments and on the heap using tags.

[*] There remain some raw, untagged 31-bit scalars whose type is only known to the compiler. These are used to encode the state of text and blob iterators, hidden in dedicated iterator closure environments. Note that these are not stable types, so need not be precisely tagged for stabilization.

# Tagging Scheme

   | Value | Type | Payload bits |
   |-------| ------| --------------|
   | `((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))` | TBool (* false *) | 0 |
   | `((O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,I))`   | TBool (* true *) | 0 |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,I))`                                         | TRef    |    30       |  
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,I,O))`                                       | TNum |  30        |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,O,I,O,O))`                                     | TNat64 | 28       |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,I,I,O,O))`                                       |  TInt64  |  28      |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,O,I,O,O,O))`                                    | TNat32 |  27      |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,I,I,O,O,O))`                                     | TInt32   | 27        |
   | ... unused tags ....                             | ...   | ...       |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (_,_,_,_,_,O,I,O), (O,O,O,O,O,O,O,O))` | TChar | 21 |
  | ... unused tags ....                             | ...   | ...       |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))`| TNat16 | 16 |
   | `((_,_,_,_,_,_,_,_), (_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))` | TInt16 | 16 |
  | ... unused tags ....                             | ...   | ...       |
   | `((_,_,_,_,_,_,_,_), (O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))` | TNat8 | 8 |
   | `((_,_,_,_,_,_,_,_), (I,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))` | TInt8 | 8 |
  | ... unused tags ....                             | ...   | ...       |
   | `((O,I,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O), (O,O,O,O,O,O,O,O))` |  TUnit | 0 |

# Implementation

The implementation was carried out in a number of precursor PRs:

* #4098: Added 1-byte tags to small  values, untagging an retagging on every operation, with many code changes.
* #4278: Made the payload/tag size for scalar values configurable using a fixed compile time constant.
* #4322: Added tags to compact `Nat32/Int32` and `Nat32/Nat64`, making the payload size type-dependent.
            The previously untyped _StackReps_ `UnboxedWord32` and `UnboxedWord64` were extended to carry a type 
argument. The argument is used to remember and re-introduce the precise tag on unboxing and boxing. 
            It can also be used to verify the tag on unboxing, for sanity checking.
* #4345: Tag compact Int and Nat (both as Int due to subtyping) 
* #4353: Extended the range of compact `Int/Nat` from 29 to 30-bit, by adjusting the tagging scheme. This is just 1 bit less
            than with the existing scheme (31-bit, untagged scalars).
* #4354: Improved the tagging scheme to use the longest possible tags for the required payload size, upping the ranges of unused tags (for future use)
* #4357: Merge with master, fixs bugs in sanity checking of tags. Fix bugs revealing by more stringent sanity checks.
* #4363: Uses the `UnboxedWord32/Word64` stack reps also for untagged, 0-right-padded  small tagged values, 
           tagging/untagging only on exit to and from stack. 
           This alone reduces the (large) 80% overhead in bench/nat16.mo to 55%.
           It also has the advantage of reverting almost all changes to the arithmetic code, 
           which can now (again) assume values are right, 0-padded as it did previously,
* #4369: (this PR) does a small tweak so that mutable locals containing small tagged values in untagged form, extending
           the existing optimization done for mutable locals containing unboxed `Nat32`/`Int32` and `Int64`/`Nat64`.
           This reduces the `bench/nat16.mo` overhead from 55% to just 6% (the benchmark use repeated in-place updates in a tight loop so benefits greatly).
           This PR also makes use of the previously unused bit in the the compact representation of `Nat32s` and `Nat64s` which previously had to concur with the representation of `Int32` and `Nat64` and could only represent half the unsigned range.
           With the typed StackRep, we now know whether the values are signed or not and can choose distinct compact
           representation for `Nat32` vs `Int32`, and `Nat64` vs `Int64` rather that shared ones.
           Note however, that the compact representation for `Nat` cannot recover the missing bit because of subtyping.
           A compact `Nat` **must** have the same representation as a compact `Int` to support non-coercive subtyping.
* #4375 (incoming): rewrite array iter optimization to respect compact bignum representation invariants.
* #4400 : gate feature behind `Mo_config.Flags.rtti (default off)`, avoiding overhead for now.
* added (unadvertised) flag `--experimental-rtti` to enable feature for performance feedback from users.

# Overheads

These are the cycle count and code size differences measured using `test/bench` and  `test/perf`, compared against master (see spreadsheet for perf of interim PRs).

Summarized from:

https://docs.google.com/spreadsheets/d/1zC2Hsl9gGUzJESQmSABPiu-XIsICEw1I3O-JKHNWVQs/edit?usp=sharing


## perf


## test/perf

Master |   |   | Widening |   | Widening vs Master |   | Gated |   | Gated vs Master
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
gas/assetstorage | 10013950 |   | gas/assetstorage | 10013950 | 0.00% |   | gas/assetstorage | 10013950 | 0.00%
size/assetstorage | 186455 |   | size/assetstorage | 186705 | 0.13% |   | size/assetstorage | 186520 | 0.03%
gas/dao | 4413634512 |   | gas/dao | 4413744976 | 0.00% |   | gas/dao | 4413743944 | 0.00%
size/dao | 265797 |   | size/dao | 266385 | 0.22% |   | size/dao | 265922 | 0.05%
gas/qr | 1302744688 |   | gas/qr | 1305067118 | 0.18% |   | gas/qr | 1302750018 | 0.00%
size/qr | 256049 |   | size/qr | 256925 | 0.34% |   | size/qr | 256285 | 0.09%
gas/reversi | 80920993 |   | gas/reversi | 81019001 | 0.12% |   | gas/reversi | 80927129 | 0.01%
size/reversi | 175956 |   | size/reversi | 176421 | 0.26% |   | size/reversi | 176084 | 0.07%
gas/sha224 | 460197621 |   | gas/sha224 | 498978947 | 8.43% |   |   |   |  
size/sha224 | 191929 |   | size/sha224 | 192859 | 0.48% |   |   |   |  
gas/sha256 | 14487063673 |   | gas/sha256 | 15568532694 | 7.47% |   | gas/sha256 | 14486916565 | 0.00%
size/sha256 | 179075 |   | size/sha256 | 180167 | 0.61% |   | size/sha256 | 179223 | 0.08%



## test/bench

Master |   |   | Widening |   | Widening vs Master |   | Gated |   | Gated vs Master
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
gas/alloc | 9,243,068,120.00 |   | gas/alloc | 10,350,366,461.00 | 11.98% |   | gas/alloc | 9243068126 | 0.00%
size/alloc | 181,066.00 |   | size/alloc | 180,759.00 | -0.17% |   | size/alloc | 180464 | -0.33%
gas/bignum | 130,604,743.00 |   | gas/bignum | 130,606,013.00 | 0.00% |   | gas/bignum | 130604779 | 0.00%
size/bignum | 184,420.00 |   | size/bignum | 184,093.00 | -0.18% |   | size/bignum | 183790 | -0.34%
gas/heap-32 | 1,610,218,447.00 |   | gas/heap-32 | 1,695,702,521.00 | 5.31% |   | gas/heap-32 | 1609469958 | -0.05%
size/heap-32 | 182,167.00 |   | size/heap-32 | 181,856.00 | -0.17% |   | size/heap-32 | 181556 | -0.34%
gas/nat16 | 61,393,031.00 |   | gas/nat16 | 65,587,813.00 | 6.83% |   | gas/nat16 | 61393019 | 0.00%
size/nat16 | 181,010.00 |   | size/nat16 | 180,727.00 | -0.16% |   | size/nat16 | 180408 | -0.33%
gas/palindrome | 10,131,340.00 |   | gas/palindrome | 10,133,866.00 | 0.02% |   | gas/palindrome | 10131268 | 0.00%
size/palindrome | 185,338.00 |   | size/palindrome | 185,024.00 | -0.17% |   | size/palindrome | 184695 | -0.35%
gas/region0-mem | 6,402,149,937.00 |   | gas/region0-mem | 6,452,495,054.00 | 0.79% |   | gas/region0-mem | 6402149955 | 0.00%
size/region0-mem | 181,898.00 |   | size/region0-mem | 181,602.00 | -0.16% |   | size/region0-mem | 181281 | -0.34%
gas/region-mem | 5,974,331,587.00 |   | gas/region-mem | 6,024,676,752.00 | 0.84% |   | gas/region-mem | 5974331605 | 0.00%
size/region-mem | 181,539.00 |   | size/region-mem | 181,252.00 | -0.16% |   | size/region-mem | 180931 | -0.33%
gas/stable-mem | 3,885,566,188.00 |   | gas/stable-mem | 3,935,898,195.00 | 1.30% |   | gas/stable-mem | 3885566206 | 0.00%
size/stable-mem | 181,896.00 |   | size/stable-mem | 181,600.00 | -0.16% |   | size/stable-mem | 181279 | -0.34%
gas/xxx-nat32 | 57,198,791.00 |   | gas/xxx-nat32 | 57,199,237.00 | 0.00% |   | gas/xxx-nat32 | 57198779 | 0.00%
size/xxx-nat32 | 181,001.00 |   | size/xxx-nat32 | 180,694.00 | -0.17% |   | size/xxx-nat32 | 180399 | -0.33%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build_artifacts Upload moc binary as workflow artifacts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants