Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

refactor(rome_rowan): rework SyntaxTokenText #4721

Closed
wants to merge 2 commits into from
Closed

refactor(rome_rowan): rework SyntaxTokenText #4721

wants to merge 2 commits into from

Conversation

Conaclos
Copy link
Contributor

@Conaclos Conaclos commented Jul 22, 2023

Summary

This PR addresses some concerns of @ematipico.

All inner_text functions now return a SyntaxTokenText.
I thus removed Quoted and StaticStringValue.

My main issue, with SyntaxTokenText was the creativeness of the returned range. We lose the absolute location of the text in a source.
To address this issue, I added a (private) absolute offset, which is the starting position of the (private) token in the source.
This allows computing the absolute range from the (private) relative range.

I also modified slice to accept a range relative to the current slice of text. This seems more consistent from a user point of view and this mimics what we are used to for string slicing.

I take the opportunity to remove SyntaxTokenTextSlice in the formatter since SyntaxTokenText is now sufficient.

Test Plan

In progress...

@netlify
Copy link

netlify bot commented Jul 22, 2023

Deploy Preview for docs-rometools canceled.

Built without sensitive environment variables

Name Link
🔨 Latest commit 3da0087
🔍 Latest deploy log https://app.netlify.com/sites/docs-rometools/deploys/64bd15ff75565e0008c776b4

@github-actions github-actions bot added A-Linter Area: linter A-Core Area: core A-Formatter Area: formatter L-JavaScript Langauge: JavaScript A-Parser Area: parser A-Project Area: project configuration and loading L-JSON Language: JSON labels Jul 22, 2023
@Conaclos Conaclos marked this pull request as draft July 22, 2023 10:19
@@ -392,4 +386,4 @@ static_assert!(std::mem::size_of::<crate::format_element::Tag>() == 16usize);

#[cfg(not(debug_assertions))]
#[cfg(target_pointer_width = "64")]
static_assert!(std::mem::size_of::<crate::FormatElement>() == 24usize);
static_assert!(std::mem::size_of::<crate::FormatElement>() == 32usize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why the size changed.
Moreover, I cannot reproduce this test locally.
Before and after this PR, the size of FormatElement is evaluated to 40 bytes on my machine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure the size with a release build? Some elements contain debug only data

Regressing the size has negative consequences on overall performance and memory consumption

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accept with your point. However, I just merged SyntaxTokenTextSlice into SyntaxTokenText. This should take the same space.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you measure the size with a release build? Some elements contain debug only data

You are right!

Copy link
Contributor

@strager strager Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old layout of format_element::FormatElement::SyntaxTokenTextSlice (24 bytes):

#[repr(C)]
// alignment=8 size=24
struct FormatElement_SyntaxTokenTextSlice {
  variant: i8,          // which element of FormatElement is this?
  padding_0: [u8; 3],   // for text_size alignment
  text_size: u32,       // rome_text_size::TextSize::raw

  // Begin rome_rowan::SyntaxTokenText (alignment=8 size=16)
  green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer))
  range_start: u32,     // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
  range_end: u32,       // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
  // End rome_rowan::SyntaxTokenText (alignment=8 size=16)
}

New layout of format_element::FormatElement::SyntaxTokenTextSlice (32 bytes):

#[repr(C)]
// alignment=8 size=32
struct FormatElement_SyntaxTokenTextSlice {
  variant: i8,          // which element of FormatElement is this?
  padding_0: [u8; 7],   // for rome_rowan::SyntaxTokenText alignment

  // Begin rome_rowan::SyntaxTokenText (alignment=8 size=24)
  text_size: u32,       // rome_text_size::TextSize::raw
  padding_1: [u8; 4],   // for green_token_ptr alignment
  green_token_ptr: u64, // rome_rowan::GreenToken::ptr (rome_rowan::ThinArc::ptr (pointer))
  range_start: u32,     // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
  range_end: u32,       // rome_text_size::TextRange::start (rome_text_size::TextSize::raw)
  // End rome_rowan::SyntaxTokenText (alignment=8 size=24)
}

Before, the space between format_element::FormatElement's discriminator and the rome_rowan::SyntaxTokenText was used to fit text_size. After, this space cannot be used. +4 bytes

After, text_size forces padding inside rome_rowan::SyntaxTokenText. The size of the members is 20 bytes, but because the struct needs to be 8-byte-aligned (because rome_rowan::GreenToken is 8-byte-aligned because pointers are 8-byte-aligned), 4 padding bytes are added. +4 bytes

+4 bytes + +4 bytes = +8 bytes


This patch helped my investigations (building with cargo +nightly build --release):

diff --git a/crates/rome_formatter/src/format_element.rs b/crates/rome_formatter/src/format_element.rs
index fcb0e92d33..3d6d3e3250 100644
--- a/crates/rome_formatter/src/format_element.rs
+++ b/crates/rome_formatter/src/format_element.rs
@@ -16,6 +16,7 @@ use std::rc::Rc;
 ///
 /// Use the helper functions like [crate::builders::space], [crate::builders::soft_line_break] etc. defined in this file to create elements.
 #[derive(Clone, Eq, PartialEq)]
+#[rustc_layout(debug)]
 pub enum FormatElement {
     /// A space token, see [crate::builders::space] for documentation.
     Space,
diff --git a/crates/rome_formatter/src/lib.rs b/crates/rome_formatter/src/lib.rs
index 21a5d90997..c2322681cd 100644
--- a/crates/rome_formatter/src/lib.rs
+++ b/crates/rome_formatter/src/lib.rs
@@ -1,3 +1,5 @@
+#![feature(rustc_attrs)]
+
 //! Infrastructure for code formatting
 //!
 //! This module defines [FormatElement], an IR to format code documents and provides a mean to print
diff --git a/crates/rome_rowan/src/lib.rs b/crates/rome_rowan/src/lib.rs
index b4f5e659fb..7fc598c5fb 100644
--- a/crates/rome_rowan/src/lib.rs
+++ b/crates/rome_rowan/src/lib.rs
@@ -1,3 +1,5 @@
+#![feature(rustc_attrs)]
+
 //! A generic library for lossless syntax trees.
 //! See `examples/s_expressions.rs` for a tutorial.
 #![forbid(
diff --git a/crates/rome_rowan/src/syntax_token_text.rs b/crates/rome_rowan/src/syntax_token_text.rs
index d5eb1ddbf1..7977189bb3 100644
--- a/crates/rome_rowan/src/syntax_token_text.rs
+++ b/crates/rome_rowan/src/syntax_token_text.rs
@@ -5,6 +5,7 @@ use std::{borrow::Borrow, fmt::Formatter};
 
 /// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`.
 #[derive(Eq, Clone)]
+#[rustc_layout(debug)]
 pub struct SyntaxTokenText {
     // Using a green token to ensure this type is Send + Sync.
     token: GreenToken,

Copy link
Contributor Author

@Conaclos Conaclos Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the write-up!

I see two possibilities:

  1. Replacing TextRange with a SmallTextRange struct in SyntaxTokenText to shrink SyntaxTokenText to 16 bytes.
    This is based on the observation that a relative range does not need u32 positions. u16 seems enough.

  2. Introducing GreenTokenText and reverting to the original layout of FormatElement (using GreenTokenText).

Any opinions?

@Conaclos Conaclos marked this pull request as ready for review July 23, 2023 12:02
@ematipico
Copy link
Contributor

!bench_formatter

/// Reference to the text of a SyntaxToken without having to worry about the lifetime of `&str`.
#[derive(Eq, Clone)]
pub struct SyntaxTokenText {
// Absolute start location of `token`
token_start: TextSize,
Copy link
Contributor

@ematipico ematipico Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Conaclos this is the reason why the FormatElement increased in size.

FormatElement is an enum, and its size is equal to the biggest type among all variants. Which means that SyntaxTokenText is now 32kb

Copy link
Contributor Author

@Conaclos Conaclos Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved source_position (renamed token_start) from SyntaxTokenTextSlice to SyntaxTokenText:

struct SyntaxTokenText {
+   token_start: TextSize
    token: GreenToken
    range: TextRange
}
enum FormatElement {
-    SyntaxTokenTextSlice {
-        source_position: TextSize,
-        slice: SyntaxTokenText,
-    },
+    SyntaxTokenText(SyntaxTokenText),
}

FormatElement::SyntaxTokenTextSlice and FormatElement::SyntaxTokenText should have the same size?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions
Copy link

Formatter Benchmark Results

group                                    main                                   pr
-----                                    ----                                   --
formatter/big5-added.json                1.00    483.0±0.44µs    35.0 MB/sec    1.02    491.4±0.85µs    34.4 MB/sec
formatter/canada.json                    1.00    210.1±5.65ms    10.2 MB/sec    1.05    220.7±1.45ms     9.7 MB/sec
formatter/checker.ts                     1.00    398.6±6.21ms     6.5 MB/sec    1.00    396.9±2.61ms     6.5 MB/sec
formatter/compiler.js                    1.00    218.9±2.00ms     4.8 MB/sec    1.03    225.4±1.01ms     4.6 MB/sec
formatter/d3.min.js                      1.00    170.7±1.68ms  1572.6 KB/sec    1.03    175.8±3.13ms  1527.0 KB/sec
formatter/db.json                        1.00     14.3±0.08ms    12.8 MB/sec    1.02     14.5±0.09ms    12.6 MB/sec
formatter/dojo.js                        1.00     11.8±0.08ms     5.8 MB/sec    1.00     11.9±0.07ms     5.8 MB/sec
formatter/eucjp.json                     1.00    813.9±1.80µs    48.1 MB/sec    1.01    825.5±0.34µs    47.4 MB/sec
formatter/ios.d.ts                       1.00    259.1±1.77ms     7.2 MB/sec    1.01    260.4±3.70ms     7.2 MB/sec
formatter/jquery.min.js                  1.00     48.1±0.42ms  1760.7 KB/sec    1.02     48.9±0.43ms  1729.8 KB/sec
formatter/math.js                        1.00    331.6±3.19ms  1999.8 KB/sec    1.03    341.9±1.67ms  1939.7 KB/sec
formatter/package-lock.json              1.00      6.1±0.02ms    22.8 MB/sec    1.01      6.1±0.05ms    22.5 MB/sec
formatter/parser.ts                      1.00      8.1±0.04ms     6.0 MB/sec    1.01      8.2±0.03ms     5.9 MB/sec
formatter/pixi.min.js                    1.00    184.4±1.35ms     2.4 MB/sec    1.02    188.5±1.26ms     2.3 MB/sec
formatter/react-dom.production.min.js    1.00     55.7±0.60ms     2.1 MB/sec    1.03     57.3±0.67ms     2.0 MB/sec
formatter/react.production.min.js        1.01      2.8±0.01ms     2.2 MB/sec    1.00      2.7±0.01ms     2.2 MB/sec
formatter/router.ts                      1.00      3.3±0.01ms     9.8 MB/sec    1.00      3.3±0.03ms     9.7 MB/sec
formatter/tex-chtml-full.js              1.00    424.7±2.17ms     2.1 MB/sec    1.02    432.7±1.78ms     2.1 MB/sec
formatter/three.min.js                   1.00    221.2±2.10ms     2.7 MB/sec    1.02    225.3±1.91ms     2.6 MB/sec
formatter/typescript.js                  1.00   1428.5±7.95ms     6.7 MB/sec    1.01  1444.9±10.05ms     6.6 MB/sec
formatter/vue.global.prod.js             1.00     73.7±0.79ms  1673.3 KB/sec    1.03     76.3±1.66ms  1617.4 KB/sec

@ematipico
Copy link
Contributor

There are a couple of things that are not clear in this PR, and I would like to clarify them:

  • what's the issue with inner_string_text? (please share an example so we can understand)
  • what's the issue with SyntaxTokenText not returning the range you need? (please share an example so we can understand)

Also, can you explain the relation between the two issues?

If they are unrelated, we could open two different PRs.

Copy link
Contributor

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid increasing memory, if the value we get in return is not enough

let mut text = token.token_text_trimmed();
if token_kind == JsonSyntaxKind::JSON_STRING_LITERAL {
// remove string delimiters
let len = text.len() - TextSize::from(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unsafe because text.len() can be less than 2. You should use .saturing_sub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be the case because it is a string literal?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still an unsafe operation, which is done without the proper checks. If you think it's a safe operation, you should add a // SAFETY comment to explain why it's safe.

@Conaclos
Copy link
Contributor Author

Conaclos commented Jul 24, 2023

* what's the issue with `inner_string_text`? (please share an example so we can understand)

There is no issue except that code is duplicated in several places. I just renamed it to inner_text. However, we can keep the previous name...

* what's the issue with `SyntaxTokenText` not returning the range you need? (please share an example so we can understand)

Let's me introduce some context to clarify the situation.

In some places, we need the text of a token, however, the borrow-checker requires allocating a new string. To avoid the allocation, SyntaxTokenText was introduced.

SyntaxTokenText is used in high-level API of the JSON, JS and JSX linter and formatter. This is also used to return the inner text of a JSON string literal.

I expanded its uses to JS and JSX string literals to be more consistent with the JSON code base. This makes possible the deletion of Quoted and StaticStringValue utility types.

SyntaxTokenText offers a range() method. The prefix Syntax suggests that this range is absolute, similar to SyntaxToken, SyntaxNode, SyntaxNodeText, ...

However, it is not.
This is really misleading.
For instance, let's use the following example:

let x;

The range of the token "x" is (4,5).
The range of its token text is (0,1).

I think the method must either be removed (and SYntaxTokenText renamed to TokenText or GreenTokenText) or reworked to return an absolute range. I chose the second path because it is very convenient in high-level code
to access to the absolute range of a text.
Otherwise, the user should fall back to SyntaxToken that is error-prone in the case where the token can be a string literal.

Moreover, there is the need to track the absolute range in other places like the formatter where a struct SyntaxTokenTextSlice and an enum variant FormatElement::SyntaxTokenTextSlice consisted of a source_location and a SyntaxTokenText. This looked like a good opportunity to merge SyntaxTokenTextSlice into SyntaxTokenText.

@ematipico
Copy link
Contributor

There is no issue except that code is duplicated in several places. I just renamed it to inner_text. However, we can keep the previous name...

We can solve the repetition with a trait. Still, I think the changes can be moved into a separate PR. Doing so will allow to focus on what you're trying to solve.

Otherwise, the user should fall back to SyntaxToken that is error-prone in the case where the token can be a string literal.

When you say "user", do you mean us developers?

I'd like to propose another solution instead, where we remove the API that exposes the range of SyntaxTokenText. This is misleading, a developer should never rely on a range of substring in the first place. When dealing with string literals, the quotes should always be part of the range.

If there's a particular case where we need to extract a sub-range of a string literal, in that case, it's up to the developer to compute it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Core Area: core A-Formatter Area: formatter A-Linter Area: linter A-Parser Area: parser A-Project Area: project configuration and loading L-JavaScript Langauge: JavaScript L-JSON Language: JSON
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants