Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML search: represent indices using JavaScript Map instead of object literals #13097

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Nov 2, 2024

Feature or Bugfix

  • Bugfix

Purpose

  • Prevent queries for the JavaScript class prototype reference propery name from causing errors in the HTML search JavaScript code provided in the basic theme.

Detail

  • Ensure that the user input query term is found as a named property of the relevant index objects before accessing them, using Object.hasOwnProperty.
  • Represent the HTML search indices as JavaScript Map instances instead of Object literals.

Relates

Edit: update description to reflect JavaScript Map-based approach.

@jayaddison jayaddison added html search javascript Pull requests that update Javascript code labels Nov 2, 2024
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments

@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__ property in an object. Am I missing something? Should we be using a map to store these terms instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing JSON, an object key named "__proto__" is accepted, and converted to the JavaScript object key ["__proto__"] -- distinct from the __proto__ property (a reference to a class prototype).

During read-access, introducing hasOwnProperty allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms.

I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:

> x = JSON.parse('{}')
{}
> y = JSON.parse('{"__proto__": "testing"}')
{ ['__proto__']: 'testing' }
> '__proto__' in x
true
> x.hasOwnProperty('__proto__')
false
> '__proto__' in y
true
> y.hasOwnProperty('__proto__')
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to respond to your Map question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map objects).

As I understand it, Map objects are also more difficult to make immutable in JavaScript than Object and Array instances. Freezing a Map object doesn't prevent its private contents from being updated:

> x = {'foo': 'bar'}
{ foo: 'bar' }
> Object.freeze(x)
{ foo: 'bar' }
> x['foo'] = 'baz'
'baz'
> x['foo']
'bar'
> y = new Map([['foo', 'bar']])
Map(1) { 'foo' => 'bar' }
> Object.freeze(y)
Map(1) { 'foo' => 'bar' }
> y.set('foo', 'baz')
Map(1) { 'foo' => 'baz' }
> y.get('foo')
'baz'

So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!

tests/js/roots/ecmascript/index.rst Outdated Show resolved Hide resolved
The JS prototype property should _not_ be found in unrelated documents
@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

@AA-Turner
Copy link
Member

Would this work as an alternative?

    // in Search.setIndex()
    const frozen_index = JSON.parse(index, (key, value) => {
      typeof value === "object" && value !== null
        ? Object.freeze(value)
        : value;
    });

If we change Search.setIndex() to taking a JSON string, it seems we could use the 'reviver' function to freeze arrays and objects?

A

@jayaddison
Copy link
Contributor Author

@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep.

@jayaddison
Copy link
Contributor Author

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

@jayaddison
Copy link
Contributor Author

I've pushed the change to use JSON.parse and reviver -- a note of caution, though: when I was testing that against a JSON-ized version of the Python documentation searchindex.js file yesterday, I was encountering ~200ms duration within the setIndex function in my browser.

@wlach
Copy link
Contributor

wlach commented Nov 14, 2024

Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in.

@jayaddison
Copy link
Contributor Author

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes
@jayaddison
Copy link
Contributor Author

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

@wlach
Copy link
Contributor

wlach commented Nov 14, 2024

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

@jayaddison
Copy link
Contributor Author

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

Sorry, I missed this part. Do you know why this is the case? The solution looked correct to me?

The test case in 0fff170 was inaccurate -- it demonstrated that an object Object.__proto__ could be found based on a user query for __proto__.

However, there was still a bug: a document text term __proto__ -- represented by the "terms":{"__proto__":0 fragment of the searchindex.js -- was not searchable as intended.

Comment on lines 222 to 223
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v;
Search._index = JSON.parse(index, reviver);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @wlach is right that immutability is orthoganal here. I think 'simply' using JSON.parse rather than an object literal would solve the __proto__ case. Would we then need to keep the hasOwnProperty checks?

An alternative would be to use a reviver function that constructs Map objects/serialise the index with Map literals, and use .get(), but that may also be expensive.

Suggested change
const reviver = (k, v) => (typeof v === "object" && v !== null) ? Object.freeze(v) : v;
Search._index = JSON.parse(index, reviver);
Search._index = JSON.parse(index);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @wlach is right that immutability is orthoganal here. I think 'simply' using JSON.parse rather than an object literal would solve the __proto__ case.

Ok, yep - I can revert 443fd21 to remove the use of Object.freeze / the reviver function.

Would we then need to keep the hasOwnProperty checks?

We should keep those, I think. Without them, queries for __proto__ may continue to inadvertently retrieve object prototypes instead of getting an undefined result.

An alternative would be to use a reviver function that constructs Map objects/serialise the index with Map literals, and use .get(), but that may also be expensive.

What do you think about constructing those Map objects directly in the searchindex.js file format, instead of using JSON.parse and a reviver?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about constructing those Map objects directly in the searchindex.js file format, instead of using JSON.parse and a reviver?

IMO this is definitely worth a shot, I suspect it'll be quite fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm giving it a try - it doesn't seem too tricky; the Python serialization of the search index becomes a little more complex. One challenge that I have yet to work out is how the Python unit tests should parse/load the resulting index (ref:

return json.loads(searchindex[16:-1])
).

Comment on lines +71 to +76
indices = ast.literal_eval(
searchindex[16:-1]
.replace('new Map', '')
.replace('null', 'None')
.replace('false', 'False')
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fragile and inelegant; it's a compromise to avoid introducing the complexity of a Python-implemented JavaScript parser dependency given the use of new Map(...) within searchindex.js files.

@jayaddison jayaddison changed the title HTML search: check for query terms presence in index term properties HTML search: represent indices using JavaScript Map instead of object literals Nov 16, 2024
@jayaddison
Copy link
Contributor Author

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

I haven't gotten around to this comprehensive performance testing yet, but want to share some notes:

Initial results

Initially the native Map approach (up to fae81a0) seems similar in runtime performance to the existing/baseline approach (object literals), based on my testing using Sphinx's self-built documentation.

The page-load searchindex-evaluation time on my machine for a query for phin (~700 results) is ~30ms in Firefox, compared to a ~20ms baseline. Those are both relatively small compared to a ~0.5s duration for display/rendering of the search results, but arguably that is a pessimal query (a more-precise match would reduce the display duration).

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

Observations so far

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE html search javascript Pull requests that update Javascript code type:performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML search: bug with JS-prototype-property query
3 participants