Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft][Please review] Make Paginator use simpler queries when there are no sql joins used (#8278) #11595

Draft
wants to merge 3 commits into
base: 3.3.x
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 84 additions & 9 deletions src/Tools/Pagination/Paginator.php
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
use Doctrine\ORM\Internal\SQLResultCasing;
use Doctrine\ORM\NoResultException;
use Doctrine\ORM\Query;
use Doctrine\ORM\Query\Expr\Join;
use Doctrine\ORM\Query\Parameter;
use Doctrine\ORM\Query\Parser;
use Doctrine\ORM\Query\ResultSetMapping;
Expand All @@ -21,6 +22,7 @@
use function array_map;
use function array_sum;
use function assert;
use function count;
use function is_string;

/**
Expand All @@ -38,14 +40,43 @@
private readonly Query $query;
private bool|null $useOutputWalkers = null;
private int|null $count = null;
private bool $fetchJoinCollection;
/**
* @var bool The auto-detection of queries style was added a lot later to this class, and this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @var bool The auto-detection of queries style was added a lot later to this class, and this
* The auto-detection of queries style was added a lot later to this class, and this

For a property phpdoc, the description is just the phpdoc description, not something attached to the tag (and then, we don't need @var at all as there is a native type).

* class historically was by default using the more complex queries style, which means that
* the simple queries style is potentially very under-tested in production systems. The purpose
* of this variable is to not introduce breaking changes until an impression is developed that
* the simple queries style has been battle-tested enough.
*/
private bool $queryStyleAutoDetectionEnabled = false;
/** @var bool|null Null means "undetermined". */
private bool|null $queryHasHavingClause = null;

/** @param bool $fetchJoinCollection Whether the query joins a collection (true by default). */
/**
* @param bool|null $fetchJoinCollection Whether the query joins a collection (true by default). Set
* to null to enable auto-detection of this parameter, however note that the auto-detection requires
* a QueryBuilder to be provided to Paginator, otherwise this parameter goes back to its default value.
* Also, for now, when the auto-detection of this parameter is enabled, then auto-detection
* of $useOutputWalkers and of the CountWalker::HINT_DISTINCT is also enabled.
*/
public function __construct(
Query|QueryBuilder $query,
private readonly bool $fetchJoinCollection = true,
bool|null $fetchJoinCollection = true,
Copy link
Member

@greg0ire greg0ire Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would revert this and instead introduce a static method.

public static function newWithAutoDetection(QueryBuilder $queryBuilder): self
{
        // do the autodetection here
        // use the unaltered constructor here
        
}

That way:

  • You remove the possibility of passing a Query instance and null to the constructor.
  • You don't complexify the constructor that much.
  • You don't rely on null for communicating failure.
  • You can reduce the comments a bit because you don't need to explain how both parameters related to each other.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I've introduced the static function as you advised and the constructor returned to its original form in its entirety. Thanks.

) {
if ($fetchJoinCollection === null) {
$fetchJoinCollection = $this->autoDetectFetchJoinCollection($query);

Check warning on line 67 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L67

Added line #L67 was not covered by tests
}

if ($fetchJoinCollection === null) {
$this->fetchJoinCollection = true;

Check warning on line 71 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L71

Added line #L71 was not covered by tests
} else {
$this->fetchJoinCollection = $fetchJoinCollection;
$this->queryStyleAutoDetectionEnabled = true;
}

if ($query instanceof QueryBuilder) {
$query = $query->getQuery();
$this->queryHasHavingClause = $query->getDQLPart('having') !== null;
$query = $query->getQuery();
}

$this->query = $query;
Expand Down Expand Up @@ -154,6 +185,23 @@
return new ArrayIterator($result);
}

/** @return bool|null Null means that auto-detection could not be carried out. */
private function autoDetectFetchJoinCollection(Query|QueryBuilder $query): bool|null

Check warning on line 189 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L189

Added line #L189 was not covered by tests
{
// For now, only working with QueryBuilder is supported.
if (! $query instanceof QueryBuilder) {
return null;

Check warning on line 193 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L192-L193

Added lines #L192 - L193 were not covered by tests
}

/** @var array<string, Join[]> $joinsPerRootAlias */
$joinsPerRootAlias = $query->getDQLPart('join');

Check warning on line 197 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L197

Added line #L197 was not covered by tests

// For now, do not try to investigate what kind of joins are used. It is, however, doable
// to detect a presence of only *ToOne joins via the access to joined entity classes'
// metadata (see: QueryBuilder::getEntityManager()->getClassMetadata(className)).
return count($joinsPerRootAlias) > 0;

Check warning on line 202 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L202

Added line #L202 was not covered by tests
}

private function cloneQuery(Query $query): Query
{
$cloneQuery = clone $query;
Expand All @@ -171,13 +219,30 @@
/**
* Determines whether to use an output walker for the query.
*/
private function useOutputWalker(Query $query): bool
private function useOutputWalker(Query $query, bool $forCountQuery = false): bool
{
if ($this->useOutputWalkers === null) {
return (bool) $query->getHint(Query::HINT_CUSTOM_OUTPUT_WALKER) === false;
if ($this->useOutputWalkers !== null) {
return $this->useOutputWalkers;
}

return $this->useOutputWalkers;
// When a custom output walker already present, then do not use the Paginator's.
if ($query->getHint(Query::HINT_CUSTOM_OUTPUT_WALKER) !== false) {
return false;
}

// When not joining onto *ToMany relations, then do not use the more complex CountOutputWalker.
// phpcs:ignore SlevomatCodingStandard.ControlStructures.UselessIfConditionWithReturn.UselessIfCondition
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I think it's better not to enforce the failing coding standard on that if-condition because otherwise it's a multiline return abomination.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why the return abomination would be worse than the if abomination.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The way I see it is that this function is a set of if-statements that each check for a very individual use case. If none of the statements "wants to make a decision", then the default "return true;" is returned.
  2. The if-condition for the new if-statement is quite elaborate, so I didn't want it to be the last statement of the function.
  3. The next dev that wants to add more if-statements to this function, would have to "un-return" my if-statement. This argument is my common argument for all phpstan rules that enforce "return statement code style" rules, which often in my mind are "false positives".

These points are semantics + I'm the "guest" here, so I've merged the last two return statements, as per phpstan detection.

if (
$forCountQuery
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a separate detection for count query and result query, I would rather extract the check for $forCountQuery in a separate query (as maybe we will also have smarter detection for the case of the result query in the future)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You said:

extract the check for $forCountQuery in a separate query

Did you mean to say "extract (...) in a separate function"? In other words, are you proposing splitting the useOutputWalker() into useResultOutputWalker() and useCountOutputWalker()?

Copy link
Member

@stof stof Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I meant a separate condition (no need for a separate function which would force to duplicate the previous logic). We could even factor out the auto-detection check:

This would look like that:

// ...

if (!$this->queryStyleAutoDetectionEnabled) {
    return true; // Old simple guessing
}

if ($forCountQuery) {
    return /* condition for the support of CountWalker */;
}

return /* condition for the support of LimitSubqueryWalker */;

&& $this->queryStyleAutoDetectionEnabled
&& ! $this->fetchJoinCollection === false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't compare boolean variable with boolean constants

Suggested change
&& ! $this->fetchJoinCollection === false
&& $this->fetchJoinCollection

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's surprising to me that I left that in the code, despite reviewing it twice myself. Thanks.

// CountWalker doesn't support the "having" clause, while CountOutputWalker does.
&& $this->queryHasHavingClause === false
) {
return false;

Check warning on line 242 in src/Tools/Pagination/Paginator.php

View check run for this annotation

Codecov / codecov/patch

src/Tools/Pagination/Paginator.php#L242

Added line #L242 was not covered by tests
}

return true;
}

/**
Expand Down Expand Up @@ -205,10 +270,20 @@
$countQuery = $this->cloneQuery($this->query);

if (! $countQuery->hasHint(CountWalker::HINT_DISTINCT)) {
$countQuery->setHint(CountWalker::HINT_DISTINCT, true);
$hintDistinctDefaultTrue = true;

// When not joining onto *ToMany relations, then use a simpler COUNT query in the CountWalker.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetchJoinCollection is not only about whether we join on a ToMany relation, but whether we include this relation in what get fetched by the ORM. So the condition you use does not match what the comment says.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. It appears I misunderstood the purpose of that variable.

I gave a thought about what to do with it, and I initially concluded that that variable is just misnamed. Consider the following:

  1. The toIterator()'s main if-statement (link) executes the extra "SELECT DISTINCT" subquery only if fetchJoinCollection is true. Otherwise, the original query is executed as-provided (with an OFFSET and LIMIT attached to it). Note that if fetchJoinCollection is supposed to be true only when (a) the original query joins onto a *ToMany relation AND only when (b) the query fetch-loads that relation, then toInterator() would not run the extra "SELECT DISTINCT" subquery when the query does the first without the second, which would produce an incorrect result for the "current pagination page". The result would be incorrect because even though the *ToMany relation is not fetch-loaded, the database would still return (and paginate on) "duplicate records" of the entities that are fetch-loaded (which most likely is only the "root" entity).

In other words, both the following DQL queries require the extra "SELECT DISTINCT" subquery to happen in the toIterator(), otherwise the result is incorrect:

SELECT u FROM User u JOIN u.tags ut;
SELECT u, ut FROM User u JOIN u.tags ut;
  1. The docblocks for the fetchJoinCollection consistently say the following:
/** @param bool $fetchJoinCollection Whether the query joins a collection (true by default). */

// (...)

/**
 * Returns whether the query joins a collection.
 *
 * @return bool Whether the query joins a collection.
 */
public function getFetchJoinCollection(): bool {}

It appears then that the docblocks are under an impression that that variable is "true" regardless of whether a *ToMany relation is fetch-loaded or not.


Do you agree that that instance variable is not named correctly (or the author incorrectly thought that not fetch-loading a *ToMany relation is immune to the "duplicated records being returned from the database" problem)? Would you agree for me to just rename that instance variable to something like: "queryHasToManyJoins"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, maybe it is indeed badly named.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a case where joined collections that are not selected will not impact the count or the iteration: when using SELECT DISTINCT in the query (see SelectClause::$isDistinct)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so that I understand: you're trying to say that running the following DQLs without an "outputWalker" would be fine because it would produce the correct results, therefore the literal meaning of fetchJoinCollection is the correct meaning?

SELECT DISTINCT u FROM User u JOIN u.userGroups;
SELECT COUNT(DISTINCT u) FROM User u JOIN u.userGroups;

I agree, however note that in the current code there is nothing that would ensure that DISTINCT is added/present in the toInterator()'s else branch of the main if-else-statement. I.e. what you say is correct, but the current code would still either not work, or work by an extreme fluke (i.e. because someone did add a DISTINCT keyword to their $query prior to using the Paginator). Based on this, I would still say that fetchJoinCollection should just be renamed to reflect the fact that it's actual present meaning is that " whether the query contains ToMany joins (true/false)".

In case there is a legitimate worry that there might be someone in the userspace that does the (undocumented) due diligence and adds the DISTINCT keyword to their queries when they set fetchJoinCollection to false, and we don't want to break that code, then I can introduce a new queryHasToManyJoins instance variable, and use that instead of fetchJoinCollection (in which case, I would also need to find out what I need to default fetchJoinCollection to, but that would not be an unsurmountable problem).

Other that that, I admit that there is a potential future improvement, where useOutputWalkers(forCountQuery: true) could still returns false (i.e. saying that the "simple" count query should be used) even when queryHasToManyJoins is true, however fetchJoinCollection is false, and we know that a COUNT(DISTINCT foo) is going to be used.

So, are you ok for me to rename that instance variable, or would you rather that I should not touch it, and instead introduce a new instance variable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a lot to take in for me.

After a substantial amount of thought, I would like to propose and get your opinion on the following.

  1. Rename the $fetchJoinCollection to $query(Potentially)ProducesDuplicateRows: bool.
    1.1. Why: The new name expresses the "root of the problem", and does not go into details about "how it comes about". It also succinctly expresses why pagination is not a trivial task.
    1.2. What it achieves: it allows the end user to correctly express their confidence (or lack of it) in whether their query produces duplicate rows (which necessitate that the Paginator should run more complex queries). In particular: it puts the onus on the end user to ensure that this "declaration" (in particular: when they "declare false") is accurate.
    1.3. What it achieves: In case the end user's query has a ToMany join, but it doesn't select any fields from it (or only selects SQL aggregate functions that use those fields), and the query uses a "SELECT DISTINCT" (or an appropriate GROUP BY), then the end user can succinctly inform the Paginator that their query is safe for the "fast result & count queries" without explaining the minute details about what exactly allows those fast queries to be used.
    1.4. Bonus: It would "fix" what I said somewhere above that currently when $fetchJoinCollection is false, then the toIterator() runs the "fast result query", however there is nothing that indicates that the end user did the due diligence of using SELECT DISTINCT. The new name of $queryProducesDuplicateRows=false would in this case communicate the intended use case (and requirements) to the end user in a more complete way than $fetchJoinCollection.
    1.5. Breaking change handling: the getFetchJoinCollection() getter would be preserved but deprecated. A new appropriate getter would be introduced and recommended as a replacement.
    1.6. Why: This new name would resolve the review comment that was raised by you in relation to line 257 in the PR.
    1.7. Side note: "$fetchJoinCollection" is just a terrible name, as it suggests that the Paginator would do something to the joined collections in the query.

  2. Split $useOutputWalkers into $useResultQueryOutputWalker and $useCountQueryOutputWalker.
    2.1. Why: Apart from what you said, splitting this config variable just makes sense. The two output walkers (i.e. on the result query and the count query) have different reasons to be enabled/disabled.
    2.2. Breaking change handling: the (get|set)UseOutputWalkers() functions would be preserved but deprecated. A new set of functions (4 in total) would be introduced and recommended as a replacement.
    2.3. Downside: There would effectively be one more pair of public functions to (optionally) configure the Paginator. But again: it should be viewed as a long overdue code change, rather than an questionable and optional addition.
    2.4. Benefit: It would allow end users to manually control whether they are confident to run the "fast" count query, without affecting the runtime of the toIterator() function.

  3. The auto-detection/guessing would investigate the given Query and appropriately set the variables mentioned above.
    3.1. The auto-detection would initially not attempt to auto-detect whether a presence of a ToMany join is compatible with the "fast" queries. This is to limit the scope of this PR. It obviously could be attempted in future PRs. As such, as soon as there is a ToMany join detected, the "fast" queries would not be used (which is equivalent to the current (default) behaviour of the Paginator).

Q1. Are you personally ok with the proposition above?


the fast counting query is compatible with toMany joins when using CountWalker::HINT_DISTINCT as it will count distinct values

Q2. Could you please confirm that you meant: is compatible with toMany joins when using CountWalker::HINT_DISTINCT, AND when the query does NOT select any of the toMany entities/fields? Because when the query selects any of the toMany fields, then it will make the CountWalker (i.e. the fast query) not work correctly, regardless of whether CountWalker::HINT_DISTINCT is added or not.

when no fields from the grouping condition is dependant on the ToMany join (i.e. is part of the joined entity or of other entities joined again from that one)

Thanks for mentioning the case of other entities joined again from that one. This is quite important and I did not initially think of it.

[the points about when the walker-based fetching queries and tree-walker fetching queries are supported]

That is good to know, however note that I do not intend to touch the region of the code that makes the decision whether the walker-based or tree-walker fetching query should be used.

[The "fast" result query] is also the only result fetching mode compatible with composite primary keys.

Fair point, however this again is not something that I'm going to change or touch in my PR. Note that if a query uses composite primary keys AND produces duplicate rows (i.e. is not compatible with the "fast" result query), then the Paginator will not able to be used at all, and it's no different to the current situation in the Paginator.

Regarding output walker vs tree walker, the output walker supports more cases (...) but is generally less performant than then tree-walker query when both are supported.

This again is good to know, however I don't intend to auto-detect/guess in this PR which of the two walkers should be used by the result-fetching query.


Could you reply to Q1 and Q2, please, and share any other thought you have about what I said above?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is good to know, however note that I do not intend to touch the region of the code that makes the decision whether the walker-based or tree-walker fetching query should be used.

your code provides guessing for use*OutputWalker variables, which is precisely about making that decision. So I don't understand what you intend to do if you don't intend to touch that decision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, are you referring to the following diff in the current PR?

    private function useOutputWalker(Query $query): bool
    {
        if ($this->useOutputWalkers === null) {
            return (bool) $query->getHint(Query::HINT_CUSTOM_OUTPUT_WALKER) === false;
        }

        return $this->useOutputWalkers;
    }

->

    private function useOutputWalker(Query $query, bool $forCountQuery = false): bool
    {
        if ($this->useOutputWalkers !== null) {
            return $this->useOutputWalkers;
        }

        // When a custom output walker already present, then do not use the Paginator's.
        if ($query->getHint(Query::HINT_CUSTOM_OUTPUT_WALKER) !== false) {
            return false;
        }

        // When not joining onto *ToMany relations, then do not use the more complex CountOutputWalker.
        return ! (
            $forCountQuery
// (rest of the diff not relevant)
        );
    }

Do we both agree that this code diff does not result in any change to whether the walker-based or the tree-walker result-fetching query should be used?

Or do you refer to the proposition of splitting useOutputWalker into two variables? In that case I can assure you that I do not intend to modify the current logic for the $useResultQueryOutputWalker variable. It could be done in future PRs, most likely by different contributors than me.

Note that I started this PR with only the intention of adding code that "if there are any dql-joins present, then use the fast version of the queries". I believe this alone would immediately address 80% of the use cases in the wild. Anything more than this, I consider a nicety. However, I do agree that going the extra mile and checking for *ToMany dql-joins should be easy enough, so as an exception from my original plan, I'm willing to attempt to make that change in the current PR. Anything other than that is a "less than 5% of use cases" matter in my mind, and although I do appreciate us discussing it, I do not intend to address it in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when $fetchJoinCollection is true (the default), forbidding to use the fast result fetching query, useOutputWalker is what decides between fetching results with an output walker or a tree walker.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when $fetchJoinCollection is true (the default), forbidding to use the fast result fetching query,

I agree.

when $fetchJoinCollection is true (the default) (...), useOutputWalker is what decides between fetching results with an output walker or a tree walker.

I agree.

useOutputWalker is what decides between fetching results with an output walker or a tree walker.

Do we both agree that my code diff does not change the behaviour of useOutputWalker?

when $fetchJoinCollection is true (the default)

Are you trying to say that because the new Paginator::newWithAutoDetection() function in this PR is now going to decide whether the (former) $fetchJoinCollection is true or false, it means that I should pay more attention to when the function decides that $fetchJoinCollection should be "set" to true, because this implies that useOutputWalker is going to be run? To this I would say that I partially see the concern here, however I don't view Paginator::newWithAutoDetection() function as deciding whether the $fetchJoinCollection should be set to true, but rather whether the $fetchJoinCollection should "revert to the current behaviour of the Paginator" (i.e. setting the variable to true). It's not ideal the the default behaviour of Paginator::newWithAutoDetection() could lead to problems in some cases, but likewise the current default behaviour of the Paginator class would lead to the same problems. And for this reason, I do not see why I should go even farther with this PR by making sure that the details about the output/tree walkers in the result-fetching query are taken into the consideration. Please correct me if I'm wrong.

If you're not referring to the consequences of the Paginator::newWithAutoDetection() function, then I truly struggle to see what we are discussing here, and I would appreciate some more help in understanding your standpoint. Why exactly do you believe I should concern myself with the the output/tree walkers in the result-fetching query in this PR?

if (
$this->queryStyleAutoDetectionEnabled
&& $this->fetchJoinCollection === false
d-ph marked this conversation as resolved.
Show resolved Hide resolved
) {
$hintDistinctDefaultTrue = false;
}

$countQuery->setHint(CountWalker::HINT_DISTINCT, $hintDistinctDefaultTrue);
}

if ($this->useOutputWalker($countQuery)) {
if ($this->useOutputWalker($countQuery, forCountQuery: true)) {
$platform = $countQuery->getEntityManager()->getConnection()->getDatabasePlatform(); // law of demeter win

$rsm = new ResultSetMapping();
Expand Down
Loading