Faster sampling for Postgres? #387

schuemie · 2023-09-19T10:56:07Z

Currently sampling on Postgres uses ORDER BY RANDOM() LIMIT (see here), which means the server first has to order the entire table randomly, and then take the top n rows.

Perhaps TABLESAMPLE could be used for better performance? (Similar to MSSQL)

The text was updated successfully, but these errors were encountered:

blootsvoets · 2023-09-19T13:29:48Z

Makes sense. Bigquery supports it as well, but in beta only. Limitations:

Table sample works with a percentage instead of a number of rows, which it takes as an approximate number rather than a fixed percentage.
It samples rows in pages, meaning that the sample is not truly random, it will return some groups of rows, where the grouping is rather static (they are 'close-by' records).
Before the query is run we need to know what the table count is.

In order to limit the impact of these limitations, we could take the following approach.

// overestimate the number of samples by a factor 2
int percentage = 2 * 100 * sampleSize / tableCount;
// table sample has an effect: less than 50% of the number of records will be sampled.
if (percentage < 50) {
    // avoid too small samples
    if (percentage < 2) {
       percentage = 2;
    }
    query = "SELECT * FROM " + table + " TABLESAMPLE SYSTEM (" + percentage + ") ORDER BY RANDOM() LIMIT " + sampleSize;
    // optional: resample without table sample if the number of records is still lower than the sampleSize.
} else {
    query = "SELECT * FROM " + table + " ORDER BY RANDOM() LIMIT " + sampleSize;
}

schuemie · 2023-09-21T10:56:40Z

Yes, that sounds like a good idea. The fact that the sample isn't truly random and is not guaranteed to return the exact number of rows should be ok for this purpose (IMHO).

janblom · 2024-02-28T12:46:21Z

@schuemie do you think this is worth taking along with the upcoming 1.0 release of whiterabbit?

I have a bit of doubt about the combination of taking a TABLESAMPLE and then order the result by ramdom. I think I get why it is done but wouldn't that cancel part of the performance improvement?

schuemie added the enhancement label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster sampling for Postgres? #387

Faster sampling for Postgres? #387

schuemie commented Sep 19, 2023

blootsvoets commented Sep 19, 2023 •

edited

Loading

schuemie commented Sep 21, 2023

janblom commented Feb 28, 2024

Faster sampling for Postgres? #387

Faster sampling for Postgres? #387

Comments

schuemie commented Sep 19, 2023

blootsvoets commented Sep 19, 2023 • edited Loading

schuemie commented Sep 21, 2023

janblom commented Feb 28, 2024

blootsvoets commented Sep 19, 2023 •

edited

Loading