Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add record count meta table #56

Open
smnorris opened this issue Sep 7, 2021 · 12 comments
Open

add record count meta table #56

smnorris opened this issue Sep 7, 2021 · 12 comments

Comments

@smnorris
Copy link
Owner

smnorris commented Sep 7, 2021

For pg_featureserv :

provide a function taking a table name as a parameter and returning the count of records in the table

poissonconsulting/fwapgr#47

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

Erwin B shows me how to write the function but pg_fs doesn't seem to like the regclass type. https://www.hillcrestgeo.ca/fwapg/functions/record_count/items.json

I don't want to work around the security features of above function and record count is only needed for a few tables that generally do not have changing counts.... a meta table seems easier to implement for now?

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

Also, getting the record count takes forever - much better to have this cached somewhere:

$ time psql -c "select postgisftw.record_count('whse_basemapping.fwa_stream_networks_sp');"
 record_count
--------------
      4907325
(1 row)
real	0m50.719s
user	0m0.065s
sys	0m0.038s

(pg13, ubuntu20.04, 4CPU, 8G ram)

@smnorris smnorris changed the title add record count function add record count meta table Sep 7, 2021
@dr-jts
Copy link

dr-jts commented Sep 7, 2021

Also, getting the record count takes forever - much better to have this cached somewhere:

@smnorris Is it faster to get the feature count via Python and WFS? If so, I wonder why it is so much faster?

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

I don't have WFS set up on my db - my comment references getting a WFS feature count from DataBC which I use for paging their collections. That request is pretty much instant, but it is (presumably) a geoserver / oracle back end.

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

Hmm, I'm not sure why counting the records is so slow on that db, I may want to adjust something in the config. If I make a similar request on my local db it is <1s.

@dr-jts
Copy link

dr-jts commented Sep 7, 2021

Could it be something to do with the indexing, or the table statistics? Or perhaps parallel workers allocated? Or how warm the DB instance is?

Doing an EXPLAIN ANALYZE might indicate something.

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

Maybe? Indexes should be the same and autovacuum is on on both dbs. The query to the ubuntu db does not speed up on repeated calls.

Ubuntu 20.04 / PG 13.4:

hillcrestgeo=> explain analyze select count(*) from whse_basemapping.fwa_stream_networks_sp;
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=985836.25..985836.26 rows=1 width=8) (actual time=45619.851..45643.503 rows=1 loops=1)
   ->  Gather  (cost=985836.04..985836.25 rows=2 width=8) (actual time=45619.642..45643.488 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=984836.04..984836.05 rows=1 width=8) (actual time=45594.097..45594.099 rows=1 loops=3)
               ->  Parallel Seq Scan on fwa_stream_networks_sp  (cost=0.00..979724.23 rows=2044723 width=0) (actual time=0.134..45281.439 rows=1635775 loops=3)
 Planning Time: 0.141 ms
 JIT:
   Functions: 8
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.485 ms, Inlining 183.040 ms, Optimization 30.949 ms, Emission 30.149 ms, Total 245.624 ms
 Execution Time: 45644.363 ms
(12 rows)

MacOS, PG 13.3 (homebrew)

postgis=# explain analyze select count(*) from whse_basemapping.fwa_stream_networks_sp;
                                                                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=67819.06..67819.07 rows=1 width=8) (actual time=689.076..692.822 rows=1 loops=1)
   ->  Gather  (cost=67818.84..67819.05 rows=2 width=8) (actual time=688.895..692.813 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=66818.84..66818.85 rows=1 width=8) (actual time=621.564..621.565 rows=1 loops=3)
               ->  Parallel Index Only Scan using fwa_stream_networks_sp_edge_type_idx1 on fwa_stream_networks_sp  (cost=0.43..61707.05 rows=2044719 width=0) (actual time=0.075..337.181 rows=1635775 loops=3)
                     Heap Fetches: 177
 Planning Time: 1.647 ms
 Execution Time: 692.853 ms
(9 rows)

But this is getting off into the weeds. For paging - an estimate (maybe rounded up to the request limit) might be adequate?

@dr-jts
Copy link

dr-jts commented Sep 7, 2021

Looks like the MacOS query is using an index-only scan, and the Ubuntu instance is not. If the indexes are indeed defined in the same way, not sure why that would be. Something about the table stats perhaps?

Agreed that finding a faster way to get an approximate count would be nice. Perhaps using the pg_class or pg_stat_user tables as per this SO answer. But that doesn't handle queries with filter conditions, or (probably) views. So will need the brute-force method as well as a fallback. It might also be nice to have the application be able to choose when to populate this value, to avoid overhead when not needed (via some sort of API extension).

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

Well they should be defined the same way but that doesn't mean they are. The index being used has a different name in one db - but both are basic btree indexes and were named by the system. I might try rebuilding both at some point.

To solve the fwapgr requirements in the short term I could just create an FWA meta table and manually populate it. The data rarely changes... and I don't want to get in the habit of serving huge collections either. Perhaps pg_fs is better but paging through large DataBC WFS requests gets unreliable after a few 100k.

@dr-jts
Copy link

dr-jts commented Sep 7, 2021

Would it be helpful to have some way in the request API to indicate that a count of total number of records in query response should be calculated and returned? E.g. perhaps a (non-standard) query parameter totalCount which would return the query result size in some way?

And how should the query result size be returned? As a special JSON document? Or as the standard numberMatched response property? (If the latter, then the client might want to make an initial call with limit=0 to get just the numberMatched value, with no overhead of data transfer).

@dr-jts
Copy link

dr-jts commented Sep 7, 2021

Perhaps pg_fs is better but paging through large DataBC WFS requests gets unreliable after a few 100k.

Unreliable how? Missing records? The unreliability might be a function of how Oracle implements offsetting. Postgres may or may not be better in this respect.

Hard to see why humans would want to wade through a large number of pages, but I guess automated systems might do data extracts that way, and thus require full reliability?

I'm starting to think about implementing a chunk of the CQL filter spec, so maybe that will provide a better alternative to reduce the size of query results?

@smnorris
Copy link
Owner Author

smnorris commented Sep 7, 2021

WFS Reliability :
Paging through a large set can bail like this: smnorris/bcdata#75
Downloading the entire streams or watershed table is a bad idea but something like DRA is usually ok. It seems to depend on data density, not just number of records.

I have not tried to debug - the issue might just be that my script doesn't handle network interruptions. Files are generally available for these larger collections and the files are far faster to download/load to postgres than requesting geojson feature by feature.

API questions:
Those both sound reasonable from my somewhat uninformed viewpoint. I haven't been following the latest OGC / WFS standards and pygeoapi closely but working with a standard numberMatched that way seems fine to me. My feature server needs are minimal, @joethorley may have more thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants