-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM exhaustion with 59,067,585-line, 102 GB JSONL file #962
Comments
I don't immediately know what is going on here, but I have been making other improvements to Tippecanoe in https://github.com/felt/tippecanoe, including some that are meant to reduce memory consumption, so I would suggest trying with that version. If you can share your data file, I can try to reproduce the problem myself. |
I'm going to run that fork you mentioned and I'll report back with my results. The dataset itself is the last 14 releases of the FCC's 'without satellite' 477 data. https://www.fcc.gov/general/broadband-deployment-data-fcc-form-477 The 14 CSV files were converted into JSONL. At one point this data lived in my client's BigQuery instance and I don't have a good way to share the ~20 GB compressed version of this dataset. It might be quicker to download the latest release, convert it to JSONL and then duplicate it 14 times. |
Just to report back, I tried that fork and the job was killed after some time. The VM I ran it on has 64 GB. |
I partitioned the 60M records on their H3 zoom level 1 values, this broke the records up into 44 files. They weren't even in size but it was the quickest way I could think of to break up the dataset.
I then ran tippecanoe on them, one file at a time as there were RAM usage spikes and I didn't want to suffer any OOM issues. Usually, ~13 GB of RAM was being used on my 64 GB system though this would spike at odd times. The process took 3 weeks to complete. $ ls 8*.geojson \
| xargs -P1 \
-n1 \
-I% \
bash -c 'HEXVAL=`echo % | sed "s/.geojson//g"`; tippecanoe --coalesce-densest-as-needed -zg --extend-zooms-if-still-dropping -e fcc_477_$HEXVAL $HEXVAL.geojson' The process produced 4.6 GB of PBF data across 168,754 files. I took a 100K record sample (~179 MB in GeoJSON) and ran it through strace and produced a FlameGraph. On an e2-highmem-4 with 4 vCPUs and 32 GB of RAM in GCP's LA zone the following runs in 115 seconds and produces 31.7K PBFs totalling 147 MB in size. This is one PBF for roughly every 3 records. $ tippecanoe \
--coalesce-densest-as-needed \
-zg \
--extend-zooms-if-still-dropping \
-e \
fcc_477 \
out_100k.geojson There are 30K
This operation also has around 9.6K context switches and 43K page faults.
Below is a FlameGraph: The main overhead appears to be all the files that need to be written out. If there were fewer PBFs this process should run a lot quicker. It also leads to the small file problem where you end up with a lot of file system overhead from simply having too many files. Is there a way to cut down the number of PBFs being produced? If I output to a single .mbtiles file it takes substantially longer so I'm not sure if that alone would be an answer for a 60M-record dataset that takes 3 weeks to convert to PBFs. I don't have much more to report in terms of RAM consumption but if that can be kept down I should be able to run more tippecanoe commands in parallel with one another. The RAM ceiling-to-process ratio is very high at the moment and RAM is the most expensive $/GB piece of hardware on GCP. |
I'm running the following on a system with 64 cores and 64 GB of RAM. After a few hours of running the application appears to exhaust all available memory and is terminated by the Kernel.
Is there any workaround for this?
Here is an example record from the 102 GB JSONL file:
The text was updated successfully, but these errors were encountered: