Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[please review the Contribution Guidelines prior to submitting your pull request. go ahead and delete this line if you've already reviewed said guidelines.]
What does this PR do?
adds a few lines of code to call future.release when we are done with a future. Also adds client.run(gc.collect) after each evaluation to clear up a potential memory leak.
Any background context you want to provide?
I noticed that running graph pipelines with large datasets caused an issue where training effectively stopped early. TPOT did not crash, but after a number of generations, all subsequent pipelines evaluated as "INVALID". My theory is that a memory leak ate up all the available RAM, causing all subsequent pipelines to fail.
I was able to reproduce a memory leak by running graph pipelines with several polynomial transformers (as this transformation exponentially increases data size). By watching the dask dashboard, a memory leak seems to happen when a future starts using a large amount of RAM. Normally dask is able to clean the data when the pipeline is done evaluating. But in some cases with particularly large data/transformations, it is not able to free the ram. My theory is that this happens when a transformation inside the future goes beyond system memory and crashes.
I was able to free this memory by calling client.run(gc.collect) and future.release() . I added these to the parallel evaluation function in hopes that it helps TPOT to free memory more often and prevent this issue.
While I think this does help the issue, it looks like with large datasets and graph search spaces that training still sometimes terminates early in the same way as described, so there is more to look into.
Here's a short script I was able to run in a jupyter notebook that could cause a potential memory leak with dask. The unmanaged memory was able to be freed with client.run(gc.collect)