Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding caching #96

Open
2 of 7 tasks
daviddavo opened this issue Jul 28, 2022 · 12 comments
Open
2 of 7 tasks

Adding caching #96

daviddavo opened this issue Jul 28, 2022 · 12 comments
Labels
backend Tasks that modify the backend enhancement New feature or request

Comments

@daviddavo
Copy link
Member

daviddavo commented Jul 28, 2022

  • Make CacheRequester threadsafe (CacheRequester.request is called from multiple threads)
  • Adding flask-caching to the application with either redis or filesystem caching
  • Solve problems with caching when using gunicorn (it spawns multiple processes).
    • It works "the first time", when the cache is clean. But if you visit the same page again, last update date is Jan 1, 1 at 00:00 UTC...
  • Cache dropdown generation
  • Cache layout generation
  • Check what else to cache (search everything that makes complex calculations)
  • Create some kind of test to check how the changes improve the access time to the application
@daviddavo daviddavo added the enhancement New feature or request label Jul 28, 2022
@daviddavo daviddavo added this to the 1.2.0 milestone Jul 28, 2022
@daviddavo
Copy link
Member Author

The staging server seems to reload the page indefinitely... This does not happen in the normal deployment, or in development in localhost...

@daviddavo
Copy link
Member Author

The callbacks that take the most time to load for the first time are:

callback time (c+n)
change_subpage 1500ms + 500ms
load_ecosystem 1900ms + 100ms
org_filters 4000ms

@daviddavo
Copy link
Member Author

daviddavo commented Aug 4, 2022

Both get_layout (inside change_subpage) and load_ecosystem call the get_organization_list function.

callgraph

get_organization_list will always return the same result. So it can be cached even among workers... The first one to enter (each day) will have a hard time. But the following ones will experience a better result. We could even try to populate this cache after running the cache_scripts, somehow.

Nevertheless, as we use multiple processes, caching using a global variable (like we do on the CacheRequester) simply won't do. This method works because we are caching a disk read (and not a complex calculation). With long computations, the result should be cached between workers, and even between runs if possible.

All of this is possible with long callbacks, but we could also use Flask-Caching. Using Flask-Caching doesn't have background tasks, but it would be better because we would cache the result of our functions, and not the entire callback. For example, we should just cache get_organization_list instead of the callbacks that use it. I think this is the method wikichron uses. Furthermore, these processing tasks take a couple of seconds, not half a minute, we don't need background processing nor showing a progress bar.

The cache would be invalidated somehow when the datawarehouse is updated.

On deployment, we could even use the redis server used for wikichron (but I'd prefer we use a different one, to avoid disrupting it).

@daviddavo daviddavo changed the title Page takes time to load Adding flask-caching Aug 11, 2022
@daviddavo
Copy link
Member Author

The main problem with flask-caching is that it is an inter-process cache, while the CacheRequester is inter-thread (but intra-process). Every process has its own CacheRequester, which are not synchronized between them.

If we do the CacheRequester something inter-process, we will need to use the filesystem, which will make the CacheRequester useless. So we can either:

  • Make CacheRequester an inter-process cache (using a backend like Redis)
  • Remove the caching from CacheRequester and make it so the functions that use CacheRequester have to make sure of not requesting a lot.

Perhaps we could even implement both options (use Redis only if a redis backend is specified)

Furthermore, what will happen if we have two users trying to do the same computation? I think that, with our current approach (using just flask-caching), the following will happen:

  1. Computation in P1 starts
  2. Computation in P2 starts
  3. Computation in PX (either P1 or P2) finishes, and the result is saved to the cache
  4. Computation in PY (the other process) finishes, and the result is saved to the cache, overwritting the previous result.

I.e. the computation will be done more than necessary, consuming resources. This could be improved using a task queue, like Celery.

Celery is so well integrated with Flask, that you won't even need to install another package.

@daviddavo
Copy link
Member Author

It seems that Dash Caching (with DiskCache as a backend) allows for fine-tuning the things we couldn't fine-tune with Flask-Caching, like adding a function to check whether to invalidate the cache (which checks the last time the dw changed).

Caching the entire callbacks won't be as space efficient as just caching the "computation functions" (the functions which make long computation and take a lot of time), but I think it will be easier to implement and will integrate better. It also adds a task queue, which will fix a lot of problems: if you go now to dao-analyzer.science, you'll have to wait a bit to see a graph because all callbacks start at the same time.

  • Check if a priority system could be implemented so tasks that render the app go before tasks that render the first few graphs, that go before the rest.

Furthermore, we could create our custom wrappers for more fine-tuning.

At the time, the only thing that has been memoized is PlatformDao, there was not a lot of hours put into the flask-caching issue.

The only drawback of using background callbacks instead of flask-caching is that if two or more callbacks use an intermediate function, the result of that intermediate function won't be cached, which would be great. Nevertheless, we still could have the option of creating a custom decorator with dash's Cache Managers, or even using flask-caching alongside.

tldr: I think that we should pivot to use background callbacks and its caching and ditch the work done with flask-caching. (Or maybe keep both).

@javiag
Copy link
Contributor

javiag commented Sep 5, 2022

Probably we should have a conversation about it. Anyway some thoughts.

Some users have reported very long (unacceptable indeed) loading times in the dropdown menu component while searching for a DAO. This is the kind of problem that we have to fix.

Another important issue is that, since we update data every day, for most situations the data cached will be deprecated each day, right? Moreover, on many days we won't have two users interested in the same DAO. According to this, we should prioritize the DAO platform page, especially the info in the two panels, (even pre-loading the cache) and worry less about the problem for each DAO and the tab data.

As a guiding principle, even in life, "better strive for the good than for the best". So, probably, it makes no sense to keep both lines of work alive, but try to use a good solution even if not perfect. In this case, I'd try to fix the unacceptable loading times that we are aware of and the loading time of the DAOplatform page.

I can be overlooking some other issues and technical problems... but as I said first, we should make a decision by having a conversation.

@daviddavo
Copy link
Member Author

daviddavo commented Sep 7, 2022

I can't make background callbacks work...

I tried making chart_controller use a background callback, but then all graphs will be the same.

This is because the callback is added to a list of callbacks using the following hashing method:

    @staticmethod
    def hash_function(fn):
        fn_source = inspect.getsource(fn)
        fn_str = fn_source
        return hashlib.sha1(fn_str.encode("utf-8")).hexdigest()

Because all functions have the same code (only the id of the output changes), the code will be the same, and the hashed will be equal among all functions. I think the only possible solution would be to define the callback functions using exec.

Nevertheless, I'll ask on the forum to see if anyone else has come with a solution.

@daviddavo
Copy link
Member Author

Instead of creating multiple functions with the same code (one per each component), we could create just one of them, and use pattern matching callbacks for the output.

The main problem is that each controller (and thus, each callback) has a different adapter (and perhaps a different layout), so this is not possible without creating a custom "callback registry"...

@daviddavo
Copy link
Member Author

I think I'll put this on hold for a while, and focus on the dropdown

@daviddavo
Copy link
Member Author

@daviddavo daviddavo changed the title Adding flask-caching Adding caching Sep 8, 2022
@daviddavo
Copy link
Member Author

I managed to use background=True with the functions in main_view_controller. The improvement is not very noticeable in development, but perhaps it's noticeable after deployment.

@daviddavo
Copy link
Member Author

I ended up with a branch with both Flask-Caching and Dash Background Callbacks. I can't test if it's working and it has a lot of problems... Moving this to low priority and help wanted.

@daviddavo daviddavo added the help wanted Extra attention is needed label Sep 24, 2022
@daviddavo daviddavo removed this from the 1.2.0 milestone Sep 27, 2022
@daviddavo daviddavo added backend Tasks that modify the backend and removed help wanted Extra attention is needed labels Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Tasks that modify the backend enhancement New feature or request
Projects
Status: To do (Low Priority)
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants