Dias is an automatic rewriter of pandas
code for Jupyter (IPython) notebooks. It rewrites pandas
code to semantically equivalent but faster versions, on-the-fly, transparently and correctly. Dias is extremely lightweight and it will incur virtually no extra runtime or memory overheads. At the same time, Dias can provide 100x or even 1000x speedups (see example below).
Dias identifies rewrite opportunities automatically and leaves the rest of the code untouched, so you do not have to change a single line of your pandas
code to use it.
Quickstart Colab Notebook |
The fastest way to get started is to play around with our Quickstart Google Colab notebook. Otherwise, you can follow the documentation here to experiment locally.
|
|
||||
Original: 10.3s | Rewritten: 48.4ms | ||||
|
pip install dias
Make sure that you are using a Jupyter/IPython notebook.
First import the package... That's it!
import dias.rewriter
Our Quickstart notebook contains many examples in a single place. You can also see our examples directory which lists self-contained examples that showcase different use cases Dias.
Dias is extremely lightweight. In terms of memory overheads, anything that runs with vanilla pandas
, runs with Dias enabled too. Dias is just a code rewriter, so it does not alter the way pandas
stores data and its internal state is minimal.
Dias' runtime overheads are minimal too. In our experiments, the maximum overhead of Dias is 23ms. You may also want to take a look at this example, where even though the original cell is quick, it is still worth using Dias.
Yes. Dias' output is standard Python code, and so, for example, you do not need to know anything about Dias to know why you got a speedup. Similarly, you can just copy Dias' output and use it as any other Python code.
To inspect the rewritten version, add the comment # DIAS_VERBOSE
at the beginning of your cell (right after any magic functions). See this example.
No (which inherently means Dias does not suffer from lack of API support). Dias is a rewriter, which inspects and possibly rewrites pandas
code.
No. Dias currently uses IPython features.
Dias looks for certain patterns, and upon recognizing one, it rewrites the code to a faster version. Thus, Dias will rewrite the code if it contains one of the patterns it is programmed to look for. Consider this example. One pattern Dias looks for is any expression followed by sort_values()
, followed by head()
. Upon recognizing this pattern, it rewrites the code to use nsmallest()
. You can take a look at the paper for more information.
Dias is still under early but active development, so expect more patterns to be added soon!
No and no. Dias is not probabilistic; if it rewrites code, it is always correct (barring implementation bugs). Dias is also not intended to be an assistant. First, it's intended to be more quiet than an assistant. If Dias does its job correctly, then you should never have to think of it. Second, while you can inspect the rewritten code, Dias does not offer any explanations of why the rewritten version is faster.
Dias is an ongoing research project by the ADAPT group @ UIUC. You can help us by sending us notebooks that you want to speed up and we will our best to make Dias do it automatically (send us an email with either the notebook or Colab link)! Moreover, if you are aware of a pattern that can be rewritten to a faster version, please consider submitting an issue. You can use our template.
We also welcome feedback from all backgrounds, including industry specialists, data analysts and academics. Please reach out to [email protected] to share your opinion!
BibTeX
@article{dias,
author = {Baziotis, Stefanos and Kang, Daniel and Mendis, Charith},
title = {Dias: Dynamic Rewriting of Pandas Code},
year = {2024},
issue_date = {February 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {2},
number = {1},
url = {https://doi.org/10.1145/3639313},
doi = {10.1145/3639313},
abstract = {In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify source-to-source, external program rewriting as a lightweight technique which can optimize across representations, and offer substantial speedups while also avoiding slowdowns. We implemented Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including checking the preconditions under which rewrites are correct, dynamically, at fine-grained program points. We show that Dias can rewrite individual cells to be 57\texttimes{} faster compared to pandas and 1909\texttimes{} faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6\texttimes{} compared to pandas and 27.1\texttimes{} compared to modin.},
journal = {Proc. ACM Manag. Data},
month = {mar},
articleno = {58},
numpages = {27},
keywords = {cross-representation, dynamic, pandas, rewriting}
}