Data processing projects such as machine learning or data mining take sorts of large amount of input data. After input data are loaded, such programs create or convert intermediate objects using input data. Getting or converting data also takes long time. This aspect of data processing projects prevents software engineers to refactor codes or keep them clean.
We can install hideout with pip. Run the following command.
$ pip install hideout
Hideout support two types of caching. One is decoration style, the other is registering function to a hideout function adding the target function as the argument.
We just add resumable to the target function.
@resumable() def generate(): sleep(10) return {"foobar": "bar"}
If you want to cache the result of a instance method of a object, we just add the resumable decoration to the instance method.
class Generator2: @resumable() def generate(self, baz): return {"foobar": baz}
Hideout save and load object with hideout.resume. If the cache file for the object exist, hideout loads it otherwise call specified function to generate expected object.
large_object = hideout.resume_or_generate( label="large_object", func=generate_large_object, func_args={"source": "s3-northeast-8.amazonaws.com/large-dic.txt"} )
hideout.resume
have func_args
option which contains the parameters of specified function to generate the expected object.
We can specify the prefix of cache file with label option. When we do not specify the label
option, resume_or_generate
method automatically
name the cache file from function name and the arguments.
In default, Hideout is not activated and therefore does not save and load cache files. To enable cache we set the provided environment variable
HIDOUT_ENABLE_CACHE
to True
.
$ HIDEOUT_ENBALE_CACHE=True your_data_engineering_program.py
Hideout provide stage for skipping caches for specified points.
Users can add the stage names to the object generation by hideout.resume_or_generate
with stage
parameter.
large_object = hideout.resume_or_generate( label="large_object", func=generate_large_object, stage="preliminaries", func_args={"source": "s3-northeast-8.amazonaws.com/large-dic.txt"} )
If you use decoration style, add stage parameter to the decorator.
@resumable(stage="preliminaries") def generate(): sleep(10) return {"foobar": "bar"}
Specifing stage names with HIDEOUT_SKIP_STAGES
, hideout skip the caching.
For example, the following command skip caching named preliminaries and integrate.
$ HIDEOUT_SKIP_STAGES=preliminaries,integrate your_data_engineering_program.py
In default, Hideout saves the cache files in caches
under the top project directory. If we specify the directory, we specify it with environment variable
HIDEOUT_CACHE_DIR
.
When you want to apply the logger which you use throughout an application, you can inject the logger with
hideout.set_logger()
function.
We can install the hideout package and upload it to pypi repository.
$ python setup.py install
$ python setup.py sdist upload
MIT
See CONTRIBUTING.md.