"Delayed Deduplication" background job #515
jennydaman
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem: ChRIS Files are Highly Duplicated
Typically, a majority of file data in CUBE is duplicated. The actual amount depends on usage, though I sometimes run feeds where each plugin instance a copy operation.
ChRIS files are copied each time you run pl-dircopy. pl-topologicalcopy, ... meaning the same piece of data consumes many times more storage than its own size, depending on how many times it is used. Futhermore, the only way to rename a file is to create a copy of it (e.g. pl-bulk-rename) which creates more duplicated data.
Alternative Solutions
unextpath
parameters and ts-plugins should do these efficient copy operations. However, this isn't easy to take advantage of with ChRIS plugins, we must assume ChRIS plugin developers are not CUBE experts and will not bother to use our special "copy" mechanism where appropriate.How can we deduplicate data, (A) without the overhead of hashing, and (B) in a way that's transparent (no developmental burden for ChRIS plugin developers)?
Delayed Deduplication Service
Create a ChRIS plugin,
pl-hash
, which computes hashes. It writes a single file to its output directory. This output files will contain hashes of its input data.Create the "Periodic Hashing Service" (phs*) which periodically** runs
pl-hash
on files in CUBE. phs reads from CUBE's database (db) directly, selecting the oldest files which have not been hashed. phs runspl-hash
on these files.Create the "Periodic Deduplication Service" (pdds) which harvests completed
pl-hash
jobs. After eachpl-hash
, compare the new hashes with every existing file hash. For every duplicate found, delete the more recent file from storage, and create a pointer (akin to a hard link) in the db for the more recent file to the older file.Architecture: phs and pdds connect directly to CUBE's database, storage, and a pfcon.
Footnotes
Beta Was this translation helpful? Give feedback.
All reactions