Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add handling for unmanaged files to vacuum command #1817

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1dc65b3
Add handling for unmanaged files to vacuum command
Jan-Schweizer Nov 7, 2023
e59bb34
collapse nested if block
Jan-Schweizer Nov 7, 2023
90b7741
chore: upgrade to the latest dynamodb-lock crate
rtyler Nov 7, 2023
809f645
feat: default logstore implementation (#1742)
dispanser Nov 9, 2023
8e64a0d
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 9, 2023
7559c24
fix: use correct folder for auto assigned labels
roeap Nov 8, 2023
da6e438
fix: run integration tests in CI
roeap Nov 9, 2023
140f949
Update README.md
dennyglee Nov 9, 2023
a327fa8
Correctly handle hidden files in _change_data and _delta_index & dele…
Jan-Schweizer Nov 11, 2023
b40f276
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 11, 2023
a358f06
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 15, 2023
752773a
Fix paths for managed files
Jan-Schweizer Nov 15, 2023
dc74dcc
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Nov 29, 2023
96a5e0a
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
rtyler Nov 30, 2023
5ea77fd
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
ion-elgreco Nov 30, 2023
48b4e3c
docs: fix all examples and change overall structure (#1931)
ion-elgreco Dec 1, 2023
f90b48c
fix: prune each merge bin with only 1 file (#1902)
haruband Dec 2, 2023
d518f40
chore: update python version (#1934)
wjones127 Dec 2, 2023
2733f3d
Support os.PathLike for table references
bolkedebruin Nov 6, 2023
f4b9e91
add type param
wjones127 Dec 2, 2023
83f2f99
fix: get rid of panic in during table (#1928)
Dec 2, 2023
b946d07
Happify linter
Jan-Schweizer Dec 2, 2023
d4642bf
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Dec 2, 2023
04e3811
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
Jan-Schweizer Dec 14, 2023
27c2b53
Merge branch 'main' into fix/handle-unmanaged-files-in-vacuum-command
ion-elgreco Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 28 additions & 6 deletions crates/deltalake-core/src/operations/vacuum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -180,7 +180,6 @@ impl VacuumBuilder {
};

let expired_tombstones = get_stale_files(&self.snapshot, retention_period, now_millis);
let valid_files = self.snapshot.file_paths_iter().collect::<HashSet<Path>>();

let mut files_to_delete = vec![];
let mut file_sizes = vec![];
Expand All @@ -195,12 +194,31 @@ impl VacuumBuilder {
.ok_or(DeltaTableError::NoMetadata)?
.partition_columns;

let managed_files = self
.snapshot
.files()
.iter()
.map(|a| a.path.as_str())
.chain(
self.snapshot
.all_tombstones()
.iter()
.map(|r| r.path.as_str()),
)
.collect::<HashSet<&str>>();

while let Some(obj_meta) = all_files.next().await {
// TODO should we allow NotFound here in case we have a temporary commit file in the list
let obj_meta = obj_meta.map_err(DeltaTableError::from)?;
if valid_files.contains(&obj_meta.location) // file is still being tracked in table
|| !expired_tombstones.contains(obj_meta.location.as_ref()) // file is not an expired tombstone
|| is_hidden_directory(partition_columns, &obj_meta.location)?
let is_hidden = is_hidden_directory(partition_columns, &obj_meta.location)?;

if managed_files.contains(obj_meta.location.as_ref()) {
if !expired_tombstones.contains(obj_meta.location.as_ref()) || is_hidden {
continue;
}
} else if now_millis - retention_period.num_milliseconds()
< obj_meta.last_modified.timestamp_millis()
|| is_hidden
{
continue;
}
Expand Down Expand Up @@ -363,8 +381,12 @@ impl VacuumPlan {
/// deleted even if they'd normally be hidden. The _db_index directory contains (bloom filter)
/// indexes and these must be deleted when the data they are tied to is deleted.
fn is_hidden_directory(partition_columns: &[String], path: &Path) -> Result<bool, DeltaTableError> {
let path_name = path.to_string();
Ok((path_name.starts_with('.') || path_name.starts_with('_'))
let is_hidden = path
.parts()
.any(|p| p.as_ref().starts_with('.') || p.as_ref().starts_with('_'));

let path_name = path.as_ref();
Ok(is_hidden
&& !path_name.starts_with("_delta_index")
&& !path_name.starts_with("_change_data")
&& !partition_columns
Expand Down
3 changes: 1 addition & 2 deletions crates/deltalake-core/tests/command_vacuum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,6 @@ async fn test_partitions_included() {
);
}

#[ignore]
#[tokio::test]
// files that are not managed by the delta log and have a last_modified greater
// than the retention period should be deleted. Unmanaged files and directories
Expand Down Expand Up @@ -276,7 +275,7 @@ async fn test_non_managed_files() {

// Validate unmanaged files are deleted after the retention period
let res = {
clock.tick(Duration::hours(1));
clock.tick(Duration::days(7));
let (_, metrics) = DeltaOps(table)
.vacuum()
.with_clock(Arc::new(clock.clone()))
Expand Down
Loading