Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture number of snapshots created per day as a metric #149

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

maluchari
Copy link
Collaborator

@maluchari maluchari commented Jul 30, 2024

Summary

Adding a metric to capture number of snapshots being created every day. This will help in understanding if there is any anomalous behavior wrt any jobs that execute on OpenHouse tables. This could also help us get a count of number of unexpired snapshots from the past.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

Added UT to check the same.

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

A minor non-breaking change.

For all the boxes checked, include additional details of the changes made in this pull request.

@maluchari maluchari force-pushed the malini/add_snapshot_stats branch from c3c15c5 to ffd3afc Compare August 1, 2024 03:53
cbb330
cbb330 previously approved these changes Aug 12, 2024
Copy link
Collaborator

@sumedhsakdeo sumedhsakdeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits and a question around storing large maps in the table.

/** Get snapshot distribution for a given table by date. */
private static Map<String, Long> getSnapShotDistributionPerDay(
Table table, SparkSession spark, MetadataTableType metadataTableType) {
Dataset<Row> snapShotDistribution =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Dataset<Row> snapShotDistribution =
Dataset<Row> snapshotDistribution =

@@ -35,4 +36,6 @@ public class IcebergTableStats extends BaseTableMetadata {
private Long numReferencedManifestFiles;

private Long numReferencedManifestLists;

private Map<String, Long> snapshotCountByDay;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a large map? If key is all days.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if SE is functioning fine this should only have a bounded number of days?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is right. It should ideally have only 3 days worth of data. But we can consider collecting only past 2 days since we should already have the old data from previous runs

private static Map<String, Long> getSnapShotDistributionPerDay(
Table table, SparkSession spark, MetadataTableType metadataTableType) {
Dataset<Row> snapShotDistribution =
SparkTableUtil.loadMetadataTable(spark, table, metadataTableType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this get all snapshots committed from beginning of table?
If yes, should it have a filter criteria as well to get snapshots count only in last X days?

…eStatsCollectorUtil.java

Co-authored-by: Sumedh Sakdeo <[email protected]>
@@ -35,4 +36,6 @@ public class IcebergTableStats extends BaseTableMetadata {
private Long numReferencedManifestFiles;

private Long numReferencedManifestLists;

private Map<String, Long> snapshotCountByDay;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if SE is functioning fine this should only have a bounded number of days?

Comment on lines +147 to +154
Collectors.toMap(
row -> {
SimpleDateFormat formatter = new SimpleDateFormat("yyyy-MM-dd");
return formatter.format(new Date(row.getTimestamp(1).getTime()));
},
row -> 1L,
Long::sum,
LinkedHashMap::new));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better to implement this before collectAsList?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants