Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to find/delete duplicate archived files #12

Open
ZNikke opened this issue May 31, 2018 · 6 comments
Open

Document how to find/delete duplicate archived files #12

ZNikke opened this issue May 31, 2018 · 6 comments

Comments

@ZNikke
Copy link
Collaborator

ZNikke commented May 31, 2018

Older ENDIT versions seems to have been prone to archive files multiple times when certain error conditions were triggered.

This should be due to old bugs, but we need to document how to detect if this has happened and most important how to cleanup afterwards.

Procedure is something like:

  • dsmc q archive -asnode=NODENAME '/path/out/*' and filter out the duplicates
  • Determine whether we should keep oldest or newest file (likely should not matter, but let's settle for the file that matches the operation that dCache deems successful, probably the last one)
  • Delete the file. If the descriptions are identical, this will require using dsmc delete archive -pick '/path/to/file' in order to be able to select just one of the duplicates.
@ZNikke
Copy link
Collaborator Author

ZNikke commented Jun 12, 2018

The file logged as successfully transferred to tape by dCache is the last one, so that's the one that should be kept.

@ZNikke
Copy link
Collaborator Author

ZNikke commented Jun 21, 2018

The most common way for duplicates to be detected will likely be logging from tsmtapehints.pl, since it extracts the full file list it was trivial to do duplicate detection at the same time.

@ZNikke
Copy link
Collaborator Author

ZNikke commented Jun 21, 2018

For clarity, the procedure is, assuming old files are present written by endit with a static description:

  • Identify the duplicates, if running tsmtapehints.pl they are logged in the tsmtapehints.log
  • su - to the ENDIT runtime user
  • Deletion MUST be done using dsmc in -pick mode, otherwise it's very easy to accidentally delete all archived copies of a file!
  • For each file, do something similar to dsmc delete archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log -dateformat=3 -timeformat=1 -pick /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F , replacing the -asnode argument to your TSM proxy node name (see your endit.conf) and the file name for which you want to delete duplicates.
    • The date/time format options outputs dates in ISO-style format
    • The -errorlogname option is needed if the ENDIT runtime user is not allowed to write to the default dsmc error log (usually owned by root).
    • If you forget the -pickoption all copies of the file will be deleted without any further prompt asking for confirmation
    • You should then get a text interface, allowing you to select files using a numeric identifier, ie 42 and press Enter
    • Do NOT select the newest copy version file since you want to keep the file that corresponds to the object dCache successfully logged as migrated.
    • Select all other copies of the file
    • Execute the deletion process by selecting OK, ie. input o and press Enter
  • Repeat as needed until all duplicates are deleted.

@ZNikke
Copy link
Collaborator Author

ZNikke commented Jun 21, 2018

One reason for duplicates can be dcache being shut down while tsmarchiver.pl is running dsmc to archive files. Files that are successfully archived while the ENDIT dcache plugin is not running will not be marked as successfully migrated to tape, and dCache will retry the operation.

@ZNikke
Copy link
Collaborator Author

ZNikke commented Oct 25, 2022

Given a more modern version of endit daemons all files are written with a description that is the time when that particular dsmc write session was started. This can be used to uniquely identify a single duplicate, and thus avoiding the tedious manual procedure previously described.

An automated procedure can look like this:

  • Identify the duplicates, if running tsmtapehints.pl they are logged in the tsmtapehints.log
  • su - to the ENDIT runtime user
  • For each filename with duplicates, do query archive to get the descriptions of all duplicates
    • Something along the lines of dsmc query archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F replacing the -asnode argument to your TSM proxy node name (see your endit.conf) and the file name.
  • Choose the duplicates you want to remove (usually you want to keep the latest copy)
  • delete archive the duplicates using the description to identify them
    • Something along the lines of dsmc delete archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log -desc=ENDIT-2022-10-22T22:48:10+0200 /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F replacing the -asnode, -desc and filename arguments.
    • If you are unsure, do query archive first. If only a single file is listed, proceed with doing delete archive with the same arguments.

@ZNikke
Copy link
Collaborator Author

ZNikke commented Oct 25, 2022

The most common cause for duplicates today are pools being restarted while doing pool-to-pool migrations, ie moving tape data between instances. This causes the pool to re-transfer the files on disk, causing duplicates.

After such a migration we recommend running tsmtapehints to generate a fresh hint file, and check the output/log if any duplicates are found.

ZNikke added a commit that referenced this issue Dec 29, 2022
Warn if we find files with size mismatch on retrieve errors.

There are some corner cases where this helps pinpoint the reason for
repeated retries etc. The most common case is duplicate files,
as described in #12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant