Document how to find/delete duplicate archived files #12

ZNikke · 2018-05-31T08:01:13Z

Older ENDIT versions seems to have been prone to archive files multiple times when certain error conditions were triggered.

This should be due to old bugs, but we need to document how to detect if this has happened and most important how to cleanup afterwards.

Procedure is something like:

dsmc q archive -asnode=NODENAME '/path/out/*' and filter out the duplicates
Determine whether we should keep oldest or newest file (likely should not matter, but let's settle for the file that matches the operation that dCache deems successful, probably the last one)
Delete the file. If the descriptions are identical, this will require using dsmc delete archive -pick '/path/to/file' in order to be able to select just one of the duplicates.

ZNikke · 2018-06-12T10:26:11Z

The file logged as successfully transferred to tape by dCache is the last one, so that's the one that should be kept.

ZNikke · 2018-06-21T06:48:59Z

The most common way for duplicates to be detected will likely be logging from tsmtapehints.pl, since it extracts the full file list it was trivial to do duplicate detection at the same time.

ZNikke · 2018-06-21T07:13:45Z

For clarity, the procedure is, assuming old files are present written by endit with a static description:

Identify the duplicates, if running tsmtapehints.pl they are logged in the tsmtapehints.log
su - to the ENDIT runtime user
Deletion MUST be done using dsmc in -pick mode, otherwise it's very easy to accidentally delete all archived copies of a file!
For each file, do something similar to dsmc delete archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log -dateformat=3 -timeformat=1 -pick /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F , replacing the -asnode argument to your TSM proxy node name (see your endit.conf) and the file name for which you want to delete duplicates.
- The date/time format options outputs dates in ISO-style format
- The -errorlogname option is needed if the ENDIT runtime user is not allowed to write to the default dsmc error log (usually owned by root).
- If you forget the -pickoption all copies of the file will be deleted without any further prompt asking for confirmation
- You should then get a text interface, allowing you to select files using a numeric identifier, ie 42 and press Enter
- Do NOT select the newest copy version file since you want to keep the file that corresponds to the object dCache successfully logged as migrated.
- Select all other copies of the file
- Execute the deletion process by selecting OK, ie. input o and press Enter
Repeat as needed until all duplicates are deleted.

ZNikke · 2018-06-21T10:39:32Z

One reason for duplicates can be dcache being shut down while tsmarchiver.pl is running dsmc to archive files. Files that are successfully archived while the ENDIT dcache plugin is not running will not be marked as successfully migrated to tape, and dCache will retry the operation.

ZNikke · 2022-10-25T10:48:34Z

Given a more modern version of endit daemons all files are written with a description that is the time when that particular dsmc write session was started. This can be used to uniquely identify a single duplicate, and thus avoiding the tedious manual procedure previously described.

An automated procedure can look like this:

Identify the duplicates, if running tsmtapehints.pl they are logged in the tsmtapehints.log
su - to the ENDIT runtime user
For each filename with duplicates, do query archive to get the descriptions of all duplicates
- Something along the lines of dsmc query archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F replacing the -asnode argument to your TSM proxy node name (see your endit.conf) and the file name.
Choose the duplicates you want to remove (usually you want to keep the latest copy)
delete archive the duplicates using the description to identify them
- Something along the lines of dsmc delete archive -asnode=MIGRTEST -errorlogname=/tmp/endit-dsmerror.log -desc=ENDIT-2022-10-22T22:48:10+0200 /grid/pool/out/0000E8B2DB24C6624B8AA91D0BFCB39AF13F replacing the -asnode, -desc and filename arguments.
- If you are unsure, do query archive first. If only a single file is listed, proceed with doing delete archive with the same arguments.

ZNikke · 2022-10-25T10:53:42Z

The most common cause for duplicates today are pools being restarted while doing pool-to-pool migrations, ie moving tape data between instances. This causes the pool to re-transfer the files on disk, causing duplicates.

After such a migration we recommend running tsmtapehints to generate a fresh hint file, and check the output/log if any duplicates are found.

Warn if we find files with size mismatch on retrieve errors. There are some corner cases where this helps pinpoint the reason for repeated retries etc. The most common case is duplicate files, as described in #12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document how to find/delete duplicate archived files #12

Document how to find/delete duplicate archived files #12

ZNikke commented May 31, 2018

ZNikke commented Jun 12, 2018

ZNikke commented Jun 21, 2018

ZNikke commented Jun 21, 2018 •

edited

Loading

ZNikke commented Jun 21, 2018

ZNikke commented Oct 25, 2022

ZNikke commented Oct 25, 2022

Document how to find/delete duplicate archived files #12

Document how to find/delete duplicate archived files #12

Comments

ZNikke commented May 31, 2018

ZNikke commented Jun 12, 2018

ZNikke commented Jun 21, 2018

ZNikke commented Jun 21, 2018 • edited Loading

ZNikke commented Jun 21, 2018

ZNikke commented Oct 25, 2022

ZNikke commented Oct 25, 2022

ZNikke commented Jun 21, 2018 •

edited

Loading