Add `iter` singular value into TBE optimizer state #2474

csmiler · 2024-10-07T16:00:05Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/326

When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the iter number is a single value tensor, which cannot be tracked and checkpointed properly (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!)

Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim).

By doing so, single-value iter can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training.

Differential Revision: D63909559

facebook-github-bot · 2024-10-07T16:00:13Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-09T23:51:14Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-10T12:51:35Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/FBGEMM#3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559

facebook-github-bot · 2024-10-10T16:40:32Z

This pull request was exported from Phabricator. Differential Revision: D63909559

Summary: X-link: pytorch/torchrec#2474 Pull Request resolved: #3228 X-link: facebookresearch/FBGEMM#326 When the optimizer states for sharded embedding tables are tracked in TorchRec, they are assumed to be either point-wise (same shape as the embedding table, for example, Adam's exp_avg), or row-wise (same length as the embedding hashsize, for example, rowwise_adagrad's momentum/sum). However, there may be other formats, a single value for each table. Specifically, for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb and GWD, the `iter` number is a single value tensor, which **cannot be tracked and checkpointed properly** (this also means that there is a bug in Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb usages!) Here we support tracking and checkpointing single-value states, by constructing ShardMetadata with rowwise-sharding and replicating the single-value for each Sharded param (this is similar to how the rowwise state for colume-wise sharded tables are concatenated along row-dim). By doing so, single-value `iter` can be properly checkpointed just like other states, ensuring correct reloading of states and continuous training. This diff checkpoints `iter` for rowwise_adagrad with GWD. The next diff would checkpoint `iter` for Adam/Partial_rowwise_adam/Lamb/Partial_rowwise_lamb. Reviewed By: iamzainhuda, spcyppt Differential Revision: D63909559 fbshipit-source-id: e14c1dc3e8f87bfc4cc95f2321b358526719d88f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 7, 2024

facebook-github-bot added the fb-exported label Oct 7, 2024

csmiler force-pushed the export-D63909559 branch from 4418332 to dec4abf Compare October 9, 2024 23:50

csmiler force-pushed the export-D63909559 branch from dec4abf to 5daddb0 Compare October 10, 2024 12:51

csmiler force-pushed the export-D63909559 branch from 5daddb0 to 263bfc6 Compare October 10, 2024 16:40

facebook-github-bot closed this in be7cadb Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `iter` singular value into TBE optimizer state #2474

Add `iter` singular value into TBE optimizer state #2474

csmiler commented Oct 7, 2024

facebook-github-bot commented Oct 7, 2024

facebook-github-bot commented Oct 9, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

Add iter singular value into TBE optimizer state #2474

Add iter singular value into TBE optimizer state #2474

Conversation

csmiler commented Oct 7, 2024

facebook-github-bot commented Oct 7, 2024

facebook-github-bot commented Oct 9, 2024

facebook-github-bot commented Oct 10, 2024

facebook-github-bot commented Oct 10, 2024

Add `iter` singular value into TBE optimizer state #2474

Add `iter` singular value into TBE optimizer state #2474