Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when checkpointing a dataset that uses SentencepieceTokenizer #1289

Open
chrisc36 opened this issue Jun 24, 2024 · 1 comment
Open

Comments

@chrisc36
Copy link

I am running into a error when checkpointing a tf.data.Dataset iterator that uses a SentencepieceTokenizer for tokenization. It fails with:

tensorflow.python.framework.errors_impl.FailedPreconditionError: {{function_node __wrapped__SerializeIterator_device_/job:localhost/replica:0/task:0/device:CPU:0}} SentencepieceTokenizeOp is stateful. [Op:SerializeIterator] name:

As a result I cannot checkpoint datasets that use SentencepieceTokenizer. Is there a fix of work-around that would resolve the issue for me? I saw

ALLOW_STATEFUL_OP_FOR_DATASET_FUNCTIONS("SentencepieceTokenizeOp");
which makes it looks like this supposed to be possible.

Code to reproduce the issue:

import tensorflow as tf
import tensorflow_text as tf_text

  with open("/path/to/tokenizer.model", "rb") as f:
      sp_model = f.read()
  tokenizer = tf_text.SentencepieceTokenizer(sp_model)
  ds = tf.data.Dataset.from_tensor_slices(dict(data=["ex1", "ex2", "ex3",]))

  def _map(ex):
      return dict(data=tokenizer.tokenize(ex["data"]))

  ds: tf.data.Dataset = ds.map(_map)
  iterator = iter(ds)
  ckpt = tf.train.Checkpoint(iterator=iterator)
  ckpt.write("/tmp/iterator")

colab:
https://colab.research.google.com/drive/1kGYP4GJ2YVGBVQaxNzcIm1M3VxO9yRse?authuser=1#scrollTo=nZ5PVQk-BRP7

@chrisc36
Copy link
Author

In case anyone else has this issue, one workaround is to use tf.numpy_function with regular python tokenizer while setting stateful=False.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant