Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to embed tokenization #2371

Open
piotrkowalczuk opened this issue Oct 19, 2024 · 2 comments
Open

How to embed tokenization #2371

piotrkowalczuk opened this issue Oct 19, 2024 · 2 comments
Labels
question Response providing clarification needed. Will not be assigned to a release. (type)

Comments

@piotrkowalczuk
Copy link

How to embed tokenization ❓

Models created using the Create ML app provide this sleek API that hides some complexity:

/// Model Prediction Input Type
@available(macOS 10.14, iOS 12.0, tvOS 12.0, watchOS 5.0, visionOS 1.0, *)
class ExampleClassifierInput : MLFeatureProvider {

    /// Input text as string value
    var text: String

    var featureNames: Set<String> { ["text"] }

    func featureValue(for featureName: String) -> MLFeatureValue? {
        if featureName == "text" {
            return MLFeatureValue(string: text)
        }
        return nil
    }

    init(text: String) {
        self.text = text
    }
}

While trying to convert the model:

def build_inference_model(weights):
    text = tf.keras.Input(shape=(), dtype=tf.string, name='text')
    input_ids = tf.keras.layers.TextVectorization(
        output_mode='int',
        output_sequence_length=MAX_LENGTH,
        vocabulary='bert_vocabulary.txt',
    )(text)
    attention_mask = tf.cast(tf.not_equal(input_ids, 0), dtype=tf.int32)

    bert = transformers.TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased',
        num_labels=len(class_labels))
    bert.set_weights(weights)

    predictions = bert(input_ids=input_ids, attention_mask=attention_mask)
    predicted_label = ArgmaxAndLabelMappingLayer(class_labels, name="category")(predictions.logits)

    model = tf.keras.Model(inputs=text, outputs=predicted_label, name="classifier")

    model.summary()

    return model

I encountered this error TypeError: dtype=<class 'coremltools.converters.mil.mil.types.type_str.str'> is unsupported for inputs/outputs of the model.

What do you think is the best way to handle my use case? How far the coremltools library can take me, without me having to create a Swift Package that will wrap the model?

@piotrkowalczuk piotrkowalczuk added the question Response providing clarification needed. Will not be assigned to a release. (type) label Oct 19, 2024
@TobyRoseman
Copy link
Collaborator

Hi @piotrkowalczuk - I'm confused here. What are you trying to do here? Do you have a Core ML model and you're trying to get predictions from it in Python?

@piotrkowalczuk
Copy link
Author

I have a Tensorflow model that I am trying to convert using the Python tooling. I found that conversion has some limitations. I want a user experience similar to what the Create ML.app offers. A similar text classifier trained using the Create ML.app, has a tokenizer included. Such model is easier to distribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Response providing clarification needed. Will not be assigned to a release. (type)
Projects
None yet
Development

No branches or pull requests

2 participants