Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate vocabulary slower than vocabulary in Dataflow for large amount of features to apply vocab on #279

Open
tanguycdls opened this issue Jun 22, 2022 · 1 comment
Assignees

Comments

@tanguycdls
Copy link

Hello, for testing reasons we wanted to see if approximate vocabulary was faster than vocabulary when there are many features (we have 36 features to analyze). In the past we hit the graph too large error in dataflow when using vocabulary (fixed using upload_graph but we still have some limits).

However in our comparison it's at least 4 times slower and we're not sure to understand why.

Here is our transform function: we reload the data using tfxio from a TF record dataset.

for key in custom_config[VOCABULARY_KEYS]: # 36 features for that dataset
        ragged = tf.RaggedTensor.from_sparse(inputs[key])

        weights = _compute_weights(ragged, inputs['nbr_target_event']) # function written in TF
        # Build a vocabulary for this feature.
        _ = tft.vocabulary(inputs[key],
                           weights=weights,
                           vocab_filename=f'{key}{VOCAB}',
                           store_frequency=True,
                           top_k=200000)

we replaced tft.vocabulary by tft.experimental.approximate_vocabulary.

Here our stats for vocab vs approximate vocab:

  • duration: 25 minutes (21.444 vCPU hr) / 2 hours (250 v CPU hour)
  • batch_size_MEAN: 900 / 32 --> seems very strange
  • batch_size_MAX: 1000 / 1000
  • Slowest operation: TFXIOReadAndDecode[AnalysisIndex3]/RawRecordToRecordBatch/RawRecordToRecordBatch/Decode and GroupByKey for approximate vocab.

The CPU seems to peak very fast at 100% with 8 machines started but the throughput is way smaller.

and we use the following beamPipelineArgs:

   beam_pipeline_args = [
        '--runner=DataflowRunner',
        '--no_use_public_ips',
        '--disk_size_gb=200',  
        '--num_workers=1',  # nbr initial worker
        '--autoscaling_algorithm=THROUGHPUT_BASED',
        '--network=private-experiments-network',
        '--max_num_workers=8',
        '--experiments=use_runner_v2',
        '--experiments=use_network_tags=ssh',
        "--experiments=upload_graph",
        '--temp_location=' + _temp_location,
        '--project=' + GOOGLE_CLOUD_PROJECT,
        f'--worker_harness_container_image={docker_image_full_uri}',
        '--region=' + GCP_REGION,
       '--machine_type=e2-standard-16'
    ]

Do you see a reason why approximate vocabulary would be that much slower than vocabulary here ? we tried reducing top_k but did not change much.

Thanks,

@pindinagesh pindinagesh self-assigned this Jun 27, 2022
@pindinagesh pindinagesh added the type:performance Performance Issue label Jul 4, 2022
@pindinagesh pindinagesh assigned zoyahav and unassigned pindinagesh Jul 11, 2022
@zoyahav zoyahav assigned zoyahav and iindyk and unassigned zoyahav Jul 13, 2022
@iindyk
Copy link
Contributor

iindyk commented Jul 14, 2022

Thanks for reporting this!
The batch size decrease is definitely not expected and something that may be one of the reasons for slowdown. It is controlled by Apache Beam in beam.BatchElements. Would you be able/willing to test whether the performance gets better by locally modifying this to also have "min_batch_size": _BATCH_SIZE_CAP ?

As a side note, seeing that you do tf.RaggedTensor.from_sparse I'm guessing that you have VarLenFeatures? If so, you could directly switch to RaggedTensors (and RaggedFeatures, now fully supported in TFT), this would make the mentioned Decode step and the preprocessing more efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants