You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But when running AnalyzeAndTransformDataset, the _InstanceDictInputToTFXIOInput call in TransformDataset is unnecessary, since it was already run in AnalyzeDataset.
The _InstanceDictInputToTFXIOInput transformation is expensive, and this redundant call meaningfully increase runtime and cost
The text was updated successfully, but these errors were encountered:
Hi @zoyahav thank you for the resources. The ask here is actually specific to BigQuery.
IIUC, there is no tfxio precanned input path for BigQuery sources. It seems like it exists for CSV, but there is no equivalent for BigQuery, unless a user writes the PyArrow RecordBatch conversion code themselves.
Are there any plans to create a tfx_bsl.public.tfxio.BeamRecordBigQueryTFXIO or similar? This would help a lot of our use cases.
AnalyzeAndTransformDataset
should not run_InstanceDictInputToTF
twice.AnalyzeAndTransformDataset
runsAnalyzeDataset
andTransformDataset
back-to-back.AnalyzeDataset
runs_InstanceDictInputToTFXIOInput
andTransformDataset
also runs_InstanceDictInputToTFXIOInput
.But when running
AnalyzeAndTransformDataset
, the_InstanceDictInputToTFXIOInput
call inTransformDataset
is unnecessary, since it was already run inAnalyzeDataset
.The
_InstanceDictInputToTFXIOInput
transformation is expensive, and this redundant call meaningfully increase runtime and costThe text was updated successfully, but these errors were encountered: