[SPARK-49949][PS] Avoid unnecessary analyze task in `attach_sequence_…

…column` ### What changes were proposed in this pull request? Avoid unnecessary analyze task in `attach_sequence_column` ### Why are the changes needed? In Connect mode, if the input `sdf` hasn't cache its schema, `attach_sequence_column` will trigger an analyze task for it. However, in this case, the column names are not needed. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#48448 from zhengruifeng/attach_sequence_column. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
uros-db · Oct 14, 2024 · eeb044e · eeb044e
1 parent 560748c
commit eeb044e
Showing 1 changed file with 1 addition and 2 deletions.
diff --git a/python/pyspark/pandas/internal.py b/python/pyspark/pandas/internal.py
@@ -902,11 +902,10 @@ def attach_default_index(
 
     @staticmethod
     def attach_sequence_column(sdf: PySparkDataFrame, column_name: str) -> PySparkDataFrame:
-        scols = [scol_for(sdf, column) for column in sdf.columns]
         sequential_index = (
             F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
         )
-        return sdf.select(sequential_index.alias(column_name), *scols)
+        return sdf.select(sequential_index.alias(column_name), "*")
 
     @staticmethod
     def attach_distributed_column(sdf: PySparkDataFrame, column_name: str) -> PySparkDataFrame: