Skip to content

Commit

Permalink
[SPARK-49949][PS] Avoid unnecessary analyze task in `attach_sequence_…
Browse files Browse the repository at this point in the history
…column`

### What changes were proposed in this pull request?
Avoid unnecessary analyze task in `attach_sequence_column`

### Why are the changes needed?
In Connect mode, if the input `sdf` hasn't cache its schema, `attach_sequence_column` will trigger an analyze task for it.
However, in this case, the column names are not needed.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing tests

### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#48448 from zhengruifeng/attach_sequence_column.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
  • Loading branch information
zhengruifeng committed Oct 14, 2024
1 parent 560748c commit eeb044e
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions python/pyspark/pandas/internal.py
Original file line number Diff line number Diff line change
Expand Up @@ -902,11 +902,10 @@ def attach_default_index(

@staticmethod
def attach_sequence_column(sdf: PySparkDataFrame, column_name: str) -> PySparkDataFrame:
scols = [scol_for(sdf, column) for column in sdf.columns]
sequential_index = (
F.row_number().over(Window.orderBy(F.monotonically_increasing_id())).cast("long") - 1
)
return sdf.select(sequential_index.alias(column_name), *scols)
return sdf.select(sequential_index.alias(column_name), "*")

@staticmethod
def attach_distributed_column(sdf: PySparkDataFrame, column_name: str) -> PySparkDataFrame:
Expand Down

0 comments on commit eeb044e

Please sign in to comment.