The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

SidWeng · 2024-08-08T03:47:35Z

SidWeng
Aug 8, 2024

I use the following pipeline with BioBERT Sentence Embeddings.
However, it throws The columns of A don't match the number of elements of x. A: 768, x: 1536 when execute pipeline.fit(). I trace the code and find out the dimension of randMatrix used by BucketedRandomProjectLSHModel is determined by DatasetUtils.getNumFeatures().
Does it imply something wrong with the data I feed into fit() ? The data I feed is a DataFrame with a String column code and a String column text. The longest length of text is 229.

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

val document_similarity_ranker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val document_similarity_ranker_finisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    embeddings,
    document_similarity_ranker,
    document_similarity_ranker_finisher
  ))

24/08/08 03:19:13.581 [task-result-getter-3] WARN o.a.spark.scheduler.TaskSetManager - Lost task 7.2 in stage 10.0 (TID 370) (10.0.0.12 executor 4): org.apache.spark.SparkException: Failed to execute user defined function (LSHModel$$Lambda$5263/1056329262: (struct<type:tinyint,size:int,indices:array,values:array>) => array<struct<type:tinyint,size:int,indices:array,values:array>>)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:177)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:670)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 768, x: 1536
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:579)
at org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction(BucketedRandomProjectionLSH.scala:87)
at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
... 22 more

Answered by SidWeng

Aug 15, 2024

Finally I found the root cause. There exists. in dataset like this

First document, this is my first sentence. This is my second sentence.

It will be viewed as 2 sentences.
The output column(sentence_embeddings) of BertSentenceEmbeddings and RoBertaSentenceEmbeddings is an array of size 2.
DocumentSimilarityRankerApproach.train() will flatten sentence_embeddings.embeddings and causes the dimension be 1536 (768 * 2)

val similarityDataset: DataFrame = embeddingsDataset
  .withColumn(s"$LSH_INPUT_COL_NAME", array_to_vector(flatten(col(INPUT_EMBEDDINGS))))

The solution to my case is to set custom bound for SentenceDetector

.setCustomBounds(Array("\n"))
.setUseCustomBoundsOnly(true)

View full answer

SidWeng · 2024-08-10T04:26:26Z

SidWeng
Aug 10, 2024
Author

The exception still raises even I use sent_roberta_base.

1 reply

SidWeng Aug 14, 2024
Author

it could be reproduced by modifying test case in DocumentSimilarityRankerTestSpec.scala

SidWeng · 2024-08-15T06:10:04Z

SidWeng
Aug 15, 2024
Author

Finally I found the root cause. There exists. in dataset like this

First document, this is my first sentence. This is my second sentence.

It will be viewed as 2 sentences.
The output column(sentence_embeddings) of BertSentenceEmbeddings and RoBertaSentenceEmbeddings is an array of size 2.
DocumentSimilarityRankerApproach.train() will flatten sentence_embeddings.embeddings and causes the dimension be 1536 (768 * 2)

val similarityDataset: DataFrame = embeddingsDataset
  .withColumn(s"$LSH_INPUT_COL_NAME", array_to_vector(flatten(col(INPUT_EMBEDDINGS))))

The solution to my case is to set custom bound for SentenceDetector

.setCustomBounds(Array("\n"))
.setUseCustomBoundsOnly(true)

1 reply

danilojsl Aug 16, 2024
Collaborator

Hi @SidWeng

Thanks for letting us know about your workaround. We are working on adding a parameter to DocumentSimilarityRankerApproach to choose the aggregation method when a document has multiple sentences. I hope we can include it in the next release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

SidWeng Aug 8, 2024

Replies: 2 comments · 2 replies

SidWeng Aug 10, 2024 Author

SidWeng Aug 14, 2024 Author

SidWeng Aug 15, 2024 Author

danilojsl Aug 16, 2024 Collaborator

SidWeng
Aug 8, 2024

Replies: 2 comments 2 replies

SidWeng
Aug 10, 2024
Author

SidWeng Aug 14, 2024
Author

SidWeng
Aug 15, 2024
Author

danilojsl Aug 16, 2024
Collaborator