spark nlp ner XlmRoBertaForTokenClassification performance improvement #13475

LucaPifferettiPrivate · 2023-02-06T14:24:51Z

LucaPifferettiPrivate
Feb 6, 2023

Hi everyone!
I'm using a NER model XlmRoBertaForTokenClassification to find person name inside a column of messages.
The problem is the model is really slow and it takes 35 minutes to process 100K messages.
I have this configuration:
spark driver cores = 2
spark driver memory = 48Gb
spark executors = 8
spark executors cores = 8
spark executores memory = 32Gb

Given a look to the spark UI I have found that during a stage involving the ner model I have a single task that takes 30 minutes, so to improve performance I would need to use all the executors, but it seems a problem related to the model.
Did anyone have the same problem?

maziyarpanahi · 2023-02-06T14:35:42Z

maziyarpanahi
Feb 6, 2023
Maintainer

Hi,

So I would go like this:

8 executors each having 8 cores = 64 cores in total must work in parallel. (nothing less, nothing more)
less means you do not have enough partition in your DataFrame to send to those cores to do any computations
Make sure your DataSet/DataFrame is breakable (Parquet format is the best), this way you can repartition it to the number that can utilize the whole cores
Also, do use appropriate partition number like .repartition(64) for your DataFrame
The Spark UI is your best friend to see if you can have all the 64 cores in parallel doing the computations

This Webinar is about the exact same thing: https://www.johnsnowlabs.com/watch-webinar-speed-optimization-benchmarks-in-spark-nlp-3-making-the-most-of-modern-hardware/

8 replies

LucaPifferettiPrivate Feb 6, 2023
Author

as you can see given this code I see only one executor working on a single task

maziyarpanahi Feb 6, 2023
Maintainer

OK, I think you should watch the Webinar, you must repartition the input DataFrame not in the .fit().transform() which is practically the end result. (what goes in must be breakable and repartitioned)
You need to share your .read() (what you are reading as a DataFrame) and then repartition that (textDataset).

LucaPifferettiPrivate Feb 6, 2023
Author

yes, this is the read part:

val textDataset = spark.read
        .options(source.options)
        .format("parquet")
        .load(source.path)
        .filter(length(col("object.body")) < 250)
        .withColumn(
        "text",
        col(columnAnonymize))
        .repartition(64)
textDataset.cache()
println(textDataset.show())

as you can see I already partitioned the data on the read partin 64 partitions

maziyarpanahi Feb 6, 2023
Maintainer

what is that formatStr
is that type even breakable to be repartitioned?
do I really have 8 executors in my Executors tab?
do I see 64 tasks in parallel if I remove Spark NLP and do something else in Apache Spark?

These are the questions I think you should continue online as this is not really a Spark NLP issue, no matter what you use in Apache Spark you will see that 1/1 as your source data is none-breakable or there is something else going on. (any annotator in Spark NLP runs in parallel/distributed manner as it is natively extending Spark ML. So if your Apache Spark can do something over 64 cores by having 64 tasks in parallel, Spark NLP just uses that)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark nlp ner XlmRoBertaForTokenClassification performance improvement #13475

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

spark nlp ner XlmRoBertaForTokenClassification performance improvement #13475

LucaPifferettiPrivate Feb 6, 2023

Replies: 1 comment · 8 replies

maziyarpanahi Feb 6, 2023 Maintainer

LucaPifferettiPrivate Feb 6, 2023 Author

LucaPifferettiPrivate Feb 6, 2023 Author

maziyarpanahi Feb 6, 2023 Maintainer

LucaPifferettiPrivate Feb 6, 2023 Author

maziyarpanahi Feb 6, 2023 Maintainer

LucaPifferettiPrivate
Feb 6, 2023

Replies: 1 comment 8 replies

maziyarpanahi
Feb 6, 2023
Maintainer

LucaPifferettiPrivate Feb 6, 2023
Author

LucaPifferettiPrivate Feb 6, 2023
Author

maziyarpanahi Feb 6, 2023
Maintainer

LucaPifferettiPrivate Feb 6, 2023
Author

maziyarpanahi Feb 6, 2023
Maintainer