Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLeap usage in streaming scenario (perf issue) #633

Closed
eisber opened this issue Jan 27, 2020 · 7 comments
Closed

MLeap usage in streaming scenario (perf issue) #633

eisber opened this issue Jan 27, 2020 · 7 comments

Comments

@eisber
Copy link

eisber commented Jan 27, 2020

Hi,

We are using MLeap to perform model inference within Apache Accumulo. Since the Accumulo iterator framework exposes a streaming API (e.g. process row by row vs batch) we'd like to re-use as much of the objects required by MLeap.

We managed to create single "infinite" input dataframe and then produce a single result data frame from which we pull the data iteratively. It works, but unnfortunately this results in a memory leak. I was looking through the stack, but wasn't able to figure out at which point things are kept in memory.

The code works, but isn't performing that well as we have to call transformer.transform(this.mleapDataFrame) for every single row.

Integration code can be found here: https://github.com/microsoft/masc/blob/master/connector/iterator/src/main/java/com/microsoft/accumulo/spark/processors/AvroRowMLeap.java#L337

Any advise appreciated.

Markus

@lucagiovagnoli
Copy link
Member

Hi @eisber, this sounds very close to what we're trying to do (run MLeap in a streaming Apache Flink job, processing row by row). What do you think is the bottleneck in your case ?

Also, what model are you using? We found some performance issues using xgboost, see issue #631

@eisber
Copy link
Author

eisber commented Jan 28, 2020

Initially our code was substantially faster as we didn't call transform() for every row, which has to perform some initial setup (e.g. inspection of input types, output dataframe allocation,...).

@voganrc just to confirm that I read your table correctly. To get the ML examples processed by second I need to multiply the cell with the batch size, right? If so it's surprising that only xgboost4j sees a speed-up (e.g. 10*10.135).

I was considering to implement batching in our integration, but the numbers above don't look promising. Also the complexity would increase...

@ancasarb
Copy link
Member

@eisber I will take a closer look this week, but it seems to me like this would be a good use for the RowTransformer that mleap has, you can see more details about how to use it here https://github.com/combust/mleap/blob/master/mleap-runtime/src/test/scala/ml/combust/mleap/runtime/frame/RowTransformerSpec.scala#L52. You'd be able to provide the schema upfront, from your bundled zip file, and thus skip the cost of creating the leapframe.

@eisber
Copy link
Author

eisber commented Jan 29, 2020

@ancasarb thanks for the awesome hint. I changed our code: https://github.com/microsoft/masc/blob/marcozo/rowtransformer/connector/iterator/src/main/java/com/microsoft/accumulo/spark/processors/AvroRowMLeap.java#L299 and got a 40% speed improvement.

Any other hints like that?

Markus

@voganrc
Copy link
Contributor

voganrc commented Jan 29, 2020

@eisber Yes, you are reading my table correctly. xgboost4j was the only library I found that actually implemented batching. This PR is open though (#600).

@ancasarb
Copy link
Member

@eisber the other thing I could think of is that mleap has support for avro https://mvnrepository.com/artifact/ml.combust.mleap/mleap-avro, perhaps you could be using the DefaultRowReader/DefaultRowWriter there so that you simplify your code somewhat.

@ancasarb
Copy link
Member

ancasarb commented Mar 9, 2020

@eisber I am going to close this for now, but please feel free to re-open if you have further questions. Thank you!

@ancasarb ancasarb closed this as completed Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants