-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLeap usage in streaming scenario (perf issue) #633
Comments
Initially our code was substantially faster as we didn't call transform() for every row, which has to perform some initial setup (e.g. inspection of input types, output dataframe allocation,...). @voganrc just to confirm that I read your table correctly. To get the ML examples processed by second I need to multiply the cell with the batch size, right? If so it's surprising that only xgboost4j sees a speed-up (e.g. 10*10.135). I was considering to implement batching in our integration, but the numbers above don't look promising. Also the complexity would increase... |
@eisber I will take a closer look this week, but it seems to me like this would be a good use for the RowTransformer that mleap has, you can see more details about how to use it here https://github.com/combust/mleap/blob/master/mleap-runtime/src/test/scala/ml/combust/mleap/runtime/frame/RowTransformerSpec.scala#L52. You'd be able to provide the schema upfront, from your bundled zip file, and thus skip the cost of creating the leapframe. |
@ancasarb thanks for the awesome hint. I changed our code: https://github.com/microsoft/masc/blob/marcozo/rowtransformer/connector/iterator/src/main/java/com/microsoft/accumulo/spark/processors/AvroRowMLeap.java#L299 and got a 40% speed improvement. Any other hints like that? Markus |
@eisber the other thing I could think of is that mleap has support for avro https://mvnrepository.com/artifact/ml.combust.mleap/mleap-avro, perhaps you could be using the DefaultRowReader/DefaultRowWriter there so that you simplify your code somewhat. |
@eisber I am going to close this for now, but please feel free to re-open if you have further questions. Thank you! |
Hi,
We are using MLeap to perform model inference within Apache Accumulo. Since the Accumulo iterator framework exposes a streaming API (e.g. process row by row vs batch) we'd like to re-use as much of the objects required by MLeap.
We managed to create single "infinite" input dataframe and then produce a single result data frame from which we pull the data iteratively. It works, but unnfortunately this results in a memory leak. I was looking through the stack, but wasn't able to figure out at which point things are kept in memory.
The code works, but isn't performing that well as we have to call transformer.transform(this.mleapDataFrame) for every single row.
Integration code can be found here: https://github.com/microsoft/masc/blob/master/connector/iterator/src/main/java/com/microsoft/accumulo/spark/processors/AvroRowMLeap.java#L337
Any advise appreciated.
Markus
The text was updated successfully, but these errors were encountered: