-
Notifications
You must be signed in to change notification settings - Fork 313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using XGBoost with the newest mleap=0.22.0 in Python 3.8 #849
Comments
I am using the above and did Error:
|
hmmm, I'm not too familiar with idk if that would impact the error being generated though. It seems to be complaining an mleap class not being present in the jvm. |
Oh I see. Can this model be used inside of a bundle (think I got an error but I could be wrong)? I.e. I should be able to use it as above and serialize / deserialize? |
I mean, if I have a pipeline with such a model in it, I get an error as below:
|
hmmm @WeichenXu123 any thoughts on this since I know you added the pyspark bindings in xgboost. |
To be clear this is the code I ran. ` sc = SparkContext() features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId'] xgboost = SparkXGBClassifier( pipeline = pyspark.ml.Pipeline( model = pipeline.fit(df) local_path = "jar:file:/tmp/pyspark.example.zip" deserialized_model.stages[-1].set(deserialized_model.stages[-1].missing, 0.0) sc.stop() ` |
Could you try mleap 0.20 version ? I remember similar issue happens since version > 0.20 |
Thanks so much for this. Give me a few days to try this and circle back. We actually got it working on 0.22 after changing some random files (wrapper.py in pyspark code). But I'd be interested to see if this works ... |
Thanks for your investigation! Would you mind file a PR containing your changes that makes it work ? So that it can help us fix it. |
I don't have a PR, but I can tell you some specs. Basically we needed to wrap up a bunch of jars into one and run the commands like below. This involves installing Java 11 to make 0.22 work but also we needed to change some specific things inside of spark related files. Specifically, see the changes needed to wrapper.py to make xgboost 1.7.3 work. See also the change to spark-env-sh to make the right Java be used. The jars I am using are a wrap up of the mleap jars found on maven. That sparkxgb zip is one I found online, it wraps xgboost so that it can be used in pyspark. ` We need this new Java since the new Mleap uses Java 11sudo apt-get -y install openjdk-11-jdk Change Java to 11 for Sparkecho "export JAVA_HOME="/usr/lib/jvm/java-11-openjdk-amd64"" >> $SPARK_HOME/conf/spark-env.sh If you don't do this, Spark will complain about Python 3.7 and 3.8 - Fix Python to a specific oneexport PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python Pip install packages - mleap 0.22.0 needs Java 11pip install mleap==0.22.0 Change wrapper.py and tun off dynamic allocation; these are needed to make the imports work right plus we get some error for dynamicAllocation=truesudo sed -i.bak 's/spark.dynamicAllocation.enabled=true/spark.dynamicAllocation.enabled=false/' $SPARK_HOME/conf/spark-defaults.conf Transfer files where we need them. Don't forget to add the zip. There are 4 jars and 1 zip that need to be placed in the directory, see belowsudo gsutil cp gs://filesForDataproc2.0/jarsAndZip/* $SPARK_HOME/jars/ |
Hi, What is the best way to use XGBoost with mleap=0.22.0. Specifically, I'd like to make a pipeline and then serialize it into a bundle. Is there a demo somewhere on how to do this? @jsleight
The text was updated successfully, but these errors were encountered: