Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: #1011

Open
behnazeslami opened this issue Mar 7, 2024 · 1 comment

Comments

@behnazeslami
Copy link

behnazeslami commented Mar 7, 2024

Hi,
In my CentOS Linux, I installed:
1- ! pip install --upgrade -q pyspark==3.4.1 spark-nlp==5.2.2

2- ! pip install --upgrade spark-nlp-jsl==5.2.1 --user --extra-index-url https://pypi.johnsnowlabs.com/[secret_code]
I checked the java --version:
java -version

openjdk version "11.0.13" 2021-10-19 OpenJDK Runtime Environment JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21) OpenJDK 64-Bit Server VM JBR-11.0.13.7-1751.21-jcef (build 11.0.13+7-b1751.21, mixed mode)

in the ~/.bashrc:
JAVA_HOME: export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.332.b09-1.el7_9.x86_64
I am trying to run the following program:

` import sparknlp
import sparknlp_jsl

from sparknlp.base import *
from sparknlp.util import *
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql.functions import col
from pyspark.sql.functions import explode

from sparknlp.pretrained import PretrainedPipeline

import gc

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np
#%%
params = {"spark.driver.memory":"50G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.driver.maxResultSize":"16G"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params, gpu=True)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

print(spark)
print("\n========================================================================")
document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

sentenceDetector = SentenceDetector()
.setInputCols(["document"])
.setOutputCol("sentence")

token = Tokenizer()
.setInputCols(['sentence'])
.setOutputCol('token')

embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
.setInputCols(["sentence", "token"])
.setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_clinical_biobert", "en", "clinical/models")
.setInputCols(["sentence", "token", "embeddings"])
.setOutputCol("ner")

ner_converter = NerConverter()
.setInputCols(["sentence", "token", "ner"])
.setOutputCol("ner_chunk")

clinical_assertion = AssertionDLModel.pretrained("assertion_dl_biobert_scope_L10R10","en", "clinical/models")
.setInputCols(["sentence", "ner_chunk", "embeddings"])
.setOutputCol("assertion")

chunk2doc = Chunk2Doc()
.setInputCols("ner_chunk")
.setOutputCol("ner_chunk_doc")

sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")
.setInputCols(["ner_chunk_doc"])
.setOutputCol("sbert_embeddings")

snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_findings_aux_concepts", "en", "clinical/models")
.setInputCols(["sbert_embeddings"])
.setOutputCol("snomed_code")
.setDistanceFunction("COSINE")
.setCaseSensitive(False)
.setUseAuxLabel(True)
.setNeighbours(10)

resolver = SentenceEntityResolverModel
.pretrained("sbiobertresolve_umls_findings","en", "clinical/models")
.setInputCols(["ner_chunk", "sbert_embeddings"])
.setOutputCol("resolution")
.setDistanceFunction("EUCLIDEAN")

nlpPipeline = Pipeline(stages=[document,
sentenceDetector,
token,
embeddings,
clinical_ner,
ner_converter,
clinical_assertion,
chunk2doc,
sbert_embedder,
snomed_resolver,
resolver])

data = spark.createDataFrame([[""]]).toDF("text")

assertion_model = nlpPipeline.fit(data)`

However, I get the following error:

    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   75  |   0   |   0   |   3   ||   72  |   0   |
    ---------------------------------------------------------------------

:: retrieving :: org.apache.spark#spark-submit-parent-b59223ac-26d8-44de-a4c3-d05a558c3faf
confs: [default]
0 artifacts copied, 72 already retrieved (0kB/31ms)
24/03/06 20:34:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark NLP Version : 5.2.2
Spark NLP_JSL Version : 5.2.1
<pyspark.sql.session.SparkSession object at 0x7f5597073190>

========================================================================
biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[ | ]biobert_pubmed_base_cased download started this may take some time.
Approximate size to download 386.4 MB
[ / ]Download done! Loading the resource.

An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.take(RDD.scala:1435)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.first(RDD.scala:1476)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
... 40 more
[OK!]
Traceback (most recent call last):
File "/data/beslami/sample_loaded_models.py", line 75, in
embeddings = BertEmbeddings.pretrained("biobert_pubmed_base_cased")
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/annotator/embeddings/bert_embeddings.py", line 206, in pretrained
return ResourceDownloader.downloadModel(BertEmbeddings, name, lang, remote_loc)
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 99, in downloadModel
raise e
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/pretrained/resource_downloader.py", line 96, in downloadModel
j_obj = _internal._DownloadModel(reader.name, name, language, remote_loc, j_dwn).apply()
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/init.py", line 352, in init
super(_DownloadModel, self).init("com.johnsnowlabs.nlp.pretrained." + validator + ".downloadModel", reader,
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 27, in init
self._java_obj = self.new_java_obj(java_obj, *args)
File "/home/beslami/.local/lib/python3.9/site-packages/sparknlp/internal/extended_java_wrapper.py", line 37, in new_java_obj
return self._new_java_obj(java_class, *args)
File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/ml/wrapper.py", line 86, in _new_java_obj
return java_obj(*java_args)
File "/home/beslami/.local/lib/python3.9/site-packages/py4j/java_gateway.py", line 1322, in call
return_value = get_return_value(
File "/home/beslami/.local/lib/python3.9/site-packages/pyspark/errors/exceptions/captured.py", line 169, in deco
return f(*a, **kw)
File "/home/beslami/.local/lib/python3.9/site-packages/py4j/protocol.py", line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel.
: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:244)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:332)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:208)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:291)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:287)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1441)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.take(RDD.scala:1435)
at org.apache.spark.rdd.RDD.$anonfun$first$1(RDD.scala:1476)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:405)
at org.apache.spark.rdd.RDD.first(RDD.scala:1476)
at org.apache.spark.ml.util.DefaultParamsReader$.loadMetadata(ReadWrite.scala:587)
at org.apache.spark.ml.util.DefaultParamsReader.load(ReadWrite.scala:465)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:31)
at com.johnsnowlabs.nlp.FeaturesReader.load(ParamsAndFeaturesReadable.scala:24)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:513)
at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:505)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.downloadModel(ResourceDownloader.scala:705)
at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel(ResourceDownloader.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Input path does not exist: file:/home/beslami/cache_pretrained/biobert_pubmed_base_cased_en_2.6.2_2.4_1600449177996/metadata
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:278)
... 40 more

@behnazeslami behnazeslami changed the title py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.downloadModel. : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: Mar 7, 2024
@maziyarpanahi maziyarpanahi transferred this issue from JohnSnowLabs/spark-nlp Mar 7, 2024
@maziyarpanahi
Copy link
Member

maziyarpanahi commented Mar 7, 2024

I transferred this issue as it has licensed annotators, we cannot reproduce it in open-source library. (but I suspect the home_directory not having right permissions to download/extract the models or it is not reachable. checking /home/beslami/cache_pretrained/ path and its permissions might help)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants