Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIKA-3420] Set tesseract ocr langauges as docker build args #2

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

mhf-ir
Copy link

@mhf-ir mhf-ir commented Dec 15, 2020

Ability to user build docker with list of tesseract-ocr-[lang] as build args.

@mhf-ir
Copy link
Author

mhf-ir commented Dec 15, 2020

Seems be docker-build.sh also must be update. for accept <TESSERACT_LANGUAGES>

Copy link
Member

@dameikle dameikle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, this will be a great addition to tika-docker!

I've got a few comments as it looks like this may not work as intended, so wondering if you could take a look at them.

Thanks,
Dave

docker-tool.sh Outdated Show resolved Hide resolved
docker-tool.sh Outdated Show resolved Hide resolved
docker-tool.sh Outdated Show resolved Hide resolved
@mhf-ir
Copy link
Author

mhf-ir commented Jan 1, 2021

Change the build for rest of parameters and fix echo problem.

Also some change in README.md using Markdown lint and this changes, change my PR if needed for grammer or etc problem.

Thanks for your attention

@mhf-ir mhf-ir requested a review from dameikle January 1, 2021 22:10
@mhf-ir
Copy link
Author

mhf-ir commented May 24, 2021

@dameikle Could you please look at this? Any problem, issue, changes? This simple patch will help me a lot.

Copy link
Member

@lewismc lewismc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mhf-ir I like this pull request. Please consider my feedback. Thank you

README.md Outdated Show resolved Hide resolved
docker-tool.sh Outdated Show resolved Hide resolved
docker-tool.sh Outdated Show resolved Hide resolved
@lewismc lewismc changed the title set tesseract ocr langauges as docker build args [TIKA-3420] Set tesseract ocr langauges as docker build args May 26, 2021
Copy link
Author

@mhf-ir mhf-ir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:like:

@mhf-ir
Copy link
Author

mhf-ir commented May 26, 2021

Seems TIKA_JAR_NAME also added, please check it.

@mhf-ir mhf-ir requested a review from lewismc May 31, 2021 06:41
@lewismc
Copy link
Member

lewismc commented Jun 5, 2021

@mhf-ir OK I tried out the new patch today

./docker-tool.sh build 1.26 tika-1.27-tesseract-french.jar tesseract-ocr-fra

...

 => ERROR [dependencies 1/2] RUN DEBIAN_FRONTEND=noninteractive apt-get -y install openjdk-14-jre-headless gdal-bin tesseract-ocr 'tesseract-ocr-fra'                                                                                    1.4s
------
 > [dependencies 1/2] RUN DEBIAN_FRONTEND=noninteractive apt-get -y install openjdk-14-jre-headless gdal-bin tesseract-ocr 'tesseract-ocr-fra':
#6 0.293 Reading package lists...
#6 1.103 Building dependency tree...
#6 1.268 Reading state information...
#6 1.383 E: Unable to locate package 'tesseract-ocr-fra'
------
executor failed running [/bin/sh -c DEBIAN_FRONTEND=noninteractive apt-get -y install $JRE gdal-bin tesseract-ocr $TESSERACT_LANGUAGES]: exit code: 100

Is this the correct way to invoke ./docker-tool.sh's build command?

@mhf-ir
Copy link
Author

mhf-ir commented Jun 6, 2021

@lewismc seems be problem for multiple packages name for build-args. i will try to find better way for that.

@mhf-ir
Copy link
Author

mhf-ir commented Jun 7, 2021

@lewismc Try this, must be okey

./docker-tool.sh build 1.26 jar-alt-name tesseract-ocr-fra tesseract-ocr-fas

@lewismc
Copy link
Member

lewismc commented Jun 7, 2021

This doesn't work either @mhf-ir

./docker-tool.sh build 1.26 jar-alt-name tesseract-ocr-fra
...
#9 9.069 --2021-06-07 19:53:48--  https://www.apache.org/dyn/closer.cgi/tika/jar-alt-name-1.26.jar?filename=tika/jar-alt-name-1.26.jar&action=download
#9 9.070 Resolving www.apache.org (www.apache.org)... 207.244.88.140, 95.216.26.30, 2a01:4f9:2a:1a61::2
#9 9.178 Connecting to www.apache.org (www.apache.org)|207.244.88.140|:443... connected.
#9 9.400 HTTP request sent, awaiting response... 404 Not Found
#9 10.63 2021-06-07 19:53:49 ERROR 404: Not Found.
#9 10.63
#9 10.64 --2021-06-07 19:53:49--  https://archive.apache.org/dist/tika/jar-alt-name-1.26.jar
#9 10.64 Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
#9 10.74 Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
#9 12.28 HTTP request sent, awaiting response... 404 Not Found
#9 12.50 2021-06-07 19:53:51 ERROR 404: Not Found.
#9 12.50
------
executor failed running [/bin/sh -c DEBIAN_FRONTEND=noninteractive apt-get -y install gnupg2 wget     && wget -t 10 --max-redirect 1 --retry-connrefused -qO- https://downloads.apache.org/tika/KEYS | gpg --import     && wget -t 10 --max-redirect 1 --retry-connrefused $NEAREST_TIKA_SERVER_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar ]" || wget $ARCHIVE_TIKA_SERVER_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar ]" || exit 1     && wget -t 10 --max-redirect 1 --retry-connrefused $DEFAULT_TIKA_SERVER_ASC_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc  || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc ]" || wget $ARCHIVE_TIKA_SERVER_ASC_URL -O /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc || rm /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc     && sh -c "[ -f /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar.asc ]" || exit 1;]: exit code: 1

@mhf-ir
Copy link
Author

mhf-ir commented Jun 8, 2021

seems $jar variable has problem, default is tika-server
https://github.com/apache/tika-docker/blob/master/docker-tool.sh#L61
It's not my modification. I just resolve that conflict like master.

gpg: Total number processed: 7
gpg:               imported: 7
gpg: no ultimately trusted keys found
--2021-06-08 04:16:59--  https://www.apache.org/dyn/closer.cgi/tika/tika-server-1.26.jar?filename=tika/tika-server-1.26.jar&action=download
Resolving www.apache.org (www.apache.org)... 207.244.88.140, 95.216.26.30, 2a01:4f9:2a:1a61::2
Connecting to www.apache.org (www.apache.org)|207.244.88.140|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://archive.apache.org/dist/tika/tika-server-1.26.jar [following]
--2021-06-08 04:17:01--  https://archive.apache.org/dist/tika/tika-server-1.26.jar
Resolving archive.apache.org (archive.apache.org)... 138.201.131.134, 2a01:4f8:172:2ec5::2
Connecting to archive.apache.org (archive.apache.org)|138.201.131.134|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79903002 (76M) [application/java-archive]
Saving to: '/tika-server-1.26.jar'

     0K .......... .......... .......... .......... ..........  0%  171K 7m36s
    50K .......... .......... .......... .......... ..........  0%  283K 6m5s
   100K .......... .......... .......... .......... ..........  0%  483K 4m57s
   150K .......... .......... .......... .......... ..........  0%  496K 4m22s

try this:

./docker-tool.sh build 1.26 tika-server tesseract-ocr-fra tesseract-ocr-fas
 ---> 594d05b32156
Step 22/26 : COPY --from=fetch_tika /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar
 ---> 74da0f6e2136
Step 23/26 : USER $UID_GID
 ---> Running in 27cd65503cc4
Removing intermediate container 27cd65503cc4
 ---> 3087a7429f40
Step 24/26 : EXPOSE 9998
 ---> Running in 85d927ba4187
Removing intermediate container 85d927ba4187
 ---> bc6b467c7eed
Step 25/26 : ENTRYPOINT [ "/bin/sh", "-c", "exec java -jar /${TIKA_JAR_NAME}-${TIKA_VERSION}.jar -h 0.0.0.0 $0 $@"]
 ---> Running in 22f85e5b9e83
Removing intermediate container 22f85e5b9e83
 ---> 16c47b78d7e9
Step 26/26 : LABEL maintainer="Apache Tika Developers [email protected]"
 ---> Running in d670b34497d3
Removing intermediate container d670b34497d3
 ---> 6f04502585ad
Successfully built 6f04502585ad
Successfully tagged apache/tika:1.26-full
sweb@sweb-laptop:/sweb/tmp/tika-d-mhf$ TZ=UTC date && docker images | grep tika
Tue 08 Jun 2021 04:20:15 AM UTC
apache/tika                             1.26-full         6f04502585ad   2 minutes ago   690MB
apache/tika                             1.26              65ea0073c1e2   7 minutes ago   408MB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants