Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ Setup/examples ] Initial Installation Issues - docker compose errors #25

Open
gaspardc-met opened this issue May 25, 2022 · 13 comments
Open

Comments

@gaspardc-met
Copy link

Hello clearml team,
Congrats on the release of clearml-serving V2 🎉

I really wanted to check it out, and I'm having difficulties running the basic setup and scikit-learn example commands on my side.
I want to run the Installation and the Toy model (scikit learn) deployment example

I have a self-hosted clearml Server built with the helm chart on Kubernetes.

The environment variables of clearml-serving/docker/docker-compose.yml where defined in the myexemple.env file, and starts like this :

CLEARML_WEB_HOST="<http://localhost:8080/>"
CLEARML_API_HOST="<http://localhost:8008/>"
CLEARML_FILES_HOST="<http://localhost:8081/>"

Upon running docker-compose , both clearml-serving-inference and clearml-serving-statistics return errors:

Retrying (Retry(total=236, connect=236, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f4065110310>: Failed to establish a new connection: [Errno 111] Connection refused')': /auth.login

I think the issue comes from the communication with the Kafka service, but I do not know how to solve this.
Has anyone encountered this issue and solved it before, since it's the default installation on the doc ?

Haven't found any related issues on any of the GitHub repos
Thanks for the help 🤖

@thepycoder
Copy link
Contributor

thepycoder commented May 31, 2022

Hi @gaspard-met

I'll try to recreate your environment and help out. When you say you have a clearml server running on Kubernetes, how is that running locally? Microk8s?
Also, do you use the experiment manager with that same server too? If so, can you share your ~/.clearml.conf file, because then we can see what URLs the experiment manager uses to connect to the server. I suspect there is a discrepancy there.

If not, when you use a web browser, can you actually go to the server via localhost?

@Muscle-Oliver
Copy link

Hello @thepycoder
Confronted with the same error with clearml-serving-inference here, using minikube in kubernetes.

To reproduce:

  1. Used minikube to create a single-node cluster;
$ minikube start --driver=docker \
--container-runtime=containerd \
--nodes=1
  1. Used helm to create both clearml and clearml-serving;
(helm repo already added)
$ helm install clearml allegroai/clearml
$ helm install clearml-serving allegroai/clearml-serving
  1. Seems everything works fine:
$ kubectl get po
NAME                                          READY   STATUS    RESTARTS       AGE
alertmanager-84b874c6f8-nxnqm                 1/1     Running   0              18h
clearml-apiserver-7b46876f44-gpm4v            1/1     Running   3 (8d ago)     8d
clearml-elastic-master-0                      1/1     Running   0              8d
clearml-fileserver-5c968587b4-2zmqx           1/1     Running   0              8d
clearml-k8sagent-5d468b6d47-269qp             1/1     Running   0              5d19h
clearml-mongodb-6b94888687-r4x7d              1/1     Running   0              8d
clearml-redis-master-0                        1/1     Running   1 (7d1h ago)   8d
clearml-serving-inference-85bcf97f69-w5b2b    1/1     Running   2 (171m ago)   18h
clearml-serving-statistics-6ffb8459bc-vhktv   1/1     Running   2 (171m ago)   18h
clearml-serving-triton-666f97b8d6-k8lsd       1/1     Running   2 (171m ago)   18h
clearml-webserver-7d86c649dd-txczl            1/1     Running   0              8d
grafana-84b7f5c559-wnfdx                      1/1     Running   0              18h
kafka-cb849765-7kng5                          1/1     Running   0              18h
prometheus-6f5868884b-9h5h8                   1/1     Running   0              18h
zookeeper-6795454fbf-gqfjh                    1/1     Running   0              18h
  1. Created a cred in webserver/settings/Workspace (from a browser). Used clearml-init to configure the ~/.clearml.conf file;
# ClearML SDK configuration file
api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server: http://127.0.0.1:46555
    web_server: http://127.0.0.1:38063
    files_server: http://127.0.0.1:42347
    # Credentials are generated using the webapp, http://127.0.0.1:45595/settings
    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": 
    ......

which I am sure is correct cuz webserver works fine. And Git examples in /clearml/examples/ works.

  1. I tried Git examples in /clearml-serving/examples/pytorch/, and created an endpoint;
$ clearml-serving model list
clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
List model serving and endpoints, control task id=1f38787f5f7a4ab6b860532369f0aa57
Info: syncing model endpoint configuration, state hash=253e8350252883f7e599572903a5cf63
Endpoints:
{
  "test_pytorch_mnist/1": {
    "engine_type": "triton",
    "serving_url": "test_pytorch_mnist",
    "model_id": "3ed0f8563b56482eb9726230f1171ef1",
    "version": "1",
    "preprocess_artifact": "py_code_test_pytorch_mnist_1",
    "input_size": [
      1,
      28,
      28
    ],
    "input_type": "float32",
    "input_name": "INPUT__0",
    "output_size": [
      -1,
      10
    ],
    "output_type": "float32",
    "output_name": "OUTPUT__0",
    "auxiliary_cfg": null
  }
}
Model Monitoring:
{}
Canary:
{}
  1. Then I found the serving URL cannot be reached. Checked the logs of pod clearml-serving-inference;
CLEARML_SERVING_TASK_ID=ClearML Serving Task ID
CLEARML_SERVING_PORT=8080
CLEARML_USE_GUNICORN=true
CLEARML_EXTRA_PYTHON_PACKAGES=
CLEARML_SERVING_NUM_PROCESS=2
CLEARML_SERVING_POLL_FREQ=1.0
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
WEB_CONCURRENCY=
SERVING_PORT=8080
GUNICORN_NUM_PROCESS=2
GUNICORN_SERVING_TIMEOUT=
GUNICORN_MAX_REQUESTS=0
GUNICORN_EXTRA_ARGS=
UVICORN_SERVE_LOOP=asyncio
UVICORN_EXTRA_ARGS=
UVICORN_LOG_LEVEL=warning
CLEARML_DEFAULT_BASE_SERVE_URL=http://127.0.0.1:8080/serve
CLEARML_DEFAULT_TRITON_GRPC_ADDR=clearml-serving-triton:8001
Starting Gunicorn server
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9b57d87610>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9aefd01760>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
Retrying (Retry(total=237, connect=237, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9aefc207f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
......
  1. Checked the endpoints.sh in the clearml-serving-inference pod with kubectl exec;
......
else
  echo "Starting Gunicorn server"
  # start service
  PYTHONPATH=$(pwd) python3 -m gunicorn \
      --preload clearml_serving.serving.main:app \
      --workers $GUNICORN_NUM_PROCESS \
      --worker-class uvicorn.workers.UvicornWorker \
      --max-requests $GUNICORN_MAX_REQUESTS \
      --timeout $GUNICORN_SERVING_TIMEOUT \
      --bind 0.0.0.0:$SERVING_PORT \
      $GUNICORN_EXTRA_ARGS
fi

Seems this gunicorn app failed to communicate with something

Thanks for any help! :)

@thepycoder
Copy link
Contributor

Thanks for the detailed writeup @Muscle-Oliver !

So I've taken a look and it seems like a specific parameter is missing from the helm chart.
The url http://127.0.0.1:8080/serve does not look like the correct url to connect to, most of the time in a kubernetes cluster you'd use different IP addresses than localhost.

In order to set this IP address, you'll have to edit the following parameter in the serving docker-compose yaml file:
https://github.com/allegroai/clearml-serving/blob/main/docker/docker-compose-triton-gpu.yml#L92

But it seems that the particular env var CLEARML_DEFAULT_BASE_SERVE_URL isn't exposed in the helm chart at all! So I'm adding @valeriano-manassero to the discussion as he is the maintainer of the helm charts. Using this parameter should allow you to set everything up properly :)

@Muscle-Oliver
Copy link

Thanks for the quick reply @thepycoder !

May I ask what problem it suggests by /auth.login at the end of the starting log of gunicorn?
As it goes:
Retrying (Retry(total=237, connect=237, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9aefc207f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login

And, I'm just wondering whether this env just indicating the final serving url after gunicorn startup successfully, rather than some destination it now trying to connect? 😄

Thanks for any further update :) ☕

@valeriano-manassero
Copy link

Hi, I just issued a PR mentioning this issue, can you pls check it and letting me know if this is what you are expecting?

@valeriano-manassero
Copy link

Since that change is not breaking, I just merged PR and released clearml-serving-0.4.0 .
pls let me know if this chart is good for you.

@Muscle-Oliver
Copy link

Muscle-Oliver commented Jun 27, 2022

Thanks for the update @valeriano-manassero !

So, I also tried minikube start driver=none with root. And helm installed clearml-serving-0.4.0 as suggested. But everything worked out the same.

The log of pod clearml-serving-inference still goes:

Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff3de505880>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login

which means gunicorn failed to start ? (or it did, but not working as expected)

Then I kubectl exec into the pod clearml-serving-inference, and manually executed the clearml_serving/serving/entrypoint.sh
And I used Ctrl+C to terminate the process as the above error log appeared, which produced a more detailed output:

root@clearml-serving-inference-85bcf97f69-9jsdh:~/clearml# sh clearml_serving/serving/entrypoint.sh 
CLEARML_SERVING_TASK_ID=ClearML Serving Task ID
CLEARML_SERVING_PORT=8080
CLEARML_USE_GUNICORN=true
EXTRA_PYTHON_PACKAGES=
CLEARML_SERVING_NUM_PROCESS=2
CLEARML_SERVING_POLL_FREQ=1.0
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
WEB_CONCURRENCY=
SERVING_PORT=8080
GUNICORN_NUM_PROCESS=2
GUNICORN_SERVING_TIMEOUT=
GUNICORN_EXTRA_ARGS=
UVICORN_SERVE_LOOP=asyncio
UVICORN_EXTRA_ARGS=
CLEARML_DEFAULT_BASE_SERVE_URL=http://127.0.0.1:8080/serve
CLEARML_DEFAULT_TRITON_GRPC_ADDR=clearml-serving-triton:8001
Starting Gunicorn server
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5bfe7e4610>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5b984f2520>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login
^CTraceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f5b984f27f0>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/__main__.py", line 7, in <module>
    run()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 231, in run
    super().run()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 58, in __init__
    self.setup(app)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 118, in setup
    self.app.wsgi()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
    return self.load_wsgiapp()
  File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
    mod = importlib.import_module(module)
  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/root/clearml/clearml_serving/serving/main.py", line 49, in <module>
    serving_task = ModelRequestProcessor._get_control_plane_task(task_id=serving_service_task_id)
  File "/root/clearml/clearml_serving/serving/model_request_processor.py", line 1094, in _get_control_plane_task
    task = Task.get_task(task_id=task_id)
  File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 796, in get_task
    return cls.__get_task(
  File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 3523, in __get_task
    return cls(private=cls.__create_protection, task_id=task_id, log_to_backend=False)
  File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 169, in __init__
    super(Task, self).__init__(**kwargs)
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/task/task.py", line 152, in __init__
    super(Task, self).__init__(id=task_id, session=session, log=log)
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 145, in __init__
    super(IdObjectBase, self).__init__(session, log, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 39, in __init__
    self._session = session or self._get_default_session()
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
    InterfaceBase._default_session = Session(
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 207, in __init__
    self.refresh_token()
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/token_manager.py", line 112, in refresh_token
    self._set_token(self._do_refresh_token(self.__token, exp=self.req_token_expiration_sec))
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 736, in _do_refresh_token
    res = self._send_request(
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 358, in _send_request
    res = self.__http_session.request(
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/utils.py", line 85, in send
    return super(SessionWithTimeout, self).send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    retries.sleep()
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 432, in sleep
    self._sleep_backoff()
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 416, in _sleep_backoff
    time.sleep(backoff)
KeyboardInterrupt

Will this help to explain the connection error? Or does Task.init() have something to do with the reported ... Connection refused')': /auth.login? I'm not sure.

Thank you for any reply!

@valeriano-manassero
Copy link

Just a question: I see clearml.defaultBaseServeUrl in chart is still the default defaultBaseServeUrl . Did you try to change that url to point to the right endpoint?

@Muscle-Oliver
Copy link

Thanks for the helm update.
But, I'm confused now 😂

How can I get this clearml-serving-inference to start correctly, anyway?

I can infer from the logs that the startup problem may results from some sort of connection error. But I have no idea where exactly the gunicorn was connecting to.
As @thepycoder have suggested, the CLEARML_DEFAULT_BASE_SERVE_URL might not be localhost, but some cluster IP. Then I checked all the services in the cluster, and only found one service with port 8080, which is clearml-serving-inference itself!
Does this mean the localhost is actually correct ❓
I really have no idea what is the right endpoint of the clearml.defaultBaseServeUrl.

As the clearml-serving-inference gunicorn entrypoint.sh goes:

#!/bin/bash

# print configuration
echo CLEARML_SERVING_TASK_ID="$CLEARML_SERVING_TASK_ID"
echo CLEARML_SERVING_PORT="$CLEARML_SERVING_PORT"
echo CLEARML_USE_GUNICORN="$CLEARML_USE_GUNICORN"
echo EXTRA_PYTHON_PACKAGES="$EXTRA_PYTHON_PACKAGES"
echo CLEARML_SERVING_NUM_PROCESS="$CLEARML_SERVING_NUM_PROCESS"
echo CLEARML_SERVING_POLL_FREQ="$CLEARML_SERVING_POLL_FREQ"
echo CLEARML_DEFAULT_KAFKA_SERVE_URL="$CLEARML_DEFAULT_KAFKA_SERVE_URL"
echo CLEARML_DEFAULT_KAFKA_SERVE_URL="$CLEARML_DEFAULT_KAFKA_SERVE_URL"

SERVING_PORT="${CLEARML_SERVING_PORT:-8080}"
GUNICORN_NUM_PROCESS="${CLEARML_SERVING_NUM_PROCESS:-4}"
GUNICORN_SERVING_TIMEOUT="${GUNICORN_SERVING_TIMEOUT:-600}"
UVICORN_SERVE_LOOP="${UVICORN_SERVE_LOOP:-asyncio}"

# set default internal serve endpoint (for request pipelining)
CLEARML_DEFAULT_BASE_SERVE_URL="${CLEARML_DEFAULT_BASE_SERVE_URL:-http://127.0.0.1:$SERVING_PORT/serve}"
CLEARML_DEFAULT_TRITON_GRPC_ADDR="${CLEARML_DEFAULT_TRITON_GRPC_ADDR:-127.0.0.1:8001}"

# print configuration
echo WEB_CONCURRENCY="$WEB_CONCURRENCY"
echo SERVING_PORT="$SERVING_PORT"
echo GUNICORN_NUM_PROCESS="$GUNICORN_NUM_PROCESS"
echo GUNICORN_SERVING_TIMEOUT="$GUNICORN_SERVING_PORT"
echo GUNICORN_EXTRA_ARGS="$GUNICORN_EXTRA_ARGS"
echo UVICORN_SERVE_LOOP="$UVICORN_SERVE_LOOP"
echo UVICORN_EXTRA_ARGS="$UVICORN_EXTRA_ARGS"
echo CLEARML_DEFAULT_BASE_SERVE_URL="$CLEARML_DEFAULT_BASE_SERVE_URL"
echo CLEARML_DEFAULT_TRITON_GRPC_ADDR="$CLEARML_DEFAULT_TRITON_GRPC_ADDR"

# runtime add extra python packages
if [ ! -z "$EXTRA_PYTHON_PACKAGES" ]
then
      python3 -m pip install $EXTRA_PYTHON_PACKAGES
fi

if [ -z "$CLEARML_USE_GUNICORN" ]
then
  echo "Starting Uvicorn server"
  PYTHONPATH=$(pwd) python3 -m uvicorn \
      clearml_serving.serving.main:app --host 0.0.0.0 --port $SERVING_PORT --loop $UVICORN_SERVE_LOOP \
      $UVICORN_EXTRA_ARGS
else
  echo "Starting Gunicorn server"
  # start service
  PYTHONPATH=$(pwd) python3 -m gunicorn \
      --preload clearml_serving.serving.main:app \
      --workers $GUNICORN_NUM_PROCESS \
      --worker-class uvicorn.workers.UvicornWorker \
      --timeout $GUNICORN_SERVING_TIMEOUT \
      --bind 0.0.0.0:$SERVING_PORT \
      $GUNICORN_EXTRA_ARGS
fi

Maybe we can set this clear by reproducing the startup process of the gunicorn in the clearml-serving-inference?

@valeriano-manassero
Copy link

I probably found the issue:
there should be some misconfiguration on apiHost value in your helm chart installation.
If they are on the same cluster they should be:

  apiHost: http://clearml-enterprise-apiserver:8008
  filesHost: http://clearml-enterprise-fileserver:8081
  webHost: http://clearml-enterprise-webserver:80

Once you will not get anymore connection errors, you can connect to the inference service simply doing a port-forward with kubectl -n clearml port-forward svc/clearml-serving-inference 8080:8080.

Let me know if this helps.

@Muscle-Oliver
Copy link

@valeriano-manassero Thanks! That's it!

Thanks to your reminder, I finally noticed that the configs of clearml-serving are all incorrect. 🤣
Previously I installed clearml-serving via helm install [RELEASE] [CHART] command line. And All the helm charts used the default configs from values.yaml, left unchecked.

I git pull the clearml-serving repo and checked the values.yaml, which goes:

clearml:
  apiAccessKey: "ClearML API Access Key"
  apiSecretKey: "ClearML API Secret Key"
  apiHost: http://clearml-server-apiserver:8008
  filesHost: http://clearml-server-fileserver:8081
  webHost: http://clearml-server-webserver:80
  servingTaskId: "ClearML Serving Task ID"

......

where all the Host addresses don't match with my current services of clearml (I installed clearml via helm, app version 1.4.0).

The correct services should be (version 1.4.0):

clearml-apiserver:8008
clearml-fileserver:8081
clearml-webserver:80

No more connection error!

@Mithmi
Copy link

Mithmi commented Oct 31, 2023

Hello, i have the same issue with connection errors.

On my Ubuntu machine:

  1. I successfully installed clearml server and it sits on my localhost:8080

On the same Ubuntu machine i tried:

  1. to run docker-compose from this tutorial - I wasn't able to deploy serving-inference and serving-statistics parts of this compose - they can't connect and throw /auth.login
  2. to run inference container from toy mode tutoriall - i wasn't able to deploy it, because it can't connect and throw me /auth.login

my example.env file:

CLEARML_WEB_HOST="http://localhost:80"
CLEARML_API_HOST="http://localhost:8008"
CLEARML_FILES_HOST="http://localhost:8081"
CLEARML_API_ACCESS_KEY="IEHHDEZ3HO2MNHYX5OAZ"
CLEARML_API_SECRET_KEY="IbIAqWWAjmWcNxk6uOlFqywuBIT350Dy03II77SE2wOaiAhl8T"
CLEARML_SERVING_TASK_ID="7b94d19189b84692b1450b00037dc45d"

my conf file:

# ClearML SDK configuration file
api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server: http://localhost:8008
    web_server: http://localhost:8080
    files_server: http://localhost:8081
    # Credentials are generated using the webapp, http://localhost:8080/settings
    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "IEHHDEZ3HO2MNHYX5OAZ", "secret_key": "IbIAqWWAjmWcNxk6uOlFqywuBIT350Dy03II77SE2wOaiAhl8T"}
}

@Jahysama
Copy link

Jahysama commented Nov 8, 2023

Hi, @Mithmi i host my main clearml server and clearml serving server by utilizing different docker-compose files. I have resolved this issue by hosting main and serving composes under the same network. I hope this stackoverflow will help you to figure it out for your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants