Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用KubeFate docker-deploy方法部署失败 #5731

Open
DuanYuFi opened this issue Oct 29, 2024 · 11 comments
Open

使用KubeFate docker-deploy方法部署失败 #5731

DuanYuFi opened this issue Oct 29, 2024 · 11 comments

Comments

@DuanYuFi
Copy link

Describe the bug

您好,

我使用KubeFate docker-compose-release 中的方法在三个机器上部署Fate training和serving,但是docker container一直重试。

image

别的模块都可以启动,这是怎么回事?

附:各image tag如下:
image

谢谢!

@dylan-fan
Copy link
Collaborator

你们的镜像看起来不是最新的,很多怎么都是16个月以前的?

@DuanYuFi
Copy link
Author

我以为latest就是最新的,gg,我指定一下版本再试试,谢谢。

@DuanYuFi
Copy link
Author

DuanYuFi commented Oct 31, 2024

我更新后可以起来环境了:
image

不过不知道为什么测试不通过。
image

我的配置文件如下:

#!/bin/bash

user=root
dir=/data/projects/fate
party_list=(0 1 2)
party_ip_list=(172.26.232.247 172.26.232.248 172.26.232.249)
serving_ip_list=(172.26.232.247 172.26.232.248 172.26.232.249)

# Engines:
# Computing : Eggroll, Spark, Spark_local
computing=Eggroll
# Federation: OSX(computing: Eggroll/Spark/Spark_local), Pulsar/RabbitMQ(computing: Spark/Spark_local)
federation=OSX
# Storage: Eggroll(computing: Eggroll), HDFS(computing: Spark), LocalFS(computing: Spark_local)
storage=Eggroll
# Algorithm: Basic, NN, ALL
algorithm=Basic
# Device: CPU, IPCL, GPU
device=CPU

# spark and eggroll 
compute_core=8

# You only need to configure this parameter when you want to use the GPU, the default value is 1
gpu_count=1

# default
exchangeip=

# modify if you are going to use an external db
mysql_ip=mysql
mysql_user=fate
mysql_password=fate_dev
mysql_db=fate_flow
serverTimezone=UTC

name_node=hdfs://namenode:9000

# Define fateboard login information
fateboard_username=admin
fateboard_password=admin

# Define serving admin login information
serving_admin_username=admin
serving_admin_password=admin

# Define notebook login information
notebook_hashed_password=

另外,fate board的用户名密码也不对:
image

@yx0090sh
Copy link

docker-compose 部署文档 https://github.com/FederatedAI/KubeFATE/blob/dev-2.2.0/docker-deploy/README_zh.md
fateboard的密码是正确的, 这种情况只需要刷新一下,再次登录即可,具体原因在于新服务覆盖了旧的,清理下缓存,或者刷新下再次登录就可以了

@DuanYuFi
Copy link
Author

抱歉没有注意别的branch,我再试试,感谢!

@DuanYuFi
Copy link
Author

:(
image

另外,fateboard清空缓存硬刷新仍然不work,密码错误:(,换浏览器也不行

@yx0090sh
Copy link

yx0090sh commented Nov 1, 2024

这是另一个错误了, fate_flow服务没有启动,需要看一下为什么flow没有启动,或者是没有启动需要 start一下

@DuanYuFi
Copy link
Author

DuanYuFi commented Nov 1, 2024

image
image
看起来服务没啥问题,我再排查一下别的方面

@DuanYuFi
Copy link
Author

DuanYuFi commented Nov 1, 2024

image
看起来是,fateflow的服务不能通过127.0.0.1的方式访问,而flow server info走的API是localhost的所以造成了这个问题。

请问这里是我部署的问题吗(部署完全按照教程),还是我机器的问题?现在我是应该更改路由表,让fateflow client走192.167的路径,还是应该排查为什么localhost访问不到服务的问题?

@yx0090sh
Copy link

yx0090sh commented Nov 1, 2024

建议不要用127.0.0.1

@DuanYuFi
Copy link
Author

DuanYuFi commented Nov 1, 2024

我没有手动指定127.0.0.1,这是部署后默认的。请问我该在哪里修改?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants