Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.7.2版本训练模型丢失问题 #906

Open
ChrisHuo-04 opened this issue Jul 7, 2023 · 7 comments
Open

1.7.2版本训练模型丢失问题 #906

ChrisHuo-04 opened this issue Jul 7, 2023 · 7 comments
Assignees

Comments

@ChrisHuo-04
Copy link

情况描述:
部署版本为Kubefate 1.7.2,可正常训练预测,运行一段时间后python容器下的/data/projects/fate/model_local_cache/目录清空。

问题:
1.模型是否最终存于python容器下的/data/projects/fate/model_local_cache/中?(如果是存储在Eggroll里,容器名称和目录是什么)
2.什么情况可能会导致kubefate的model_local_cache文件夹清空?

多谢

@dylan-fan
Copy link
Collaborator

是在/data/projects/fate/model_local_cache/ 这个目录。你们有定时清理逻辑吗? 是不是被清理了?, 这个可以查下。
此外如果是生产上应用,这个目录最好做下高可用。

@ChrisHuo-04
Copy link
Author

ChrisHuo-04 commented Jul 26, 2023

是在/data/projects/fate/model_local_cache/ 这个目录。你们有定时清理逻辑吗? 是不是被清理了?, 这个可以查下。 此外如果是生产上应用,这个目录最好做下高可用。

@dylan-fan 没有做定时清理逻辑。部署kubefate各节点下的python容器一直运行未重启,各节点下面的/data/projects/fate/fateflow/model_local_cache目录以及/data/projects/fate/fateflow/jobs在同一个时间点内容均全部被清除。
/data/projects/fate/fateflow/logs中的文件均未丢失。

FATE中可能有某些指令会触发model_local_cahe及jobs文件夹清空么?

@dylan-fan
Copy link
Collaborator

kubefate 这块fangchi看下?

@wfangchi
Copy link
Collaborator

wfangchi commented Aug 1, 2023

请问是用的是kubefate的docker-compose模式还是K8s模式?

@ChrisHuo-04
Copy link
Author

@wfangchi k8s模式
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7", GitCommit:"1f86634ff08f37e54e8bfcd86bc90b61c98f84d4", GitTreeState:"clean", BuildDate:"2021-11-17T14:41:19Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

@wfangchi
Copy link
Collaborator

wfangchi commented Aug 1, 2023

谢谢,1.7.2版本的fateflow容器的volume mount路径可能不太对,在1.9和以后的版本应该已经修复了:#639 ,请 @owlet42 帮忙确认下。

@wfangchi wfangchi transferred this issue from FederatedAI/FATE Aug 2, 2023
@owlet42
Copy link
Collaborator

owlet42 commented Aug 7, 2023

@ChrisHuo-04

  1. 你们部署FATE是否有开启持久化,cluster.yamlpersistencetrue
  2. 确认是/data/projects/fate/fateflow/model_local_cache还是/data/projects/fate/model_local_cache这个路径会被清理。
  3. 清理时间大约是多久?
  4. k8s是哪个版本?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants