Openbayes 天津大学GPU异常FAQ
天津大学问题处理¶
当遇到容器关闭出于一直卡住的状态,可以查看 bj 对象的状态是否处于Terminated
kubectl get bj | grep Terminated
清理单个处于 Terminated bayesjob 对象
kubectl delete bj | grep Terminated
kubectl delete bj <具体的某一个Terminated bj>
清理所有处于 Terminated bayesjob 对象,尝试页面启动
kubectl get bj | grep Terminated | awk '{print $1}' | xargs -I {} kubectl delete bj/{} --force --grace-period=0
重启 openbayes-gear-controller¶
除了以上方式,如果整个集群遇到了无法调度问题(如果非集群问题,可以尝试重启 openbayes-gear-controller pod )
kubeclt delete pod openbayes-gear-controller-xxxx