I think the fact that onsite trainings continue to fail is not good for civitai nor for the users, as they are just burning computational resources for absolutely nothing. For addressing this issue, I came up with the features that may help.
Be able to terminate training process before it's finished.
Let user select whether to what's next when the training process fails, for example, a toggle button to choose whether they want to keep the last epoch of previous failed training or jump straight to next training attempt. So that it would be possible for users to continue the training from the last epoch and finish the rest on another machine. Not to mention the model could have been cooked before the last epoch yet get discarded.
Although it would be best if there's no bug.
Thanks for reading this comment, hope this can be considered.
Please authenticate to join the conversation.
Awaiting Dev Review
💡 Feature Request
Over 2 years ago
OMG
Get notified by email when there are changes.
Awaiting Dev Review
💡 Feature Request
Over 2 years ago
OMG
Get notified by email when there are changes.