In training context 1 epoch means that the model is trained with every single training image 1 time. So when the model sees and learns every image 1 time it means 1 epoch happened.
For example we have used the below dataset to train a LoRA model. This dataset is not great in consistency therefore we can’t say it is a perfect dataset but it is okay to use as a demo. There were a total 66 images. All with white backgrounds. As you have more variety of images with the same consistency it is better for Style training like below case. More variety means not having a sword twice (same subject / class) but having different items objects for every image.
Figure 1: Training Dataset, To see full size grid click here
So 1 epoch meant we trained every image 1 time with the model. I have trained 1000 epochs therefore the model has seen every image 1000 times in total.
How Many Epoch You Should Do
How many epochs you should do depends on the dataset and the training workflow you are using. It requires professionalism and understanding and research to set up the workflow. Therefore rather than trying to understand and tune workflow, you should focus on your training dataset improvement and also comparing different epoch counts.
The more the training images you have it is better. However there are 2 critical important prerequisites of this. With current training workflows on FLUX we can only teach a single subject perfectly. So it is either a specific style or a subject like a character, a specific item, a specific object or such. If you train multiple concepts that have common features they will bleed. This means that you can’t train 2 people or 2 styles at the same time. They will bleed. At max you can train like a dog and a person together because they are totally different concepts. So what are these 2 critical important prerequisites we mentioned?
1: Consistency; Your dataset has to be consistent. If it is a character, all the images must have the character and the character alone not with other characters.
Or if your dataset is a style, all of the training dataset images have to be in the same style, not with different style images.
2: Quality; All of the images have to be high quality. Lower than 1 mega pixel images 100% avoid. Don’t have widely different aspect ratio images. Try to aim all your images at 1 megapixel. The best is 1024x1024. Moreover images shouldn’t be blurry, low lightning, low quality, AI upscaled. So the dataset quality matters a lot.
Moreover make your subjects focused in the image. Here below I have given 2 examples.
E.g. bad example below
E.g. Good example below
So as you can see, your image should be filled with the subject as much as possible so the model learns most of it.
Ok so after all how we will know the best epoch? Sadly the only way is testing with your dataset. As the bigger dataset you have, you will need fewer epochs. Other than that given your used training workflow / preset / system we can give the below numbers as estimated.
Dataset between 1-50 : 200 epoch
Dataset between 50-150 : 150 epoch
Dataset between 150-300 : 100 epoch
In all cases you must always compare different epoch checkpoints and decide which one looks the best.
Different Epoch Checkpoints Comparison
As I have shown in above Figure 1, I have trained that dataset for 1000 epochs. Now I will show the epoch comparisons. I have taken a backup of the training model once every 25 epochs. So in total I have taken 40 checkpoints.
The checkpoints are named like this in my demo case:
Style_Demo-000025 : 25 epoch, so every image trained total 25 times
Style_Demo-000050 : 50 epoch, so every image trained total 50 times
Style_Demo-000150 : 150 epoch, so every image trained total 150 times
Then I have done testing of different epochs like below to demonstrate you how epoch counts impacted the output.
I have trained the models with only ohwx activation token and I am using the following suffix:
stylized 3D render with a playful, cartoon aesthetic, featuring smooth, glossy surfaces and soft, rounded shapes.
First tested prompt is : ohwx car, stylized 3D render with a playful, cartoon aesthetic, featuring smooth, glossy surfaces and soft, rounded shapes.
Compared epochs : 25, 50, 100, 200, 400, 600, 800, 975
Compared seeds : 1,2,3,4
What seed means is that, every image generation starts with a random noise and seed changes that noise therefore we get different output with different seed values.
This way we can obtain the best image we need.
The first test result shown below:
Figure 2: Click to download full grid
When we look at the grid the very first column is the base FLUX model without any trained LoRA. The second column is 25 epoch, the third one is 50, then 100, 200, 400, 600, 800, 975. First row is seed 1, second row is seed 2, third row is seed 3, and 4th row is seed 4.
What we see here clearly is that, as we trained the model more, it become more consistent with our style. Now you may be asking how to decide which epoch is best, this is totally personal. Moreover, it depends on your task. Usually what makes a model good is both achieving the target style - character resemblance but also being able to generalized. So that you should be able to generate everything without making the model very overfit.
As in above Figure 2, we can see that it is still able to generate different cars and it is still perfectly following ohwx car prompt. However you shouldn’t decide with a single prompt. For example my dataset didn’t have any car image therefore it is good that we didn’t overwrite the car concept. But what about we had a revolver image in dataset as below and we want to generate different revolver will it still work?
Figure: The revolver image we had in dataset
To test generalization and style learning of our training I have made another experiment with below prompt:
ohwx revolver, stylized 3D render with a playful, cartoon aesthetic, featuring smooth, glossy surfaces and soft, rounded shapes.
Figure 3: Click to download full grid
The revolver test as seen above Figure 3 turned out to be great to understand epoch logic. We can clearly see that the model started to become overfit starting from epoch 200. At the 400 epoch, none of the seeds generated anything resembling a revolver. This means that the model was totally overfit. So now I can focus on testing epochs between 100 and 200 to find my sweet spot if I am not satisfied with 100 epochs.
Figure 4:Click to download full grid
As seen in Figure 4, we can see that after epoch 125, it started to lose the concept of revolver due to overfitting. Of course it is a possibility that we better prompting, I can still generate a revolver or by generating more images with different seeds. But this is the logic of epochs in training.
Ultimately it depends on your training dataset and your training purpose. If the overfit model works better for your case, there is absolutely nothing wrong with overfitting your training with more epochs.
Step Count and Batch Size Logic
Step count and batch size are common concepts you may have heard of if you are interested in training models.
Meaning of Step Count
1-step means that the GPU is doing a 1 time task. Like executing a command. In our case when training a model, it means that it is training our model with samples at a single time. However, how many samples (images) will be trained in 1 time task depends on batch size.
Meaning of Batch Size
Batch size means that how many samples (images) the model will be trained at 1-step. So if batch size is 2, 2 samples (images) will be shown to the model at 1-step. If batch size is 4, 4 images will be shown to the model at 1-step.
Step Count vs Epoch
Epoch depends on how many times each sample (image) will be shown to the model training in total. However, step count means how many times a GPU will execute a task. We can understand this with few examples below
Case 1 : 20 training images and 100 epoch and batch size 1
In this case, we have to show the model total 2000 samples (images). Since our batch size is 1, the GPU has to execute 2000 tasks so it will be 2000 steps.
Case 2 : 20 training images and 100 epoch and batch size 4
In this case, we have to show the model total 2000 samples (images). Since our batch size is 4, the GPU has to execute 500 tasks so it will be 500 steps. Because at each task, it will be able to show the model 4 images.
Case 3 : 20 training images and 100 epoch and batch size 4 and 2 GPUs
In this case, we have to show the model total 2000 samples (images). Since our batch size is 4, the GPU has to execute 500 tasks. However since we have 2 GPUs, tasks will be equally distributed between GPUs and therefore, it will do 250 tasks. So it will be 250 tasks on each GPU at the same time.
Case 4 : 100 training images and 50 epoch and batch size 8
In this case, we have to show the model a total of 5000 samples (images). Since our batch size is 8, the GPU has to execute 625 tasks so it will be 625 steps. Because at each task, it will be able to show the model 8 images.
How Batch Size Works
In the essence of training, we show a sample to the model, calculate the loss rate, and depending on the loss rate we modify the model weights to have a lesser loss rate next time. When batch size is 1, 1 sample (image) trained on model and the model weights are updated. So after every sample, the model gets a full update. When batch size is 2, 2 images are shown to the model as a batch and model weights are updated after both images as averages instead of updating model weights after every single image. So this way we can increase the speed of training since with lesser model weight update and lesser calculations.
When there are multiple GPUs the logic is the same. The training is cloned on each GPU fully. Then each GPU calculates the weight updates according to samples they see. Then these weight changes are accumulated from all GPUs, averaged and then all models on all GPUs synchronized. This is for training Image model by the way not sharded LLM training.
Why Higher Batch Size Is Not Preferred
This is rather a debated topic but all of our empirical experiments proving that when we do a fine tuning either a LoRA or full model with a single subject, batch size 1 yields the very best results. Therefore, higher batch size is actually should be avoided. However if you need speed, the only way is using multiple GPUs and almost linearly reducing the training time. In this case, still the batch size on each GPU should be 1 to minimize quality loss and maximize speed gain. Since batch size on each GPU brings so little speed.
Is Same Config Used on All Batch Sizes
Batch size 100% directly impacts the learning rate of the model since we average the loss rates from every sample. Therefore, each batch size requires a specific learning rate. However from experience and experiments we can generalize a learning rate like below;
Learning Rate = Square Root of Batch_Size x Batch_Size_1_Learning_Rate
So if batch size 1 learning rate is 0.0001 and your batch size is 4, your new learning rate becomes 0.0002. You can see square root table below