Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How long does it take for the training? #8

Open
moon5756 opened this issue Mar 5, 2020 · 2 comments
Open

How long does it take for the training? #8

moon5756 opened this issue Mar 5, 2020 · 2 comments

Comments

@moon5756
Copy link

moon5756 commented Mar 5, 2020

Hi, thanks for the really helpful work.

I just wonder how long it took for the training.
My desktop has the following cpu and gpu.
cpu: Intel Core i7-6900K CPU @ 3.2GHz
SSD: Sanmsung SSD 850 EVO
gpu: NVIDIA GeForce RTX 2080 TI

I ran the training script and it says active GPUs: 0, from which I can tell the my GPU is properly processing. I changed the size of batch to 50 in config.json because it complained about OOM issue.

I ran the script for about 23 min and it only completed one epoch.
One concern is that the utilization of CPU is like 99% but utilization of GPU is less than 10%.
Any configuration I need to change to fully utilize GPU?
Following is the command line log.

$ python train.py --config configs/config.json -g 0
=> active GPUs: 0
=> Output folder for this run -- jester_conv6
Using 9 processes for data loader.
Training is getting started...
Training takes 999999 epochs.
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Epoch: [0][0/2371]      Loss 3.3603 (3.3603)    Prec@1 2.000 (2.000)    Prec@5 24.000 (24.000)
Epoch: [0][100/2371]    Loss 3.3065 (3.3294)    Prec@1 8.000 (5.267)    Prec@5 28.000 (21.010)
Epoch: [0][200/2371]    Loss 3.4034 (3.3176)    Prec@1 6.000 (6.179)    Prec@5 16.000 (21.980)
Epoch: [0][300/2371]    Loss 3.3358 (3.3123)    Prec@1 12.000 (6.698)   Prec@5 20.000 (22.213)
Epoch: [0][400/2371]    Loss 3.2839 (3.3080)    Prec@1 10.000 (7.137)   Prec@5 20.000 (22.339)
Epoch: [0][500/2371]    Loss 3.2690 (3.3068)    Prec@1 12.000 (7.246)   Prec@5 28.000 (22.367)
Epoch: [0][600/2371]    Loss 3.3679 (3.3045)    Prec@1 6.000 (7.384)    Prec@5 22.000 (22.326)
Epoch: [0][700/2371]    Loss 3.3639 (3.3040)    Prec@1 6.000 (7.387)    Prec@5 14.000 (22.397)
Epoch: [0][800/2371]    Loss 3.2118 (3.3035)    Prec@1 8.000 (7.366)    Prec@5 36.000 (22.429)
Epoch: [0][900/2371]    Loss 3.3153 (3.3017)    Prec@1 2.000 (7.478)    Prec@5 24.000 (22.562)
Epoch: [0][1000/2371]   Loss 3.3295 (3.3003)    Prec@1 4.000 (7.538)    Prec@5 16.000 (22.691)
Epoch: [0][1100/2371]   Loss 3.2486 (3.2990)    Prec@1 10.000 (7.599)   Prec@5 30.000 (22.874)
Epoch: [0][1200/2371]   Loss 3.3112 (3.2973)    Prec@1 6.000 (7.607)    Prec@5 14.000 (22.981)
Epoch: [0][1300/2371]   Loss 3.2315 (3.2960)    Prec@1 14.000 (7.631)   Prec@5 36.000 (23.148)
Epoch: [0][1400/2371]   Loss 3.3065 (3.2944)    Prec@1 4.000 (7.659)    Prec@5 26.000 (23.269)
Epoch: [0][1500/2371]   Loss 3.2688 (3.2931)    Prec@1 12.000 (7.695)   Prec@5 34.000 (23.387)
Epoch: [0][1600/2371]   Loss 3.1971 (3.2921)    Prec@1 12.000 (7.734)   Prec@5 40.000 (23.492)
Epoch: [0][1700/2371]   Loss 3.2873 (3.2908)    Prec@1 8.000 (7.790)    Prec@5 20.000 (23.588)
Epoch: [0][1800/2371]   Loss 3.1563 (3.2894)    Prec@1 16.000 (7.842)   Prec@5 42.000 (23.719)
Epoch: [0][1900/2371]   Loss 3.2181 (3.2875)    Prec@1 8.000 (7.883)    Prec@5 36.000 (23.916)
Epoch: [0][2000/2371]   Loss 3.2744 (3.2859)    Prec@1 4.000 (7.929)    Prec@5 18.000 (24.034)
Epoch: [0][2100/2371]   Loss 3.3153 (3.2836)    Prec@1 6.000 (7.952)    Prec@5 28.000 (24.207)
Epoch: [0][2200/2371]   Loss 3.1725 (3.2810)    Prec@1 12.000 (8.038)   Prec@5 36.000 (24.462)
Epoch: [0][2300/2371]   Loss 3.2124 (3.2788)    Prec@1 8.000 (8.044)    Prec@5 38.000 (24.708)
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
=> active GPUs: 0
Test: [0/296]   Loss 3.2033 (3.2033)    Prec@1 14.000 (14.000)  Prec@5 32.000 (32.000)

// EDIT: Wait a sec... I just checked the tensorboard.. and is it supposed to take more than 1 day?

@OValery16
Copy link
Owner

First, you hardware configuration seems quite good. When we work with any 3d network, it is important. For the training configuration, I remember that I used 3 GPU, so it is normal that the initial batch size doesn't fit into 1 single GPU. My advice is to allow make sure the batch size is always greater than 32 (it make sure the statistics you capture help the model to generalize well). Another advice is to use apex library for mix precision training (https://github.com/NVIDIA/apex). The library wasn't mature enough when I did this project.

Another important thing is the loading. If you have SSD, make sure your data are stored on it (it greatly increase the loading speed). A poor loading leads to GPU starvation, which slow down the training process.

The training could take several days (depending on your configuration).

If you like you like this project, feel free to leave a star. (it is my only reward ^^)

@moon5756
Copy link
Author

moon5756 commented Mar 6, 2020

Thanks for the prompt response.
Based on the tensorboard, it seems like yours also took like a day and 14hrs for 350 epochs. Thanks for the suggestion though. I already left the star.

@moon5756 moon5756 closed this as completed Mar 6, 2020
@OValery16 OValery16 reopened this Mar 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants