Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Killed #28

Open
tsugg opened this issue Feb 21, 2024 · 2 comments
Open

Training Killed #28

tsugg opened this issue Feb 21, 2024 · 2 comments

Comments

@tsugg
Copy link

tsugg commented Feb 21, 2024

Hi,

I'm trying to get this running on a remote linux machine with an A10 gpu. I built in headless mode. Build looked fine, colmap2adop looks fine, but training gets killed. Below is the end of the stack trace showing the 'Killed' message. I tried lowering batch sizes, render size, and crop sizes in case it was a gpu memory issue, but even on low values it still returns killed. Nothing is populated in the errors.txt either in the experiment directory. Any ideas on what could be happening? I even tried using Docker and the same message was returned. Thanks.

CAM model: CameraModel::PINHOLE_DISTORTION
Image Size 8831x6732
Aspect 1.31179
K 2456.38 2470.56 4415.5 3366 0
ocam 8831x6732 affine(1, 0, 0, 0, 0) cam2world() world2cam()
ocam cut 1
normalized center 0 0
dist 0 0 0 0 0 0 0 0
CAM model: CameraModel::PINHOLE_DISTORTION
Points 1931815
Colors 1
Normals 1
Avg. EV 0
Num Images 82
Num Cameras 82
Compute scene importance bounding box as 95% of points interval around center of mass
Starting Compute center of mass...center of mass:1.08572
-0.00931753
0.720447
Done in 20.1654ms.
Starting Build range vec... Done in 5.9594ms.
Starting Sort range vec... Done in 152.128ms.
Starting Extend box... Done in 23.709ms.
Box: AABB: [-4.65199 -3.8338 -5.63757 ] [7.34715 3.68846 6.95357 ]

Modulo stepsize: 8
Train(71): 1 2 3 4 5 6 7 9 10 11 12 13 14 15 17 18 19 20 21 22 23 25 26 27 28 29 30 31 33 34 35 36 37 38 39 41 42 43 44 45 46 47 49 50 51 52 53 54 55 57 58 59 60 61 62 63 65 66 67 68 69 70 71 73 74 75 76 77 78 79 81
Test(11): 0 8 16 24 32 40 48 56 64 72 80
Killed

@tsugg tsugg closed this as completed Feb 22, 2024
@tsugg tsugg reopened this Feb 22, 2024
@tsugg
Copy link
Author

tsugg commented Feb 22, 2024

I did some more digging and found that while the train and test image indices are posted to the stack trace, RAM usage continually balloons passed 60GBs until that Killed message pops up.

@lfranke
Copy link
Owner

lfranke commented Mar 4, 2024

Hi, this looks to me that the resolution and amount of cameras is too high for this implementation. Maybe try to use shared intrinsics and lower the resolution of the cameras? I think I never tried with more than 2.5K resolutions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants