Hi,
I currently testing out milo and it's parameters. I'am using the RTX50 dockerfile from #20 and the rtx 50 repo. I started the following training on the Truck demo data on different GPUs and always get an cudaMalloc exception.
Training command
python /workspace/MILo/milo/train.py -s /data/input/Truck -m /data/output/Truck --imp_metric outdoor --rasterizer radegs --eval --mesh_config default --decoupled_appearance --log_interval 200 --save_iterations 2000 4000 6000 8000 10000 12000 14000 16000 18000 --checkpoint_iterations 2000 4000 6000 8000 10000 12000 14000 16000 18000 --data_device cpu --config_path /workspace/MILo/milo/configs/fast
Tested GPUs
- Nvidia GTX 5060 TI, 16 GB RAM
- Nvidia GTX 5070 TI, 16 GB RAM
- Nvidia GTX 5090, 32 GB RAM
- Nvidia H100, 80 GB RAM
Exception
The crash does not crash at the same iteration on every gpu, so I don't think its a problem with the demo data.
Training progress: 71%|██████████████████████████████████████████████████████████████████████████████████████████▉ | 12790/18000 [31:50<15:18, 5.67it/s, Loss=0.0630130, DNLoss=0.0062229, MDLoss=0.0019296, MNLoss=0.0061558, OccLoss=0.0000477, OccLabLoss=0.0011094, N_Gauss=319469]
[INFO] Resetting occupancy labels at iteration 12800. [03/02 11:15:44]
Computing occupancy from mesh: 0%| | 0/219 [00:00<?, ?it/s]
Traceback (most recent call last):%| | 0/219 [00:00<?, ?it/s]
File "/workspace/MILo/milo/train.py", line 652, in <module>
training(
File "/workspace/MILo/milo/train.py", line 288, in training
mesh_regularization_pkg = compute_mesh_regularization(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/regularization/regularizer/mesh.py", line 535, in compute_mesh_regularization
voronoi_occupancy_labels, _ = evaluate_mesh_occupancy(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/regularization/sdf/depth_fusion.py", line 541, in evaluate_mesh_occupancy
render_pkg = mesh_renderer(
^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 396, in forward
fragments, rast_out, pos = self.rasterizer(mesh, cameras, cam_idx, return_rast_out=True, return_positions=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 341, in forward
nvdiff_rast_out = nvdiff_rasterization(
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/MILo/milo/scene/mesh.py", line 127, in nvdiff_rasterization
rast_chunk, _ = dr.rasterize(
^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/nvdiffrast/torch/ops.py", line 135, in rasterize
return _rasterize_func.apply(glctx, pos, tri, resolution, ranges, grad_db, -1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/venv/lib/python3.12/site-packages/nvdiffrast/torch/ops.py", line 78, in forward
out, out_db = _nvdiffrast_c.rasterize_fwd_cuda(raster_ctx.cpp_wrapper, pos, tri, resolution, ranges, peeling_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Cuda error: 2[cudaMalloc(&m_gpuPtr, bytes);]
Training progress: 71%|██████████████████████████████████████████████████████████████████████████████████████████▉ | 12790/18000 [31:52<12:59, 6.69it/s, Loss=0.0630130, DNLoss=0.0062229, MDLoss=0.0019296, MNLoss=0.0061558, OccLoss=0.0000477, OccLabLoss=0.0011094, N_Gauss=319469]
Any ideas whats going wrong? Some missleading parameters?
If I use mesh_config=verylowres and set the sampling_factor=0.1 or 0.2 it works.
Hi,
I currently testing out milo and it's parameters. I'am using the RTX50 dockerfile from #20 and the rtx 50 repo. I started the following training on the Truck demo data on different GPUs and always get an cudaMalloc exception.
Training command
Tested GPUs
Exception
The crash does not crash at the same iteration on every gpu, so I don't think its a problem with the demo data.
Any ideas whats going wrong? Some missleading parameters?
If I use mesh_config=verylowres and set the sampling_factor=0.1 or 0.2 it works.