Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mobilenet V2 train fail #8

Open
fanweiya opened this issue Jan 14, 2021 · 5 comments
Open

mobilenet V2 train fail #8

fanweiya opened this issue Jan 14, 2021 · 5 comments

Comments

@fanweiya
Copy link

i use mobilenet V2 backbone, but train fail

[-] Importing tensorflow...
2021-01-14 13:49:10.317068: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[+] Done! Tensorflow version: 2.5.0-dev20201230
[-] Importing Deeplabv3plus Trainer class...
[-] Importing config files...
2021-01-14 13:49:11.537581: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-14 13:49:11.591072: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-01-14 13:49:11.591101: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (alit-PowerEdge-T640): /proc/driver/nvidia/version does not exist
2021-01-14 13:49:11.591383: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:0,/job:localhost/replica:0/task:0/device:GPU:1
Train Images are good to go
[+] Data points in train dataset: 6400
Train Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
Train Images are good to go
Data points in train dataset: 1600
Val Dataset: <PrefetchDataset shapes: ((16, 512, 512, 3), (16, 512, 512, 1)), types: (tf.float32, tf.float32)>
2021-01-14 13:49:12.045387: I tensorflow/core/profiler/lib/profiler_session.cc:126] Profiler session initializing.
2021-01-14 13:49:12.045414: I tensorflow/core/profiler/lib/profiler_session.cc:141] Profiler session started.
2021-01-14 13:49:12.100790: I tensorflow/core/profiler/lib/profiler_session.cc:158] Profiler session tear down.
2021-01-14 13:49:12.268507: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Found an unshardable source dataset: name: "TensorSliceDataset/_2"
op: "TensorSliceDataset"
input: "Placeholder/_0"
input: "Placeholder/_1"
attr {
  key: "Toutput_types"
  value {
    list {
      type: DT_STRING
      type: DT_STRING
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
      }
      shape {
      }
    }
  }
}

2021-01-14 13:49:12.362496: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:127] None of the MLIR optimization passes are enabled (registered 2)
2021-01-14 13:49:12.367114: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2300000000 Hz
Epoch 1/100
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
WARNING:tensorflow:`input_shape` is undefined or non-square, or `rows` is not in [96, 128, 160, 192, 224]. Weights for input shape (224, 224) will be loaded as the default.
Traceback (most recent call last):
  File "trainer.py", line 47, in <module>
    HISTORY = TRAINER.train()
  File "/data/deeplab/DeepLabV3-Plus/deeplabv3plus/train.py", line 191, in train
    epochs=self.config['epochs'], callbacks=callbacks
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/wandb/integration/keras/keras.py", line 119, in new_v2
    return old_v2(*args, **kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1135, in fit
    tmp_logs = self.train_function(iterator)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 797, in __call__
    result = self._call(*args, **kwds)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 841, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 695, in _initialize
    *args, **kwds))
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2998, in _get_concrete_function_internal_garbage_collected
    graph_function, _ = self._maybe_define_function(args, kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3390, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3235, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 998, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 603, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 985, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py:840 train_function  *
        return step_function(self, iterator)
    /data/deeplab/DeepLabV3-Plus/deeplabv3plus/model/deeplabv3_plus.py:104 call  *
        tensor = tf.keras.layers.Concatenate(axis=-1)([input_a, input_b])
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:1015 __call__  **
        self._maybe_build(inputs)
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py:2709 _maybe_build
        self.build(input_shapes)  # pylint:disable=not-callable
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/utils/tf_utils.py:273 wrapper
        output_shape = fn(instance, input_shape)
    /data/Anaconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/keras/layers/merge.py:519 build
        raise ValueError(err_msg)

    ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(8, 128, 128, 256), (8, 64, 64, 48)]


@jeremy-cv
Copy link

Hi, thanks for this implementation.

I'm having the same issue with Mobilenetv2 backbone model.

@fanweiya did you solve this ?

thanks

@jeremy-cv
Copy link

It seems that training runs with factor 8 in ._get_upsample_layer_fn(input_shape, factor=8)

@diogosilva30
Copy link

+1 Same error here

@shivarajkarki
Copy link

Same error
ValueError: A Concatenate layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(1, 128, 128, 256), (1, 64, 64, 48)]

@KozAAAAA
Copy link

KozAAAAA commented Mar 8, 2024

Has anyone managed to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants