Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pmix plugin will not initialize without libpmix-dev #29

Open
jamesbeedy opened this issue Oct 8, 2024 · 3 comments
Open

pmix plugin will not initialize without libpmix-dev #29

jamesbeedy opened this issue Oct 8, 2024 · 3 comments

Comments

@jamesbeedy
Copy link
Contributor

Bug Description

Slurmctld and slurmd processes cannot load the pmix plugin because the charms provide slurm built with pmix support, but then don't make the libs available at runtime, so the plugin cannot load.

To Reproduce

juju bootstrap localhost
juju add-model slurm-test
tox -e build
juju deploy ./_build/slurmd.charm --constraints "virt-type=virtual-machine cores=4 mem=4G root-disk=20G"
juju deploy ./_build/slurmctld.charm --constraints "virt-type=virtual-machine cores=4 mem=4G root-disk=20G"
juju relate slurmctld slurmd

Environment

lxd provider, virtual-machines

Relevant log output

$ sudo cat /var/log/slurm/slurmctld.log
[2024-10-08T06:10:43.540] error: Configured MailProg is invalid
[2024-10-08T06:10:43.541] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:10:43.542] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:10:43.542] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:10:43.542] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:10:43.545] No memory enforcing mechanism configured.
[2024-10-08T06:10:43.548] error: read_slurm_conf: default partition not set.
[2024-10-08T06:10:43.548] error: Could not open node state file /var/spool/slurmctld/node_state: No such file or directory
[2024-10-08T06:10:43.548] error: NOTE: Trying backup state save file. Information may be lost!
[2024-10-08T06:10:43.548] No node state file (/var/spool/slurmctld/node_state.old) to recover
[2024-10-08T06:10:43.549] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:43.549] No job state file (/var/spool/slurmctld/job_state.old) to recover
[2024-10-08T06:10:43.549] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:10:43.549] error: Could not open reservation state file /var/spool/slurmctld/resv_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Reservations may be lost
[2024-10-08T06:10:43.549] No reservation state file (/var/spool/slurmctld/resv_state.old) to recover
[2024-10-08T06:10:43.549] error: Could not open trigger state file /var/spool/slurmctld/trigger_state: No such file or directory
[2024-10-08T06:10:43.549] error: NOTE: Trying backup state save file. Triggers may be lost!
[2024-10-08T06:10:43.549] No trigger state file (/var/spool/slurmctld/trigger_state.old) to recover
[2024-10-08T06:10:43.549] read_slurm_conf: backup_controller not specified
[2024-10-08T06:10:43.549] Reinitializing job accounting state
[2024-10-08T06:10:43.549] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:10:43.549] Running as primary controller
[2024-10-08T06:10:43.549] No parameter for mcs plugin, default values set
[2024-10-08T06:10:43.549] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:10:44.543] Processing Reconfiguration Request
[2024-10-08T06:10:44.544] No memory enforcing mechanism configured.
[2024-10-08T06:10:44.544] error: read_slurm_conf: default partition not set.
[2024-10-08T06:10:44.544] restoring original state of nodes
[2024-10-08T06:10:44.545] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:44.545] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:44.545] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:10:44.545] read_slurm_conf: backup_controller not specified
[2024-10-08T06:10:44.545] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:10:44.545] No parameter for mcs plugin, default values set
[2024-10-08T06:10:44.545] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:10:44.547] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:10:44.547] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:10:44.547] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:10:44.547] reconfigure_slurm: completed usec=3269
[2024-10-08T06:10:44.547] error: Could not open job state file /var/spool/slurmctld/job_state: No such file or directory
[2024-10-08T06:10:44.547] error: NOTE: Trying backup state save file. Jobs may be lost!
[2024-10-08T06:10:44.547] No job state file (/var/spool/slurmctld/job_state.old) found
[2024-10-08T06:10:46.554] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2024-10-08T06:11:15.401] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:15.493] Saving all slurm state
[2024-10-08T06:11:15.551] error: Configured MailProg is invalid
[2024-10-08T06:11:15.553] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:15.554] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:15.554] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:15.554] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:15.557] No memory enforcing mechanism configured.
[2024-10-08T06:11:15.559] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:15.559] Recovered state of 0 nodes
[2024-10-08T06:11:15.560] Recovered information about 0 jobs
[2024-10-08T06:11:15.560] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:15.560] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:15.560] Recovered state of 0 reservations
[2024-10-08T06:11:15.560] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:15.560] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:15.560] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:15.560] Running as primary controller
[2024-10-08T06:11:15.560] No parameter for mcs plugin, default values set
[2024-10-08T06:11:15.560] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:16.556] Processing Reconfiguration Request
[2024-10-08T06:11:16.557] No memory enforcing mechanism configured.
[2024-10-08T06:11:16.557] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:16.557] restoring original state of nodes
[2024-10-08T06:11:16.558] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:16.558] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.558] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:16.558] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:16.558] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.558] No parameter for mcs plugin, default values set
[2024-10-08T06:11:16.558] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:16.558] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:16.559] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:16.559] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:16.559] reconfigure_slurm: completed usec=2800
[2024-10-08T06:11:16.611] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:16.662] Saving all slurm state
[2024-10-08T06:11:16.712] error: Configured MailProg is invalid
[2024-10-08T06:11:16.713] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:16.715] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:16.715] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:16.715] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:16.717] No memory enforcing mechanism configured.
[2024-10-08T06:11:16.720] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:16.720] Recovered state of 1 nodes
[2024-10-08T06:11:16.720] Down nodes: juju-1cc933-0
[2024-10-08T06:11:16.720] Recovered information about 0 jobs
[2024-10-08T06:11:16.720] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:16.721] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.721] Recovered state of 0 reservations
[2024-10-08T06:11:16.721] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:16.721] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:16.721] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:16.721] Running as primary controller
[2024-10-08T06:11:16.721] No parameter for mcs plugin, default values set
[2024-10-08T06:11:16.721] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:17.716] Processing Reconfiguration Request
[2024-10-08T06:11:17.717] No memory enforcing mechanism configured.
[2024-10-08T06:11:17.717] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:17.717] restoring original state of nodes
[2024-10-08T06:11:17.718] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:17.718] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:17.718] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:17.718] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:17.718] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:17.718] No parameter for mcs plugin, default values set
[2024-10-08T06:11:17.718] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:17.719] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:17.719] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:17.719] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:17.720] reconfigure_slurm: completed usec=3315
[2024-10-08T06:11:18.184] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:18.222] Saving all slurm state
[2024-10-08T06:11:18.275] error: Configured MailProg is invalid
[2024-10-08T06:11:18.276] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:18.277] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:18.277] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:18.277] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:18.280] No memory enforcing mechanism configured.
[2024-10-08T06:11:18.282] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:18.282] Recovered state of 1 nodes
[2024-10-08T06:11:18.282] Down nodes: juju-1cc933-0
[2024-10-08T06:11:18.283] Recovered information about 0 jobs
[2024-10-08T06:11:18.283] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:18.283] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:18.283] Recovered state of 0 reservations
[2024-10-08T06:11:18.283] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:18.283] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:18.283] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:18.283] Running as primary controller
[2024-10-08T06:11:18.283] No parameter for mcs plugin, default values set
[2024-10-08T06:11:18.283] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:19.279] Processing Reconfiguration Request
[2024-10-08T06:11:19.279] No memory enforcing mechanism configured.
[2024-10-08T06:11:19.280] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:19.280] restoring original state of nodes
[2024-10-08T06:11:19.280] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:19.280] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.280] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:19.280] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:19.280] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.280] No parameter for mcs plugin, default values set
[2024-10-08T06:11:19.280] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:19.282] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:19.282] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:19.282] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:19.282] reconfigure_slurm: completed usec=3260
[2024-10-08T06:11:19.760] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:19.786] Saving all slurm state
[2024-10-08T06:11:19.839] error: Configured MailProg is invalid
[2024-10-08T06:11:19.840] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:19.841] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:19.841] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:19.841] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:19.844] No memory enforcing mechanism configured.
[2024-10-08T06:11:19.854] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:19.854] Recovered state of 1 nodes
[2024-10-08T06:11:19.854] Down nodes: juju-1cc933-0
[2024-10-08T06:11:19.855] Recovered information about 0 jobs
[2024-10-08T06:11:19.855] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:19.855] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.855] Recovered state of 0 reservations
[2024-10-08T06:11:19.856] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:19.856] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:19.856] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:19.856] Running as primary controller
[2024-10-08T06:11:19.856] No parameter for mcs plugin, default values set
[2024-10-08T06:11:19.856] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:20.843] Processing Reconfiguration Request
[2024-10-08T06:11:20.844] No memory enforcing mechanism configured.
[2024-10-08T06:11:20.844] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:20.844] restoring original state of nodes
[2024-10-08T06:11:20.845] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:20.845] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:20.846] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:20.846] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:20.846] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:20.846] No parameter for mcs plugin, default values set
[2024-10-08T06:11:20.846] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:20.847] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:20.847] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:20.847] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:20.848] reconfigure_slurm: completed usec=4984
[2024-10-08T06:11:20.893] Terminate signal (SIGINT or SIGTERM) received
[2024-10-08T06:11:20.957] Saving all slurm state
[2024-10-08T06:11:21.016] error: Configured MailProg is invalid
[2024-10-08T06:11:21.017] slurmctld version 23.02.7 started on cluster osd-cluster
[2024-10-08T06:11:21.019] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:21.019] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:21.019] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:21.021] No memory enforcing mechanism configured.
[2024-10-08T06:11:21.024] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:21.024] Recovered state of 1 nodes
[2024-10-08T06:11:21.024] Down nodes: juju-1cc933-0
[2024-10-08T06:11:21.024] Recovered information about 0 jobs
[2024-10-08T06:11:21.024] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:21.024] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:21.024] Recovered state of 0 reservations
[2024-10-08T06:11:21.024] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:21.024] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:21.024] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:21.025] Running as primary controller
[2024-10-08T06:11:21.025] No parameter for mcs plugin, default values set
[2024-10-08T06:11:21.025] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:22.019] Processing Reconfiguration Request
[2024-10-08T06:11:22.020] No memory enforcing mechanism configured.
[2024-10-08T06:11:22.020] error: read_slurm_conf: default partition not set.
[2024-10-08T06:11:22.020] restoring original state of nodes
[2024-10-08T06:11:22.020] select/cons_tres: select_p_node_init: select/cons_tres SelectTypeParameters not specified, using default value: CR_Core_Memory
[2024-10-08T06:11:22.020] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:22.020] read_slurm_conf: backup_controller not specified
[2024-10-08T06:11:22.020] select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
[2024-10-08T06:11:22.020] select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
[2024-10-08T06:11:22.020] No parameter for mcs plugin, default values set
[2024-10-08T06:11:22.020] mcs: MCSParameters = (null). ondemand set.
[2024-10-08T06:11:22.021] error:  mpi/pmix_v4: init: (null) [0]: mpi_pmix.c:197: pmi/pmix: can not load PMIx library
[2024-10-08T06:11:22.021] error: Couldn't load specified plugin name for mpi/pmix_v4: Plugin init() callback failed
[2024-10-08T06:11:22.021] error: MPI: Cannot create context for mpi/pmix_v4
[2024-10-08T06:11:22.022] reconfigure_slurm: completed usec=2659
[2024-10-08T06:11:24.029] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

Additional context

No response

@NucciTheBoss
Copy link
Member

Hmm... I thought that we were installing libpmix-dev as part of the common packages, no? 232d4c0. I remember seeing this commit go through while I was at OpenInfra Asia.

Either way, we can upstream ensuring that the common packages are installed on slurmctld and slurmd nodes into slurm_ops. Would just be something like the following:

if self._service_name in ["slurmctld", "slurmd"]:
    apt.add_package(["libpmix-dev", "openmpi-bin"])

@jamesbeedy
Copy link
Contributor Author

jamesbeedy commented Oct 8, 2024

yeah ...

slurmd: [openmpi-bin, libpmix-dev]

slurmctld: [mailutils, libpmix-dev]

@NucciTheBoss
Copy link
Member

Btw @jamesbeedy which branch are you working off of here? Is this main or experimental? Either way I'll ensure that slurm_ops installs the correct packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants