Skip to content

Commit

Permalink
Merge pull request #3 from mathworks-ref-arch/1.2.0
Browse files Browse the repository at this point in the history
Version 1.2.0
  • Loading branch information
hannahpullen authored Sep 20, 2024
2 parents b3ad3cb + ce628c7 commit 20bf8b2
Show file tree
Hide file tree
Showing 19 changed files with 1,010 additions and 102 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/parallel-server-images.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ jobs:
strategy:
fail-fast: false
matrix:
matlab-release: [r2024a]
matlab-release: [r2024a, r2024b]

steps:
- name: Checkout repo
Expand Down
137 changes: 131 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ Before you start, you need the following:
- Uses Kubernetes version 1.21.1 or later.
- Meets the system requirements for running MATLAB Job Scheduler. For details, see the MathWorks documentation for [MATLAB Parallel Server Product Requirements](https://www.mathworks.com/support/requirements/matlab-parallel-server.html).
- Configured to create external load balancers that allow traffic into the cluster.
- Has adequate storage on cluster nodes. When using a MATLAB Parallel Server Docker image for your workers (default behavior), ensure that each cluster node has at least 50GB of storage. If mounting MATLAB Parallel Server from a persistent volume, each cluster node must have at least 20GB of storage.
- Kubectl installed on your computer and configured to access your Kubernetes cluster. For help with installing Kubectl, see [Install Tools](https://kubernetes.io/docs/tasks/tools/) on the Kubernetes website.
- Helm® version 3.8.0 or later installed on your computer. For help with installing Helm, see [Quickstart Guide](https://helm.sh/docs/intro/quickstart/).
- Network access to the MathWorks Container Registry, `containers.mathworks.com`, and the GitHub® Container registry, `ghcr.io`.
Expand Down Expand Up @@ -86,6 +87,7 @@ For details about security levels, see [MATLAB Job Scheduler Security](https://w
When you run MATLAB Job Scheduler with security level 2, you must provide an administrator password.
Create a Kubernetes Secret for your administrator password named `mjs-admin-password` and replace `<password>` with a password of your choice.
If you are using LDAP to authenticate user credentials, this must be the password of the cluster administrator in the LDAP server.
```
kubectl create secret generic mjs-admin-password --from-literal=password=<password> --namespace mjs
```
Expand Down Expand Up @@ -127,6 +129,19 @@ Modify the following values:

For a full list of the configurable Helm values that you can set in this file, see the [Helm Values](helm_values.md) page.

To use an LDAP server to authenticate user credentials, add the following parameters to the `values.yaml` file:
```yaml
adminUser: "admin"
ldapURL: "ldaps://HOST:PORT"
ldapSecurityPrincipalFormat: "[username]@domain.com"
```
Modify the following values:
- `adminUser` &mdash; Specify the username of a valid user in the LDAP server. The secret you created in the [Create Administrator Password Secret](#create-administrator-password-secret) must contain this user's password.
- `ldapURL` &mdash; Specify the URL of the LDAP server as `ldaps://HOST:PORT`. If you have not configured your LDAP server over SSL, specify the URL as `ldap://HOST:PORT`.
- `ldapSecurityPrincipalFormat` &mdash; Specify the format of a security principal (user) for your LDAP server.

**Security Considerations:** Use LDAP over SSL (LDAPS) to encrypt communication between the LDAP server and clients. For additional LDAPS configuration steps, see [Configure LDAP over SSL](#configure-ldap-over-ssl).

### Install Helm Chart

Install the MATLAB Job Scheduler Helm chart with your custom values file:
Expand Down Expand Up @@ -158,10 +173,10 @@ NAME TYPE CLUSTER-IP EXTERNAL-IP PORT
mjs-ingress-proxy-ed5e5db8 LoadBalancer 10.233.12.53 192.168.1.200 27356:31387/TCP,27359:31664/TCP,30000:32212/TCP
```

Configure your firewall so that MATLAB clients can route to the IP address or hostname under the `EXTERNAL-IP` column through the ports this service exposes.
Configure your firewall so that MATLAB clients can connect to the IP address or hostname under the `EXTERNAL-IP` column through the ports this service exposes.
For a description of the ports the load balancer service exposes, see the [Customize Load Balancer](#customize-load-balancer) section.

If you want the MATLAB client to route to this load balancer through a different hostname, for example, an intermediate server or a DNS entry, set the value of the `clusterHost` parameter in your Helm values file before you install MATLAB Job Scheduler on your Kubernetes cluster.
If you want the MATLAB client to connect to this load balancer through a different hostname, for example, an intermediate server or a DNS entry, set the value of the `clusterHost` parameter in your Helm values file before you install MATLAB Job Scheduler on your Kubernetes cluster.

## Download Cluster Profile

Expand All @@ -187,16 +202,97 @@ Import the cluster profile.
2. Click **Import** in the toolbar.
3. Navigate to the location where you saved the profile you created in the previous step and select it.

### Validate Cluster
## Validate Cluster

Cluster validation submits a job of each type to test whether the cluster profile is configured correctly.
In the Cluster Profile Manager, click **Validate**.
If you make a change to the cluster configuration, run cluster validation again to ensure your changes cause no errors.
You do not need to validate the profile each time you use it or each time you start MATLAB.

### Troubleshoot Cluster Validation Failures

The following sections explain how to resolve some common cluster validation failures.

#### Cluster Connection Test Failure

Incorrect cluster profiles or networking issues can cause failures during the "Cluster connection test (parcluster)" cluster validation stage.

If you have uninstalled and reinstalled the MATLAB Job Scheduler Helm chart, make sure you download and import the new cluster profile following the instructions in [Download Cluster Profile](#download-cluster-profile).
Using a cluster profile from a previous deployment in the same Kubernetes cluster results in cluster validation errors.

You must ensure that your MATLAB client can connect to the IP address of the load balancer and that your firewall allows traffic to the MATLAB Job Scheduler ports.
To check the load balancer's IP address, see [Install Helm Chart](#install-helm-chart).
For a description of the MATLAB Job Scheduler ports, see [Customize Load Balancer](#customize-load-balancer).

#### License Checkout Failure

If you have not correctly configured the MATLAB Parallel Server license for your cluster, the "Job test (createJob)" cluster validation stage fails with this message:
```
License checkout failed
```
Make sure you have set either the `useOnlineLicensing` parameter or the `networkLicenseManager` parameter in your `values.yaml` file.
To learn more about the `useOnlineLicensing` and `networkLicenseManager` parameters, see [Create Helm Values File](#create-helm-values-file).

If you continue to experience licensing errors, contact [MathWorks Technical Support](https://www.mathworks.com/support/contact_us.html).

#### Job Test Unresponsive

If the "Job test (createJob)", "SPMD job test (createCommunicatingJob)" or "Pool job test (createCommunicatingJob)" stage takes a very long time to run (> 5 minutes), your Kubernetes cluster might not have sufficient resources to start the worker pods.

Check the status of the worker pods while cluster validation is in progress by running
```
kubectl get pods --label app=mjs-worker --namespace mjs
```
If a worker pod has the `Pending`, `ContainerCreating` or `ContainerStatusUnknown` status, check the pod's details by running
```
kubectl describe pods --namespace mjs <pod-name>
```
Replace `<pod-name>` with the name of the worker pod.
If your Kubernetes cluster does not have enough CPU resources to run the pod, the output might include messages like:
```
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 7s default-scheduler 0/2 nodes are available: 2 Insufficient cpu.
```
If your Kubernetes cluster does not have enough memory resources to run the pod, the output might include messages like:
```
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 43s default-scheduler 0/2 nodes are available: 2 Insufficient memory.
```
If you see either output, your Kubernetes cluster does not have enough resources to run the number of workers you specified in the `maxWorkers` parameter in your `values.yaml` file.
For details on the resource requirements for MATLAB Parallel Server workers, see the MathWorks documentation for [MATLAB Parallel Server Product Requirements](https://www.mathworks.com/support/requirements/matlab-parallel-server.html).
By default, each worker pod requests 2 vCPU and 8GB of memory. If your cluster does not have enough resources, either
- Add more nodes to your Kubernetes cluster or replace your existing nodes with nodes that have more CPU and memory resources.
- Modify your `values.yaml` file to decrease the value of the `maxWorkers` parameter.
- Modify your `values.yaml` file to decrease the values of the `workerMemoryRequest` and `workerMemoryLimit` parameters. A minimum of 4GB per MATLAB worker is recommended. If you are using Simulink, a minimum of 8GB per worker is recommended.
If you modified your `values.yaml` file, uninstall the MATLAB Job Scheduler Helm chart following the instructions in [Uninstall MATLAB Job Scheduler](#uninstall-matlab-job-scheduler), then reinstall the Helm chart following the instructions in [Install Helm chart](#install-helm-chart).
If your Kubernetes cluster nodes do not have enough ephemeral storage to pull the MATLAB Parallel Server Docker image, the output of `kubectl describe pods` might include messages like:
```
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m49s default-scheduler Successfully assigned default/mjs-worker-1-fd697549aeca4c2ab0f1bcb4fe819b0f-5d78457d5c-lcpv5 to my-node
Normal Pulling 4m49s kubelet Pulling image "ghcr.io/mathworks-ref-arch/matlab-parallel-server-k8s/mjs-worker-image:r2024a"
Warning Evicted 88s kubelet The node was low on resource: ephemeral-storage. Threshold quantity: 3219965180, available: 1313816Ki.
```
For details on the node storage requirements, see [Requirements](#requirements).
If your nodes do not have enough ephemeral storage, either
- Replace your Kubernetes nodes with nodes that have more storage.
- Instead of pulling the MATLAB Parallel Server Docker image, mount MATLAB Parallel Server from a PersistentVolume. To learn more, see [Mount MATLAB from a PersistentVolume](#mount-matlab-from-a-persistentvolume).
## Uninstall MATLAB Job Scheduler
To uninstall MATLAB Job Scheduler from your Kubernetes cluster, run this command:
To uninstall the MATLAB Job Scheduler Helm chart from your Kubernetes cluster, run this command:
```
helm uninstall mjs --namespace mjs
```
Expand All @@ -211,12 +307,12 @@ If you created a custom load balancer service, delete the service:
kubectl delete service mjs-ingress-proxy --namespace mjs
```
If you want to reinstall MATLAB Job Scheduler, you must ensure that the load balancer service is deleted first.
If you want to reinstall the MATLAB Job Scheduler Helm chart, you must ensure that the load balancer service is deleted first.
To check the status of the load balancer service, run:
```
kubectl get service mjs-ingress-proxy --namespace mjs
```
If the load balancer service appears, wait for some time, then run the command again to confirm that the load balancer service is not found before proceeding with the MATLAB Job Scheduler reinstallation.
If the load balancer service appears, wait for some time, then run the command again to confirm that the load balancer service is not found before proceeding with the MATLAB Job Scheduler Helm chart reinstallation.
## Examples
Expand Down Expand Up @@ -324,6 +420,35 @@ For details on creating the PersistentVolumeClaim, see the [Create Persistent Vo
Modify your `values.yaml` file to set the `matlabPVC` parameter to the name of your PersistentVolumeClaim before installating the Helm chart.
The worker pods will now use the image URI specified in the `matlabDepsImage` parameter instead of the `workerImage` parameter.

### Run Multiple MATLAB Parallel Server Versions

You can use multiple versions of MATLAB Parallel Server in a single MATLAB Job Scheduler cluster.
When you upgrade to a newer release of MATLAB Parallel Server on your cluster, users can continue to submit jobs from both newer and older releases of the MATLAB client.
The additional MATLAB Parallel Server versions you use must be version R2024a or newer and must be older than the version of MATLAB Job Scheduler you are using.

Create a PersistentVolume and PersistentVolumeClaim for each additional MATLAB Parallel Server installation you want to use.
The root directory of each PersistentVolume must be the MATLAB root folder.
Modify your `values.yaml` file to set the `additionalMatlabPVCs` parameter to the names of the PersistentVolumeClaims.

For example, to use an additional PersistentVolumeClaim `matlab-r2024a-pvc`, add the following line to your `values.yaml` file:
```
additionalMatlabPVCs:
- matlab-r2024a-pvc
```

### Configure LDAP over SSL

When you use an LDAP server configured over SSL, you must add the LDAPS SSL certificate to your Kubernetes cluster.
To obtain the SSL server certificate, follow the instructions in [Connect to LDAP Server to Get Server SSL Certificate](https://www.mathworks.com/help/matlab-parallel-server/configure-ldap-server-authentication-for-matlab-job-scheduler.html#mw_fe8d0f90-2854-42b9-9e04-a2f25a295e61) on the MathWorks website.

If you use a prebuilt job manager image (default behavior), create a Kubernetes secret containing the server SSL certificate.
Replace `<path>` with the path to your server SSL certificate.
```
kubectl create secret generic mjs-ldap-secret --namespace mjs --from-file=cert.pem=<path>
```

If you use a persistent volume for the job manager pod (`matlabPVC` is set to a non-empty string and `jobManagerUsesPVC` is set to `true` in your `values.yaml` file), you must add your certificate to the Java trust store of the MATLAB Parallel Server installation in your persistent volume. For detailed instructions, see [Add Certificate to Java Trust Store](https://mathworks.com/help/matlab-parallel-server/configure-ldap-server-authentication-for-matlab-job-scheduler.html#mw_fe8d0f90-2854-42b9-9e04-a2f25a295e61) on the MathWorks website.

### Customize Load Balancer

MATLAB Job Scheduler in Kubernetes uses a Kubernetes load balancer service to expose MATLAB Job Scheduler to MATLAB clients running outside of the Kubernetes cluster.
Expand Down
4 changes: 2 additions & 2 deletions chart/mjs/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ apiVersion: v2
name: mjs
description: A Helm chart for MATLAB (R) Job Scheduler in Kubernetes
type: application
version: 1.1.0
appVersion: 1.1.0
version: 1.2.0
appVersion: 1.2.0
5 changes: 5 additions & 0 deletions chart/mjs/templates/_derived.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,8 @@
{{- define "derived.enableServiceLinks" -}}
{{ .Values.enableServiceLinks | default false }}
{{- end -}}

# If we are using a secure LDAP server and not using a persistent volume claim for the job manager pod, we need to add the LDAP certificate to the job manager's secret store
{{- define "derived.addLDAPCert" -}}
{{ and (hasPrefix "ldaps://" .Values.ldapURL) (or (empty .Values.matlabPVC) (not .Values.jobManagerUsesPVC)) }}
{{- end -}}
7 changes: 6 additions & 1 deletion chart/mjs/templates/_paths.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -75,4 +75,9 @@ config.json
# Ingress proxy config file
{{- define "paths.ingressProxyConfig" -}}
haproxy.cfg
{{- end -}}
{{- end -}}

# Path to mounted LDAP certificate
{{- define "paths.ldapCert" -}}
/mjs/ldap/cert.pem
{{- end -}}
10 changes: 10 additions & 0 deletions chart/mjs/templates/controller-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@ metadata:
data:
{{ include "paths.controllerConfig" . }}: |
{
"AdditionalMatlabPVCs": [
{{- $comma := "" }}
{{- range $matlabPVC := .Values.additionalMatlabPVCs }}
{{ $comma }}{{ $matlabPVC | quote }}
{{- $comma = "," }}
{{- end }}
],
"BasePort": {{ .Values.basePort | int }},
"CertFileName": {{ include "paths.certFile" . | quote }},
"CheckpointBase": {{ include "paths.checkpointbase" . | quote }},
Expand Down Expand Up @@ -45,6 +52,9 @@ data:
"JobManagerGroupID": {{ .Values.jobManagerGroupID }},
"JobManagerUserID": {{ .Values.jobManagerUserID }},
"JobManagerUID": {{ uuidv4 | quote }},
{{ if eq (include "derived.addLDAPCert" .) "true" -}}
"LDAPCertPath": {{ include "paths.ldapCert" . | quote }},
{{- end }}
"LivenessProbeFailureThreshold": {{ .Values.livenessProbeFailureThreshold | default 3 }},
"LivenessProbePeriod": {{ .Values.livenessProbePeriod | default 300 }},
"LivenessProbeTimeout": {{ .Values.livenessProbeTimeout | default 30 }},
Expand Down
17 changes: 16 additions & 1 deletion chart/mjs/templates/mjs-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
{{- $binDir := printf "%s/toolbox/parallel/bin/" (include "paths.matlabroot" .) }}
{{- $basePort := .Values.basePort | int }}
{{- $workerStartedFile := "/tmp/worker-started" }}
{{- $trustStorePath := printf "%s/sys/java/jre/glnxa64/jre/lib/security/cacerts" (include "paths.matlabroot" .) }}
apiVersion: v1
kind: ConfigMap
metadata:
Expand All @@ -21,6 +22,13 @@ data:
SECURITY_LEVEL={{ .Values.securityLevel }}
SHARED_SECRET_FILE={{ include "paths.secretPath" . }}
USE_ONLINE_LICENSING={{ .Values.useOnlineLicensing }}
ADMIN_USER={{ .Values.adminUser }}
{{- if .Values.ldapURL }}
USE_LDAP_SERVER_AUTHENTICATION=true
LDAP_URL={{ .Values.ldapURL | quote }}
LDAP_SECURITY_PRINCIPAL_FORMAT={{ .Values.ldapSecurityPrincipalFormat | quote }}
LDAP_SYNCHRONIZATION_INTERVAL_SECS={{ .Values.ldapSynchronizationIntervalSecs | quote }}
{{- end }}
jobManager.sh: |
# Script to run on the MATLAB Job Scheduler job manager pod
Expand All @@ -31,6 +39,12 @@ data:
echo "HOSTNAME={{ include "resources.jobManager" . }}" >> {{ $localMJSDef }}
echo "LOGBASE={{ include "paths.logbase" . }}" >> {{ $localMJSDef }}
{{ if eq (include "derived.addLDAPCert" .) "true" -}}
# Add LDAP certificate to the Java trust store
chmod +w {{ $trustStorePath }}
yes | {{ include "paths.matlabroot" . }}/sys/java/jre/glnxa64/jre/bin/keytool -import -keystore {{ $trustStorePath }} -file {{ include "paths.ldapCert" . }} -storepass changeit > /dev/null
{{ end -}}
# Start the MJS service
{{ printf "%s/mjs" $binDir }} start -mjsdef {{ $localMJSDef }} -loglevel {{ .Values.logLevel }} -disablereliableconnections

Expand All @@ -56,6 +70,7 @@ data:
LOGBASE={{ include "paths.logbase" . }}/${WORKER_NAME}
echo "LOGBASE=${LOGBASE}" >> {{ $localMJSDef }}
echo "HOSTNAME=${HOSTNAME}" >> {{ $localMJSDef }}
echo "MJS_ADDITIONAL_MATLABROOTS=${MJS_ADDITIONAL_MATLABROOTS}" >> {{ $localMJSDef }}
# Ensure log directory exists and is writeable by workers
if [ ! -d "${LOGBASE}" ]; then
Expand All @@ -64,7 +79,7 @@ data:
chmod o+w "${LOGBASE}"
# Create a user to run MATLAB as
useradd --create-home {{ .Values.workerUsername }} || echo "User {{ .Values.workerUsersname }} already exists"
useradd --create-home {{ .Values.workerUsername }} || echo "User {{ .Values.workerUsername }} already exists"
echo {{ printf "%s:%s" .Values.workerUsername .Values.workerPassword }} | chpasswd
# Start the MJS service
Expand Down
Loading

0 comments on commit 20bf8b2

Please sign in to comment.