- Important Concepts
- Dynamic Provisioning Workflow
- Static Provisioning Workflow
- Best Practices
- Managing ReadOnly Volumes
- Notes for BeeGFS Administrators
- Limitations and Known Issues
Within the context of this driver, a "volume" is simply a directory within a BeeGFS filesystem. When a volume is mounted by a Kubernetes Pod, only files within this directory and its children are accessible by the Pod. An entire BeeGFS filesystem can be a volume (e.g. by specifying / as the /path/to/dir in the static provisioning workflow) or a single subdirectory many levels deep can be a volume (e.g. by specifying /a/very/deep/directory as the volDirBasePath in the dynamic provisioning workflow).
In this version, the driver ignores the capacity requested for a Kubernetes Persistent Volume. Consider the definition of a "volume" above. While an entire BeeGFS filesystem may have a usable capacity of 100GiB, there is very little meaning associated with the "usable capacity" of a directory within a BeeGFS (or any POSIX) filesystem. The driver does provide integration with BeeGFS permissions and quotas which provides ways to limit the capacity consumed by containers. For more details refer to the documentation on Quotas.
Starting with v1.7.0 the driver also supports volume expansion, which is useful for instances where the persistent volume claim request size has meaning for the application. As with the initial capacity request, the size of the PVC and PV are simply updated in the Kubernetes API to reflect the requested new capacity, and there are no checks there is actually sufficient space available to satisfy the requested capacity.
As a user, I want a volume to use as high-performance scratch space or semi-temporary storage for my workload. I want the volume to be empty when my workload starts. I may keep my volume around for other stages in my data pipeline, or I may provide access to other users or workloads. Eventually, I'll no longer need the volume and I expect it to clean up automatically.
In the Kubernetes dynamic provisioning workflow, an administrator identifies an existing parent directory within a BeeGFS filesystem. When a user creates a PVC, the driver automatically creates a new subdirectory underneath that parent directory and binds it to the PVC. To the user and/or workload, the subdirectory is the entire volume. It exists as long as the PVC exists.
As an administrator, I want to make a directory within an existing BeeGFS file system available to be mounted by multiple users and/or workloads. This directory probably contains a large, commonly used dataset that I don't want to see copied to multiple locations within my file system. I plan to manage the volume's lifecycle and I don't want it cleaned up automatically.
As a user, I want to consume an existing dataset in my workload.
In the Kubernetes static provisioning workflow, an administrator manually creates a PV and PVC representing an existing BeeGFS file system directory. Multiple users and/or workloads can mount that PVC and consume the data the directory contains.
Depending on your topology, different nodes within your cluster or different BeeGFS file systems accessible by your cluster may need different client configuration parameters. This configuration is NOT handled at the volume level (e.g. in a Kubernetes Storage Class or Kubernetes Persistent Volume). See Managing BeeGFS Client Configuration in the deployment guide for detailed instructions on how to prepare your cluster to mount various BeeGFS file systems.
- A BeeGFS filesystem with its management service listening at
sysMgmtdHost
already exists and is accessible from all Kubernetes worker nodes. - A directory that can serve as the parent to all dynamically allocated subdirectories already exists within the BeeGFS filesystem at /path/to/parent/dir OR it is fine for the driver to create one at /path/to/parent/dir.
- An administrator creates a Kubernetes Storage Class describing a particular directory on a particular BeeGFS filesystem under which dynamically provisioned subdirectories should be created.
- A user creates a Kubernetes Persistent Volume Claim requesting access to a newly provisioned subdirectory.
- A user creates a Kubernetes Pod, Deployment, Stateful Set, etc. that references the Persistent Volume Claim.
Under the hood, the driver creates a new BeeGFS subdirectory. This subdirectory is tied to a new Kubernetes Persistent Volume, which is bound to the user-created Kubernetes Persistent Volume Claim. When a Pod is scheduled to a Node, the driver uses information supplied by the Persistent Volume to mount the subdirectory into the Pod's namespace.
Who: A Kubernetes administrator working closely with a BeeGFS administrator
Specify the filesystem and parent directory using the sysMgmtdHost
and
volDirBasePath
parameters respectively.
Striping parameters that can be specified using the beegfs-ctl command line
utility in the --setpattern
mode can be passed with the prefix
stripePattern/
in the parameters
map. If no striping parameters are passed,
the newly created subdirectory has the same striping configuration as its
parent. The following stripePattern/
parameters work with the driver:
Prefix | Parameter | Required | Accepted patterns | Example | Default |
---|---|---|---|---|---|
stripePattern/ | storagePoolID | no | unsigned integer | 1 | file system default |
stripePattern/ | chunkSize | no | unsigned integer + k (kilo) or m (mega) | 512k 1m |
file system default |
stripePattern/ | numTargets | no | unsigned integer | 4 | file system default |
NOTE: While the driver expects values with certain patterns (e.g. unsigned integer), Kubernetes only accepts string values in Storage Classes. These values must be quoted in the Storage Class .yaml (as in the example below).
NOTE: The effects of unlisted configuration options are NOT tested with the driver. Contact your BeeGFS support representative for recommendations on appropriate settings. See the BeeGFS documentation on striping for additional details.
By default, the driver creates all new subdirectories with root:root ownership
and globally read/write/executable 0777 access permissions. This makes it easy
for an arbitrary Pod to consume a dynamically provisioned volume. However,
administrators may want to change the default permissions on a
per-Storage-Class basis, in particular if integration with BeeGFS quotas is desired.
The following permissions/
parameters allow this fine-grained control:
Prefix | Parameter | Required | Accepted patterns | Example | Default |
---|---|---|---|---|---|
permissions/ | uid | no | unsigned integer | 1000 | 0 (root) |
permissions/ | gid | no | unsigned integer | 1000 | 0 (root) |
permissions/ | mode | no | three or four digit octal notation | 755 0755 |
0777 |
NOTE: While the driver expects values with certain patterns (e.g. unsigned integer), Kubernetes only accepts string values in Storage Classes. These values must be quoted in the Storage Class .yaml (as in the example below).
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: my-storage-class
provisioner: beegfs.csi.netapp.com
parameters:
# All Storage Class values must be strings. Quotes are required on integers.
sysMgmtdHost: 10.113.72.217
volDirBasePath: /path/to/parent/dir
stripePattern/storagePoolID: "1"
stripePattern/chunkSize: 512k
stripePattern/numTargets: "4"
permissions/uid: "1000"
permissions/gid: "1000"
permissions/mode: "0644"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: false
Who: A Kubernetes user
Specify the Kubernetes Storage Class using the storageClassName
field in the
Kubernetes Persistent Volume Claim spec
block.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: my-storage-class
Who: A Kubernetes user
Follow standard Kubernetes practices to deploy a Pod that consumes the newly created Kubernetes Persistent Volume Claim.
- A BeeGFS filesystem with its management service listening at
sysMgmtdHost
already exists and is accessible from all Kubernetes worker nodes. - A directory of interest already exists within the BeeGFS filesystem at /path/to/dir. If this whole BeeGFS filesystem is to be consumed, /path/to/dir is /.
- An administrator creates a Kubernetes Persistent Volume referencing a particular directory on a particular BeeGFS filesystem.
- An administrator or a user creates a Kubernetes Persistent Volume Claim that binds to the Persistent Volume.
- A user creates a Kubernetes Pod, Deployment, Stateful Set, etc. that references the Persistent Volume Claim.
When a Pod is scheduled to a Node, the driver uses information supplied by the Persistent Volume to mount the subdirectory into the Pod's namespace.
Who: A Kubernetes administrator working closely with a BeeGFS administrator
The driver receives all the information it requires to mount the directory of
interest into a Pod from the volumeHandle
field in the csi
block of the
Persistent Volume spec
block. It MUST be formatted as modeled in the example.
NOTE: The driver does NOT provide a way to modify the stripe settings of a directory in the static provisioning workflow.
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
spec:
accessModes:
- ReadWriteMany
capaciy:
storage: 100Gi
persistentVolumeReclaimPolicy: Retain
csi:
driver: beegfs.csi.netapp.com
volumeHandle: beegfs://sysMgmtdHost/path/to/dir
Who: A Kubernetes administrator or user
Each Persistent Volume Claim participates in a 1:1 mapping with a Persistent
Volume. Create a Persistent Volume Claim and set the volumeName
field to
ensure it maps to the correct Persistent Volume.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: ""
volumeName: my-pv
Who: A Kubernetes user
Follow standard Kubernetes practices to deploy a Pod that consumes the newly created Kubernetes Persistent Volume Claim.
- While multiple Kubernetes clusters can use the same BeeGFS file system, it is
not recommended to have more than one cluster use the same
volDirBasePath
within the same file system. - Do not rely on Kubernetes access modes to prevent directory contents from being overwritten. Instead set sensible permissions, especially on static directories containing shared datasets (more details below).
In some cases an administrator or a user may wish to have a BeeGFS volume mounted in ReadOnly mode within a container. There are several mechanisms to accomplish this goal. The following describes some of the approaches to configuring ReadOnly volumes and relevant information on how to choose the right approach for your situation.
Who: A kubernetes user or administrator
Within a Pod specification there are two options that currently apply to the BeeGFS CSI driver for configuring a volume to be mounted as ReadOnly.
- There is a readOnly attribute of the VolumeMounts for a container.
- There is a readOnly attribute of the PersistentVolumeClaim source in the Volume configuration of a Pod.
This method can be used for any volume type, including BeeGFS volumes configured as PersistentVolumeClaims. For other volume types that aren't configured as PersistentVolumeClaims this might be the only option to specify the ReadOnly mode in the Pod configuration. The following is an example of a pod specification that uses this method to set the readOnly attribute.
kind: Pod
apiVersion: v1
metadata:
name: pod-sample
spec:
containers:
- name: sample
volumeMounts:
- mountPath: /mnt/static
name: csi-beegfs-static-volume
readOnly: true
volumes:
- name: csi-beegfs-static-volume
persistentVolumeClaim:
claimName: beegfs-pvc-1
In this scenario the volume is staged with ReadWrite permissions and the ReadOnly permission is applied in a subsequent bind mount specific to the targeted container. Therefore the scope of this ReadOnly configuration is the single container within the Pod.
This method is available to any volume being presented to a pod as a PersistentVolumeClaim. The following is an example of a pod specification that uses this method to set the readOnly attribute.
kind: Pod
apiVersion: v1
metadata:
name: pod-sample
spec:
containers:
- name: sample
volumeMounts:
- mountPath: /mnt/static
name: csi-beegfs-static-volume
volumes:
- name: csi-beegfs-static-volume
persistentVolumeClaim:
claimName: beegfs-pvc-1
readOnly: true
In this scenario the volume is staged with ReadWrite permissions and the ReadOnly permission is applied in a subsequent bind mount where the volume is made available to the Pod. Therefore the scope of this ReadOnly configuration is all containers within the Pod.
You may not want to use one of the Pod configuration methods for configuring the ReadOnly volume under these circumstances.
- The targeted volume may be used by multiple pods and you may not control the Pod configuration for all Pods that use the volume.
- You are an administrator and you want to control the ReadOnly attributes for a volume instead of letting users managing the Pods control the access for that volume.
Who: A Kubernetes administrator working closely with a BeeGFS administrator
The ReadOnly attribute for a volume can also be configured through the mountOptions property of a persistent volume or a storage class object.
When defining the PersistentVolume
spec
you can use the mountOptions property to define the mount options to use for
that particular volume. This can include configuring the volume to be ReadOnly
with the ro
mount option. The following is an example.
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-sample
spec:
accessModes:
- ReadOnlyMany
capacity:
storage: 100Gi
mountOptions:
- ro
- nosuid
- relatime
persistentVolumeReclaimPolicy: Retain
storageClassName: "csi-beegfs-dyn-sc"
csi:
driver: beegfs.csi.netapp.com
# Replace "localhost" with the IP address or hostname of the BeeGFS management daemon.
# "all" k8s clusters may share access to statically provisioned volumes.
# Ensure that the directory, e.g. "k8s/all/static", exists on BeeGFS. The driver will not create the directory.
volumeHandle: beegfs://localhost/k8s/all/static
In this scenario the volume will be staged as ReadOnly and all bind mounts will be ReadOnly. Also, all uses of the volume will be ReadOnly regardless of the Pod configuration for any Pod using this volume.
NOTE: If you specify any mount options with the mountOptions property, then you need to specify all of your desired mount options here. See the BeeGFS Mount Options section for information on the default mount options used.
A StorageClass object can be configured with a mountOptions parameter similar to how a PersistentVolume object can be configured with mountOptions. However, the mountOptions for a StorageClass are only applied to dynamically provisioned volumes. Any mountOptions configured on a StorageClass do not apply to a statically defined PersistentVolume that references that StorageClass object.
- By default the driver uses the beegfs-client.conf file at /etc/beegfs/beegfs-client.conf for base configuration. Modifying the location of this file is not currently supported without changing kustomization files.
- When using dynamic provisioning, if
--node-unstage-timeout
is set to a nonzero value (default: 60) the driver will create a directory structure atvolDirBasePath/.csi/
(in the BeeGFS filesystem) and use it to persist any state used by the driver, for example to prevent orphaned mounts. This behavior can optionally be disabled, but is strongly recommended for the driver to function optimally.
Except for cfgFile
(which has to be set by the driver) mount options supported
by BeeGFS can be specified on a persistent
volume
or storage class.
Please note the driver DOES NOT validate provided mount options and use of
options not supported by BeeGFS may cause unpredictable behavior.
By default the driver mounts BeeGFS with the following mount options: rw, relatime, and nosuid.
- The cfgFile option is also used, but it is handled entirely by the driver and ignored if specified.
- The nosuid mount option is used to adhere to BeeGFS security recommendations.
For performance (and other) reasons each Persistent Volume used on a given Kubernetes node has a separate mount point. When using remote direct memory access (RDMA) this will increase the amount of memory used for RDMA queue pairs between BeeGFS clients (K8s nodes) and BeeGFS servers. As of BeeGFS 7.2 this is around 12-13MB per mount for each client connection to a BeeGFS storage/metadata service.
Since clients only open connections when needed this is unlikely to be an issue, but in some large environments may result in unexpected memory utilization. This is much more likely to be an issue on BeeGFS storage and metadata servers than the Kubernetes nodes themselves (since multiple clients connect to each server). Administrators are advised to spec out BeeGFS servers accordingly.
Note: See the section on Creating a Storage Class for how to set permissions using the CSI driver.
By default, the driver creates all new subdirectories with root:root ownership and globally read/write/executable 0777 access permissions. This works well in a Kubernetes environment where Pods may run as an arbitrary user or group but still expect to access provisioned volumes.
NOTE: Permissions on the volDirBasePath
are not modified by the
driver. These permissions can be used to limit external access to dynamically
provisioned subdirectories even when these subdirectories themselves have 0777
access permissions.
In certain situations, it makes sense to override the default behavior and instruct the driver to create directories owned by some other user/group or with a different mode. This can be done on a per-Storage-Class basis. Some example scenarios include:
- BeeGFS quotas are
in use and all files and directories in provisioned volumes must be
associated with a single appropriate GID (as in the
project directory quota tracking)
example in the BeeGFS documentation.
- See the driver documentation on Quotas for further guidance on how to use BeeGFS quotas with the CSI driver.
- It is important to limit the ability of arbitrary BeeGFS file system users to access dynamically provisioned volumes and the volumes will be accessed by Pods running as a known user or group anyway (see the above note for an alternate potential mitigation).
NOTE: The above BeeGFS quotas documentation suggests using chmod g+s
on a
directory to enable the setgid bit. The exact same behavior can be obtained
using four digit octal permissions in the parameters.permissions/mode
field of
a BeeGFS Storage Class. For example, 2755 represents the common 755 directory
access mode with setgid enabled. See the
chmod man page for more details.
Under the hood, the driver uses a combination of beegfs-ctl and chown/chmod-like functionality to set the owner, group, and access mode of a new BeeGFS subdirectory. These properties limit access to the subdirectory both outside of (as expected) and inside of Kubernetes. If permissions are set in a Storage Class, Kubernetes Pods likely need to specify one of the following parameters to allow access:
- spec.securityContext.runAsUser
- spec.securityContext.runAsGroup
- spec.securityContext.fsGroup
- spec.container.securityContext.runAsUser
- spec.container.securityContext.runAsGroup
Some CSI drivers support a recursive operation in which the permissions and ownership of all files and directories in a provisioned volume are changed to match the fsGroup parameter of a Security Context on Pod startup. This behavior is generally undesirable with BeeGFS for the following reasons:
- Unexpected permissions changes within a BeeGFS file system may be confusing to administrators and detrimental to security (especially in the static provisioning workflow).
- Competing operations executed by multiple Pods against large file systems may be time-consuming and affect overall system performance.
For clusters running most versions, Kubernetes heuristics enable this behavior on ReadWriteOnce volumes and do NOT enable this behavior on ReadWriteMany volumes. Create only ReadWriteMany volumes to ensure no unexpected permissions updates occur.
For clusters running v1.20 or v1.21 WITH the optional CSIVolumeFSGroupPolicy
feature gate (in an eventual future version the feature gate will not be
required), the csiDriver.spec.fsGroupPolicy
parameter can be used to disable
this behavior for all volumes. The beegfs-csi-driver deploys with this parameter
set to "None" in case it is deployed to a cluster that supports it.
- Each BeeGFS instance used with the driver must have a unique BeeGFS management IP address.
Access modes in Kubernetes are how a driver understands what K8s wants to do
with a volume, but do not strictly enforce behavior. This may result in
unexpected behavior if administrators expect creating a Persistent Volume with
(for example) ReadOnlyMany
access will enforce read only access across all
nodes accessing the volume. This is a larger issue with Kubernetes/CSI ecosystem
and not specific to the BeeGFS driver. Some relevant discussion can be found in
this GitHub issue.
If the pod.spec.volumes.persistentVolumeClaim.readOnly
flag or the
pod.spec.containers.volumeMounts.readOnly
flag is set, volumes are mounted
read-only as expected. However, this workflow leaves the read-only vs read-write
decision up to the user requesting storage.
While moving forward we plan to look at ways the driver could better enforce read only capabilities when access modes are specified, doing so will likely require us to deviate slightly from the CSI spec. In the meantime one workaround is to set permissions on static BeeGFS directories so they cannot be overwritten. Note pods running with root permissions could ignore this.
The volume_id
used by this CSI is in the format of a Uniform Resource
Identifier (URI) generated by aggregating several fields' values including a
path within a BeeGFS file system.
- In the case of dynamic provisioning, the fields within the StorageClass object
(
sc
) and CreateVolumeRequest message (cvr
) combine to yield thevolume_id
:beegfs://{sc.parameters.sysMgmtdHost}/{sc.parameters.volDirBasePath}/{cvr.name}
- In the case of static provisioning, the
volume_id
is written directly by the administrator into the Persistent Volume object (pv
) as thepv.spec.volumeHandle
.
In either case the resulting volume_id
URI is generally of the format
beegfs://ip-or-domain-name/path/to/sub/directory/volume_name
.
The volume_id
, like all string field values, is subject to a 128 byte limit
unless overridden in the CSI spec:
CSI defines general size limits for fields of various types (see table below). The general size limit for a particular field MAY be overridden by specifying a different size limit in said field's description. Unless otherwise specified, fields SHALL NOT exceed the limits documented here. These limits apply for messages generated by both COs and plugins.
Size Field Type 128 bytes string 4 KiB map<string, string>
Source: CSI Specification v1.5.0 Size Limits
CSI specification
v1.4.0
relaxed the size limit for some file paths and increased the limit for the
node_id
field specifically to 192 bytes. CSI specification
v1.5.0
further increased the size limit for the node_id
field to 256 bytes. However,
the volume_id
size limit is unchanged.
Some cursory testing of a few CO and CSI deployments suggest that the limits are not strictly enforced. So, rather than impose strict failures or warnings in the event that CSI spec field limits are exceeded, we have elected to only document the possibility that long paths may cause errors.