- Lightweight way to virtualize process
- Kernel namespaces split kernel resources into one instance per namespace
- This partitions processes, users, network stacks and other components into separate analogous pieces in order to provide processes a unique view
- Originated in 1992 in the Plan 9 operating system
- For example, one bash shell have a different hostname than a different bash shell if they are running in different namespaces
- There are currently 6 namespaces:
- mnt (mount points, filesystems)
- pid (processes)
- net (network stack)
- ipc (System V IPC)
- uts (hostname)
- user (UIDs)
- Namespace system calls:
- clone() - creates a new process and a new namespace;
- the process is attached to the new namespace.
- Process creation and process termination methods, fork() and exit() methods, were patched to handle the new namespace CLONE_NEW* flags.
- unshare() - does not create a new process; creates a new namespace and attaches the current process to it.
- setns() - a new system call was added, for joining an existing namespace.
- clone() - creates a new process and a new namespace;
- Namespaces do not have names
- /proc/pid/ns lists the namespace number for each namespace for each process
- ls -al /proc//ns
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 ipc -> ipc:[4026531839]
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 mnt -> mnt:[4026531840]
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 net -> net:[4026531956]
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 pid -> pid:[4026531836]
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 user -> user:[4026531837]
- lrwxrwxrwx 1 root root 0 Apr 24 17:29 uts -> uts:[4026531838]
- How are namespaces implemented?
- By extensive modifications in the kernel to recognize namespaces at appropriate places
- Took over 5 years
- Example: gethostname() used to read system->nodename, now reads u->nodename where u is the namespace
- UTS Namespace:
- Provides namespace-specific hostname and domainname (for Network Information Service)
- Network Namespace:
- A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.
- The network namespace is struct net. (defined in include/net/net_namespace.h)
- Struct net includes all network stack ingredients, like: loopback device, all network tables:routing, neighboring, etc, all sockets, /procfs and /sysfs entries
- Each network namespace has its own IP addresses
- A network device belongs to exactly one network namespace
- A socket belongs to exactly one network namespace
- When you delete a namespace, all its migratable network devices are moved to the the default namespace
- Communicating between two network namespaces:
- Veth (virtual ethernet) is used like a pipe between two namespaces
- Sockets also work
- Mount Namespace:
- On creation, we copy the file system tree to new space
- All previous mounts will be visible
- Future mounts/unmounts invisible to the rest of the system
- Weird bug: mount command updates /etc/mtab, which is visible across all namespaces
- So mount in one namespace will be visible to another
- However, if you cat /proc/mounts, namespace-specific mounts wont be visible externally
- Another weird bug:
- By default, Fedora systemd will share all mount information across namespaces
- Have to explicitly mark as private
- More sophisticated than chroot jails:
- Mount namespaces can be connected so changes in one show up in another (called mount propogation)
- PID Namespace:
- Processes in different PID namespaces can have the same process ID
- When creating the first process in a new namespace, its PID is 1
- User Namespace:
- A process will have distinct set of UIDs, GIDs and capabilities
- IPC Namespace:
- Each namespace gets its own IPC objects and POSIX message queues
- Control Groups (cgroups) are a mechanism for applying hardware resource limits and access controls to a process or collection of processes
- The cgroup mechanism and the related subsystems provide a tree-based hierarchical, inheritable and optionally nested mechanism of resource control
- This work was started by engineers at Google (primarily Paul Menage and Rohit Seth) in 2006 under the name "process containers; in 2007, renamed to "Control Groups".
- The implementation of cgroups requires a few, simple hooks into the rest of the kernel, none in performance-critical paths:
- In boot phase (init/main.c) to preform various initializations.
- In process creation and destroy methods, fork() and exit().
- A new file system of type "cgroup" (VFS) – Process descriptor additions (struct task_struct) – Add procfs entries:
- For each process: /proc/pid/cgroup.
- System-wide: /proc/cgroups
- Note that the cgroups is not dependent upon namespaces; you can build cgroups without namespaces kernel support.
- There are 11 cgroup subsystems: cpuset, freezer, mem, blkio, net_cls, net_prio, devices, perf, hugetlb, cpu_cgroup, cpuacct
- In order to mount a subsystem, you should first create a folder for it under /cgroup. In order to mount a cgroup, you first mount some tmpfs root folder:
- mount -t tmpfs tmpfs /cgroup
- Mounting of the memory subsystem, for example, is done thus:
- mkdir /cgroup/memtest
- mount -t cgroup -o memory test /cgroup/memtest/
- The cpuset control group identifies a set of processors on which each process in a group may run, and it also identifies a set of memory nodes from which processes in a group can allocate memory
- Net_cl and Net_prio uniquely identify each group inside sockets and other networking events
- More fine-grained access control than just root or non-root.
- A process could be non-root, but given root-like privileges for a specific task (example: using a network port)
- Alternatively, a process could be root, and be denied access because of missing capabilities
- For example, root could be denied permission to mount new devices
- This is used to make the “container root” user significantly less powerful than the host root
- Copy-on-write file system that is the union of existing file systems
- UnionFS allows several containers to share common data (if we are running 5 containers based on the same image, we don’t have to keep five copies)
- On write to the UnionFS, the overwritten data is saved to a new path, specific to that container
- Thus, writes of one container do not affect reads of another container
- Built on top of kernel namespaces, cgroups, unionFS, and capabilities
- Each container gets its own set of namespaces and cgroups
- Namespaces isolate containers from each other: one container can’t even see the list of processes in another container
- Cgroups allow the admin to isolate the resources used by each container and its children
- Running the docker daemon requires root privileges
- As of Docker 1.3.2, images are now extracted in a chrooted subprocess on Linux/Unix platforms, being the first-step in a wider effort toward privilege separation
- Docker provides a whitelist of capabilities to root users inside a container