I was quite baffled first time when I saw docker running centos on an amazon linux. I thought I will jot down all I understand about containers so someone can save some time without needing to read a whole lot of documents. In this document I will try to achieve most (all) of the concepts without using any docker (or rkt) commands so we can know how these tools work behind the scene.
NOTE : As of today this document references docker, but it should
be true for other systems too. These topics covered here are only showing the proof
of concept. Tools like docker
, rkt
, etc wraps around these tools but they cover
all the corner cases and all the use cases way more than what is mentioned here.
DISCLAIMER I am not an expert in these domains, only my curiosity lead me to read and write this document. (Other words, don't try in production)
The reason behind writing this document is in operations world when we debug a production issue, you better know the internals than making foolish assumptions (assumption is the root of all major screw-ups).
I assume the reader knows more about Docker and its features. Knowing the features of Docker will help the reader understand the internals better (otherwise you might think why the heck am I saying this)
FROM centos
If you save the above contents to Dockerfile
and do a docker build -t vigith/centos .
followed by a
docker run -t -i centos /bin/bash
you will get a bash
promt in
centos
(you can confirm this by doing a cat /etc/system-release
on
your new prompt).
It also supports mounting volumes, exposing ports, container linking etc. Though the names sound unfamiliar
the technology behind it remains the same. Since I am focusing on containers as a whole, I leave it to the
reader to explore more into docker
.
UnionFS lets you overlay files and directories of different filesystem forming a single unified mountable filesystem. The merged single filesystem will have no duplicates and later layers take precedence over former layers such that we will end up in a new unified coherent virtual filesystem. Couple of well know UnionFS are AUFS, btrfs, DeviceMapper, OverlayFS, etc
It allows both read-only and read-write filesystems to be merged. Once a write is made, it can be persisted by making the copy go to a file system. The writes can be either discarded or persisted, persisting will enable users to create a snapshot of the changes and later build layers on top if it as if it were the base layer.
You have to install nginx for you website. The end container on your website server will be an nginx process tailor cut for website called website-nginx. You would also like to reuse your nginx build because it has lot of patches made specific for your env.
This can be done in 2 steps
- getting a specific version of nginx (patched with all crazy stuff)
called
ops-nginx
- use
ops-nginx
to build outwebsite-nginx
server, also this sameops-nginx
can be used for other servers by just putting the right confs
step 1 Create the ops-nginx image from base os, then can be reused later for many other apps
(base os) -> layer 1
\_ installing patched nginx -> layer 2
\_ install users -> layer 3
|_ giving sudo for ops -> layer 3 (snapshot as ops-nginx)
step 2 Create the website-nginx image from ops-nginx, a server with a specific set of configs and other packages
(ops-nginx) -> layer 1 (snapshot)
\_ nginx website conf -> layer 2
|_ ssl conf -> layer 2
|_ log conf -> layer 2 (snapshot as website-nginx)
step 1 Create the ops-nginx image from base os
- docker pull centos
- docker run -t -i centos /bin/bash
- yum install nginx foobar
- ... other crazy command ...
- docker commit -m "ops nginx image" IMAGE_ID ops-nginx
step 2 Create the website-nginx image from ops-nginx
- docker pull ops_nginx
- docker run -t -i ops_nginx /bin/bash
- .. change your config ..
- ... other voodoo stuff ...
- docker commit -m "ops nginx image" IMAGE_ID website-nginx
If there was no docker, how could have we done this!. Docker or other tools might be doing real different, but lets see couple of ways how we can do it.
To understand how we can do it, we need to understand the following
- loop device - mount file as a block device
- sparse filesystem - efficiently use FS when it is mostly empty
- device mapper - Kernel Device Mapper Doc mapping physical block devices onto higher-level virtual block devices
- snapshotting - snapshot the state of a file system at a give time
- thin provisioning - allows many virtual devices to be stored on the same data volume
I will try give some crude examples on device mappers. These examples try to do 2 step snapshots
empty filesystem
|
+- load filesystem (snapshot 1)
|
+- edit filesystem (snapshot 2)
If we can achieve this, then we can do this repeatitively, also can make and persist any kind of changes at any level of snapshot.
This is a crude HOWTO on a working example of device mapper snapshots.
# create a sparse 100G file
> truncate -s100G test.block
# create /dev/loop0
# -f will find an unused device and use it
# --show will print the device name
> losetup -f --show test.block
Now we have /dev/loop0
(my example is based on loop0
, if loop0
is not free do a losetup -d /dev/loop0
) attached to test.block
(file mounted as block device).
# create base target (1953125 = 1000 * 1000 * 1000 / 512)
# where 512 byte = 1 sector, and GB = 1000 * 1000 * 1000 (it would have been 1024
# if GiB was the unit)
> dmsetup create test-snapshot-base-real --table '0 1953125 linear /dev/loop0 0'
# create the cow snapshot target
# 390625 + 1953125 = 2343750 (== 1.2GB)
> dmsetup create test-snapshot-snap-cow --table '0 390625 linear /dev/loop0 1953125'
I downloaded a centos rootfs (actually I took a docker centos image and converted it to tar via docker2aci). This
centos tar is named as centos-latest.tar
.
# format the orgin as an ext4 device
> mkfs.ext4 /dev/mapper/test-snapshot-base-real
# create a dir to mount the new ext4 fs
> mkdir -p /mnt/loados
# mount it
> mount /dev/mapper/test-snapshot-base-real /mnt/loados
# load centos to new ext4
> tar -xf centos-latest.tar -C /mnt/loados/
# umount the dir
> umount /mnt/loados
We will make the newly created ext4
filesystem containing centos rootfs
as our origin.
# make /dev/mapper/test-snapshot-base-real as origin
> dmsetup create test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'
This will make a snapshot target, which can be mounted and edited. Snapshot target will be having the origin as its backend (ie,
if no write is made to snapshot origin == snapshot
, else all new writes will go to snapshot)
# P (2nd last arg) means, make it persistent across reboot
# 8 (last arg) chunk-size, granularity of the of copying the snapshot
> dmsetup create test-snapshot-cow --table '0 1953125 snapshot /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'
Note how the origin device is the not the same device as the one we just created (ie test-snapshot-base
), but rather the origin's
underlying device test-snapshot-base-real
At this point if you do a dmsetup status
you will see something as follows
> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin
test-snapshot-base-real: 0 1953125 linear
test-snapshot-cow: 0 1953125 snapshot 16/390625 16
Lets add some data on the CoW Snapshot. The origin won't have these changes but only the CoW snapshot.
# mount the CoW device
> mount /dev/mapper/test-snapshot-cow /mnt/loados
# create a dir (one way to edit)
> mkdir /mnt/loados/vigith_test
# add some data
> echo bar > /mnt/loados/vigith_test/foo
# umount the device
> umount /mnt/loados
Take the changes we have made and merge these changes to the origin, so the origin will have all these changes. This is good to do because, next time we create a snapshot we will already have the changes.
To merge a snapshot,
- origin must be suspended
- the snapshot device unmapped
- merge the snapshot via
snapshot-merge
- resume
- once merge is complete (check it via
dmsetup status
) - suspend
- replace the snapshot-origin with snapshot-merge
- reload
## replace the snapshot-origin target replaced with the snapshot-merge target, and the origin resumed
> dmsetup suspend test-snapshot-base
> dmsetup remove test-snapshot-cow
> dmsetup reload test-snapshot-base --table '0 1953125 snapshot-merge /dev/mapper/test-snapshot-base-real /dev/mapper/test-snapshot-snap-cow P 8'
if you do a dmsetup status
you will see that test-snapshot-cow
is missing now.
> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin <--- it is snapshot-origin
test-snapshot-base-real: 0 1953125 linear
do a resume
> dmsetup resume test-snapshot-base
If you do dmsetup status
, you will see that snapshot-origin
became snapshot-merge
> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-merge 16/390625 16 <--- snapshot-merge
test-snapshot-base-real: 0 1953125 linear
suspend; replace the snapshot-origin with snapshot-merge; reload
## dmsetup status output will need be polled to find out then the merge is complete.
## Once the merge is complete, the snapshot-merge target should be replaced with the snapshot-origin target
> dmsetup suspend test-snapshot-base
> dmsetup reload test-snapshot-base --table '0 1953125 snapshot-origin /dev/mapper/test-snapshot-base-real'
> dmsetup resume test-snapshot-base
Now dmsetup status
will confirm that snapshot-merge
has become snapshot-origin
> dmsetup status
test-snapshot-snap-cow: 0 390625 linear
test-snapshot-base: 0 1953125 snapshot-origin <--- snapshot-origin
test-snapshot-base-real: 0 1953125 linear
We should be seeing the new directory we created in here /mnt/loados/vigith_test
and also the file inside
that dir /mnt/loados/vigith_test/foo
# mount
> mount /dev/mapper/test-snapshot-base /mnt/loados
# you should be seeing 'bar' as output
> cat /mnt/loados/vigith_test/foo
bar
# unmount it
> umount /mnt/loados
If you remember, we started with a file called test.block
. If you run file test.block
or tune2fs -l test.block
you will see it is
an ext4
file. Also, you can mount that file to any dir and you will see that it is the merged origin you just created
# run file
> file test.block
# tune2fs
> tune2fs -l test.block
# create a mount dir
> mkdir /tmp/testmnt
# lets mount this test.block
> mount -o loop test.block /tmp/testmnt
# look for the dir and file we created
> cat /tmp/testmnt/vigith_test/foo
bar
# umount it
> umount /tmp/testmnt
Now you have a file that can be mounted.
An astute reader might say, ofcourse you can add files and manipulate the FS but what about installing packages,
compiling source code pointing to new libraries in the new FS. Answer to that is, keep reading, skip to next
section (pivot_root
) if you are really curious.
Compared to the previous implementation of snapshots, is that it allows many virtual devices to be stored on the same data volume. Please read the doc more to understand about it.
An example of how thin provisioning works.
Thin Provisioning requires a metadata and data store.
# create a sparse 100G data file
> truncate -s100G testthin.block
# create a sparse 1G metadata file
> truncate -s1G testmetadata.block
# create /dev/loop0
# -f will find an unused device and use it
# --show will print the device name
> losetup -f --show testthin.block
# create /dev/loop1 for metadata
> losetup -f --show testmetadata.block
# clean it with zeros
> dd if=/dev/zero of=/dev/loop1 bs=4096 count=1
# test-thin-pool => poolname
# /dev/loop1 /dev/loop0 => metadata and data devices
# 20971520 => 10GiB (20971520 = 10 * 1024 * 1024 * 1024 / 512)
# 128 => data blocksize
> dmsetup create test-thin-pool --table '0 20971520 thin-pool /dev/loop1 /dev/loop0 128 0'
- send message to active pool device
- activate the new volume (allocate storage)
# create a new thin volume
# 0 (last arg) => 24 bit identifier
# 0 (other 0) => sector (512 bytes) in the logical device
> dmsetup message /dev/mapper/test-thin-pool 0 "create_thin 0"
# allocate storage/activate
# 0 (last arg) => thinp device identifier
# 2097152 => 1GiB (2097152 sectors = 1024 * 1024 * 1024 / 512)
> dmsetup create test-thin --table '0 2097152 thin /dev/mapper/test-thin-pool 0'
Load the data to the new thin device. We will use this loaded thin device to create snapshots.
# create an ext4 partition
> mkfs.ext4 /dev/mapper/test-thin
# mount the dir
> mount /dev/mapper/test-thin /mnt/loados
# load the partition with centos
> tar -xf centos-smaller.tar -C /mnt/loados/
# unmount it
> umount /mnt/loados/
- suspend the origin device whose snapshot is being taken
- send message "create_snap"
- resume the origin device
# suspend origin
> dmsetup suspend /dev/mapper/test-thin
# create snapshot
# 1 => identifier for snapshot
# 0 => identifier for origin device (last arg 0)
> dmsetup message /dev/mapper/test-thin-pool 0 "create_snap 1 0"
#resume the origin
> dmsetup resume /dev/mapper/test-thin
If you do an ls -l /dev/mapper
you won't be seeing any snapshot yet.
> ls /dev/mapper/
control test-thin test-thin-pool
Once created, the user doesn't have to worry about any connection between the origin and the snapshot. It can be worked on like yet another thinly-provisioned device (ie, you can do snapshots on this)
# active the snapshot (note there that we gave 1)
# 1 => snapshot identifier (same value we gave when we called "create_snap")
# 2097152 => 1GiB (2097152 sectors = 1024 * 1024 * 1024 / 512)
> dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
If you do a ls -l /dev/mapper
you should be seeing test-thin-snap
in the listing
> ls /dev/mapper/
control test-thin test-thin-pool test-thin-snap
Lets mount this thin snapshot and put some data
# mount
> mount /dev/mapper/test-thin-snap /mnt/loados
# create some new dir
> mkdir /mnt/loados/vigith_test
# write some data
> echo bar > /mnt/loados/vigith_test/foo
# umount
> umount /mnt/loados/
This snapshot is exactly same as the earlier discussed Internal Snapshot.
# suspend the origin (origin for this snap, but it is a snapshot of 1st origin)
> dmsetup suspend /dev/mapper/test-thin-snap
# please note we have incremented identifier to 2 and origin is 1
# (for the earlier run it was 1 and 0)
> dmsetup message /dev/mapper/test-thin-pool 0 "create_snap 2 1"
# resume the origin
> dmsetup resume /dev/mapper/test-thin-snap
Same as activating the earlier Activating Internal Snapshot, except that the identifier is 2 now (it was 1 before)
# earlier the identifier was 1
# lets call it test-thin-snap-2
> dmsetup create test-thin-snap-2 --table '0 2097152 thin /dev/mapper/test-thin-pool 2'
Load the latest snapshot to see the new dir created
> mount /dev/mapper/test-thin-snap-2 /mnt/loados
> ls -l /mnt/loados/vigith_test/foo
# you should be seeing 'bar' as output
> cat /mnt/loados/vigith_test/foo
bar
> umount /mnt/loados
If you remember, we start with a file called testthin.block
. If you run file testthin.block
or tune2fs -l testthin.block
you will see it is an ext4
file. Also, you can mount that file to any dir and you will see that it is the merged origin
you just created.
# mounting the thin-snapshots is little different from snapshots
# load the thin block
> losetup -f --show testthin.block
# load the metadata
> losetup -f --show testmetadata.block
# create the pool
> dmsetup create test-thin-pool --table '0 20971520 thin-pool /dev/loop1 /dev/loop0 128 0'
# create the thin device
> dmsetup create test-thin --table '0 2097152 thin /dev/mapper/test-thin-pool 0'
# create a mount dir
> mkdir /tmp/testmnt
# lets mount this test.block
> mount /dev/mapper/test-thin /mnt/testmnt
# look for the dir and file we created
> cat /tmp/testmnt/vigith_test/foo
bar
# umount it
> umount /tmp/testmnt
If you are thinking how is layering done, it is not done by operating system. docker
has an fsdiff.go
program
which does it. Basically you do a diff between the archive that is brought in by pull
and the changes you made.
FIXME: If I am wrong
clone
syscall, really! that is all about kernel namespaces.
Using Kernel Namespaces, we achieve process isolation
- ipc - InterProcess Communication (flag:
CLONE_NEWIPC
) - mnt - Mount points (flag:
CLONE_NEWNS
) - pid - Process ID (flag:
CLONE_NEWPID
) - net - Networking (flag:
CLONE_NEWNET
) - uts - set of identifiers returned by
uname(2)
(flag:CLONE_NEWUTS
)
When a process is created, the new process inherits most of the parent process flags. To use namespaces we just need to pass the right flags.
int clone(int (*fn)(void *), void *child_stack,
int flags, void *arg, ...
/* pid_t *ptid, struct user_desc *tls, pid_t *ctid */ );
The 3rd argument is the flags.
For eg, clone(fn_child, child_stack, SIGCHLD|CLONE_NEWPID|CLONE_NEWNET, &fn_child_args);
can be called to create
a child process with a new net
and pid
namespace.
To understand Kernel Namespaces, lets write a sample clone code and start building on it. The major change
will be in static int clone_flags = SIGCHLD;
where we will add more flags. This code when executed will
run a new bash
process in the child context.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <unistd.h>
#include <sched.h>
#include <errno.h>
#include <string.h>
#define STACKSIZE (1024*1024)
/* the flags */
static int clone_flags = SIGCHLD;
/* fn_child_exec is the func that will be executed by clone
and when this function returns, the child process will be
terminated */
static int fn_child_exec(void *arg) {
char * const cmd[] = { "/bin/bash", NULL};
fprintf(stderr, "Child Pid: [%d] Invoking Command [%s] \n", getpid(), cmd[0]);
if (execv(cmd[0], cmd) != 0) {
fprintf(stderr, "Failed to Run [%s] (Error: %s)\n", cmd[0], strerror(errno));
exit(-1);
}
/* exec never returns */
exit(EXIT_FAILURE);
}
int main(int argc, char *argv) {
char *child_stack = (char *)malloc(STACKSIZE*sizeof(char));
/* create a new process, the function fn_child_exec will be called */
pid_t pid = clone(fn_child_exec, child_stack + STACKSIZE, clone_flags, NULL);
if (pid < 0) {
fprintf(stderr, "clone failed (Reason: %s)\n", strerror(errno));
exit(EXIT_FAILURE);
}
/* wait on our child process the parent exits, else init will reap it.
we could also do other book keeping in the parent to use cgroups etc */
if (waitpid(pid, NULL, 0) == -1) {
fprintf(stderr, "'waitpid' for pid [%d] failed (Reason: %s)\n", pid, strerror(errno));
exit(EXIT_FAILURE);
}
return 0;
}
To compile, save this code (TODO: save this as a git code) as clone_example.c
> gcc clone_example.c -o bash_ex
Now when you run ./bash_ex
, you will get a new bash child process.
> ./bash_ex
Child Pid: [13225] Invoking Command [/bin/bash]
man
page of clone describes the CLONE_NEWIPC
flag as below
If CLONE_NEWIPC is set, then create the process in a new IPC namespace. If this flag is not set,
then (as with fork(2)), the process is created in the same IPC namesace as the calling process.
This flag is intended for the implementation of containers.
Recompile the code after changing static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWIPC;
We will create a Shared Memory Segment in the parent shell and we will confirm that we can see the segment we created
# create a segment
> ipcmk -M 4096
# list the shared memory segments
> ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x6b68f06d 0 ec2-user 644 4096 0
Now run the ./bash_ex
you just created (with the new flag), now when you do ipcs -m
, you won't be seeing the
segments created earlier, because you are in a new IPC Namesace.
# cloned process with CLONE_NEWIPC set
> ./bash_ex
Child Pid: [12624] Invoking Command [/bin/bash]
## no shared memory listed
> ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
CLONE_NEWNS
creates a new mount namespace. If the child process you created has set CLONE_NEWNS
, then the mount(2)
and
umount(2)
system calls will only affect the child process (or processes that live in the same namespace). You can have
multiple process in same mount namespace if you create new child processes without setting CLONE_NEWNS
.
Recompile the code after changing static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWNS;
Run the ./bash_ex
, we will umount
tmpfs
and prove that it only got removed in child process, not in the parent.
> ./bash_ex
Child Pid: [12785] Invoking Command [/bin/bash]
## list the mount type tmpfs
> mount -l -t tmpfs
tmpfs on /dev/shm type tmpfs (rw,relatime)
## unmount tmpfs
> umount tmpfs
## show that the tmpfs got unmounted
> mount -l -t tmpfs
While in the shell (or anything process) they will still see tmpfs
mounted.
## tmpfs is still mounted
> mount -l -t tmpfs
tmpfs on /dev/shm type tmpfs (rw,relatime)
NOTE: Please don't mistake mount namespace with process jailing, this has nothing to do with jailing.
A PID namespace provides an isolated environment for PIDs: PIDs in a new namespace start at 1, somewhat like a standalone system, and calls to
fork(2), vfork(2), or clone() will produce processes with PIDs that are unique within the namespace. The first process created in a new
namespace (i.e., the process created using the CLONE_NEWPID flag) has the PID 1, and is the "init" process for the namespace. Children that are
orphaned within the namespace will be reparented to this process rather than init(8).
Recompile the code after changing static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWPID;
Execute the code and check the pid of the process, it should be 1
> ./bash_ex
Child Pid: [1] Invoking Command [/bin/bash]
> echo $$
1
If you do a pstree
or ps auxwww
, you will see lot of other processes too. This is because those tools work by reading
/proc
dir and our /proc
is still pointing to the parent process's namespace.
Recompile the code after changing static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWNET;
If you do ip addr
on the terminal you will be seeing multiple interfaces, like lo
, eth0
etc. Now execute the newly compiled
code and do an ip addr
on the child bash promt, you will be seeing only lo
interface.
ip addr
on normal bash prompt
> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
..snip..
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9001 qdisc pfifo_fast state UP qlen 1000
link/ether 0a:a5:84:25:0a:db brd ff:ff:ff:ff:ff:ff
..snip..
ip addr
on bash process created with CLONE_NEWNET
> ./bash_ex
Child Pid: [13182] Invoking Command [/bin/bash]
## only lo is shown
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
..snip..
A UTS namespace is the set of identifiers returned by uname(2); among these, the domain name and the host name can be modified
by setdomainname(2) and sethostname(2), respectively Changes made to the identifiers in a UTS namespace are visible to all
other processes in the same namespace, but are not visible to processes in other UTS namespaces.
Recompile the code after changing static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWUTS;
We should be able to change the hostname in the new process and still not affect the hostname of the global namespace.
> ./bash_ex
Child Pid: [13225] Invoking Command [/bin/bash]
# change the hostname
> hostname foo.bar
> hostname
foo.bar
While the hostname as per the global namespace is still unaltered.
## hostname of the system
> hostname
test.qa
redhat cgroup Doc is a very beautiful doc for resource management per process. Reading this document is a must to really understand and use it more efficiently. Kernel Doc has the implementation details. cgroup Subsystem level doc is useful when you want to tweak each subsystem.
cgroups (control groups) is a Linux kernel feature that limits, accounts for and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of a collection of processes. cgroups allow you to allocate resources—such as CPU time, system memory, network bandwidth, or combinations of these resources—among user-defined groups of tasks (processes) running on a system.
I will just show an example on how to do it, me trying to explain cgroups will be doing injustice to the cgroups :-)
As you know, cgroups doesn't have any system call by itself but uses the VFS system calls to implement it. cgroups
filesytem is mounted in /cgroups
. So to limit memory we just need to put a cap on memory.limit_in_bytes
and
memory.memsw.limit_in_bytes
in /cgroups/memory/<subgroup>/
.
# create the subgroup
> mkdir /cgroup/memory/test/
# jump to newly created subgroup
> cd /cgroup/memory/test/
# list the cgroup
> lscgroup
..snip..
memory:/
memory:/test
..snip..
Now lets write a quick code to use 1MB of memory and make sure it is able to run.
# create an array with elements 1 to 1024x1024x10 > 1M
> perl -le '@x=[1..1024*1024*10]; print "done"'
done
Using cgroups
we can put a cap limit on the process, for memory and swap
# cap on memory
> echo $((1024*1024)) > /cgroup/memory/test/memory.limit_in_bytes
# cap on swap too (else it will swap out and run)
> echo $((1024*1024)) > /cgroup/memory/test/memory.memsw.limit_in_bytes
# run the process in the newly create cgroups `test` (And get killed)
> cgexec -g memory:test perl -le '@x=[1..1024*1024*10];print "done"'
Killed
# We can check the `dmesg` to confirm it
> dmesg | tail
[76470.936104] [<ffffffff8148b948>] page_fault+0x28/0x30
[76470.938226] Task in /test killed as a result of limit of /test
[76470.940815] memory: usage 1024kB, limit 1024kB, failcnt 7
[76470.943113] memory+swap: usage 1024kB, limit 1024kB, failcnt 0
[76470.945670] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[76470.948289] Memory cgroup stats for /test: cache:0KB rss:1024KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:0KB active_anon:976KB inactive_file:0KB active_file:0KB unevictable:0KB
[76470.957530] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[76470.960909] [ 4986] 0 4986 85839 595 22 0 0 perl
[76470.964260] Memory cgroup out of memory: Kill process 4986 (perl) score 2339 or sacrifice child
[76470.968144] Killed process 4986 (perl) total-vm:343356kB, anon-rss:828kB, file-rss:1552kB
>
You can also cgroup.proc
file to achieve the same without using cgexec
.
- cgroup.procs: list of thread group IDs in the cgroup. This list is
not guaranteed to be sorted or free of duplicate TGIDs, and userspace
should sort/uniquify the list if this property is required.
Writing a thread group ID into this file moves all threads in that
group into this cgroup.
To use cgroup.proc
, we just need to write your pid to /cgroup/memory/test/cgroup.procs
(you
might think who created this file, when you create a subgroup via mkdir, this file gets created
for you)
> echo $$ > /cgroup/memory/test/cgroup.procs
> perl -le '@x=[1..1024*1024*10];print "done"'
Killed
cgroup
is very powerful, I would suggest you to read the doc before implementing it.
- veth - Virtual Ethernet device that comes in as a pair device, anything that is send to one device will come out from the other
- bridge - Virtual Ethernet Bridge Device
- netns - Network Namespace
+--------------+ +--------------+
| | | |
| Container 1 (iface) <====== (bridge) ======> (iface) Container 2 |
| | | |
+--------------+ +--------------+
- create a bridge
- activate the bridge
- create a vethA pair (vethA1/vethA2)
- set bridge as the master to one end of the veth pair (vethA1)
- bring the vethA1 up
- attach the vethA2 to your container namespace (container1)
- name the vethA2 interface as eth1 (optional)
- give ip addr to interface (use the new name or use vethA2)
- bring up the interface
- do an
arping
to make sure interface is good
- create another veth pair (vethB1/vethB2)
- set bridge as the master to one end of the veth pair (vethB1)
- bring the vethB1 up
- attach the vethB2 to your other container namespace (container2)
- name the vethB2 interface as eth1 (optional)
- give ip addr to interface (use the new name or use vethB2)
- bring up the interface
- do an
arping
to make sure interface is good
Start two process with new network namespace. In case you forgot HOWTO,
recompile the template code we wrote for Kernel Namespaces after changing
static int clone_flags = SIGCHLD;
to static int clone_flags = SIGCHLD|CLONE_NEWNET;
When you execute the compiled binary, you will get two process in two differen network namespaces.
By default there two comtainers won't have any interface attached other than lo
, so they won't be
able to talk to each other. Lets make them talk!
Start Process 1 (prompt 1)
# process 1
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
Start Process 2 (prompt 2)
# process 2
> ./bash_ex
Child Pid: [2307] Invoking Command [/bin/bash]
On the Global Namespace
# to make life easier, lets set the 2 pids as our
# namespaces
# (pid from prompt 1)
> pidA=2264
# (pid from prompt 2)
> pidB=2307
# make it ready for `ip netns` to read
# (strace told me so)
> mkdir -p /var/run/netns
> ln -s /proc/$pidA/ns/net /var/run/netns/$pidA
> ln -s /proc/$pidB/ns/net /var/run/netns/$pidB
# create the bridge
> ip link add dev br1 type bridge
# bring up the bridge
> ip link set br1 up
# veth pair I
# mtu can be fetched by calling (ip link show br1)
> ip link add name vethA1 mtu 1500 type veth peer name vethA2 mtu 1500
# enslave vethA1 to br1
> ip link set vethA1 master br1
# bring vethA1 up
> ip link set vethA1 up
# attach other end of veth to a namespace
> ip link set vethA2 netns $pidA
# rename vethA2 to eth1 (optional)
> ip netns exec $pidA ip link set vethA2 name eth1
# attach an ipaddr to to the interface
> ip netns exec $pidA ip addr add 192.168.1.1/24 dev eth1
# bring the interface ip
> ip netns exec $pidA ip link set eth1 up
# test by an arping
> ip netns exec $pidA arping -c 1 -A -I eth1 192.168.1.1
ARPING 192.168.1.1 from 192.168.1.1 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
# veth pair II
> ip link add name vethB1 mtu 1500 type veth peer name vethB2 mtu 1500
# enslave vethB1 to br1
> ip link set vethB1 master br1
# bring vethB1 up
> ip link set vethB1 up
# attach vethB to a namespace
> ip link set vethB2 netns $pidB
# rename to eth1 (optional)
> ip netns exec $pidB ip link set vethB2 name eth1
# attach an ipaddr to interface
> ip netns exec $pidB ip addr add 192.168.1.2/24 dev eth1
# bring the interface up
> ip netns exec $pidB ip link set eth1 up
# arping test
> ip netns exec $pidB arping -c 1 -A -I eth1 192.168.1.2
ARPING 192.168.1.2 from 192.168.1.2 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
# remove the stuffs we brought in
> unlink /var/run/netns/$pidA
> unlink /var/run/netns/$pidB
> rmdir /var/run/netns
In the prompt you created (that is how you got the PID). Try to do a connect. (Reuse the same prompt, don't kill the process already created (and thus the namespaces))
Prompt 1
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 6a:8f:4a:61:97:90 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.1/24 scope global eth1
..snip..
# list on one container
> nc -l 1234
hi
Prompt 2
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
6: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 1e:44:08:bf:a5:ad brd ff:ff:ff:ff:ff:ff
inet 192.168.1.2/24 scope global eth1
..snip..
# talk to the other container
> echo "hi" | nc 192.168.1.1 1234
>
To poke into your docker network namespace netns
, you need to link to /var/run/netns
and use ip netns exec
(ip command looks into the var dir to get info). man(7) namespaces has more
info about the /proc/$pid/ns
namespaces.
eg,
# it might be missing
> mkdir -p /var/run/netns
# link the process in the container
# you should find $pid
> ln -s /proc/$pid/ns/net /var/run/netns/$pid
# list your namespaces (it should return the pid)
> ip netns ls
<pid>
# list the interfaces in
> ip netns exec <pid> ip addr
....info about interface in the namespace <pid> ...
# please remove what you did after you are done with the experiments
> unlink /var/run/netns/$pid
# don't force delete, someone else too might be mucking around :-)
> rmdir /var/run/netns
Curious mind can do tcpdump -i br0 -n
(provided br0 is your bridge name) to see packets going back and
forth.
+---------------------------------------+
| |
| +-----------+ |
| | | |
| | Container | |
| | (iface) <===== bridge |
| +-----------+ ^ |
| | |
| Host (iface) |
+---------------------------------------+
- create a bridge
- activate the bridge
- create a vethA pair (vethA1/vethA2)
- set bridge as the master to one end of the veth pair (vethA1)
- bring the vethA1 up
- attach the vethA2 to your container namespace (container1)
- name the vethA2 interface as eth1 (optional)
- give ip addr to interface (use the new name or use vethA2)
- bring up the interface
- do an
arping
to make sure interface is good - add routing entry from container to bridge
- assign ip addr to bridge
- add routing entry from host to container (via bridge)
We need to attach one end of the veth pair to the container while the other end to the global namespace. Here we
need to manually assign ip to the bridge br1
and also add the routing table entry from host to the container
and also add a routing entry back from container to host (via veth endpoint)
Start Process 1 (prompt 1)
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
On Global Namespace
# pid from the prompt 1
> pidA=2264
# make network namespace visible to `ip netns`
> mkdir -p /var/run/netns
> ln -s /proc/$pidA/ns/net /var/run/netns/$pidA
# setup the veth
> ip link add dev br1 type bridge
> ip link set br1 up
> ip link add name vethA1 mtu 1500 type veth peer name vethA2 mtu 1500
> ip link set vethA1 master br1
> ip link set vethA1 up
> ip link set vethA2 netns $pidA
> ip netns exec $pidA ip link set vethA2 name eth1
> ip netns exec $pidA ip addr add 192.168.1.1/24 dev eth1
> ip netns exec $pidA ip link set eth1 up
> ip netns exec $pidA arping -c 1 -A -I eth1 192.168.1.1
ARPING 192.168.1.1 from 192.168.1.1 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)
# route entry back to the host
> ip netns exec $pidA ip route add 192.168.2.0/24 dev eth1 proto kernel scope link src 192.168.1.1
# add ip addr to bridge
> ip addr add 192.168.2.1/24 dev br1
# add a route entry
> ip route add 192.168.1.0/24 dev br1 proto kernel scope link src 192.168.2.1
Reuse the same prompt, don't kill the process (and thus the namespaces)
Start nc
on containter
Prompt 1
> ./bash_ex
Child Pid: [2264] Invoking Command [/bin/bash]
> nc -l 1234
hi
Send "hi" to port 1234
listening on container using nc
. You should be seeing "hi" in the container.
Global Namespace
> echo hi | nc 192.168.1.1 1234
pivot_root moves the root file system of the current process to the directory put-old and makes new-root the new root file system. chroot runs the command with a changed root dir. This will help the process run in a rootfs of the of the linux kernel it prefers along with the custom libraries.
Earlier we mentioned about test.block
which can be mounted and contains the rootfs for centos
. It also
contains some changes which we have brought in (a file with content "bar"). You can mount test.block
and
make your process run with the new point point as its rootfs.
# mount the ext4 filesystem
> mount -o loop test.block /tmp/mnt/
# copy the new code to the new mount
> cp bash_ex /tmp/mnt
# change to new root
> cd /tmp/mnt/
# dir for pivot_root
> mkdir put-old
# pivot root, so you can umount put-old
> pivot_root . put-old
# chroot and start your process
> chroot . ./bash_ex
# you are now in /
> pwd
/
# ls should return a a view from /
> ls
bash_ex bin dev etc home lib lib64 lost+found media mnt opt proc put-old root run sbin selinux srv sys tmp usr var vigith_test
# mount your proc
> mount -t proc none /proc
# put your resolv.conf
> cat > /etc/resolv.conf
.. write your stuff ..
# fill in the mtab
> cat /proc/mounts > /etc/mtab
# you are process 1 now
> pstree -a -p
bash,1
└─bash,22 -i
└─pstree,46 -a -p
I am learning and new to this, so there will be a lot of bugs and misunderstanding of concepts. Please send me pull requests if you find something really preposterous.