In these notes, we follow The Linux Kernel Module Programming Guide closely. Each directory in this repository contains one or more kernel module examples, most are taken from LKMPG. We attempt to tidy up the examples, modernize them, or comment on them further. In any confusion, the ultimate arbiter are the kernel sources. Use the source, Luke.
Other resources:
The build system is called kbuild.
- Begin with Building external modules.
- Even more details about kernel makefiles here.
- Notes online on what happens when
make
is executed on kbuild, https://0xax.gitbooks.io/linux-insides/content/Misc/linux-misc-2.html.
We need the kernel headers, which we can install with:
sudo apt install linux-headers-$(uname -r)
modinfo
, inspect a.ko
file.lsmod
, view which modules are loaded and how many processes use a module.- Modules are also listed under
/sys/modules
; additionally, the kernel built-in modules will be also listed there.
- Modules are also listed under
insmod
, load a module.- Arguments can be passed via
insmod mymodule.ko variable=value
using module_param(). Seehello-5.c
. E.g.insmod hello-5.ko mystring="foo" myintarray=-1,3
- Can’t load a module if one with the same name is already loaded. This includes built-in kernel modules.
- Arguments can be passed via
rmmod
, remove a module.dmesg | tail
, view diagnostic messages printed withpr_info()
and other similar functions.
Kernel modules are object files whose symbols are resolved by insmod
(or modprobe
for already-installed modules.) Exported kernel symbols are in /proc/kallsyms
.
The MODULE_LICENSE()
macro exists primarily for three reasons:
- So modinfo can show license info for users wanting to vet their setup is free
- So the community can ignore bug reports including proprietary modules
- So vendors can do likewise based on their own policies
Information on MODULE_LICENSE()
and other MODULE_*
macros can be found by reading the source of the linux/module.h header.
Device drivers are a class of kernel modules providing functionality for hardware.
Device files are under /dev
; they provide means of communicating with hardware. This provides a general method of communicating with drivers: /dev/sound
may be connected to by the es1370.ko
driver, or some other. Device files can be created by e.g. =mknod /dev/coffee c 12 2=.
Major and minor numbers (Assigned major/minor number listing) are listed by ls -l
in the form MAJOR, MINOR
. The major number is the corresponding device driver controlling it, and the minor number is to differentiate different (potentially abstract) hardware. When a device file is accessed, the kernel determines the module controlling it by the major number. The minor number is for the module itself to consume. Major numbers and drivers currently online are under /proc/devices
.
Device files are either “character” or “block”. Block devices have IO in blocks of bytes and can buffer requests, while character devices work with bytes.
The struct file_operations holds operation callbacks such as read()
and write()
. Unused operations are set to NULL
. The variable is commonly called fops
. struc proc_ops
replaces struct file_operations
for merely registering /proc
handlers.
The struct file
represents an abstract open file.
To register a major number for a character device, use either register_chrdev_region()
or alloc_chrdev_region()
. The former is with a fixed major number and the latter dynamically allocates one that is available. Then the functions cdev_alloc()
, cdev_init()
, cdev_add()
, etc, are used. For an example of the cdev interface, see ioctl/ioctl.c
.
In <linux/module.h>
, the following functions are available to view or modify the use counter:
try_module_get() /* Increment the reference count of current module. */
module_put() /* Decrement the reference count of current module. */
module_refcount() /* Return the value of reference count of current module. */
This can all be accomplished better by the .owner = THIS_MODULE
member of struct file_operations
. See SA/a/6079839 and an examplanation of the VFS as well as lwn.net/Articles/22197/.
This is an advanced situation where multiple incompatible kernel versions are wished to be supported.
/* Conditionally compile for kernel 2.6.16 or less */
#if LINUX_VERSION_CODE <= KERNEL_VERSION(2,6,16)
/* ... */
#endif
Each directory in this repository contains one or more kernel module examples. Here we describe them and comment on the particularities of their source code.
This kernel module is a character device. Userland processes can interact with the device by treating it as a file (with filename /dev/chardev
.)
We define four functions, device_{open,release,read,write}
, which we populate a struct file_operations
with. The file_operations
structure controls the behavior of the character device. For example, when an attempt from a process to read from the character device is made, the function registered under the structure member .read
is called.
There are two functions attributed with __init
and __exit
which are the entry point and exit point of the kernel module (analogous to main
in a C userspace program.) Any functions attributed with __init
and __exit
allow the kernel to free up the memory their code used after initialization, so it is an optional optimization. The actual lines that tell the kernel which functions will be the entry and exit point are module_init()
and module_exit()
.
In our init function, we register a character device with register_chrdev
so that the kernel dynamically assigns a major number (scroll to the 234-254 range) for us. This looks like:
major = register_chrdev(0, DEVICE_NAME, &chardev_fops);
/* ... */
cls = class_create(THIS_MODULE, "chardev");
device_create(cls, NULL, MKDEV(major, 0), NULL, "chardev");
The class_create
call creates a class structure. These classes have multiple uses, a notable one is for exporting device numbers under /sys/class/$name
where $name
is the second parameter of class_create()
. The device numbers are used by by udev(7)
, e.g. with tools like udevadm(8)
for device discovery (for example: mount filesystem when USB stick is plugged in.) Note that cls
must be deallocated with class_destroy()
; THIS_MODULE
is a macro to a struct and MKDEV()
combines a major and a minor number.
Our driver has a global buffer called msg
which we wish to synchronize between multiple processes; only one process can use the buffer at a time. For this purpose, we use a binary semaphore with atomic updates: we use ATOMIC_INIT(val)
, atomic_cmpxchg(&x, comp, newval)
, and atomic_set(&x, val)
.
We keep track of the number of processes currently using the kernel module with try_module_get(THIS_MODULE)
and module_put(THIS_MODULE)
to let the kernel know not to make the module exit module prematurily. Note that try_module_get()
presents an issue, and there is a superior alternative. See SA/a/6079839.
Writing to the device fails with -EINVAL
.
Reading from the device essentially calls put_user(*msg++, *buf++)
over and over until the whole message is written, and returns the number of bytes. The function put_user()
copies from kernel memory to user memory: when a userland program attempts to read from the character device, a userland buffer is provided to kernel space for filling; note that it is attributed with __user
, as in char __user *buf
.
We can invoke trigger.sh
every time chardev
is loaded by writing the following udev rule in /etc/udev/rules.d/80-chardev.rules
:
SUBSYSTEM=="chardev", ACTION=="add", RUN+="/path/to/chardev/trigger.sh"
Assuming the path is correctly modified to point to trigger.sh
, and that we then run udevadm control --reload
, the script will be invoked whenever insmod chardev.ko
is performed. We can check that it has indeed ran by inspecting its output, on /tmp/chardev_trigger.log
.
Another mechanism of communication with character devices is demonstrated: ioctl(2)
calls. The function that deals with the ioctl
call is device_ioctl()
, stored under the .unlocked_ioctl
member of the fops structure. To define our own ioctls, we use the _IO*
macros in chardev2.h
. This public header is also used by userland programs, as they also need to be able to use the ioctl macros. One important difference with the old chardev
is that we no longer dynamically register a major number; instead we provide a fixed number MAJOR_NUM
to register_chrdev()
. This is important because the device number is used in the ioctl macros.
In chardev
we used the .release
fops, but now we use a worse alternative, try_module_get()
and module_put()
. This shouldn’t be used, but we demonstarate it regardless.
The init and exit functions use proc_create()
and proc_remove()
to create/remove the proc file. The return value is a struct proc_dir_entry *
To them the file permissions, e.g. =0644= are passed, and a proc_ops
struct with .proc_read = procfile_read
. See linux/proc_fs.h for kernels v5.6+.
The function procfile_read
uses copy_to_user(buffer, s, len)
and adds *offset +
len=.
After loading the module, use journalctl | tail
to find out the major number, and use
mknod mydevfile c <MAJOR> 0
to create a device file corresponding to this driver. This char file will continuously output the configured byte value non-stop.
When calling a syscall, a process jumps to a location in the kernel named system_call
. They are indexed on sys_call_table
by the syscall number.
We wish to modify sys_call_table
to wrap our code around a particular syscall.
The control register cr0
modifies the x86 processor behavior. Once the write protection WP
flag is set, the processor disallows write attempts to read-only sections. Thus to modify the table, we must disable WP
.
We will replace open()
with what is conceptually
new_open(): if proc_id() == MAGIC: pr_info(report which file is being opened) continue with normal open()
We show various things that a kernel module can do with userspace processes.
Print information about a process.
The VFS is the layer between a call to write()
and the specific code responsible for dealing e.g. with ext4, btrfs, and so on.
VFS translates pathnames into directory entries (dentries). A dentry points to an inode, a filesystem object. The inode contains information about the file, for example the file’s permissions, together with a pointer to the disk location or locations where the file’s data can be found.
To open an inode, a file structure is allocated (kernel-side file descriptor). The file structure points to the dentry and operation callbacks taken from the inode; in particular, open()
is then called so that the particular filesystem can do its work.
Filesystems are (un)registered with
int (un)register_filesystem(struct file_system_type *);
The registered filesystems are under /proc/filesystems
. To mount a filesystem, VFS calls mount0()
and a new vfsmount is attached to the mountpoint; when pathname resolution reaches the mountpoint, it jumps into the root of the vfsmount.
A superblock object representes a mounted filesystem.
- [X] What is the
loff_t*
parameter in the.read
operations ofstruct file_operations
andstruct proc_ops
?The offset is the current position in the file. The read operation gets called again and again until a
0
is returned. Notice it is us who advance the offset via a simple+=
. - [X] How does the sysfs example work? I don’t understand
kobject_create_and_add()
, especially the second argument. How is an attribute a kobject?The
kernel_kobj
file makes it a parent and so the kobject lies under/sys/kernel
. - [X] What does
class_create()
do?Creates entries with major/minor under
/sys/class
, useful for device discovery byudev(7)
.