The Linux Kernel Series: Every Article

System Calls T-Z

times() - The various time data of processes are returned with this call. The "time data" includes user time, system time, children's user time, and the children's system time.

truncate() - A file (specified by its path) is resized to a particular size. If the file is larger than the desired size, then the extra data is lost. If the file is smaller than the needed size, then null bits are added to reach the final file size.

umask() - When files are created, permissions must be set on the file. The umask() kernel call tells the process what permissions to set on the file it is creating.

umount() - This is the kernel call used to unmount filesystems.

umount2() - This syscall is the same as umount(). However, this system call accepts flags as arguments.

uname() - This syscall is a commonly known and used command in the command-line. Uname() returns various information about the active kernel.

uname-png.1113


unlink() - This syscall is used to delete soft links, sockets, pipes, and device files. The remove() syscall is used for files and rmdir() is needed to delete directories, but these calls do not work on soft links, sockets, pipes, or device files. That is why unlink() is needed. As you know, sockets and pipes are used by processes quite often, so unlink() is made to wait until the processes are done using the object it needs to remove. unlinkat() is the same as unlink() except for some minor differences in its functioning and the fact that it accepts flags.

unshare() - Sometimes, processes may share various resources (like virtual memory) after they fork. It may not always be desirable that the processes share so many resources. The unshare() syscall helps separate the processes more by making them get their own resources rather than sharing.

uselib() - As many GNU/Linux users know, many programs need various shared libraries (like the one under /lib/, /lib32/, /lib64/, /libx32/ ,and elsewhere). The uselib() system call is used by processes to load the needed library. This syscall loads libraries by path.

ustat() - The calling process can view the statistics of mounted filesystems using this syscall.

utime() - This system call is used to change the access and modification times associated with a file. The system call goes by inode to find the file. This kernel call is commonly used after a file is access or modified.

utimensat() - This kernel call is just like utime(), but the difference lies in the precision. utime() is precise down to the microsecond while utimensat() can handle nanoseconds.

vfork() - This syscall is like fork(), but with some differences. vfork() gives the child process its own page tables while fork() makes the parent and child processes share. The parent process still shares many of the attributes with the child process.

vhangup() - A hang-up is simulated in the terminal. This is needed so users will have a fresh terminal (tty) when they login after someone else has been logged into the terminal.

vm86() - This is a kernel call used to initiate a virtual 8086 mode which is commonly needed to run dosemu. The older (and obsolete) version of this system call has been renamed to vm86old(). If dosemu is able to run on your system, then your kernel has the vm86() system call. If not, then your platform cannot support this platform-specific kernel call.

vmsplice() - This syscall splices pages of memory to a pipe.

wait() - The calling process will be suspended until one of its children processes exit or are killed.

wait3() and wait4() - These syscalls are deprecated, so waitid() is used instead.

waitid() - This syscall makes the parent process pause until one of its child processes have a particular state change (see the NOTE a few lines below). waitid() is more precise than waitpid() since a specific state change can be the trigger for the parent to resume execution.

waitpid() - A child process can be specified by PID. Then, the parent process (which is also the calling process of waitpid()) will be paused until the specified child process experiences a state change of any kind.

NOTE: A "state change" refers to one of the following events - the termination of a process, a process resumes execution, or a process is stopped.

write() - This syscall writes data from a buffer to the specified file descriptor.

Now, we have discussed most of the official syscalls. The only ones I did not discuss were most of the obsolete or platform specific syscalls (although I mentioned some).
 
Last edited:


Non-Standard Syscalls

There is another set of kernel calls called the non-standard syscalls or unimplemented system calls. These kernel calls are not part of the vanilla kernel, or at least not in recent versions.

afs_syscall() - This is a seldomly used system call for OpenAFS. OpenAFS is an open-source alternative to the Andrew FileSystem, both of which are distributed filesystems. This syscall managed the Input/Output of OpenAFS. Most developers now use ioctl() instead.

break() - No info available. I cannot find any information about this syscall. If anyone knows anything, please share with us.

getpmsg() - This kernel call is the same as getmsg(). However, getpmsg() offered calling processes greater control than getmsg(). Both of these syscalls allowed a process to get a message from a stream.

gtty(), stty() - This system call is mostly obsolete due to ioctl(). gtty() may still be used for backwards compatibility, but in general, developers may no longer see this syscall. It is (or was) used to control terminal devices through the device files (/dev/tty*). stty() is an equivalent of gtty().

idle() - This kernel call makes the CPU idle.

lock() - This system call locks or unlocks access to registers.

madvise1() - This syscall is not used in the vanilla kernel, but is used in the Android-Linux kernels. madvice1() is an alias to madvice(), both of which advise the kernel on how to allocate and manage memory for the calling process.

mpx() - This kernel call creates and manages multiplexed files.

phys() - mmap() is now used instead of phys().

prof() - This syscall is a profiler which is used to measure the performance of the kernel.

putpmsg() - This is equivalent to putmsg(). Both of these syscalls are used to put messages on the message stream.

security() - Some Linux Security Modules (LSMs) use (or had used) this kernel call to define system calls for security purposes. Most LSMs now use socket use socketcall().

tuxcall() - This call comes from a TUX module and is sent to the kernel. The call asks the kernel to perform some task for the module. A TUX module is basically a server application/daemon in the form of a Linux module. Imagine an Apache server being a kernel module; that is essentially how TUX works.

vserver() - This is a virtualization system call that is used by this specialized kernel (http://linux-vserver.org/Welcome_to_Linux-VServer.org).


Last Note on System Calls

This last "syscall" does not belong with the others, but I will mention it here. On some syscall tables (I will explain that in a moment), people may see a ni(), sys_ni(), ni_syscall(), or something of that manner. This is a syscall place holder. To better understand this, it is important to know that syscalls are assigned a syscall number. For instance, here is a syscall table for the v3.10 Android's Linux kernel (https://android.googlesource.com/kernel/common.git/ /android-3.10-adf/arch/m32r/kernel/syscall_table.S). Notice that each syscall has a number associated with it. For instance, write() is syscall "6". When syscalls are given new numbers or are removed, the ni() syscall is a null kernel call that reserves that syscall number for later or private use. These syscall tables may also be called "system call maps". The Linux kernel (when executing) stores the syscall map in the "%eax" register. These numbers are probably only important to syscall developers or assembly programmers.
 
IDs

Linux and many other operating systems use various identifiers. An identifier is a token or number used to designate some object or entity. Such an object or entity may be a user, group, process, etc. The Linux kernel prefers to use identifiers when referring to some object. For instance, the "kill" command is a system call that stops a process, but does so after being given the process's identifier. This syscall (as well as many other kernel calls) will only accept an identifier and not a name. The Linux kernel has a variety of types of identifiers as you will see.

Group ID (GID) - Every group is given a GID. Many Unix systems have different GID maximums (GIDs are not unlimited). Many distros have limit of 0-65535 or 0-4294967295. This limit is due to how GIDs are stored. As many of you know, Linux is written in C/C++ and Assembly. A GID is (depending on the system) an unsigned 16-bit integer or unsigned 32-bit integer. Every user must belong to at least one group. The group a user primarily belongs to is defined in /etc/passwd. Additional groups are listed in /etc/group.

Process ID (PID) - Every process whether it be a daemon, GUI application, kernel, etc. has a process ID. The kernel always has a PID of 0 and init (or an alternative) has a PID of 1. When a process is no longer running, its PID can be reused by a new process. There is a limit to the number of available PIDs, but the limit is so high that users would run out of memory and/or CPU resources before the PID limit can be reached.

Thread ID (TID) - Just as each process has an ID, so does each individual thread.

Parent Process ID (PPID) - A process has a PPID which is the same number as the PID belonging to the processes parent. For example, if I open Xterm which has a PID of 123 and I start Firefox in Xterm by typing "firefox", then Firefox's PPID will be 123 which is Xterm's PID. The PPID is a way of indicating which process (the parent) started the child process.

User ID (UID) - Every user has a UID. Like GIDs, the number of UID's is limited by the type of integer that defines a UID. Again like GIDs, many distros have limit of 0-65535 or 0-4294967295 as an unsigned 16-bit integer or unsigned 32-bit integer, respectively. The Root user always has a UID of 0. The UIDs are stored in /etc/passwd.

Effective User ID (EUID) - When a process or user is trying to access or create some object, the kernel checks the EUID to see if the user or process has permissions to perform the desired action. In most cases, the EUID is the exact same thing as the UID. However, the EUID may differ from the UID. Assume a regular user uses a command that requires Root privileges. The user's permissions are temporarily upgraded (if allowed) to perform th special function (like "sudo"). Users cannot change their UID while being logged in. This is why the EUID is needed, so users can change their EUID when needed. Remember, the UID is used to identify the user/process and EUID is used to identify the permissions. In other words, the UID 0 is Root and an EUID of 0 s Root's file permissions.

FUN FACT: The "whoami" command shows who you are based on your Effective User ID, not your User ID.

Real User ID (RUID) - The Real User ID is like EUID, but is related to sending signals to processes. For instance, a process can only send signals to other processes with the same RUID. When a user starts a program, the program has the same RUID as the user. The RUID is the same as the UID, hence the name "Real User ID". The RUID is not changed. Sometimes the EUID may need to change to the user's ID, and the EUID gets this info from the RUID. The RUID has the same limits as the UID (being an unsigned 32-bit integer).

Saved/Set User ID (SUID) - Remember how the Effective User ID (EUID) changes? Well, sometimes it may need to change to have the same value as the RUID temporarily. It would be inefficient to go through the process of re-elevating privileges again, so the EUID will go back to the ID saved in the SUID. For illustration, a process or user may perform some action that needs elevated privileges. The EUID is changed, but then something needs to be done under the RUID and then change back. Due to security reasons, a user/progress cannot assign itself privileges back. This is where SUID comes into play. When, the EUID needs to switch back to the elevated privileges, the ID is retrieved from the Saved User ID (or Set User ID). However, such elevated privileges can be revoked even from the SUID. By default, the SUID will have the same value as the EUID.

NOTE: Saved IDs and Set IDs are the same.

Real Group ID (RGID) - The Real Group ID is like the Real User ID, but this applies to groups.

Effective Group ID (EGID) - This is just like an Effective User ID, but this is for groups.

Saved Group ID (SGID) - The SGID uses the same concepts as Saved/Set User IDs.

Session ID (SID) - A Session ID is used for communication purposes so only the authenticated processes can communicate together on that particular session only. This helps to prevent hijackers and malware from intercepting a data transaction. The only processes that can communicate on that "conversation" are those with the session ID for that particular communication session. Session IDs are common between HTTP connections.

Process Group ID (PGID) - A Process Group ID is the ID given to a set of processes that group together. The PGID is the same as the PID of the process that formed/declared the group via the setsid() kernel call. This is kind of like groups (as in GID), but PGID are not defined/preset in a file like Group IDs (/etc/passwd and /etc/group). Process groups are needed for inter-process communication purposes. Process groups prevent race conditions as well as other obstacles that may harm communication.

Filesystem User ID (FSUID) - When a filesystem operation is to be performed on a file or the filesystem itself, the FSUID is used rather than the EUID. Effective User IDs (EUID) are used for the data and FSUIDs are needed to modify the file itself (like change permissions and such).

Universally Unique ID (UUID) - This 16-octet number has a variety of uses. Many Linux users may have seen a storage unit's UUID. UUIDs are needed to provide some kind of general unique ID that is permanent (unless changed by the user or programmer). For example, Gparted can be used to change a memory card's or hard-drive's UUID. Mac addresses are not the same as UUIDs, although they seem the same. A Mac address is made up of six sets of hexadecimal pairs, like this - ##:##:##:##:##:##. An online UUID generator can be found here (http://www.uuidgenerator.net/). Or, you can do it locally as a fan of mine suggested on Google Plus - cat /proc/sys/kernel/random/uuid

Partition ID - The different types of filesystems have a unique ID. This ID is usually used by partition formators to format a filesystem. For example, swap partitions have a partition ID of 82. However, not every filesystem uses a unique ID; xiafs, ext2, ext3, Reiserfs, and others share ID 83.

USB ID - Different USB devices also have an ID. The different IDs are listed here (http://www.linux-usb.org/usb.ids). Other types of hardware (like PCI devices) have ID of their own, but I will not bore you with all of the different kinds of hardware IDs since I am sure you get the point - PCI IDs, SCSI IDs, etc.
 
Descriptors

If you have read the part of the Linux kernel series that discusses system calls, then you are probably wondering a lot about file descriptors. In short terms, a file descriptor is some special indicator (in the form of an integer) used to access files. The descriptors are stored on an array that contains the file descriptor and various information about the file. The file descriptor may not point to a file, but it may point to a socket, pipe, directory, device, etc. Remember, in Linux, everything is a file. We will take a moment to try to thoroughly understand file descriptors.

Only the kernel can access the file descriptor table or index (which ever you prefer to call it). The file descriptor index keeps track of a file's location (i.e. /some/path), the process/user accessing the file, the processes PID, the file's type (i.e. pipe, directory, regular file, etc.), and some other information.

Many kernel calls usually access files using a file descriptor, just like processes are referred to by their PIDs and other IDs. Closed files do not have file descriptors just like how processes are only given a PID when they are open/executing. Files cannot get a file descriptor until they are opened/accessed.

The "lsof" command LiSts Open Files by checking the list of file descriptors. Users can use lsof to view the file descriptor index. Users can also see what files particular processes have open. Users can also see the file descriptors processes have open via /proc/. In a process's proc folder (like /proc/2010/), users can see the file descriptors that the process is using by looking in the "fd" folder. For instance, to see the file descriptors that are being used by process 2010, type "ls /proc/2010/fd/".
fd_lsof-png.1184


If you run "ls /proc/PID/fd/" for multiple "PIDs", you may notice that file descriptors "0", "1", and "2" are under most or all PID folders. "0" is standard in (stdin), "1" is standard out (stout), and "2" is standard error (stderr).

fd_term2-png.1186


Processes can share file descriptors. For instance, a process may need to share a file descriptor with another process. Also, when a process is forked, the child process has the same file descriptors as the parent process.

fd_share-png.1185


There is a limit to the number of system-wide file descriptors that can be active at any given moment. This limit can be seen by running this command - "cat /proc/sys/fs/file-max". Every system and distro is different. To see the hard limit that is allowed for the average single user/process, execute this command - "ulimit -Hn". The soft limit can be viewed with a similar command - "ulimit -Sn". The soft limit is a limit they may be exceeded for a limited amount of time when needed. The hard limit is a limit that may not be exceeded or bent in any way. In other words, the hard limit must be obeyed with no exceptions. The system-wide file descriptor limit can be changed via this command - "sysctl -w fs.file-max=#", where "#" is the desired number. Then, using the same number, add this line to /etc/sysctl.conf - "fs.file-max = #". Reboot the system to apply the changes or run "sysctl -p" if you prefer not to reboot. Obviously, making such changes must be performed as Root. To change the hard and soft limits of a particular user, add or edit these lines in the /etc/security/limits.conf file -

USER soft nofile ##
USER hard nofile ##

Where "USER" is the name of a user or process and "##" is a number. Again, perform this action as Root.

fd_limit-png.1183


NOTE: Running "sysctl fs.file-max" will produce the same output as "cat /proc/sys/fs/file-max".

The soft limit can be temporarily increased in the shell that executes this command - "ulimit -n ###".

To see how many file descriptors are open on your system, type "lsof | wc -l". To see how many file descriptors are currently being actively used, look at the first number of this output - "cat /proc/sys/fs/file-nr".

If you are a programmer, then you may be familiar with the open(), close(), read(), and write() commands which are commonly used in C/C++, Python, and other languages. These commands interact with the file descriptors. open() and close() open and close a file descriptor, respectively. Without running these two commands the file descriptor is never made or left open, again respectively.

Hopefully, this article has helped readers to understand that file descriptors are basically PIDs for files. This knowledge should help make system calls easier to understand.
 
Last edited:
PulseAudio and Other Sound Components

Many Linux users have probably heard of PulseAudio, but many users may not completely understand why it is need or specifically how it works. PulseAudio is a part of the sound system and helps to make all of the sound components work together as you will see.

NOTE: This article will explain how the sound system works. This article does not explain installation or command-line parameters.

There are many ways to send sound from one location to another. Sound may come from a process/application (like Clementine, VLC, or a video game), the network (like a meeting with a friend over the Internet), from a microphone, or other source. Then, the sound may be sent to a file, over the network, to a speaker, to headphones, etc. How can Linux properly and accurately ensure sound is sent from point A to point B? The answer is PulseAudio (PulseAudio.org). PulseAudio is a daemon that acts as a proxy server for sound. All sound goes through PulseAudio. PulseAudio will then determine where the sound is meant to go. When a process makes a sound (like your music or video game), the application does not send sound straight to the speaker. Rather, it is sent to PulseAudio. If PulseAudio were to be killed with a kill command, then you would lose your sound. However, PulseAudio tends to restart itself on many systems, or users may have an alternative to PulseAudio.

Once PulseAudio receives the sound, what happens next? Well, once PulseAudio has a sound destined for the speaker or headphones, PulseAudio will send the sound to Advanced Linux Sound Architecture (ALSA) (www.alsa-project.org). ALSA is a sound API for the Linux kernel. ALSA runs in kernel-space and contains many sound drivers and features. Users can install various applications that allow them to control the audio system for various reasons (like JACK {http://jackaudio.org/}). AlsaMixer is an interface to ALSA used to configure sound and volume (http://alsa.opensrc.org/Alsamixer).

After ALSA makes any modifications to the sound (due to mixers and such), the drivers perform their low-level operations with the hardware. The sound is then sent to the desired destination. If PulseAudio sends sound over the network, then the client's ALSA process will perform such operations.

Some users have the Open Sound System (OSS) instead of ALSA. However, ALSA is the newest and most recommended sound-interface/sound-architecture for the kernel.

Sound can be sent via Bluetooth (like for headphones) thanks to the "bluetooth-alsa" software (http://bluetooth-alsa.sourceforge.net/).

As with anything, there are exceptions. Processes can send sound directly to ALSA, thus bypassing PulseAudio. Alternately, processes can send data to ALSA's own sound server. However, most systems use a "PulseAudio to ALSA to hardware" approach.

This is a little off topic, but I am sure some of you have thought of this while reading this article. So, I will briefly inform you on some related software. GStreamer is a multimedia-framework that provides many codecs and helps link various sound components. Phonon is like a Qt/KDE alternative to Gstreamer (both of which have entirely different code and teams of coders).
 
FUSE

NOTE: For those of you that enjoy learning about filesystems, you may enjoy this link - http://www.linux.org/threads/filesystem-article-index.6028/

The Filesystem in Userspace (FUSE) is a special part of the Linux kernel that allows regular users to make and use their own filesystems without needing to change the kernel or have Root privileges. The filesystems used in FUSE are virtual filesystems. Not all virtual filesystems use FUSE. The code for FUSE itself is in the kernel, but the filesystem is in userspace. However, typical filesystems exist in the kernelspace.

FUSE allows users to make their own filesystems, modify filesystems, or use a special filesystem temporarily. FUSE can also be used to add extra features and abilities to a system or software. For example, GVFS (GNOME Virtual FileSystem) is a filesystem that allows applications to access remote files as if they were local. FUSE makes this possible. Also, at the time this article was written, the Linux kernel did not natively support exFAT (also called FAT64). If users wish to access a storage unit with a exFAT/FAT64 filesystem, users can mount the filesystem in FUSE. However, FUSE may need some extra packages to be installed to run some filesystems or virtual filesystems. For instance, if a user wanted to use exFAT like in the example above, then the user would need to install the "exfat-fuse" package which extends FUSE's abilities. The "exfat-fuse" package is a FUSE driver. Many drivers/extensions are available for FUSE.

fuse_packages-png.1287


FUSE is hosted at http://fuse.sourceforge.net/. FUSE is open-source freeware that anyone may obtain and use. FUSE is stable, secure, and reliable. However, it is more efficient to use a "real" filesystem if possible. FUSE is compatible and works on Solaris, FreeBSD, Darwin, GNU/Hurd, OS X, Opensolaris, and others. Some operating systems do not support FUSE. Such systems can use a fork of FUSE or use an alternative. For example, NetBSD uses PUFFS and Windows uses "fuse4win".

FUSE not only mounts virtual filesystems, but also "real" filesystems like ZFS and exFAT. FUSE can also mount files like ISOs, CD track files, and compressed files (zip, gzip, tar, etc.). FUSE's abilities extend to network filesystems like HTTP-FS, aptfs (apt repos), and others. FUSE can also be used to transfer files to Apple devices like the ipod and iphone (iFUSE). Amazingly, FUSE can also be used to convert FLAC to MP3 via the MP3 filesystem (mp3fs).

Users can program their own filesystems and use them in FUSE without changing the Linux kernel or needing Root privileges. Users program filesystems in a similar way programs are coded. If users wished to program a virtual filesystem using Ruby, then the programmer would need the "ruby-fusefs" package. FUSE bindings exist for many programming languages including Perl, Python, C/C++. C#, and many others. The complete list of language bindings are here (http://sourceforge.net/apps/mediawiki/fuse/index.php?title=LanguageBindings).

With FUSE, these virtual filesystems are not formatted. Rather, all a user needs to do to initiate a FUSE filesystem is to mount the specified FUSE filesystem to an empty directory that a user has permissions to use.

Now that many of you know a lot about FUSE, it is time to explain how it works. FUSE is a Linux module that acts as a “middle-man” or mediator between the FUSE filesystem and the Linux-kernel's VFS module. The VFS module can only be accessed by privileged users or processes. Since all users may access FUSE and FUSE may access VFS, that is how the permissions work to allow any user to use a FUSE filesystem. As for the FUSE filesystem itself, the code contains a structure variable that contains a pointer to functions in the FUSE filesystem code that respond to various actions that a process may try to use. In simple terms, the structure variable is a set of code that says a “read” syscall must have a particular set of parameters and is handled by a specific function defined in the FUSE filesystem's code.

Sample FUSE structure variable code -
Code:
struct fuse_operations {
  int (*mknod) (const char *, mode_t, dev_t);
  int (*getdir) (const char *, fuse_dirh_t, fuse_dirfil_t);
  int (*mkdir) (const char *, mode_t);
  int (*rmdir) (const char *);
  int (*readlink) (const char *, char *, size_t);
  int (*symlink) (const char *, const char *);
  int (*unlink) (const char *);
  int (*rename) (const char *, const char *);
  int (*link) (const char *, const char *);
  int (*chmod) (const char *, mode_t);
  int (*chown) (const char *, uid_t, gid_t);
  int (*truncate) (const char *, off_t);
  int (*utime) (const char *, struct utimbuf *);
  int (*open) (const char *, struct fuse_file_info *);
  int (*read) (const char *, char *, size_t, off_t, struct fuse_file_info *);
  int (*write) (const char *, const char *, size_t, off_t,struct fuse_file_info *);
  int (*statfs) (const char *, struct statfs *);
  int (*flush) (const char *, struct fuse_file_info *);
  int (*release) (const char *, struct fuse_file_info *);
  int (*fsync) (const char *, int, struct fuse_file_info *);
  int (*getattr) (const char *, struct stat *);
  int (*setxattr) (const char *, const char *, const char *, size_t, int);
  int (*getxattr) (const char *, const char *, char *, size_t);
  int (*listxattr) (const char *, char *, size_t);
  int (*removexattr) (const char *, const char *);
};

So, if a process using a FUSE filesystem performs some task that uses the write() syscall, then the filesystem code will execute the code for write(). When you write code for a FUSE filesystem, you could program it that when a process writes to a file that a copy of the file is to be made first.

Here is a table to help summarize the parts of a FUSE filesystem's code -
Import Headers – all C/C++ code imports headers and other programming language import some kind of library.

Declare Variables – Any variables that are used a lot in the code should be declared near the topic of the source code so the programmers can easily find and change global variables.

Syscall Declaration – A variable structure titled “fuse_operations” declares a syscall and needed parameters.

Syscall Functions – Next, programmers would write code for the action or actions that should occur when a particular syscall is declared. Developers may have a function for open(), write(), read(), and many other syscalls or needed features.

Main() - Obviously, if the FUSE filesystem is coded in C/C++, you will have a “int main()” which is where the code “starts” after the functions and variables are set.

When the filesystem “executes”, FUSE will be a mediator and communicate with the kernel on behalf of the FUSE filesystem.

Below is sample FUSE filesystem code that came from FUSE's Sourceforge page (http://fuse.sourceforge.net/helloworld.html).

Code:
/*
  FUSE: Filesystem in Userspace
  Copyright (C) 2001-2007  Miklos Szeredi <[email protected]>

  This program can be distributed under the terms of the GNU GPL.
  See the file COPYING.

  gcc -Wall hello.c `pkg-config fuse --cflags --libs` -o hello
*/

#define FUSE_USE_VERSION 26

#include <fuse.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <fcntl.h>

static const char *hello_str = "Hello World!\n";
static const char *hello_path = "/hello";

static int hello_getattr(const char *path, struct stat *stbuf)
{
   int res = 0;

   memset(stbuf, 0, sizeof(struct stat));
   if (strcmp(path, "/") == 0) {
     stbuf->st_mode = S_IFDIR | 0755;
     stbuf->st_nlink = 2;
   } else if (strcmp(path, hello_path) == 0) {
     stbuf->st_mode = S_IFREG | 0444;
     stbuf->st_nlink = 1;
     stbuf->st_size = strlen(hello_str);
   } else
     res = -ENOENT;

   return res;
}

static int hello_readdir(const char *path, void *buf, fuse_fill_dir_t filler,
        off_t offset, struct fuse_file_info *fi)
{
   (void) offset;
   (void) fi;

   if (strcmp(path, "/") != 0)
     return -ENOENT;

   filler(buf, ".", NULL, 0);
   filler(buf, "..", NULL, 0);
   filler(buf, hello_path + 1, NULL, 0);

   return 0;
}

static int hello_open(const char *path, struct fuse_file_info *fi)
{
   if (strcmp(path, hello_path) != 0)
     return -ENOENT;

   if ((fi->flags & 3) != O_RDONLY)
     return -EACCES;

   return 0;
}

static int hello_read(const char *path, char *buf, size_t size, off_t offset,
      struct fuse_file_info *fi)
{
   size_t len;
   (void) fi;
   if(strcmp(path, hello_path) != 0)
     return -ENOENT;

   len = strlen(hello_str);
   if (offset < len) {
     if (offset + size > len)
       size = len - offset;
     memcpy(buf, hello_str + offset, size);
   } else
     size = 0;

   return size;
}

static struct fuse_operations hello_oper = {
   .getattr   = hello_getattr,
   .readdir   = hello_readdir,
   .open     = hello_open,
   .read     = hello_read,
};

int main(int argc, char *argv[])
{
   return fuse_main(argc, argv, &hello_oper, NULL);
}
 
Last edited:
Virtualization

Many of us have heard of virtualization, guest operating systems, hypervisors, and many other related terms. Virtualization allows users to run an operating system (OS) inside the "real" OS as if it were an application, or virtualize some device or network. But, what exactly is virtualization and how does it all work?

NOTE: This article discusses virtualization as in operating system virtualization. This article does not discuss virtualization as seen in Java's virtual machine, runtime environments, and similar systems.

Virtualization is the creation of a system or component that is not truly real. In other words, virtualization makes something seem genuine. The virtual object is not exactly real, but it is not entirely fake either. In computer technology, virtualization can be used to make "semi-real" networks, devices, computer systems, etc. Users may make a virtual device that the operating system thinks is a genuine physical device. The user knows the device is entirely software, but the computer cannot see the difference between a virtual device and a physical device. There is also OS virtualization which is where an operating system (called the guest) runs inside (or on top) of the OS on the hardware (called the host). The host is running on the physical computer while the guest system thinks it is running on bare-metal hardware. The guest runs as it would on real hardware, but it cannot see or access the physical hardware. Instead, it thinks it is running on the hardware that the virtualization software is virtualizing.

Virtualization is not emulation. Emulation is the process of pretending (or imitating) some other device or system. Think about some game emulators like ZSNES (which mimics Super Nintendo). They behave like the real console just as ZSNES acts like the Super Nintendo consoles. The emulator performs the functions of the real device, but it does not truly act like the device. Emulators copy the behavior of real devices enough to complete the same tasks they perform. For instance, Super Nintendo games work on the console and SNES ROMs (files containing an SNES game) work on ZSNES. On both systems, the game functions perfectly, or at least nearly perfect on the emulator. However, emulators do not act or appear exactly like the real system. For instance, most game emulators support the ability to save the game's current state/progress while consoles usually use a different way to save the game (like checkpoints). Also, emulators may provide the user with additional functionality. In contrast, virtualization is nearly real. Virtualization is faster than emulation in nearly all cases.

A virtual machine is a piece of software that represents a computer system. Guest operating systems run inside of virtual machines (like Oracle's VirtualBox). The guest OS thinks the virtual machine is real hardware. The hypervisor creates and manages the virtual machine. Multiple hypervisors and guest systems can be run on a host system as long as there is enough “real” hardware to satisfy the needs of the guests. A guest can also be run in a guest.

NOTE: Usually, when many guests are run, a single hypervisor manages all of the virtual machines containing guests.

Virtual machines can represent hardware of an architecture that is very different from the host. This is possible because of the way the guest's system calls are mapped (or "translated"). Users can run 32-bit guests on 64-bit hosts. Operating systems designed for one processor type can be run on a host that uses an entirely different CPU. However, not all processors, BIOS chips, and hypervisors support certain guest architectures. For instance, some 64-bit hosts cannot support a 64-bit guest (usually due to limitations in the CPU and/or BIOS). Because of this fact, be sure to understand the virtualization capabilities of your BIOS, CPU, hypervisor, and host kernel when choosing the right hypervisor+guest+host combination for your needs. To run a particular guest, you may need a different hypervisor, host OS, or entirely different physical hardware.

FUN FACT: Under very special conditions, a 64-bit guest can run on a 32-bit host. This is possible if the physical hardware is 64-bit and has a CPU that offers certain virtualization extensions.

Hypervisors may work at different levels of the host system depending on the particular hypervisor's abilities. KVM (Kernel-based Virtual Machine) is a hypervisor in the Linux kernel (or as a module) that runs virtual machines in separate userspaces. QEMU and Oracle's VirtualBox (VBox) are hypervisors that run in the host's userspace as an application. Some hypervisors may work at the hardware level. Such hypervisors may exist as firmware or act as a compatibility layer under the OS kernels (such as Xen which is a hypervisor in the form of a microkernel). Many hypervisors exist, each with their advantages and disadvantages as well as supported features. Not all hardware supports the same hypervisors. Some hypervisors require special CPU commands (instruction sets and extensions). Computers with a CPU that lacks virtualization extensions will only be able to use hypervisors that run as applications or use paravirtualization (discussed below).

NOTE: Processors with virtualization instructions/extensions can help speed-up virtualization. This is known as hardware-assisted virtualization.

Virtual machines often use virtual devices, and they may use virtual networks. Many guest systems use a file as a hard-drive (device virtualization). ISO files are examples of virtual optical discs (like CDs and DVDs), and “.img” files are often used as virtual floppies. Many virtual machines allow users to use ISO files on their host system as discs on the guest. Guests can form a virtual network with other systems like other guests, the host, and external systems. Some hypervisors allow the host's directories to be mounted in the guest as a drive. This helps users transfer files between systems.

Another type of virtualization exists called paravirtualization (in contrast to full virtualization). This is a form of virtualization where the guest knows it is running in a virtual machine. Thus, the guest will modify its behavior slightly to enhance its performance. The guest does so by using a special set of system calls referred to as "hypercalls". Not all operating systems support hypercalls. Operating systems that support hypercalls either have the hypercalls built into the kernel or as a driver. Hypercalls use the paravirtualization application-programming-interface (para-API). Hypercalls come from the kernel and are sent to the hypervisor while the guest's userspace still sends system calls to the guest's kernel. Using paravirtualization (also called para-virt or PV) does not require the processor to have virtualization extensions. The Xen hypervisor supports paravirtualization.

NOTE: Hypercalls are to hypervisors as system calls are to kernels. (Analog from http://wiki.xen.org/wiki/Hypercall)

Guest's userspace >>(syscalls)>> Guest's kernel >>(hypercalls)>> Hypervisor

Instead of hypervisors, OS–level virtualization can be used. In this virtualization process, the kernel runs multiple sets of userspaces. Hypercalls and virtual machines are not used since the userspaces interact with the kernel directly. The userspaces communicate with the kernel using the typical system calls like “normal” systems. There is only one kernel running on the physical hardware. Since there is only one kernel that interacts with the userspaces directly, each userspace must be compatible with the kernel. This means OS-level virtualization is only possible among systems that use the same kernel. For instance, users running a Linux kernel in this fashion can only run userspaces of various Linux distros. Each userspace instance is referred to as a "container" in contrast to the term “virtual machine”. The machine can act as a server that uses one IP address with each userspace using its own hostname or domain-name. Alternately, the userspaces can each be set to use a different port. Either way, only one IP address is used, thus saving IP addresses for other uses.

There is also another form of OS-level virtualization. LXC is a hypervisor that allows each userspace to have its own Linux kernel. This is possible because LXC divides the hardware resources for each kernel (via cgroups) as if the hardware were multiple computers. Such systems do not use the concept of "host and guest". Technically speaking, each OS is its own host.

NOTE: OS-level virtualization is not the same as multibooting.


In summary, operating systems can be run in another operating system by using virtualization or paravirtualization. Virtualization also allows multiple operating systems to run on the same piece of hardware simultaneously. Hypervisors manage and create virtual machines that contain the guest OS. All this works together to allow users to take advantage of other operating systems to perform tasks that may not be easily performed using their usual OS. Understanding the basics of virtualization can help users with creating virtual systems.
 
Xen

For those of you that understand virtualization, you may want to have an understanding of the most popular hypervisor for Linux - Xen. Xen is an open-source hypervisor that is well supported by Linux. Some other operating systems support Xen as well. Knowing the basics of Xen can help you understand hypervisors better and know why Xen is so popular.

Xen is a popular hypervisor that supports OS virtualization and paravirtualization (PV). Xen is a "Type 1" hypervisor meaning that it runs on the hardware. When the computer boots-up, the bootloader (typically GRUB) loads Xen (Xen is a microkernel). Then, Xen sets up specialized virtual machines called "Domains". Domains are low-level virtual machines that do not have a host, and only one domain can access the hardware directly. Xen first sets up the primary domain. This domain is known by many names such as "Domain Zero", "Dom0", "Domain0", "Host Domain", or "The Control Domain". Xen cannot function without the control domain. The control domain provides the drivers that are used by itself and other domains. The control domain also contains the controls and settings for Xen as well as Xen itself. The other domains are called "Guest Domains", "Unprivileged Domains", or "DomU". The guest domains use the drivers provided by Domain0. The guest domains are not permitted to access the hardware directly. Instead, they must allow Dom0 to manage the hardware. (http://www.xenproject.org/)(http://wiki.xen.org)

NOTE: With a virtualization structure like this, it is hard to specifically say which system is the host. Some may argue that Xen is the host system since it is a layer between the hardware and operating systems. As for the other view, Domain0 could be considered the host since Xen depends on it and unprivileged domains need Domain0 for hardware access. Either way, the unprivileged domains act like guests, so they are often called guest domains. Some manuals may refer to the hardware itself as the host.

Most Linux distros contain PV-enabled kernels; PV is enabled in the vanilla kernel. PV-enabled kernels contain the paravirt_ops (pv-ops) framework which is a kernel infrastructure that provides hypercalls and other components needed for paravirtualization. Thus, paravirtualization does not require virtualization support from the CPU. Just as the userspace sends system calls to the kernel, the Linux kernel sends hypercalls to Xen. Hypercalls usually have a naming scheme like this "__HYPERVISOR_*", where the asterisk represents the rest of the hypercall's name. (http://xenbits.xen.org/docs/unstable/hypercall/index.html)

NOTE: Other hypervisors recognize and use the hypercalls from the paravirt_ops framework.

Xen can send the virtual operating systems "Virtual Interrupts". These virtual interrupts go to the kernel of the virtual OS. The names of the interrupts begin with (VIRQ_). "Event Channels" provide the framework for event notifications. These notifications are Xen's equivalent to hardware interrupts.

Xen Cloud Platform (XCP) is an ISO file that comes with Xen and a Linux distro set as Dom0. In other words, this is a complete Xen system that you install from the ISO. Otherwise, install a Linux distro and use the package manager to install Xen. If you are interested in installing Xen, then I recommend you go to Xen's official website and read the installation instructions. (http://wiki.xenproject.org/wiki/Xen_Project_Beginners_Guide)

Often times, Xen systems use a Logical Volume Manager (LVM) for local storage, but this is not required. Xen can also use network storage. Xen also uses “Xen Virtual Block Devices” (VBD or XVD). These are files where the file itself is a virtual hard-drive. Such files use the “.dsk” file extension. The first virtual Xen hard-drive is designated as /dev/xvda and /dev/xvdb is the second (just change the last letter). The virtual hard-drives are treated as IDE drives, but number of maximum partitions is fifteen. Xen Virtual Block Devices are commonly used with paravirtualized domains.

NOTE: Domain0 must be installed on the physical hardware, but all of the guest domains may be installed in XVDs.

You may be wondering, “why install domains in an XVD file?”. Well, doing so allows the virtual OS to be copied as a file. Admins can copy the “.dsk” file and easily and quickly install preconfigured systems or make backups. Having a complete OS in a single file makes backups and installation very easy. Since Xen is a hypervisor, nearly any system running Xen would be able to use the OS in the “.dsk” file. This also makes hardware requirements flexible because as long as Xen is supported and Domain0 has the needed drivers, a particular virtual OS will be able to run on any machine. If an OS becomes corrupted, the file can be copied over the top of the corrupted Domain's image.

LINK: This site offers some pre-made Xen images as well as images for other hypervisors. http://stacklet.com/downloads/images/public

XenServer is an open-source Xen system offered by Citrix. Users can purchase support, but the software itself is free. XenServer has an interesting piece of software called XenMotion. XenMotion allows a currently running guest domain to be moved from one physical machine to another host machine that both share the same storage. This allows admins to repair servers without having the system down. (http://xenserver.org/)

Xen supports various operating systems including Windows. However, Windows can only be in a guest domain, never in the control domain. NetBSD and OpenSolaris are two examples of operating systems that can function as Domain0.

"xencommons" is the init script that each operating system domain must have properly installed to run the required Xen services.

Various interfaces exist for managing a Xen system. Users can use Xen's command-line tools or install a separate interface. Many management tools exist including graphical interfaces. "libvirt" is an example of a hypervisor management API that many management tools use. "libvirt" is a library written in C, but various programming bindings allow programmers to implement libvirt in management tools that are programmed in other computer languages. (http://wiki.xen.org/wiki/Xen_Management_Tools)(http://wiki.xen.org/wiki/XCP_Management_Tools)

Xen uses the Hierarchical Protection Domains (CPU Protection Rings) security feature offered by most (or all) processors. A CPU can execute code under different security modes. Most CPUs have four rings (security modes) designated "Ring 0" through "Ring 3". Ring 0 is the most privileged and allows complete direct-access to the hardware. In contrast, Ring 3 has very few privileges and must request permission from the user to perform special tasks. Xen runs in Ring 0 while Dom0 and the guest domains all run in Ring 1. User applications run in Ring 3, which is the default for most operating systems. Xen does not assign any process to Ring 2.


Xen is a powerful hypervisor with many features and uses. Having a general understanding of Xen can help admins decide which hypervisor to use. Understanding Xen also helps users gain a better understanding of virtualization and Linux.
 
Hi, DevynCJohnson
Do you know whether XFS support SMR drive?

Wow! I never heard of SMR until now. After some research, I assume that these drives would appear (to the user) to be no different than magnetic drives. After reading http://en.wikipedia.org/wiki/Shingled_magnetic_recording, if people were not sure how Seagate made its new 8TB and people could not distinguish this drive type from magnetic, then it seems like you could use XFS. However, I have never heard of these drives until now, so you may want to do more research of contact the XFS developers.
 

Members online


Latest posts

Top