POSIX Capabilities is a fundamentally sound idea hobbled by silliness that makes the capability system difficult to understand.
The idea behind POSIX capabilities is to replace the old model of root is god and everyone else isn't with the idea of a process having capabilities. Capabilities as defined in the POSIX draft may be seen as sub-divided root powers. For example, in the old model, root (UID 0) can bypass file permissions (that is, root is always permitted). In the capabilities model, if a process has CAP_DAC_OVERRIDE, it is always permitted no matter what the file's permission and owner bits say.
It should be noted that what the POSIX draft calls “capabilities” are actually privileges and should not be confused with capabilities in EROS or similar systems. Since use of the term “capabilities” is now use throughout the Linux kernel code and surrounding internet discussions, this document will also use the term “capabilities”.
The IronPenguin implementation of capabilities is similar to the POSIX draft, but seeks to simplify matters somewhat and fix a few specification flaws so that the potential benefits may finally be realized.
Capabilities allow for a much finer grained control over system administration than the old root privilege model. For example, in the old model, ping is setuid root so it will be allowed to send raw icmp packets. So long as ping is perfectly secured, that isn't much of a problem. Unfortunately 'perfectly secured' can NEVER be assUme-d. Should someone figure out an obscure way to subvert ping and get it to (for example) exec /bin/bash, some user will be able to become root at will.
Under the new model, ping is granted CAP_NET_RAW. That is adequate for sending the necessary icmp echo packets but does NOT (for example) allow overriding file permissions, remounting filesystems, making binaries suid-root, etc. The 'benefits' of exploiting a flaw in ping under the new system are quite limited compared to the old system.
POSIX Capabilities are assigned per-process rather than per-user. Capabilities are only inhereted if both the process and the executable have the inheritable bit set for a given capability.
Process capabilities are contained within 3 'sets', Permitted, Inherit, and Effective. Capabilities are to be re-computed whenever the exec system call is invoked.
Effective is the 'working' capabilities. That is, when the process tries to do something privileged, the kernel checks the effective capability set when it decides whether to permit the action or not.
Permitted is the upper bound on capabilities. The process cannot have an effective capability unless it is also Permitted.
Inherit decides which Permitted capabilities will still be permitted after executing a program (binary or script).
A system call exists for a process to ask for an effective capability. That system call consults Permitted to decide if it should grant the request or not.
Meanwhile, executables carry 3 complementary capabilities sets, Allowed, Forced, and Effective.
Allowed is the upper bound on capabilities that should be Permitted once the program is run. Capabilities not in the allowed set will be removed when the program is executed.
Forced is capabilities that the program grants when it is run. In the ping example above, ping would have CAP_NET_RAW
Effective is what capabilities should be made effective in the process (restricted by Permitted, of course). Even if a file has everything set in effective, they will not ACTUALLY be effective unless they are ALSO permitted by the process itself OR forced by the file.
In the ping example above, ping will have CAP_NET_RAW in all 3 of it's sets. A user process having no capabilities at all will be granted CAP_NET_RAW when running ping.
Note that once ping terminates, the capabilities die with it. That is, capabilities are granted to the process itself, not it's parent(s) or the user that owns the process.
Clear as MUD, right?
This is silliness. From a security standpoint, if something is permitted we must assume it will be asked for and granted (that is, it WILL become effective). There is little point in playing 'Mother may I' with security.
The effective set for a file is likewise silly. Again, if we force CAP_NET_RAW, we must assume that if the program wants to use it (or an attacker wants it to use it), it will be asked for and granted freely ('Mother may I“?).
Do we REALLY think we'll trip up the bad guys that way? (“You didn't say 'Mother may I'”!!!). OK, so we'll stop the incredibly stupid bad guys.
Proposed new Capabilities
The sensible approach if to eliminate the Effective sets and base security decisions on the Permitted set.
This greatly simplifies everything with no real loss of actual expressiveness.
So the process has Permitted and Inheritable capabilities and executables have Forced and Allowed.
In cases where nothing is set for the file, the default is everything Allowed, nothing Forced. A process neither gains nor loses capabilities when running the executable.
The default state for a process is nothing Permitted, everything Inheritable.
A system call exists so that a program may voluntarily drop Permitted or Inheritable capabilities. Naturally, once dropped, it can only be re-acquired by being Forced.
Why Inheritable? Many daemons require a capability or two to function. Some need it only at the beginning (CAP_NET_BIND_SERVICE, the ability to bind to a port <1024 comes to mind) and sometimes it may be needed on an ongoing basis, for example, a userspace nfs server will need CAP_DAC_OVERRIDE so it can modify files on other user's behalf.
In that case, an attacker might like to exploit unfsd and get it to execute a nice privileged shell for them. So, unfsd voluntarily gives up it's Inheritable capabilities. If the bad guy DOES exploit unfsd and get a shell, it will have no privileges at all.
So why an Allowed set for files?
Honestly, it's just a bit of insurance. We've all seen programs that stridently warn never run as root. Some even refuse to run as root. Instead, those should set allowed to nothing and be happy.
The current state of capabilities in Linux:
The Linux kernel already has MOST of what's needed internally to go to a capability system. In the 2.3.x -2.4.x kernels, all checks for uid or euid==0 were changed to check capabilities instead. To maintain backwards compatibility, when a setuid-root program runs, all capabilities are forced in the kernel. Processes task_structs have fields for Permitted, Effective, and Inherited.
However, because of the withdrawal of the POSIX draft specification and a few early difficulties, the filesystem parts of capabilities were never completed.
A number of proposals were made for the implementation of file capabilities, the most prominant would have stored capabilities in a notes segment of an ELF binary. This was proposed primarily because at the time xattrs had not been implemented at all.
As of kernel 2.6.24-rc2, file based capabilities are available in the mainstream kernel.
The kernel for the most part implements exactly the POSIX draft except that fE is a single boolean applied to all pP. The xattr is formatted differently than the original IronPenguin implementation, but seems reasonable. As a result, IronPenguin has been updated to be reasonably compatible (but extended) with the mainstream.
In the mainstream, the capabilities computation is:
pP' = fP | (pI & fI) pI' = pI pE = pP' & fE
IronPenguin alters things a bit.
pP' = fP | (pI & fI) | (pP & pI) pI' = pI | (fP & fI) pE' = pP' & fE
Meanwhile, mainstream assumes nothing set when no capabilities record exists for a file. IronPenguin assumes fE is set in that case.
The upshot of the difference is that an executable can grant a persistant capability to a process if fP, fI, and fE are set. A file can refuse to run with any capabilities at all by having a capability xattr with everything cleared.
A privileged process can clear it's own pI so that it can use a capability but executables it runs cannot.
A future patch is under consideration to the setcap system call so that pI can be cleared by a process, but can only be set by executing something with fI and fP set. That is, clearing pI cannot be undone by calling setcaps again.
The capabilities patch avoids hacking everything up by simply setting Effective and Permitted together. It retains the behaviour of setting everything when execing a suid-root program.
File capabilities are stored in the form of the new user.fscap xattr (extended attribute). xattrs were added to ext2 and 3 and the vfs some time ago. Reiserfs v3 was patched to support xattrs over Hans' vigorous objections. Since then, xattrs have been added to most filesystems in Linux that could credibly support them (msdosfs need not apply). IronPenguin modifies fs/xattr.c's xattr_permission function so that 'user.fscap' can only be modified by a process with CAP_SETUID. CAP_SETUID is chosen since it is certainly related, CAP_SETPCAP is generally unavailable system-wide, and admins of virtual systems must be able to manipulate this attribute but are restricted from CAP_SYS_ADMIN. Without that restriction, any idiot could set a forced capability on his own programs and wreak havoc. Arguably, CAP_SETUID should be adequate for manipulating fscap.
The user. prefix was chosen due to infelecities in the way xattrs have been implemented in Liinux to date (see [xattr proposal]]).
So, in summary, when exec is called, the process's capabilities get re-computed as follows:
If xattr user.fscap doesn't exist, Allowed = ~0 and Forced = 0. If xattr user.fscap doesn't exist AND executable is setuid-root, Allowed = ~0 and Forced = ~0.
Permitted &= Inherited Permitted &= Allowed Inherited &= Allowed
Permitted |= Forced Inherited |= Forced
Note on terminology: The above assumes that capabilities are stored in bitfields (as they are in Linux). If you prefer set terminology, A &= B may be read as A becomes the intersection of A and B. A |= B reads A becomes the union of A and B. A = 0 reads A becomes the empty set, and A = ~0 reads A becomes the set of all capabilities.
In Linux, the kernel ALSO has a global value for the maximum capabilities any process at all (except for init) may ever have called the capability bounding set. This set may be altered (downward) through /proc/sys/kernel/cap-bound. This adds two additional computations:
Permitted &= Bound Inherited &= Bound
Furthermore, the Iron Penguin patch set adds a per-process capability ceiling as well and two more computations:
Permitted &= Ceiling Inherited &= Ceiling
Finally, since as explained above, there is no point in separating Permitted and Effective capabilities, we do:
Effective = Permitted
This is done simply to avoid hacking up the whole kernel.
The fscaps xattr is left as is for mainstream. The patch is entirely within security/commoncap.c and security/Kconfig. It might be nice to add a full fE set to partially make up for the Denied set for files, but that is for later. Currently, an executable can deny all capabilities, it just can't be more specific.
As for POSIX compliance, it's irrelevant now, since the 'official' POSIX draft capabilities proposal was ambiguous in a few areas and was withdrawn in 1998.
The essential IDEA was sound even if the description, terminology, and specifics were not. The principle of least privilege is not only sound, but is a cornerstone of security.