POSIX Capabilities is a fundamentally sound idea hobbled by silliness that makes the capability system difficult to understand.
The concept is to replace the old model of root is god and everyone else isn't with the idea of a process having capabilities. Capabilities can be seen as sub-divided root powers. For example, in the old model, root (UID 0) can bypass file permissions (that is, root is always permitted). In the capabilities model, if a process has CAP_DAC_OVERRIDE, it is always permitted no matter what the file's permission and owner bits say.
This gives us a much finer grained control over administration and security on the system. For example, in the old model, ping is setuid root so it will be allowed to send raw icmp packets. So long as ping is perfectly secured, that isn't much of a problem. Unfortunately 'perfectly secured' can NEVER be assUme-d. Should someone figure out an obscure way to subvert ping and get it to (for example) exec /bin/bash, some user will be able to become root at will.
Under the new model, ping is granted CAP_NET_RAW. That is adequate for sending the necessary icmp echo packets but does NOT (for example) allow overriding file permissions, remounting filesystems, making binaries suid-root, etc. The 'benefits' of exploiting a flaw in ping under the new system are quite limited compared to the old system.
Capabilities are assigned per-process rather than per-user. All capabilities are inherited by children of the process.
Process capabilities are contained within 3 'sets', Permitted, Inherit, and Effective. Capabilities are to be re-computed whenever the exec system call is invoked.
Effective is the 'working' capabilities. That is, when the process tries to do something privileged, the kernel checks the effective capability set when it decides whether to permit the action or not.
Permitted is the upper bound on capabilities. The process cannot have an effective capability unless it is also Permitted.
Inherit decides which Permitted capabilities will still be permitted after executing a program (binary or script).
A system call exists for a process to ask for an effective capability. That system call consults Permitted to decide if it should grant the request or not.
Meanwhile, executables carry 3 complementary capabilities sets, Allowed, Forced, and Effective.
Allowed is the upper bound on capabilities that should be Permitted once the program is run. Capabilities not in the allowed set will be removed when the program is executed.
Forced is capabilities that the program grants when it is run. In the ping example above, ping would have CAP_NET_RAW
Effective is what capabilities should be made effective in the process (restricted by Permitted, of course). Even if a file has everything set in effective, they will not ACTUALLY be effective unless they are ALSO permitted by the process itself OR forced by the file.
In the ping example above, ping will have CAP_NET_RAW in all 3 of it's sets. A user process having no capabilities at all will be granted CAP_NET_RAW when running ping.
Note that once ping terminates, the capabilities die with it. That is, capabilities are granted to the process itself, not it's parent(s) or the user that owns the process.
Clear as MUD, right?
This is silliness! I see no reason for the Effective set at all. From a security standpoint, if it is permitted we must assume it will be asked for and granted (that is, it WILL become effective). What is the point in playing 'Mother may I' with security?
The effective set for a file is likewise silly. Again, if we force CAP_NET_RAW, we must assume that if the program wants to use it (or an attacker wants it to use it), it will be asked for and granted freely ('Mother may I“?).
Do we REALLY think we'll trip up the bad guys that way? (“You didn't say 'Mother may I'”!!!). OK, so we'll stop the incredibly stupid bad guys.
This greatly simplifies everything with no loss of actual expressiveness.
Proposed new Capabilities.
So the process has Permitted and Inheritable capabilities and executables have Forced and Allowed.
In cases where nothing is set for the file, the default is everything Allowed, nothing Forced. A process neither gains nor loses capabilities when running the executable.
The default state for a process is nothing Permitted, everything Inheritable.
A system call exists so that a program may voluntarily drop Permitted or Inheritable capabilities. Naturally, once dropped, it can only be re-acquired by being Forced.
Why Inheritable? Many daemons require a capability or two to function. Some need it only at the beginning (CAP_NET_BIND_SERVICE, the ability to bind to a port <1024 comes to mind) and sometimes it may be needed on an ongoing basis, for example, a userspace nfs server will need CAP_DAC_OVERRIDE so it can modify files on other user's behalf.
In that case, an attacker might like to exploit unfsd and get it to execute a nice privileged shell for them. So, unfsd voluntarily gives up it's Inheritable capabilities. If the bad guy DOES exploit unfsd and get a shell, it will have no privileges at all.
So why an Allowed set for files?
Honestly, it's just a bit of insurance. We've all seen programs that stridently warn never run as root. Some even refuse to run as root. Instead, those should set allowed to nothing and be happy.
The Linux kernel already has MOST of what's needed to go to a capability system. In the 2.3.x -2.4.x kernels, all checks for uid or euid==0 were changed to check capabilities instead. To maintain backwards compatibility, when a setuid-root program runs, all capabilities are forced in the kernel. Processes task_structs have fields for Permitted, Effective, and Inherited.
The capabilities patch avoids hacking everything up by simply setting Effective and Permitted together. It retains the behaviour of setting everything when execing a suid-root program.
it adds the file capabilities in the form of the fscap xattr (extended attribute). xattrs were added to ext2 and 3 and the vfs some time ago. reiserfs v3 was patched to support xattrs over Hans' vigorous objections. since then, xattrs have been added to most filesystems in Linux that could credibly support them (msdosfs need not apply). iron penguin modifies fs/xattr.c's xattr_permission function so that 'fscap' can only be modified by a process with CAP_SETUID. CAP_SETUID is chosen since it is certainly related, CAP_SETPCAP is generally unavailable system-wide, and admins of virtual systems must be able to manipulate this attribute but are restricted from CAP_SYS_ADMIN. Without that restriction, any idiot could set a forced capability on his own programs and wreak havoc. Arguably, CAP_SETUID should be adequate for manipulating fscap.
So, in summary, when exec is called, the process's capabilities get re-computed as follows:
If xattr trusted.fscap doesn't exist, Allowed = ~0 and Forced = 0. If xattr trusted.fscap doesn't exist AND executable is setuid-root, Allowed = ~0 and Forced = ~0.
Permitted &= Inherited Permitted &= Allowed Inherited &= Allowed
Permitted |= Forced Inherited |= Forced
Note on terminology: The above assumes that capabilities are stored in bitfields (as they are in Linux). If you prefer set notation, A &= B may be read as A becomes the intersection of A and B. A |= B reads A becomes the union of A and B. A = 0 reads A becomes the empty set, and A = ~0 reads A becomes the set of all capabilities.
In Linux, the kernel ALSO has a global value for the maximum capabilities any process at all (except for init) may ever have called the capability bounding set. This set may be altered (downward) through /proc/sys/kernel/cap-bound. This adds two additional computations:
Permitted &= Bound Inherited &= Bound
Furthermore, the Iron Penguin patch set adds a per-process capability ceiling as well and two more computations:
Permitted &= Ceiling Inherited &= Ceiling
Finally, since as explained above, there is no point in separating Permitted and Effective capabilities, we do:
Effective = Permitted
This is done simply to avoid hacking up the whole kernel.
As for POSIX compliance, it's irrelevant now, since the 'official' POSIX capabilities proposal managed to be so foggy that no two people interpreted it the same way (that is, amongst the few who didn't just quit reading it) and so it went nowhere and was withdrawn in 1998.
The essential IDEA was sound even if the description, terminology, and implementation were not. The principle of least privilege is not only sound, but is a cornerstone of security.
I have retained the mis-naming of 'capability' throughout this document and the actual code for the sake of consistency with the kernel and terminology used throughout the Internet.
Properly, the 'capability' referred to here is a 'privilege'. A proper 'capability' is a very different concept documented elsewhere.