The jail call is inspired by and somewhat modeled on the jail syscall from FreeBSD. Jail brings together the IronPenguin improvements in chroot, bind mounts, etc and adds signal and process isolation.
The primary purpose of jail is to allow the creation of extremely lightweight virtual servers that live in a chroot jail.
The jail call accepts a chroot path, a capability ceiling, and an IP address. Jail chroots to the provided path, sets the capability ceiling on the process, and 'locks' the process and it's descendants to the provided IP address. Finally, it marks the process as being in jail (sets the new PF_JAILED flag on the process's task_struck in the kernel).
Procfs has been modified so that jailed processes can only see processes whose root is reachable from the current process. That is, a jailed process doing ps afx will see only the processes in the same jail. The signal syscaalls are likewise modified to prevent jailed processes from blindly signaling random pids to attack the unjailed processes.
In addition, all sysctls and /proc/sys become read-only even for a jailed root.
The ipv4 protocol is modified so that any attempt to bind to an IP address is re-directed to bind to the IP parameter of the jail call, including binds to 127.0.0.0/8.
Further, connects to localhost addresses (127.0.0.0/8) are re-directed to the jail IP.
In order for a jail to be secure, a number of conditions need to be met:
A fully capable root will always be able to find a way out of jail. This is the reason a capceiling is provided to jail. At least the following capabilities must be disabled to seal the virtual server tightly:
Obviously, jailed processes cannot be allowed to reboot the host!
If modules can be loaded, the kernel can be modified to open the jail, so this is disallowed.
If a jailed root can mknod devices, it can make it's own r/w mount of the root filesystem inside the jail.
A jailed root must not be permitted to alter the host's routes or firewall. At the least, such a capability would allow redirection of server traffic or simply taking the host off the net.
A jail's filesystem needs to be carefully sanitized, particularly /dev. By removing CAP_MKNOD, root in the jail can't create arbitrary device nodes, but it can still use those that have been provided for it.
mem, kcore, agpgart:
Any of these would allow root to modify the running kernel and/or it's own task_struct. (for example, clearing PF_JAILED and changing current→capceiling to all enabled). Note that though agpgart is MEANT for video access, a gart is a memory management/mapping device, so who knows what wierdness could result!
hdd*, sd*, md*, mapper, VolGroup* etc:
If a jailed root can mount copies of the root filesystem, all bets are off!
Can't have a jailed process snooping on the sysadmin.
Ethertaps. It's OK for a jail to access specific assigned ethertaps, but certainly not just ANY ethertap.
If a jail can frob the hardware, all bets are off!
A jail should have a PRIVATE and isolated shared memory space.
Actually, this is OK and REQUIRED for a functional virtual system. For example, without it, sshd and friends won't work. The pty driver has been modified to restrict access appropriatly.
The rule here is “When in doubt, leave it out”. s the jail syscal is brand new, best practices remain under development. Hopefully, those best practices will be filled in over time.
The host should configure an alias OR an 802.1q VLAN interface for the exclusive use of the jailed virtual server. The IP address of that interface should be passed to the jail call.