User Tools

Site Tools


OFS: OPUS File System

(Open Parallel Unified System)

The state of filesystems: In spite of advances in clustered and distributed computing, filesystems remain in the dark ages. The primary option for sharing filesystems remains NFS. Development exists in the form of GFS and CODA, but neither really meets requirements. Filesystems remain intimately bound to their block device based backing store. Even with LVM and other techniques to improve the block storage itself, filesystems are not readily resizable. Most shared filesystems continue to rely on a centralized resource for storage. Many do not support fully coherant operation. The few that do not suffer these problems rely on fiber channel or other expensive (and somewhat uncommon) hardware. They still are not readily resizable or device independant. As a result, most HPC clusters still mount NFS storage and rely on rsync and other utilities to occasionally update local or backup storage. Few express elation with this solution, but use it because it exists. While CODA and PVFS are fine filesystem developments, both are designed for specialized needs and not a suitable replacement NFS in most cases. CODA focuses on disconnected operation and cannot possibly be coherent or support atomic operations. PVFS is targeted primarily as a data staging area for parallel jobs, with no support for redundancy or resilient operation. GFS comes close, but emphasizes more or less exotic specialized storage devices, and makes no effort to extend filesystem semantics or even to fully support mmap semantics. OFS is a collection of technologies intended to facillitate RandD efforts at producing a versatile distributed collection of filesystems to address the needs of HPC and failover clusters (including true single system image) as well as general computation environments.

Fuse: Fuse is a GPL project that exports filesystem functionality from kernel to userspace (a microkernel approach). This eases development in several respects including availability of a wider range of development tools, minimizing the consequences of a crash, allowing languages other than C to be used in development, relaxing the restrictions on programming language, and simplifying the robust use of networking. Linux Labs International, Inc. has adapted the Fuse project to support a simple inode based API for filesystem research purposes along with a Python binding for libfuse. These changes have been sent to the maintainer and are expected to be released in the upcoming 2.0 version.¶ The hope is to use the obvious advantages of a microkernel filesystem approach without introducing the difficulties of switching away from the most commonly used operating system in clusters.

Python: At first glance, Python may seem to be an odd choice for developing an OS resource. However, it offers the advantage of being a fully introspective object oriented language with good debugging support that lends itself well to rapid prototyping in situations where design may change frequently (as is the case for many research projects). Python also lends itself well to selectively translating performance critical portions of the program to C through it's easy to use C interfaces. The latter advantage opens the possibility of having a usable system earlier in the process while development continues in an unstable branch.

Backing store: Initially, a BackingStore module has been written. It's primary objectives are functional correctness and ease of debugging. It consists of an Inode class for regular files, a dnode class derived from Inode for directories, and a dentry class to handle directory entries. This is wrapped in a filesystem class which manages caching operations and translation of thrown exceptions into Unix style errno values as well as superblock related functionality. Currently, the BackingStore operates on top of any reasonable filesystem such as ext2/3. Inodes are stored in two separate files, one for metadata (such as owner, permissions, and timestamps) and a data file. This is done so that work may focus on filesystem semantics and not presently focus on block allocations and such. This also makes inspection of the low-level inode a simple matter using regular commandline utilities. Even in that simple form, it makes the filesystem easily resizable and allows the use of nothing more than CP to be used to evacuate a volume (directory!) that is to be unmounted. As inodes are assigned without regard to block numbers, kernel imposed block device limitations do not apply to the filesystem, only filesize limits.

Glue: Finally, a glue class (which derives from Fuse, the Fuse Python binding) has been deveoped which translates kernel requests into OFS requests. Because of the design of the storage API, the glue layer is quite thin and could degenerate into a simple mix-in in the near future. Effectively, this is a VFS implementation in userspace with an interface to the Linux kernel's VFS.

Future: With correctness and debugging under control through testing of the BackingStore module, the natural next step is to place a DistributedStore module between the glue and the BackingStore. The DistributedStore module would be responible for maintaining coherency in the distributed filesystem, local caching of data, and transparently locating Inode objects wherever they might exist in the system. ¶ The DistributedStore module will use a Communicator class for all inter-node communication. Work has already begun on this module. To maximise it's versatility, it's design criteria include simple datagram based messages, minimal dependance on the underlying network to support reliability, fragmenting, and other semantics. By developing it in this way, it should be reletivly painless to adapt it to Dolphin interconnect, Myrinet, Raw ethernet, or Infiniband as an underlying fabric. The communicator includes it's own discovery mechanism along with machine number to address translation. By presenting an abstract machine number to it's client, the DistributedStorage will not need to understand the addressing scheme of any particular communications fabric. The Communicator supports an authentication scheme based on SHA1 hashes to prevent unautho-rized systems from accessing or altering the filesystem. However, the data is not encrypted at this time. While some applications may require encryption in the future, that will be a subject of later development, and will remain optional since it could introduce a significant performance penalty and will be unnecessary for many HPC applications.

Development Objectives: OFS is to be a fully distributed filesystem supporting fully coherant operation (implying distributed locking and atomicity), data redundancy, graceful failover, and graceful degradation. While the initial work has focused on correct implementation of POSIX file semantics, future work is likely to include inode versioning, and Plan 9 like device and pseudo device access semantics. As VM and FS are intimately tied together, future work will naturally include interfacing with distributed shared memory through other LinuxLabs work such as l4mmu. Given a coherant distributed VM and Plan 9 pseudo file access to system resources, a natural conclusion will be a strong single system image.

Another Linux Labs International GPL Project…

Steven James & — Eduard Goiu 2006/07/23 20:55

ofs.txt · Last modified: 2010/04/15 21:18 (external edit)