Linux Edu Handbook

PVenti is a new filesystem drawing on the Plan 9 Venti storage server with updates to reflect more modern assumptions about data storage.

Venti is a block storage system where blocks are identified by a hash of their contents rather than by a fixed block number. When designed, it was assumed that massive storage would require the use of WORM media in a jukebox and that that media would have a lifetime in the tens of years. Thus, Venti does not maintain refcounts on blocks. Instead, once created, a block is never destroyed.

A useful natural feature of Venti is that no matter how many times a file is copied, it will only be stored once as a natural consequence of the way blocks are named rather than by special measures to eliminate duplicate data. Further, if those copies are versions where only a few changes are made between the versions, only the actual changes will increase storage size. This also arises as a natural consequence of the fundamental design rather than by special measures.

In normal operation on plan9, Venti is used in combination with fossil. Fossil is a caching filesystem meant to run on read-write media such as a hard drive. Periodically, the contents of fossil are flushed to the underlying venti creating a new version of the filesystem. because venti never discards data, this naturally results in the ability to retrieve any previous version of a file.

PVenti, by contrast is written with the assumption that large volumes of read-write media will remain cheaper than a jukebox of WORM media and that in practice WORM media may rot in a period of less than 10 years. Based on that, there is now value in maintaining a refcount, discarding old versions (under user control) and storing multiple copies of a block on different servers/media in order to keep ahead of drive failure.

In addition, the concepts of fossil and venti are implemented as layers in a single system rather than as seperate filesystems and the moving of files from fossil to venti happens based on transactions rather than on a fixed time table. It also adds a full write logging facility such that the log can be sent to a remote mirror server in real time or periodically in order to maintain an offsite backup.

The write log is maintained in such a way that it is entirely sufficient to maintain a mirror AND such that inadvertantly replaying a log multiple times will do no harm.

In order to handle this, metadata is stored seperately from the venti blocks. Currently they are stored in an underlying filesystem such as ext3. The metadata includes typical stat data such as size, ?time, owner uid,gid, unix permissions, ACLs, xattrs, and a list of the PVenti blocks that make up the file.

PVenti blocks are stored based on the hash of their contents, just as in venti. However, they are stored as files in an underlying fs. They also maintain their own metadata including refcount, ownership, and (experimentally) a backreference listing of files that reference them.

Versioning shall be based on files opened for writing rather than day by day. That is, a new version is created when a file that is open for read only or not open at all is opened for writing. It closes only when no process has the file open for writing. In special cases, the transaction can be logically closed and a new one immediatly reopened without actually closing the file.

The transaction log will include each individual write so that even a failure in mid update will be recoverable.

Design Objectives

The primary goal for pventi is data safety. That in turn requires correctness. Speed is a distant third. That is, pventi will never be the fastest filesystem available. If speed is valued over data integrity pventi is not the correct choice.

Implementation details

All PVenti data is maintained in an underlying filesystem, at least for version 1.0 This is for several reasons including that block allocation and indexing is not an interesting problem for PVenti, so practically any existing implementation will be preferable to developing yet another one.

The root of a PVenti contains /meta /venti /fossil

/meta stores file blocklists, /venti contains the blocks themselves, /fossil contains a skeleton of the filesystem structure and all metadata (including size) for the files. When a file is opened, the venti block data is copied into the fossil. When the fossil is closed, the venti metafile is (re)created and the fossil's blocks are cleared.

The /fossil system takes advantage of the ability to create files with holes, that is, where well-formed blocks of all zeros do not actually take up space in storage. (Hence the name 'fossil', it has the form of the file but not the substance).

Notably, the /venti componant may be shared amongst multiple otherwise independant filesystems. Access to the data is already controled by the time the venti becomes involved, so sharing is not an issue.

Lifecycle of an opened file

When a file is opened, after checking relevant permissions, etc, the metadata is accessed from a file of the same name/path in /meta and the venti blocks are fetched and written into the fossil. Any read and write accesses are directed to the fossil. If versioning is in use, the metafile is renamed and a new empty fossil is created for that version. Otherwise, the metafile is unlinked.

When the file is released (the last open is closed), the contents of the fossil are copied into venti blocks and the metafile created with a list of those blocks. Then the fossil is truncated to 0 and then back to it's correct size. The latter step causes the underlying filesystem to release all of it's actual allocated blocks and assume the data is all zeros. However, at that point, the fossil consumes only 1 block for it's inode.

This handling allows pventi blocks to be larger than the OS/fuse blocks without doing anything special and thrashing the blockstore with new blocks that will be immediatly destroyed when the next sequential write happens. This strategy also allows for data compression at the venti level. In addition, it allows the most common filesystem accesses to be thin translations back to the underlying filesystem and so, fast and efficient.

Versioning

Files in PVenti are versioned. That is, old versions of files are kept available unless explicitly removed. The Plan9 venti/fossil combination versions by day by flushing all files in the fossil filesystem to the Venti filesystem at a fixed time of day (and in some cases, also when triggered by the fossil filesystem filling up).

By contrast, PVenti defines the sequence of opening for write (or read-write) then closing as a transaction, and maintains versions transaction by transaction.

In order to avoid namespace pollution, older versions of a file are stored in a special sub-directory named '.&&'. In addition, when a file is deleted, it's final form is also moved into '.&&'. That name is chosen because it is rather unlikely to be used for anything else and the leading . signifies a hidden file in Posix. Samba can be configured to map dot files as hidden for Windows as well.

The .&& directory is always set as read-only as are all files in it. While in theory, versions of version files could be stored in a directory called .&& inside the .&& directory, that's likely to be a trip down the rabbit hole and so is not supported. If an old version of a file is to be used as the basis for a newly modified file, it should be copied out of the .&& directory first, then modified.

Version files will have a number appended to them with the year, month, day, hour, minute, second, -x where x is a version number within that second. In that way, the version name can never clash with an existing version.

This means that file version rollback (and even undoing that rollback) is a simple matter of copying an old version of a file over the current version (with the overwritten version itself becoming a version file).

So, given that it is Mar 30th 2008 at noon and we wish to roll the file file1.txt back to the Feb 29th version, do:

cp .&&/file1.txt-20080229120000-0 file1.txt

To undo that rollback, do:

cp .&&/file1.txt-20080330120000-0 file1.txt

Filesystem integrity

State data in the filesystem is carefully managesd such that should a crash (such as power loss) happen at any point, the state of a given file can be accuratly reconstructed on recovery. The state rules are simple. A metafile with no corresponding fossil should be unlinked unconditionally. All existant files always have a fossil but may not have a meta. Any fossil with no corresponding meta should be copied to a meta and it's contents zeroed. A fossil with a corresponding meta should be zeroed.

Any files in the special directories meta/:tmp/ or fossil/:tmp/ should be unlinked.

Notably, Pventi is designed such that crash recovery can happen upon the next access to a file. The only harm in delaying recovery is that extra storage may be taken up due to fossils containing data that's duplicated in the block store. This means the filesystem becomes available quickly after the system is restarted (e.g. power is restored). The fossil layer provides a function to recover a file without the extra overhead of opening and closing though it is possible to recover simply by opening for RO and closing. Either process can happen on the live filesystem with no penalty other than the slight load increase.

Read scenerio

File is opened O_RDONLY. Meta is copied into fossil. Clearly should power fail before the file is closed, the meta is naturally up to date, and so the fossil is merely zeroed.

When a file is opened for writing, the meta begins copying to the fossil. As a write cannot yet happen, a failure looks like and effectively IS the read only case.

Only once the fossil is fully copied out and fdatasync is called, the meta is deleted or renamed. A failure after delete is again the read only scenerio.

Renaming happens when a version is to be created. In this case, the meta gets renamed first. A failure at this point is fine, the meta gets deleted, and the fossil is copied back to a meta again on recovery (again, the read only scenerio!).

Finally, a fossil is created in :tmp and it's attrubutes are set. Then it is renamed to be the empty fossil for the new version. Since a write can only happen after this completes, the fileststem state is never in danger.

When fossils are copied back to a meta on close, the meta will be created as a file in /meta:tmp first, and is only renamed AFTER it has been fdatasync. Thus, it either contains all of the data in the fossil or it is deleted on recovery and recreated.

Again, filesystem state integrity is maintained.

The only state error that is not fully covered is the case of venti blocks having a higher refcount than it should. This is a case where no data is lost, but space may be taken up un-necessarily. This can be fixed on a live filesystem simply by checking that all backreferences in a venti block actually still exist. A later version may address even this issue with a venti block journal.

Write Scenerio

When a file is opened for writing, the data from it's meta are copied into the fossil. A failure at this point is the same as the read scenerio. Once the fossil is complete, a stub fossil is created in .&& to hold the previous version. Finally, the meta is atomically renamed to be the backing meta for the previous version. Simultaniously, the opened fossil loses it's meta. From that instant forward, the fossil may be considered to be marked as dirty and recovery will recreate the meta by copying the fossil in.

Block API

pventi_get

pventi_put

pventi_create

pventi_addref

pventi_delref

pventi_create_if_not_ref

pventi_create_and_ref

PVenti Meta

The PVenti meta system maintains directory structure and ownership information for the filesystem. A look into filesystem/meta/ will show the complete file structure. where possible, the underlying filesystem's own metadata handling is used. For example, if a file in PVenti is owned by Joe in the Admin group, the metafile will be owned by Joe and the admin group. ACLs, xattrs, and unix permissions are stored the same way. OTOH, file size is stored in the metafile. After the stat header is a list of pventi blocknames that make up the file data.

When a file is cached, the cache file is consulted for stat information. However, when it is copied back to meta, the metafile's utimes are reset after closing to reflect user rather than filesystem access.

PVenti Fossil

The PVenti metafiles and pventi blocks are an excellent storage format for a file, but are poor choices for user access. Amongst other reasons, the venti blocksize is a multiple of the OS blocksize. Were venti blocks used directly, a single sequential write of 2 OS blocks would result in the creation of 2 new venti blocks with one destroyed almost immediatly. With 64K blocks, as many as 15 venti blocks might be created and immediatly destroyed for what should be a quick sequential write. In addition, creating new venti blocks is somewhat more computationally intensive than writing a block in a fossil. Finally, venti blocks may be transparently compressed. That is highly useful for long term storage, but constantly decompressing and recompressing is also not very efficient. It's far better to decompress on open and recompress updated data on release.

Strategies for multi-volume storage

Naturally as a storage grows, adding additional volumes (drive w/ filesystem) to the venti is highly desirable. Fortunatly, the SHA1 hash values for blocks have a random distribution through the full name. Since venti blocks are stored by splitting the name up into directories and sub-directories, the first level sub-directories may be on different physical volumes and will naturally result in a reasonably even distribution. Future versions may add a translation table and volume manager to permit live migrations, but the procedure is imminantly handlable using simple tools during a planned downtime currently.

Open Questions

Deleted Directories

Keeping old versions of files (including the last version when the file is unlinked) is simple enough, but what to do about deleted directories? The directory itself is unimportant, but the files in it need to be retained. What if it is created and removed a few times? Applying the same version renaming as files get is semantically valid, but results in the potential for files to exist in several old versions of the same directory. Another valid approach is a single deleted version with all files stored into it (due to date/time based versioning, the filenames necessarily cannot conflict).

When a directory is deleted, all files that were in it will now be in dirname/.&& (since non 'empty' directories can't be deleted). Should they all be moved up into dirname after dirname moves to parent/.&& ?

Can FUSE be made to grok that a directory containing .&& is empty as far as deleting is concerned?

Should directory reads be hacked not to return .&& at all? If so, can Windoes handle that? Perhaps a special case hack where .&& isn't returned in the case that it is the ONLY 'file' contained by the directory in order to trick FUSE into allowing the unlink command?

Deleting Versions

Since old versions are always R/O files in an R/O directory, how shall a user (given sifficient permission) delete them? Who should have sufficient permission (if anyone)?

Layering

Internally, the entire system is layered with defined APIs between the layers.

PVenti block

At the bottom is the PVenti block store. It knows nothing of files, metadata, or permissions. It supports a simple API:

create

Given a block of data (of arbitrary length) either create the block and return it's name (SHA1 hash) or increment the existing block's refcount and return it's name.

get

Given the name of an existing block, increment it's refcount. This could be accomplished by reading and creating the block, but is a much more efficient shortcut.

put

Given the name of an existing block, decrement it's refcount and destroy the block if it's new value is zero.

read

Given the name of an existant block, return it's contents.

Pventi Meta

The PVenti meta layer looks much more like an actual filesystem. It creates files in the meta directory that contain the metadata of a PVenti file. However, instead of storing the actual file data, it stores a list of PVenti blocks containing that data. It does NOT deal in read/write at a given position. Rather, it supports:

copyin

Given a descriptor to a file and a name for the metafile, create the meta (unlinking it should it already exist) and fill it with the data in the file described by the descriptor.

copyout

Given a filename and a descriptor, copy the metafile's data into the descriptor.

rename

Rename the meta-file.

unlink

Put the blocks in the metafile, then unlink it.

lookup

Return true if the metafile exists.

mkpath

More a nicety than a necessity. directories are created as needed when files are copied in or renamed, but they can also be explicitly created here.

Fossil

Fossil is the layer that supports typical filesystem operations on files. It maintains a skeleton of the entire filesystem in the fossil directory. When a file is open, the fissil actually IS the file and supports everything the underlying filesystem does. When the file is NOT open, the fossil is a stub of the correct length with all metadata but the data itself is absent. Instead, the file contains a data hole of bthe same size as the file it represents. This is accomplished by calling copyout on the correcponding meta file when opened, and calling meta's copyin operating, then truncating the file when it is closed. To maintain the correct size data, the a seek is performed on the fossil after truncating to create a data hole. The underlying filesystem must support this operation and not require actually allocating blocks filled with zeros to do this. In Linux, ext3 meets these requirements.

In addition, the fossil layer actually handles the versioning process. meta and block have no real concept of the versioning, but do contribute by making it considerably less expensive.

Top

Two different top layers are available depending on the role of the given filesystem.

Fuse interface/logging

When used explicitly as a filesystem, the logging Fuse interface is used. It interfaces with Fuse and allows mounting of the filesystem. It also maintains a log (locally or over an optionally encrypted and authenticated TCP connection). All write operations are sent into the log. It completes all operations by passing them on to the fossil layer and translating the results for the Fuse module.

Replication

The replication layer is used on backup servers. It accepts connections from a filesystem and simply performs the logged write operations as they come in in order to make it's filesystem a duplicate of the remote live filesystem.

The question of hash collision

Periodically, the question of hash collision comes up. It is good to consider the issue, however, in practice it is not a likely cause of failure. If blocks consisted of truly random data *AND* could be of freely variable length, *AND* the storge was to grow to contain a significant portion of all possible blocks, it could be an issue. However, there may be 2^64K distinct blocks at a size of 4KB/block. Notably, we have no names for numbers even approaching that size. Note that a Googol is only 2^332! At the current rate of production taking into account the rate that storage density increases and even assuming that old hard drives never fail and are never retired, several lifetimes will be needed to store that much data at all (and most of it will be duplicates for things like OSes and popular executables)

The best comparitive estimate of the odds of a hash collision in a Venti filesystem is about the same as the odds of being struck by lightning while claiming the grand prize winning lottery ticket TWICE!

By contrast, the odds have always been much better that just the right bits will flip on a HD platter that the CRC is still correct but the data corrupt or the similar event happening over a network connection (or even in a CPU). That is, the underlying hardware is not reliable enough to reveal the theoretical failure of the SHA1 hash based Venti.

As for the case of a deliberate computed collision, simply coming up with a block that will collide with an existing block won't do (since the original block would be retained, not the attacking block). Thus, the attacker would have to predict the value of a particular block, discover a block that would result in a collision, and then get that block written first. Given that there exists no known computationally feasible way to compute a colliding block, that would be quite a task.

The reported weakness in SHA1 is that given months of computation, it is possible to discover a pair of inputs (that may differ in length) that collide. Neither input is distinguishable from random numbers. At this point, the weakness is not significant even in a cryptographic context. There exists no known way to apply that discovery to a case where one of the inputs is a given value.

Journaling and Replication

No filesystem can offer true data safety without a method to replicate the data at a remote location. The current methods present a number of shortcomings. By far the most common method is nightly backups or rsync to another location. While thses do safeguard the bulk of the data (and if the native filesystem supports versioning, it may even be adequate) they do so with a typical granularity of 24 hours such that a day's work can potentially be lost between close of business and the completion of the backup that evening.

When up to the minute backups are desired, a common solution is to store it in a fibre channel SAN with RADI1. The mirror volumes are located in a remote location. The difficulty is that such solutions strongly depend on 100% uptime on the site to site link (else the volume must fully rebuild, a lengthy process OR the filesystem must become read-only the instant connectivity is lost). Further, a full scale fibre-channel SAN is prohibitively expensive for small to medium sized businesses and is simply out of the question for home use. Finally, RAID is a block level replication. This means that in the event of an abrupt interruption (common exactly when the backup is most needed) the on-disk filesystem on the mirror will inevitably be in an inconsistant state and will require a full filesystem check.

Pventi Journaling

In order to maintain consistancy with a replication server and assure that consistancy can be restored should the filesystem loose connectivity to the replication server at some point, all operations that alter the filesystem state are fully journaled both locally and with respect to the replication server(s).

Each operation that may change the state of the filesystem begins by creating a journal entry of all parameters (including the timestamp) necessary to re-create the transaction. The journal system generates a unique transaction number, commits the entry to the journal and ueues it for sending to the replication server. Then the operation is performed as expected. Finally, a completion entry is created and committed to the local journal. Completion entries are NOT sent to the replication server.

Journal recovery

Should the fileserver shut down uncleanly (such as sudden loss of power) the first order of business is to scan the journal. Any entry without a matching completion entry is re-played on the filesystem. This assures that the last few modifications are retained AND that the local filesystem state matches the replication server's view. Next, the replication server indicates the last transaction it recieved and handled. Any journal entries with a higher transaction number are (re-)sent to the replication server. Once complete, the two states will be back in sync.

Finally, any entries with a matching completion and acknowledgement from the replication server(s) are discarded or (in future) converted to an audit log by discarding the actual data.

Because of the transaction journaling, state can also be recovered should the replication server fail for a period of time. PVenti only discards journal entries once the completion record is committed AND the replication server acknowledges it's own completion of the operation.

Because transaction numbers are sequential and unique, all operations occur in the same order on all copies of the filesystem. The timestamping creates an “official time” of the operation. Thus, any operation that fails (generally a permission issue) on the primary filesystem will fail on the replicas as well. Thus, journal entries are logged, sent, and completed without regard for any such errors. The only case where an error will result in a completion not being committed is a local resource problem (out of memory, etc) that can/should complete successfully if re-tried during journal recovery.

Table of Contents