ï»¿

Development Wish List / Road Map for OpenAFS

In its ten plus years as an open source project, OpenAFS has established AFS as an open source success stories. OpenAFS provides clients for all of the major operating system distributions and servers on all UNIX/Linux variants. Even so, there is still a great deal that must be finished in order for AFS to acheive first class status on MacOS X and Microsoft Windows.

The work to be accomplished on OpenAFS falls into six broad categories:

Core Client Functionality
Human Interfaces
AFSv3 Remote Procedure Call Extensions
Server Enhancements and Management Tools
High Performance Computing Extensions
Miscellaneous
Documentation

It is the goal of the OpenAFS Elders to raise resources from the OpenAFS Community and others to successfully implement all of these functions over the next three to five years.

An Implementation and Release Schedule is provided at the end of the page.

A note about the estimates provided on this page. For many of the projects the Gatekeepers have designs, partially completed work, or even fully implemented systems. The estimates that are provided is the time necessary to complete the project and/or integrate it into a standard release of OpenAFS.

1. Core Client Functionality

Core client functionality encompasses the AFS cache manager, file system interfaces, pioctl interfaces, and credential management.

Microsoft Windows:

The Microsoft Windows client has received significant attention over the last four years. It is a fully functional client that works on all Microsoft Windows releases from Windows 2000 SP4 through Windows Vista and Server 2008. For a summary, see the OpenAFS for Windows Status Report. Still, there are a number of deficiencies that adversely impact the ability of end users to use AFS to its full existing potential.

Read-only or Read-write disconnected mode:
Microsoft Windows users are used to the "Windows Offline Folders" functionality which permits them to synchronize local copies of files or folders from a CIFS server to their local disk for use when disconnected from the network. UMichigan long ago implemented a read-write disconnected mode for the UNIX AFS client which permits users to continue using data within the AFS cache while in an offline mode. Once the client is restored to an online state the modifications made to the cache buffers are written back to the file server provided that there are no conflicts. If there are conflicts a manual conflict resolution process must be initiated. Conflict resolution is hard but AFS users would gain a great deal even if the contents of the AFS cache were available when disconnected from the file servers in a read-only manner.
Estimate: 3 months
Status: no resource commitments
Native file system driver interface (aka IFS):
The current OpenAFS client on Windows is not a true Windows file system. Instead it operates as a SMB translator service. The Windows OpenAFS client creates a SMB fileserver on the client machine, and Windows accesses this SMB server as a normal Windows shared volume. For each I/O operation made to this virtual SMB server, the OpenAFS client translates the SMB request into a comparable operation on the OpenAFS fileserver. This impacts the Windows OpenAFS client in a number of negative ways:
- The semantics of the CIFS file system are different than the semantics provided by OpenAFS. Because Windows sees the OpenAFS file system as an CIFS share, it has no way of acquiring the true capabilities or semantics of OpenAFS. This causes some applications to perform poorly when they expect the semantics of CIFS, which OpenAFS does not necessarily provide.
- The use of the translator service requires data to be received by the OpenAFS client via RX, translated into SMB packets, and then sent over the virtual loopback interface to the actual SMB server on the same machine. This results in a number of extra data copies, which greatly reduces OpenAFS performance. Making OpenAFS a native Windows file system will reduce the number of data copies and protocol translations, which will increase performance.
- The CIFS/SMB protocol does not provide any mechanism for the server to inform a client that an operation is actively being processed even if it is taking a long time to complete. The CIFS clients in Windows 2000 and above implement a dynamic timeout algorithm that estimates how long a request should take based upon the prior response time of the server and the amount of data being transferred. As the OpenAFS SMB server and cache manager are local to the machine, it is frequently the case that the response time in on the order of hundreds of microseconds. When a request to read or write large amounts of data from/to a file server occurs or if the needed volume is temporarily busy, the CIFS client will frequently timeout the request and tear down the SMB virtual circuit. This has a negative impact on applications as it results in all file handles being invalidated and all locks being dropped which must then be re-established.
- In order for the UNC server name "AFS" to be visible on all clients, the Microsoft Loopback Adapter (MLA) must be installed in order to provide a private network adapter to which the "AFS" Netbios name can be bound. The installation of the MLA adversely effects several popular software licensing and anti-spyware products which use the network adapter MAC address as a unique key.
The existing OpenAFS Client relies on an SMB server implementation (similar to Samba) to export the AFS name space to Windows Applications. This has a number of negative side effects that would be avoided if the OpenAFS for Windows client were to be implemented via a combination of File System Redirector, File System Filter driver, and a Network Provider.
The solution is to replace the SMB server with a native File System Redirector, a File System Filter driver, and a Network Provider that can be supported on Windows XP SP2, Windows 2003 SP1, Windows XP 64, Windows 2003 R2, Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2.
For more details ...
Estimate: to be included in the 1.7 development series one month after the release of 1.6.0. Our goal is to have a stable 1.8 release series with this functionality by June 2011.

All UNIX platforms:

split readonly/readwrite cache usage limits:
Some usage scenarios make it desirable to ensure that a certain amount of the available cache space is reserved for read-only data in order to reduce the likelihood that the cached data will be swapped out.
(testing needed)
demand prefetch:
In order to reduce wait time while streaming files, it is desirable to allow for a sliding prefetch window when a file is opened and being read.
(testing needed)
read/write disconnected AFS:
Read-only and partial read-write operational support is present in the 1.5 release series. Additional work needs to completed to support hard links, conflict resolution, and pinning of objects, directory trees or volumes.
Status: core functionality substantially complete on Unix platforms except for pinning and user interface issues.
intermediate fileserver - client as fileserver:
Each AFS cache contains up-to-date copies of data when a callback is registered with a file server. In a cluster of clients that are reading the same data over a low bandwidth network, it would be more efficient if the AFS cache manager could obtain data from a nearby up-to-date copy as opposed to reading it from the file server on the slow link.
Estimate: two to three months. Requires protocol standardization
prevent repeated queries of the same volume location data:
The UNIX client does not prevent multiple threads from repeatedly querying the same volume location data.
O_DIRECT and O_SYNC:
These fopen flags should be supported. O_SYNC should force each write to block until the data is sent to the file server. O_DIRECT should perform a direct write to the file server bypassing the cache (and until cache bypass is implemented, it should behave like O_SYNC.)

MacOS X:

Provide for token acquisition as part of the logon process
PAG support

Linux:

On-going maintenance necessary to keep up with backward incompatible changes to the Linux kernel and new interfaces provided and used therein.

Solaris:

update client to use system inodes instead of private inode pool.

AIX:

AIX 5 kcred PAG support

BSD:

Port the cache manager to NetBSD
Estimate: 1 month

New Client Platforms:

There is a growing demand for pervasive access to data from handset devices. Clients for Symbian S60, Windows Mobile, Apple's iOS, OpenHandsetAlliance (aka Android), and Nokia/Intel Maemo/Meego devices will be critical in the years to come.

All Platforms:

Client Cache Usage Tracking and Tuning:
The current cache managers implementation an explicit Least Recently Used algorithm for recycling objects. This algorithm does not take into account:
- frequency of use
- whether the file is currently open
By tracking the right statistics, the cache manager would be more likely to have the right data cached. One of the extreme examples that has been discussed is a maintenance application that runs five nights a week at 1am on tens of thousands of machines. The running of this application pounds the volume and file server because every machine must reload the application contents into the cache at the same time. If the cache manager could predict this utilization based upon past history, then the clients could slowly prefetch the data ahead of time easing the load on the file servers and improving the performance of the application on the local machine.

2. Human Interfaces

There have been many discussions about how hard AFS is to use, how end users don't want AFS and really want a WebDAV solution. What do those statements really mean? First, AFS isn't any harder to use than any other authenticated file system from the perspective of end users. If a user has an "encrypted" local disk she has to authenticate herself by providing her password. With the single sign-on solutions available for OpenAFS there isn't much reason for users today to be running without tokens when they have network access. Second, the statement that end users don't want AFS (as opposed to some other centralize storage solution) really makes no sense. End users don't ask for technologies, they ask for functionality. If a user wants centralized storage then the user wants centralized storage.

Users describe their desires using the technologies that are most familiar to them which today most often means Windows Shares (CIFS) and Browser based services. Why? Because those are the technologies the user is presented on his or her operating system's desktop. The vast majority of daily users are uncomfortable with command-line operations. Improving the ease of use of AFS can be achieved by providing tighter integration with the operating system desktop environment.

Microsoft Windows:

The Secure Endpoints Inc. OpenAFS Windows Road Map web page provides an number of mock ups of Explorer Shell extensions that can be used to not only make AFS more accessible to end users but also significantly improve its ease of use. By making the Explorer Shell AFS aware, users will be more comfortable using it. No longer will users have to use command line techniques to access AFS and manage its contents and metadata.

One of the most important ideas that was the result of discussions with Stanford University's Help Desk staff is the concept of Custom Name Spaces. On Microsoft Windows a Name Space is a virtual folder that appears as part of the Explorer Shell. The objects "My Computer", "My Documents", "Control Panel", My Network Places", "My Sharing Folders", etc. are all name spaces. Stanford University has been shipping for many years variations of an application now called "Stanford Desktop Tools". One of the features of SDT is the ability to search for classes, users, departments, and projects and map a drive letter to the associated AFS volume. Another feature is the ability to quickly map a drive letter to "my home directory". A final feature is the most recently used volume list.

With Name Spaces, we can implement all of this functionality. We can define a "recently used volumes list" which is always populated with the volumes the user most recently read or stored data to. We can define a "My Stanford Home Directory" name space that always contains a shortcut to the volume associated with the user's token for the ir.stanford.edu cell. We can also create name spaces for "Stanford Users", "Stanford Classes", "Stanford Departments", etc. Other organizations can distribute their own AFS name spaces that represent important data that is stored in their cell. AFS name spaces from multiple organizations can co-exist on the same system. Since name spaces are built into the Explorer Shell they are always easily accessible to the end user because they become a part of the Desktop.

A detailed proposal describing an AFS Name Spaces implementation is available in PDF.

Users expect to find a Control Panel for Services that support per-user configuration. For OpenAFS users can configure the behavior of the AFS Credential Provider for Network Identity Manager and their Protection Service Groups. For more details ...

System-wide configuration of Services are performed via Microsoft Management Console plug-ins. For more details ...

Microsoft Windows Vista User Account Control Privilege Separation. For more details ...

MacOS X:

Apple doesn't permit the same degree of customization of the Finder as Microsoft does for the Explorer Shell. However, the Finder can be customized with an AFS virtual folder and AFS context menus. Likewise, certain other graphical interfaces which will become available in Leopard provide opportunities for customization to ease use of AFS.

Enhance Finder with an OpenAFS Context menu

3. AFS3 Protocol Feature Enhancements

In order for AFS to be treated as a first class file system for MacOS X and Microsoft Windows it must gain the following functionality:

Removing Directory Limitations:
The current AFS directory format and RPCs suffer from a number of limitations that adversely affect the user experience. A directory has a maximum of 64,000 entries if all file names are 16 or fewer octets. Longer names are implemented by consuming an additional entry for each additional 32 octets of file name. Given the ever increasing length of file names some cells are filling the directory with as few as 10,000 entries.   Some scientific research projects require the use of millions of files perhaps containing a single data byte within a single directory.
The current AFS directory format is very inefficient for searching when case-insensitivity or Unicode normalization is required. Under these circumstances search time is linear to the number of entries in the directory. Many modern file systems implement the directory as a B+ tree to permit O(log n) searching. The existing format places a heavy burden on each and every cache manager. Each client must download a copy of the directory buffers and perform linear searching. This results in heavy use of the CPU when searching directories with 500 or more entries.
Another issue is the lack of support for internationalization. In the current directory format directory entries are stores as a sequence of octets without any character set hinting. A file that is stored using a name encoded with ISO 8859-5 or CP437 will not be represented correctly to the user on a system that expects UTF-8. Even when file names are stored using UTF-8 it is important to recognize that depending on the input mechanisms it is possible for a user to enter the same semantic string using different octet sequences. Therefore it is crucial that any implementation of Unicode file names support normalized forms for comparison.
Finally, Microsoft Windows and MacOS X are now requiring that first class file systems support the concept of multiple data streams per file. These streams are used to store extended attributes, security zone information, resource forks, and other forms of meta data in addition to providing a general purpose storage mechanism for applications. For more details ...
At the 2004 AFS Hackathon in Stockholm there was much discussion of potential methods of extending the existing directory format to support Unicode.   http://www.afsig.se/afsig/space/AFS+directory+format+extensions However, these approaches did not address the directory search performance issues, the entry limitations or multiple data streams.
The current direction under consideration is to completely replace the on disk directory format with an entirely new one consisting of data blocks representing nodes in a B+ tree with each block containing a variable number of entries. The new data structure would be Unicode aware and support multiple data streams. Microsoft Windows clients would implement extended attributes in a reserved data stream. MacOS X clients would use a reserved stream for the resource fork.
New versions of all of the directory RPCs would be implemented to support the new data structure. Clients that use the new APIs would be delivered directory buffers which construct a B+ tree which in turn would significantly improve directory search times.
For old clients, new implementations of the old RPCs would deliver directory data translated to the old linear format up to the maximum number of directory entries. It is possible that old clients will not be able to see all the files in a given directory.

Status: Implementation promised by Your File System, 2011.
Extended Attributes:
Extended Attributes are used by MacOS X to store resources and DOS Attributes. When they are not supported by the file system, MacOS X is forced to create the ._ (Apple Double files). Extended Attributes on Microsoft Windows are used to store a variety of meta-data about files and directories. The lack of EA support in AFS damages the Windows User Experience. AFS Cache Managers can implement support for extended attributes and store them in hidden Apple Double files while waiting for full EA support within AFS volumes.
For more details ... Status: Implementation promised by Your File System, 2011.
Per-file ACLs:
AFS supports per-directory ACLs. Per-file ACLs would make it possible to apply a different set of access constraints on a single object within a directory. At the present time storing multiple objects with different access controls requires that they be stored in separate directories. The AFS protocol provides partial support for this from the AFS/DFS translator, and this is supported in clients going back to IBM AFS. Status: Implementation promised by Your File System, 2011.
Mandatory Locking and Byte range locks:
Platforms such as Microsoft Windows and MacOS X require that their first class file systems support mandatory lock semantics and byte ranges. Applications which rely on these capabilities such as Microsoft Office and databases risk data corruption if their data files are altered while they are assumed to be under a lock. AFS only provides advisory full file locks and provides no upgradeable lock type. The existing AFS file server lock implementation doesn't keep track of which clients were issued locks which results a number of situations in which lock counts can become incorrect and produce a denial of service on a given file.
The Windows AFS client in the 1.5 series has added a localized implementation of mandatory locking and byte range locks. Each time an application requests a byte range to be locked, the cache manager ensures that it has an appropriate full lock on the object. The cache manager than accepts the responsibility of tracking each of the locks and doling out a range at a time.
Estimate: 2 months.

Status: Implementation promised by Your File System, 2011. Requires Standardization.
Status Data (Callback Registration) Expiration Algorithm Improvements:
Status data and callback registration expiration is currently determined based upon the number of clients that are accessing the data instead of the likelihood that the data is going to change.

Status: Implementation in progress. Awaiting Standardization.
OPEN/CLOSE File Server RPCs:
New file server RPCs would provide new audit data

Status: Implementation promised by Your File System, 2011

4. Server Enhancements and Management Tools

All Platforms

LDAP integration:
Protection Server to LDAP Proxy implementation and Direct File Server to LDAP implementation. Separate schemas will be necessary depending on whether the LDAP server is OpenLDAP or Active Directory.

Brett Trotter <blt@iastate.edu> wrote an implementation called ptsldap. In July 2005 he had working code using the Mozilla LDAP library and was porting it to use OpenLDAP instead.
Additionally, Volker Lendecke implemented a similar project for use with Samba, which is not known to have been completed.
Luke Howard PADL Ltd.) developed an AFS Protection Service as part of his Active Directory clone, XAD. Ownership of XAD has since been transferred to Novell. However, it is expected that Luke will assist us in developing a new implementation in the coming months.
Multiple back-end file server:
Current file servers must be built to support one type of file partition, either inode or namei. This makes it impossible to mix partition types on the same machine for the purpose of performance comparison or conversion. This project will make the storage back end modular, and allow for additional back ends to be provided.
Estimate: 1 month
Posix Extended Attribute File Server Back-end:
Using Posix Extended Attributes to store AFS metadata will provide the portability of the namei backend with the performance of the inode backend. A PosixEA backend would not require a special fsck which would permit the use of journal logs.

Status: Implementation promised by Your File System, 2011
Demand Attach File Server:
The Demand Attach File Server is a compile time option that produces a file server that supports several new and important features:

* an enhanced volume management library that supports:
    . lock-less I/O
    . on-demand attachment of volumes
    . parallel shutdown of the file server
    . on-line salvaging of volumes
    . automatic detachment of inactive volumes

* a new salvageserver daemon which can salvage volumes on-demand

* a modified bos and bosserver
    . fileserver state saving and restoration
      - host state
      - callback state

With the DAFS, shutdown and startup of the file server is significantly faster. Instead of salvaging a partition at a time, individual volumes are salvaged in the background while volumes that do not need salvaging are served by the fileserver to clients. Volumes are attached on-demand and detached when no longer required. This reduces the number of volumes that may require salvaging upon a restart.

Status: Present in 1.6 pre-releases. Requires additional testing.
Multi-Protocol File Server:
The current AFS file server can only communicate using the AFS3 protocol. Once all of the functionality in section 3, AFS Protocol Feature Enhancements, is implemented, the file server will support all of the capabilities necessary to alternative protocol front ends including CIFS/SMB2.0 and WebDAV.

Estimated: To be determined.
Protection Server enhancements for alternate principal names:
The AFS Protection Service maintains a database that maps user names to AFS IDs. This table needs to be extended to a many-to-one relationship of user names to AFS IDs. In addition, multiple name forms must be supported in order to permit different types of user authentication.
A proposed implementation was sketched out at the AFSig Hackathon in Stockholm in 2004 (details)
Status: Standardization complete. Implementation promised by Your File System, 2011
Volume Server enhancements for large volume names:
Volume names within the existing RPCs and database are too short given the number of volumes that are now being deployed in existing cells.

Estimate: 1 week
Implementation promised by Your File System, 2011
Volume Server enhancements for split horizon addressing:
Many sites now provide services from hosts which exist behind a NAT to hosts both inside and outside that NAT. The vlserver should provide interfaces to allow returning only some addresses to queries when responding to volume location requests.

Implementation promised by Your File System, 2011
Volume Server read-only replication optimizations:
A volume release current requires that the entire file be copied from the server maintaining the release clone to the replication site instead of only copying modified chunks. This is a significant performance issue especially as 64-bit file sizes and append only files become more common. Modifying the volume replication process would provide significant clock time and network traffic savings. Possible approaches include:
- data compression of the volume dumps can reduce network bandwidth and storage requirements
- rsync/rdiff style replication can be used to reduce the cost of replicating large files that have only had small changes since the previous replica was created. This can be coupled with additional read-only clone revisions and volume version numbers to produce very rdiff streams that can be used to update "old release" replicas without requiring a full volume replication.
- encrypted volume dumps can increase security when storing dump files in external backup systems
Estimate: 2 to 3 months
Implementation promised by Your File System, 2011

Read/write volumes and replication:
After optimizing the size of volume dumps it should be possible to implement a simple read/write replica model in which there is a single master for each volume to be used for writes and locking and multiple replicas that can be used for reads.

Implementation promised by Your File System, 2011
Partition UUIDs:
Assigning UUIDs to each partition instead of the exsting non-unique names will permit drives to be migrated between machines with just a simple update of the volume database before continued use. This will eliminate most of the work associated with synchronizing the database and removing stale entries.
Estimate: 1 to 2 weeks
Implementation promised by Your File System, 2011
Machine Accounts in the Protection Database:
Add the concept of machine accounts to the protection server database permitting authenticated machine access that does not result in the client being treated as a member of system:authuser.
Implementation promised by Your File System, 2011
UBIK improvements or replacements:
UBIK is the database replication protocol used by AFS server for the synchronizing the volume and protection databases among servers. It is an elected single master model in which the selection of the master is determined strictly based upon the IP addresses of the server subset which are capable of mutual communication and which are sufficient to establish quorum. The server with the lowest IP address is elected master. There are a number of scalability and performance issues with UBIK that must be addressed.
- Master Selection Algorithms
  Master selection should be configurable based upon the needs of the organization and not tied to the IP addressing. One model involves configuring additional state information including priorities based upon the servers in the quorum, the time of data, source of network traffic, etc. It should be possible a master to be selected with a quorum that is not represented by a fully connected graph.
- True Server Multi-homing:
  CellServDB changes are required
- pthreading:
  The new implementation should support the use of pthreads instead of lwp in order to maximize performance of multiple processor systems.
- Increase Record Sizes:
  The existing record size results in significant design restrictions and imposes undesirable overhead.
- Multi-master Replication:
  Single master replication is easier to implement than multiple master replication models. However, multiple master replication makes the most efficient use of network resources.

Microsoft Windows

Once AFS is capable of being used as a first class file system for Microsoft Windows clients it will make sense to support the AFS servers on the Windows Server platform as there are a large number of Microsoft Windows only IT organizations that do not have the expertise to manage UNIX/Linux systems. The servers are mostly there already. There is work that needs to be done on the NTFS Namei implementation and there needs to be much better integration with power management, plug-n-play networking, and Windows Event Logging.

Of course if you want to host services on Windows, you must provide a Microsoft Management Console plug-in to manage them.

The primary reason that we haven't spent the time and energy to get the AFS servers in tip top shape is that without the protocol feature enhancements, users that attempt to deploy AFS in an all Windows environment are bound to be disappointed.

For more details ...

Estimate: 4 to 6 weeks

5. High Performance Computing Extensions

RX/TCP and IPv6:
The networking protocol used by OpenAFS was developed in the late 1980s. It is a Remote Procedure Call transport called Rx that provides a stream interface that runs over UDP/IP. This protocol was designed to address the traditionally poor performance of TCP implementations at the time, and to be scalable to large numbers of clients.
Since that time, a large amount of networking research has been done on TCP performance, and modern TCP implementations are capable of good performance on high-speed networks. Also, other protocols such as Infiniband and SCTP have emerged as alternative transports to TCP. The growth of the World Wide Web has pushed operating system vendors to develop interfaces that allow applications to manage thousands of clients simultaneously in a scalable manner. While incremental improvements have been made to the Rx protocol since it was first developed, it has not been able to take advantage of the performance available in modern networks.
Our proposal includes the following work items:/p>
Â·         Develop an Rx protocol that uses TCP as a transport (RxTCP). As part of an earlier project, Kenneth Hornstein has developed a prototype of RxTCP; our goal is to complete this work and integrate it properly into the OpenAFS distribution.

There are many challenges with this approach. Much of the OpenAFS code has an implicit assumption that the transport protocol is the traditional Rx interface. This is further complicated by the lack of any formal API for Rx. As a result of this, many OpenAFS programs use what normally would be considered internal interfaces, and perform internal operations such as directly manipulate queues of packets. The Rx API has no way to indicate such things as connection types, so new API functions will need to be created. The Rx library makes heavy uses of threads, but uses two distinct threading libraries (pthreads and a custom thread library called LWP). Any new transport must not only be thread-safe, but also present the same threading model to applications so that they do not need to be rewritten. To maintain compatibility with existing clients and servers, the original UDP transport must still function simultaneously with the new transport.

Despite these obstacles, much work has been completed. The RxTCP transport has been implemented, and tests have shown excellent performance on Gigabit networks. The remaining challenge is to integrate this transport into the actual OpenAFS clients and servers. This new transport protocol addresses a number of deficiencies in the original Rx protocol and implementation:
o       The original Rx protocol was (with one exception) limited in packet size to one Ethernet MTU (1500 bytes). In general, research has shown that larger packet sizes facilitate higher performance due to the overall reduction in per-packet processing time and the need to process fewer packets in order to send the same amount of data. With the current networking APIs available to applications today, programs cannot query the size of the MTU on a networking interface, nor can they determine the MTU of a particular network path (even though the operating system may have determined that already).

With the use of TCP, the operating system can make use of knowledge not available to the application, such as accurate estimates of round trip time, network path MTU, and interface MTU, and as a result can take advantage of the capabilities of modern networks, such as Ethernet jumbo frames. The exception to this 1500 byte packet limit is that Rx has a concept called a â€œjumbogramâ€, which places multiple Rx packets into one UDP datagram. Unfortunately, in practice this results in no net gain in performance, since the Rx per-packet processing is not reduced.
o       Since Rx utilizes a datagram transport but provides a reliable stream interface, it must implement many of the features that are already provided by TCP, such as a windowing algorithm, packet loss detection and retransmission, and congestion control. While incremental improvements have been made to the Rx implementation since it was first developed, it has not received the attention that TCP performance has had during the same time period. The use of TCP as the basis for Rx allows us to leverage the serious engineering work that has been done on TCP, rather than requiring that same level of effort be placed into Rx.
o       The Rx implementation is very large and monolithic. Part of the size of Rx results in having to perform many of the same functions of TCP. This size and complexity makes it extremely difficult to understand and follow the code. In addition, most of the Rx work is done by a relatively small number of extremely complicated functions. In addition to lacking modularity, this makes profiling extremely difficult, as most of the time spent by Rx takes place in few functions, and it becomes difficult to get further granularity by profiling since most profiling tools operate on a function basis.

In contrast, RxTCP has a much smaller implementation, and is very modular. Since the overall complexity is reduced, this makes it easier to profile and understand.
o       All data transmitted or received by Rx is divided internally into per-packet buffers, which means data that is delivered to an application must undergo two copies to reach its destination (kernel to Rx buffers, Rx buffers to application buffers).

RxTCP eliminates this copy completely on writes; data is sent directly from the application buffer to the network stack. On reads data is delivered directly from the network stack to an application when an application buffer is available; if an application buffer is not available, then data is placed into a large contiguous buffer internal to Rx and copied out completely when an application buffer becomes available.
o       Since Rx is a custom protocol, any tool used to analyze Rx performance must be designed specifically for Rx use, or we must make modifications to an existing tool. With the use of TCP, we can use any one of a number of off-the-shelf tools to analyze TCP performance.
Â·         Analyze remaining AFS performance deficiencies. Under an SBIR grant, Sine Nomine has already conducted a formal analysis of AFS performance. Poor Rx behavior was identified as the number one item impacting performance. Our strategy for dealing with Rx performance is to utilize RxTCP, as detailed above.

Aside from Rx, a number of other bottlenecks were identified in the UNIX OpenAFS client.   After the integration of RxTCP, we will perform another series of benchmarks to mimic the ones in the Sine Nomine report to analyze the issues reported by Sine Nomine, and to see if any new issues arise. The issues identified by the Sine Nomine research were as follows:
o       Write performance is hurt by UNIX OpenAFSâ€™s sync-on-close semantics. Currently the UNIX OpenAFS client performs data synchronization when a file is closed by an application, or the client cache dirty blocks have exceeded a high-water mark threshold. For large files, this results in large time periods during writes where the network is quiescent, and the application must wait either when the file is closed or during a write for all of the outstanding data to be written to the fileserver.

One proposal offered in the Sine Nomine paper was to relax the sync-on-close semantics as offered by the AFS client today. Multi-client data consistency has never been an area where AFS has excelled, but our long-term experience has shown this is not a necessary function of OpenAFS. Especially for larger files, a more intelligent scheme would be to implement a relaxed consistency model where synchronization could occur between the client and server at any arbitrary time. In the case of HPC or video applications, an adaptive write-behind mechanism would be the most desirable option. We recognize that not all users desire the same semantics, so after a write-behind mechanism has been implemented the next step will be to develop the ability to select the data consistency model on a per-volume basis.

In addition, write-on-close semantics are incompatible with file locks. There is a strong desire in the AFS community to support mandatory file locking, byte range locks, and optimistic locking algorithms. This desire is primarily the result of wishing AFS to be a first class file system on the Microsoft Windows and MacOS X operating systems which require those semantics. When locks are obtained and released, the buffers affected by the locks must be flushed to the file server prior to the completion of the lock release.

The Windows OpenAFS client does not implement write-on-close semantics because of the heavy use of byte range file locks.
o       If the client cache fills up during writes, the UNIX OpenAFS client blocks the process performing writes until the cache is completely flushed to the low-water mark. A better strategy would be to dispatch the truncation daemon in advance of the cache reaching a high-water mark threshold, using one of the many I/O prediction algorithms available in the common literature.
o       Read performance of both OpenAFS client implementations are hampered by an inability to read more than a single chunk per RPC. When flushing dirty buffers to the file server, the client is able to send multiple contiguous chunks at a time thereby reducing the number of RPCs. On operating systems that provide hints as to the usage patterns of the file, performance can be improved by optimistically reading chunks whose need is anticipated.
o       The restriction that no more than four RPCs may be outstanding on any Rx connection is another observed bottleneck. This is particularly troublesome when AFS is being used primarily to serve a single service such as a web server as only one Rx connection is created for any collection of client, server, and authentication ID.
Given the limitations imposed by network processing overhead, available network APIs, and system bus limitations we do not expect to achieve line rate performance at network speeds past 1 Gigabit. We believe that performance faster than 1 Gigabit is achievable, but we do not know what the practical limits are beyond that. One of our out year tasks is to investigate work being done at CERN in using â€œoff-line storageâ€, where an OpenAFS client bypasses the OpenAFS fileserver completely and retrieves files via direct communication with disk hardware.

Estimated: 20 to 25 months.
Implementation promised by Your File System, 2012
Asynchronous RX RPCs:
All Rx calls in the existing implementation are synchronous. The currently executing thread must wait for completion. The maximum number of simultaneous requests that can be processed is limited by the number of threads that can be allocated to the process. By adding an asynchronous Rx call mode, the file server can be redesigned to process requests without blocking threads for callback breaks, whoareyou? probes, and getcps calls. This will permit a significant reduction in client requests waiting for threads.

Estimated: 6 weeks for Asynchronous Rx and 3 weeks for file server modifications.
Implementation promised by Your File System, 2012
Increase RX/UDP Maximum Number of Calls:
RX/UDP has a maximum of four simultaneous calls per connection. The initial number of calls for each connection cannot be increased due to the lack of an appropriate negotiation mechanism at the RX layer. However, an application layer negotiation could be used to permit the application to hint to the rx connection how many calls should be permitted.

Implementation promised by Your File System, 2011. Requires Standardization
rxgk:
rxgk is a security class based upon the Generic Security Service Application Programming Interface (GSS-API) that attempts to address a much broader range of security weaknesses in AFS; not simply the use of single DES encryption ciphers. These include issues such as:
- users can impersonate the server to the cache manager since the user knows the key obtained from the Kerberos service ticket
- neither the AFS clients nor the servers contribute any random data to the construction of the key
- the desire to support individual keys per service per host instead of a single key for all services on all hosts within a given cell
- the desire to provide data confidentiality and integrity protection on anonymous connections as well as authenticated ones
- the desire to provide for algorithm agility
- the desire to allow the server to require the use of crypto by the clients
- the desire to map multiple authentication names to a single AFS ID within the protection database
rxgk is designed but has not yet been fully implemented. Love HÃ¶rnquist Ã…strand, Magnus Ahltorp, Jeffrey Hutzelman, Derrick Brashear and Jeffrey Altman met at KTH the week of 22 Jan 2007 to begin implementation of rxgk and modify as many of the AFS services as possible. Love presented a status report at the 2007 AFS & Kerberos Best Practice Workshop and did more work with Derrick the following week. Status: Standardization in progress. Implementation substantially complete. promised by Your File System, 2011
Target: 2.0
rxk5:
rxk5 is a Kerberos v5 based replacement for the existing security class, rxkad, developed by Marcus Watts (UMich) and Matt Benjamin (LinuxBox). This is desirable because rxkad only works with DES,
which is an increasingly insecure choice. The goal here is to provide the highest quality cryptography
possible using code and standards that exist today.
A key sub-goal of this project is to keep the code footprint as small
as possible. Less code should produce better code reliability,
and should also facilitate running this code inside a kernel
environment, such as the AFS cache manager.
rxk5 has several minor limitations:
- no support for Kerberos v5 tickets containing authorization data (aka Microsoft PAC data)
- there is no enc-type negotiation separate from the Kerberos ticket session key enc-type
- the client chooses the checksum type and the server has no influence
- it doesn't use the Kerberos v5 pseudo-random function due to limited availability
- it requires functionality not currently exported from MIT Kerberos as of release 1.6.1
- it cannot be implemented on platforms such as Solaris which do not ship a public Kerberos API
- implementation on MacOS requires access to functions which are not currently exported
rxk5 is significantly stronger than rxkad when it comes to security and requires relatively minor changes to the OpenAFS architecture. Estimate: 1-2 months for protocol review and revision
Target: 1.10

6. Miscellaneous

HostAFSD and Peer-to-Peer AFS:
One of the things that the Gatekeepers are frequently asked by AFS newbies is "why can't I share my local files by AFS?" Many users have experienced either CIFS or NFSv3 file sharing and wish to be able to do the same using AFS. The existing AFS file server back ends store files within physical files known as volumes which can be migrated, cloned, replicated to any AFS file server within the same cell regardless of the operating system, partition file system, or hardware.
For users that are willing to give up the location independence of the data, there isn't much preventing the construction of a file server back end that reads and writes from the native file system provided that native file system has some way of notifying AFS when a file changed. Change notification is required for the file server to be able to callback the clients and report the invalidation of their data.
Another question that needs to be addressed is how to provide for authenticated access and access control lists. Finally, location discovery is a challenge that might be addressed with Apple's Bonjour and/or dyndns; this work can be extended to provide similar ability to discover a local cell for any client.
Estimate: 2 months
Backup solutions:
Backup systems in large organizations are often quite institution dependent. When backing up AFS there are two different views. There is how the directory tree is viewed by the end user and the AFS volume view. It is the mapping of these two views that frequently results in differing requirements on the backup systems that an organization wants to deploy.
Most off the shelf backup systems only see file systems from the viewpoint of the user. Whereas backing up AFS so that a given volume can be restored as needed in a location independent manner is much more similar to backing up a distributed database. Backing up the files that the database writes does not allow for the necessary granularity of restores that are required. In addition, backing up the database files while they are in use results in data inconsistencies.
Teradactyl is one of the few remaining commercial offerings that have integrated support for AFS. VERITAS Net Backup and Tivoli Storage Manager have both dropped integrated AFS support. Teradactyl have been a sponsor of the AFS & Kerberos Best Practice Workshops for the last couple of years.
http://www.teradactyl.com/Documents/OpenAFSbasics.html
There have also been various efforts to contribute AFS support to Amanda, http://www.amanda.org/, and there have been efforts to provide an AFS wrapper to Legato Networker.
Object Storage:
The RxOSD-extension to AFS allows to store metadata and actual filedata separately. The filedata may also reside on more than one server or on HSM-systems. This extension is somewhat based on the SCSI-T10-standard. A file stored on OSDs is generally called an "object". The AFS-fileserver is extended inasmuch to serve the metadata of an object. A new server instance the OSD-server is serving the actual file-data. A new server osddb server finally keeps track of all registered osd-server. The osddb server is somewhat comparable to the vldb-server which gives the location of a volume. http://pfanne.rzg.mpg.de/trac/openAFS-OSD/wiki/Specs

Status: Integration is in progress.
Target: 1.10
Follow up work promised by Your File System, 2011.
Automatic Load Balancing:
CMU has a load balancing tool that could be polished and supported.
Official Support for Cell Clones:
Morgan Stanley has often described their Volume Management Service which manages clones of read-only cells across multiple data centers. There are many things that could be done to make such deployments easier to manage including support for automatic failover between read only cells which are known to be clones.

7. Documentation

Installation and Maintenance
Reference Manuals
Developer Guides
End User Guides

Implementation and Schedule

The implementation schedule for these projects is entirely dependent upon resource availability. Please send inquiries, comments, and offers of support to openafs-gatekeepers@openafs.org. Where external contributors have promised contributions, they are included, as are timelines when those are provided. The following release schedule is subject to change.

1.4.15

The next release in the stable series for UNIX is expected before June 2011. This release will correct implementation defects and support for newer kernel revisions on supported operating systems.

1.6

The 1.6 series will replace the 1.4 series as the current stable series for UNIX and the 1.5 series as the last stable series for Microsoft Windows without a native IFS implementation. The 1.6 series will include significant improvements to source code quality and one major feature change: the demand attach file server. Pre-release testing for 1.6 is currently under way.

1.7

The 1.7 series will replace the 1.5 series as the experimental release series. 1.7 releases will begin shortly after the 1.6 series enters pre-release testing. The Windows IFS implementation will be integrated into 1.7 releases in preparation for the 1.8 stable release. The 1.7 release series will track the 1.6 series. Other than the Window IFS changes no commits will be made to this branch that are not also committed to the 1.6 release series. May 2011.

1.8

The 1.8 series will become the first stable release of OpenAFS to include the Windows IFS implementation. No other new features will be added to 1.8. June 2011.

1.9

The 1.9 series will replace the 1.5 series as the experimental release series. 1.9 releases will begin shortly after the 1.7 series has the Windows IFS implementation committed. Major new features will be integrated into 1.9 releases in preparation for the 1.10 stable release. October 2011.

1.10

The 1.10 series will replace the 1.8 series as the stable release series for UNIX and Microsoft Windows. The 1.10 series are scheduled to include the rxk5 security class, object storage, RxUDP performance improvements, PTS authentication name extensions, and extended callbacks. Pre-release testing for 1.10 is expected to begin in February 2012.

1.11

The 1.11 series will replace the 1.9 series as the experimental release series. 1.11 releases will begin shortly after the 1.10 series enters pre-release testing. Major new features will be integrated into 1.11 releases in preparation for the 2.0 stable release. March 2012.

2.0

The 2.0 series will replace the 1.10 series as the current stable series for UNIX and Microsoft Windows. The 2.0 series will include the rxgk security class including Kerberos v5, X.509 and SCRAM authentication, protection of anonymous connections, protection of the server to client callback connection, and server coordinated byte range locking, . Pre-release testing for 2.0 is expected to begin in June 2012.

Search OpenAFS
Downloads	Microsoft Windows * MacOS X * 1.4.x Maintenance Release * 1.5.x Features Release * Browse Source
OpenAFS Info	Documentation * Man Pages * AFSLore Wiki * Security Advisories * Getting Support * Workshops
Mailing Lists	openafs-announce * openafs-info * openafs-devel * openafs-win32-devel * afs3-standardization
Support OpenAFS	Donate * Store * Success Stories * Project Contributors
Development	Road Map * Projects * Bug Tracking * Submit a Bug * Source Repository
Links	Arla * grand.central.org

<webmaster@openafs.org>

Last modified: 2011/03/18 01:12:14 EDT