trckpd

Design and Implementation of Sun Network Filesystem

tracking tech

Paraphrasing the first paragraph of the paper, NFS is a

  • transparent
  • platform independent
  • remote filesystem.

The design and implementation details of Virtual Filesystem (VFS) is the most important part of this paper.

NFS design is broken into three major parts -

  • protocol
  • server side
  • client side

NFS Protocol

RPC

NFS protocol runs on top of RPC. Use of RPC helps in simplifying the protocol definition, organization and service implementation. RPCs are synchronous, thus blocking. This nature of RPC helps create a system analogous to a local filesystem.

RPC is transport protocol independent. NFS uses UDP (User Datagram Protocol) and IP (Internet Protocol) for transport.

RPC is built on top of XDR (External Data Representation). XDR data definition language with a C-like syntax.

Stateless

The protocol is stateless. Every call to the server contains all the information required to complete the call. Server does not maintain any client state. Statelessness makes any crash recovery very simple. If server crashes, there is no recovery is done. If a client crashes, no recovery is needed on either side. If state was maintained, then recovery would have been an issue with both server and client design.

Definition

Signature Returns Description
null() () pings server, RTT time
lookup(dirfh, name) (fh, attr) a new file handle in the directory, with name
create(dirfh, name, attr) (newfh, attr) a new file, its handle and attributes
remove(dirfh, name) (status) removes file from directory
getattr(fh) (attr) returns file attribute, similar to stat call
setattr(fh, attr) (attr) sets file attributes
read(fh, offset, count) (attr, data) returns upto count bytes of data from offset
write(fh, offset, cout, data) (attr) writes count bytes of dat at offset from begining of file. Returns file attributes
rename(dirfh, name, tofh, toname) (status) rename a file in dirfh to a file in_tofh_
link(dirfh, name, tofh, toname) (status) creates a a link of a file in tofh from dirfh
symlink(dirfh, name, string) (status) creates a symlink in dirfh with value string
readlink(fh) (string) returns the string associated with symlink
mkdir(dirfh, name, attr) (fh, newattr) creates a new directory in dirfh
rmdir(dirfh, name) (status) removes empty directory with name in dirfh
readdir(dirfh, cookie, count) (entries) returns upto count bytes of directory entries from dirfh. Entry consists file name, id, an opaque pointer to next directory entry called cookie. readdir call with a zero value for cookie returns entries starting with first entry
statfs(fh) (fsstats) returns filesystem information

Server side

Server does not keep any client state, hence all transactions has to be persisted in disk. For write calls data block, all modified indirect blocks and inode block all has to be flushed to storage before returning to client.

To achieve stateless sever, inode implementation is updated. The new inode has a generation number and filesystem id. The inode number, generation number and filesystem id together make up the file handle for a file.

Every time a inode is deleted, the generation number is incremented. This way if the inode is deleted, but client still holds it then server can identify that it’s an old inode.

NFS Server

Client side

NFS does not use server:path format for file lookup as that is not compatible with other Unix filesystems. Instead the client bind the filesystem at mount time. The client can not access the filesystem until the mount is complete.

NFS Server

Virtual File System

VFS is an abstraction layer on top of native filesystems. Clents can adapt to the VFS APIs and transparently interact with multiple filesystems.

VFS is implemented with a structure that contains operations wchich can be applied to whole filesystem. vnode is a structure that contains all operations for a node, a node is a file or directory.

Each mounted filesystem has an associated VFS structure in kernel. Each active node has a vnode associated with it. Each vnode contains 2 VFS pointers, one to parent VFS another to mounted-on VFS. This way client can navigate to any part of the filesystem without any knowledge of underlying filesystem.

File System Operation

Name Description
mount() system call to mount filesystems
mount_root() mount filesystem as root

VFS Operation

Name Return Description
unmount(vfs) () unmount filesystems
root(vfs) (vnode) returns vnode filesystem root
statfs(vfs) (fsstatbuf) returns filesystem statistics
sync(vfs) () flush delayed writes

Vnode Operation

Name Return Description
open(vnode, flags) () marks a file open
close(vnode, flags) () marks a file close
rdwr(vnode, uio, rwflag, flags) () read or write a file
ioctl(vnode, cmd, data, rwflag) () do I/O control operation
select(vnode, rwflag) () do select
getattr(vnode) (attr) returns file attributes
setattr(vnode, attr) () set file attributes
access(vnode, mode) () check access permission
lookup(dvnode, name) (vnode) lookup file name in directory
create(dvnode, name, attr, excl, mode) (vnode) create a file
remove(dvnode, name) () remove file name from directory
link(dvnode, todvnode, toname) () link to a file
rename(dvnode, name, todvnode, toname) () rename a file
mkdir(dvnode, name, attr) (dvnode) create a directory
rmdir(dvnode, name) () remove a directory
readdir(dvnode) (entries) read directory entries
symlink(dvnode, name, attr, to_name) () create symbolic link
readlink(vp) (data) read value of symlink
fsync(vnode) () flush dirty blocks of a file
inactive(vnode) () mark inactive, do cleanup
bmap(vnode, blk) (devnode, mappedblk) map block number
strategy(bp) () read and write filesystem block
bread(vnode, blockno) (buf) read a block
brelse(vnode, buf) () release a block buffer