Who knew creating a filesystem was difficult? // the crosseroads

It turns out that trying to create your own filesystem is a bit more difficult than one might expect. I have been nose-deep in various documentation about filesystem theory and the layouts of various popular filesystems for a little over a week. (Okay, okay, so most of it begins with “http://en.wikipedia.org/wiki/...", but still–lots of good information out there.)

Things I have learned, after the break.

Building my Own

I’ve been thinking / whiteboarding out my own sort of file system based on some of the concepts of the filesystems listed below, plus others like the new Btrfs. It’s been an interesting experience so far.

The reason I started down this path instead of implementing one of the existing ones is because I’m at that stage in the project where I have to to read the kernel bootstrap file from the filesystem. That part really doesn’t mean I couldn’t use something already out there, though. The problem is that I’m using a Mac to develop this thing (at least, so far), and it seems that OSX only knows how to create FAT and Mac-type file systems (HFS+, etc.). At least, I haven’t found out how to make it create FFS or ext2, which if I were going to re-implement something, it’d be one of those. So, if I can’t create and mount the filesystem to put the kernel onto it….

I guess I could just implement something temporarily and get back onto the build-a-kernel track, but for some reason it bothers me to do something that I know I’ll have to rip out later. Plus, this is part of the learning process, so why not learn it now? One site I read said something like “just implement one of the easy ones for now so you can get back to the fun stuff like kernel dev.” Well, who said this part wasn’t the fun part as well?

What follows is what I’ve learned so far. I just need a place to put this, really, so that it stops leaking out of my brain.

FAT

FAT places its file allocation tables (there are two) at the beginning of the disk, just after the boot sector and any “reserved” sectors. Each “cluster” (just a block of space on disk) has an entry in the FAT that describes whether the cluster is free, used, bad, or a couple of other things. Thus, the FAT is essentially a free/used bitmap. For a “used” entry, the value stored is the next cluster in the file or directory chain, or the special “End of Chain” cluster to denote, well, the end of the file or directory chain. This means that each file is described as essentially a singly-linked list of clusters.

After the FATs is the Root Directory Entry, which is just a special “directory table” describing the root directory of a volume. It is of the same format as entries in the Data Region, which contains entries that describe files and directories; things such as filename, creation date, and the first cluster in the file or directory are stored here. The Data Region follows the Root Directory entry, and contains all other directories and all files.

So, all of the free/used information is stored at the beginning of the disk, and all other information is interspersed throughout the rest of the disk. This differs from other filesystems like ext{2,3,4} and UFS/FFS, all of which break the disk into smaller groups of discrete blocks (“block groups” in ext2+; “cylinder groups” in UFS/FFS).

FAT is supposedly one of the easiest filesystems to implement because it is so simplistic. However, that simplicity doesn’t come without drawbacks:

FAT can’t handle UNIX permissions;
FAT fragments easily;
For larger disks, FAT’s method of storing file chains as singly-linked lists becomes inefficient and contributes to less-than-stellar performance;
Implementing FAT may or may not require a license be purchased from Microsoft (at least, for some of the FAT implementations, especially concerning long file names).

ext2

The Second Extended Filesystem was introduced in 1993 and replaced the original “ext” filesystem in Linux, which itself replaced the Minix filesystem. It uses a bitmap for free space tracking. The ext2 “superblock” is analogous to (I think) FAT’s BIOS Parameter Block / Extended BPB, and there are copies of the superblock stored in each “block group”. A block group is just a contiguous number of blocks, each block encompassing one or multiple disk sectors. A block group contains the superblock copy, inode information, and a data region. Each block group has the same amount of inodes in it. By dispersing the inode data into block groups, I think this allows ext2 to locate similar data (files in the same directory, etc.) close together (see “UFS” below). If they just started allocating space at the start of the data region, files and directories could be spread out all over the disk (I guess they still could be) instead of closer together. The theory is that if you access a directory, you’re probably also going to access one or more files in that same directory, and by locating the data close together you can reduce read times.

File block data is represented as a bunch of pointers to the various blocks used by the file. There is enough room in the inode structure to store the first 12 blocks. Then there is a pointer to what they call an “indirect block”, which stores more pointers. There is also a pointer to a doubly-indirect block and a trebly-indirect block. Perhaps you can make more sense of Wikipedia’s diagram than I can, but I don’t fully understand the double- and triple-indirect blocks yet. That being said, I haven’t started reading the ext2 Internal Layout Guide by David Poirier yet, mostly because it’s a 58-page guide describing the guts of ext2 in excruciating detail.

FFS/UFS

The Unix File System / Fast File System was the original filesystem used on the *BSDs, and still in use today. The Berkeley Fast File System was written by Dr. Marshall Kirk McKusick while still at Berkeley. He also added “soft updates” to FFS (a alternative to journalling) and then revised FFS into FFS2.

Many of the ideas built into ext2 came from FFS, so things might look a little similar here. FFS has the idea of “cylinder groups”, each of which is a group of inodes and the associated data.

Straight from Wikipedia:

Early versions of Unix filesystems were referred to simply as FS. FS only included the boot block, superblock, a clump of inodes, and the data blocks. This worked well for the small disks early Unixes were designed for, but as technology advanced and disks got larger, moving the head back and forth between the clump of inodes and the data blocks they referred to caused thrashing. Marshall Kirk McKusick, then a Berkeley graduate student, optimized this for 4.2BSD’s FFS (Fast File System) by inventing cylinder groups. This breaks the disk up into smaller chunks, each with its own inode clump and data blocks.

The intent of BSD FFS is to try to localize associated data blocks and metadata in the same cylinder group; and, ideally, all of the contents of a directory (both data and metadata for all the files) in the same or nearby cylinder group, thus reducing fragmentation caused by scattering a directory’s contents over a whole disk.

I guess that answers my question as to why cylinder/block groups are used.