应用眼中的文件系统
理解两个文件系统的设计与实现
最小的 Linux 系统镜像
initramfs
├── busybox
└── init
/busybox
是个二进制文件/init
只有 3 行exec /busybox sh
“几乎什么也没有”
/busybox mkdir -p /bin && /busybox mv /busybox /bin/
c1="arch ash base64 cat chattr chgrp chmod chown conspy cp cpio cttyhack date dd df dmesg dnsdomainname dumpkmap echo ed egrep false fatattr fdflush fgrep fsync getopt grep gunzip gzip hostname hush ionice iostat ipcalc kbd_mode kill link linux32 linux64 ln login ls lsattr lzop makemime mkdir mknod mktemp more mount mountpoint mpstat mt mv netstat nice nuke pidof ping ping6 pipe_progress printenv ps pwd reformime resume rev rm rmdir rpm run-parts scriptreplay sed setarch setpriv setserial sh sleep stat stty su sync tar touch true umount uname usleep vi watch zcat"
c2="[ [[ awk basename bc beep blkdiscard bunzip2 bzcat bzip2 cal chpst chrt chvt cksum clear cmp comm crontab cryptpw cut dc deallocvt diff dirname dos2unix dpkg dpkg-deb du dumpleases eject env envdir envuidgid expand expr factor fallocate fgconsole find flock fold free ftpget ftpput fuser groups hd head hexdump hexedit hostid id install ipcrm ipcs killall last less logger logname lpq lpr lsof lspci lsscsi lsusb lzcat lzma man md5sum mesg microcom mkfifo mkpasswd nc nl nmeter nohup nproc nsenter nslookup od openvt passwd paste patch pgrep pkill pmap printf pscan"
c3="pstree pwdx readlink realpath renice reset resize rpm2cpio runsv runsvdir rx script seq setfattr setkeycodes setsid setuidgid sha1sum sha256sum sha3sum sha512sum showkey shred shuf smemcap softlimit sort split ssl_client strings sum sv svc svok tac tail taskset tcpsvd tee telnet test tftp time timeout top tr traceroute traceroute6 truncate ts tty ttysize udhcpc6 udpsvd unexpand uniq unix2dos unlink unlzma unshare unxz unzip uptime users uudecode uuencode vlock volname w wall wc wget which who whoami whois xargs xxd xz xzcat yes"
for cmd in $c1 $c2 $c3; do
/bin/busybox ln -s /bin/busybox /bin/$cmd
done
mkdir -p /proc && mount -t proc none /proc
mkdir -p /sys && mount -t sysfs none /sys
在一个 I/O 设备 (驱动) 上实现 “目录树” 的数据结构。
VFS: 管理所有文件系统共享的部分
块设备驱动
read_block
, write_block
目录/文件 API
mkdir
, rmdir
, link
, unlink
open
, read
, write
, stat
文件系统实现
read_block
, write_block
上实现一个 “目录树” 的抽象数据类型5.25" 软盘:单面 180 KiB
抛开 workload 谈优化,就是耍流氓。
需求
实现方式
哪种方式的缺陷是致命、难以解决的?
集中存储的指针容易损坏?存 $n$ 份就行!
RTFM 得到必要的细节
mmap
到内存,就可以直接访问
struct fat_volume {
struct fat_header header;
struct fat[FAT_NUM];
char clusters[CLUSTER_SZ][];
};
“FAT” 的 “next” 数组
0
: free; 2...MAX
: allocated; ffffff7
: bad cluster; ffffff8-ffffffe
, -1
: end-of-file以普通文件的方式存储 “目录” 这个数据结构
性能
可靠性
不能 “尽善尽美”,但可以在 “实际 workload” 下尽可能好
Summary | Findings |
---|---|
Most files are small | Roughly 2K is the most common size |
Average file size is growing | Almost 200K is the average |
Most bytes are stored in large files | A few big files use most of the space |
File systems contains lots of files | Almost 100K on average |
File systems are roughly half full | Even as disks grow, file systems remain ~50% full |
Directories are typically small | Many have few entries; most have 20 or fewer |
按对象方式集中存储文件/目录元数据
为大小文件区分 fast/slow path
与 FAT 本质相同:在文件上建立目录的数据结构
大文件的随机读写性能提升明显 ($O(1)$)
但可靠性依然是个很大的问题
btrfs: Everything is a B-tree
本次课内容与目标
Takeaway messages