Zettabyte File System Explained

In this article, I will strive to answer many questions that have been asked about ZFS, such as what is it, why should I use it, what can I do with it, and the like? Let's begin:

What are some of the attributes of ZFS?

  • ZFS is a fully-featured filesystem
  • Does data integrity checking
  • Uses snapshots
  • Created by Sun Microsystems, forked by Oracle
  • Oracle version is less full featured
  • OpenZFS - open source version of ZFS
  • Feeds into FreeBSD, illumos, ZFSonLinux, Canonical

What makes ZFS special?

  • Leverages well-understood standard userland tools into the filesystem
    • Checksums everything
    • Metadata abounds
    • Uses Compression
    • diff(1)
    • Copy-on-Write (COW)

What is Copy-on-Write?

  • ZFS never changes a written disk sector
  • A sector changes? Allocate a new sector. Write data to it
  • Data on disk is always coherent
  • Power loss half-way through a write? Old data is still there untouched. Version control at the disk level
  • Interesting side-effect. You can effectively free snapshots

ZFS Assumptions?

  • ZFS is not your typical EXT/UFS filesystem
  • Traditional assumptions about filesystems will come back to haunt you
  • Non-ZFS tools like dump will appear to work, but not really

ZFS Hardware?

  • RAID Controllers -- Absolutely NOT!
    • ZFS expects raw disk access
    • RAID controller in JBOD or single-disk RAID0?
    • RAM -- ECC?
    • Disk redundancy

ZFS Terminology

  • VDEV or Virtual Device - a group of storage providers
  • Pool - a group of identical VDEVs
  • Dataset - a named chunk of data on a pool
  • You can arrange data in a pool anyway that you desire
  • -f switch is very important (be careful how you use it)

Virtual Devices (VDEVs) and Pools

  • Basic unit of storage in ZFS
  • All ZFS redundancy occurs at the virtual device level
  • Can be built out of any storage provider
  • Most common providers: disk or GPT partition
    • Could be FreeBSD crypto device
    • Low-Level Virtual Machine (LLVM) RAID
  • A Pool contains only one type of VDEV
  • "X VDEV" and "X Pool" get used interchangeably
  • VDEVs are added to Pools
  • Typically providers are not added to VDEVs but to Pools

Stripe VDEV/Pool

  • Each disk is its own VDEV
  • Data is striped across all VDEVs in the Pool
  • Can add striped VDEVs to grow Pools
  • No redundancy. Absolutely none. Nada!
  • No self-healing
  • Set copies=2 to get self-healing; Must be done when added

Mirror VDEV/Pool

  • Each VDEV contains multiple disks that replicate the data of all other disks in the VDEV
  • A Pool with multiple VDEVS is synonymous to RAID-10 (Stripe over Mirrors)
  • Can add more mirror VDEVs to grow Pool

RAIDZ VDEV/Pool

  • Each VDEV contains multiple disks
  • Data integrity maintained via parity (such as RAID-5, etc.)
  • Lose a disk - No data loss
  • Can self-heal via redundant checksums
  • RAIDZ Pool can have multiple identical VDEVs
  • Cannot expand the size of a RAIDZ VDEV by adding more disks

RAIDZ Types

  • RAID-Z1
    • 3+ Disks
    • Can lose 1 disk/VDEV
  • RAID-Z2
    • 4 + Disks
    • Can lose 2 disks/VDEV
  • RAID-Z3
    • 5+ Disks
    • Can lose 3 disks/VDEV
  • Disk size far exceeds disk access speed

Number of Disks and Pools?

  • No more than 9 - 12 Disks per VDEV
  • Pool size is your choice
  • Avoid putting everything in one massive Pool
  • Best practice is to put OS in one mirrored Pool, and data in a separate Pool

RAIDZ vs. Traditional RAID

  • ZFS combines filesystem and Volume Manager - faster recovery
  • Write hole
  • Copy-on-Write -- never modify a block, only write new blocks

Create Striped Pools

  • Each VDEV is a single disk
  • No special label for VDEV of striped disk
  • # zpool create trinity gpt/zfs0 gpt/zfs1 \ gpt/zfs2 gpt/zfs3 gpt/zfs4

Viewing Stripe/Mirror/RAIDZ Pool Results

  • Use # zpool status

Multi-VDEV RAIDZ

  • Stripes are inherently multi-VDEV
  • There's no traditional RAID equivalent
  • Use type keyword multiple times
    • # zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Malformed Pool Example

  • # zpool create trinity raidz1 gpt/zfs0 gpt/zfs1 gpt/zfs2 mirror gpt/zfs3 gpt/zfs4 gpt/zfs5
    • receives an "invalid vdev specification" message
  • Don't use -f here as ZFS will let you when you shouldn't
  • Attempting to add a Striped Mirror to a RAIDz -- No go!

Reusing Providers

  • # zpool create db gpt/zfs1 gpt/zfs2 gpt/zfs3 gpt/zfs4
  • /dev/gpt/zfs3 is a part of an exported pool 'db' , so
  • the use of -f here is appropriate and essential

Pool Integrity

  • ZFS is self-healing at the Pool and VDEV level
  • Parity allows data to be rebuilt
  • Every block is hashed; hash is stored in the parent
  • Data integrity is checked as the data is accessed on the disk
  • A Scrub is essentially checking the live filesystem without off-line'ing
  • If you don't have VDEV redundancy, use dataset copies property

Scrub vs fsck

  • ZFS has no offline integrity checker
  • ZFS scrub does everything that fsck does, and more
  • You can offline your Pool to scrub, by why would you?
  • Scrub isn't perfect, but it's better than fsck

Pool Properties

  • Properties are tunables
  • Both Pools and Datasets have properties
  • Commands: zpool set, and zpool get
  • Some are read-only
    • # zpool get all | less

Changing Pool Properties

  • # zpool set comment="Main OS Files" zroot
  • # zpool set copies=2 zroot

Pool History

  • # zpool history zroot

ZPool Feature Flags

  • ZFS had version numbers
  • Then, Oracle assimilated Sun
  • Feature flags are at version 5000
  • Feature flags versus OS
  • # zpool get all trinity | grep feature

Datasets

  • A named chunk of data
  • Filesystems
  • Volume
  • Snapshot
  • Clone
  • Bookmark
  • Properties and features work on a per-dataset basis
  • # zfs list -r zroot/ROOT

Creating Datasets

  • $ zfs create zroot/var/mysql
  • $ zfs create -V 4G zroot/vmware

Destroying Datasets

  • # zfs destroy zroot/var/old-mysql
  • -v -- verbose mode
  • -n -- no-op flag

Parent-Child Relationships

  • Datasets inherit their parent's properties
  • If you change it locally, but want to have it use the parent's inherited property, use

zfs inherit

  • Renaming a Dataset changes its inheritance

Pool Repair & Maintenance

  • Resilvering
  • Rebuild from parity
  • Uses VDEV redundancy data
  • No redundancy? No resilvering
  • Throttled by Disk I/O
  • Happens automatically when disk is replaced
  • Can add VDEVs to Pools, not disks to VDEV
  • Be cautious of slightly smaller disks (check sector size as they can vary from disk to disk of equal capacity)

Add VDEV to Pool

  • New VDEVs must be identical to existing VDEVs in the Pool
    • # zpool add scratch gpt/zfs99
    • # zpool add db mirror gpt/zfs6 gpt/zfs7
    • # zpool add trinity raidz1 gpt/zfs3 gpt/zfs4 gpt/zfs5

Hardware States in ZFS

  • ONLINE -- operating normally
  • DEGRADED -- at least one storage provider has failed
  • FAULTED -- generated too many errors
  • UNAVAIL -- cannot open storage provider
  • OFFLINE -- storage provider has been shut down
  • REMOVED -- hardware detection of unplugged device
  • Errors percolate up through the ZFS stack
  • Hardware RAID hides errors - ZFS does not!

Log and Cache Devices

  • Read Cache -- L2ARC (Level 2 Adaptive Replacement Cache)
  • Synchronous Write Log -- ZIL, SLOG (ZFS Intent Log, Separate Log Device)
  • Where is the bottleneck?
  • Log/Cache Hardware

Filesystem Compression

  • Compression exchanges CPU time for disk I/O
  • Disk I/O is very limited
  • CPU time is plentiful
  • LZ4 by default
  • Enable compression before writing any data
  • # set compress=lz4 zroot
  • Typically gzip-9 is better than lz4
  • No more userland log compression

Memory Cache Compression

  • Advanced Replication Cache is ZFS' Buffer Cache
  • ARC compression exchanges CPU time for memory
  • Memory can be somewhat limited
  • CPU time is plentiful
  • ZFS ARC auto compresses what can be compressed

Deduplication (Dedup)

  • ZFS deduplication isn't good as you would imagine it is
  • Only duplicates identical filesystem blocks
  • Most data is not ZFS deduplicable
  • 1TB of dedup'd data = 5G RAM for the dedup process
  • Dedup RAM X 4 = System RAM
  • Effectiveness: run zdb -S zroot, check dedup column
  • Cost-effective ZFS dedup just doesn't exist

Snapshots

  • # zfs snapshot trinity/home@<today's date>
  • # zfs list -t snapshot
  • # zfs -r zroot@<today's date>
  • Access snapshots in hidden .zfs directory (especially when using ZFSonLinux)
  • # zfs destroy trinity/home@<today's date>
  • Use -vn in destroy operations

Snapshot Disk Use

  • Delete file from live filesystem
  • Blocks in a snapshot remain in use
  • Blocks are freed only when no snapshot uses them

Roll Back

  • Can rollback filesystem to the most recent snapshot
  • # zfs rollback zroot/ROOT/@<before upgrade>
  • Newer data is destroyed

Clones

  • A read-write copy of a snapshot
  • # zfs clone zroot/var/mysql@<today> \ zroot/var/mysql-test
  • Run a test, then discard afterward

Boot Environments

  • Built on clones and snapshots
  • Snapshot root filesystem dataset before an upgrade
  • If Upgrade goes awry. Roll back!
  • FreeBSD: sysutils/beadm

ZFS Send/Receive

  • Move whole filesystems to another host
  • Blows rsync out of the water
  • Resumable

Let's install ZFS in Linux and start using it