Issue with my SSD + btrfs + discard

5 minute read

Ubuntu 13.04 Upgrade + discard mount option for /

I upgrade to Ubuntu 13.04 the other week and it pretty much went without incident. On a side note, I don’t remember an upgrade going this smooth since before Unity was introduced.

Following the upgrade I wanted to enable TRIM support for my rootfs, so I added “discard” to the mount options. I thought it would improve performance by sending the SSD TRIM commands when files were deleted. I thought wrong and didn’t notice initially, this caused me to blame btrfs and trying a number of other things with no success (btrfs balance, clear_cache, finally re-making the filesystem…).

Everytime I would write something on my SSD it was terribly slow. For example, opening /etc/fstab with vim and writing it would take 5+ seconds. Something was wrong.

Of course fstab is not a big file, and I’m not even writing any changes:

-rw-r--r-- 1 root root 1226 May  5 16:21 /etc/fstab

While it’s hanging for a few seconds, lets look at where the vim code was hanging:

$ sudo cat /proc/$(pidof vim)/stack 
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa0192733>] btrfs_sync_file+0x193/0x230 [btrfs]
[<ffffffff811c396d>] do_fsync+0x5d/0x90
[<ffffffff811c3bd0>] sys_fsync+0x10/0x20
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

There we have the smoking gun… blkdev_issue_discard(). At first I thought it was btrfs being slow, but it was actually the discard code. I thought the discard just marked deleted blocks for removal later, and then charged on. Apparently it’s blocking for some reason when vim calls fsync() after each write.

Closer investigation reverals that every btrfs sync does pretty much the same thing even with back to back syncs. The second sync should be almost instantenous:

$ sudo cat /proc/$(pidof btrfs)/stack
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa01559ca>] btrfs_sync_fs+0x5a/0xc0 [btrfs]
[<ffffffffa01b3385>] btrfs_ioctl+0x13c5/0x1b80 [btrfs]
[<ffffffff811a5969>] do_vfs_ioctl+0x99/0x570
[<ffffffff811a5ed1>] sys_ioctl+0x91/0xb0
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

$ sudo sh -c "( time btrfs filesystem sync / ; time btrfs filesystem sync / )"
FSSync '/'

real	0m7.015s
user	0m0.008s
sys	0m0.000s
FSSync '/'

real	0m9.050s
user	0m0.004s
sys	0m0.008s

System wide sync is pretty much the same thing, but more drama on a pretty much idle system:

$ cat /proc/$(pidof sync)/stack
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa01559ca>] btrfs_sync_fs+0x5a/0xc0 [btrfs]
[<ffffffff811c3900>] sync_fs_one_sb+0x20/0x30
[<ffffffff8119768a>] iterate_supers+0xfa/0x100
[<ffffffff811c3a95>] sys_sync+0x55/0x90
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff

[email protected]:~$ sync; time sync; time sync 

real	0m19.202s
user	0m0.004s
sys	0m0.024s

real	0m16.882s
user	0m0.000s
sys	0m0.028s

Running iostat also points at a problem, note the %utilization column:

$ iostat -ctx -p sda 1
05/05/2013 04:24:26 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
		   2.81    0.00    1.28    0.26    0.00   95.65

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   23.00     0.00   100.00     8.70     1.00   42.61    0.00   42.61  43.48 100.00

Checking btrfs sync on another drive

To make sure it’s not a btrfs problem, I checked another traditional harddrive and it behaved as expected. The filesystem is nearly 2TB and syncs instantly… why doesn’t my SSD sync in a reasonable amount of time?

$ sudo sh -c "( time btrfs filesystem sync /mnt/massive ; time btrfs filesystem sync /mnt/massive )"
FSSync '/mnt/massive'

real	0m0.001s
user	0m0.000s
sys	0m0.000s
FSSync '/mnt/massive'

real	0m0.001s
user	0m0.000s
sys	0m0.000s

Applying the fix

I deleted the discard option from my mount options and now performance is back to this:

$ mount
/dev/sda5 on / type btrfs (rw,noatime,nodiratime,[email protected],discard)

$ sudo sh -c "( time btrfs filesystem sync / ; time btrfs filesystem sync / )"
FSSync '/'

real    0m0.119s
user    0m0.000s
sys     0m0.000s
FSSync '/'

real    0m0.105s
user    0m0.000s
sys     0m0.000s

And I was happy once again. I’m sure ext4 would behave the same way. I think I’m going to setup a cron script to run fstrim / once a week or so to free up the deleted blocks.

Drive Information

For the record, here’s the information about my drive. Still not sure if discard flag is supposed to behave like this, I suspect it might be buggy firmware on my old SSD, but I don’t have any newer more modern SSDs to compare it to. I updated from v1.10 firmware to v1.37 firmware and it appears to make no change. Also the Data Set Management TRIM supported (limit 1 block) is interesting. It appears newer higher performance drives have a higher block limit.

sudo hdparm -I /dev/sda

ATA device, with non-removable media
	Model Number:       OCZ-AGILITY2                            
	Serial Number:      <serial>           
	Firmware Revision:  1.10    
	Transport:          Serial
	Used: unknown (minor revision code 0x0028) 
	Supported: 8 7 6 5 
	Likely used: 8
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  234441648
	LBA48  user addressable sectors:  234441648
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	Logical Sector-0 offset:                  0 bytes
	device size with M = 1024*1024:      114473 MBytes
	device size with M = 1000*1000:      120034 MBytes (120 GB)
	cache/buffer size  = unknown
	Nominal Media Rotation Rate: Solid State Device
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 1
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
		 Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
		 Cycle time: no flow control=120ns  IORDY flow control=120ns
	Enabled	Supported:
	   *	SMART feature set
			Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
			Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	NOP cmd
			SET_MAX security extension
	   *	48-bit Address feature set
	   *	Mandatory FLUSH_CACHE
	   *	SMART error logging
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	64-bit World wide name
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Host-initiated interface power management
	   *	Phy event counters
	   *	DMA Setup Auto-Activate optimization
			Device-initiated interface power management
	   *	Software settings preservation
	   *	SMART Command Transport (SCT) feature set
	   *	SCT Write Same (AC2)
	   *	SCT Error Recovery Control (AC3)
	   *	SCT Features Control (AC4)
	   *	SCT Data Tables (AC5)
	   *	Data Set Management TRIM supported (limit 1 block)
	   *	Deterministic read data after TRIM
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
	not	supported: enhanced erase
Logical Unit WWN Device Identifier: <id>
	NAA		: 5
	IEEE OUI	: e83a97
	Unique ID	: <uid>
Checksum: correct