Issue with my SSD + btrfs + discard
Ubuntu 13.04 Upgrade + discard mount option for /
I upgrade to Ubuntu 13.04 the other week and it pretty much went without incident. On a side note, I don’t remember an upgrade going this smooth since before Unity was introduced.
Following the upgrade I wanted to enable TRIM support for my rootfs, so I added “discard” to the mount options. I thought it would improve performance by sending the SSD TRIM commands when files were deleted. I thought wrong and didn’t notice initially, this caused me to blame btrfs and trying a number of other things with no success (btrfs balance, clear_cache, finally re-making the filesystem…).
Everytime I would write something on my SSD it was terribly slow. For example, opening /etc/fstab with vim and writing it would take 5+ seconds. Something was wrong.
Of course fstab is not a big file, and I’m not even writing any changes:
-rw-r--r-- 1 root root 1226 May 5 16:21 /etc/fstab
While it’s hanging for a few seconds, lets look at where the vim code was hanging:
$ sudo cat /proc/$(pidof vim)/stack
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa0192733>] btrfs_sync_file+0x193/0x230 [btrfs]
[<ffffffff811c396d>] do_fsync+0x5d/0x90
[<ffffffff811c3bd0>] sys_fsync+0x10/0x20
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff
There we have the smoking gun… blkdev_issue_discard(). At first I thought it was btrfs being slow, but it was actually the discard code. I thought the discard just marked deleted blocks for removal later, and then charged on. Apparently it’s blocking for some reason when vim calls fsync() after each write.
Closer investigation reverals that every btrfs sync does pretty much the same thing even with back to back syncs. The second sync should be almost instantenous:
$ sudo cat /proc/$(pidof btrfs)/stack
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa01559ca>] btrfs_sync_fs+0x5a/0xc0 [btrfs]
[<ffffffffa01b3385>] btrfs_ioctl+0x13c5/0x1b80 [btrfs]
[<ffffffff811a5969>] do_vfs_ioctl+0x99/0x570
[<ffffffff811a5ed1>] sys_ioctl+0x91/0xb0
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff
$ sudo sh -c "( time btrfs filesystem sync / ; time btrfs filesystem sync / )"
FSSync '/'
real 0m7.015s
user 0m0.008s
sys 0m0.000s
FSSync '/'
real 0m9.050s
user 0m0.004s
sys 0m0.008s
System wide sync is pretty much the same thing, but more drama on a pretty much idle system:
$ cat /proc/$(pidof sync)/stack
[<ffffffff8132c529>] blkdev_issue_discard+0x279/0x2a0
[<ffffffffa016691c>] btrfs_discard_extent.isra.48+0x9c/0xf0 [btrfs]
[<ffffffffa016f8c1>] btrfs_finish_extent_commit+0xc1/0xf0 [btrfs]
[<ffffffffa01831c4>] btrfs_commit_transaction+0x974/0xac0 [btrfs]
[<ffffffffa01559ca>] btrfs_sync_fs+0x5a/0xc0 [btrfs]
[<ffffffff811c3900>] sync_fs_one_sb+0x20/0x30
[<ffffffff8119768a>] iterate_supers+0xfa/0x100
[<ffffffff811c3a95>] sys_sync+0x55/0x90
[<ffffffff816d37dd>] system_call_fastpath+0x1a/0x1f
[<ffffffffffffffff>] 0xffffffffffffffff
nitro@core:~$ sync; time sync; time sync
real 0m19.202s
user 0m0.004s
sys 0m0.024s
real 0m16.882s
user 0m0.000s
sys 0m0.028s
Running iostat also points at a problem, note the %utilization column:
$ iostat -ctx -p sda 1
05/05/2013 04:24:26 PM
avg-cpu: %user %nice %system %iowait %steal %idle
2.81 0.00 1.28 0.26 0.00 95.65
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.00 0.00 23.00 0.00 100.00 8.70 1.00 42.61 0.00 42.61 43.48 100.00
Checking btrfs sync on another drive
To make sure it’s not a btrfs problem, I checked another traditional harddrive and it behaved as expected. The filesystem is nearly 2TB and syncs instantly… why doesn’t my SSD sync in a reasonable amount of time?
$ sudo sh -c "( time btrfs filesystem sync /mnt/massive ; time btrfs filesystem sync /mnt/massive )"
FSSync '/mnt/massive'
real 0m0.001s
user 0m0.000s
sys 0m0.000s
FSSync '/mnt/massive'
real 0m0.001s
user 0m0.000s
sys 0m0.000s
Applying the fix
I deleted the discard option from my mount options and now performance is back to this:
$ mount
/dev/sda5 on / type btrfs (rw,noatime,nodiratime,subvol=@,discard)
$ sudo sh -c "( time btrfs filesystem sync / ; time btrfs filesystem sync / )"
FSSync '/'
real 0m0.119s
user 0m0.000s
sys 0m0.000s
FSSync '/'
real 0m0.105s
user 0m0.000s
sys 0m0.000s
And I was happy once again. I’m sure ext4 would behave the same way. I think I’m going to setup a cron script to run fstrim /
once a week or so to free up the deleted blocks.
Drive Information
For the record, here’s the information about my drive. Still not sure if discard flag is supposed to behave like this, I suspect it might be buggy firmware on my old SSD, but I don’t have any newer more modern SSDs to compare it to. I updated from v1.10 firmware to v1.37 firmware and it appears to make no change. Also the Data Set Management TRIM supported (limit 1 block)
is interesting. It appears newer higher performance drives have a higher block limit.
sudo hdparm -I /dev/sda
/dev/sda:
ATA device, with non-removable media
Model Number: OCZ-AGILITY2
Serial Number: <serial>
Firmware Revision: 1.10
Transport: Serial
Standards:
Used: unknown (minor revision code 0x0028)
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 234441648
LBA48 user addressable sectors: 234441648
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
Logical Sector-0 offset: 0 bytes
device size with M = 1024*1024: 114473 MBytes
device size with M = 1000*1000: 120034 MBytes (120 GB)
cache/buffer size = unknown
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16 Current = 1
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* NOP cmd
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART error logging
* SMART self-test
* General Purpose Logging feature set
* WRITE_{DMA|MULTIPLE}_FUA_EXT
* 64-bit World wide name
* IDLE_IMMEDIATE with UNLOAD
* Segmented DOWNLOAD_MICROCODE
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Host-initiated interface power management
* Phy event counters
* DMA Setup Auto-Activate optimization
Device-initiated interface power management
* Software settings preservation
* SMART Command Transport (SCT) feature set
* SCT Write Same (AC2)
* SCT Error Recovery Control (AC3)
* SCT Features Control (AC4)
* SCT Data Tables (AC5)
* Data Set Management TRIM supported (limit 1 block)
* Deterministic read data after TRIM
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Logical Unit WWN Device Identifier: <id>
NAA : 5
IEEE OUI : e83a97
Unique ID : <uid>
Checksum: correct
Comments