在XFS上运行MySQL的一个坑

2015-02-15 | 分类 Linux | 标签存储

前言

XFS作为REHL7的默认文件系统，正受到越来越多的人的关注。关于它的介绍，可以参考这里。XFS是一个非常优秀、成熟的文件系统，相比ext3，有更多的特性，我们在Docker平台也使用XFS quota来做磁盘空间限制。

问题

前些天，DBA同学在上面测试MySQL，发现性能很差。尝试了很多优化，没有什么改进：

（1）将RAID的strip size设置为256K，参考这里。

（2）磁盘分区以256K对齐：

# fdisk -ul /dev/sdb
   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1             512  2929356359  1464677924   83  Linux

（3）将XFS的su/sw设置为与RAID一致，参考这里。

# mkfs.xfs -d su=256k,sw=5 /dev/sdb1

（4）关闭XFS的barrier：

#cat /proc/mounts
/dev/sdb1 /data1 xfs rw,noatime,attr2,delaylog,nobarrier,logbsize=256k,sunit=512,swidth=2560,noquota 0 0

分析

使用fio进行测试:

#fio --direct=1 --thread --rw=randrw --rwmixread=60  --ioengine=libaio --runtime=300  --iodepth=1 --size=40G --numjobs=32  -name=test_rw  --group_reporting --bs=16k --time_base

XFS的性能与ext3差不多，甚至稍好。

	read(iops)	r_lat(ms)	write(iops)	w_lat(ms)
direct IO + ext3	1848	17.243	1232	0.086
direct IO + XFS	1954	16.31	1304	0.086

后来发现DBA同学在MySQL上使用的是buffer IO，于是又测试了一下buffer IO，发现了其中的玄机：

	read(iops)	r_lat(ms)	write(iops)	w_lat(ms)
buffer IO + ext3	1976	15.661	1319	0.78
buffer IO + XFS	307	57.7	203	148.22

可以看到，XFS的direct IO性能比ext3稍好，但远远好于buffer IO。

那么问题来了，

** 为什么在ext3下，buffer IO与direct IO并没有这么大的差别呢？ **

于是，我将问题反馈给XFS的开发者，和Eric Sandeen的回答给了我很大的提示：

With buffered I/O, these writes are exclusively locked whereas direct I/O is shared locked.

So you aren’t really exercising the filesystem or storage by pounding on the same file with 32 threads. If you use separate files for buffered I/O (e.g., remove the –filename param), you’ll probably see a bump in iops.

XFS对buffer IO和direct IO做了区别对待，buffer IO会对文件inode使用X锁，对direct IO使用S锁。从我看了一下XFS的代码，的确是这样：

** Buffer IO: **

STATIC ssize_t
xfs_file_buffered_aio_write(
	struct kiocb		*iocb,
	const struct iovec	*iovp,
	unsigned long		nr_segs,
	loff_t			pos,
	size_t			ocount)
{
	struct file		*file = iocb->ki_filp;
	struct address_space	*mapping = file->f_mapping;
	struct inode		*inode = mapping->host;
	struct xfs_inode	*ip = XFS_I(inode);
	ssize_t			ret;
	int			enospc = 0;
	int			iolock = XFS_IOLOCK_EXCL;
	size_t			count = ocount;

	xfs_rw_ilock(ip, iolock);

	ret = xfs_file_aio_write_checks(file, &pos, &count, &iolock);
	if (ret)

** Direct IO: **

STATIC ssize_t
xfs_file_dio_aio_write(
	struct kiocb		*iocb,
	const struct iovec	*iovp,
	unsigned long		nr_segs,
	loff_t			pos,
	size_t			ocount)
{
	struct file		*file = iocb->ki_filp;
	struct address_space	*mapping = file->f_mapping;
	struct inode		*inode = mapping->host;
	struct xfs_inode	*ip = XFS_I(inode);
...
	/*
	 * We don't need to take an exclusive lock unless there page cache needs
	 * to be invalidated or unaligned IO is being executed. We don't need to
	 * consider the EOF extension case here because
	 * xfs_file_aio_write_checks() will relock the inode as necessary for
	 * EOF zeroing cases and fill out the new inode size as appropriate.
	 */
	if (unaligned_io || mapping->nrpages)
		iolock = XFS_IOLOCK_EXCL;
	else
		iolock = XFS_IOLOCK_SHARED; //S lock
	xfs_rw_ilock(ip, iolock);

我又对比看了一下ext3的代码，发现ext3也有一个mutex:

ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
		unsigned long nr_segs, loff_t pos)
{
	struct file *file = iocb->ki_filp;
	struct inode *inode = file->f_mapping->host;
	ssize_t ret;

	BUG_ON(iocb->ki_pos != pos);

	sb_start_write(inode->i_sb);
	mutex_lock(&inode->i_mutex);
	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
	mutex_unlock(&inode->i_mutex);

为什么ext3没有问题呢？于是，我又看了一下read的实现，发现了区别：

STATIC ssize_t
xfs_file_aio_read(
	struct kiocb		*iocb,
	const struct iovec	*iovp,
	unsigned long		nr_segs,
	loff_t			pos)
{
...
	/*
	 * Locking is a bit tricky here. If we take an exclusive lock
	 * for direct IO, we effectively serialise all new concurrent
	 * read IO to this file and block it behind IO that is currently in
	 * progress because IO in progress holds the IO lock shared. We only
	 * need to hold the lock exclusive to blow away the page cache, so
	 * only take lock exclusively if the page cache needs invalidation.
	 * This allows the normal direct IO case of no page cache pages to
	 * proceeed concurrently without serialisation.
	 */
	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
	if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);

		if (inode->i_mapping->nrpages) {
			ret = -xfs_flushinval_pages(ip,
					(iocb->ki_pos & PAGE_CACHE_MASK),
					-1, FI_REMAPF_LOCKED);
			if (ret) {
				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
				return ret;
			}
		}
		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
	}

	trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);

	ret = generic_file_aio_read(iocb, iovp, nr_segs, iocb->ki_pos);
	if (ret > 0)
		XFS_STATS_ADD(xs_read_bytes, ret);

	xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);

原来对于XFS，read的时候也会对inode加S lock，而ext3在读的时候并没有请求inode->i_mutex。

** 为什么有这个区别呢？难道是XFS的设计问题？ **

BUG?

XFS的维护者Dave Chinner（XFS maintainer）解答了我对XFS和ext3区别的疑问：

This is a bug and design flaw in ext3, and most other Linux filesystems. Posix states that write() must execute atomically and so no concurrent operation that reads or modifies data should should see a partial write. The linux page cache doesn’t enforce this - a read to the same range as a write can return partially written data on page granularity, as read/write only serialise on page locks in the page cache.

XFS is the only Linux filesystem that actually follows POSIX requirements here - the shared/exclusive locking guarantees that a buffer write completes wholly before a read is allowed to access the data. There is a down side - you can’t run concurrent buffered reads and writes to the same file - if you need to do that then that’s what direct IO is for, and coherency between overlapping reads and writes is then the application’s problem, not the filesystem…

Maybe at some point in the future we might address this with ranged IO locks, but there really aren’t many multithreaded programs that hit this issue…

一句话：这不是XFS的bug，而且ext3及其它大多数文件的bug。

** 对！你没有看错，这不是XFS的问题，而且ext3的问题！！！ **

在网上找了一下，终于找到那句话：

POSIX requires that a read(2) which can be proved to occur after a write() has returned returns the new data. Note that not all filesystems are POSIX conforming.

参考这里。

小结

由于XFS满足了POSIX的要求，导致在多个线程同时读写同一个文件时，buffer IO的性能会很差，正如Dave Chinner所说，这需要通过范围锁(range lock)才能解决这个问题。而EXT3等文件系统由于在读的时候没有加锁，多线程情况下，可能会导致partial read的问题，这需要应用层自己解决。不过，对于MySQL，由于自身的并发控制机制，不会出问题。