YY哥 2017-04-25T12:32:32+00:00 hustcat@gmail.com Mount namespace and mount propagation 2017-03-10T15:00:30+00:00 hustcat http://hustcat.github.io/mount-namespace-and-mount-propagation Mount namespace and problems

When a new mount namespace is created, it receives a copy of the mount point list replicated from the namespace of the caller of clone() or unshare().

create_new_namespaces -> copy_mnt_ns -> dup_mnt_ns:

/*
 * Allocate a new namespace structure and populate it with contents
 * copied from the namespace of the passed in task structure.
 */
static struct mnt_namespace *dup_mnt_ns(struct mnt_namespace *mnt_ns,
		struct fs_struct *fs)
{
	struct mnt_namespace *new_ns;
	struct vfsmount *rootmnt = NULL, *pwdmnt = NULL;
	struct vfsmount *p, *q;

	new_ns = alloc_mnt_ns();
	if (IS_ERR(new_ns))
		return new_ns;

	down_write(&namespace_sem);
	/* First pass: copy the tree topology */
	new_ns->root = copy_tree(mnt_ns->root, mnt_ns->root->mnt_root,
					CL_COPY_ALL | CL_EXPIRE); ///拷贝原来namespace所有的文件系统
 ///...

每个Mount namespace有自己独立的文件系统视图,但是这种隔离性同时也带来一些问题:比如,当系统加载一块新的磁盘时,在最初的实现中,每个namespace必须单独挂载磁盘,才能见到。很多时候,我们希望挂载一次,就能在所有的mount namespace可见。为此,内核在2.6.15引入了shared subtrees feature

The key benefit of shared subtrees is to allow automatic, controlled propagation of mount and unmount events between namespaces. This means, for example, that mounting an optical disk in one mount namespace can trigger a mount of that disk in all other namespaces.

为了支持shared subtrees feature,每个挂载点都会标记propagation type,用于决定在当前挂载点下创建/删除(子)挂载点时,是否传播到别的挂载点。

propagation type

内核有以下几种传播类型:

  • MS_SHARED

This mount point shares mount and unmount events with other mount points that are members of its “peer group”. When a mount point is added or removed under this mount point, this change will propagate to the peer group, so that the mount or unmount will also take place under each of the peer mount points. Propagation also occurs in the reverse direction, so that mount and unmount events on a peer mount will also propagate to this mount point.

  • MS_PRIVATE

This is the converse of a shared mount point. The mount point does not propagate events to any peers, and does not receive propagation events from any peers.

  • MS_SLAVE

This propagation type sits midway between shared and private. A slave mount has a master—a shared peer group whose members propagate mount and unmount events to the slave mount. However, the slave mount does not propagate events to the master peer group.

  • MS_UNBINDABLE

This mount point is unbindable. Like a private mount point, this mount point does not propagate events to or from peers. In addition, this mount point can’t be the source for a bind mount operation.

几点注意事项:

(1) propagation type是对每个挂载点的设置.

(2) propagation type决定挂载点的直属(immediately under)子挂载点mount/umount事件的传播.比如,挂载点X下创建新的挂载点Y,Y会扩展到X的peer group,但是X不会影响Y下面的子挂载点。

(3)

Peer groups

peer group就是一些可以相互传播mount/umount事件的挂载点集合. 对于shared挂载点,当创建新的mount namspace或者作为bind mount的源目标时,就会创建新的成员。这两种情况都会创建新的挂载点,新的挂载点与原来的挂载点构成peer group。同理,当mount namespace释放时,或者挂载点umount时,会从对应的peer group删除。

A peer group is a set of mount points that propagate mount and unmount events to one another. A peer group acquires new members when a mount point whose propagation type is shared is either replicated during the creation of a new namespace or is used as the source for a bind mount. In both cases, the new mount point is made a member of the same peer group as the existing mount point. Conversely, a mount point ceases to be a member of a peer group when it is unmounted, either explicitly, or implicitly when a mount namespace is torn down because the last member process terminates or moves to another namespace.

  • 示例

在sh1执行:将/设置为private,并创建2个shared的挂载点:

sh1# mount --make-private / 
sh1# mount --make-shared /dev/sda3 /X 
sh1# mount --make-shared /dev/sda5 /Y 

然后在sh2执行:创建新的mount namespace:

sh2# unshare -m --propagation unchanged sh 

然后再在sh1执行:X bind mount to Z:

sh1# mkdir /Z 
sh1# mount --bind /X /Z 

这会创建2个peer group:

  • 第一个peer group包含X, X’, 和 Z。其中,X和X’是因为namespace的创建,X和Z是因为bind mount产生的。
  • 第二个peer group只包含Y, Y’。

注意,因为/private的,所以,bind mount Z并不会传播到第二个namespace。

来看Docker使用private的代码:

// InitializeMountNamespace sets up the devices, mount points, and filesystems for use inside a
// new mount namespace.
func InitializeMountNamespace(rootfs, console string, sysReadonly bool, mountConfig *MountConfig) error {
	var (
		err  error
		flag = syscall.MS_PRIVATE
	)

	if mountConfig.NoPivotRoot {
		flag = syscall.MS_SLAVE   ///容器中的mount事件不会传播到init ns
	}

	if err := syscall.Mount("", "/", "", uintptr(flag|syscall.MS_REC), ""); err != nil { ///将/设置为private,与init ns完全隔离
		return fmt.Errorf("mounting / with flags %X %s", (flag | syscall.MS_REC), err)
	}

	if err := syscall.Mount(rootfs, rootfs, "bind", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
		return fmt.Errorf("mouting %s as bind %s", rootfs, err)
	}
///...

Reference

]]>
TCP SYN cookies make window size suddenly becomes smaller 2017-03-03T20:00:30+00:00 hustcat http://hustcat.github.io/tcp_syn_cookies_and_window_size 问题

最近,业务反映了一个TCP连接容窗口突然异常变小的问题,引起数据传输速度异常之慢,如下:

在server端回SYN-ACK包,窗口大小还是144800,在server端确认client的第一个数据包时,一下子变成了60,但数据包的长度只有86个字节。

经过和几位同事的一起各种定位,最终发现是TCP SYN cookies引起的。简单总结一下,以示后人。

TCP引入SYN cookies是为了解决SYN flood问题。

SYN cookie is a technique used to resist SYN flood attacks.The technique’s primary inventor Daniel J. Bernstein defines SYN cookies as “particular choices of initial TCP sequence numbers by TCP servers.” In particular, the use of SYN cookies allows a server to avoid dropping connections when the SYN queue fills up. Instead, the server behaves as if the SYN queue had been enlarged. The server sends back the appropriate SYN+ACK response to the client but discards the SYN queue entry. If the server then receives a subsequent ACK response from the client, the server is able to reconstruct the SYN queue entry using information encoded in the TCP sequence number.

内核参数

  • sysctl_max_syn_backlog

sysctl_max_syn_backlog控制Listen Socket的半连接(SYN_RECV)队列长度:

struct inet_connection_sock {

	struct request_sock_queue icsk_accept_queue; ////SYN_RECV sockets queue
}

int inet_csk_listen_start(struct sock *sk, const int nr_table_entries)
{
	struct inet_sock *inet = inet_sk(sk);
	struct inet_connection_sock *icsk = inet_csk(sk);
	int rc = reqsk_queue_alloc(&icsk->icsk_accept_queue, nr_table_entries);
///...
	sk->sk_state = TCP_LISTEN;
}

int reqsk_queue_alloc(struct request_sock_queue *queue,
		      unsigned int nr_table_entries)
{ 
///...
	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
	
	for (lopt->max_qlen_log = 3;
	     (1 << lopt->max_qlen_log) < nr_table_entries;
	     lopt->max_qlen_log++);
///...
}

static inline int reqsk_queue_is_full(const struct request_sock_queue *queue)
{
	return queue->listen_opt->qlen >> queue->listen_opt->max_qlen_log;
}

内核会根据listen的backlog和sysctl_max_syn_backlog计算listen socket的SYN queue的长度。如果队列满了,就会输出下面的日志:

TCP: TCP: Possible SYN flooding on port 6000. Dropping request.  Check SNMP counters.
  • sysctl_tcp_syncookies

控制是否启动TCP SYN cookies机制。

extern int sysctl_tcp_syncookies;

TCP处理新建连接的逻辑

当接收端收到发送端的SYN包之后,会创建一个request_sock,再给发送端返回SYN/ACK包后,将request_sock加入到LISTEN socket的SYN table:

tcp_v4_do_rcv(TCP_LISTEN) -> tcp_rcv_state_process -> tcp_v4_conn_request:

///ipv4_specific, LISTEN socket handle SYN packet
int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{
        struct request_sock *req;
///...
	/* TW buckets are converted to open requests without
	 * limitations, they conserve resources and peer is
	 * evidently real one.
	 */
	if (inet_csk_reqsk_queue_is_full(sk) && !isn) { ///SYN queue is full
		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
		if (!want_cookie) ///no tcp_syncookies, drop SKB
			goto drop;
	}

	/* Accept backlog is full. If we have already queued enough
	 * of warm entries in syn queue, drop request. It is better than
	 * clogging syn queue with openreqs with exponentially increasing
	 * timeout.
	 */
	if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
		goto drop;
	}
	
	req = inet_reqsk_alloc(&tcp_request_sock_ops);
	if (!req)
		goto drop;
///...
	if (likely(!do_fastopen)) {
		int err;
		err = ip_build_and_send_pkt(skb_synack, sk, ireq->loc_addr, ///send SYN/ACK
		     ireq->rmt_addr, ireq->opt);
		err = net_xmit_eval(err);
		if (err || want_cookie) ///tcp_syncookies, don't add to SYN queue
			goto drop_and_free;

		tcp_rsk(req)->snt_synack = tcp_time_stamp;
		tcp_rsk(req)->listener = NULL;
		/* Add the request_sock to the SYN table */
		inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT); ///Add SYN table
		if (fastopen_cookie_present(&foc) && foc.len != 0)
			NET_INC_STATS_BH(sock_net(sk),
			    LINUX_MIB_TCPFASTOPENPASSIVEFAIL);
	} else if (tcp_v4_conn_req_fastopen(sk, skb, skb_synack, req)) ///fast open
		goto drop_and_free;
///...
}

当接收端再次收到发送端的ACK包时,内核会从SYN table找到与之对应的tcp_check_req,然后创建新的socket,至此,TCP连接算是完成建立(TCP_ESTABLISHED): tcp_v4_do_rcv(TCP_LISTEN) -> tcp_v4_hnd_req -> tcp_check_req:

static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
{
	struct tcphdr *th = tcp_hdr(skb);
	const struct iphdr *iph = ip_hdr(skb);
	struct sock *nsk;
	struct request_sock **prev;
	/* Find possible connection requests. */
	struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
						       iph->saddr, iph->daddr); ///get request_sock from SYN table
	if (req)
		return tcp_check_req(sk, skb, req, prev, false); /// create new socket
///...
}

SYN cookies

在没有开启tcp_syncookies选项时,如果LISTEN socket的SYN queue满之后,会直接丢掉SKB:

///ipv4_specific, LISTEN socket handle SYN packet
int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{
///..
	if (inet_csk_reqsk_queue_is_full(sk) && !isn) { ///SYN queue is full
		want_cookie = tcp_syn_flood_action(sk, skb, "TCP");
		if (!want_cookie) ///no tcp_syncookies, drop SKB
			goto drop;
	}

开启tcp_syncookies之后,如果LISTEN socket的SYN queue满之后,会创建request_sock,再返给对端SYN/ACK后,并不会将request_sock对象加到SYN queue,而是将其释放:

	if (likely(!do_fastopen)) {
		int err;
		err = ip_build_and_send_pkt(skb_synack, sk, ireq->loc_addr, ///send SYN/ACK
		     ireq->rmt_addr, ireq->opt);
		err = net_xmit_eval(err);
		if (err || want_cookie) ///tcp_syncookies, don't add to SYN queue
			goto drop_and_free;

这样,当收到对端的ACK后,tcp_v4_hnd_req从SYN queue找不到对应的request_sock对象,就会进入syncookies的处理逻辑: tcp_v4_do_rcv -> tcp_v4_hnd_req:

static struct sock *tcp_v4_hnd_req(struct sock *sk, struct sk_buff *skb)
{
///...
#ifdef CONFIG_SYN_COOKIES
	if (!th->syn)
		sk = cookie_v4_check(sk, skb, &(IPCB(skb)->opt));
#endif
	return sk;
}

cookie_v4_check会检查cookies是否有效,并创建新的request_sock对象,进入正常连接的流程。

SYN cookies与TCP options

对于走SYN cookies逻辑的连接,由于内核没有保存相关socket的状态,所以,SYN包中携带的TCP options就会丢失。

  • MSS

接收端在向发送端返回cookies时,会将MSS的值编码到cookies,发送端在返回cookies后,接收端调用cookie_v4_check获取MSS的值:

struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb,
			     struct ip_options *opt)
{
///...
	if (!sysctl_tcp_syncookies || !th->ack || th->rst)
		goto out;

	if (tcp_synq_no_recent_overflow(sk) ||
	    (mss = cookie_check(skb, cookie)) == 0) { ///mss option value
		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SYNCOOKIESFAILED);
		goto out;
	}

///...
	req = inet_reqsk_alloc(&tcp_request_sock_ops); /* for safety */
	if (!req)
		goto out;
///...
	/* Try to redo what tcp_v4_send_synack did. */
	req->window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
	///initial window
	tcp_select_initial_window(tcp_full_space(sk), req->mss,
				  &req->rcv_wnd, &req->window_clamp,
				  ireq->wscale_ok, &rcv_wscale,
				  dst_metric(&rt->dst, RTAX_INITRWND));

	ireq->rcv_wscale  = rcv_wscale;
///...
}
  • wscale

但是,对于其它option,比如wscaleSACK等信息,就会丢失。后来,又使用timestamp来保存wscale,后来又取消了,参考12。详细参考Improving syncookies

对于TCP SYN cookies的处理逻辑,接收端在收到对端的ACK之后,会重新计算wscale,而不是TCP在建立连接的SYN/SYN-ACK过程协商的wscale,由于wscale的计算受recv buffer等参数的影响,会导致第二次计算的wscale与前面协商的不一致,从而导致发送端和接收端的wscale不一致:

void tcp_select_initial_window(int __space, __u32 mss,
			       __u32 *rcv_wnd, __u32 *window_clamp,
			       int wscale_ok, __u8 *rcv_wscale,
			       __u32 init_rcv_wnd)
{
	unsigned int space = (__space < 0 ? 0 : __space); ///sk_rcvbuf size
///...
	(*rcv_wscale) = 0;
	if (wscale_ok) {
		/* Set window scaling on max possible window
		 * See RFC1323 for an explanation of the limit to 14
		 */
		space = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
		space = min_t(u32, space, *window_clamp);
		while (space > 65535 && (*rcv_wscale) < 14) {
			space >>= 1;
			(*rcv_wscale)++;
		}
	}

而TCP的窗口大小,是受到wscale影响的,从而就会导致出现开头的问题。

总结

这本来是一个很简单的问题,但定位过程却走了不少弯路,从一开始就聚焦于TCP窗口机制,企图从中找问题,而忽略了内核的一些关键输出。再次说明了那个问题:* 表面上复杂的问题,背后的原因都非常简单!*

不管怎样,目前内核的TCP SYN cookies机制是有缺陷的,请慎用。

Reference

]]>
Dive deep into inotify and overlayfs 2017-01-06T11:00:30+00:00 hustcat http://hustcat.github.io/dive-into-inotify-and-overlayfs Introduction

应用层可以使用内核提供的文件系统通知API来获取文件系统中发生的变化,比如打开、关闭、创建、删除文件(夹)等。内核最开始在2.4.0中实现了dnotify,但dnotify重用了fcntl系统调用,有很多问题,比如:(1)dnotify只能监控文件夹,不能监控某个文件;(2)使用信号SIGIO来向进程传递事件,但信号是异步的,可能丢失,而且传递的信息太少,比如,无法知道到底是文件夹的哪个文件发生的事件。

后面,内核在2.6.13实现了inotifyinotify实现了几个新的系统调用,解决了dnotify的问题。

  • inotifywait

我们可以使用inotify-tools中自带的inotifywait来监控某个目录的事件。

#inotifywait -rme modify,open,create,delete,close /root/dbyin/test/
Setting up watches.  Beware: since -r was given, this may take a while!
Watches established.
/root/dbyin/test/ CREATE f1.txt
/root/dbyin/test/ OPEN f1.txt
/root/dbyin/test/ MODIFY f1.txt
/root/dbyin/test/ CLOSE_WRITE,CLOSE f1.txt
/root/dbyin/test/ DELETE f1.txt

Another terminal:

#echo hello > /root/dbyin/test/f1.txt
#rm /root/dbyin/test/f1.txt

程序示例参考这里

Inotify的实现

核心数据结构

  • fsnotify_group

fsnotify_group代表一个inotify实例,每次应用层调用inotify_init就会创建一个实例,它维护该实例的所有event信息:

/*
 * A group is a "thing" that wants to receive notification about filesystem
 * events.  The mask holds the subset of event types this group cares about.
 * refcnt on a group is up to the implementor and at any moment if it goes 0
 * everything will be cleaned up.
 */
struct fsnotify_group {

	const struct fsnotify_ops *ops;	/* how this group handles things, inotify_fops */
	
	struct list_head notification_list;	/* list of event_holder this group needs to send to userspace, fsnotify_event list */
	wait_queue_head_t notification_waitq;	/* read() on the notification file blocks on this waitq */
	unsigned int q_len;			/* events on the queue */
	unsigned int max_events;		/* maximum events allowed on the list */

	struct list_head marks_list;	/* all inode marks for this group, struct fsnotify_mark list */

	struct fasync_struct    *fsn_fa;    /* async notification */

	/* groups can define private fields here or use the void *private */
	union {
		void *private;
#ifdef CONFIG_INOTIFY_USER
		struct inotify_group_private_data {
			spinlock_t	idr_lock;
			struct idr      idr;   ///id -> inotify_inode_mark*
			struct user_struct      *user;
		} inotify_data; ///for inotify
#endif
	}
}
  • fsnotify_mark

fsnotify_mark是联系fsnotify_groupinode的桥梁,fsnotify_group->marks_listfsnotify_mark链表,fsnotify_mark.i->inode指向被监听文件的inode。inode->i_fsnotify_marks保存监听该inode的所有inotify实例。

struct inotify_inode_mark {
	struct fsnotify_mark fsn_mark;
	int wd; ///watch descriptor
};


struct fsnotify_mark {
	__u32 mask;			/* mask this mark is for */
	/* we hold ref for each i_list and g_list.  also one ref for each 'thing'
	 * in kernel that found and may be using this mark. */
	atomic_t refcnt;		/* active things looking at this mark */
	struct fsnotify_group *group;	/* group this mark is for */
	struct list_head g_list;	/* list of marks by group->i_fsnotify_marks */
	spinlock_t lock;		/* protect group and inode */
	union {
		struct fsnotify_inode_mark i;
		struct fsnotify_vfsmount_mark m;
	};
	__u32 ignored_mask;		/* events types to ignore */
#define FSNOTIFY_MARK_FLAG_INODE		0x01
#define FSNOTIFY_MARK_FLAG_VFSMOUNT		0x02
#define FSNOTIFY_MARK_FLAG_OBJECT_PINNED	0x04
#define FSNOTIFY_MARK_FLAG_IGNORED_SURV_MODIFY	0x08
#define FSNOTIFY_MARK_FLAG_ALIVE		0x10
	unsigned int flags;		/* vfsmount or inode mark? */
	struct list_head destroy_list;
	void (*free_mark)(struct fsnotify_mark *mark); /* called on final put+free */
};

/*
 * Inode specific fields in an fsnotify_mark
 */
struct fsnotify_inode_mark {
	struct inode *inode;		/* inode this mark is associated with */
	struct hlist_node i_list;	/* list of marks by inode->i_fsnotify_marks */
	struct list_head free_i_list;	/* tmp list used when freeing this mark */
};
  • inode and file
/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_marks; ///struct fsnotify_inode_mark list, see fsnotify_inode_mark.i_list
#endif 
}


struct file {
    ///...
	
	void			*private_data; ///fsnotify_group*
}

Overlayfs的实现

  • 数据结构

Overlayfs的几个关键数据结构:

struct dentry {
	struct dentry *d_parent;	/* parent directory,父目录dentry对象 */
	struct qstr d_name;   ///当前分量的名称
	struct inode *d_inode;		/* inode对象, create by ovl_new_inode */
	
	const struct dentry_operations *d_op; /// == super_block->s_d_op == ovl_dentry_operations
	struct super_block *d_sb;	/* The root of the dentry tree */

	void *d_fsdata;			/* fs-specific data, struct ovl_entry */
}


/* private information held for every overlayfs dentry */
struct ovl_entry {
	struct dentry *__upperdentry; ///not NULL if got in upperdir
	struct ovl_dir_cache *cache;
	union {
		struct {
			u64 version;
			bool opaque;
		};
		struct rcu_head rcu;
	};
	unsigned numlower;
	struct path lowerstack[]; ///not NULL if got in lowdir
};


struct inode {
	const struct inode_operations	*i_op; ///ovl_dir_inode_operations
	struct super_block	*i_sb;
	
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops, ovl_dir_operations */
	
	void			*i_private; /* fs or device private pointer,  struct ovl_entry*/
};

dentry是内核的目录项对象,每个目录(文件)都有一个对应的对象,对于overlayfs的每个dentry的指向的inode并没有实际的磁盘数据,而是由ovl_new_inode创建的一个内存inode;dentry->d_fsdata指向ovl_entry,而后者指向真正的underlay fs的dentry。

在overlayfs遍历时,dentry->inode并没有多大用,实际上,在ovl_lookup中,代表父目录的inode参数struct inode *dir并没有没使用到。而dentry->d_fsdata指向ovl_entry才是进行查找的关键因素,通过ovl_entry进入到underlay fs的查找。

///dir: parent directory inode object, dentry: dentry object for current finding dircotry entry
struct dentry *ovl_lookup(struct inode *dir, struct dentry *dentry,
			  unsigned int flags) ///called by lookup_real
{
	struct ovl_entry *oe;
	struct ovl_entry *poe = dentry->d_parent->d_fsdata; ///dentry->d_parent->d_inode == dir
	struct path *stack = NULL;
	struct dentry *upperdir, *upperdentry = NULL;
	unsigned int ctr = 0;
	struct inode *inode = NULL;
	bool upperopaque = false;
	struct dentry *this, *prev = NULL;
	unsigned int i;
	int err;

	upperdir = ovl_upperdentry_dereference(poe);
	if (upperdir) { ///(1)lookup in upperdir firstly
		this = ovl_lookup_real(upperdir, &dentry->d_name);
		err = PTR_ERR(this);
		if (IS_ERR(this))
			goto out;

		if (this) {///exist in upperdir
			if (unlikely(ovl_dentry_remote(this))) {
				dput(this);
				err = -EREMOTE;
				goto out;
			}
			if (ovl_is_whiteout(this)) {
				dput(this); ///whiteout file
				this = NULL;
				upperopaque = true;
			} else if (poe->numlower && ovl_is_opaquedir(this)) {
				upperopaque = true; ///opaque dir
			}
		}
		upperdentry = prev = this;
	}
	///(2)didn't find dentry in upperdir
	if (!upperopaque && poe->numlower) {
		err = -ENOMEM;
		stack = kcalloc(poe->numlower, sizeof(struct path), GFP_KERNEL);
		if (!stack)
			goto out_put_upper;
	}
	///(3)find dentry in lowdir
	for (i = 0; !upperopaque && i < poe->numlower; i++) {
		bool opaque = false;
		struct path lowerpath = poe->lowerstack[i];

		this = ovl_lookup_real(lowerpath.dentry, &dentry->d_name);
		err = PTR_ERR(this);
		if (IS_ERR(this)) {
			/*
			 * If it's positive, then treat ENAMETOOLONG as ENOENT.
			 */
			if (err == -ENAMETOOLONG && (upperdentry || ctr))
				continue;
			goto out_put;
		}
		if (!this)
			continue;
		if (ovl_is_whiteout(this)) {
			dput(this);
			break;
		}
		/*
		 * Only makes sense to check opaque dir if this is not the
		 * lowermost layer.
		 */
		if (i < poe->numlower - 1 && ovl_is_opaquedir(this))
			opaque = true;

		if (prev && (!S_ISDIR(prev->d_inode->i_mode) ||
			     !S_ISDIR(this->d_inode->i_mode))) {
			/*
			 * FIXME: check for upper-opaqueness maybe better done
			 * in remove code.
			 */
			if (prev == upperdentry)
				upperopaque = true;
			dput(this);
			break;
		}
		/*
		 * If this is a non-directory then stop here.
		 */
		if (!S_ISDIR(this->d_inode->i_mode))
			opaque = true;

		stack[ctr].dentry = this;
		stack[ctr].mnt = lowerpath.mnt;
		ctr++;
		prev = this;
		if (opaque)
			break;
	}

	oe = ovl_alloc_entry(ctr); ///ovl_dentry for current finding dentry
	err = -ENOMEM;
	if (!oe)
		goto out_put;

	if (upperdentry || ctr) {///if got in upperdir, upperdentry != NULL; else if got in lowdir, ctr > 0
		struct dentry *realdentry;

		realdentry = upperdentry ? upperdentry : stack[0].dentry;
		///alloc overlayfs inode for current real inode
		err = -ENOMEM;
		inode = ovl_new_inode(dentry->d_sb, realdentry->d_inode->i_mode,
				      oe);
		if (!inode)
			goto out_free_oe;
		ovl_copyattr(realdentry->d_inode, inode);
	}

	oe->opaque = upperopaque;
	oe->__upperdentry = upperdentry;
	memcpy(oe->lowerstack, stack, sizeof(struct path) * ctr);
	kfree(stack);
	dentry->d_fsdata = oe; ///ovl_entry
	d_add(dentry, inode);

	return NULL;

out_free_oe:
	kfree(oe);
out_put:
	for (i = 0; i < ctr; i++)
		dput(stack[i].dentry);
	kfree(stack);
out_put_upper:
	dput(upperdentry);
out:
	return ERR_PTR(err);
}
  • open and copy up

overlayfs在打开文件时,会让struct file->f_inode指向real inode;而且,如果会修改文件,且upperdir不存在该文件,则会从lowerdir进行copy up:

int vfs_open(const struct path *path, struct file *file,
            const struct cred *cred)
{
	struct dentry *dentry = path->dentry; ///overlayfs dentry
	struct inode *inode = dentry->d_inode; ///overlayfs inode

	file->f_path = *path;
	if (dentry->d_flags & DCACHE_OP_SELECT_INODE) {
		inode = dentry->d_op->d_select_inode(dentry, file->f_flags); ///get real inode, ovl_dentry_operations
		if (IS_ERR(inode))
			return PTR_ERR(inode);
	}

	return do_dentry_open(file, inode, NULL, cred); ///file->f_inode = inode
}

///return underlay fs inode
struct inode *ovl_d_select_inode(struct dentry *dentry, unsigned file_flags)
{
	int err;
	struct path realpath;
	enum ovl_path_type type;

	if (S_ISDIR(dentry->d_inode->i_mode))
		return dentry->d_inode;

	type = ovl_path_real(dentry, &realpath); ///real dentry
	if (ovl_open_need_copy_up(file_flags, type, realpath.dentry)) { ///need copy up
		err = ovl_want_write(dentry);
		if (err)
			return ERR_PTR(err);

		if (file_flags & O_TRUNC)
			err = ovl_copy_up_truncate(dentry);
		else
			err = ovl_copy_up(dentry); ///copy up
		ovl_drop_write(dentry);
		if (err)
			return ERR_PTR(err);

		ovl_path_upper(dentry, &realpath);
	}

	if (realpath.dentry->d_flags & DCACHE_OP_SELECT_INODE)
		return realpath.dentry->d_op->d_select_inode(realpath.dentry, file_flags);

	return realpath.dentry->d_inode; ///return real inode
}

Inotify and Overlayfs

inotify_add_watch使用的是overlayfs inode:

SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
		u32, mask)
{

///...
	ret = inotify_find_inode(pathname, &path, flags); ///返回overlayfs inode
	if (ret)
		goto fput_and_out;

	/* inode held in place by reference to path; group by fget on fd */
	inode = path.dentry->d_inode; ///monitored file(overlay inode)
	group = f.file->private_data; ///notify group

	/* create/update an inode mark */
	ret = inotify_update_watch(group, inode, mask);

///...
}

fsnotify_open使用的是underlayfs inode:

/*
 * fsnotify_open - file was opened
 */
static inline void fsnotify_open(struct file *file)
{
	struct path *path = &file->f_path;
	struct inode *inode = file_inode(file); ///for overlayfs , after vfs_open, f->f_inode == underlay inode
	__u32 mask = FS_OPEN;

	if (S_ISDIR(inode->i_mode))
		mask |= FS_ISDIR;

	fsnotify_parent(path, NULL, mask);
	fsnotify(inode, mask, path, FSNOTIFY_EVENT_PATH, NULL, 0);
}

vfs_open中,内核会将file->f_inode指向underlayfs inode:

int vfs_open(const struct path *path, struct file *file,
            const struct cred *cred)
{
	struct dentry *dentry = path->dentry; ///overlayfs dentry
	struct inode *inode = dentry->d_inode; ///overlayfs inode

	file->f_path = *path;
	if (dentry->d_flags & DCACHE_OP_SELECT_INODE) {
		inode = dentry->d_op->d_select_inode(dentry, file->f_flags); ///get underlayfs inode, ovl_dentry_operations
		if (IS_ERR(inode))
			return PTR_ERR(inode);
	}

	return do_dentry_open(file, inode, NULL, cred); ///file->f_inode = inode
}

所以,对单个文件进行watch时,无法得到事件。

Reference

]]>
Socket activation in systemd 2017-01-05T11:00:30+00:00 hustcat http://hustcat.github.io/socket-activation-in-systemd Introduction

systemd为了加快系统的启动速度,使用socket activation的方式让所有系统服务并发启动。socket activation的思想由来所久,inetd使用它来实现按需启动网络服务。

socket activation的核心在于将创建listen socket的过程从service daemon移到systemd,即使该服务本身没有启动,其它依赖的服务也可以连接listen socket,然后systemd创建服务进程,并将listen socket转给该daemon进程,由后者处理listen socket的各种请求。这样使得所有的服务守护进程都可以同时启动。

Socket activation makes it possible to start all four services completely simultaneously, without any kind of ordering. Since the creation of the listening sockets is moved outside of the daemons themselves we can start them all at the same time, and they are able to connect to each other’s sockets right-away.

Write socket activation daemon

使用socket activation的服务,必须从systemd接收socket,而不是自己创建socket.

A service capable of socket activation must be able to receive its preinitialized sockets from systemd, instead of creating them internally.

  • NON-SOCKET-ACTIVATABLE SERVICE

socket activation的服务一般是自己创建socket:

/* Source Code Example #1: ORIGINAL, NOT SOCKET-ACTIVATABLE SERVICE */
...
union {
        struct sockaddr sa;
        struct sockaddr_un un;
} sa;
int fd;

fd = socket(AF_UNIX, SOCK_STREAM, 0);
if (fd < 0) {
        fprintf(stderr, "socket(): %m\n");
        exit(1);
}

memset(&sa, 0, sizeof(sa));
sa.un.sun_family = AF_UNIX;
strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
        fprintf(stderr, "bind(): %m\n");
        exit(1);
}

if (listen(fd, SOMAXCONN) < 0) {
        fprintf(stderr, "listen(): %m\n");
        exit(1);
}
...

这种方式下,其它依赖的服务必须在该daemon进程启动后之能访问该服务。

  • SOCKET-ACTIVATABLE SERVICE
/* Source Code Example #2: UPDATED, SOCKET-ACTIVATABLE SERVICE */
...
#include "sd-daemon.h"
...
int fd;

if (sd_listen_fds(0) != 1) { ///使用systemd创建的socket
        fprintf(stderr, "No or too many file descriptors received.\n");
        exit(1);
}

fd = SD_LISTEN_FDS_START + 0;
...

这种方式下,传统的启动服务服务方式将不再可用。为了兼容两种方式,可以使用下面的方法:

  • SOCKET-ACTIVATABLE SERVICE WITH COMPATIBILITY
/* Source Code Example #3: UPDATED, SOCKET-ACTIVATABLE SERVICE WITH COMPATIBILITY */
...
#include "sd-daemon.h"
...
int fd, n;

n = sd_listen_fds(0);
if (n > 1) {
        fprintf(stderr, "Too many file descriptors received.\n");
        exit(1);
} else if (n == 1)
        fd = SD_LISTEN_FDS_START + 0;
else {
        union {
                struct sockaddr sa;
                struct sockaddr_un un;
        } sa;

        fd = socket(AF_UNIX, SOCK_STREAM, 0);
        if (fd < 0) {
                fprintf(stderr, "socket(): %m\n");
                exit(1);
        }

        memset(&sa, 0, sizeof(sa));
        sa.un.sun_family = AF_UNIX;
        strncpy(sa.un.sun_path, "/run/foobar.sk", sizeof(sa.un.sun_path));

        if (bind(fd, &sa.sa, sizeof(sa)) < 0) {
                fprintf(stderr, "bind(): %m\n");
                exit(1);
        }

        if (listen(fd, SOMAXCONN) < 0) {
                fprintf(stderr, "listen(): %m\n");
                exit(1);
        }
}
...

完整程序参考这里。另外,这里还有一个Go语言的示例。

  • Enable service in systemd

创建socket unit file:

# cat /etc/systemd/system/foobar.socket 
[Socket]
ListenStream=/run/foobar.sk

[Install]
WantedBy=sockets.target

创建对应的service file:

# cat /etc/systemd/system/foobar.service 
[Service]
ExecStart=/usr/local/bin/foobard

启动socket:

# systemctl enable foobar.socket
# systemctl start foobar.socket
# systemctl status foobar.socket
● foobar.socket
   Loaded: loaded (/etc/systemd/system/foobar.socket; enabled; vendor preset: disabled)
   Active: active (listening) since 四 2017-01-05 18:59:41 CST; 31s ago
   Listen: /run/foobar.sk (Stream)

1月 05 18:59:41 centos7 systemd[1]: Listening on foobar.socket.
1月 05 18:59:41 centos7 systemd[1]: Starting foobar.socket.

# lsof /run/foobar.sk 
COMMAND PID USER   FD   TYPE             DEVICE SIZE/OFF  NODE NAME
systemd   1 root   28u  unix 0xffff88002db53400      0t0 29058 /run/foobar.sk

可以看到systemd创建了对应的socket。但此时,foobard进程并没有启动。

当我们连接/run/foobar.sk时,foobard进程就会被systemd拉起:

# socat - unix-connect:/run/foobar.sk 
hello world
hello world
again
again


# ps -ef|grep foob
root      3589  3338  0 19:01 pts/1    00:00:00 socat - unix-connect:/run/foobar.sk
root      3590     1  0 19:01 ?        00:00:00 /usr/local/bin/foobard

Internal

systemd在启动服务进程前,会设置环境变量LISTEN_PIDLISTEN_FDS:

static int build_environment(
                const ExecContext *c,
                unsigned n_fds,
                usec_t watchdog_usec,
                const char *home,
                const char *username,
                const char *shell,
                char ***ret) {

        _cleanup_strv_free_ char **our_env = NULL;
        unsigned n_env = 0;
        char *x;

        assert(c);
        assert(ret);

        our_env = new0(char*, 10);
        if (!our_env)
                return -ENOMEM;

        if (n_fds > 0) {
                if (asprintf(&x, "LISTEN_PID="PID_FMT, getpid()) < 0)
                        return -ENOMEM;
                our_env[n_env++] = x;

                if (asprintf(&x, "LISTEN_FDS=%u", n_fds) < 0)
                        return -ENOMEM;
                our_env[n_env++] = x;
        }
...

服务进程通过sd_listen_fds获取对应的环境变量:

_public_ int sd_listen_fds(int unset_environment) {
        const char *e;
        int n, r, fd;
        pid_t pid;

        e = getenv("LISTEN_PID");
        if (!e) {
                r = 0;
                goto finish;
        }

        r = parse_pid(e, &pid);
        if (r < 0)
                goto finish;

        /* Is this for us? */
        if (getpid() != pid) {
                r = 0;
                goto finish;
        }

        e = getenv("LISTEN_FDS");
        if (!e) {
                r = 0;
                goto finish;
        }
...

Reference

]]>
Getting started with D-BUS 2017-01-04T19:00:30+00:00 hustcat http://hustcat.github.io/getting-started-with-dbus Introduction

D-BUS是一种进程间通信(IPC)机制,一般主要用于基于AF_UNIX套接字的本地进程间通信(local IPC)(当然也可以基于TCP/IP)实现跨主机的通信。

Linux上已经存在很多local IPC机制,比如FIFO、UNIX套接字等,为什么还要搞一个D-BUS呢?实际上,D-BUS采用了RPC(Remote Procedure Calling)的思想,它相当于本机的RPC,RPC相对于原始的FIFO、unix socket,是更加现代的通信方式,RPC框架本身会负责消息的编解码、安全验证等。这些会大大简化应用程序的开发。

D-Bus包含下面一些内容:

(1) libdbus: a low-level library

(2) dbus-daemon: a daemon based on libdbus. Handles and controls data transfers between DBus peers

(3) two types of busses: a system and a session one. Each bus instance is managed by a dbus-daemon

(4) a security mechanism using policy files

值得一提的是systemd没有使用libdbus,而是使用自己实现的library。

D-Bus Concepts

D-BUS协议是一个端到端(peer-to-peer or client-server)的通信协议,它包含一些基本的概念

  • 总线(bus)

相当于D-BUS的通信链路,应用之间通过总线进行通信。应用在总线上寻找service。有两种总线:

A “system bus” for notifications from the system to user sessions, and to allow the system to request input from user sessions.

A “session bus” used to implement desktop environments such as GNOME and KDE.

  • 服务(service)

服务是提供IPC API的程序,每个服务都有一个reverse domain name结构的标识名称。比如org.freedesktop.NetworkManager对应系统总线上的NetworkManagerorg.freedesktop.login1对应系统总线上的systemd-logind

  • 对象(object)

相当于通信的地址,每个serviceobject都通过object path来标识,object path类似文件系统的路径。比如/org/freedesktop/login1是服务org.freedesktop.login1manager对象的路径。

  • 接口(interfaces)

每个object包含一个或者多个interfaces。D-Bus interfaces define the methods and signals supported by D-Bus objects。

  • 方法(method)

D-Bus methods may accept any number of arguments and may return any number of values, including none.

  • signal

D-Bus signals provide a 1-to-many, publish-subscribe mechanism. Similar to method return values, D-Bus signals may contain an arbitrary ammount of data. Unlike methods however, signals are entirely asynchronous and may be emitted by D-Bus objects at any time.

  • signature

signature用于描述参数的数据类型,比如s代表UTF-8字符串。

更多概念参考DBus Overview

From shell

有一些D-BUS相关的工具,比如busctlgdbusdbus-send等,通过这些工具可以进行一些D-BUS相关的操作。

列出所有连接系统总线的端点:

# busctl
NAME                                   PID PROCESS         USER             CONNECTION    UNIT                      SESSION    DESCR
:1.0                                     1 systemd         root             :1.0          init.scope                -          -    
:1.1                                   516 polkitd         polkitd          :1.1          polkit.service            -          -    
:1.10                                 1305 busctl          root             :1.10         sshd.service              -          -    
:1.2                                   719 tuned           root             :1.2          tuned.service             -          -    
com.redhat.tuned                       719 tuned           root             :1.2          tuned.service             -          -    
fi.epitest.hostap.WPASupplicant          - -               -                (activatable) -                         -         
fi.w1.wpa_supplicant1                    - -               -                (activatable) -                         -         
org.freedesktop.DBus                     - -               -                -             -                         -          -    
org.freedesktop.PolicyKit1             516 polkitd         polkitd          :1.1          polkit.service            -          -    
org.freedesktop.hostname1                - -               -                (activatable) -                         -         
org.freedesktop.import1                  - -               -                (activatable) -                         -         
org.freedesktop.locale1                  - -               -                (activatable) -                         -         
org.freedesktop.login1                   - -               -                (activatable) -                         -         
org.freedesktop.machine1                 - -               -                (activatable) -                         -         
org.freedesktop.network1                 - -               -                (activatable) -                         -         
org.freedesktop.resolve1                 - -               -                (activatable) -                         -         
org.freedesktop.systemd1                 1 systemd         root             :1.0          init.scope                -          -    
org.freedesktop.timedate1                - -               -                (activatable) -                         -         

D-Bus API of systemd

systemd提供了一些D-BUS API,通过API,我们可以进行systemd的各种操作,比如启动、停止服务等。

查看所有的API:

# gdbus introspect --system --dest org.freedesktop.systemd1 --object-path /org/freedesktop/systemd1
node /org/freedesktop/systemd1 {
  interface org.freedesktop.DBus.Peer {
    methods:
      Ping();
      GetMachineId(out s machine_uuid);
    signals:
    properties:
  };
  interface org.freedesktop.DBus.Introspectable {
    methods:
      Introspect(out s data);
    signals:
    properties:
  };
  interface org.freedesktop.DBus.Properties {
    methods:
      Get(in  s interface,
          in  s property,
          out v value);
      GetAll(in  s interface,
             out a{sv} properties);
      Set(in  s interface,
          in  s property,
          in  v value);
    signals:
      PropertiesChanged(s interface,
                        a{sv} changed_properties,
                        as invalidated_properties);
    properties:
  };
  interface org.freedesktop.systemd1.Manager {
    methods:
      GetUnit(in  s arg_0,
              out o arg_1);
      GetUnitByPID(in  u arg_0,
                   out o arg_1);
      LoadUnit(in  s arg_0,
               out o arg_1);
      StartUnit(in  s arg_0,
                in  s arg_1,
                out o arg_2);
      StartUnitReplace(in  s arg_0,
                       in  s arg_1,
                       in  s arg_2,
                       out o arg_3);
      StopUnit(in  s arg_0,
               in  s arg_1,
               out o arg_2);
      ReloadUnit(in  s arg_0,
                 in  s arg_1,
                 out o arg_2);
      RestartUnit(in  s arg_0,
                  in  s arg_1,
                  out o arg_2);
...

接口org.freedesktop.systemd1.Manager包含了systemd提供的主要操作方法。

  • StartUnit/StopUnit
# busctl --system call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager GetUnit s crond.service 
o "/org/freedesktop/systemd1/unit/crond_2eservice"


# busctl --system call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StopUnit ss crond.service replace
o "/org/freedesktop/systemd1/job/904"

# systemctl status crond.service
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since 四 2017-01-05 03:13:29 CST; 4s ago
  Process: 656 ExecStart=/usr/sbin/crond -n $CRONDARGS (code=exited, status=0/SUCCESS)
 Main PID: 656 (code=exited, status=0/SUCCESS)
...

# busctl --system call org.freedesktop.systemd1 /org/freedesktop/systemd1 org.freedesktop.systemd1.Manager StartUnit ss crond.service replace
o "/org/freedesktop/systemd1/job/905"

# systemctl status crond.service
● crond.service - Command Scheduler
   Loaded: loaded (/usr/lib/systemd/system/crond.service; enabled; vendor preset: enabled)
   Active: active (running) since 四 2017-01-05 03:13:53 CST; 1s ago
 Main PID: 12277 (crond)
   Memory: 624.0K
   CGroup: /system.slice/crond.service
           └─12277 /usr/sbin/crond -n
...

Reference

]]>
The Go netpoller and timeout 2016-12-30T18:00:30+00:00 hustcat http://hustcat.github.io/go-netpoller-and-timeout 数据结构
  • netFD

netFD是pkg的网络文件描述符。

//net.go
type conn struct {
	fd *netFD
}

//net/fd_unix.go
// Network file descriptor.
type netFD struct {
	// locking/lifetime of sysfd + serialize access to Read and Write methods
	fdmu fdMutex

	// immutable until Close
	sysfd       int ///OS socket fd
	family      int
	sotype      int
	isConnected bool
	net         string
	laddr       Addr
	raddr       Addr

	// wait server
	pd pollDesc
}

//net/fd_poll_runtime.go
type pollDesc struct {
	runtimeCtx uintptr ///point to runtime PollDesc
}
  • PollDesc

PollDesc代表运行时的文件描述符.

///runtime/netpoll.goc
struct PollDesc
{
	PollDesc* link;	// in pollcache, protected by pollcache.Lock

	// The lock protects pollOpen, pollSetDeadline, pollUnblock and deadlineimpl operations.
	// This fully covers seq, rt and wt variables. fd is constant throughout the PollDesc lifetime.
	// pollReset, pollWait, pollWaitCanceled and runtime·netpollready (IO rediness notification)
	// proceed w/o taking the lock. So closing, rg, rd, wg and wd are manipulated
	// in a lock-free way by all operations.
	Lock;		// protectes the following fields
	uintptr	fd;  //OS socket fd, 对应netFD.sysfd
	bool	closing;
	uintptr	seq;	// protects from stale timers and ready notifications
	G*	rg;	// READY, WAIT, G waiting for read or nil
	Timer	rt;	// read deadline timer (set if rt.fv != nil)
	int64	rd;	// read deadline
	G*	wg;	// READY, WAIT, G waiting for write or nil
	Timer	wt;	// write deadline timer
	int64	wd;	// write deadline
	void*	user;	// user settable cookie
};

Read

从网络fd读取数据,goroutine会一直Read,直到发生EAGAIN,则将当前goroutine加到wait队列:

//net/fd_unix.go
func (fd *netFD) Read(p []byte) (n int, err error) {
	if err := fd.readLock(); err != nil {
		return 0, err
	}
	defer fd.readUnlock()
	if err := fd.pd.PrepareRead(); err != nil { ///检查有没有错误
		return 0, &OpError{"read", fd.net, fd.raddr, err}
	}
	for { ///一直Read,直到发生EAGAIN,则将当前goroutine加到wait队列
		n, err = syscall.Read(int(fd.sysfd), p)
		if err != nil {
			n = 0
			if err == syscall.EAGAIN {
				if err = fd.pd.WaitRead(); err == nil { ///等待epoll唤醒
					continue
				}
			}
		}
		err = chkReadErr(n, err, fd)
		break
	}
	if err != nil && err != io.EOF {
		err = &OpError{"read", fd.net, fd.raddr, err}
	}
	return
}
//net/fd_poll_runtime.go
func (pd *pollDesc) Wait(mode int) error {
	res := runtime_pollWait(pd.runtimeCtx, mode)
	return convertErr(res)
}
  • runtime_pollWait
//runtime/netpoll.goc
func runtime_pollWait(pd *PollDesc, mode int) (err int) {
	err = checkerr(pd, mode);
	if(err == 0) {
		// As for now only Solaris uses level-triggered IO.
		if(Solaris)
			runtime·netpollarm(pd, mode);
		while(!netpollblock(pd, mode, false)) {
			err = checkerr(pd, mode);
			if(err != 0)
				break;
			// Can happen if timeout has fired and unblocked us,
			// but before we had a chance to run, timeout has been reset.
			// Pretend it has not happened and retry.
		}
	}
}

// returns true if IO is ready, or false if timedout or closed
// waitio - wait only for completed IO, ignore errors
static bool
netpollblock(PollDesc *pd, int32 mode, bool waitio)
{
	G **gpp, *old;

	gpp = &pd->rg;
	if(mode == 'w')
		gpp = &pd->wg;

	// set the gpp semaphore to WAIT
	for(;;) {
		old = *gpp;
		if(old == READY) {
			*gpp = nil;
			return true;
		}
		if(old != nil)
			runtime·throw("netpollblock: double wait");
		if(runtime·casp(gpp, nil, WAIT))
			break;
	}

	// need to recheck error states after setting gpp to WAIT
	// this is necessary because runtime_pollUnblock/runtime_pollSetDeadline/deadlineimpl
	// do the opposite: store to closing/rd/wd, membarrier, load of rg/wg
	if(waitio || checkerr(pd, mode) == 0)
		runtime·park((bool(*)(G*, void*))blockcommit, gpp, "IO wait"); ////阻塞当前的goroutine
	// be careful to not lose concurrent READY notification
	old = runtime·xchgp(gpp, nil);
	if(old > WAIT)
		runtime·throw("netpollblock: corrupted state");
	return old == READY;
}

static intgo
checkerr(PollDesc *pd, int32 mode)
{
	if(pd->closing)
		return 1;  // errClosing
	if((mode == 'r' && pd->rd < 0) || (mode == 'w' && pd->wd < 0))
		return 2;  // errTimeout
	return 0;
}

sysmon

sysmonOS线程会定期检查epoll事件,并将ready的G加入到全局队列(值得注意的是并没有直接调用runtime·ready):

//runtime/proc.c
static void
sysmon(void)
{
///..
		// poll network if not polled for more than 10ms
		lastpoll = runtime·atomicload64(&runtime·sched.lastpoll);
		now = runtime·nanotime();
		if(lastpoll != 0 && lastpoll + 10*1000*1000 < now) {
			runtime·cas64(&runtime·sched.lastpoll, lastpoll, now);
			gp = runtime·netpoll(false);  // non-blocking
			if(gp) {
				// Need to decrement number of idle locked M's
				// (pretending that one more is running) before injectglist.
				// Otherwise it can lead to the following situation:
				// injectglist grabs all P's but before it starts M's to run the P's,
				// another M returns from syscall, finishes running its G,
				// observes that there is no work to do and no other running M's
				// and reports deadlock.
				incidlelocked(-1);
				injectglist(gp); // 加到全局队列
				incidlelocked(1);
			}
		}
}

  • epoll_wait
//runtime/netpoll_epoll.c
// polls for ready network connections
// returns list of goroutines that become runnable
G*
runtime·netpoll(bool block)
{
	static int32 lasterr;
	EpollEvent events[128], *ev;
	int32 n, i, waitms, mode;
	G *gp;

	if(epfd == -1)
		return nil;
	waitms = -1;
	if(!block)
		waitms = 0;
retry:
	n = runtime·epollwait(epfd, events, nelem(events), waitms);
	if(n < 0) {
		if(n != -EINTR && n != lasterr) {
			lasterr = n;
			runtime·printf("runtime: epollwait on fd %d failed with %d\n", epfd, -n);
		}
		goto retry;
	}
	gp = nil;
	for(i = 0; i < n; i++) {
		ev = &events[i];
		if(ev->events == 0)
			continue;
		mode = 0;
		if(ev->events & (EPOLLIN|EPOLLRDHUP|EPOLLHUP|EPOLLERR))
			mode += 'r';
		if(ev->events & (EPOLLOUT|EPOLLHUP|EPOLLERR))
			mode += 'w';
		if(mode)
			runtime·netpollready(&gp, (void*)ev->data, mode);
	}
	if(block && gp == nil)
		goto retry;
	return gp;
}

// Injects the list of runnable G's into the scheduler.
// Can run concurrently with GC.
static void
injectglist(G *glist)
{
	int32 n;
	G *gp;

	if(glist == nil)
		return;
	runtime·lock(&runtime·sched);
	for(n = 0; glist; n++) {
		gp = glist;
		glist = gp->schedlink;
		gp->status = Grunnable;
		globrunqput(gp);
	}
	runtime·unlock(&runtime·sched);

	for(; n && runtime·sched.npidle; n--)
		startm(nil, false);
}
///runtime/netpoll.goc
// make pd ready, newly runnable goroutines (if any) are enqueued info gpp list
void
runtime·netpollready(G **gpp, PollDesc *pd, int32 mode)
{
	G *rg, *wg;

	rg = wg = nil;
	if(mode == 'r' || mode == 'r'+'w')
		rg = netpollunblock(pd, 'r', true);
	if(mode == 'w' || mode == 'r'+'w')
		wg = netpollunblock(pd, 'w', true);
	if(rg) {
		rg->schedlink = *gpp;
		*gpp = rg;
	}
	if(wg) {
		wg->schedlink = *gpp;
		*gpp = wg;
	}
}

static G*
netpollunblock(PollDesc *pd, int32 mode, bool ioready)
{
	G **gpp, *old, *new;

	gpp = &pd->rg;
	if(mode == 'w')
		gpp = &pd->wg;

	for(;;) {
		old = *gpp;
		if(old == READY)
			return nil;
		if(old == nil && !ioready) {
			// Only set READY for ioready. runtime_pollWait
			// will check for timeout/cancel before waiting.
			return nil;
		}
		new = nil;
		if(ioready)
			new = READY;
		if(runtime·casp(gpp, old, new))
			break;
	}
	if(old > WAIT)
		return old;  // must be G*
	return nil;
}

SetReadDeadline

SetReadDeadline用于实现Go的网络IO超时原语,它会给netFD创建对应的IO定时器,当定时器超时,runtime会调用runtime·ready唤醒对应进行Read/Write的goroutine,如果对应的goroutine处于等待的状态(默认情况下deadline为0,不会创建定时器)。

Go并没有使用epoll_wait实现IO的超时,而是通过Set[Read|Write]Deadline(time.Time)对每个netFD设置超时。

SetDeadline设置的定时器超时后,在超时处理函数中,会删除该定时器;而且,每次收到或者发送数据时,也不会reset该定时器。所以,每次Read/Write操作之前,都需要调用该函数:

// A net.Conn that sets a deadline for every Read or Write operation
type TimeoutConn struct {
	net.Conn
	timeout time.Duration
}

func (c *TimeoutConn) Read(b []byte) (int, error) {
	if c.timeout > 0 {
		err := c.Conn.SetReadDeadline(time.Now().Add(c.timeout))
		if err != nil {
			return 0, err
		}
	}
	return c.Conn.Read(b)
}
// SetReadDeadline implements the Conn SetReadDeadline method.
func (c *conn) SetReadDeadline(t time.Time) error {
	if !c.ok() {
		return syscall.EINVAL
	}
	return c.fd.setReadDeadline(t)
}
///net/fd_poll_runtime.go
func (fd *netFD) setReadDeadline(t time.Time) error {
	return setDeadlineImpl(fd, t, 'r')
}

func (fd *netFD) setWriteDeadline(t time.Time) error {
	return setDeadlineImpl(fd, t, 'w')
}

func setDeadlineImpl(fd *netFD, t time.Time, mode int) error {
	d := runtimeNano() + int64(t.Sub(time.Now()))
	if t.IsZero() {
		d = 0
	}
	if err := fd.incref(); err != nil {
		return err
	}
	runtime_pollSetDeadline(fd.pd.runtimeCtx, d, mode)
	fd.decref()
	return nil
}
func runtime_pollSetDeadline(pd *PollDesc, d int64, mode int) {
	G *rg, *wg;

	runtime·lock(pd);
	if(pd->closing) {
		runtime·unlock(pd);
		return;
	}
	pd->seq++;  // invalidate current timers
	// Reset current timers.
	if(pd->rt.fv) {
		runtime·deltimer(&pd->rt);
		pd->rt.fv = nil;
	}
	if(pd->wt.fv) {
		runtime·deltimer(&pd->wt);
		pd->wt.fv = nil;
	}
	// Setup new timers.
	if(d != 0 && d <= runtime·nanotime())
		d = -1;
	if(mode == 'r' || mode == 'r'+'w')
		pd->rd = d;
	if(mode == 'w' || mode == 'r'+'w')
		pd->wd = d;
	if(pd->rd > 0 && pd->rd == pd->wd) {
		pd->rt.fv = &deadlineFn;
		pd->rt.when = pd->rd;
		// Copy current seq into the timer arg.
		// Timer func will check the seq against current descriptor seq,
		// if they differ the descriptor was reused or timers were reset.
		pd->rt.arg.type = (Type*)pd->seq;
		pd->rt.arg.data = pd;
		runtime·addtimer(&pd->rt);
	} else {
		if(pd->rd > 0) {
			pd->rt.fv = &readDeadlineFn;
			pd->rt.when = pd->rd;
			pd->rt.arg.type = (Type*)pd->seq;
			pd->rt.arg.data = pd;
			runtime·addtimer(&pd->rt);
		}
		if(pd->wd > 0) {
			pd->wt.fv = &writeDeadlineFn;
			pd->wt.when = pd->wd;
			pd->wt.arg.type = (Type*)pd->seq;
			pd->wt.arg.data = pd;
			runtime·addtimer(&pd->wt);
		}
	}
	// If we set the new deadline in the past, unblock currently pending IO if any.
	rg = nil;
	runtime·atomicstorep(&wg, nil);  // full memory barrier between stores to rd/wd and load of rg/wg in netpollunblock
	if(pd->rd < 0)
		rg = netpollunblock(pd, 'r', false);
	if(pd->wd < 0)
		wg = netpollunblock(pd, 'w', false);
	runtime·unlock(pd);
	if(rg)
		runtime·ready(rg);
	if(wg)
		runtime·ready(wg);
}

static FuncVal deadlineFn	= {(void(*)(void))deadline};
static FuncVal readDeadlineFn	= {(void(*)(void))readDeadline};
static FuncVal writeDeadlineFn	= {(void(*)(void))writeDeadline};


static void
readDeadline(int64 now, Eface arg)
{
	deadlineimpl(now, arg, true, false);
}

static void
deadlineimpl(int64 now, Eface arg, bool read, bool write)
{
	PollDesc *pd;
	uint32 seq;
	G *rg, *wg;

	USED(now);
	pd = (PollDesc*)arg.data;
	// This is the seq when the timer was set.
	// If it's stale, ignore the timer event.
	seq = (uintptr)arg.type;
	rg = wg = nil;
	runtime·lock(pd);
	if(seq != pd->seq) {
		// The descriptor was reused or timers were reset.
		runtime·unlock(pd);
		return;
	}
	if(read) {
		if(pd->rd <= 0 || pd->rt.fv == nil)
			runtime·throw("deadlineimpl: inconsistent read deadline");
		pd->rd = -1;
		runtime·atomicstorep(&pd->rt.fv, nil);  // full memory barrier between store to rd and load of rg in netpollunblock
		rg = netpollunblock(pd, 'r', false);
	}
	if(write) {
		if(pd->wd <= 0 || (pd->wt.fv == nil && !read))
			runtime·throw("deadlineimpl: inconsistent write deadline");
		pd->wd = -1;
		runtime·atomicstorep(&pd->wt.fv, nil);  // full memory barrier between store to wd and load of wg in netpollunblock
		wg = netpollunblock(pd, 'w', false);
	}
	runtime·unlock(pd);
	if(rg)
		runtime·ready(rg);
	if(wg)
		runtime·ready(wg);
}

初始化操作

  • epoll init
///runtime/netpoll_epoll.c
static int32 epfd = -1;  // epoll descriptor

void
runtime·netpollinit(void)
{
	epfd = runtime·epollcreate1(EPOLL_CLOEXEC);
	if(epfd >= 0)
		return;
	epfd = runtime·epollcreate(1024);
	if(epfd >= 0) {
		runtime·closeonexec(epfd);
		return;
	}
	runtime·printf("netpollinit: failed to create descriptor (%d)\n", -epfd);
	runtime·throw("netpollinit: failed to create descriptor");
}
  • netFD init

runtime创建socket fd后,会将其加入到epoll:

//net/fd_unix.go
func (fd *netFD) init() error {
	if err := fd.pd.Init(fd); err != nil {
		return err
	}
	return nil
}

//net/fd_poll_runtime.go
func (pd *pollDesc) Init(fd *netFD) error {
	serverInit.Do(runtime_pollServerInit)
	ctx, errno := runtime_pollOpen(uintptr(fd.sysfd)) ///add to epoll
	if errno != 0 {
		return syscall.Errno(errno)
	}
	pd.runtimeCtx = ctx
	return nil
}

//runtime/netpoll.goc
func runtime_pollOpen(fd uintptr) (pd *PollDesc, errno int) {
	pd = allocPollDesc();
	runtime·lock(pd);
	if(pd->wg != nil && pd->wg != READY)
		runtime·throw("runtime_pollOpen: blocked write on free descriptor");
	if(pd->rg != nil && pd->rg != READY)
		runtime·throw("runtime_pollOpen: blocked read on free descriptor");
	pd->fd = fd;
	pd->closing = false;
	pd->seq++;
	pd->rg = nil;
	pd->rd = 0;
	pd->wg = nil;
	pd->wd = 0;
	runtime·unlock(pd);

	errno = runtime·netpollopen(fd, pd);
}

int32
runtime·netpollopen(uintptr fd, PollDesc *pd)
{
	EpollEvent ev;
	int32 res;

	ev.events = EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET;
	ev.data = (uint64)pd;
	res = runtime·epollctl(epfd, EPOLL_CTL_ADD, (int32)fd, &ev);
	return -res;
}

总结

SetReadDeadline用于实现Go的网络IO超时原语,它会给netFD创建对应的IO定时器,当定时器超时,runtime会调用runtime·ready唤醒对应进行Read/Write的goroutine,如果对应的goroutine处于等待的状态(默认情况下deadline为0,不会创建定时器)。

Go并没有使用epoll_wait实现IO的超时,而是通过Set[Read|Write]Deadline(time.Time)对每个netFD设置超时。

SetDeadline设置的定时器超时后,在超时处理函数中,会删除该定时器;而且,每次收到或者发送数据时,也不会reset该定时器。所以,每次Read/Write操作之前,都需要调用该函数。

Reference

]]>
Timeout in Go net/http client 2016-12-30T17:00:30+00:00 hustcat http://hustcat.github.io/timeout-in-net-http-of-golang 使用net/http client的一般流程如下:

tr := &http.Transport{
	TLSClientConfig:    &tls.Config{RootCAs: pool},
	DisableCompression: true,
}

client := &http.Client{
	Transport: tr,
	CheckRedirect: redirectPolicyFunc,
}

req, err := http.NewRequest("GET", "http://example.com", nil)
// ...
req.Header.Add("If-None-Match", `W/"wyzzy"`)
resp, err := client.Do(req)
// ...

http.Client.Do的实现:

数据结构

  • Transport

Transport代表client与server之间的传输管道,它比net.Conn更加高层,它基于net.Conn(实际上是http.persistConn)进行数据传输,并管理空闲的net.Conn.

// Transport is an implementation of RoundTripper that supports http,
// https, and http proxies (for either http or https with CONNECT).
// Transport can also cache connections for future re-use.
type Transport struct {
	idleConn    map[connectMethodKey][]*persistConn ///空闲的连接对象
	reqCanceler map[*Request]func() ///实现Cancel request

	// Dial specifies the dial function for creating TCP
	// connections.
	// If Dial is nil, net.Dial is used.
	Dial func(network, addr string) (net.Conn, error)

	// TLSHandshakeTimeout specifies the maximum amount of time waiting to
	// wait for a TLS handshake. Zero means no timeout.
	TLSHandshakeTimeout time.Duration

	// ResponseHeaderTimeout, if non-zero, specifies the amount of
	// time to wait for a server's response headers after fully
	// writing the request (including its body, if any). This
	// time does not include the time to read the response body.
	ResponseHeaderTimeout time.Duration
}


var DefaultTransport RoundTripper = &Transport{
	Proxy: ProxyFromEnvironment,
	Dial: (&net.Dialer{
		Timeout:   30 * time.Second,
		KeepAlive: 30 * time.Second,
	}).Dial,
	TLSHandshakeTimeout: 10 * time.Second,
}

net.Dialer.Timeout limits the time spent establishing a TCP connection (if a new one is needed). http.Transport.TLSHandshakeTimeout limits the time spent performing the TLS handshake. http.Transport.ResponseHeaderTimeout limits the time spent reading the headers of the response.

  • Client
// A Client is higher-level than a RoundTripper (such as Transport)
// and additionally handles HTTP details such as cookies and
// redirects.
type Client struct {
	// Transport specifies the mechanism by which individual
	// HTTP requests are made.
	// If nil, DefaultTransport is used.
	Transport RoundTripper

	// Timeout specifies a time limit for requests made by this
	// Client. The timeout includes connection time, any
	// redirects, and reading the response body. The timer remains
	// running after Get, Head, Post, or Do return and will
	// interrupt reading of the Response.Body.
	//
	// A Timeout of zero means no timeout.
	//
	// The Client's Transport must support the CancelRequest
	// method or Client will return errors when attempting to make
	// a request with Get, Head, Post, or Do. Client's default
	// Transport (DefaultTransport) supports CancelRequest.
	Timeout time.Duration
}

Client.Timeout specifies a time limit for requests made by this Client. The timeout includes connection time, any redirects, and reading the response body.

Client要求底层的Transport支持CancelRequest方法:

// CancelRequest cancels an in-flight request by closing its
// connection.
func (t *Transport) CancelRequest(req *Request) {
	t.reqMu.Lock()
	cancel := t.reqCanceler[req]
	t.reqMu.Unlock()
	if cancel != nil {
		cancel()
	}
}

几个timeout变量的关系:

实现

  • Client.Timeout
func (c *Client) Do(req *Request) (resp *Response, err error) {
	if req.Method == "GET" || req.Method == "HEAD" {
		return c.doFollowingRedirects(req, shouldRedirectGet)
	}
	if req.Method == "POST" || req.Method == "PUT" {
		return c.doFollowingRedirects(req, shouldRedirectPost)
	}
	return c.send(req)
}

func (c *Client) doFollowingRedirects(ireq *Request, shouldRedirect func(int) bool) (resp *Response, err error) {
///...
	var timer *time.Timer
	if c.Timeout > 0 {
		type canceler interface {
			CancelRequest(*Request)
		}
		tr, ok := c.transport().(canceler)
		if !ok {
			return nil, fmt.Errorf("net/http: Client Transport of type %T doesn't support CancelRequest; Timeout not supported", c.transport())
		}
		timer = time.AfterFunc(c.Timeout, func() {
			reqmu.Lock()
			defer reqmu.Unlock()
			tr.CancelRequest(req) ///超时,则取消request
		})
	}
  • Transport.TLSHandshakeTimeout
func (t *Transport) dialConn(cm connectMethod) (*persistConn, error) {
	conn, err := t.dial("tcp", cm.addr()) ///tcp conn

	pconn := &persistConn{
		t:          t,
		cacheKey:   cm.key(),
		conn:       conn,
		reqch:      make(chan requestAndChan, 1),
		writech:    make(chan writeRequest, 1),
		closech:    make(chan struct{}),
		writeErrCh: make(chan error, 1),
	}

///...
	if cm.targetScheme == "https" {

		plainConn := conn
		tlsConn := tls.Client(plainConn, cfg)
		errc := make(chan error, 2)
		var timer *time.Timer // for canceling TLS handshake
		if d := t.TLSHandshakeTimeout; d != 0 {
			timer = time.AfterFunc(d, func() {
				errc <- tlsHandshakeTimeoutError{}  ///超时
			})
		}
		go func() {
			err := tlsConn.Handshake()
			if timer != nil {
				timer.Stop()
			}
			errc <- err
		}()
		if err := <-errc; err != nil {
			plainConn.Close()
			return nil, err
		}

}
  • Transport.ResponseHeaderTimeout
func (pc *persistConn) roundTrip(req *transportRequest) (resp *Response, err error) {
	pc.t.setReqCanceler(req.Request, pc.cancelRequest)

	// Write the request concurrently with waiting for a response,
	// in case the server decides to reply before reading our full
	// request body.
	writeErrCh := make(chan error, 1)
	pc.writech <- writeRequest{req, writeErrCh} ///persistConn.writeLoop write Request

	resc := make(chan responseAndError, 1)
	pc.reqch <- requestAndChan{req.Request, resc, requestedGzip} ///persistConn.readLoop read Response

	var re responseAndError
	var pconnDeadCh = pc.closech
	var failTicker <-chan time.Time
	var respHeaderTimer <-chan time.Time
WaitResponse:
	for {
		select {
		case err := <-writeErrCh:
			if err != nil {
				re = responseAndError{nil, err}
				pc.close()
				break WaitResponse
			}
			if d := pc.t.ResponseHeaderTimeout; d > 0 { ///发送请求成功,然后设置ResponseHeaderTimeout
				respHeaderTimer = time.After(d)
			}
		case <-pconnDeadCh:
			// The persist connection is dead. This shouldn't
			// usually happen (only with Connection: close responses
			// with no response bodies), but if it does happen it
			// means either a) the remote server hung up on us
			// prematurely, or b) the readLoop sent us a response &
			// closed its closech at roughly the same time, and we
			// selected this case first, in which case a response
			// might still be coming soon.
			//
			// We can't avoid the select race in b) by using a unbuffered
			// resc channel instead, because then goroutines can
			// leak if we exit due to other errors.
			pconnDeadCh = nil                               // avoid spinning
			failTicker = time.After(100 * time.Millisecond) // arbitrary time to wait for resc
		case <-failTicker:
			re = responseAndError{err: errClosed}
			break WaitResponse
		case <-respHeaderTimer:  ///ResponseHeaderTimeout
			pc.close()
			re = responseAndError{err: errTimeout}
			break WaitResponse
		case re = <-resc:
			break WaitResponse
		}
	}

Docker pull client

Docker在pull镜像时,设置了net.Dialer.Timeout:

///registry/registry.go
func newClient(jar http.CookieJar, roots *x509.CertPool, cert *tls.Certificate, timeout TimeoutType, secure bool) *http.Client {

	httpTransport := &http.Transport{
		DisableKeepAlives: true,
		Proxy:             http.ProxyFromEnvironment,
		TLSClientConfig:   &tlsConfig,
	}

	switch timeout {
	case ConnectTimeout:
		httpTransport.Dial = func(proto string, addr string) (net.Conn, error) {
			// Set the connect timeout to 5 seconds
			conn, err := net.DialTimeout(proto, addr, 5*time.Second)
			if err != nil {
				return nil, err
			}
			// Set the recv timeout to 10 seconds
			conn.SetDeadline(time.Now().Add(10 * time.Second))
			return conn, nil
		}
	case ReceiveTimeout: ///go here
		httpTransport.Dial = func(proto string, addr string) (net.Conn, error) {
			conn, err := net.Dial(proto, addr)
			if err != nil {
				return nil, err
			}
			conn = utils.NewTimeoutConn(conn, 1*time.Minute)
			return conn, nil
		}
	}

	return &http.Client{
		Transport:     httpTransport,
		CheckRedirect: AddRequiredHeadersToRedirectedRequests,
		Jar:           jar,
	}
}

同时,还调用了net.Conn.SetReadDeadline设置了read的超时时间:

func NewTimeoutConn(conn net.Conn, timeout time.Duration) net.Conn {
	return &TimeoutConn{conn, timeout}
}

// A net.Conn that sets a deadline for every Read or Write operation
type TimeoutConn struct {
	net.Conn
	timeout time.Duration
}

func (c *TimeoutConn) Read(b []byte) (int, error) {
	if c.timeout > 0 {
		err := c.Conn.SetReadDeadline(time.Now().Add(c.timeout))
		if err != nil {
			return 0, err
		}
	}
	return c.Conn.Read(b)
}
  • net.Conn.SetReadDeadline

SetReadDeadline用于实现Go的网络IO超时原语,它会给netFD创建对应的IO定时器,当定时器超时,runtime会调用runtime·ready唤醒对应进行Read/Write的goroutine,如果对应的goroutine处于等待的状态(默认情况下deadline为0,不会创建定时器)。

Go并没有使用epoll_wait实现IO的超时,而是通过Set[Read|Write]Deadline(time.Time)对每个netFD设置超时。

SetDeadline设置的定时器超时后,在超时处理函数中,会删除该定时器;而且,每次收到或者发送数据时,也不会reset该定时器。所以,每次Read/Write操作之前,都需要调用该函数。参考The Go netpoller and timeout.

//net/net.go
// SetReadDeadline implements the Conn SetReadDeadline method.
func (c *conn) SetReadDeadline(t time.Time) error {
	if !c.ok() {
		return syscall.EINVAL
	}
	return c.fd.setReadDeadline(t)
}


//net/net_poll_runtime.go
func (fd *netFD) setReadDeadline(t time.Time) error {
	return setDeadlineImpl(fd, t, 'r')
}

func (fd *netFD) setWriteDeadline(t time.Time) error {
	return setDeadlineImpl(fd, t, 'w')
}

func setDeadlineImpl(fd *netFD, t time.Time, mode int) error {
	d := runtimeNano() + int64(t.Sub(time.Now()))
	if t.IsZero() {
		d = 0
	}
	if err := fd.incref(); err != nil {
		return err
	}
	runtime_pollSetDeadline(fd.pd.runtimeCtx, d, mode)
	fd.decref()
	return nil
}

Reference

]]>
perf-trace and perf-probe and uprobe 2016-12-26T12:00:30+00:00 hustcat http://hustcat.github.io/perf-trace-and-perf-probe perf-trace

perf-trace是内核tools/perf提供一个子工具,相对于传统的strace,它不用stop目标进程,开销更小。

# dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB) copied, 0.421561 s, 1.2 MB/s

# strace -c dd if=/dev/zero of=/dev/null bs=1 count=500k       
512000+0 records in
512000+0 records out
512000 bytes (512 kB) copied, 19.8597 s, 25.8 kB/s

# perf stat -e 'syscalls:sys_enter_*' dd if=/dev/zero of=/dev/null bs=1 count=500k   
512000+0 records in
512000+0 records out
512000 bytes (512 kB) copied, 0.510631 s, 1.0 MB/s

可以看到perf的性能比strace好很多。但是遗憾的是3.10.x内核带的perf-trace不能指定event。

perf-probe

perf-probe可以创建dynamic tracepoints,在kernel debuginfo的帮助下,可以实现内核代码行级动态跟踪。

我们可以通过-nv查看perf probe命令的效果(不会执行):

# perf probe -nv 'dump_write:0 file addr nr'           
probe-definition(0): dump_write:0 file addr nr 
symbol:dump_write file:(null) line:0 offset:0 return:0 lazy:(null)
parsing arg: file into file
parsing arg: addr into addr
parsing arg: nr into nr
3 arguments
Looking at the vmlinux_path (6 entries long)
Using /boot/vmlinux-3.10.102-1-tlinux2-0040.tl1 for symbols
Probe point found: dump_write+0
Searching 'file' variable in context.
Converting variable file into trace event.
file type is (null).
Searching 'addr' variable in context.
Converting variable addr into trace event.
addr type is (null).
Searching 'nr' variable in context.
Converting variable nr into trace event.
nr type is int.
find 1 probe_trace_events.
Opening /sys/kernel/debug//tracing/kprobe_events write=1
Added new event:
Writing event: p:probe/dump_write dump_write+0 file=%di:u64 addr=%si:u64 nr=%dx:s32
  probe:dump_write     (on dump_write with file addr nr)

You can now use it in all perf tools, such as:

        perf record -e probe:dump_write -aR sleep 1

然后,我们可以在没有kernel debuginfo的内核上执行下面的命令:

# perf probe 'dump_write+0 file=%di:u64 addr=%si:u64 nr=%dx:s32'
Failed to find path of kernel module.
Added new event:
  probe:dump_write     (on dump_write with file=%di:u64 addr=%si:u64 nr=%dx:s32)

You can now use it in all perf tools, such as:

        perf record -e probe:dump_write -aR sleep 1
# perf probe --list
  probe:dump_write     (on dump_write with file addr nr)

# ls /sys/kernel/debug/tracing/events/probe/
dump_write  enable  filter

# perf record -e probe:dump_write -aR sleep 5

# perf script 
...
     test_signal 25262 [005] 10874780.803504: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fcb8a7840 nr=64
     test_signal 25262 [005] 10874780.803522: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fcb8a7e00 nr=56
     test_signal 25262 [005] 10874780.803523: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fc949bc38 nr=56
     test_signal 25262 [005] 10874780.803525: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fc949bc38 nr=56
     test_signal 25262 [005] 10874780.803526: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fc949bc38 nr=56
     test_signal 25262 [005] 10874780.803527: probe:dump_write: (ffffffff811d472f) file=ffff881e95dbf100 addr=ffff880fc949bc38 nr=56
....

也可以使用perf-tools中的kprobe:

# ./kprobe 'p:dump_write nr=%dx:s32'
dump_write
Tracing kprobe dump_write. Ctrl-C to end.
     test_signal-31031 [005] d... 10875856.829642: dump_write: (dump_write+0x0/0x70) nr=64
     test_signal-31031 [005] d... 10875856.829658: dump_write: (dump_write+0x0/0x70) nr=56
     test_signal-31031 [005] d... 10875856.829660: dump_write: (dump_write+0x0/0x70) nr=56
     test_signal-31031 [005] d... 10875856.829660: dump_write: (dump_write+0x0/0x70) nr=56
     test_signal-31031 [005] d... 10875856.829661: dump_write: (dump_write+0x0/0x70) nr=56
...

uprobe

uprobe是在3.5加到内核的,它可以实现对用户进程动态trace。

# perf probe -x /lib64/libc.so.6 malloc
Added new event:
  probe_libc:malloc    (on 0x7a640)

You can now use it in all perf tools, such as:

        perf record -e probe_libc:malloc -aR sleep 1

# cat uprobe_events  
p:probe_libc/malloc /lib64/libc.so.6:0x000000000007a640
# perf probe --list
  probe_libc:malloc    (on 0x000000000007a640)

# perf record -e probe_libc:malloc -aR sleep 3 
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.092 MB perf.data (~4016 samples) ]


# perf script
...
           sleep  5924 [002] 256797.635705: probe_libc:malloc: (7fd7e092a640)
           sleep  5924 [002] 256797.635740: probe_libc:malloc: (7fd7e092a640)
           sleep  5924 [002] 256797.635752: probe_libc:malloc: (7fd7e092a640)

# perf probe --del probe_libc:malloc
Removed event: probe_libc:malloc

uprobe3.14做了很多改进。根据Brendan Gregg的经验,最好在4.0以上的内核使用uprobe

I was hoping to use uprobes on the Linux 3.13 kernels I’m now debugging, but have frequently hit issues where the target process either crashes or enters an endless spin loop. These bugs seem to have been fixed by Linux 4.0 (maybe earlier). For that reason, uprobe won’t run on kernels older than 4.0 (without -F to force). Maybe that’s pessimistic, and it should be 3.18 or something.

Reference

]]>
Introduction to the perf-tools 2016-12-23T18:00:30+00:00 hustcat http://hustcat.github.io/the-introduction-to-perf-tools ftrace是内核在2.6.27提供的trace机制。使用ftrace可以在不影响进程运行的情况下,跟踪分析内核中发生的事情。ftrace提供了不同的跟踪器,以用于不同的场合,比如跟踪内核函数调用、对上下文切换进行跟踪、查看中断被关闭的时长、跟踪内核态中的延迟以及性能问题等。

perf-toolsBrendan Gregg基于perf_events (aka perf)ftrace写的一套脚本工具。它包装并简化了perfftrace以及kprobe的使用,相比于直接去操作ftrace的内核接口,使用这些成熟的脚本工具更加方便安全。

个人已经使用perf-tools定位解决了无数生产环境线上问题(当然,还需熟悉内核的代码)。下面介绍几个常用工具。

funccount

funccount可以用来统计内核函数的调用次数:

# ./funccount "schedule"
Tracing "schedule"... Ctrl-C to end.
^C
FUNC                              COUNT
schedule                          64261

Ending tracing...

funccount会设置/sys/kernel/debug/tracing/set_ftrace_filter:

# cat /sys/kernel/debug/tracing/set_ftrace_filter 
schedule

trace_stat/*目录下的文件为每个CPU执行函数的次数:

# ls /sys/kernel/debug/tracing/trace_stat/
function0   function13  function18  function22  function27  function31  function36  function40  function45  function7
function1   function14  function19  function23  function28  function32  function37  function41  function46  function8
function10  function15  function2   function24  function29  function33  function38  function42  function47  function9
function11  function16  function20  function25  function3   function34  function39  function43  function5
function12  function17  function21  function26  function30  function35  function4   function44  function6

# cat /sys/kernel/debug/tracing/trace_stat/function0   
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  schedule                               821    495097294 us     603041.7 us     8175448486 us 

funcgraph

funcgraph可以用来trace内核某个函数的调用栈,例如:

# ./funcgraph "xfs_dir_open"
Tracing "xfs_dir_open"... Ctrl-C to end.
  2)               |  xfs_dir_open [xfs]() {
  2)   0.160 us    |    xfs_file_open [xfs]();
  2)               |    xfs_ilock_map_shared [xfs]() {
  2)               |      xfs_ilock [xfs]() {
  2)   0.164 us    |        down_read();
  2)   0.642 us    |      }
  2)   0.922 us    |    }
...

脚本主要设置下面几个trace参数:

# cat current_tracer 
function_graph
# cat set_graph_function 
xfs_dir_open [xfs]

另外,还可以通过-p指定进程ID,-m指定栈的深度。

-m maxdepth     # max stack depth to show
 -p PID          # trace when this pid is on-CPU
# ./funcgraph -m 3 "xfs_dir_open"  
Tracing "xfs_dir_open"... Ctrl-C to end.
 18)               |  xfs_dir_open [xfs]() {
 18)   0.103 us    |    xfs_file_open [xfs]();
 18)               |    xfs_ilock_map_shared [xfs]() {
 18)   0.264 us    |      xfs_ilock [xfs]();
 18)   0.609 us    |    }
 18)               |    xfs_dir3_data_readahead [xfs]() {
 18)   6.987 us    |      xfs_da_reada_buf [xfs]();
 18)   7.683 us    |    }
 18)               |    xfs_iunlock [xfs]() {
 18)   0.034 us    |      up_read();
 18)   0.446 us    |    }
 18) + 10.990 us   |  }
^C
Ending tracing...

这两个参数分别对应tracing/set_ftrace_pidtracing/max_graph_depth

tpoint

tpoint - trace a given tracepoint. Static tracing

跟踪内核的静态tracepoint,tracepoint可以输出相关的内核变量的值,这一点是非常有用的。

内核所有的tracepoint都在/sys/kernel/debug/tracing/events目录下,例如,与信号相关的有2个signal:

# ls /sys/kernel/debug/tracing/events/signal/
enable  filter  signal_deliver  signal_generate

signal_generate跟踪进程发送信号,在内核函数__send_signal中创建,所有信号的发送都会调用该函数:

# ./tpoint -p 32309 signal:signal_generate
Tracing signal:signal_generate. Ctrl-C to end.
     test_signal-32309 [011] 51151417.094942: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
     test_signal-32309 [011] 51151418.097461: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
           <...>-32309 [011] 51151419.099980: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
           <...>-32309 [011] 51151419.642213: signal_generate: sig=17 errno=0 code=262146 comm=bash pid=26674 grp=1 res=0
^C
Ending tracing...

另外,还可以通过参数-s显示函数的调用栈(实际上是ftrace提供的):

-s              # show kernel stack traces

一般来说,一个函数可能会被多个父函数调用,这可以方便分析当前调用产生的路径,结合相关变量的值可以帮助我们分析问题。

# ./tpoint -s -p 10213 signal:signal_generate     
Tracing signal:signal_generate. Ctrl-C to end.
     test_signal-10213 [020] 51151620.771581: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=10213 grp=1 res=0
     test_signal-10213 [020] 51151620.771582: <stack trace>
 => send_signal
 => __group_send_sig_info
 => check_cpu_itimer
 => run_posix_cpu_timers
 => update_process_times
 => tick_sched_timer
 => __run_hrtimer
 => hrtimer_interrupt

设置的主要tracing参数:

# cat options/stacktrace 
1
# cat events/signal/signal_generate/enable 
1
# cat events/signal/signal_generate/filter 
common_pid == 10213

kprobe

kprobe - trace a given kprobe definition. Kernel dynamic tracing.

kprobe基于内核的kprobe,非常强大,可以做一些动态跟踪的事情,tpoint只能分析内核静态的tracepoint,而kprobe可以分析内核所有函数。

比如,查看xfs_dir_open的调用栈和返回值:

# ./kprobe -s 'r:xfs_dir_open $retval' 
xfs_dir_open [xfs]
Tracing kprobe xfs_dir_open. Ctrl-C to end.
              ls-32222 [002] d... 1561178.430757: xfs_dir_open: (do_dentry_open+0x20e/0x290 <- xfs_dir_open) arg1=0
              ls-32222 [002] d... 1561178.430761: <stack trace>
 => vfs_open
 => do_last
 => path_openat
 => do_filp_open
 => do_sys_open
 => SyS_open
 => system_call_fastpath
^C
Ending tracing...

动态跟踪的内核接口为kprobe_events:

# cat kprobe_events 
r:kprobes/xfs_dir_open xfs_dir_open arg1=$retval
# ls events/kprobes/
enable  filter  xfs_dir_open

kprobe定义event的格式如下:

Synopsis of kprobe_events
-------------------------
  p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS]	: Set a probe
  r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS]		: Set a return probe
  -:[GRP/]EVENT						: Clear a probe

 GRP		: Group name. If omitted, use "kprobes" for it.
 EVENT		: Event name. If omitted, the event name is generated
		  based on SYM+offs or MEMADDR.
 MOD		: Module name which has given SYM.
 SYM[+offs]	: Symbol+offset where the probe is inserted.
 MEMADDR	: Address where the probe is inserted.

 FETCHARGS	: Arguments. Each probe can have up to 128 args.
  %REG		: Fetch register REG
  @ADDR		: Fetch memory at ADDR (ADDR should be in kernel)
  @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
  $stackN	: Fetch Nth entry of stack (N >= 0)
  $stack	: Fetch stack address.
  $retval	: Fetch return value.(*)
  $comm		: Fetch current task comm.
  +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**)
  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
		  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
		  (x8/x16/x32/x64), "string" and bitfield are supported.

  (*) only for return probe.
  (**) this is useful for fetching a field of data structures.

参考Kprobe-based Event Tracing。直接使用内核的接口/sys/kernel/debug/tracing/kprobe_events来定义高级格式的event需要对熟悉内核二进制程序,不太方便,我们可以更加方便的内核自带的perf工具提供的perf probe定义一些更加复杂的kprobe event。

perf probe

内核自带tools/perf工具,我们可以使用perf probe命令动态添加trace event。

首先,确认定义trace event的源代码,假设我们想trace内核函数dump_write,使用perf probe --line dump_write可以显示源代码以及显示源代码的哪一行可以插入event:

# perf probe --line dump_write 
<dump_write@/usr/src/kernels/kernel-tlinux2-3.10.102//fs/coredump.c:0>
      0  int dump_write(struct file *file, const void *addr, int nr)
      1  {
      2         return !dump_interrupted() &&
      3                 access_ok(VERIFY_READ, addr, nr) &&
      4                 file->f_op->write(file, addr, nr, &file->f_pos) == nr;
      5  }
         EXPORT_SYMBOL(dump_write);
         
         int dump_seek(struct file *file, loff_t off)

perf probe --vars可以查看指定函数某一行可以访问的变量:

# perf probe --vars dump_write:0                                            
Available variables at dump_write
        @<dump_write+0>
                (unknown_type)  addr
                int     nr
                struct file*    file

使用perf probe --add可以定义trace event:

# perf probe --add 'dump_write:0 nr file->f_path.dentry->d_name.name:string'   
Added new event:
  probe:dump_write     (on dump_write with nr name=file->f_path.dentry->d_name.name:string)

You can now use it in all perf tools, such as:

        perf record -e probe:dump_write -aR sleep 1

# perf probe --list
  probe:dump_write     (on dump_write@fs/coredump.c with nr name)
# ls events/probe/  
dump_write  enable  filter

定义trace event的格式如下:

(1)根据函数名定义时:

[事件名=]函数[@文件][:从函数开头的行数 +偏移量 %return ;模式]

(2)根据文件名和行数定义时:

[事件名=]文件[: 从文件开头的行数 ; 模式]

开始trace:

# echo probe:dump_write > set_event

# head trace -n 20
# tracer: nop
#
# entries-in-buffer/entries-written: 22249/526438   #P:48
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
           <...>-41773 [002] d...  7171.883693: dump_write: (dump_write+0x0/0x70) nr=4096 name="core_test_signal_1482493290.41773"
           <...>-41773 [002] d...  7171.883695: dump_write: (dump_write+0x0/0x70) nr=4096 name="core_test_signal_1482493290.41773"
           <...>-41773 [002] d...  7171.883698: dump_write: (dump_write+0x0/0x70) nr=4096 name="core_test_signal_1482493290.41773"
           <...>-41773 [002] d...  7171.883700: dump_write: (dump_write+0x0/0x70) nr=4096 name="core_test_signal_1482493290.41773"
           <...>-41773 [002] d...  7171.883703: dump_write: (dump_write+0x0/0x70) nr=4096 name="core_test_signal_1482493290.41773"

可以看到dump_write写的文件名称以及nr的值。

delete probe event:

# perf probe --list
  probe:dump_write     (on dump_write@fs/coredump.c with nr name)
# echo > set_event
# perf probe --del probe:dump_write
Removed event: probe:dump_write

Summary

从4.x开始,内核社区开始转向使用BPF实现内核的动态跟踪。但4.x的内核要在生产环境推广还需要一段时间。对于3.X的内核,perf-tools以及内核自带的perf是我们定位内核问题的最佳选择。

Reference

]]>
Getting into the Linux ELF and core dump file 2016-12-21T17:00:30+00:00 hustcat http://hustcat.github.io/getting-into-core-dump-file 最近遇到一个业务发生coredump文件不完整的问题,稍微深入的研究了一下ELF和coredump文件。

ELF format

ELF的基本结构如下:

每个ELF文件主要分ELF headerProgram header tableSection header table以及相对应的Program/Section table entry指向的程序或者控制数据。

ELF文件有两种视图,Execution ViewLinking View:

ELF header

ELF header位于文件的头部,用于描述文件的整体结构:

typedef struct elf64_hdr {
  unsigned char	e_ident[EI_NIDENT];	/* ELF "magic number" 16 bytes*/
  Elf64_Half e_type;
  Elf64_Half e_machine;
  Elf64_Word e_version;
  Elf64_Addr e_entry;		/* Entry point virtual address */
  Elf64_Off e_phoff;		/* Program header table file offset */
  Elf64_Off e_shoff;		/* Section header table file offset */
  Elf64_Word e_flags;
  Elf64_Half e_ehsize;      ///elf header size
  Elf64_Half e_phentsize;   ///program header entry size
  Elf64_Half e_phnum;       ///program header count
  Elf64_Half e_shentsize;   ///section header entry size
  Elf64_Half e_shnum;		///section header count
  Elf64_Half e_shstrndx;
} Elf64_Ehdr;

查看可执行文件的ELF header:

# readelf -h test 
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x400470
  Start of program headers:          64 (bytes into file)
  Start of section headers:          2680 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         8
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 27
  • Program headers

Program header用于内核构造进程的内存镜像,对应到/proc/$PID/maps

typedef struct elf64_phdr {
  Elf64_Word p_type;
  Elf64_Word p_flags;
  Elf64_Off p_offset;		/* Segment file offset */
  Elf64_Addr p_vaddr;		/* Segment virtual address */
  Elf64_Addr p_paddr;		/* Segment physical address */
  Elf64_Xword p_filesz;		/* Segment size in file */
  Elf64_Xword p_memsz;		/* Segment size in memory */
  Elf64_Xword p_align;		/* Segment alignment, file & memory */
} Elf64_Phdr;

查看执行文件的program headers:

# readelf -l test  

Elf file type is EXEC (Executable file)
Entry point 0x400470
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000074c 0x000000000000074c  R E    200000
  LOAD           0x0000000000000750 0x0000000000600750 0x0000000000600750
                 0x00000000000001fc 0x0000000000000210  RW     200000
  DYNAMIC        0x0000000000000778 0x0000000000600778 0x0000000000600778
                 0x0000000000000190 0x0000000000000190  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x00000000000006a8 0x00000000004006a8 0x00000000004006a8
                 0x0000000000000024 0x0000000000000024  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

 Section to Segment mapping:
  Segment Sections...
   00     
   01     .interp 
   02     .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame 
   03     .ctors .dtors .jcr .dynamic .got .got.plt .data .bss 
   04     .dynamic 
   05     .note.ABI-tag .note.gnu.build-id 
   06     .eh_frame_hdr 
   07     
  • Sections headers

Sections hold the bulk of object file information for the linking view: instructions, data, symbol table, relocation information, and so on.

typedef struct elf64_shdr {
  Elf64_Word sh_name;		/* Section name, index in string tbl */
  Elf64_Word sh_type;		/* Type of section */
  Elf64_Xword sh_flags;		/* Miscellaneous section attributes */
  Elf64_Addr sh_addr;		/* Section virtual addr at execution */
  Elf64_Off sh_offset;		/* Section file offset */
  Elf64_Xword sh_size;		/* Size of section in bytes */
  Elf64_Word sh_link;		/* Index of another section */
  Elf64_Word sh_info;		/* Additional section information */
  Elf64_Xword sh_addralign;	/* Section alignment */
  Elf64_Xword sh_entsize;	/* Entry size if section holds table */
} Elf64_Shdr;

查看执行文件的section headers:

# readelf -S test  
There are 30 section headers, starting at offset 0xa78:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000400200  00000200
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.ABI-tag     NOTE             000000000040021c  0000021c
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .note.gnu.build-i NOTE             000000000040023c  0000023c
       0000000000000024  0000000000000000   A       0     0     4
  [ 4] .gnu.hash         GNU_HASH         0000000000400260  00000260
       000000000000001c  0000000000000000   A       5     0     8
  [ 5] .dynsym           DYNSYM           0000000000400280  00000280
       0000000000000090  0000000000000018   A       6     1     8
  [ 6] .dynstr           STRTAB           0000000000400310  00000310
       000000000000004e  0000000000000000   A       0     0     1
  [ 7] .gnu.version      VERSYM           000000000040035e  0000035e
       000000000000000c  0000000000000002   A       5     0     2
  [ 8] .gnu.version_r    VERNEED          0000000000400370  00000370
       0000000000000020  0000000000000000   A       6     1     8
  [ 9] .rela.dyn         RELA             0000000000400390  00000390
       0000000000000018  0000000000000018   A       5     0     8
  [10] .rela.plt         RELA             00000000004003a8  000003a8
       0000000000000060  0000000000000018   A       5    12     8
  [11] .init             PROGBITS         0000000000400408  00000408
       0000000000000018  0000000000000000  AX       0     0     4
  [12] .plt              PROGBITS         0000000000400420  00000420
       0000000000000050  0000000000000010  AX       0     0     4
  [13] .text             PROGBITS         0000000000400470  00000470
       0000000000000218  0000000000000000  AX       0     0     16
  [14] .fini             PROGBITS         0000000000400688  00000688
       000000000000000e  0000000000000000  AX       0     0     4
  [15] .rodata           PROGBITS         0000000000400698  00000698
       0000000000000010  0000000000000000   A       0     0     8
  [16] .eh_frame_hdr     PROGBITS         00000000004006a8  000006a8
       0000000000000024  0000000000000000   A       0     0     4
  [17] .eh_frame         PROGBITS         00000000004006d0  000006d0
       000000000000007c  0000000000000000   A       0     0     8
  [18] .ctors            PROGBITS         0000000000600750  00000750
       0000000000000010  0000000000000000  WA       0     0     8
  [19] .dtors            PROGBITS         0000000000600760  00000760
       0000000000000010  0000000000000000  WA       0     0     8
  [20] .jcr              PROGBITS         0000000000600770  00000770
       0000000000000008  0000000000000000  WA       0     0     8
  [21] .dynamic          DYNAMIC          0000000000600778  00000778
       0000000000000190  0000000000000010  WA       6     0     8
  [22] .got              PROGBITS         0000000000600908  00000908
       0000000000000008  0000000000000008  WA       0     0     8
  [23] .got.plt          PROGBITS         0000000000600910  00000910
       0000000000000038  0000000000000008  WA       0     0     8
  [24] .data             PROGBITS         0000000000600948  00000948
       0000000000000004  0000000000000000  WA       0     0     4
  [25] .bss              NOBITS           0000000000600950  0000094c
       0000000000000010  0000000000000000  WA       0     0     8
  [26] .comment          PROGBITS         0000000000000000  0000094c
       000000000000002c  0000000000000001  MS       0     0     1
  [27] .shstrtab         STRTAB           0000000000000000  00000978
       00000000000000fe  0000000000000000           0     0     1
  [28] .symtab           SYMTAB           0000000000000000  000011f8
       0000000000000630  0000000000000018          29    46     8
  [29] .strtab           STRTAB           0000000000000000  00001828
       000000000000021b  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

Core dump file

当进程收到,比如SIGABRT时,会进行coredump操作:

int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
			  struct pt_regs *regs, void *cookie)
{
///...
		if (sig_kernel_coredump(signr)) {
			if (print_fatal_signals)
				print_fatal_signal(info->si_signo);
			proc_coredump_connector(current);
			/*
			 * If it was able to dump core, this kills all
			 * other threads in the group and synchronizes with
			 * their demise.  If we lost the race with another
			 * thread getting here, it set group_exit_code
			 * first and our do_group_exit call below will use
			 * that value and ignore the one we pass it.
			 */
			do_coredump(info);
		}

对于ELF执行程序,最终会调用[elf_core_dump](https://bitbucket.org/hustcat/kernel-3.10.83/src/cf765cbd7202f226f5e6c1945dbf4fcea3bd6853/fs/binfmt_elf.c?at=master&fileviewer=file-view-default#binfmt_elf.c-2054)

ELF coredump文件主要包含4部分内容,按写入先后顺序:

(1)ELF header

	size += sizeof(*elf);
	if (size > cprm->limit || !dump_write(cprm->file, elf, sizeof(*elf)))
		goto end_coredump;

(2) Program headers

	size += sizeof(*phdr4note); ///hdr note
	if (size > cprm->limit
	    || !dump_write(cprm->file, phdr4note, sizeof(*phdr4note)))
		goto end_coredump;

	/* Write program headers for segments dump */
	for (vma = first_vma(current, gate_vma); vma != NULL;
			vma = next_vma(vma, gate_vma)) {
		struct elf_phdr phdr;

		phdr.p_type = PT_LOAD;
		phdr.p_offset = offset;
		phdr.p_vaddr = vma->vm_start;
		phdr.p_paddr = 0;
		phdr.p_filesz = vma_dump_size(vma, cprm->mm_flags);
		phdr.p_memsz = vma->vm_end - vma->vm_start;
		offset += phdr.p_filesz;
		phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
		if (vma->vm_flags & VM_WRITE)
			phdr.p_flags |= PF_W;
		if (vma->vm_flags & VM_EXEC)
			phdr.p_flags |= PF_X;
		phdr.p_align = ELF_EXEC_PAGESIZE;

		size += sizeof(phdr);
		if (size > cprm->limit
		    || !dump_write(cprm->file, &phdr, sizeof(phdr)))
			goto end_coredump;
	}

(3) 进程信息,包括signal、registers、以及task_struct数据等。

 	/* write out the notes section */
	if (!write_note_info(&info, cprm->file, &foffset))
		goto end_coredump;

(4) 内存数据

	/* Align to page */
	if (!dump_seek(cprm->file, dataoff - foffset))
		goto end_coredump;

	for (vma = first_vma(current, gate_vma); vma != NULL;
			vma = next_vma(vma, gate_vma)) {
		unsigned long addr;
		unsigned long end;

		end = vma->vm_start + vma_dump_size(vma, cprm->mm_flags);

		for (addr = vma->vm_start; addr < end; addr += PAGE_SIZE) {
			struct page *page;
			int stop;

			page = get_dump_page(addr);
			if (page) {
				void *kaddr = kmap(page);
				stop = ((size += PAGE_SIZE) > cprm->limit) ||
					!dump_write(cprm->file, kaddr,
						    PAGE_SIZE);
				kunmap(page);
				page_cache_release(page);
			} else
				stop = !dump_seek(cprm->file, PAGE_SIZE);
			if (stop)
				goto end_coredump;
		}
	}

可以用readelf查看coredump file的信息。当内核在执行coredump时,可能会被信号中断,从而引起coredump不完整(truncated)。

/*
 * Core dumping helper functions.  These are the only things you should
 * do on a core-file: use only these functions to write out all the
 * necessary info.
 */
int dump_write(struct file *file, const void *addr, int nr)
{
	return !dump_interrupted() && ///收到信号,则停止dump
		access_ok(VERIFY_READ, addr, nr) &&
		file->f_op->write(file, addr, nr, &file->f_pos) == nr;
}

Trace signal send and deliver

内核有2个与signal相关的tracepoint,可以用来trace内核的信号发送与接收的情况:

# ./tpoint -l | grep signal
signal:signal_deliver
signal:signal_generate
  • Trace signal deliver
# ps -ef|grep test_signal
root     32309 26674 89 10:42 ?        00:00:12 ./test_signal
root     32621  1418  0 10:43 pts/1    00:00:00 grep test_signal

# ./tpoint -p 32309 signal:signal_deliver                   
Tracing signal:signal_deliver. Ctrl-C to end.
     test_signal-32309 [019] 51151109.432816: signal_deliver: sig=27 errno=0 code=128 sa_handler=400634 sa_flags=14000004
           <...>-32309 [019] 51151110.433334: signal_deliver: sig=27 errno=0 code=128 sa_handler=400634 sa_flags=14000004
           <...>-32309 [019] 51151111.434852: signal_deliver: sig=27 errno=0 code=128 sa_handler=400634 sa_flags=14000004
^C
Ending tracing...

测试程序

  • Trace signal send

内核信号的发送最终都会调用__send_signal:

static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
			int group, int from_ancestor_ns)
{
///...
ret:
	trace_signal_generate(sig, info, t, group, result);
	return ret;
}
# ./tpoint -p 32309 signal:signal_generate
Tracing signal:signal_generate. Ctrl-C to end.
     test_signal-32309 [011] 51151417.094942: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
     test_signal-32309 [011] 51151418.097461: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
           <...>-32309 [011] 51151419.099980: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=32309 grp=1 res=0
           <...>-32309 [011] 51151419.642213: signal_generate: sig=17 errno=0 code=262146 comm=bash pid=26674 grp=1 res=0
^C
Ending tracing...

可以看到,信号SIGPROF(27)是由进程自己发送给自己的,另外,当进程退出时,会给父进程发送SIGCHLD(17)

Interval timer

setitimer用来设置interval timers。对于ITIMER_PROF定时器,当timer超时,内核会给进程发送SIGPROF信号。

早期 Linux 考虑两种定时器:内核自身需要的timer,也叫做动态定时器;其次是来自用户态的需要, 即 setitimer 定时器,也叫做间隔定时器。2.5.63 开始支持 POSIX Timer。2.6.16 引入了高精度 hrtimer,而且所有其它的timer都是建立在hrtimer之上的。

数据结构:

struct signal_struct {
///...
	/*
	 * ITIMER_PROF and ITIMER_VIRTUAL timers for the process, we use
	 * CPUCLOCK_PROF and CPUCLOCK_VIRT for indexing array as these
	 * values are defined to 0 and 1 respectively
	 */
	struct cpu_itimer it[2]; ///interval timers per-process, see set_cpu_itimer

内核函数调用栈:

# ./tpoint -s -p 10213 signal:signal_generate     
Tracing signal:signal_generate. Ctrl-C to end.
     test_signal-10213 [020] 51151620.771581: signal_generate: sig=27 errno=0 code=128 comm=test_signal pid=10213 grp=1 res=0
     test_signal-10213 [020] 51151620.771582: <stack trace>
 => send_signal
 => __group_send_sig_info
 => check_cpu_itimer
 => run_posix_cpu_timers
 => update_process_times
 => tick_sched_timer
 => __run_hrtimer
 => hrtimer_interrupt

Reference

]]>