Friday, October 17, 2008

What do IPsec and Larvae Have in Common?

This is a bit of a digression from the usual topic of this blog, but I found the problem interesting enough to mention.

I have a Linux server at home, on which I tend to keep the command-line rtorrent client running in a detached screen session. I recently noticed that my rtorrent started hanging randomly -- the process would be in the sleeping state with no noticeable CPU usage, but it was entirely unresponsive to key presses. All other processes on the system were unaffected.

The first thing I did was Google search for "rtorrent hang". Sure enough, someone had had this problem before and reported that rtorrent was hanging on the "madvise" system call. The comment had claimed that it was a kernel bug, which wasn't entirely unrealistic since madvise seems to be a fairly esoteric, rarely-used, and therefore rarely-tested system call. The post pointed to a newer version of rtorrent, but, to my disappointment, even the latest version from SVN didn't solve the problem. My rtorrent was still hanging. And on top of that, HTTP requests seemed to be broken in the newer version because of what I can only assume is some incopmatibility with the relatively old version of libcurl on my machine.

I recompiled the original stable version of rtorrent, and thankfully HTTP torrent fetches and tracker requests worked again, but the hang was still there. I decided to investigate based on a clue left by the madvise post: use strace to see what syscall rtorrent was blocked in. This is what I found:
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 579
fcntl64(579, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(579, SOL_IP, IP_TOS, [8], 4) = 0
connect(579, {sa_family=AF_INET, sin_port=htons(37320), sin_addr=inet_addr("189.51.247.163")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(3, EPOLL_CTL_ADD, 579, {EPOLLOUT, {u32=139804600, u64=139804600}}) = 0
epoll_ctl(3, EPOLL_CTL_MOD, 579, {EPOLLOUT|EPOLLERR, {u32=139804600, u64=139804600}}) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 580
fcntl64(580, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
setsockopt(580, SOL_IP, IP_TOS, [8], 4) = 0
connect(580, {sa_family=AF_INET, sin_port=htons(57524), sin_addr=inet_addr("192.168.0.10")}, 16

"How was this possible??" I asked myself. You can clearly see that socket 579 was put in nonblocking more and connect immediately returned "EINPROGRESS", as it should. But socket 580, which was doing virtually the same thing, was blocking in "connect." How was that possible? What was the difference?

It turns out the difference was in the destination IP address. I had almost forgotten that, prior to this hanging problem, I had set up an IPsec tunnel to my friends' network. His network is on 192.168.0.0/24, so when rtorrent was trying to connect to 192.168.0.10, my kernel was actually trying to establish an SA (IPsec tunnel) to my friend's network. A quick search for "ipsec blocking socket" quickly revealed that this behaviour is documented and configurable (http://lkml.org/lkml/2007/12/4/260). In fact, by simply "echo 0 > /proc/sys/net/core/xfrm_larval_drop", I solved the mystery of the hanging rtorrent. In this case, the kernel simply drops any packets associated with an IPsec tunnel that is in the process of being established -- instead of blocking the calling process until that tunnel is fully established.

This solution, of course, does have its downsides: when setting up or debugging an IPsec tunnel, you end up seeing packet loss without really knowing the reason. Temporarily turning xfrm_larval_drop is probably a good idea while tweaking your IPsec configuration!

See you next time.

-Cat

No comments: