I've been investigating why qmail has been looping some messages through the same server. Even though MX records are configured correctly, the headers indicate that it is been sent back to the same server repeatedly until it reaches the hop limit.
Example
Here's an example bounce message:
78.158.64.13 failed after I sent the message. Remote host said: 554 too many hops, this message is looping (#5.4.6) --- Below this line is a copy of the message. Return-Path: <ELIDED@ELIDED.com> Received: (qmail 3552 invoked from network); 10 Jun 2009 21:06:14 +0100 Received: by simscan 1.4.0 ppid: 3549, pid: 3550, t: 0.0480s scanners: clamav: 0.95.1/m:51/d:9451 Received: from sec2.mail.sargasso.net (78.158.64.13) by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:14 +0100 Received: (qmail 3546 invoked from network); 10 Jun 2009 21:06:14 +0100 Received: by simscan 1.4.0 ppid: 3543, pid: 3544, t: 0.0443s scanners: clamav: 0.95.1/m:51/d:9451 Received: from sec2.mail.sargasso.net (78.158.64.13) by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:14 +0100 Received: (qmail 3540 invoked from network); 10 Jun 2009 21:06:13 +0100 Received: by simscan 1.4.0 ppid: 3537, pid: 3538, t: 0.0437s scanners: clamav: 0.95.1/m:51/d:9451 Received: from sec2.mail.sargasso.net (78.158.64.13) by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:13 +0100 Received: (qmail 3534 invoked from network); 10 Jun 2009 21:06:13 +0100 etc. etc.
Analysis
An MTA should never attempt delivery to an MX with the same or higher priority than itself, period, let alone try to deliver remotely to itself. As the MX records are correctly set up, I looked into qmail-remote and found that it appears to correctly try to exclude its own IP addresses by testing each MX record with ipme_is(). But clearly it was not working right or the list was incomplete. Next step was to add a quick debug main() function to ipme.c to find out what qmail thought the server's addresses were:
int main(int argc, char **argv) { int i; ipme_init(); for (i = 0;i < ipme.len;++i) { printf("%d.%d.%d.%d\n", ((unsigned char*)(&ipme.ix[i].ip))[0], ((unsigned char*)(&ipme.ix[i].ip))[1], ((unsigned char*)(&ipme.ix[i].ip))[2], ((unsigned char*)(&ipme.ix[i].ip))[3]); } }
$ ./ipme 0.0.0.0 127.0.0.1 78.158.64.6 78.158.64.6 78.158.64.6
The same address repeated three times? And the same one, missing the two other unique addresses set?
[irrelevant lines elided] $ ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN inet 127.0.0.1/8 scope host lo 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 inet 78.158.64.6/28 brd 78.158.64.15 scope global eth0 inet 78.158.64.13/32 scope global eth0 inet 78.158.64.14/32 scope global eth0
The problem then is in ipme_init(). It appears that Qmail does not correctly handle IP addresses added as aliases using "ip addr add", which this new server was set up with, although it handles old format "sub-interfaces" such as "eth:2" just fine.
Impact
As qmail will correctly try to deliver to the lowest priority MX, this problem only appears when the primary MX is unreachable. The message will then quickly loop around the same server in a short space of time until it reaches the maximum hop count (100 by default), and then bounce back to the sender.
Solution
ipme_init() needs to be patched to use the address returned by the original SIOCGIFCONF call instead of looking up the primary address for the interface using SIOCGIFADDR. Linux has no sa_len, so this is the second case (else) of #ifdef HASSALEN. It also needs to copy the address before reusing the structure for the SIOCGIFFLAGS call. This ends up with the code looking virtually identical to the #ifdef HASSALEN case, so the simplest patch is:
$ diff -U 4 ipme.c.original ipme.c --- ipme.c.original 2009-06-13 05:26:25.000000000 +0100 +++ ipme.c 2009-06-13 05:28:34.000000000 +0100 @@ -73,26 +73,18 @@ #ifdef HASSALEN len = sizeof(ifr->ifr_name) + ifr->ifr_addr.sa_len; if (len < sizeof(*ifr)) len = sizeof(*ifr); +#else + len = sizeof(*ifr); +#endif if (ifr->ifr_addr.sa_family == AF_INET) { sin = (struct sockaddr_in *) &ifr->ifr_addr; byte_copy(&ix.ip,4,&sin->sin_addr); if (ioctl(s,SIOCGIFFLAGS,x) == 0) if (ifr->ifr_flags & IFF_UP) if (!ipalloc_append(&ipme,&ix)) { close(s); return 0; } } -#else - len = sizeof(*ifr); - if (ioctl(s,SIOCGIFFLAGS,x) == 0) - if (ifr->ifr_flags & IFF_UP) - if (ioctl(s,SIOCGIFADDR,x) == 0) - if (ifr->ifr_addr.sa_family == AF_INET) { - sin = (struct sockaddr_in *) &ifr->ifr_addr; - byte_copy(&ix.ip,4,&sin->sin_addr); - if (!ipalloc_append(&ipme,&ix)) { close(s); return 0; } - } -#endif x += len; } close(s); ipmeok = 1;
Result: success!
$ ./ipme 0.0.0.0 127.0.0.1 78.158.64.6 78.158.64.13 78.158.64.14