Qmail looping messages around the same server

Submitted by davidc on Sat, 13/06/2009 - 05:43

I've been investigating why qmail has been looping some messages through the same server. Even though MX records are configured correctly, the headers indicate that it is been sent back to the same server repeatedly until it reaches the hop limit.

Example

Here's an example bounce message:

78.158.64.13 failed after I sent the message.
Remote host said: 554 too many hops, this message is looping (#5.4.6)
 
--- Below this line is a copy of the message.
 
Return-Path: <ELIDED@ELIDED.com>
Received: (qmail 3552 invoked from network); 10 Jun 2009 21:06:14 +0100
Received: by simscan 1.4.0 ppid: 3549, pid: 3550, t: 0.0480s
         scanners: clamav: 0.95.1/m:51/d:9451
Received: from sec2.mail.sargasso.net (78.158.64.13)
  by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:14 +0100
Received: (qmail 3546 invoked from network); 10 Jun 2009 21:06:14 +0100
Received: by simscan 1.4.0 ppid: 3543, pid: 3544, t: 0.0443s
         scanners: clamav: 0.95.1/m:51/d:9451
Received: from sec2.mail.sargasso.net (78.158.64.13)
  by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:14 +0100
Received: (qmail 3540 invoked from network); 10 Jun 2009 21:06:13 +0100
Received: by simscan 1.4.0 ppid: 3537, pid: 3538, t: 0.0437s
         scanners: clamav: 0.95.1/m:51/d:9451
Received: from sec2.mail.sargasso.net (78.158.64.13)
  by sec2.mail.sargasso.net with SMTP; 10 Jun 2009 21:06:13 +0100
Received: (qmail 3534 invoked from network); 10 Jun 2009 21:06:13 +0100
etc. etc.

Analysis

An MTA should never attempt delivery to an MX with the same or higher priority than itself, period, let alone try to deliver remotely to itself. As the MX records are correctly set up, I looked into qmail-remote and found that it appears to correctly try to exclude its own IP addresses by testing each MX record with ipme_is(). But clearly it was not working right or the list was incomplete. Next step was to add a quick debug main() function to ipme.c to find out what qmail thought the server's addresses were:

int main(int argc, char **argv)
{
  int i;
  ipme_init();
  for (i = 0;i < ipme.len;++i) {
    printf("%d.%d.%d.%d\n",
           ((unsigned char*)(&ipme.ix[i].ip))[0],
           ((unsigned char*)(&ipme.ix[i].ip))[1],
           ((unsigned char*)(&ipme.ix[i].ip))[2],
           ((unsigned char*)(&ipme.ix[i].ip))[3]);
  }
}

$ ./ipme
0.0.0.0
127.0.0.1
78.158.64.6
78.158.64.6
78.158.64.6

The same address repeated three times? And the same one, missing the two other unique addresses set?

[irrelevant lines elided]
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    inet 127.0.0.1/8 scope host lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    inet 78.158.64.6/28 brd 78.158.64.15 scope global eth0
    inet 78.158.64.13/32 scope global eth0
    inet 78.158.64.14/32 scope global eth0

The problem then is in ipme_init(). It appears that Qmail does not correctly handle IP addresses added as aliases using "ip addr add", which this new server was set up with, although it handles old format "sub-interfaces" such as "eth:2" just fine.

Impact

As qmail will correctly try to deliver to the lowest priority MX, this problem only appears when the primary MX is unreachable. The message will then quickly loop around the same server in a short space of time until it reaches the maximum hop count (100 by default), and then bounce back to the sender.

Solution

ipme_init() needs to be patched to use the address returned by the original SIOCGIFCONF call instead of looking up the primary address for the interface using SIOCGIFADDR. Linux has no sa_len, so this is the second case (else) of #ifdef HASSALEN. It also needs to copy the address before reusing the structure for the SIOCGIFFLAGS call. This ends up with the code looking virtually identical to the #ifdef HASSALEN case, so the simplest patch is:

$ diff -U 4 ipme.c.original ipme.c
--- ipme.c.original     2009-06-13 05:26:25.000000000 +0100
+++ ipme.c      2009-06-13 05:28:34.000000000 +0100
@@ -73,26 +73,18 @@
 #ifdef HASSALEN
     len = sizeof(ifr->ifr_name) + ifr->ifr_addr.sa_len;
     if (len < sizeof(*ifr))
       len = sizeof(*ifr);
+#else
+    len = sizeof(*ifr);
+#endif
     if (ifr->ifr_addr.sa_family == AF_INET) {
       sin = (struct sockaddr_in *) &ifr->ifr_addr;
       byte_copy(&ix.ip,4,&sin->sin_addr);
       if (ioctl(s,SIOCGIFFLAGS,x) == 0)
         if (ifr->ifr_flags & IFF_UP)
           if (!ipalloc_append(&ipme,&ix)) { close(s); return 0; }
     }
-#else
-    len = sizeof(*ifr);
-    if (ioctl(s,SIOCGIFFLAGS,x) == 0)
-      if (ifr->ifr_flags & IFF_UP)
-        if (ioctl(s,SIOCGIFADDR,x) == 0)
-         if (ifr->ifr_addr.sa_family == AF_INET) {
-           sin = (struct sockaddr_in *) &ifr->ifr_addr;
-           byte_copy(&ix.ip,4,&sin->sin_addr);
-           if (!ipalloc_append(&ipme,&ix)) { close(s); return 0; }
-         }
-#endif
     x += len;
   }
   close(s);
   ipmeok = 1;

Result: success!

$ ./ipme
0.0.0.0
127.0.0.1
78.158.64.6
78.158.64.13
78.158.64.14