Wednesday, January 23, 2013

Adventures in Ethernet

It's always rough when you follow directions and something doesn't turn out, even more so when you are familiar with what you are trying to do. I believe I've discovered a bug somewhere in Debian's Ethernet (or the ifenslave-2.6 package) configuration during a new server setup on December 14, 2012. Once I am finished writing this post I am off to find where to submit an issue for them to see if I can save someone else from similar madness.

I was setting up a new install (Debian 6.0.6 i686) at work and was struggling with setting up Ethernet Bonding. I've done it in the past, and newer versions of Debian have made it easier than ever to configure, so I was really stumped as to why this was not working.

Entries like this from dmesg tell me it's working, but ping illustrates that clearly something is wrong:

Dec 14 14:20:09 ferrari kernel: [    4.595988] bonding: bond0: setting mode to active-backup (1).
Dec 14 14:20:09 ferrari kernel: [    4.596045] bonding: bond0: Setting MII monitoring interval to 100.
Dec 14 14:20:09 ferrari kernel: [    4.596087] bonding: bond0: Setting up delay to 200.
Dec 14 14:20:09 ferrari kernel: [    4.596121] bonding: bond0: Setting down delay to 200.
Dec 14 14:20:09 ferrari kernel: [    4.658073] bonding: bond0: doing slave updates when interface is down.
Dec 14 14:20:09 ferrari kernel: [    4.658079] bonding: bond0: Adding slave eth0.
Dec 14 14:20:09 ferrari kernel: [    4.658082] bonding bond0: master_dev is not up in bond_enslave
Dec 14 14:20:09 ferrari kernel: [    4.676526] tg3 0000:03:06.0: firmware: requesting tigon/tg3_tso.bin
Dec 14 14:20:09 ferrari kernel: [    4.923645] bonding: bond0: enslaving eth0 as a backup interface with a down link.
Dec 14 14:20:09 ferrari kernel: [    4.934060] bonding: bond0: doing slave updates when interface is down.
Dec 14 14:20:09 ferrari kernel: [    4.934066] bonding: bond0: Adding slave eth1.
Dec 14 14:20:09 ferrari kernel: [    4.934069] bonding bond0: master_dev is not up in bond_enslave
Dec 14 14:20:09 ferrari kernel: [    4.956523] tg3 0000:03:08.0: firmware: requesting tigon/tg3_tso.bin
Dec 14 14:20:09 ferrari kernel: [    5.208291] bonding: bond0: enslaving eth1 as a backup interface with a down link.
Dec 14 14:20:09 ferrari kernel: [    5.212315] ADDRCONF(NETDEV_UP): bond0: link is not ready
Dec 14 14:20:11 ferrari kernel: [    7.813163] tg3 0000:03:08.0: eth1: Link is up at 1000 Mbps, full duplex
Dec 14 14:20:11 ferrari kernel: [    7.813167] tg3 0000:03:08.0: eth1: Flow control is on for TX and on for RX
Dec 14 14:20:11 ferrari kernel: [    7.912012] bonding: bond0: link status up for interface eth1, enabling it in 0 ms.
Dec 14 14:20:11 ferrari kernel: [    7.912016] bonding: bond0: link status definitely up for interface eth1.
Dec 14 14:20:11 ferrari kernel: [    7.912020] bonding: bond0: making interface eth1 the new active one.
Dec 14 14:20:11 ferrari kernel: [    7.912044] bonding: bond0: first active interface up!
Dec 14 14:20:11 ferrari kernel: [    7.912172] ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Dec 14 14:20:11 ferrari kernel: [    8.148079] tg3 0000:03:06.0: eth0: Link is up at 1000 Mbps, full duplex
Dec 14 14:20:11 ferrari kernel: [    8.148084] tg3 0000:03:06.0: eth0: Flow control is on for TX and on for RX
Dec 14 14:20:11 ferrari kernel: [    8.212012] bonding: bond0: link status up for interface eth0, enabling it in 200 ms.
Dec 14 14:20:11 ferrari kernel: [    8.412010] bonding: bond0: link status definitely up for interface eth0.

While trying to figure this out I noticed some strange entries in both the routing table, and the output of /sbin/ifconfig.

/sbin/route:
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.0.0     *               255.255.255.0   U     0      0        0 eth0
192.168.0.0     *               255.255.255.0   U     0      0        0 bond0
default         192.168.0.1     0.0.0.0         UG    0      0        0 bond0

As you can see for some reason eth0 still has an entry in the routing table. Seeing that as a problem I tried to delete it with no success. Below you'll see for some reason eth0, while "RUNNING SLAVE" still has the old IP address it had before it was reassigned to bond0.

/sbin/ifconfig:
bond0     Link encap:Ethernet  HWaddr 00:0b:db:e2:ce:db
          inet addr:192.168.0.215  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::20b:dbff:fee2:cedb/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:220 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:23522 (22.9 KiB)  TX bytes:2028 (1.9 KiB)

eth0      Link encap:Ethernet  HWaddr 00:0b:db:e2:ce:db
          inet addr:192.168.0.215  Bcast:192.168.0.255  Mask:255.255.255.0
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:142 errors:0 dropped:0 overruns:0 frame:0
          TX packets:21 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:15158 (14.8 KiB)  TX bytes:1344 (1.3 KiB)
          Interrupt:28

eth1      Link encap:Ethernet  HWaddr 00:0b:db:e2:ce:db
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:78 errors:0 dropped:0 overruns:0 frame:0
          TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8364 (8.1 KiB)  TX bytes:684 (684.0 B)
          Interrupt:29

At first I thought it was maybe strange notation for the bonded interfaces, but the more I thought about it the more I felt it was wrong. After some searching I came to reading this: http://www.kernel.org/doc/Documentation/networking/bonding.txt and found "Section 8.1 Adventures in Routing" was explaining exactly the issue I was having. For reasons unknown to me I was not able to delete the route I wanted to delete. In the end what worked was getting my bonded connection setup and then rebooting. Only then did I lose the eth0 in routing, and the IP address on the eth0 as reported by /sbin/ifconfig.

I went through some trials using ifup and ifdown to get rid of the eth0 entry in routing, and I even put a short line in the interfaces files:

iface eth0 inet manual

Bringing eth0 up and down removed the errant entries, but restarting networking brought them back, even with eth0 removed from interfaces aside from the slave command.

So far my only success has been a reboot, upon which the bonding is working fine.

References:

Update:

Bug reported: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=698797

No comments: