SUSE Support

Here When You Need Us

Bonds and VLANs randomly fail to become active while duplicate IP verification is active

This document (000021492) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server 15 Service Pack 5
SUSE Linux Enterprise Server 15 Service Pack 4
SUSE Linux Enterprise Server 15 Service Pack 3
SUSE Linux Enterprise Micro 5.5
SUSE Linux Enterprise Micro 5.4
SUSE Linux Enterprise Micro 5.3
SUSE Linux Enterprise Micro 5.2
SUSE Linux Enterprise Micro 5.1

 

Situation

Bonds and VLANs randomly fail to become active while duplicate IP verification is active. The problem happens after a server reboot or after a restart of Wicked.

Resolution

Wicked now has an an extended timeout/retry period to overcome most of the issues related to a problem present in the bond driver.

Please install the following version of Wicked or later:

SUSE Linux Enterprise Server 12 SP5:   wicked-0.6.75-3.43.1

SUSE Linux Enterprise Server 15 SP3:   wicked-0.6.75-150300.4.32.1
SUSE Linux Enterprise Micro 5.1:   wicked-0.6.75-150300.4.32.1
SUSE Linux Enterprise Micro 5.2 :   wicked-0.6.75-150300.4.32.1

SUSE Linux Enterprise Server 15 SP4:   wicked-0.6.75-150400.3.27.1
SUSE Linux Enterprise Micro 5.3:   wicked-0.6.75-150400.3.27.1
SUSE Linux Enterprise Micro 5.4   wicked-0.6.75-150400.3.27.1

SUSE Linux Enterprise Server 15 SP5:   wicked-0.6.75-150500.3.29.1
SUSE Linux Enterprise Micro 5.5:   wicked-0.6.75-150500.3.29.1


To identify if a wicked rpm contains the relevant fix, the following can be used:
# rpm -qp --changelog wicked-0.6.75-150500.3.29.1.x86_64.rpm | less

- arp: increase arp-send retry value to avoid address configuration
  failure due to ENOBUF reported by kernel while duplicate address
  detection with underlying bonding in 802.3ad mode reporting link
  "up & running" too early (bsc#1218668, gh#openSUSE/wicked#1020,
  gh#openSUSE/wicked#1022).
  [+ 0002-increase-arp-retry-attempts-on-sending-bsc1218668.patch]

Cause

In some LACP setups (like multi-chassis), the start-up sequence switches the active aggregator, causing all slaves to be disabled and any transmit request to fail, until the remote LACP partner sends a further request to re-enable them. This operation might take several seconds and it's partner dependent. If Wicked tries to send ARP requests during the aggregator transition, the error ENOBUFS is returned and the duplicate IP verification fails.
 

Additional Information

The wicked patch only reduces the chance of the reported problem (by extending the interface's verification time) but it doesn't solve it, because the root problem is in the bonding driver. In some rare cases, this extended verification time might not be enough and in those cases, the following work-around can be used.

Increase the ARP probes that wicked sends before giving up on the NIC configuration:

Add the following to the '/etc/wicked/local.xml' file and restart the wicked.service:
 
<config>
    <addrconf>
            <arp>
                <verify>
                    <count>2</count>
                    <interval>2000</interval>
                    <retries>10</retries>
                </verify>
            </arp>
    </addrconf>
</config>
Note that the file local.xml may have to be created if not present.


It is actually enough to only change the retries value, as this controls the ENOBUFS error handling. As an alternative to the above example, the following could be used to set the retries value for dhcp4, auto4 and static-IP :
<config>
    <addrconf>
        <auto4>
            <arp>
                <verify>
                    <retries>20</retries>
                </verify>
            </arp>
        </auto4>
        <dhcp4>
            <arp>
                <verify>
                    <retries>20</retries>
                </verify>
            </arp>
        </dhcp4>
        <arp>
            <verify>
                <retries>20</retries>
            </verify>
        </arp>
    </addrconf>
</config>
With this, wicked tries to send 3 verify packets in an interval between 0.67s and 2s (this is the default). During these 3 attempts it can have up to 20 ENOBUFS errors. This mean, it tries at least ~13s. You should see a similar message to the following in debug output:

e.g.    wickedd[8721]: en0: ARP verify failed for 192.168.0.22 - ENOBUFS, probes:0/3 errors:8/20


Note:  there is an important limitation: the verify duration time can't exceed 15 seconds, that means: interval * count <= 15000

Note: The time LACP needs to complete depends also on the switch setup, e.g. VPC (Virtual Port Channel) or MLAG (Multi-chassis Link Aggregation Group).

Note: Regarding the local.xml workaround. A way to test if the workaround is active is to use tcpdump on another host to check if wickedd is sending the ARP requests according to the verify parameter settings.

Note: Since the root cause of the problem is actually in the bonding driver, not Wicked, the bonding driver problem is being investigated. The intention is to eventually arrive at a fix which is acceptable to upstream.

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.

  • Document ID:000021492
  • Creation Date: 12-Jul-2024
  • Modified Date:16-Jul-2024
    • SUSE Linux Enterprise Server
    • SUSE Linux Enterprise Micro

< Back to Support Search

For questions or concerns with the SUSE Knowledgebase please contact: tidfeedback[at]suse.com

SUSE Support Forums

Get your questions answered by experienced Sys Ops or interact with other SUSE community experts.

Support Resources

Learn how to get the most from the technical support you receive with your SUSE Subscription, Premium Support, Academic Program, or Partner Program.

Open an Incident

Open an incident with SUSE Technical Support, manage your subscriptions, download patches, or manage user access.