Recommended Posts

Hi,

My wifi AP is a linux machine running hostapd and dhcpd.  I couldn't get wifi working, it was just stuck flashing blue, and I could see the DHCP server giving it an IP but the device never raised it.  It's taken me weeks to figure out why and how to fix/workaround it.  I traced the problem into the DHCP client, but couldn't figure out how to use Serial.print() because that's C++ and the DHCP client is C code.  I tried using UDPSend() to send debug messages out via the network so I could see them with wireshark, which worked, but that also crashed the device after sending the packet, but better than nothing, so had to recompile the firmware many times changing where I put the UDPSend().

The problem was in the DHCP state machine, where this code in libraries/DEIPcK/utility/DHCP.c:

                    if (pbOp[2] == dhcpACK) {
                        UdpLog (pLLAdp, "GOT HERE 15");
                        // stuff our IP away for awhile so we can do an ARP
                        //if the ARP fails, we must zero this out before restarting.
                        memcpy((void *) &pLLAdp->pDHCPMem->dgDHCP.ciaddr, &pDHCPDG->yiaddr, sizeof(IPv4));
                        ((LLADP *) pLLAdp)->dhcpState = dhcpARPWait;

was never being executed.  The problem was pbOp[2] was actually the value of dhcpOFFER and not dhcpACK, so why was that? 

Well, it turns out there is some kind of bug in the linux wifi stack that duplicates incoming broadcast packets (i.e. DHCP requests).  The DHCP server dutifully responds to both packets, giving the client a 2nd dhcpOFFER packet which breaks the above code. 

DHCP is supposed to look like this:

 13  15.886985      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Discover - Transaction ID 0xfa0fe5d   13
 14  15.887122 192.168.193.4 -> 255.255.255.255 DHCP 67 68 365 DHCP Offer    - Transaction ID 0xfa0fe5d   14
 15  16.197498      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Request  - Transaction ID 0xfa0fe5d   15
 16  16.197657 192.168.193.4 -> 255.255.255.255 DHCP 67 68 342 DHCP ACK      - Transaction ID 0xfa0fe5d   16

but it looked like this: 

  5  16.942850      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Discover - Transaction ID 0x5abfed5d    5
  6  16.942881      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Discover - Transaction ID 0x5abfed5d    6
  7  16.943054 192.168.193.4 -> 255.255.255.255 DHCP 67 68 365 DHCP Offer    - Transaction ID 0x5abfed5d    7
  8  16.943100 192.168.193.4 -> 255.255.255.255 DHCP 67 68 365 DHCP Offer    - Transaction ID 0x5abfed5d    8
  9  17.155067      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Request  - Transaction ID 0x5abfed5d    9
 10  17.155105      0.0.0.0 -> 255.255.255.255 DHCP 68 67 314 DHCP Request  - Transaction ID 0x5abfed5d   10
 11  17.155266 192.168.193.4 -> 255.255.255.255 DHCP 67 68 342 DHCP ACK      - Transaction ID 0x5abfed5d   11
 12  17.155325 192.168.193.4 -> 255.255.255.255 DHCP 67 68 342 DHCP ACK      - Transaction ID 0x5abfed5d   12

At the point the client is looking for the final ACK, it actually sees the second Offer in packet 8.

So I just changed it (and everywhere else pbOp[2] == dhcpACK is checked) to      

               if (pbOp[2] == dhcpACK || pbOp[2] == dhcpOFFER)  {

and that fixed the problem, wifi finally worked!  Then I hunted down where the duplicate packet is coming from, and found it here:

 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/net/mac80211/rx.c?h=linux-4.14.y&id=0d59679df5b53755c00ea0292df696f97bfc950d#n2285

I don't understand why it's doing that but I commented it out and that fixed the duplicate packets and also fixed the issue...  I don't think either of these code changes are valid fixes, so next step is to understand why it's duplicating the packets,  but I'm going to leave that for another day...

Regards,

Laurence Darby

 

 

Edited by ldarby
Link to post
Share on other sites

This is some nice debugging work; and you hit the real world vs the ideal protocol.

You can check on Wikipedia for the protocol:

https://en.wikipedia.org/wiki/Dynamic_Host_Configuration_Protocol

In short the network stack should ignore the second offer, but the internal state machine follows the protocol and in a response to a request the state machine expects an ACK. DHCP is an open UDP protocol where not only can UDP packets be dropped, but the server can offer the same IP address to multiple machines up to and until he ACKs a request; so at anytime in the discovery, offer, request the server has the option to not-ack the request. In fact, when a client does a discovery, many servers can respond offering the client many IP addresses. It is up to the client to pick one, and request it; only if the server acks it does the client get it. So the state machine needs to walk the states in order. The code was written to abort when it sees an out of order sequence because the assumption is, if anything goes wrong, the sequence is over and a new discovery should be restarted. If I remember, I try a few discoveries before giving up. Of course if the sequence is repeatedly disrupted every time, then the client will eventually give up no matter how many retries are attempted; which is what I am guessing is going on.

The problem with your fix is, the client must have done a request if it is looking for an ack, and what comes in is an offer. An offer is NOT an ack, and the server has every right to not honor the IP address in the offer (thus why it can't act as an ack), the DHCP server is only require to honor the IP is if it acks the clients request. So an ACK at this point in the protocol  is the only valid response. It is confusing to the client to see an offer come in (directed to the client's MAC) at this time. 

You are also dealing with a ton of intermediaries. The WiFi AP being one. The 802.11 WiFi protocol also has retries in it, and sometimes at the WiFi level a packet will get duplicated and this can cause issues. But, the WiFi layer is supposed to flush duplicate packets so it never gets on the Ethernet; but I have seen duplicates make it through the WiFI on to the Ethernet. When I originally wrote the DCHP client, I was on a wired LAN and did not see these kinds of duplicates. I suspect your Linux AP assumes it can put duplicates on the Ethernet and that is why you are getting all of the packets duplicated, in both directions. This is a really poor AP if that is what it is doing. Also be aware, the order you sniff, is not necessarily the order that is processed. Since the client is looking for an ACK, the client must have sent the request. Internally to the Client, it saw the first offer, sent the request, and probably while sending the request the second offer was sitting in the socket buffer waiting to be process, which occur after the client sent the request. So client side processing order was discover, offer, request, offer.

But, you point out a good real world issue. In the real world we might get duplicate packets in an out of sequence order. This potentially can happen even on a wire LAN although extremely unlikely (requires two routings since it is UDP and there are no retries). My design choice at the time was, oh something went wrong, start over; not a bad design choice. However, the real world is saying a better design choice would have been, ignore the unexpected packet, keep waiting; and if we time out, try again.

I am going to be revisiting the network stack in the next year of so, and I will put this on the list of things to address. For now, I would fix the AP!

Nice debugging.

Link to post
Share on other sites

Thanks for the reply.  To clarify, in the tcpdump packet 6 is a duplicate but 8 (the 2nd Offer) isn't, the dhcp server really sent that, because it saw the 2nd request in packet 6 and replied to it.  You're right that that Offer packet is just sitting in the client's receive buffer until it tries to check for the ACK, after it's sent the Request.

Also ethernet doesn't come into this, and it's not some "Wifi AP" product, it's just a usb wifi stick connected to my PC which is running linux.  I tried 2 different brands,  NetGear WNA1100 and Ralink RT5370 and the same happened with both. It's not confirmed yet but I think this thread is caused by the same issue:

 

because that's also a linux based AP. 

I don't see anything wrong with the logic in the DHCP client, so what needs to be fixed is the linux kernel, which will need discussion with the kernel developers, which I hope to eventually get round to later.  However if that's causing the other thread's issue and others, and if it's fixed, that fix will take forever to be applied to various AP products out there, so changing the DHCP client to just ignore the 2nd Offer would resolve the issue quicker.

Regards,

Laurence

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now