[BUG] - All five kp115 smart plugs went offline according to Sense but are still really on

:sweat_smile: silently hopes nothing breaks

1 Like

I’m at 62 Hue bulbs, 1 Wemo, and 41 Kasa plugs integrated. :sweat_smile:

3 Likes

@Offthewall - I have good news for you! I have experienced the N/A bug.

I’m going to leave it at that for now. I have to get to sleep because I have to take my mother to the airport in the morning. I did a little bit of digging around and saw the following:

  • PCAPs are showing that the Sense is still doing it’s broadcast to 9999 asking for all devices to report back. My monitor is set to poll every 6 seconds, but yours should be every 2. Based on my findings I am seeing a request every 6 seconds as expected.
  • PCAPs are showing that the Kasa device is not responding anymore and is still communicating with Kasa servers and ARP requests.
  • Kasa device did have a DHCP Failure today around the time that the issue started.

I have to piece everything together, when everything went haywire and that will take some time. I will work on that over the next couple days.

I don’t think it’s your environment. I have all commercial grade Cisco switches, routers, access points, security appliances etc. I do think it’s a result of the Kasa firmware or hardware based on my findings so far. I want to figure out if it’s just not respoding anymore or if it’s some other issue. In my case it’s actually responding but only part of the time. It’s like it’s half way working.

3 Likes

I don’t quite understand the “DHCP failure” that the Kasa plug had. Are you saying it had an issue when renewing its lease? Everyone having trouble with their KP115s sees them all go “n/a” at the same time, that would point to some external cause. The DHCP failure you mention might be that cause.
EDIT (additional thoughts): I believe the problem may well be in the KP115, as the Sense polling that you see catches the plugs as they are individually restarted.

@demiller9 I’m going to dive into it more and see if there’s any correlation between the DHCP Failure and the issue everyone is having. I assumed early on, if you read back a few posts, that it was either a DHCP issue or a Kasa issue. It might very well be both however based on the below it’s not looking like a DHCP issue even though there are recorded failures.

The DHCP failure happened at 0154 Sat July 10th
The first detected issue in the energy monitor was at 1213 On Fri July 9th
The first DHCP lease was at 1544 Tues July 6th
DHCP Leases are 24 Hours

The commonality that I’m seeing on the KP115 but not on the HS300 is random deauthentications. There’s no specified reason for it either.


image

This image shows the % of failed connections, number of failed/connected and the reason

1 Like

The KP115s aren’t responding to the polling when the issue happens. I’ve confirmed that with PCAPs. So I really don’t think this issue is a Sense issue. It’s looking like it’s a Kasa issue. A power-pull of the devices would indicate that the device’s internal software is rebooted and it resumes working as expected until something gums it up. Just trying to figure out what is causing it to gum up.

2 Likes

THX, I am waiting with great anticipation for your additional findings!

What’s serving DHCP in your network?

Edit: Rare that the AP sees the DHCP request, but doesn’t a DHCP response. Assuming the AP can be trusted, this would imply a failure somewhere between the AP and the DHCP server (inclusive). Can you confirm whether or not there is a response from the DHCP server? I.E. a sniff on the DHCP server? Or a span port?

@dennypage The DHCP server is responding the plug just doesn’t want to accept it. The message about the DHCP not responding is from the device level not the AP level. I’m assuming that the device does a discover, the DHCP does an offer and then either the request or subsequent ack fail. When I force a discover on the same network I have yet to see it fail.

I don’t understand. The approve image appears to be a standard error from an AP indicating a client association failure: “Client made a request to the DHCP server, but it did not respond … type=‘NO DHCP response’”

If you are confident that the DHCP server sent a response, then the response would have to have been lost, either at or before the AP.

@dennypage I understand your confusion. I wish I could see the KP115 logs. I have looked over the logs on the AP and Security Appliance which is in charge of DHCP for this DMZ. The only issues reported for DHCP are from the KP115 devices. I don’t have any other devices (out of 168) in that DMZ that are reporting issues. When I force DHCP renewals I can see the process go through as expected.

Now … at some point they are renewing a lease because even after the failure their lease renews. So I don’t think this issue is ultimately what’s causing problems. As I laid out in the timeline, it doesn’t appear to be DHCP related, unless the device itself is doing something funky with the DHCP request which if it’s locking up could be as a result of a timeout etc.

Lastly, that one screen shot could very well just be an anomaly as that AP is the only AP that works in repeater mode. I’m posting everything in regards to 1 single KP115 for documentation purposes. I haven’t been able to catch this happening in real time yet.

All requests from the single Kasa device that got to the DHCP server were answered. The error in the screenshot you provided was on July 10th at 0150. As you can see the KP115 actually got a renewed lease 2 hours prior. Since 12 hours is the first available renew, it could be that the DHCP server just ignored the request. Again, I don’t have pcaps at that time so I’m only assuming. See below:

I have cross posted this if anyone is interested in chiming in or up-voting:

Based on my findings this is a Kasa issue and not related to Sense. Additionally, I don’t think this is an infrastructure issue and although there’s some strange things from these plugs I still think it’s related to firmware. If anyone feels otherwise please let me know.

2 Likes

You’re running a backhaul network like @Offthewall?

The DHCP server should never ignore a renew request. It should send either an ack or a nak.

It does occur to me that the plugs could have a defect in that if no response was received to a renew request, they fail to fall back into discover. Triggering this to recreate a persistent failure as described above would require either a network or DHCP failure in renew, followed by the plug failing to ever ask again. Sounds like a long shot, but I suppose it’s possible. It would explain why it’s so rare.

@Offthewall, what is your current lease time, and can you increase it to test? Something like 14400?

I’ve had it at 1 day, then tried 1 hour. Two different DHCP servers. In each case the plugs went n/a after 4 and 5 days.

How do you feel about trying a longer lease? Like 10 days (14400).

1 of 3 APs but I’m seeing this issue on KP115s connected to all APs

By ignore I’m referring to nak. I’m trying to keep it somewhat common folk language. Also my logs don’t show naks so that’s why I say I’m assuming because I don’t have logs or pcaps of that.

One other thing to note is that some people who don’t intently monitor their devices might not see this issue. Let me show you an example of screen shots where it appears that the device is still working, but if you look it’s really not, there’s huge gaps. So there could be many people who have KP115s that don’t know that the data is incorrect.

Normal KP115 on an always on device:

Bugged out KP115 on the same device:

I think that’s the issue of the Sense processor being overloaded (why support moved you to a higher polling interval). The issue @Offthewall is having is that the plugs go offline completely until he resets them. Seems to be a completely different issue.

That issue was very different and didn’t result in the n/a either. It was an overloaded device but in that case sense didn’t report any data for any loads and the monitor locked up. This is quite different and mimics that of @Offthewall in regards to the n/a and the Kasa device not responding to the broadcast udp. Additionally, this is only happening to my KP115s and not the HS300 or the virtual KS110 via HA.

Maybe he can chime in and report if he gets no data at all or some spotty data.

1 Like

I feel like we dive down all kinds of paths here but here’s what I can definitively say:

The KP115s based on the n/a issue

I can say without a doubt:

  • The sense monitor is sending broadcast UDP to 9999 according to the polling time
  • The plug is not responding or sporadically responding
  • HS300 devices don’t have this issue
  • HS300 and KP115 have different firmware versions
  • If it was the monitor the HS300s would have the same issue

If more people had PCAPs and could verify this it would be nice. In order to trouble shoot we have to eliminate one or the other. We can’t keep saying well whatif this or that. Let’s go with what we know.

15:13:27.848552 IP 10.0.2.248.9999 > 255.255.255.255.9999: UDP, length 63
15:13:27.888424 IP 10.0.2.204.9999 > 10.0.2.248.9999: UDP, length 729
15:13:27.916211 IP 10.0.2.201.9999 > 10.0.2.248.51615: UDP, length 1145
15:13:27.877220 IP 10.0.2.204.9999 > 10.0.2.248.9999: UDP, length 729
15:13:27.905034 IP 10.0.2.201.9999 > 10.0.2.248.51615: UDP, length 1145

.248 = Sense
.206 = Device I’m having issues with (not on the list)

While I’m only showing a piece of a huge cap, it’s repetitively the same.

1 Like