Solving This Common Scenario
A mysterious new Fridge 3 shows up ! How do you figure out whether it is related to any of the existing Fridges in your house, or any other device for that matter ? Mathematical correlation can help find the hidden relationships, probably more easily than overlaying waveforms.. I’m going to try to three comparisons using the Pearson Correlation coefficient. I’m not going to go into math, but the picture below is very intuitive - plot the energy/power usage of two devices at each hour (from Sense export) against one another and take a look at the line that fits them.
If the line is perfect fit and has a positive slope, they are 100% positively correlated, or the Pearson Correlation coefficient = 1. If the line has perfect fit, but the slope is negatively, then the two devices are 100% negatively correlated with a correlation coefficient of -1. If the line is not a perfect fit, the coefficient is somewhere in between. A correlation coefficient of close to 1.0 likely means the Sense detected devices are the same. But there are caveats, especially if one is comparing data from smart plugs vs native detections (more on that later).
I’m going to try to do 3 things with hourly data from Sense data Export.
- Look at a few devices with the same name to test my correlation hypothesis - are they really the same ?
- Test the same approach with a couple of other devices I know to be related and others that are completely unrelated.
- Attempt to connect a few mystery devices around my house to existing known detections.
Correlation Between Devices With the Same Name
When I downloaded my hourly Sense data for 2022, to do some analysis, I discovered a few devices that had the same Name but different Device IDs in the data. Originally I thought these duplicates (below) might have been the results of Merges, but after a little more investigation, the origins of the duplicate names were are little complex.
From the duplicates list above:
- The Cannon Printer and Kitchen Overheads are both on Wiser smart plugs/switches, that I reinitialized mid-year. Both have one ID showing usage from Jan 1 - Aug 4, and the other usage from Aug 4 - Dec 31. I’ll check these later because they should show negative correlation with one another.
- One of the mystery Motor 2 occurrences was only on for a total of 3 hours in the year. So I’m not going to bother trying to check.
- I think the Coffee Maker and the Microwave duplicates resulted from new detections that got named the same as existing detections with Locations. Notice that one of the duplicates in each case does not have a location.
- AC 3 (Air Conditioning Compressor) is a bit of mystery. It’s not a merge and both IDs have the Garage location. One of them has usage hours from Feb 14 - Nov 8, the other from Feb 14 - Aug 25. The data in my Sense app lines up with the first one (Feb 14 - Aug 25) in the list that used 893kWh during the year, so Sense is not summing the two. Guessing that second one is some kind of data remnant/artifact.
Going Deep on the AC 3 Pair
I’m going to compare the AC 3 devices first, though not before a diversion to talk about Sense and messy data. Since correlation analysis is going to compare usage side-by-side between devices, let’s take a look a representative slice of the AC 3 data (below).
Notice the profusion of NAs, mostly in both columns together. NAs show up when Sense does not log any information in the Export file for that device in that hour. You won’t see NA’s in the Export file, but you will see the “holes” appear ,where Sense hasn’t logged any usage, if/when you pivot the Sense data from their long “log” format to a columnar format with each device having column. The “pivoting” software (Excel, R, etc.) usually gives you an option of how to fill the “holes” that are created. In my case I filled them with NAs (not available), but the real data question entails getting back to what the NAs mean. The NA could be caused by:
- Native detections that were off for the entire hour - Sense does NOT log a zero in the Export file for these hours. Before I do a comparison, I really want to fill in this type of NA with real 0’s.
- Periods before or after the device was installed in your house. This is a different form of a 0, because you might not want to do comparisons against this device for periods when it wasn’t in your home.
- A real NA where networking issues or Sense monitor issues prevented detections.
Other notes on the side-by-side device / column comparisons:
- There are many hours when both devices show the exact same usage.
- There are a few hours where the column on the right (the AC 3 artifact) shows a lower usage reading (in red)
- There are a few hours wher the column on the right shows NA, even though there are real data readings on the left (in blue)
Now that you have a little background in the data, here is the corresponding correlation data:
First using my data above, with all the holes filled with NAs. All the data with 2 side-by-side NAs and NAs in any single column (mostly the left) are removed from the comparison. Looking at the chart, these two are very likely the same device, though there are some differences, where the remnant is lower than the full AC 3. The correlation coefficient is 0.992*** showing significant correlation, which is quite good considering the number of points compared. BTW - the number of asterisks indicates the statistical significance of the coefficient. Three asterisks is the max and says that there is no change the correlation is accidental.
Here’s the same comparison when we fill all the “holes” with zeros instead (all the points on the bottom of the lower left graph). Many points where the left column is a real usage number and the right was NA, but now 0, are suddenly introduced into the correlation calculation taking the coefficient down to 0.839***.
Here’s a third version where I tried to intelligently insert zeros in place of NAs, but only when the device appears to reside in my house. I basically only converted NAs to zeros between AC 3’s (artifact) first appearance and last appearance. Even though the lower left chart looks similar to the last one, the correlation is far better than even the first comparison, with a coefficient of 0.994***. How could the correlation be better vs. the NA-filled example at the top, given all zeros along the bottom ? The distribution curves, on the upper left and lower right, show the reason - the Intelligent 0’s approach adds many 0,0 data points to the correlation calculation, that weren’t there in the first calculation.
Going forward, I’m going to need to be exceptionally careful with this hidden filling, especially since native detections look far different than smart plug data when it comes to “holes”.
Moving to the Kitchen
Let’s move to the kitchen and look at the correlation between the 4 devices there. The Microwave and Coffee Maker in green correspond to the real devices in our kitchen. Based on what I’m seeing, I believe the other two are devices that were detected as Microwave and Coffee Maker, but I haven’t been able to connect them to existing (they aren’t merged - just have the same name).
The correlogram below shows correlation based on data that had its “holes” filled with NAs, which are ignored by the correlation calculation. This chart shows that you can’t just look at the correlation coefficient. Notice the amazing 0.974 correlation between the Microwave that isn’t my Microwave and the Coffee Maker that isn’t my Coffee Maker ! But notice two other things as well - no asterisks (very low statistical significance - it could all be chance) and only 3 data points that apparently form something close to a straight line with a reasonable slope (small # of data points is one of the reasons the statistical significant is super low). For contrast, look at the relationship between my real Microwave and Coffee Maker - a very low correlation coefficient, with lots of points, but also very scattered, it has low statistical significance as well.
But now we get to the interesting part ! I’m going to replace NAs with zeros in an intelligent way (only replacing after the first time the device showed up, and only until the last time we saw the device on). In essence, this smart conversion of NAs to zeros, adds missing of data to all the devices. When I run the chart, I see a very different result. The correlation between the Microwave that isn’t my Microwave and the Coffee Maker that isn’t my Coffee Maker has completely disappeared - the 3 points have blossomed into many more with either the Coffee Maker or the Microwave being off, killing off calculated correlation. In a reversal, the addition of off behaviors (0s) has improved the correlation between my real Microwave and my real Coffee Maker, as well as confirming the coefficient as statistically significant. And there is a real world phenomenon that explains this partial correlation - I make my single serving Nespresso coffee in the morning, usually within the same hour as I heat up dog food in the Microwave for our pups. But there are many other hours when we use the Microwave and my wife’s coffee comes an hour later.
BTW - Thanks to this exercise, I think I have also IDed the real second Coffee Maker - It belongs to our dog sitter. She brings her own - I noticed that it only showed up during times when were were away from home ! Still need to ID the second (really third, because Sense has found both the active Microwaves in our house).
Now that I think I have the methodology down, I’m going to work on a new post looking at all the non-duplicate mystery devices together with likely correlation partners to see if I can solve any more mysteries. LMK - If you have any questions.