Solving Sense Device Mysteries Using Correlation

Solving This Common Scenario
A mysterious new Fridge 3 shows up ! How do you figure out whether it is related to any of the existing Fridges in your house, or any other device for that matter ? Mathematical correlation can help find the hidden relationships, probably more easily than overlaying waveforms.. I’m going to try to three comparisons using the Pearson Correlation coefficient. I’m not going to go into math, but the picture below is very intuitive - plot the energy/power usage of two devices at each hour (from Sense export) against one another and take a look at the line that fits them.

If the line is perfect fit and has a positive slope, they are 100% positively correlated, or the Pearson Correlation coefficient = 1. If the line has perfect fit, but the slope is negatively, then the two devices are 100% negatively correlated with a correlation coefficient of -1. If the line is not a perfect fit, the coefficient is somewhere in between. A correlation coefficient of close to 1.0 likely means the Sense detected devices are the same. But there are caveats, especially if one is comparing data from smart plugs vs native detections (more on that later).

I’m going to try to do 3 things with hourly data from Sense data Export.

  • Look at a few devices with the same name to test my correlation hypothesis - are they really the same ?
  • Test the same approach with a couple of other devices I know to be related and others that are completely unrelated.
  • Attempt to connect a few mystery devices around my house to existing known detections.

Correlation Between Devices With the Same Name
When I downloaded my hourly Sense data for 2022, to do some analysis, I discovered a few devices that had the same Name but different Device IDs in the data. Originally I thought these duplicates (below) might have been the results of Merges, but after a little more investigation, the origins of the duplicate names were are little complex.

From the duplicates list above:

  • The Cannon Printer and Kitchen Overheads are both on Wiser smart plugs/switches, that I reinitialized mid-year. Both have one ID showing usage from Jan 1 - Aug 4, and the other usage from Aug 4 - Dec 31. I’ll check these later because they should show negative correlation with one another.
  • One of the mystery Motor 2 occurrences was only on for a total of 3 hours in the year. So I’m not going to bother trying to check.
  • I think the Coffee Maker and the Microwave duplicates resulted from new detections that got named the same as existing detections with Locations. Notice that one of the duplicates in each case does not have a location.
  • AC 3 (Air Conditioning Compressor) is a bit of mystery. It’s not a merge and both IDs have the Garage location. One of them has usage hours from Feb 14 - Nov 8, the other from Feb 14 - Aug 25. The data in my Sense app lines up with the first one (Feb 14 - Aug 25) in the list that used 893kWh during the year, so Sense is not summing the two. Guessing that second one is some kind of data remnant/artifact.

Going Deep on the AC 3 Pair
I’m going to compare the AC 3 devices first, though not before a diversion to talk about Sense and messy data. Since correlation analysis is going to compare usage side-by-side between devices, let’s take a look a representative slice of the AC 3 data (below).

Notice the profusion of NAs, mostly in both columns together. NAs show up when Sense does not log any information in the Export file for that device in that hour. You won’t see NA’s in the Export file, but you will see the “holes” appear ,where Sense hasn’t logged any usage, if/when you pivot the Sense data from their long “log” format to a columnar format with each device having column. The “pivoting” software (Excel, R, etc.) usually gives you an option of how to fill the “holes” that are created. In my case I filled them with NAs (not available), but the real data question entails getting back to what the NAs mean. The NA could be caused by:

  • Native detections that were off for the entire hour - Sense does NOT log a zero in the Export file for these hours. Before I do a comparison, I really want to fill in this type of NA with real 0’s.
  • Periods before or after the device was installed in your house. This is a different form of a 0, because you might not want to do comparisons against this device for periods when it wasn’t in your home.
  • A real NA where networking issues or Sense monitor issues prevented detections.

Other notes on the side-by-side device / column comparisons:

  • There are many hours when both devices show the exact same usage.
  • There are a few hours where the column on the right (the AC 3 artifact) shows a lower usage reading (in red)
  • There are a few hours wher the column on the right shows NA, even though there are real data readings on the left (in blue)

Now that you have a little background in the data, here is the corresponding correlation data:

First using my data above, with all the holes filled with NAs. All the data with 2 side-by-side NAs and NAs in any single column (mostly the left) are removed from the comparison. Looking at the chart, these two are very likely the same device, though there are some differences, where the remnant is lower than the full AC 3. The correlation coefficient is 0.992*** showing significant correlation, which is quite good considering the number of points compared. BTW - the number of asterisks indicates the statistical significance of the coefficient. Three asterisks is the max and says that there is no change the correlation is accidental.

Here’s the same comparison when we fill all the “holes” with zeros instead (all the points on the bottom of the lower left graph). Many points where the left column is a real usage number and the right was NA, but now 0, are suddenly introduced into the correlation calculation taking the coefficient down to 0.839***.

Here’s a third version where I tried to intelligently insert zeros in place of NAs, but only when the device appears to reside in my house. I basically only converted NAs to zeros between AC 3’s (artifact) first appearance and last appearance. Even though the lower left chart looks similar to the last one, the correlation is far better than even the first comparison, with a coefficient of 0.994***. How could the correlation be better vs. the NA-filled example at the top, given all zeros along the bottom ? The distribution curves, on the upper left and lower right, show the reason - the Intelligent 0’s approach adds many 0,0 data points to the correlation calculation, that weren’t there in the first calculation.

Going forward, I’m going to need to be exceptionally careful with this hidden filling, especially since native detections look far different than smart plug data when it comes to “holes”.

Moving to the Kitchen
Let’s move to the kitchen and look at the correlation between the 4 devices there. The Microwave and Coffee Maker in green correspond to the real devices in our kitchen. Based on what I’m seeing, I believe the other two are devices that were detected as Microwave and Coffee Maker, but I haven’t been able to connect them to existing (they aren’t merged - just have the same name).

The correlogram below shows correlation based on data that had its “holes” filled with NAs, which are ignored by the correlation calculation. This chart shows that you can’t just look at the correlation coefficient. Notice the amazing 0.974 correlation between the Microwave that isn’t my Microwave and the Coffee Maker that isn’t my Coffee Maker ! But notice two other things as well - no asterisks (very low statistical significance - it could all be chance) and only 3 data points that apparently form something close to a straight line with a reasonable slope (small # of data points is one of the reasons the statistical significant is super low). For contrast, look at the relationship between my real Microwave and Coffee Maker - a very low correlation coefficient, with lots of points, but also very scattered, it has low statistical significance as well.

But now we get to the interesting part ! I’m going to replace NAs with zeros in an intelligent way (only replacing after the first time the device showed up, and only until the last time we saw the device on). In essence, this smart conversion of NAs to zeros, adds missing of data to all the devices. When I run the chart, I see a very different result. The correlation between the Microwave that isn’t my Microwave and the Coffee Maker that isn’t my Coffee Maker has completely disappeared - the 3 points have blossomed into many more with either the Coffee Maker or the Microwave being off, killing off calculated correlation. In a reversal, the addition of off behaviors (0s) has improved the correlation between my real Microwave and my real Coffee Maker, as well as confirming the coefficient as statistically significant. And there is a real world phenomenon that explains this partial correlation - I make my single serving Nespresso coffee in the morning, usually within the same hour as I heat up dog food in the Microwave for our pups. But there are many other hours when we use the Microwave and my wife’s coffee comes an hour later.

BTW - Thanks to this exercise, I think I have also IDed the real second Coffee Maker - It belongs to our dog sitter. She brings her own - I noticed that it only showed up during times when were were away from home ! Still need to ID the second (really third, because Sense has found both the active Microwaves in our house).

Now that I think I have the methodology down, I’m going to work on a new post looking at all the non-duplicate mystery devices together with likely correlation partners to see if I can solve any more mysteries. LMK - If you have any questions.


Fortuitously, here’s a new article that evalutes a bunch of algorithms for quantitatively evaluating the differences between two (electrical usage in their case) waveforms !

The investigator’s verdict:

If the absolute amplitude of these peaks and troughs doesn’t matter to us, then we’re probably best off choosing the correlation coefficient as similarity metric.

1 Like

Same Experiment with Home Assistant Data
Given that I can get higher sample rate data out of Home Assistant, I decided to try a similar experiment in using Pearson correlation to determine whether waveforms are related. But due to a couple of factors, I had to be especially careful, above an beyond the care I took with the direct Sense data - Reasons ??

  • Naming Conventions - it appears that Home Assistant (HA) maintains different “traces” for different Device IDs, but drops the Device ID. Instead, I see coffee_maker_usage and coffee_maker_usage_2 in the HA data.
  • Data fill - HA inherits the Sense problem, that when a device is off, HA does not log that condition - only when a device is on and using greater than 1/2 W of power. Mix that with different sample rates between device integrations and it becomes difficult to compare data using the same time points. in my case, I try to fix the data by moving it all to a common 5 min sample rate and filling empty samples with the previous values. Not particularly accurate in some cases, but useful in most. Then I also do the earlier trick of adding NAs before a device went online with Sense and after it have been removed.

My first attempt at correlation was using data for Solar Production from my Sense vs my Tesla inverter data. There were 3 separate entities associated with that: 35_elmwood_solar_power, and 35_elmwood_solar_power_2 come from the Tesla cloud/inverter - there are two because the Tesla HACS integration changed in the might of the month. And energy_production comes from Sense. HA does some renaming to meet its conventions, sometimes. Here’s a view of the results based on sampled and filled 5 minute data for Sense vs Tesla.

A few things are visible:

  • There is no correlation measurement (NA) between the two feeds from Tesla integration. That’s easy explained because there was never any measurements in common (at the same point in time). One trace ended just as the next one started.
  • The first Tesla integration (35_elmwood_solar_power) shows extremely good correlation with Sense - 0.981 with very high statistical significance.
  • The second Tesla integration (35_elmwood_solar_power_2) shows good, but lesser correlation with Sense - 0.921 with very high statistical significance.

Why the difference between the two ? If we look at overlapped waveforms Tesla vs Sense in Home Assistant, the difference is fairly clear.

This is 35_elmwood_solar_power (Tesla) vs energy_production (Sense) for a single day. Tight correlation except the Tesla data occasionally undershoots or overshoot the Sense measurement.

Below is 35_elmwood_solar_power_2 (Tesla) vs energy_production (Sense) for a couple days. Notice that the Tesla sampling resolution is larger, albeit with no overshoots or undershoots. Judging from the scale, the sampling interval is 15min rather than what looked to be 1 minute with the earlier integration. The larger sampling resolution also delays the Tesla waveform from the Sense waveform, while also making it less accurate in the time domain. That helps explain the “bee wing” shape for the line.

Eyeballing the two waveforms in InfluxDB (because it allows more plotting options than native HA), it looks like the new sampling technique in the Tesla Integration delays the data by about 40 minutes. I’m not sure where that exact number comes from - given a 15min sample interval, it should be either a 15min or a 30min delay, but I’ll look at both 30min and 40min when I adjust the data to check correlation again.

Here’s the correlation result if I offset the Tesla waveform by 30 min to the left - better !

Here’s the correlation result if I offset the Tesla waveform by 40 min to the left - even better ! Oddly, the correlation exceeds the correlation for the original Tesla integration, even though the the plot shows seemingly more scattering.