Varying goals for solar

So with Sense’s awesome solar charts fresh in the mind (which look, btw, totally unlike my Sense#2 with Mains=hot water tank & Solar=air conditioner!):

Easy, near-immediate detection

  • Whole-array or large string inverter fail on a sunny day

Easy, not so fast, 2-day-long(?) detection

  • Partial array, string inverter fail on a cloudy day followed by a sunny day

Hard detection, a week?

  • Inverter degradation in a partially cloudy week … you may know it’s sunny at times but how does Sense? Do inverter fails or other transient fails match cloud periods or do they “pop out”? This is where you really start to need the Sense-wide dataset and geolocation to speed detection.

Very hard detection, a while?

  • 1 in 100 micro-inverter fail or a nasty bird. Sense: “Is it less sunny today or is that just me?”

Leading to: Is it really that easy to verify panel degradation (referring again to @kevin1’s chart) in anything other than say a year or two?

This is the real potential of using large data and the ML … much more subtle changes than you can see may (will!) be detectable, and potentially very quickly. A quoted “In 25 years output will drop to no less than 80%” would be great to verify sooner rather than later, for example.

In the extreme case you could disaggregate the array (panel vs inverter): “Panel C6 is misbehaving” … but let’s not get ahead of ourselves.

@ixu, @dianecarolmark and all,
I’m starting to push down into this interesting subject, and I wanted to share some thoughts on the journey of from working with my own data. First off, I’m considering 4 different approaches, kind of a 2 x 2 matrix.

  • Interday vs. Intraday patterns - Solar has two levels of cyclicality to work with: daily and yearly. @dianecarolmark, you are looking at intraday patterns, but they are a bit more difficult to analyze than you might think since the intraday trendline is linked to the yearly cycle. More on that later. The good news about looking at the intraday analysis is that you get faster feedback on rapid failures, but slow degradations can get buried by the yearly cycle, unless one compensates. Interday is better for slow degradation. And weather sheds noise on both interday and intraday, so one has to have appropriate filtering / triggering.

  • Analytical vs. time series ML approaches - Analytical means using the mechanics of the Earth’s motion to establish a prediction baseline. Time series analysis uses moving averages or localized polynomial fitting to separate periodic, random and trend components of the data. The challenges with an analytical approach are that the user has to find a way to cut through the weather/atmospheric noise to fit to the underlying function, and that doing intraday predictions is incredibly complex math (fortunately it has been done before). The main challenge with the time series approach is that most approaches can only handle one level of periodicity.

So combine those two dimensions for 4 possible approaches. Currently I’m trying to tackle the interday-analytical approach. I have implemented formula 4.8 for H0h (daily extraterrestrial irradiation energy falling on a plane horizontal to the Earth’s surface at my latitude coordinate) from this tutorial:

It’s a good approximation for the energy hitting south facing solar install without all the atmospheric effects (listed below) or the exact geometrical effects from the panel.

So again H0h represents solar flux hitting a plane parallel to the surface of the earth above my exact latitude at each day throughout the year. If we look at all the data for all 6 years of operation, we can see one set of patterns, but with a poor degree of fit due to all the weather noise. But this assessment highlights panel degradation over time.

If I look instead at the max daily production for each day in the year selected from the 6 years of operation, I get a much better fit to the theoretical line. But I also see that each month has slightly different characteristics.

I can share more on the fit numbers a little later - I’m still looking outlier datapoints to see if there are reasons to exclude, which would help improve fit. Once I get the fit to where I want it, I’ll have a baseline for big daily deviations. Just looking a the statistics, I have never seen a deviation outside the lower 95% confidence bounds for a local 30 day window more than two days in a row.

Other directions:

  • Investigate the SolarR package for computing daily and hourly solar irradiance
    https://cran.r-project.org/web/packages/solaR/solaR.pdf
  • Obtain hourly data from SolarCity for all 6 years so I can do intraday analysis. One issue - the SolarCity database doesn’t handle daylight savings time well - It can’t handle the extra hour so that hour disappears.
  • Look at more sophisticated time series analyses than the basic “decompose” for interday analyses. The basic default “decompose” doesn’t do enough smoothing - still too many jaggies in the Seasonal waves.

1 Like

Gonna need to dream on these for a couple of solar cycles! Great work.

I would be interested in a trigger from my solar vs usage. Is that easy? “Hey, you have too much solar production, time to run the dishwasher” or better yet, a trigger that goes straight to the Tesla and starts it charging, and again when a cloud rolls over, stops it.

If I can do that, I’m gonna need to get a 240v switch I can trigger that would allow my hot tub to only run when the sun is full

Sure.

George explains @2:50

@ixu, @dianecarolmark,
Here’s a basic time series analysis of the daily solar cycle using readings from my solar every 15 min. Very hard to make sense of the graphs since there are so many points (almost 224K points in all). Each day is just shy of 100 points (24*4 = 96), so 2.3K daily cycles.

A few things of note:

  • I had to fill in almost 1200 missing data points stemming from internet/power/SolarCity outages. Thats about 50 hours of downtime over 6 years. For the purposes of the time series, I initially filled in with zeros since the stl() analysis I’m using doesn’t like NA data or missing time periods.
  • You can see the yearly cycle emerging in the long term trend line - that means the 15 min measurement trends up and down with the yearly season.
  • It appears that even the “random” component has a rough pattern to it. More negative excursions during the late winter and spring. That’s the more cloud prone season here.
  • I can also see that there were some troublesome data points four years ago during late summer and fall. I’ll have to investigate.

Next step is to look at what kinds of prediction windows (80% and 95% confidence) come out of forecasts using the intraday and interlay time series analyses, though I probably don’t need the entire time history since I’m using a somewhat localized time-series fitting.

Here’s what the prediction looks like based on intraday data. I have pared back the number of datapoints to 1000, which means a prediction based on the previous 10 days of data. The black line shows the centerline prediction and the light and darker blue show the 80% and 95% confidence intervals for prediction.

Similar for daily data and 1000 points.

Based on these plots, I think the intraday prediction is going to be better for looking for most failures. But I really need to verify.
Next: Let’s do some random sampling of sunny, cloudy and partially cloudy days to see how often the solar behavior strays outside of the 95% confidence interval for the previous 10 days.

2 Likes

Yeah, that’s using software the interrogates the solar system, and controls the Tesla high power wall charger, not the one I use that came with the Model S. Certainly I could buy one of those for $500, but if there was an IFTTT that could trigger when excess production is detected I could feed that into my car to have it turn on charging. I could feed it to my phone so I could turn on the laundry. Just think it would be useful to everyone.

You may want to look at the weatherflow weather station. It is reasonably priced, includes a uv detector, and has a full api for getting the data. I’m believe you could monitor uv intensity at your home with it and create a model to account for local cloud cover.

This is what I came here to post. A simple application could be written to poll the weatherflow and sense APIs and compare current solar PV output with solar insolation (W/m2) from the weatherflow instrument. If the correlation factor changes significantly from the normal value, problem exists.

@pswired this is a fork in the goal that we haven’t really been focused on here but I agree that there are immediate and relatively simple (weather/insolation/external data source) methods with IFTTT that make sense for @israndy’s needs.

What @kevin1 is working on is post-analysis (hopefully very quick analysis!) and what you and others need for these situations is fundamentally as long-term weather prediction as possible. I would point to radar and satellite feeds as probably being the most useful. You need fine-grained hyper-local cloud prediction that is outside the Sense dataset. That said there is a crossover in so far as if you know what the wind is doing then neighboring Sense Solar data could be very handy.

Laundry is a great test as would be, say, “When can I take a long solar drive in the next week?” (Starts to feel like what early sailors had to contend with). You don’t want to get stuck half-way to your destination and you don’t want half-washed clothes.

@ixu, I’m really trying to see if there are ways to use the big data approach, sans orthogonal solar input, to identify likely failure patterns. Once you find a technique based on past data that works, it’s possible to “can” that analysis into realtime identification. But so far, cloud and weather noise put the kabash on simple techniques. The Arima time series prediction that looked relatively good above, comes unglued when it encounters a series of cloudy days. Expanding the analysis window to include more sunny days doesn’t seem to help.

1000 points

2000 points

4000 points

Indeed. I’m beating off the branches here to clear a path for your SoSi charts.

Meanwhile I’ve been looking at cloud nowcasting and NASA/IBM “Cloud computing on the cloud”.
Understanding the underlying factors affecting insolation and computation (ML) methods parallels the sans orthogonal method.

I have a mental image of a literal PV-covered globe. At least for short-term (nowcasting) the “weather” is then mapped onto the PV globe. Wind can be inferred.

In so far as cloud periodicity (if there is such a thing) and wind speed are concerned, you would imagine that the SoSi data might just need the right culling/compression? Taking the max daily is an obvious first look, as you have done. I can’t wrap my head around the other cycles, and sure it might be essentially random, but from experience you know that, say, an eclipse is rare and highly predictable ( = not weather) but clouds are highly unpredictable and predictably not simply on & off.

1 Like

The good news is for compression is that the insolation assuming a constant clear atmosphere is:

  • The max envelop
  • Symmetric over both daily and yearly cycles
  • Predictable over both daily and yearly cycle

So it probably only takes two parameters to characterize yearly and daily cycles at a high level:

  • The estimated daily(peak 15min or hour)/yearly max (daily). I say estimated because you want to get past a cloudy noontime, or a cloudy June 21st. This is can be compared to there theoretical max for calibrating. The peaks also give insight into degradation.
  • The measured energy delivered over the day/year. Gives a better view of atmospheric factors. There’s probably a function between the estimated peak and total energy generated that would be a good proxy metric for geographic cloud cover.

One of the most important, but least glamorous parts of data science is cleaning up the data, or even more importantly, automatically separating and remediating (if possible) anomalous data. And quite honestly, without a feedback mechanism from the data acquisition system that marks the “data bad”, the problem of finding after-the-fact data collection failures is very similar to detecting solar panel / inverter failures. Here’s a real life example:

If I go back to my every 15min time series decomposition, it’s easy to spot some clearly anomalous data highlighted in the red boxes below:

  • Individual spikes well above the max
  • Clusters of spikes well above the max
  • Large periods of time where production is zero (we have to remember that at this scale, we’re not even seeing nights of zero production, so these are periods longer than 12 hours.

If I zoom closely into one of the spike clusters, I can get a better idea of what is going on. The data collected by SolarCity/Tesla appears to ping back and forth between zero and double the typical amount of energy received in 15 min.

If I look into my SolarCity data for one of those days, I get an even better idea of what is going on. The first 5 columns come directly from the API on the SolarCity website. The rows with the NA’s (Not Available) in those columns are 15 min intervals that were missing from the SolarCity data - I added them so that I could analyze the integrity of my data (BTW - I do the same with my Sense export data because there are indeed missing hours from occasional data dropouts). The 6th column, “kWh” is one I created from my SolarCity “Energy(kWh)” solar production data, except I replaced the NAs with zeros, because some time-series analyses don’t like missing intervals or NAs.

On initial inspection, it looks as if there might have been a data acquisition / networking problem that caused occasional dropouts, and the SolarCity collection compensated by doubling up on the previous data point to compensate in the final total. If it was that simple, I could do an automated fix. Unfortunately, it’s not that simple if I look at the sequence in the red box. The 1.80 reading doesn’t have any adjacent NA row, and it’s unclear if the nearby, but not adjacent 0.01 reading might have been a partial dropout or just a very cloudy 15 min. If any of you guys want to do some forensic pattern checking, I can send you a spreadsheet for the whole months of June 2015.

As for the gaps, I think I have most of them identified. The longest time-series strings of 0 kWh readings all correlate with the missing “dropout” hours that I inserted into the time series, so I at least have control of those points, if I wanted to remediate them, though picking the right value might be more challenging.

Here’s one example: The night of Jun 15th 2015 running into a dropout in the early AM of Jun 2016, running through until the morning of June 17th.

26%20PM

1 Like

@ixu, @dianecarolmark, all,
Here comes the “fun” part. Trying to figure out automatically when my 15 min SolarCity readings are too big (beyond the capability of my system at that particular moment in time) and might be due to an acquisition glitch. Solving that will shed some insight into solving the panel/inverter failure issue.

The first simple step might be to look for a value that is too big for that time of day during that time of year. Just looking at the data, it appears than any 15 min reading above 1kWh is a problem, but I’m sure that that max cutoff varies with both time of day and time of year. Since I haven’t invested yet in solving the orbital dynamics of hourly insolation yet, I’m going try to use just my daily calculations from earlier. If I plot my 15 min max data for each day against the daily insolation theoretical (H0h) number for that day, I get essentially a straight line that I can regress, once I pull out the bad points.

This gives a date-dependent max value for 15 min generation that varies from between 0.8 to 1.0kWh that is good for finding the biggest problem points. All I need to do is compare every 15min value against the predicted max and remove points above that max for every point in the year. That gives a really busy but pretty cool chart where 902 data points have been deemed bad or at least worthy of investigation. That’s not so far off from the 1200 or so missing points I needed to add. But looking at the line of demarcation, I’m not so sure that every point on the horizon for my simple criteria is bad, plus I’m not detecting any too big points during the less sunny hours of the day, since my criteria is based on the daily max.

Two things to try next:

  • Look more closely at the 902 - how many are truly acquisition issues vs. just best case numbers.
  • Start attacking an hourly insolation model.
1 Like

How do you sort through the 902 “potentially too big” data values to figure out which ones are legit and which ones aren’t. One way would be to eyeball each one, but that kind of defeats the automation approach. Since I know one mode of failure, the ping-ponging between NA(0) and double values, I’m going to try an automated approach that looks at nearest neighbors ! I’m going to cluster the points using a simple kmeans() algorithm based on the current kWh and it’s two nearest neighbors. kmeans() will automatically cluster the data based on the 3D spacial distance between those three values. I arbitrarily picked 3 clusters for the algorithm.

Here’s the clustering vs the previous neighbor:

Here’s the same clustering vs. the next neighbor:

Something obvious becomes apparent quickly.

  • The reddish cluster represents points that are close to the bottom of my “bad” range, but have close nearest neighbors. And none have nearby neighbors, even 2 away on each side that are small values (no ping-pong pattern). That tells me they are legit.
  • The blue cluster is filled with the “ping-pong points” where the point is well into my “bad” range, and where at least one next door neighbor is 0 (NA) or some small value.
  • The green cluster is a little more dicey - some points are clearly well into the “bad” range, but others are on the edge of acceptability. Looking at the data more closely, I’m guessing nearly all of them are “bad” but only 36 of the 98 data points have the “ping-pong” symptom where a nearest neighbor either next-door or 2 away is a very small value (< 0.02). So there must be another kind of data acquisition gremlin causing those.

Bottom line, looking at the values vs. neighbors has been a great tool for discriminating between real data issues vs. data on the cusp. The real trick is to “can” this kind of testing in a way that works for smaller hour values as well. So far we’ve only been comparing against the max daily estimates, not hourly.

BTW - As you can see, data cleaning and integrity checking isn’t all that glamorous, and can require as much focus as machine learning, especially when the data is remote and without any additional feedback (network out, clamp slightly ajar, etc.). But it is really crucial before you admit said data into your “golden dataset” for use training machine learning…

Just a little factlet in case it’s useful: you can end up with generation greater than your calculated maximum output for a given value of solar insolation due to reflection, including reflection from clouds. I found this out after researching why on partly cloudy days, my sister’s pv array sometimes spiked above the sunny day value for that time of day.

These excessions tend to last more on the order of seconds, not minutes, though.

I’m not suggesting this may be relevant to your data, but it is something that I found counterintuitive and so worth noting.

1 Like

Next step - Intradaily insolation. But now I’m going to leave the orbital and atmospheric dynamics to somebody else, Oscar Perpinan Lamigueiro of Universidad Politecnica de Madrid, who has built out a great R-based environment for simulating and analyzing solar production, solaR.

Plus a useful education on everything that happens between the sun, the earth and the electric grid, and how solaR simulates each step along the way.

solaR: Solar Radiation and Photovoltaic Systems with R | Journal of Statistical Software.
Oscar Perpiñán (2012). solaR: Solar Radiation and Photovoltaic Systems with R,
Journal of Statistical Software, 50(9), 1-32. URL
His flow diagram and associated legend, highlight how little of the calculation I have simulated so far. And for you @ixu, Oscar offers a method for detecting failures in one system, via comparison with other nearby producing systems.

Just to be clear, on my own, I have only explored the sun geometry component (Sol), and only at a daily level so far. That just provides a calculation for the amount of solar energy hitting a horizontal plane to the surface of the earth, at the top of our atmosphere. The atmosphere adds all kinds of absorptions, reflection, diffusion and spectral effects, based on the weather and time of day.

Thanks @dianecarolmark,
Right now I’m still dealing with the pre-atmospheric calculation that represent the maximum envelop of energy hitting the top of atmosphere. From there the light energy gets absorbed as well as separated into Global (G0), diffuse (D0), and direct (B0) irradiance components.

At some point I’ll get to that step in the calculation using solaR, but first I have to tackle the daily calculation, then figure out where I’m going to get my meteorological data from. I’m hopeful since I’m close several NOAA sites.

As you suggest, diffuse (D0d) irradiance can exceed direct (B0d) irradiance due to weather and time of year (D0d spikes and B0d slumps in Oscar’s simulations below)

Ok, onto the intraday calculations. First off, a bit of terminology, since I’ll be doing my new charts using solaR conventions, which are slightly different than my simpler, early calculations.

From solaR

  • Bo0 - Extra-atmospheric instantaneous irradiance (W/m2) - One needs to integrate over a specific time region to get energy/m2
  • Bo0d - Extra-atmospheric DAILY irradiation incident on a horizontal surface (Wh/m2)

From my older calculation

  • H0h - An approximation of Bo0d - Extra-atmospheric DAILY irradiation incident on a horizontal surface (Wh/m2)

I probably should compare B0d vs. H0h, but that’s a daily calculation vs an intraday calc. Instead I’m going to do a sanity check of Bo0 over the 6 years that my system has been in operation - pretty ! Each color is a different hour of the day that the sun is out.

Bo0perhour

Seems to make sense !

Now it’s time to look at my 15min solar output (kWh) vs. Bo0 at the beginning (or end ?) of that time interval.

kWhvsBo0month

Wow, I was expecting a straight-line but I have a “wing” instead. I colored by month to see if I could see any discernible pattern. I see two things of interest.

  • We can see the SolarCity data acquisition glitch in the “shadow wing” that has a 2x kWh pattern above the main “wing”. But just because I can see it, doesn’t mean I can automatically identify and remediate just yet.
  • As expected, the kWh vs. Bo0 “wing” is longest during the high-sun months and shortest during the lower sun months. That’s especially obvious in the same graph from the solarR plotting package.

Switching to coloring by hour, another pattern emerges:

The curve goes out in the morning lower and comes back higher production, for the what should be the same Bo0. Why would the same power/energy striking the top of the atmosphere result in substantially different production. I have a few theories:

  • Bo0 really isn’t the same - maybe my time base is off by 15min (start of interval vs. end of interval ?)
  • Atmospheric conditions and temperature ?
  • Angle of incidence on my solar panels could favor afternoons

I guess I’m going to have to go through all the solaR calculations ! But first, maybe I’ll just double check my daily data since that takes the time base issue off the table. Tomorrow.

2 Likes

Wow all this is great. Still way back digesting. Tomorrow? I need a month!

Not that you need asides, but I assume you saw this