Varying goals for solar

The good news is for compression is that the insolation assuming a constant clear atmosphere is:

  • The max envelop
  • Symmetric over both daily and yearly cycles
  • Predictable over both daily and yearly cycle

So it probably only takes two parameters to characterize yearly and daily cycles at a high level:

  • The estimated daily(peak 15min or hour)/yearly max (daily). I say estimated because you want to get past a cloudy noontime, or a cloudy June 21st. This is can be compared to there theoretical max for calibrating. The peaks also give insight into degradation.
  • The measured energy delivered over the day/year. Gives a better view of atmospheric factors. There’s probably a function between the estimated peak and total energy generated that would be a good proxy metric for geographic cloud cover.

One of the most important, but least glamorous parts of data science is cleaning up the data, or even more importantly, automatically separating and remediating (if possible) anomalous data. And quite honestly, without a feedback mechanism from the data acquisition system that marks the “data bad”, the problem of finding after-the-fact data collection failures is very similar to detecting solar panel / inverter failures. Here’s a real life example:

If I go back to my every 15min time series decomposition, it’s easy to spot some clearly anomalous data highlighted in the red boxes below:

  • Individual spikes well above the max
  • Clusters of spikes well above the max
  • Large periods of time where production is zero (we have to remember that at this scale, we’re not even seeing nights of zero production, so these are periods longer than 12 hours.

If I zoom closely into one of the spike clusters, I can get a better idea of what is going on. The data collected by SolarCity/Tesla appears to ping back and forth between zero and double the typical amount of energy received in 15 min.

If I look into my SolarCity data for one of those days, I get an even better idea of what is going on. The first 5 columns come directly from the API on the SolarCity website. The rows with the NA’s (Not Available) in those columns are 15 min intervals that were missing from the SolarCity data - I added them so that I could analyze the integrity of my data (BTW - I do the same with my Sense export data because there are indeed missing hours from occasional data dropouts). The 6th column, “kWh” is one I created from my SolarCity “Energy(kWh)” solar production data, except I replaced the NAs with zeros, because some time-series analyses don’t like missing intervals or NAs.

On initial inspection, it looks as if there might have been a data acquisition / networking problem that caused occasional dropouts, and the SolarCity collection compensated by doubling up on the previous data point to compensate in the final total. If it was that simple, I could do an automated fix. Unfortunately, it’s not that simple if I look at the sequence in the red box. The 1.80 reading doesn’t have any adjacent NA row, and it’s unclear if the nearby, but not adjacent 0.01 reading might have been a partial dropout or just a very cloudy 15 min. If any of you guys want to do some forensic pattern checking, I can send you a spreadsheet for the whole months of June 2015.

As for the gaps, I think I have most of them identified. The longest time-series strings of 0 kWh readings all correlate with the missing “dropout” hours that I inserted into the time series, so I at least have control of those points, if I wanted to remediate them, though picking the right value might be more challenging.

Here’s one example: The night of Jun 15th 2015 running into a dropout in the early AM of Jun 2016, running through until the morning of June 17th.


1 Like

@ixu, @dianecarolmark, all,
Here comes the “fun” part. Trying to figure out automatically when my 15 min SolarCity readings are too big (beyond the capability of my system at that particular moment in time) and might be due to an acquisition glitch. Solving that will shed some insight into solving the panel/inverter failure issue.

The first simple step might be to look for a value that is too big for that time of day during that time of year. Just looking at the data, it appears than any 15 min reading above 1kWh is a problem, but I’m sure that that max cutoff varies with both time of day and time of year. Since I haven’t invested yet in solving the orbital dynamics of hourly insolation yet, I’m going try to use just my daily calculations from earlier. If I plot my 15 min max data for each day against the daily insolation theoretical (H0h) number for that day, I get essentially a straight line that I can regress, once I pull out the bad points.

This gives a date-dependent max value for 15 min generation that varies from between 0.8 to 1.0kWh that is good for finding the biggest problem points. All I need to do is compare every 15min value against the predicted max and remove points above that max for every point in the year. That gives a really busy but pretty cool chart where 902 data points have been deemed bad or at least worthy of investigation. That’s not so far off from the 1200 or so missing points I needed to add. But looking at the line of demarcation, I’m not so sure that every point on the horizon for my simple criteria is bad, plus I’m not detecting any too big points during the less sunny hours of the day, since my criteria is based on the daily max.

Two things to try next:

  • Look more closely at the 902 - how many are truly acquisition issues vs. just best case numbers.
  • Start attacking an hourly insolation model.
1 Like

How do you sort through the 902 “potentially too big” data values to figure out which ones are legit and which ones aren’t. One way would be to eyeball each one, but that kind of defeats the automation approach. Since I know one mode of failure, the ping-ponging between NA(0) and double values, I’m going to try an automated approach that looks at nearest neighbors ! I’m going to cluster the points using a simple kmeans() algorithm based on the current kWh and it’s two nearest neighbors. kmeans() will automatically cluster the data based on the 3D spacial distance between those three values. I arbitrarily picked 3 clusters for the algorithm.

Here’s the clustering vs the previous neighbor:

Here’s the same clustering vs. the next neighbor:

Something obvious becomes apparent quickly.

  • The reddish cluster represents points that are close to the bottom of my “bad” range, but have close nearest neighbors. And none have nearby neighbors, even 2 away on each side that are small values (no ping-pong pattern). That tells me they are legit.
  • The blue cluster is filled with the “ping-pong points” where the point is well into my “bad” range, and where at least one next door neighbor is 0 (NA) or some small value.
  • The green cluster is a little more dicey - some points are clearly well into the “bad” range, but others are on the edge of acceptability. Looking at the data more closely, I’m guessing nearly all of them are “bad” but only 36 of the 98 data points have the “ping-pong” symptom where a nearest neighbor either next-door or 2 away is a very small value (< 0.02). So there must be another kind of data acquisition gremlin causing those.

Bottom line, looking at the values vs. neighbors has been a great tool for discriminating between real data issues vs. data on the cusp. The real trick is to “can” this kind of testing in a way that works for smaller hour values as well. So far we’ve only been comparing against the max daily estimates, not hourly.

BTW - As you can see, data cleaning and integrity checking isn’t all that glamorous, and can require as much focus as machine learning, especially when the data is remote and without any additional feedback (network out, clamp slightly ajar, etc.). But it is really crucial before you admit said data into your “golden dataset” for use training machine learning…

Just a little factlet in case it’s useful: you can end up with generation greater than your calculated maximum output for a given value of solar insolation due to reflection, including reflection from clouds. I found this out after researching why on partly cloudy days, my sister’s pv array sometimes spiked above the sunny day value for that time of day.

These excessions tend to last more on the order of seconds, not minutes, though.

I’m not suggesting this may be relevant to your data, but it is something that I found counterintuitive and so worth noting.

1 Like

Next step - Intradaily insolation. But now I’m going to leave the orbital and atmospheric dynamics to somebody else, Oscar Perpinan Lamigueiro of Universidad Politecnica de Madrid, who has built out a great R-based environment for simulating and analyzing solar production, solaR.

Plus a useful education on everything that happens between the sun, the earth and the electric grid, and how solaR simulates each step along the way.
Oscar Perpiñán (2012). solaR: Solar Radiation and Photovoltaic Systems with R,
Journal of Statistical Software, 50(9), 1-32. URL
His flow diagram and associated legend, highlight how little of the calculation I have simulated so far. And for you @ixu, Oscar offers a method for detecting failures in one system, via comparison with other nearby producing systems.

Just to be clear, on my own, I have only explored the sun geometry component (Sol), and only at a daily level so far. That just provides a calculation for the amount of solar energy hitting a horizontal plane to the surface of the earth, at the top of our atmosphere. The atmosphere adds all kinds of absorptions, reflection, diffusion and spectral effects, based on the weather and time of day.

Thanks @dianecarolmark,
Right now I’m still dealing with the pre-atmospheric calculation that represent the maximum envelop of energy hitting the top of atmosphere. From there the light energy gets absorbed as well as separated into Global (G0), diffuse (D0), and direct (B0) irradiance components.

At some point I’ll get to that step in the calculation using solaR, but first I have to tackle the daily calculation, then figure out where I’m going to get my meteorological data from. I’m hopeful since I’m close several NOAA sites.

As you suggest, diffuse (D0d) irradiance can exceed direct (B0d) irradiance due to weather and time of year (D0d spikes and B0d slumps in Oscar’s simulations below)

Ok, onto the intraday calculations. First off, a bit of terminology, since I’ll be doing my new charts using solaR conventions, which are slightly different than my simpler, early calculations.

From solaR

  • Bo0 - Extra-atmospheric instantaneous irradiance (W/m2) - One needs to integrate over a specific time region to get energy/m2
  • Bo0d - Extra-atmospheric DAILY irradiation incident on a horizontal surface (Wh/m2)

From my older calculation

  • H0h - An approximation of Bo0d - Extra-atmospheric DAILY irradiation incident on a horizontal surface (Wh/m2)

I probably should compare B0d vs. H0h, but that’s a daily calculation vs an intraday calc. Instead I’m going to do a sanity check of Bo0 over the 6 years that my system has been in operation - pretty ! Each color is a different hour of the day that the sun is out.


Seems to make sense !

Now it’s time to look at my 15min solar output (kWh) vs. Bo0 at the beginning (or end ?) of that time interval.


Wow, I was expecting a straight-line but I have a “wing” instead. I colored by month to see if I could see any discernible pattern. I see two things of interest.

  • We can see the SolarCity data acquisition glitch in the “shadow wing” that has a 2x kWh pattern above the main “wing”. But just because I can see it, doesn’t mean I can automatically identify and remediate just yet.
  • As expected, the kWh vs. Bo0 “wing” is longest during the high-sun months and shortest during the lower sun months. That’s especially obvious in the same graph from the solarR plotting package.

Switching to coloring by hour, another pattern emerges:

The curve goes out in the morning lower and comes back higher production, for the what should be the same Bo0. Why would the same power/energy striking the top of the atmosphere result in substantially different production. I have a few theories:

  • Bo0 really isn’t the same - maybe my time base is off by 15min (start of interval vs. end of interval ?)
  • Atmospheric conditions and temperature ?
  • Angle of incidence on my solar panels could favor afternoons

I guess I’m going to have to go through all the solaR calculations ! But first, maybe I’ll just double check my daily data since that takes the time base issue off the table. Tomorrow.


Wow all this is great. Still way back digesting. Tomorrow? I need a month!

Not that you need asides, but I assume you saw this


Cool - I hadn’t looked at rastervis yet, though I have been using his xyplot() package occasionally when I first try to plot an object out of his code. But I’m doing most of my plotting in ggplot2() since it is very predictable and fast.

And the radiation plots are truly cool, though I’m trying to wrap my mind around how you would use these month to month comparisons.

Couldn’t hold myself back… Had to double check my 2 old equations of daily irradiance against the solaR package’s much more sophisticated calculation. Good news, everything seems kosher. My older analysis holds, and Bo0d looks like it takes into account some very small additional tertiary effects beyond H0h. BoD is almost exactly 10000x H0h (just different units).

But the best part is that I now trust my intraday results as well.

solaR Bo0d vs. Cyclic (a simple cosine equation with a period of 365.25 days)

lm(formula = Cyclic ~ Bo0d, data = SolarMix)

     Min       1Q   Median       3Q      Max 
-0.25787 -0.14119  0.05914  0.12754  0.16210 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.159e+00  1.009e-02  -214.0   <2e-16 ***
Bo0d         2.637e-04  1.166e-06   226.1   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1455 on 2302 degrees of freedom
Multiple R-squared:  0.9569,	Adjusted R-squared:  0.9569 
F-statistic: 5.113e+04 on 1 and 2302 DF,  p-value: < 2.2e-16

solaR Bo0d vs. my more sophisticated H0d formula

lm(formula = H0h ~ Bo0d, data = SolarMix)

       Min         1Q     Median         3Q        Max 
-0.0223529 -0.0081799  0.0009893  0.0102988  0.0163886 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.767e-02  7.701e-04  -74.89   <2e-16 ***
Bo0d         1.027e-04  8.904e-08 1153.38   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01111 on 2302 degrees of freedom
Multiple R-squared:  0.9983,	Adjusted R-squared:  0.9983 
F-statistic: 1.33e+06 on 1 and 2302 DF,  p-value: < 2.2e-16

And for those of you who know what a Residual is, you can see the extra effects and associated magnitude calculated in Bo0d vs. H0h in this plot…

Time to solve my “wing” problem !

Still sorting out ways to linearize my “wing” more using the SolaR package. I’m now trying to convert from radiation striking the top of the atmosphere to radiation hitting the earth’s surface, then hitting my angled solar panels with about a 20 degree angle west of south orientation. Those factors might just be the reason for the loop pattern.

I have seen a similar kind of loop before when comparing my Sense solar generation vs. my SolarCity generation, where Sense was showing more power being produced in the morning hours vs. Solar city, and the opposite in the afternoon. I eventually tracked down to interval accounting counting being off between the two of them. But the “wings” aren’t as symmetrical around the unity line, so there’s more at play here…

I did do a quick shot at automatically locating the bad points using a simple linear model. First, I removed the 605 missing points (that number dropped from over 1100, because I’m now only looking at daytime data) that were manufactured by me to fill in the gaps. Those were simple because they had their other entries marked as NA. Then I used linear modeling to identify the 400 points with the largest positive residuals. That did a good job finding most of the “shadow wing”. I’m guessing I could now cut the power production for those points in half and stuff that data into the nearest NA or smallest neighbor. All this is under the assumption that SolarCity and the inverter are set up to work around data transmission failures.

I also decided to remove those 400 + 600 points and remodel using a new linear equation. Then I picked off the next 400 points with the highest absolute residual (plus or minus). The next bunch would come from the the bottom.

I also tried one more interesting R package called “robustbase” which offers more robust fitting algorithms that are supposed to deal with bad date. In the case of a linear model, lmrob() iteratively assigns weights to points during regression, giving smaller weights to points with the highest residuals. After a number of remodeling interactions with the weights used in regression, the weights stabilize, and you have your result. I have rounded and multiplied the weights by 10 (usually 0.0-1.0) for annotating the “wing” below. As you can see, this also helps pick off the shadow wing above the wing, as well as low lying power production during high B0 periods (everything 7 and below). But I’m sure that some of the “shadow wing” is still obscured by the good data. Plus weight 7 and 6 data below the wing is probably legit.

OK, time to make some more progress on connecting ground-based measurements with simulated analytic results from solaR. I’ve been stymied a bit by the last stages of solaR that translate top of atmosphere equations/readings for Bo0 to best case and realistic solar production. There are really 3 steps in between:

  • Bo0, geometric, temperature and calibration measurements to earth surface measurements of G0 (GHI - global horizontal irradiance), D0 (DHI - direct horizontal irradiance)
  • From those to Geffective and Deffective, based on the tilt and orientation of my fixed solar install
  • And from those irradiances to actual solar production.
    I really just want to get to Geffective and chart against my solar production. But right now, it’s not clear those steps are working the way I’m trying to use the package.

In the meantime, I dug up some local half-hourly solar measurements for my area courtesy of the NREL NSRDB resource. It’s a 2013 through 2015 record that I can use to compare my data against as well as use to calibrate and even feed solaR, if I can get it working. But for now I’m going to use to make sense of my solar data.

The first thing I attempted to do is chart measured GHI (G0) for this time period against my kWh production during the same period.

Wow ! That looks familiar. Another “wing” pattern. Looks very close to my Bo0 (calculated using solar geometry) vs. kWh, though a little more diffuse since it only contains 50K points instead of the 225K points in the original (the measured data is only half the data period and half the sample frequency - 1/4 as many points). If I compare against the Clearsky GHI it even looks closer to the Bo0 theoretical, because NREL’s Clearsky calculation removes the effects of clouds.

The same pattern exactly, including the “shadow wing” (of course), just compressed in the x dimension. Just for fun, let’s compare the Bo0 (theoretical calculated value) against the Clearsky GHI (a compensated ground measurement).

Very close to a linear relationship, with a little bit of a loop/hysteresis. I’m suspicious still that the loop may come from some timebase offset between the two measurements, even though they are charted based on the same measurement times. Just estimating, it looks like Clearsky GHI equals 1000, when the top of the atmosphere calculation gives 1250, so there’s about a 20% energy loss from the top to the bottom of the atmosphere without clouds at my latitude.

Two pieces of good news…

  • Local ground data agrees with my solaR-based Bo0 calculations, so I should be able to use to calibrate. I might still need to take a look at what creates the loop between Bo0 and Clearsky GHI
  • I’m encouraged to focus on the next two steps in the calculation process.

ps: The other thing that makes me want to look more closely at the loop between my calculated data (Bo0) and my ground based observation data (Clearsky DHI) is that the Zenith angle coming from each shows the same loop behavior in the early morning and late evening.

ZenithN is ground-based from the NREL data for latitude 37.65N and ZenithS produced by solaR is for latitude 37.453N. That suggests that either my timebase is off, or it might be caused by slight differences in longitude BTW-longitude differences = time) and latitude between the measurement points. More, soon…


@kevin1, here’s another solar goal I want to (philosophically at least) throw into the mix, seeded by this article in the NYTimes (amazing opening picture btw!)

Before I got to it in the article, I was thinking this very thought (to quote):

“Ms. Polos, a nurse, recalls the power going out 10 times in the past year. If she and her family need to get out because of a fire, she said, she wants to be able to keep her Nissan Leaf electric car charged.”

One of the factors in insolation is of course smoke (& pollution).

So in parsing solar data for indications of inverter anomalies and so on, and (at scale) being able to see cloud patterns, we can add smoke/pollution detection in there and more crucially (combining with wind data) tell your car when to charge for emergency escapes.

1 Like

Two Suggestions.

  1. If you are looking for pollution data you should checkout You can get air quality data for locations throughout the U.S. And, they have an API so you can integrate the data into your calculations. I use airnow data in my own home to help with managing indoor air quality as August is the time we have issues with heavy smoke from wild fires.

  2. I tried to make a suggestion about using a Weatherflow smart weather station earlier in this thread while I was on vacation. Brain cells must have been on vacation too at the time and I didn’t provide the correct information. The weatherflow station is made up of 2 parts, the sky unit and the air unit. The sky unit includes a sensor for solar radiance. Again, weatherflow has an easy to use API for accessing the data from your weather station and this could be used to directly integrate the data into your calculations.

Hope this helps

1 Like

A couple things I have discovered so far.

  • Cloud cover and other weather, at least in my area, is hyperlocal. I have been trying to compare local NOAA / NREL data from various sources and can’t get the periods of lower intensity to line up, even when the reading sources are within a 10 mile radius.
  • We can try using proxies like UV index or AQI, but they generally don’t line up either. And using uncorrelated data for training is just a formula for bad predictions.

I may have to buy a Weatherflow station.

1 Like

Hyperlocal indeed!

The definition of weather is borne out in Summer when some people wear sweaters in the office and others don’t.

Some thoughts:

  • One cannot argue with incorporating as much hyperlocal weather data as possible into solar optimization and Sense-data decryption BUT …

  • Using a dis-similar PV panel for solar irradiance measurement along the lines of the Weatherflow system is non-optimal. While I would expect the Weatherflow PV panel to align with a given array’s inferred measurement, having only that ground-level irradiance broken out from the array is less useful than trying to separate out an individual array’s output … and get that in to Sense directly.

  • I don’t fully comprehend the varied output efficiency across different PV panels/types for the same solar spectrum but there has to be some non-linear variation and introducing that into the assessment may make life more difficult. Of course, that can potentially be accounted for in the calculus, but the simplest version (if simple!?) would be to separate-out the readings from one panel.

How localized are services like this getting?

They quote a spatial resolution of 250m. At 20mph that’s 28sec. For a 500m well-defined cloud shadow that’s around a 1 minute long “pulse” on a small array. The interesting thing about non-hyperlocal measurement in these days of satellite (and ground-based cameras!) cloud-movement assessment is it starts to feel less like weather prediction and more like “atmospheric notification”.

In that regard, I watched clouds last week. The speed is mesmerizing. The bigger (non-orthoganal to the wind!) the array(s) the more one imagines the gentle rolling of the Sense Solar waveform.

1 Like

I would guess that any kind of relatively similar PV-oriented solar cell, mounted nearby with the same tilt and azimuth as my panels, and out of the shade line, would give much better correlation with my production than any kind of geometric prediction.

But you have me questioning one other parameter. Most solar monitors based monitoring is done via current production into a fixed load. But PV solar panels have optimizers that continuously vary the load for max power production per panel.

Interesting paper on Solar Soiling I thought I would throw in here … just in case anybody thought it wasn’t a potential major component of what Sense should potentially do. Damn, what were those birds eating?



Including the following for completeness. I final sorted out a reasonable measurement of of solar simulations vs. Sense data. If I look at at just the months where we have reasonably clear skies here, I get a fit (R, R2) of about 0.86. I’m also underproducing the model but I’m on my 7th year, so some degradation and soiling has happened over time, that is also traceable through my data over the previous 6 years.