Loadshape clustering from AMI data

tagged: Loads Data Analytics

How to generate loadshape clusters from AMI data.

Daily loadshape identification is a common problem in utility data analysis. Before the advent of AMI metering, end-use metering surveys and simulation were often used to develop diurnal loadshapes for different seasons and weekday types. These surveys usually involved relative modest numbers of customers, whose customer class was usually well-known, and the loadshapes were often one of primary deliverables. Consequently generating the range of loadshapes observed in the surveys was not typically a problem for utility analysts.

With the advent of automated metering infrastructure (AMI), the volume of data increased significantly. Unfortunately, the analysis methods used for the end-use load surveys are not always helpful because the AMI data is not labeled. To address this problem, unsupervised machine learning methods can be used to identify the principal loadshapes found in the AMI data.

This post describes the methodology use by the OpenFIDO loadshape pipeline. The use-case presented here is based on a data set from a utility containing 1 year of AMI data from 2015 for 973 customers. The original dataset has been sanitized to preserve privacy and reduce the size of the data repository. In addition the subhourly data has been rolled up to hourly data.

The methodology used to create the loadshape data is based on an unsupervised learning algorithm known as k-means clustering which looks for $k$ loadshapes in the data. This method partitions $N$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean. Each cluster mean serves as a prototype of the cluster. This partition produces a data space divided into so-called “Voronoi” cells. The clusters minimize the loadshape variances (i.e., the Euclidean distances).

The analysis performed here applies the loadshape clustering method step by step, visualizing the intermediate results for inspection, and comparing the final result with an alternate t-distributed stochastic neighbor embedded ($t$-SNE) partitioning method. The $t$-SNE algorithm is a dimensionality reduction method that can be used to visualize high-dimensional data as a two or three-dimensional map that does not share any common elements with the $k$-means method being validated.

The entire process is divided into seven steps, briefly summarized as follows:

Load the AMI data - The original data was downloaded from a utility AMI system and stored in a CSV file. This file was cleaned of identifiable information and aggregated to hourly interval energy values before being stored for analysis.
Identify hour types - The hours of the year are grouped into a total of 192 hour types for each season, day type, and hour that comprise the loadshapes to be found in the data.
Aggregate hour types - The hourly meter data is aggregated by hour type to find the mean value for each meter during each hour type.
Cluster the meters - The meters are clustered using $k$-means into $K$ loadshapes and each meter is assigned to the loadshape with the smallest mean-squared difference.
Rescale the loadshapes - Each meter is assigned a loadshape scale and offset to ensure that the loadshape has the same variance as the meter which is assigned to it.
Generate the loadshape weights - Each loadshape is assigned a weight based on the fraction of the total annual energy associated with that loadshape.
Validate the results - The results are validated using a $t$-SNE partition method.

Load the AMI data

The first task is to load the AMI data from the sanitized AMI data repository. In this case we are only interested in the timestamps with timezones, the meter ids, and the power values. All other data in the repository is ignored.

import pandas as pd
import datetime as dt
pd.set_option("max_columns",11)
pd.set_option("max_rows",9)

data = pd.read_csv("ami.csv.gz",
    converters = {
        "timestamp" : dt.datetime.fromisoformat,
        "meter_id" : int,
        "tz" : int,
        "power" : float,
    })

data

	meter_id	timestamp	tz	power
0	1	2015-01-01 09:00:00	-8	6.31
1	1	2015-01-01 10:00:00	-8	4.48
2	1	2015-01-01 11:00:00	-8	5.88
3	1	2015-01-01 12:00:00	-8	6.52
...	...	...	...	...
8904095	973	2016-01-02 05:00:00	-8	5.61
8904096	973	2016-01-02 06:00:00	-8	5.39
8904097	973	2016-01-02 07:00:00	-8	6.08
8904098	973	2016-01-02 08:00:00	-8	5.30

8904099 rows × 4 columns

The data set contains nearly 9 million records, with meter id, timestamps, timezone, and power measurements. The timestamps are provided in UTC and the timezone provides the offset of the local time from the UTC timestamp. These will be necessary to determine time-dependent information such as the season, day of week, and local hour of day.

Identify hour types

The timestamps are then analysed to identify the hour types that are used to generate the loadshapes. The hour type is based on the season (winter=0, spring=1, summer=2, and fall=3) and the day type (weekday=0 and weekend=1). These, along with the 24 hours of the day, result in 192 hour types, which are associated with each meter observation.

hour = (data["timestamp"].apply(lambda x:x.hour-1)+data["tz"]).astype(int).mod(24)
weekend = data["timestamp"].apply(lambda x:(int(x.weekday()/5)))
season = data["timestamp"].apply(lambda x:x.quarter-1)
data["hourtype"] = season*48 + weekend*24 + hour

data

	meter_id	timestamp	tz	power	hourtype
0	1	2015-01-01 09:00:00	-8	6.31	0
1	1	2015-01-01 10:00:00	-8	4.48	1
2	1	2015-01-01 11:00:00	-8	5.88	2
3	1	2015-01-01 12:00:00	-8	6.52	3
...	...	...	...	...	...
8904095	973	2016-01-02 05:00:00	-8	5.61	44
8904096	973	2016-01-02 06:00:00	-8	5.39	45
8904097	973	2016-01-02 07:00:00	-8	6.08	46
8904098	973	2016-01-02 08:00:00	-8	5.30	47

8904099 rows × 5 columns

Aggregate hour types

The next task is to calculate the average load for each hour type of each meter. The result is then pivoted to create one column for each hour type and one row for each meter.

groups = data.groupby(["meter_id","hourtype"]).mean().reset_index()
groups = groups.pivot(index="meter_id",columns="hourtype",values="power")
groups.fillna(method="ffill",inplace=True)

groups

hourtype	0	1	2	3	4	...	187	188	189	190	191
meter_id
1	5.737077	5.319385	4.922769	4.934000	4.465538	...	7.204231	7.076154	6.870385	6.311154	5.900769
2	0.549077	0.547538	0.547231	0.548154	0.546923	...	0.693462	0.690000	0.686538	0.684231	0.683462
3	2.947692	2.996923	2.886154	2.873846	3.926154	...	10.800000	10.476923	6.923077	3.107692	2.538462
4	31.214615	31.007846	32.279231	32.323846	38.841538	...	48.923462	45.794615	41.991538	38.596538	36.475000
...	...	...	...	...	...	...	...	...	...	...	...
970	8.640615	7.935385	7.638000	7.110000	7.212154	...	14.208077	13.861923	12.586923	11.545769	10.458077
971	10.575231	9.265846	7.694308	7.691077	8.035385	...	13.876538	13.421538	14.191923	13.007308	11.916154
972	0.369385	0.370308	0.369692	0.367385	0.362308	...	0.437308	0.435769	0.436154	0.435769	0.430000
973	12.342769	12.147077	10.622462	8.666462	8.354923	...	10.270000	10.162692	9.937692	9.845769	9.771538

973 rows × 192 columns

Note that some meters may not have sufficient data to generate loads shapes. The following identifies which meters do not have loadshapes associated with them.

missing = data.iloc[list(set(data["meter_id"].unique()).difference(set(groups.index.values)))]
#missing.to_csv("missing.csv")
list(missing["meter_id"].unique())

[]

We can visualize the average load for each meter by hour type, where hours 0-23 are winter weekdays, hours 24-47 are winter weekends, hours 48-71 are spring weekdays, etc.

groups.T.plot(figsize=(15,8), legend=False, color='blue', alpha=0.05,grid=True,xlim=[0,191],xticks=range(0,192,24));

png

Clustering the meters

The next task is to find the clusters that partition this dataset among loadshapes with minimal Euclidean distances to the cluster centroid. We use the $k$-means algorithm to identify 10 clusters on the normalized hour type profiles and search for the smallest number of clusters necessary to ensure every group has more than 1 loadshape in it.

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
rescaled = MinMaxScaler().fit_transform(groups.values.copy())
ok = False
n_clusters = 10
while not ok and n_clusters > 1:
    n_clusters -= 1
    groups = data.groupby(["meter_id","hourtype"]).mean().reset_index()
    groups = groups.pivot(index="meter_id",columns="hourtype",values="power")
    groups.fillna(method="ffill",inplace=True)    
    kmeans = KMeans(n_clusters=n_clusters)
    group_found = kmeans.fit_predict(rescaled)
    group_data = pd.Series(group_found, name="loadshape")
    groups.set_index(group_data, append=True, inplace=True)
    ok = True
    for group in groups.index.get_level_values(level=1).unique():
        if len(groups.xs(group, level=1)) == 1:
            ok = False

A histogram of the loadshapes found shows that only a few loadshapes are common and the remaining ones are idiosyncratic.

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
ax = groups.reset_index().groupby("loadshape")["meter_id"].count().plot.pie()
ax.set_title("Loadshape customer count")
ax.set_ylabel("");

png

The loadshape data is obtained by aggregating the median load and relabeling the columns to improve readability.

loadshapes = groups.groupby("loadshape").mean()
seasons = ["win","spr","sum","fal"]
weekdays = ["wd","we"]
loadshapes.columns = [f"{seasons[season]}_{weekdays[weekend]}_{hour}h" for season in range(4) for weekend in range(2) for hour in range(24)]
loadshapes

	win_wd_0h	win_wd_1h	win_wd_2h	win_wd_3h	win_wd_4h	...	fal_we_19h	fal_we_20h	fal_we_21h	fal_we_22h	fal_we_23h
loadshape
0	7.518608	6.831472	6.463544	6.296450	6.407418	...	13.502548	13.197431	12.505875	11.486388	10.157408
1	334.999769	329.453596	331.760750	330.994173	342.655173	...	384.495144	373.887308	356.455865	333.860000	312.682019
2	135.139937	133.300208	129.952633	130.317991	133.486226	...	199.509276	193.379548	185.591290	172.179457	160.572534
3	862.073077	862.615385	862.716923	859.179231	851.286923	...	870.253846	867.132692	863.890385	862.736538	862.396154
4	42.580202	40.047221	39.161096	38.817104	40.577592	...	68.477997	64.067743	59.361592	53.568761	48.451841

5 rows × 192 columns

The final results are plotted with the original data to visualize how the loadshapes are representative of each cluster.

rows = int(loadshapes.shape[0]/2+0.5)
fig,ax = plt.subplots(rows,2, figsize=(18,5*rows))
for group in groups.index.get_level_values(level=1).unique():
    subset = groups.xs(group, level=1)
    ax = plt.subplot(rows,2,group+1)
    plt.plot(subset.T,alpha=0.05,ls='-',color="blue")
    plt.plot(subset.median(), alpha=1.0, ls='-',color="black")
    ax.set_title(f"Loadshape {group} (N={len(subset)})")
    ax.set_ylabel('Power [kW]')
    ax.set_xlabel('\nHour type')
    ax.grid()
    y0 = ax.get_ylim()[0]
    for s in range(4):
        for w in range(2):
            label = plt.text(s*48+w*24+12,y0,["Weekday","Weekend"][w],axes=ax)
            label.set_ha("center")
            label.set_va("top")
        label = plt.text(s*48+24,y0,["\nWinter","\nSpring","\nSummer","\nFall"][s],axes=ax)    
        label.set_ha("center")
        label.set_va("top")
    plt.xticks(range(0,192,24),[],axes=ax)
    plt.xlim([0,191])            

png

Rescaling the loadshapes

When the identified loadshapes are used to generate a load model, the mean loadshape power (as well as the total energy) will always match the original meters’ means and total energy because the methodology uses the mean power to identify the cluster centroids. However, the methodology does not address the different variances among the meters and these will not match the loadshapes centroids’ variances.

To address this, we compute the standard deviation for each meter power reading and its corresponding loadshape value.

values = groups.reset_index().set_index(["meter_id","loadshape"]).melt(ignore_index=False).reset_index().set_index(["meter_id","loadshape","hourtype"])
meters = data.drop(["tz","timestamp"],axis=1).set_index(["meter_id","hourtype"])
mapping = meters.join(values)
mapping.groupby("meter_id").std()

	power	value
meter_id
1	1.582521	1.127028
2	0.076137	0.071638
3	4.697922	4.305768
4	7.073779	5.767524
...	...	...
970	3.891709	3.231951
971	3.088949	2.161910
972	0.056724	0.025850
973	5.555886	3.713034

973 rows × 2 columns

To rescale the loadshape so that each meter’s loadshape has the same variance we use a linear function for each meter $m$

\[y_{m,t} = scale_m x[h_{t}] + offset_m\]

where the scale $scale_m = { \sigma^2_m \over \sigma^2_l}$ with $\sigma^2_m$ is the variance of the meter data and $\sigma^2_l$ is the variance corresponding loadshape data, and $offset_m = (1-scale_m)~\mu_m$ with $\mu_m$ being the mean of the meter data.

The histograms of the scales and offsets verifies the reasonableness of the range of values found.

fig,ax = plt.subplots(1,2,figsize=(18,5))
scale = pd.DataFrame(mapping.groupby("meter_id").std().power/mapping.groupby("meter_id").std().value,columns=["scale"])
scale.hist(figsize=(10,5),density=True,ax=ax[0]);
offset = pd.DataFrame((1-scale).scale*mapping.groupby("meter_id").mean()["power"],columns=["offset"])
offset.hist(figsize=(10,5),density=True,ax=ax[1]);

png

The scales and offsets are now joined with the original meter data to compute the rescaled values.

mapping = mapping.join(scale)
mapping = mapping.join(offset)
mapping["rescaled"] = mapping.value*mapping.scale + mapping.offset
mapping

			power	value	scale	offset	rescaled
meter_id	hourtype	loadshape
1	0	0	6.31	5.737077	1.404154	-2.264103	5.791635
		0	8.52	5.737077	1.404154	-2.264103	5.791635
		0	6.22	5.737077	1.404154	-2.264103	5.791635
		0	6.07	5.737077	1.404154	-2.264103	5.791635
...	...	...	...	...	...	...	...
973	191	0	14.57	9.771538	1.496319	-6.557975	8.063369
		0	5.88	9.771538	1.496319	-6.557975	8.063369
		0	7.76	9.771538	1.496319	-6.557975	8.063369
		0	6.66	9.771538	1.496319	-6.557975	8.063369

8904099 rows × 5 columns

The final rescaled loadshape values have the same variance, mean power, and total energy use as the original AMI meter data:

final=mapping.drop(["scale","offset","value"],axis=1)
pd.DataFrame(list(map(lambda x:(x.power-x.rescaled)*(x.power-x.rescaled).max(),[final.std(),final.mean(),final.sum()])),index=["std","mean","sum"],columns=["MaxSE"]).T

	std	mean	sum
MaxSE	2.764673e-25	2.051925e-22	1.626241e-08

Generate the loadshape weights

The loadshape weights can be generated by computing the total energy in each loadshape meter group and dividing it by the total energy over all loadshapes.

loadshapes = groups.drop(range(192),axis=1)
total = data.drop("tz",axis=1).set_index(["meter_id","hourtype"]).join(loadshapes).groupby("loadshape").sum()
total.columns = ["energy"]
(total/total.sum()).plot.pie(y="energy",figsize=(8,8),normalize=True);
(total/total.sum()).round(3).T

loadshape	0	1	2	3	4
energy	0.363	0.178	0.146	0.109	0.204

png

Validation

The final result is validated using a $t$-distributed stochastic neighbor embedding ($t$-SNE) algorithm to reduce the dimensionality of the original data set to two dimensions. The color of each point in the map corresponds to each loadshape. The grouping of similar loadshapes shows how the cluster methods produce similar results.

from sklearn.manifold import TSNE
import matplotlib.colors
import numpy as np
tsne = TSNE()
results_tsne = tsne.fit_transform(rescaled)
color_list = list(zip(np.arange(0,1,0.1),np.arange(0,1,0.1),np.arange(1,0,-0.1)))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list(groups.index.get_level_values(level=1).unique(), color_list)
plt.figure(figsize=(10,10))
plt.scatter(results_tsne[:,0], results_tsne[:,1],
    c=groups.index.get_level_values(level=1),
    cmap=cmap, 
    alpha=0.6, 
    );

png

This provides an simple example from which most analysts and researchers can develop their own loadshapes from AMI data. In some cases it may be necessary to do some preprocessing to the AMI data to normalize timestamps, aggregate subhourly data up to hourly day, or removed extraneous columns. These tasks are relatively simple and straightforward, and are common enough that the reader should have no problem implementing these for themselves.

Written on September 26, 2022