KTH Architecture and the Built Environment
The Influence of Energy Use Visualization
on the Energy Consumption in Municipal
Multi-Apartment Buildings
The Case of Nynäshamnbostäder
Marc C. Azar
Master of Science Thesis 2012:125
KTH School of Architecture and the Built Environment Division of Building Service and Energy Systems
Master of Science Thesis 2012:125
The Influence of Energy Use
Visualiza-tion on the Energy ConsumpVisualiza-tion in
Mu-nicipal Multi-Apartment Buildings
KTH Architecture and
The Case of Nynäshamnbostäder
the Built Environment
c
2012 School of Architecture and
the Built Environment
Division of Building Service and Energy Systems SE-100 44 STOCKHOLM
Sweden
Marc C. Azar
Approved Examiner Supervisor
2012 Prof. Ivo Martinac Prof. Ivo Martinac
Commissioner Contact Person
Abstract
This thesis investigates the influence of energy visualization on hot water consumption, as well as builds up the framework for the analysis of electricity consumption, of multi-apartment buildings in Sweden, . 115 apartments in Nynäshamn have been scheduled to be equipped with feedback vi-sualization, through the use of television sets, allowing the monitoring of electricity and hot water consumption on a monthly basis. One year of consumption data prior to this feedback introduc-tion was acquired for statistical analysis. The results were then displayed and analysed, allowing for the composition of a generalized conclusion whilst revealing the need for further investigation and future work.
To achieve this end, clustering was performed on the 115 apartments according to the follow-ing characteristics: Number of tenants per apartment, Area of the apartment, Location of the apartment, and age of the tenants. Principal Component Analysis was used to select dominant characteristics, through eliminating highly correlated components, after which trend analysis was performed on each of the separate clusters revealing a seasonal change model.
Finally, a Multivariate Analysis of Variance utilized on the paired clusters to identify any signifi-cant change, along with post-analysis tests to specify the groups in which signifisignifi-cant change was detected, is presented to be applied in future work.
The preliminary results clearly show that the characteristic data can be grouped into three distinct clusters of which the consumption trend of hot water consumption is distinct. Moreover the data reveals a correlation between the apartment’s characteristics and its hot water consumption. However further monitoring and data collection will be required before any strong trend can be identified, as well as power analysis will have to be applied to conclude any significant change. Nonetheless the initial results demonstrate promising signs that ought to be further investigated in the future.
Acknowledgments
I would like to acknowledge first and formost my supervisor and advisor Prof.
Ivo
Mar-tinac
, whose advice and flexibility gave me the freedom of tackling the issues at hand to further enhance my knowledge and abilities, and for extending me his utmost trust and faith in conducting this thesis.I would also like to extend an acknowledgement to Nynäshamnbostäder for all the data
acquired without which this thesis would not have been possible, and to
Anna-Carin Öster
in specific for her cooperation in facilitating the data acquisition.
Finally, I would like to thank PhD student
Jimmy Azar
, my brother, for his continuosadvice and support in pattern recognition, and to my family & friends for their love and tolerance for my behavior throughout the duration of this thesis.
Contents
1 Introduction 7
2 Literature Review 8
3 Background 10
4 Methodology 12
4.1 Data Collection & Representation . . . 12
4.2 Hypothesis . . . 12
4.3 Statistical Analysis . . . 12
4.3.1 Principal Component Analysis . . . 13
4.3.2 Clustering. . . 18
4.3.3 Cluster Validation . . . 23
4.3.4 Multivariate Analysis of Variance (MANOVA) . . . 26
5 Results & Discussion 34 5.1 Box Plot . . . 34
5.2 Principal Components . . . 35
5.3 Clustering . . . 40
5.4 Cluster Validation . . . 47
5.5 Power & Size Analysis . . . 48
6 Conclusion 50
7 Future Work 51
Appendix A 52
Appendix B 53
List of Figures
1 biplot of correlation values and scores . . . 17
2 k-means of data . . . 19
3 Davies-Bouldin index for k = 2,3,4,5 . . . 20
4 Gaussian v.s K-means . . . 23
5 F-measure . . . 25
6 Trend analysis for Seasonal Changes . . . 26
7 ANOVA vs. MANOVA . . . 27
8 Box Plot of Characteristic Data . . . 34
9 Variance in percentage . . . 36
10 2-D Principal Component Analysis . . . 37
11 Possible Clusters . . . 38
12 3-D Principal Component Analysis . . . 39
13 Cluster Validation Measures . . . 40
14 Gaussian clustering with k = 3 . . . 41
15 Tracking mean of cluster 1 . . . 43
16 Tracking mean of cluster 2 . . . 44
17 Average Temperatur of Stockholm in Celsius . . . 45
18 Tracking mean of cluster 3 . . . 45
19 F-measure & Rand index . . . 47
20 F-statistic for α = 0.1 . . . 52
21 Chi-squared Table . . . 53
List of Tables
1 Possible outcomes of hypothesis testing . . . 12
2 Set of generated data . . . 15
3 Normalized data with h = 1.85 , w = 80.33 , a = 24.67 , & r = 3.02 . . . 15
4 SVD decomposition of M matrix to U , Σ, & V . . . 16
5 1st & 2ndprincipal components. . . 17
6 Set of generated data with repeated measurements . . . 24
7 Set of generated data with repeated measurements continued . . . 28
8 Set of generated data with repeated measurements for Males & Females . . . . 30
9 Summary of Clustering of Characteristic Data . . . 42
1 Introduction
According to the 2020 agreement signed by Sweden, twenty percent reduction of carbon diox-ide emissions, twenty percent increase of renewable energy production, and twenty percent
decrease in energy consumption by the year 2020 should be achieved [1]. Further
investi-gations conducted by OECD (Organization for Economic Cooperation and Development) show that residential and commercial buildings account for about 30 ∼ 40 % of primary
energy and greenhouse gas emissions [2]. It is evident therefore that for the target to be
met, Sweden has to concentrate its efforts on reducing the energy consumption of residential buildings.
The advancement in today’s technology has made easy the installation of "smart meters"
ca-pable of real-time monitoring and display of household energy consumption [3]. Whilst the
trend in energy reduction of residential buildings focuses on the introduction of integrated
automation systems and smart metering ensuring an efficient operation and maintenance [4],
a different approach should be opted for Sweden due to its unique building demographic. The
fact that more than 90% percent of Swedish buildings date back before 1990 [34], limits the
potential of BEAM (Building Energy Automation Management) systems. Instead our focus should be on investigating the largest energy wasting source in buildings, with typically more
than 38% on average [5], theoccupant himself.
Influencing the life style of a tenant to include energy saving decisions in his every day ac-tivities has been the subject of a couple of studies revolving around the effect of information feedback on energy consumption through continuous monitoring. The effects of visualizing energy consumption will be the focus of this thesis in hope of shedding more light to the complex relationship between consumption and influential exterior variables.
2 Literature Review
The magnitude of the effect of information feedback on household consumption has sparked a wide spread debate within the academic field. In 1987 Sexton et.al published an empirical experiment conducted on four hundred and eighty Californian households showing the
ef-fect of information feedback on electricity consumption [6]. Since then several papers have
surfaced revolving around the same topic of which twenty two publications, including Sex-ton’s, were summarized by Corinna Fischer’s own publication for Energy Efficiency journal under the title: "Feedback on household electricity consumption: a tool for saving energy?" Fischer concluded in the latter publication that future work in this field is required to bridge the reoccurring gaps in each of the summarized studies, of which the lack of a large and multi-national sample data, as well as a concerning lack of studies about the preference of graphical representation and visualization in the feedback process should be inspected. Whereas the results of the summarized publications varied from a negative effect on electricity consump-tion (believed to be due to the rebound effect as documented by Sexton) to a respectable 14 %
decrease in consumption (relating to Group 3 of Mosler and Gutscher case study of 2004) [7].
It is worthy to note that the latter issue of graphical preference has been the subject of recent publications such as in the Energy Saving Trust report written by Will Anderson and Vicki White entitled: "The smart way to display, Full report: exploring consumer preferences for home energy display functionality" as seen in the Center for Sustainable Energy Journal. Similarly the IEEE Transactions on Industrial Electronics Vol. 59 published a paper entitled: "A comparative Study of Three Feedback Devices for Residential Real-Time Energy Moni-toring" in April 2012, tackling the display preference of households on feedback monitors. However it is evident after reviewing all these studies that they have concentrated on electric space heating while neglecting other sources of space and water heating/cooling, such as nat-ural gas or district heating, save for Haakana’s study on one hundred and five single-family
houses in Finland [8]. This is highly significant due to the fact that more than seventy two
percent of residential buildings’ energy consumption is attributed to water and space
heat-ing/cooling [9].
As an example, in Wilhits’s publication, which involves around two thousand Norwegian households and is included in Fischer’s summary of studies, it is clearly stated that: "The electricity savings potential of historical feedback may be higher in Norway than in other countries, primarily because of the extensive amount of electric space heating in Norway." He goes on to state: "These techniques could however be applied to natural gas billing. And
the customer satisfaction aspects should be universal." [10]
In another publication, Nielsen examines one thousand five hundred Danish households only to conclude that: "Heating consumption is closely tied to housing standard, size and external temperature, and is, furthermore, subject to behavior oriented variations, mainly in con-nection with the household’s comfort standards (room temperature, hot water demand etc). Electricity consumption for appliances is, on the other hand, much less dependent on the
house’s particular characteristics than it is on the other factors, such as the household’s group size, group structure and requirements or lifestyle. Nordstrand et al, Salg af kollektiv energi, AKF, 1984, also have reservations about the estimated price elasticities, and conclude that
electricity use in heating can be considered much more elastic than that in appliances." [11]
I have gone to elaborate in this section in order to establish two main entities; the first being that water and space heating/cooling is the main contributor to energy consumption in resi-dential areas, secondly and more importantly, to emphasize that very few literature exist on the effect information-feedback has on household "heating demand" when it comes to water and space heating. This could have a major impact on European countries and more specif-ically on Nordic countries where district heating accounts for a large portion of residential space and water heating. Also I would like to note that the papers discussed did not take into account the different types of residences, considering only the overall change. It is vital to investigate energy conservation methods specific to each type of household. Therefore it is vital to look into each residence group, from family households, to elderly housing areas, single apartment facilities, and rental apartments for energy saving methods via occupant be-havior alteration methods. Another note worthy of mentioning is the striking lack of power analysis in the mentioned articles, which undermines the statistical analysis results reported in the respective publications.
3 Background
Nynäshamnbostäder AB is a Swedish housing company, operating since 1995, located in Nynäshamn some sixty kilometers south of Stockholm. The company houses more than two thousand and seven hundred flats, targeting middle class, often young couples and retired
seniors according to SCB’s (Statistiska Central Byrøan) Boendenprofiler 2009 [12]. Recently
the company has decided on the implementation of feedback visualization on a selected group of flats.
Minol is a family-owned company that specializes in metering and billing services worlwide. The company develops innovative metering solutions for heating, cold/hot water consump-tion, as well as billing services. Nynäshamnbostäder AB has contracted Minol to implement the monitoring of heating, and cold/hot water consumption of the selected flats. As a result Minol has installed two temperature sensors, located in one of the rooms and on one of the radiators, along with two water meters to monitor cold and hot water usage of each flat.
As for the analysis of the effect of energy use visualization on energy consumption, Nynäshamn-bostäder AB contracted KTH (Kungliga Tekniska högskolan) of Sweden for the duration of one year. As a result two preliminary reports were produced, along with conducting the thesis study at hand.
The 115 apartments that were selected for this experiment were distributed among four res-idential areas in Nynäshamn: (a) Balder (b) Mastra (c) Torp (d) Vansta . One year monthly consumption data prior to feedback introduction were also provided for hot water usage, as well as the following characteristics were disclosed: (a) Area (b) Number of Occupants (c) Age of tenant (d) Location Factor .
At first glance, limitations on this study can be predicted to emerge in the form of resolution of monitoring and feedback, visualization methodology adopted, lack of electricity consump-tion data prior to feedback introducconsump-tion, short analysis period post feedback introducconsump-tion, and the fixed sample pool available.
Subsequently the effects of these limitations on the experimental setup might come in the form of limitations on the trend analysis due to the short analysis period available post feed-back introduction, lack of behavioral consumption of electricity due to non existing data, delayed behavioral effects due to low monthly resolution of feedback, and low confidence levels in Power analysis due to inadequate sample size.
Building up on the latter, I propose to properly investigate the influence of information on household’s heating demand through an empirical experiment involving Swedish residential households with district heating as a sole source for space and water heating/cooling. This will come as the final chapter of two prequel studies conducted at KTH, Cenac-Morthe Ro-main’s master thesis work in 2011 entitled: "Heating energy consumption of a multi-storey municipal residential building", and François Leconte’s pre-study report in 2009 entitled: "Visualization of Energy Use in Households".
.
"Les grandes personnes aiment les chiffres... Il ne faut pas leur en vouloir. Les enfants doivent être très indulgents envers les grandes personnes."Le Petit Prince
4 Methodology
4.1 Data Collection & Representation
All data collected in this thesis falls under the interval measurement scale, in that it allows us to rank, quantify, and thus compare between different readings. This includes the fol-lowing measurements: age, apartment area, number of tenants, location factor, electricity consumption, and hot water consumption. While the measurement error and tolerance of the collected data are noted and calculated accordingly.
Furthermore the data will be encoded in accordance to the Unicode standard with trans-formation format of 16-bit variable-width (UTF-16) for computer handling and processing. Whilst the data represented will be referred to by codename only, allowing a discrete disclo-sure of user information.
4.2 Hypothesis
Given the characteristic data of the apartments, as well as their consumption data, we are to look into the effect of feedback introduction to electricity and hot water consumption. Hence we formulate our Null Hypothesis as follows:
H0 →Feedback introduction has a no significant effect on electricity and/or hot water consumption.
The null hypothesis will be tested against an alternative hypothesis which is formulated as follows:
H1 →Feedback introduction has a significant effect on electricity and/or hot water consumption.
Possible outcomes of hypothesis testing are illustrated below:
Null hypothesis (H0) is true Null hypothesis (H0) is false
Rejecting null hypothesis Type I error Correct outcome
(False positive) (True Positive)
Fail to reject null hypothesis Correct outcome Type II error
(True negative) (False negative)
Table 1: Possible outcomes of hypothesis testing
The mathematical formulation of the hypothesis will be dependent on the type of hypothesis testing adopted and as such will be presented at a later stage in the report.
4.3 Statistical Analysis
In order to properly investigate if feedback introduction has any effect on the population’s consumption of electricity and hot water, we first have to take a closer look at the
pop-feedback introduction might conceal hidden dynamics in smaller population groups. Hence the need of categorizing the population into smaller samples depending on certain character-istic criteria. As a result another question surfaces that needs to be addressed, that of criteria selection.
Principal Component Analysis is a method for visualizing higher dimensional data. It does so through a mathematical procedure which requires the use of orthogonal transformation of
correlated variables to a set of linearly uncorrelated principal components [13]. This is highly
relevant to our case study in that it helps us formulate an answer to our proposed question of determining the selection criteria. Given multiple characteristics of the population, such as: age of tenants, area of the apartment, number of occupants, etc... It would be quite difficult to visualize and more difficult to analyze the statistical results of the population which is grouped based on high dimensional data vectors. However the number of occupants can be highly correlated to the apartment size. Therefore if we can implement a procedure to transform the correlated categories into linearly uncorrelated principle components, we are able to eliminate high dimensionality use in our data.
4.3.1 Principal Component Analysis
Principal Component Analysis (PCA) makes use of eigenvector decomposition to provide us with a lower dimensional picture of a set of variables viewed from it most influential point
of view [13]. In other words, PCA incorporates a basis change in order to represent the
data in accordance to its characteristic values or eigenvalues. This is achieved by finding an orthonormal matrix P such that
Y = P X,
where X is our set of data m by n in dimensions, such that m is the number of categories &
nrepresents the number of samples, while
CY =
1
nY Y
T
is the covariance matrix of Y in diagonal form. The rows of P will be the principal compo-nents sought after. One can easily rewrite:
CY = 1 nY Y T = (1 nP X)(P X) T = 1 nP XX TPT = P (1 nXX T)PT CY = P CXPT
where CX is the covariance of X. In order to calculate P we have to realize an important fact
that states:
A matrix is considered to be symmetriciff it can be orthogonally diagonalized
Knowing this let P be a matrix with rows equal to the eigenvectors of CX, it is clear that P
• Let P be a matrix with rows equal to the eigenvectors of CX,
• Knowing that CX = n1XXT is a symmetric matrix,
• We have CX = EDET, where E is a matrix of eigenvectors of CX arranged incolumns
& D is a diagonal matrix.
• Hence we can write P = ET,
• Finally substituting CX = PTDP in CY = P CXPT we obtain CY = D.
Thus we conclude that P diagonalizes CY, and as a result the principal components of X,
which are the rows of P , are also the eigenvectors of the covariance matrix of X, CX. Moving
forward to compute the eigenvectors of CX we will first have to look into Singular Value
Decomposition (SVD) [15].
SVD is the factorization of a matrix M into the form
M = U ΣVT,
where U is a matrix of the eigenvectors of the covariance matrix M MT, while Σ being a
rectangular diagonal matrix, and V is a matrix of the eigenvectors of the covariance matrix
MTM. This relates to PCA through the following: let Y be a n by m matrix where:
Y = √1 nX T We therefore have: YTY = (√1 nX T)T(√1 nX T) = 1 nXX T YTY = CX
Where CX is the covariance matrix of X. It is clear now that in order to find the
princi-ple components of X, which are equivalent of finding the eigenvectors of CX, we have to
calculate the columns of V through an SVD of [16]:
Y = √1
nX
T
There are several assumptions taken into consideration in the application of PCA. Firstly, PCA assumes there is a linear relationship between the variables, this is inherit in per-forming linear transformation of basis. Secondly, PCA is highly sensitive to scaling, this is also deduced from the linear correlation of the variables which scaling has direct effect on the eigenvalues extracted. Therefore it is important to perform data normalization with either subtracting the mean, or through z-normalization using the standrad deviation and
mean [17].
Let us now give a statistical example where PCA proves useful. Let us assume we have the following set of data:
height weight age reach (m) (kg) (y) (m) 1.88 88 24 3.01 1.68 64 18 2.71 1.75 68 25 2.91 1.98 90 22 3.41 2.08 92 24 4.05 1.74 80 35 2.02
Table 2: Set of generated data
Thus the first step is to normalize our data set through mean subtraction, yielding the fol-lowing result:
height weight age reach
0.028 7.67 -0.67 -0.01 -0.17 -16.33 -6.67 -0.31 -0.10 -12.33 0.33 -0.11 0.13 9.67 -2.67 0.39 0.23 11.67 -0.67 1.03 -0.11 -0.33 10.33 -1.00
Table 3: Normalized data with h = 1.85 , w = 80.33 , a = 24.67 , & r = 3.02
We now calcualte Y = √1 NX T Y = 0.0142 3.8333 -0.3333 -0.0042 -0.0858 -8.1667 -3.3333 -0.1542 -0.0508 -6.1667 0.1667 -0.0542 0.0642 4.8333 -1.3333 0.1958 0.1142 5.8333 -0.3333 0.5158 -0.0558 -0.1667 5.1667 -0.4992
Next we proceed to calculate the SVD of Y : U = -0.2820 -0.1205 0.4406 0.7356 -0.2464 0.3318 0.6349 -0.3880 0.2421 -0.1529 0.2045 0.5680 0.4569 0.1356 -0.4827 0.2098 -0.6977 0.0955 -0.3484 -0.3008 0.2322 -0.5895 -0.6049 0.1441 -0.4318 -0.1627 -0.6444 0.0068 0.2118 0.5718 -0.0295 0.8364 0.2123 -0.2097 -0.0039 0.4588 Σ= 13.3710 0 0 0 0 6.1907 0 0 0 0 0.4120 0 0 0 0 0.0126 0 0 0 0 0 0 0 0 V = -0.0113 -0.0097 -0.1469 -0.9890 -0.9933 -0.1086 0.0400 0.0065 -0.1114 0.9907 -0.0786 0.0033 -0.0297 -0.0820 -0.9852 0.1475
Table 4: SVD decomposition of M matrix to U , Σ, & V Therefore our variances can be calculated by squaring the diagonal values of Σ: Variance =
1st 2nd 3rd 4th
Principal Component Principal Component Principal Component Principal Component
178.7827 38.3248 0.1697 0.0002
Percent Explained =
1st 2nd 3rd 4th
Principal Component Principal Component Principal Component Principal Component
It is clear that the first two principal components make up more than 99% of the variance, and as such should only be taken into consideration.
The principal components are then calculated by multiplying our normalized data with VT:
1st principal component: 0.0113 0.9933 0.1114 0.0297
2ndprincipal component: -0.0097 -0.1086 0.9907 -0.0820
Table 5: 1st & 2ndprincipal components
To help us visualize the result of the above PCA, we consider the following biplot:
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 height weight age reach Component 1 Component 2
Figure 1: biplot of correlation values and scores
Noting the min and max values of the axes, ranging from -1 to 1, mark the correlation circle. Thus when variables (represented in blue) are located far from the center, certain conclusions
could be draw as such [14]:
• if they are closely grouped, then they are significantly positively correlated.
• if the are grouped at opposite sides, then they are significantly negatively correlated.
• if they fall orthogonal to each other, then they are uncorrelated.
However if the variables happen to be located near the center of the axes, then no conclusions can be draw from these two principal component factors; revealing hidden information on other principal components. Moreover the location of the observations (represented in red),
falling in one of the four quadrants, can be used to estimate number of clusters required for grouping; revealing the underlying structure of the data given.
Therefore from the above biplot, one can conclude that age and weight, being orthogonally placed at a distance from the center, are uncorrelated. While different principal components need to be plotted to draw correlation based conclusions from height and reach variables, due to their close proximity to the center. Also it is worthy to note the the first principal component, the horizontal axis, is linked to age, reach and height; while the second principal component, the vertical axis, correlates with age. Finally the location of the observations imply the existence of three possible clusters in the data.
Building on this analysis the next logical step will venture into clustering and grouping of data. Data clustering is vital to us, since the average behavior of the observations might not tell the full story, hiding the inner group dynamics sought after.
4.3.2 Clustering
In distribution based clustering two important factors come in play. The first being the type of distribution applied in modeling, and secondly the number of clusters required for grouping. The previous Principal Component Analysis answers both. Since PCA has the
underlying assumption of linear normal distribution of variables [14], then it is
straightfor-ward to decide on the use of a gaussian probability distribution function. As for the number of clusters, two contributors can aid us in this vital decision. The first is the initial guess we conclude from the analysis of observation scatter in the PCA we performed. The other comes in the form of a centriod based clustering method known as k-means.
K-means is a clustering method based on a voting system. Data is randomly assigned to an ini-tial number of clusters, where the new centroid of the clusters is then calculated. Afterwards the distance of every observation to every cluster mean is measured, and each observation is
then reassigned to the closest cluster [18]. Thus k-means aims to minimize the within-cluster
sum of squares of each cluster. The method is an iterative one, whereby the algorithm stops once it reaches saturation. The iteration can be formulated mathematically as:
• Expectation step:
xp ∈ Ci iff {k xp− mi k ≤ k xp− mj k ∀ 1 ≤ j ≤ k}
where xp represents the observation point, Ci is the cluster in consideration, mi is the
mean of cluster Ci, k is the number of clusters provided.
• Maximization step:
mi = hCii
Note the the distance method usually considered is either the Euclidean distance or the more general Mahalanobis distance. Performing k-means utilizing Euclidean distance measure, with the initial number of cluster k = 3 deduced from the PCA biplot, we obtain the follow-ing superimposed scatter:
−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12
scores 1st pricipal component
scores 2nd pricipal component
1 2 3
Figure 2: k-means of data
k-means revealed the obvious grouping given an initial number of three clusters according to centroid closeness. However the figure above uses the PCA scoring achieved to display the scatter grouping in 2-dimensional space, making it simpler to draw conclusions from. Alas k-means is a crude voting method that does not rely on any normal probability distribution
of votes [18]. Thus a more adequate clustering technique, for higher dimensional data, is
the mixture of Gaussian clustering methodology using the Expectation-Maximization
algo-rithm [19]. Nonetheless k-means provides us with a computational wise efficient algorithm
to test for the number of clusters through iteratively increasing (k = 1, 2, 3, etc...), as well as provides us with the the initial parameters for a mixture of gaussian; such as initial guess on indexing, cluster centers, and the covariance matrix. However before we move forward to the mixture of gaussian it is important to note out the methodology of how to determine the correct number of clusters in the iterative mechanism of k-means. Increasing the number of clusters will yield different groupings, thus in order to figure out the correct number of clusters, we look into the Davies-Bouldin index.
The Davies-Bouldin index is based on the measure of separation of the clusters as well as the measure of scatter within each cluster. In short Davies-Bouldin rewards the configu-ration with the most separate clusters having the least scatter of observations within each
cluster (measure of compactness of each cluster) [20]. This is expressed mathematically as
follows:
• S, compactness measure, is the average Euclidean distance between the centroid of the cluster and its observations:
Si =
k xj√− Ai k
where xj represents the observation point within a cluster, Ai is the centroid of the
cluster in consideration, T is the cluster size.
• Mi,j, measure of separation, is the Euclidean distance between cluster centroids i & j:
Mi,j = k Ai− Aj k 2 2.5 3 3.5 4 4.5 5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of Clusters k DB index fitted curve
Figure 3: Davies-Bouldin index for k = 2,3,4,5
Here the smallest Davies-Bouldin number of 0.128 implies the need of 4 separate clusters instead of the utilized 3 clusters in the above example. This is simple to justify looking at the second cluster, represented in red in figure 2, where the two observation points are relatively widely separated, and as such should be considered a separate cluster each. Note that the curve has been smoothed out using cubic spine interpolation, in order to better signify the global minima of the curvature.
After concluding the number of clusters as well as the initial parameters, it is time to take a closer look at the mixture of gaussian clustering technique. Similar to the k-means, in that it can be described as a two step algorithm of Expectation and Maximization, it clusters the data points in accordance to centroid closeness. However, the voting mechanism is a soft one using a probabilistic normalized function; hence instead of a rigid boundary, as in the k-means, the data points are assigned to its density function according to its respective log-likelihood.
The multivariate normal distribution is of the following form:
Noted as:
X ∼ N (µ, Σ)
where σ is the standard deviation of the distribution, Σ represents the covariance matrix, &
µbeing the mean of the distribution. All parameters are attained from the k-means clustering
as initial guesses.
Notice that the equation inside the brakets: (x − µ)0Σ−1(x − µ) is the squared Mahalanobis
distance function between x & µ.
Also in case we have uncorrelated variables, our variance-covariance matrix becomes of the form: Σ = σ2 1 0 . . . 0 0 σ2 2 . . . 0 ... ... ... ... 0 0 . . . σ2 p
As a result our multivariate distribution will attain a spherical form with the following nota-tion: φ(x) = p Y j=1 1 q 2πσ2 j exp{− 1 2σ2 j (xj − µj)2}
This reminds us of the relationship of the mixture of Gaussians with k-means, that is a hard clustering with uncorrelated variance-covariance matrix.
The Mixture Model of Gaussians is defined as the weighted sum of Gaussian probability
distribution functions [19]:
p(x) = π0f0(x) + π1f1(x) + π2f2(x) + ... + πkfk(x)
where π is the weight determined apriori withPk
i=0πi = 1, k is a predetermined number of
clusters, while fi(x) = N (x|µi, Σi)the multivariate normal distribution with mean µ & Σ
being the covariance matrix. Thus p(x) can be noted as:
p(x) = k X
i=0
πiN (x|µi, Σi)
The aim of Gaussian clustering is to maximize the likelihood of the mixture function, as such to maximize: ln p(x|π, µ, Σ) = N X n=1 ln{ k X i=0 πiN (x|µi, Σi)}
To perform this, Mixture of Gaussian Models utilizes Bayes probability to evaluate the
prob-ability of an observation belonging to a distribution [19]. This is often referred to as the
Expectation step, and is formulated as follows:
τnk = P (n ∈ Ck|n) = P (n|n ∈ Ck)P (Ck) P (n) = πkN (x|µk, Σk) Pk j=1πjN (x|µj, Σj)
After evaluating the responsibilities of each observation τnk belonging to every Gaussian
dis-tribution in the mixture model, we move on to maximize the likelihood of the probability, in what is often known as the Maximization step, as follows:
ln p(x|π, µ, Σ) = N X n=1 ln{ k X i=0 πiN (xn|µk, Σk)} δ ln p(x|π, µ, Σ) δµk = N X n=1 πkN (xn|µk, Σk) Pk j=1πjN (xn|µj, Σj) Σ−1k (xk− µk) = 0 = N X n=1 τnkΣ−1k (xk− µk) = 0
Therefore allowing us to update our initial parameter guess obtained from k-means as fol-lows: µk = PN n=1τnkxn PN n=1τnk Σk = 1 PN n=1τnk N X n=1 τnk(xk− µk)(xk− µk)T πk= PN n=1τnk N
Then we iterate and check for convergence of the likelihood function. To better demon-strate this let us consider the example data provided in the k-means example.With number of clusters equal to 3, and initial center means to be:
µ1 = [1.74; 80.00; 35.00; 2.02]
µ2 = [1.72; 66.00; 21.50; 2.81]
µ2 = [1.98; 90.00; 23.33; 3.49]
Along with the initial covariance matrices1:
Σ1 = 0.0025 0.14 0.2450 0.0070 0.14 8.00 14.00 0.40 0.25 14.00 24.50 0.70 0.007 0.40 0.70 0.02 Σ2 = 29.38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Σ3 = 20.01 0.20 0 0.0520 0.20 4.00 0 1.04 0 0 1.33 0.08 0.0520 1.04 0.08 0.2752
Finally assigning the initial weights equally among the three clusters:
π1 = π2 = π3 =
1 3
We now proceed to evaluate the responsibilities of each observation and maximize the likeli-hood in order to revaluate the new parameter and iterate till convergence. Finally we obtain the following clusters:
−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12
scores 1st pricipal component
scores 2nd pricipal component
1 2 3
(a) Gaussian clustering k = 3
−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12
scores 1st pricipal component
scores 2nd pricipal component
1 2 3
(b) K-means clustering k = 3
Figure 4: Gaussian v.s K-means
For the simple six data points provided, the two clustering techniques provided us with iden-tical grouping as shown in figure 4. However with the rise in data dimension and population, Mixture of Gaussian has an upper hand of being a generalized form of K-means, offering more flexible and useful clustering.
4.3.3 Cluster Validation
In a repeated measures experiment, as is the case here, it is not sufficient only to cluster the data at different intervals, but it is highly important to insure that the new clusters attain the same data points as the previous indexing. This is known as cluster validation, whereby we measure the validity of the new clusters with respect to the initial target clustering. A particular cluster validation method that is adopted in this thesis is the F-measure validation
[21]. This internal criterion describes the validation process in four categories:
• True Positive (TP): Observation in cluster X is found in cluster Y
• False Positive (FP): Observation in cluster X is not in cluster Y
• False Negative (FN): Observation not in cluster X is in found in cluster Y
F-measure is similar to the Rand index, which penalizes both False Positives and False Nega-tive decisions, however F-measure assigns weights to the associated penalties.
F-measure can be formulated as such:
F − measure : Fβ = (1 + β2)P R β2P + R where, P recision : P = T P T P + F P and, Recall : R = T P T P + F N
Note that β is the weighted penalty that is predetermined, usually set to 1.
Hence, going back to our example of data points we will take an additional measurement for mathematical skill:
height weight age reach math
(m) (kg) (y) (m) (/100) 1.88 88 24 3.01 100 1.68 64 18 2.71 60 1.75 68 25 2.91 88 1.98 90 22 3.41 100 2.08 92 24 4.05 20 1.74 80 35 2.02 55
Clustering with the reach and mathematical skills measurements we obtain the following groupings: −20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12
scores 1st pricipal component
scores 2nd pricipal component
1 2 3
(a) Gaussian (k = 3), with reach
−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12
scores 1st pricipal component
scores 2nd pricipal component
1 2 3
(b) Gaussian (k = 3), with mathematical skills
Figure 5: F-measure
Notice how one of the green and blue observations on the left side clustering transformed into a red colored observation. This indicates that there has been a shift in data point in between the clusters. Thus calculating F-measure with β = 1:
with TP = 1, TN = 8, FP = 3, & FN = 3. This yields:
Fβ = 0.25 1
Therefore the clusters cannot be considered to be the same in both groupings.
However in the case of Fβ being close to 1, we are able to track the cluster’s centroid shift.
This allows us to perform at a later stage trend analysis through least mean square fitting, highlighting seasonal changes exhibited by our tenant consumption.
We are able to visualize the summary result of our work thus far as follows:
!"#$"%& '''!$#( )(*(+,(%
-.(*/%0*0/&'12#3$+4/02# -.(*/%0*0/&'12#3$+4/02# -.(*/%0*0/&'12#3$+4/02#
5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 7%(#8'.0#(3
Figure 6: Trend analysis for Seasonal Changes
Where each apartment is considered as an observation, and grouped according to its charac-teristics as well as electricity and hot water consumption data. Furthermore, we were able to track the center means, through cluster validation, in order to analyse the trend utilizing regression to conclude a seasonal change model.
4.3.4 Multivariate Analysis of Variance (MANOVA)
Having obtained the different groups from the applied clustering, we can now proceed to investigate whether feedback introduction holds any significant effect on any of the groups in terms of electricity and hot water consumption. Therefore we will want to look into the change in variance to verify if the change is a significant one or a result of random chance. Having different groups implies the use of multiple t-tests, which could lead to a high type 1
error [22]. Hence the need of an ANOVA type testing which could take into consideration
the different groups in a single test, thus minimizing the sensitivity to type 1 error. How-ever, since it is required to look into two Dependent Variables, then the multivariate case of ANOVA, or MANOVA, comes into play. MANOVA can handle the analysis of two separate
dependent variables in a single test [23].
Analysis of Variance tends to examine the change in the mean distribution between two groups, seen as an extension to the t-test, ANOVA examens the total Sum of Squares Within
(SSW ) as well as the total Sum of Squares Between (SSB) [22]. Anova utilizes Fisher’s
distribution as part of its statistical significance, and as such requires us to compute the F-statistic: F − statistic = SSB m−1 SSW m(n−1)
Note that the denominators (m − 1) & m(n − 1) represent the degrees of freedom (df ), where m is the number of data sets, n is the number of observations within a dataset, SSB
is calculated as: m
X
& SSW is calculated as: SSW = m X j=1 n X i=1 (xi− ¯xj)2
Where xi denotes the observation at i, ¯xi represents the mean of group i, and ¯X is the total
mean of all data groups.
However an apparent question surfaces which bids us to evaluate the use of MANOVA in contrast to a repeated measure ANOVA. Two straight forward answers come in use, the first being that a MANOVA, in contrast to a repeated ANOVA does not assume data sphericity. Sphericity assumes that the variance of the population remains constant regardless of
com-bination of dependent variables [23]. Secondly, performing repeated ANOVAs increases the
type 1 error in the experiment. However a more subtle and intuitive answer to be considered is that repeated ANOVA running in separate will not account to the covariation pattern of
Dependen Variables, DVs [24]. To help us better understand this we consider the following
figure: ! " # $%&'( )*+ ,-.%&(/ %,&
Figure 7: ANOVA vs. MANOVA
Figure 7 that performing repeated ANOVAs on three normally distributed groups, A, B, & C, will result in three normal distributions projected on the respective orthogonal axis. However these two projections show that the three group means are close to each other, and thus it would be difficult to determine if indeed A, B, and C have different mean distributions. On the other hand a MANOVA projects onto a linear combination of dependent variable
axis [25], thus accurately portraying the three groups having three distinct mean distributions.
This clearly demonstrates the advantage of a MANOVA over a repeated ANOVA in this aspect of statistical analysis.
height weight age reach math (m) (kg) (y) (m) (/100) 1.88 88 24 3.01 100 1.68 64 18 2.71 60 1.75 68 25 2.91 88 1.98 90 22 3.41 100 2.08 92 24 4.05 20 1.74 80 35 2.02 55 ¯ xh x¯w x¯a x¯r x¯m 1.85 80.33 24.67 3.02 70.5
Table 7: Set of generated data with repeated measurements continued With ¯ X = x¯h+ ¯xw + ¯xa+ ¯xr+ ¯xm 5 ¯ X = 36.07 Thus SSB = n m X i=1 (¯xi− ¯X)2 = 6[(1.85 − 36.07)2+ (80.33 − 36.07)2+ (24.67 − 36.07)2+ (3.02 − 36.07)2+ (70.5 − 36.07)2] = 3.32 ∗ 104 & SSW = m X j=1 n X i=1 (xi− ¯xj)2 = [(1.88−1.85)2+(1.68−1.85)2+(1.75−1.85)2+(1.98−1.85)2+(2.08−1.85)2+(1.74−1.85)2]+... ... [(88−80.33)2+(64−80.33)2+(68−80.33)2+(90−80.33)2+(92−80.33)2+(80−80.33)2] + ... ... [(24−24.67)2+(18−24.67)2+(25−24.67)2+(22−24.67)2+(24−24.67)2+(35−24.67)2] + ... ... [(3.01−3.02)2+(2.71−3.02)2+(2.91−3.02)2+(3.41−3.02)2+(4.05−3.02)2+(2.02−3.02)2] + ... ... [(100 − 70.5)2+ (60 − 70.5)2+ (88 − 70.5)2+ (100 − 70.5)2+ (20 − 70.5)2+ (55 − 70.5)2] = 0.1217 + 707.33 + 159.33 + 2.32 + 4.95 ∗ 103 = 5.82 ∗ 103 Therefore: F − statistic = SSB m−1 SSW m(n−1) = 3.32∗104 5−1 5.82∗103 5(6−1) = 8300 232.80 = 35.65
With a significance level of α = 0.10, and with our numerator degree of freedom df1 =
we conclude that our F-statistic of 35.65 is critical F value of 2.18, meaning that there is very little possibility that our null Hypothesis is true. Therefore moving forward to reject the null hypothesis in favor of the alternate hypothesis which states that there is a significant difference in our population mean. Note that the null hypothesis in an ANOVA is usually formulated as such:
H0 : ¯X1 = ¯X2 = ¯X3 = ... = ¯XK
Where K represents the number of independent variables, referred to as levels. On the other
hand, the alternate hypothesis rejects H0 with:
H1 : ¯X1 6= ¯X2 6= ¯X3 6= ... 6= ¯XK
Analogously, a multivariate ANOVA, or MANOVA, dictates the null hypothesis to be of a more general form with:
H0 : ¯ X11 ¯ X21 ... ¯ Xp1 = ¯ X12 ¯ X22 ... ¯ Xp2 = ... = ¯ X1K ¯ X2K ... ¯ XpK
Where p represents the number of dependent variables in all K levels. Similarly, the alternate hypothesis in a MANOVA is formulated as:
H1 : ¯ X11 ¯ X21 ... ¯ Xp1 6= ¯ X12 ¯ X22 ... ¯ Xp2 6= ... 6= ¯ X1K ¯ X2K ... ¯ XpK
While ANOVA requires the probability of rejecting H0 given the F-statistic and our degrees
of freedom, the MONOVA’s test statistic on the other hand is λ:
λ = |SSW |
|SSW + SSB|
Where |SSW | & |SST | = |SSB + SSW | are the determinants of the within and total the sum of squares and cross-product matrices.
Notice how the value of SSB plays a key role in the variation of λ between 1 and 0. As SSB increases in value, λ decreases approaching zero; while as SSB decreases in value, λ increases towards one.
Taking the example data in table 6 and adding an additional group to each set we end up with:
Calculating λ:
λ = |SSW |
height weight age reach math (m) (kg) (y) (m) (/100) M F M F M F M F M F 1.88 1.65 88 59 24 21 3.01 2.64 100 58 1.68 1.58 64 45 18 17 2.71 2.45 60 51 1.75 1.65 68 55 25 28 2.91 2.04 88 90 1.98 1.78 90 66 22 18 3.41 2.87 100 100 2.08 1.88 92 75 24 26 4.05 2.52 20 60 1.74 1.74 80 61 35 32 2.02 1.75 55 43 ¯ x11 x¯21 x¯12 x¯22 x¯13 x¯23 x¯14 x¯24 x¯15 x¯25 1.85 1.71 80.33 60.17 24.67 23.67 3.02 2.38 70.50 67
Table 8: Set of generated data with repeated measurements for Males & Females
SSW = SSW1+ SSW2+ SSW3+ SSW4+ SSW5 where: SSW1 = " ss1 ss12 ss21 ss2 # = " P6 j=1(x1(j)− ¯x11)2 P6j=1(x1(j)− ¯x11)(x2(j)− ¯x21) P6 j=1(x1(j)− ¯x11)(x2(j)− ¯x21) P6j=1(x2(j)− ¯x21)2 # = " 0.1217 0.0712 0.0712 0.0587 # Similarly, SSW2 = " 707.33 531.66 531.66 512.83 # SSW3 = " 159.33 147.33 147.33 177.33 # SSW4 = " 2.32 0.97 0.97 0.84 # SSW5 = " 4.94 ∗ 103 2004 2004 2580 # Having, SSW = " 5.8091 ∗ 103 2.6840 ∗ 103 2.6840 ∗ 103 3.2711 ∗ 103 # While, ¯ X = 33.53 Therefore, SSB1 = " 6.021 ∗ 103 6.048 ∗ 103 6.048 ∗ 103 6.075 ∗ 103 # " 1.31 ∗ 104 7.48 ∗ 103 #
SSB3 = " 470.99 524.15 524.15 583.31 # SSB4 = " 5.58 ∗ 103 5.70 ∗ 103 5.70 ∗ 103 5.82 ∗ 103 # SSB5 = " 8.20 ∗ 103 7.42 ∗ 103 7.42 ∗ 103 6.72 ∗ 103 # Having, SSB = SSB1+ SSB2 + SSB3+ SSB4+ SSB5 = " 3.3372 ∗ 104 2.7172 ∗ 104 2.7172 ∗ 104 2.3448 ∗ 104 # Concluding that: SST = SSW + SSW = " 3.91 ∗ 104 2.98 ∗ 104 2.98 ∗ 104 2.6719 ∗ 104 # λ = 5.8091 ∗ 103 2.6840 ∗ 103 2.6840 ∗ 103 3.2711 ∗ 103 3.91 ∗ 104 2.98 ∗ 104 2.98 ∗ 104 2.6719 ∗ 104 λ = 1.1798 ∗ 10 7 1.555 ∗ 108 = 0.0759
Note that this is Wilks’ definition of:
λ = |SSW |
|SST | Where as Pillai’s trace is defined as:
V =Xeig(SSB
SST)
The sum of the eigenvalues of matrix hSSB
SST i
, while Roy’s formulation of λ is the largest
eigenvalue of the matrix hSSB
SSW i
, and finally Hotelling’s trace is defined as the sum of the
eigenvalues of the matrixhSSB
SSW i
[23].
These different tests provide us with different viewing angles in assessing differences of the linear combination of dependent variables. In simpler terms, they aid us in the analysis of the
variance-covariance matrices resulting from the MANOVA [22]. The usual tests conducted
in a MANOVA is Wilks’ or Pillai’s, that is due to for their robustness towards violations of homogeneity of the covariance matrix, and yield conservative results. This comes from the
assumptions made by the MANOVA application to our data set [25].
• Normality: MANOVA has the underlying assumption that the dependent variables
and their linear combinations are normally distributed.
• Independence: MANOVA assumes the observations are taken to be independent on
• Outliers: MANOVA has sensitive to data outliers, impacting type 1 error, thus the need to identify outliers using Mahalanobis’ distance measure.
• Linearity: MANOVA assumes all dependent variables to be linearly correlated.
Adopting Wikls’ lambda formulation, we proceed to approximate the F-statistic through Chi-Square test. In other words, assuming a Chi-squared distribution, we have:
χ2 = −[(N − 1) − 0.5(p + k)] ln λ
Where N is the total number of independent variables, and p(k − 1) represents the degree of freedom.
Hence following up on our example we have: χ2 = −[(30 − 1) − 0.5(2 + 5)] ln 0.0759 = 65.75
with 8 degrees of freedom. Thus consulting the Chi-squared chart (see Appendix B), and
adopting a significance level of 0.10, we conclude that our critical statistic χ2
c = 13.36
65.75. We can then move to reject the null hypothesis in favor of the alternate hypothesis; re-alizing that there is indeed a significant change in mean distribution among the groups. After identifying a significant change in the mean distribution, we proceed to determine in which of the dependent variables is the change exhibited, and in which groups was the change significant. To help us answer the question concerning which group exhibited the significant change, we look into multiple multivariate t-tests such as the Tukey Honestly Significant Difference, Tukey (HSD) test, which performs a pairwise comparison of mean
µi − µj in order to identify which group mean holds the highest significance level [23]. As
for the problem of determining which DV invokes the highest mean shift for the different groups, we approach this through Discriminant Function Analysis, DFA. The most common type of DFA is what is referred to as a Roy-Bargman Step down procedure. In short we can identify if the groups differ by taking a single dependent variable at a time. We thus prioritize the DVs, and then consider the most important DV (theoretically), after which all subsequent DVs are checked after removing the effects of the higher variables through
considering them to be covariates [23]. This type of canonical analysis is similar to the PCA
described previously, where we investigate the combination of the original variables that has the highest variation, however unlike PCA in multivariate analysis we investigate the linear combination of variables which yield the highest separation between groups. This procedure is repeated to reveal all levels of separation through the canonical analysis described above.
.
"...there is something inherently ridiculous about digital watches... We were trying to find ways of translating purely numeric data into graphic form so that the information leapt easily to the eye... then we
suddenly got terribly excited at the idea of translating them ’back’ to numeric data, simply because we suddenly had the technology to do so."FAX from Douglas Adams to Byron Preiss
5 Results & Discussion
In this section we implement the theoretical procedures described inMethodology section unto
the data that was collected thus far. The information obtained from Nynäshamnbostäder AB, at present, is a one year data on hot water consumption spanning from January 2011 till December 2011 for 115 residence apartments in Nynäshamn. We were also provided with apartment characteristic data for the following categories:
• Age of official tenant in years,
• Number of occupants in an apartment,
• Apartment Area in metric squared m2,
• Location factor.
The location factor is an index value between 0.25 and 1, taking into consideration location of the apartment with respect to building level (appointing higher values for low level apart-ments, and lower value for higher level apartments). This is justified due to the increase in heating consumption necessary to overcome wind and natural heat dissipation attributed to higher rise apartments. While the hot water consumption data provided has a monthly reso-lution, and as such is taken at the start of every month with measured units of metric cube
m3.
5.1 Box Plot
In order to get a general and graphical look at the characteristic data of the apartments at hand, we produce the box plot below:
0.5 1 1.5 2 2.5 3 occupants locfac Data Categories 40 50 60 70 80 90 area age Data Categories
Figure 8: Box Plot of Characteristic Data The box plot reveals the following statistical descriptions:
• Standard deviation: displays the standard deviation above and below the mean, repre-sented through the box’s length.
• Min/Max: outputs the minimum and maximum values in the data, represented by two
dottedT s.
• Outliers: reveals possible outliers in the data, represented by red plus signs+.
From the box plot we were able to conclude that the median value for the location factor is 0.65 with minimum and maximum values of 0.25 & 1 respectively. While the standard deviation of location factor is evenly distributed below and above the mean, resulting in a symmetrical box plot around the mean. This shows that there is a normalized distribution of apartments, since location factor is dependent on the geographical location of the apart-ment.
When it comes to the number of occupants in an apartment, we see a mean of 1.34 occupant per apartment with a high positive standard deviation; while the maximum value for occu-pants is set at 3. This shows that the majority of these apartment s are inhabitat by a single tenant, while a distributed few have 3 occupants.This indicates that we are dealing with single occupant apartments rather than family housing areas.
As for age distribution among the residences, we have an average age of 73.12, with mini-mum & maximini-mum values at 55 and 95 years young respectively. The box polt also shows a somewhat symmetrical distribution around the mean, however with several outliers in the younger age group below the minimum value, with the outermost minimum having a value of 35 years of age. This demonstrates that the population at hand is an elderly one in most of the cases, and the presence of outliners amongst the younger age group suggest the possible need to divide the age category into two groups for further processing.
Finally the apartment area values reveal that the average residence area lies around 60.88
m2, with most of the distribution skewed towards the upper region above the mean. The
minimum and maximum values are set at 43.4 m2 and 71.7 m2. This shows that most of
the apartments are mid-to-small sized apartments. This further backs up the number of occupants data having a mean of 1.3 apartmentoccupant .
5.2 Principal Components
Principal Component Analysis performed on the characteristic data results in the following: eigenvalues: 0.5326 0.7024 1.1454 1.6196 eigenvectors: 0.7082 −0.0277 −0.3944 0.5850 −0.6966 −0.1863 −0.3566 0.5940 0.0668 −0.7357 −0.4996 −0.4525 0.0939 −0.6506 0.6838 0.3166
Variance: 1.6196 1.1454 0.7024 0.5326 Percent Explained: 40.4909 28.6357 17.5595 13.3139
Represented graphically in figure 9 below:
1 2 3 4 0 10 20 30 40 50 60 70 80 90 100 Principal Component Variance Explained (%) Effect of Each PC 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Figure 9: Variance in percentage
Thus far PCA has given us four principal components with the highest two accounting for 69.12 % of the total scores, while adding the third principal component we obtain a reach of 86.68 % inclusion of all data scores. This informs us that we should be investigating 2-dimensional as well as 3-2-dimensional PCA charts to properly adress the hidden structure behind the characteristic data.
Visualizing the first 2 principal components we investigate Figure 10: −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 occupantsarea age locfac 1st Principal Component 2nd Principal Component
Projected Data with Component Vectors
Figure 10: 2-D Principal Component Analysis
First let us inspect the data scatter in the biplot Figure 10. The distribution of red points show that there are few outliers in the data, since most of the points are located near the
center of the axises. To verify this we look at the Hotelling’s T2 as a measure of multivariate
distance of each observation to the center of the data set. Through this analytical procedure
we are able to evaluate the outliers and mean distance from the center. Hotelling’s T2reveals
two extreme outliers with distances of 15.29 & 15.0374, while the mean distance is 3.965. With this we can proceed to extrude the outliers from our data set. This verifies the biplot result, which validates that most of the observations are centered close to each other with few outliers. However the biplot scatter of data points also reveals a possible 3 cluster formations highlighted below:
0.6 0.4 0.2 0 0.2 0.4 0.6 0.6 0.4 0.2 0 0.2 0.4 0.6 occupantsarea age locfac 1st Principal Component 2 n d P ri n ci p a l C o m p o n e n t
Projected Data with Component Vectors
Figure 11: Possible Clusters
Now shifting our attention to the coefficients we first note that all categories are far from the center, meaning certain conclusions concerning their correlation can be drawn. Having
theArea and Occupant coefficients closely placed indicates that those two categories are
pos-itively correlated. Whilst having theLocation Factor coefficient orthogonal to both Area and
Occupant coefficients hints that these coefficients are in fact uncorrelated. Age is similarly
orthogonal toArea and Occupant coefficients which indicates that the coefficients are
uncor-related. Whilst Age and Location Factor being opposite to each other indicates that they are
in fact negatively correlated.
Making sense of these correlations must be taken with caution. Correlation does not imply
causality, however it hints of a possible underlying structure. That being said, it is interest-ing to analyse these correlations to better understand the workinterest-ings behind the characteristic
data. Having theArea and Occupant positively correlated comes as no surprise, since smaller
apartment area logically attracts a lesser number of occupants. Therefore with increase in apartment size we witness an increase in number of inhabitants. However it is worthy to
note thatArea factor is the driving force for the Occupant factor, and thus is considered to be
the leading force. This is so due to the logical reasoning that the apartment area is present at first, and thus plays a role in attracting at a later stage the number of tenants. Therefore, the
value ofArea tends to explain the value of Occupant.
The box plot also shows that Area and Occupant are uncorrelated with respect to Location
Factor. Location Factor is based on apartment’s level in a building, thus a high value of Loca-tion Factor indicates a ground floor apartment, while a low LocaLoca-tion Factor indicates a roof
apartment. Having said that, we see no affect of apartment level to the Area and Occupant
categories, and as such are deemed uncorrelated.
Finally, investigating the negative correlation between Location Factor coefficient and Age
reveals an interesting fact which is backed up by the box plot summary presented previously. Knowing our population to be an elderly one, based on the age category mean in the box
plot in figure 8, translates to a higher rate of older tenants (high Age value) renting upper
role in the age distribution of the tenants, with older tenants renting out the upper floors,
while younger ones occupying ground levels. Noting that since Age is orthogonal to both
Area and Occupant, then they are found to be uncorrelated.
Taking into account the third principal component we obtain Figure 11:
−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1st Principal Component occupants area age 2nd Principal Component locfac 3rd Principal Component
Figure 12: 3-D Principal Component Analysis
Adding the third principal component turns out to reveal thatLocation Factor is coupled with
the third principal component, rather than the first two. As a consequence to this, projecting the data solely on two principal components will not accurately represent the full dimensions of the observations. This is backed up by the variances displayed in figure 9. However, it is sufficient to take into consideration two principal components as a result of a cost-benefit analysis of adding a third principal component to the increase in data representation accuracy.
Therefore we can conclude that the first two principal components are sufficient for our statistical analysis of the data.
5.3 Clustering
Based on the preliminary conclusions conducted in PCA, it is recommended to cluster our characteristic data along with hot water consumption data based on 3 cluster groups. How-ever it is important to verify analytically the number of clusters proposed, and as such we conduct the previously described Davies-Bouldin index for:
2 ≤ k ≤ 9
Where 2 is the minimum number of clusters allowed, whilst 9 comes as a conservative esti-mate from the rule of thumb in determining the number of clusters in a data set formulated by [26]:
k ≈
rn
2
where n is the total number of observations, here found to be 115 apartments. Running the iterative program for different k values we obtain:
2 3 4 5 6 7 8 9 0 5 10 15 20 25 30 Clusters DBI DBI vs. Clusters DBI
(a) Davies-Bouldin index
2 3 4 5 6 7 8 9 4000 4020 4040 4060 4080 4100 4120 Clusters AIC AIC vs. Clusters AIC
(b) Akaike information criterion
Figure 13: Cluster Validation Measures
Akaike Information Criterion, or AIC, is another relative goodness-of-fit measure that is based on the entropy of the data. AIC is often used when applying a gaussian mixture of models clustering algorithm, since the log-likelihood is a required measurement in both algo-rithms. AIC is formulated as such:
AIC = 2k − 2 ln(L)
Where L represents the likelihood of the probability distribution, gaussian in our case,
cal-culated as previously described in theClustering section.
Note that the common union of both output measures is a number of clusters equivalent to 3. Davies-Bouldin graph shows that k = 3 yields best results, while Akaike criterion shows that k = 4 yields the best fit. However D-B index shows that k = 4 is not an ideal case, whilst AIC agrees that k = 3 will yield respectable results. Thus based on the latter, and the observations seen in the PCA biplot graph, we conclude that k = 3 would be an adequate
Having agreed on the issue of number of clusters, we move on to the actual clustering itself. Applying Mixture of Gaussian with k = 3, and representing the clustering results through the principal components projection of the multivariate data, we obtain:
−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (a) January −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (b) February −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (c) March −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (d) April −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (e) May −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (f) June −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (g) July −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (h) August −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (i) September −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (j) October −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (k) November −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (l) December
Figure 14: Gaussian clustering with k = 3
This series of clustering takes into account the monthly hot water consumption along with the characteristic data of every apartment. In the series of subplots above, we visualize every apartment as a colored point in the scatter. The color indicates the point’s association to one of the three clusters, represented in red, green, & blue. Also shown in the graphs above are the cluster centers, represented as ⊗. Note that the cluster’s center does not necessary