• No results found

The influence of energy use visualization on the energy consumption in municipal multi-apartment buildings: The case of Nynäshamnbostäder

N/A
N/A
Protected

Academic year: 2021

Share "The influence of energy use visualization on the energy consumption in municipal multi-apartment buildings: The case of Nynäshamnbostäder"

Copied!
57
0
0

Loading.... (view fulltext now)

Full text

(1)

KTH Architecture and the Built Environment

The Influence of Energy Use Visualization

on the Energy Consumption in Municipal

Multi-Apartment Buildings

The Case of Nynäshamnbostäder

Marc C. Azar

Master of Science Thesis 2012:125

KTH School of Architecture and the Built Environment Division of Building Service and Energy Systems

(2)

Master of Science Thesis 2012:125

The Influence of Energy Use

Visualiza-tion on the Energy ConsumpVisualiza-tion in

Mu-nicipal Multi-Apartment Buildings

KTH Architecture and

The Case of Nynäshamnbostäder

the Built Environment

c

2012 School of Architecture and

the Built Environment

Division of Building Service and Energy Systems SE-100 44 STOCKHOLM

Sweden

Marc C. Azar

Approved Examiner Supervisor

2012 Prof. Ivo Martinac Prof. Ivo Martinac

Commissioner Contact Person

Abstract

This thesis investigates the influence of energy visualization on hot water consumption, as well as builds up the framework for the analysis of electricity consumption, of multi-apartment buildings in Sweden, . 115 apartments in Nynäshamn have been scheduled to be equipped with feedback vi-sualization, through the use of television sets, allowing the monitoring of electricity and hot water consumption on a monthly basis. One year of consumption data prior to this feedback introduc-tion was acquired for statistical analysis. The results were then displayed and analysed, allowing for the composition of a generalized conclusion whilst revealing the need for further investigation and future work.

To achieve this end, clustering was performed on the 115 apartments according to the follow-ing characteristics: Number of tenants per apartment, Area of the apartment, Location of the apartment, and age of the tenants. Principal Component Analysis was used to select dominant characteristics, through eliminating highly correlated components, after which trend analysis was performed on each of the separate clusters revealing a seasonal change model.

Finally, a Multivariate Analysis of Variance utilized on the paired clusters to identify any signifi-cant change, along with post-analysis tests to specify the groups in which signifisignifi-cant change was detected, is presented to be applied in future work.

(3)

The preliminary results clearly show that the characteristic data can be grouped into three distinct clusters of which the consumption trend of hot water consumption is distinct. Moreover the data reveals a correlation between the apartment’s characteristics and its hot water consumption. However further monitoring and data collection will be required before any strong trend can be identified, as well as power analysis will have to be applied to conclude any significant change. Nonetheless the initial results demonstrate promising signs that ought to be further investigated in the future.

(4)

Acknowledgments

I would like to acknowledge first and formost my supervisor and advisor Prof.

Ivo

Mar-tinac

, whose advice and flexibility gave me the freedom of tackling the issues at hand to further enhance my knowledge and abilities, and for extending me his utmost trust and faith in conducting this thesis.

I would also like to extend an acknowledgement to Nynäshamnbostäder for all the data

acquired without which this thesis would not have been possible, and to

Anna-Carin Öster

in specific for her cooperation in facilitating the data acquisition.

Finally, I would like to thank PhD student

Jimmy Azar

, my brother, for his continuos

advice and support in pattern recognition, and to my family & friends for their love and tolerance for my behavior throughout the duration of this thesis.

(5)

Contents

1 Introduction 7

2 Literature Review 8

3 Background 10

4 Methodology 12

4.1 Data Collection & Representation . . . 12

4.2 Hypothesis . . . 12

4.3 Statistical Analysis . . . 12

4.3.1 Principal Component Analysis . . . 13

4.3.2 Clustering. . . 18

4.3.3 Cluster Validation . . . 23

4.3.4 Multivariate Analysis of Variance (MANOVA) . . . 26

5 Results & Discussion 34 5.1 Box Plot . . . 34

5.2 Principal Components . . . 35

5.3 Clustering . . . 40

5.4 Cluster Validation . . . 47

5.5 Power & Size Analysis . . . 48

6 Conclusion 50

7 Future Work 51

Appendix A 52

Appendix B 53

(6)

List of Figures

1 biplot of correlation values and scores . . . 17

2 k-means of data . . . 19

3 Davies-Bouldin index for k = 2,3,4,5 . . . 20

4 Gaussian v.s K-means . . . 23

5 F-measure . . . 25

6 Trend analysis for Seasonal Changes . . . 26

7 ANOVA vs. MANOVA . . . 27

8 Box Plot of Characteristic Data . . . 34

9 Variance in percentage . . . 36

10 2-D Principal Component Analysis . . . 37

11 Possible Clusters . . . 38

12 3-D Principal Component Analysis . . . 39

13 Cluster Validation Measures . . . 40

14 Gaussian clustering with k = 3 . . . 41

15 Tracking mean of cluster 1 . . . 43

16 Tracking mean of cluster 2 . . . 44

17 Average Temperatur of Stockholm in Celsius . . . 45

18 Tracking mean of cluster 3 . . . 45

19 F-measure & Rand index . . . 47

20 F-statistic for α = 0.1 . . . 52

21 Chi-squared Table . . . 53

(7)

List of Tables

1 Possible outcomes of hypothesis testing . . . 12

2 Set of generated data . . . 15

3 Normalized data with h = 1.85 , w = 80.33 , a = 24.67 , & r = 3.02 . . . 15

4 SVD decomposition of M matrix to U , Σ, & V . . . 16

5 1st & 2ndprincipal components. . . 17

6 Set of generated data with repeated measurements . . . 24

7 Set of generated data with repeated measurements continued . . . 28

8 Set of generated data with repeated measurements for Males & Females . . . . 30

9 Summary of Clustering of Characteristic Data . . . 42

(8)

1 Introduction

According to the 2020 agreement signed by Sweden, twenty percent reduction of carbon diox-ide emissions, twenty percent increase of renewable energy production, and twenty percent

decrease in energy consumption by the year 2020 should be achieved [1]. Further

investi-gations conducted by OECD (Organization for Economic Cooperation and Development) show that residential and commercial buildings account for about 30 ∼ 40 % of primary

energy and greenhouse gas emissions [2]. It is evident therefore that for the target to be

met, Sweden has to concentrate its efforts on reducing the energy consumption of residential buildings.

The advancement in today’s technology has made easy the installation of "smart meters"

ca-pable of real-time monitoring and display of household energy consumption [3]. Whilst the

trend in energy reduction of residential buildings focuses on the introduction of integrated

automation systems and smart metering ensuring an efficient operation and maintenance [4],

a different approach should be opted for Sweden due to its unique building demographic. The

fact that more than 90% percent of Swedish buildings date back before 1990 [34], limits the

potential of BEAM (Building Energy Automation Management) systems. Instead our focus should be on investigating the largest energy wasting source in buildings, with typically more

than 38% on average [5], theoccupant himself.

Influencing the life style of a tenant to include energy saving decisions in his every day ac-tivities has been the subject of a couple of studies revolving around the effect of information feedback on energy consumption through continuous monitoring. The effects of visualizing energy consumption will be the focus of this thesis in hope of shedding more light to the complex relationship between consumption and influential exterior variables.

(9)

2 Literature Review

The magnitude of the effect of information feedback on household consumption has sparked a wide spread debate within the academic field. In 1987 Sexton et.al published an empirical experiment conducted on four hundred and eighty Californian households showing the

ef-fect of information feedback on electricity consumption [6]. Since then several papers have

surfaced revolving around the same topic of which twenty two publications, including Sex-ton’s, were summarized by Corinna Fischer’s own publication for Energy Efficiency journal under the title: "Feedback on household electricity consumption: a tool for saving energy?" Fischer concluded in the latter publication that future work in this field is required to bridge the reoccurring gaps in each of the summarized studies, of which the lack of a large and multi-national sample data, as well as a concerning lack of studies about the preference of graphical representation and visualization in the feedback process should be inspected. Whereas the results of the summarized publications varied from a negative effect on electricity consump-tion (believed to be due to the rebound effect as documented by Sexton) to a respectable 14 %

decrease in consumption (relating to Group 3 of Mosler and Gutscher case study of 2004) [7].

It is worthy to note that the latter issue of graphical preference has been the subject of recent publications such as in the Energy Saving Trust report written by Will Anderson and Vicki White entitled: "The smart way to display, Full report: exploring consumer preferences for home energy display functionality" as seen in the Center for Sustainable Energy Journal. Similarly the IEEE Transactions on Industrial Electronics Vol. 59 published a paper entitled: "A comparative Study of Three Feedback Devices for Residential Real-Time Energy Moni-toring" in April 2012, tackling the display preference of households on feedback monitors. However it is evident after reviewing all these studies that they have concentrated on electric space heating while neglecting other sources of space and water heating/cooling, such as nat-ural gas or district heating, save for Haakana’s study on one hundred and five single-family

houses in Finland [8]. This is highly significant due to the fact that more than seventy two

percent of residential buildings’ energy consumption is attributed to water and space

heat-ing/cooling [9].

As an example, in Wilhits’s publication, which involves around two thousand Norwegian households and is included in Fischer’s summary of studies, it is clearly stated that: "The electricity savings potential of historical feedback may be higher in Norway than in other countries, primarily because of the extensive amount of electric space heating in Norway." He goes on to state: "These techniques could however be applied to natural gas billing. And

the customer satisfaction aspects should be universal." [10]

In another publication, Nielsen examines one thousand five hundred Danish households only to conclude that: "Heating consumption is closely tied to housing standard, size and external temperature, and is, furthermore, subject to behavior oriented variations, mainly in con-nection with the household’s comfort standards (room temperature, hot water demand etc). Electricity consumption for appliances is, on the other hand, much less dependent on the

(10)

house’s particular characteristics than it is on the other factors, such as the household’s group size, group structure and requirements or lifestyle. Nordstrand et al, Salg af kollektiv energi, AKF, 1984, also have reservations about the estimated price elasticities, and conclude that

electricity use in heating can be considered much more elastic than that in appliances." [11]

I have gone to elaborate in this section in order to establish two main entities; the first being that water and space heating/cooling is the main contributor to energy consumption in resi-dential areas, secondly and more importantly, to emphasize that very few literature exist on the effect information-feedback has on household "heating demand" when it comes to water and space heating. This could have a major impact on European countries and more specif-ically on Nordic countries where district heating accounts for a large portion of residential space and water heating. Also I would like to note that the papers discussed did not take into account the different types of residences, considering only the overall change. It is vital to investigate energy conservation methods specific to each type of household. Therefore it is vital to look into each residence group, from family households, to elderly housing areas, single apartment facilities, and rental apartments for energy saving methods via occupant be-havior alteration methods. Another note worthy of mentioning is the striking lack of power analysis in the mentioned articles, which undermines the statistical analysis results reported in the respective publications.

(11)

3 Background

Nynäshamnbostäder AB is a Swedish housing company, operating since 1995, located in Nynäshamn some sixty kilometers south of Stockholm. The company houses more than two thousand and seven hundred flats, targeting middle class, often young couples and retired

seniors according to SCB’s (Statistiska Central Byrøan) Boendenprofiler 2009 [12]. Recently

the company has decided on the implementation of feedback visualization on a selected group of flats.

Minol is a family-owned company that specializes in metering and billing services worlwide. The company develops innovative metering solutions for heating, cold/hot water consump-tion, as well as billing services. Nynäshamnbostäder AB has contracted Minol to implement the monitoring of heating, and cold/hot water consumption of the selected flats. As a result Minol has installed two temperature sensors, located in one of the rooms and on one of the radiators, along with two water meters to monitor cold and hot water usage of each flat.

As for the analysis of the effect of energy use visualization on energy consumption, Nynäshamn-bostäder AB contracted KTH (Kungliga Tekniska högskolan) of Sweden for the duration of one year. As a result two preliminary reports were produced, along with conducting the thesis study at hand.

The 115 apartments that were selected for this experiment were distributed among four res-idential areas in Nynäshamn: (a) Balder (b) Mastra (c) Torp (d) Vansta . One year monthly consumption data prior to feedback introduction were also provided for hot water usage, as well as the following characteristics were disclosed: (a) Area (b) Number of Occupants (c) Age of tenant (d) Location Factor .

At first glance, limitations on this study can be predicted to emerge in the form of resolution of monitoring and feedback, visualization methodology adopted, lack of electricity consump-tion data prior to feedback introducconsump-tion, short analysis period post feedback introducconsump-tion, and the fixed sample pool available.

Subsequently the effects of these limitations on the experimental setup might come in the form of limitations on the trend analysis due to the short analysis period available post feed-back introduction, lack of behavioral consumption of electricity due to non existing data, delayed behavioral effects due to low monthly resolution of feedback, and low confidence levels in Power analysis due to inadequate sample size.

Building up on the latter, I propose to properly investigate the influence of information on household’s heating demand through an empirical experiment involving Swedish residential households with district heating as a sole source for space and water heating/cooling. This will come as the final chapter of two prequel studies conducted at KTH, Cenac-Morthe Ro-main’s master thesis work in 2011 entitled: "Heating energy consumption of a multi-storey municipal residential building", and François Leconte’s pre-study report in 2009 entitled: "Visualization of Energy Use in Households".

(12)

.

"Les grandes personnes aiment les chiffres... Il ne faut pas leur en vouloir. Les enfants doivent être très indulgents envers les grandes personnes."Le Petit Prince

(13)

4 Methodology

4.1 Data Collection & Representation

All data collected in this thesis falls under the interval measurement scale, in that it allows us to rank, quantify, and thus compare between different readings. This includes the fol-lowing measurements: age, apartment area, number of tenants, location factor, electricity consumption, and hot water consumption. While the measurement error and tolerance of the collected data are noted and calculated accordingly.

Furthermore the data will be encoded in accordance to the Unicode standard with trans-formation format of 16-bit variable-width (UTF-16) for computer handling and processing. Whilst the data represented will be referred to by codename only, allowing a discrete disclo-sure of user information.

4.2 Hypothesis

Given the characteristic data of the apartments, as well as their consumption data, we are to look into the effect of feedback introduction to electricity and hot water consumption. Hence we formulate our Null Hypothesis as follows:

H0 →Feedback introduction has a no significant effect on electricity and/or hot water consumption.

The null hypothesis will be tested against an alternative hypothesis which is formulated as follows:

H1 →Feedback introduction has a significant effect on electricity and/or hot water consumption.

Possible outcomes of hypothesis testing are illustrated below:

Null hypothesis (H0) is true Null hypothesis (H0) is false

Rejecting null hypothesis Type I error Correct outcome

(False positive) (True Positive)

Fail to reject null hypothesis Correct outcome Type II error

(True negative) (False negative)

Table 1: Possible outcomes of hypothesis testing

The mathematical formulation of the hypothesis will be dependent on the type of hypothesis testing adopted and as such will be presented at a later stage in the report.

4.3 Statistical Analysis

In order to properly investigate if feedback introduction has any effect on the population’s consumption of electricity and hot water, we first have to take a closer look at the

(14)

pop-feedback introduction might conceal hidden dynamics in smaller population groups. Hence the need of categorizing the population into smaller samples depending on certain character-istic criteria. As a result another question surfaces that needs to be addressed, that of criteria selection.

Principal Component Analysis is a method for visualizing higher dimensional data. It does so through a mathematical procedure which requires the use of orthogonal transformation of

correlated variables to a set of linearly uncorrelated principal components [13]. This is highly

relevant to our case study in that it helps us formulate an answer to our proposed question of determining the selection criteria. Given multiple characteristics of the population, such as: age of tenants, area of the apartment, number of occupants, etc... It would be quite difficult to visualize and more difficult to analyze the statistical results of the population which is grouped based on high dimensional data vectors. However the number of occupants can be highly correlated to the apartment size. Therefore if we can implement a procedure to transform the correlated categories into linearly uncorrelated principle components, we are able to eliminate high dimensionality use in our data.

4.3.1 Principal Component Analysis

Principal Component Analysis (PCA) makes use of eigenvector decomposition to provide us with a lower dimensional picture of a set of variables viewed from it most influential point

of view [13]. In other words, PCA incorporates a basis change in order to represent the

data in accordance to its characteristic values or eigenvalues. This is achieved by finding an orthonormal matrix P such that

Y = P X,

where X is our set of data m by n in dimensions, such that m is the number of categories &

nrepresents the number of samples, while

CY =

1

nY Y

T

is the covariance matrix of Y in diagonal form. The rows of P will be the principal compo-nents sought after. One can easily rewrite:

CY = 1 nY Y T = (1 nP X)(P X) T = 1 nP XX TPT = P (1 nXX T)PT CY = P CXPT

where CX is the covariance of X. In order to calculate P we have to realize an important fact

that states:

A matrix is considered to be symmetriciff it can be orthogonally diagonalized

Knowing this let P be a matrix with rows equal to the eigenvectors of CX, it is clear that P

(15)

Let P be a matrix with rows equal to the eigenvectors of CX,

Knowing that CX = n1XXT is a symmetric matrix,

We have CX = EDET, where E is a matrix of eigenvectors of CX arranged incolumns

& D is a diagonal matrix.

Hence we can write P = ET,

Finally substituting CX = PTDP in CY = P CXPT we obtain CY = D.

Thus we conclude that P diagonalizes CY, and as a result the principal components of X,

which are the rows of P , are also the eigenvectors of the covariance matrix of X, CX. Moving

forward to compute the eigenvectors of CX we will first have to look into Singular Value

Decomposition (SVD) [15].

SVD is the factorization of a matrix M into the form

M = U ΣVT,

where U is a matrix of the eigenvectors of the covariance matrix M MT, while Σ being a

rectangular diagonal matrix, and V is a matrix of the eigenvectors of the covariance matrix

MTM. This relates to PCA through the following: let Y be a n by m matrix where:

Y = √1 nX T We therefore have: YTY = (√1 nX T)T(1 nX T) = 1 nXX T YTY = CX

Where CX is the covariance matrix of X. It is clear now that in order to find the

princi-ple components of X, which are equivalent of finding the eigenvectors of CX, we have to

calculate the columns of V through an SVD of [16]:

Y = √1

nX

T

There are several assumptions taken into consideration in the application of PCA. Firstly, PCA assumes there is a linear relationship between the variables, this is inherit in per-forming linear transformation of basis. Secondly, PCA is highly sensitive to scaling, this is also deduced from the linear correlation of the variables which scaling has direct effect on the eigenvalues extracted. Therefore it is important to perform data normalization with either subtracting the mean, or through z-normalization using the standrad deviation and

mean [17].

Let us now give a statistical example where PCA proves useful. Let us assume we have the following set of data:

(16)

height weight age reach (m) (kg) (y) (m) 1.88 88 24 3.01 1.68 64 18 2.71 1.75 68 25 2.91 1.98 90 22 3.41 2.08 92 24 4.05 1.74 80 35 2.02

Table 2: Set of generated data

Thus the first step is to normalize our data set through mean subtraction, yielding the fol-lowing result:

height weight age reach

0.028 7.67 -0.67 -0.01 -0.17 -16.33 -6.67 -0.31 -0.10 -12.33 0.33 -0.11 0.13 9.67 -2.67 0.39 0.23 11.67 -0.67 1.03 -0.11 -0.33 10.33 -1.00

Table 3: Normalized data with h = 1.85 , w = 80.33 , a = 24.67 , & r = 3.02

We now calcualte Y = 1 NX T Y = 0.0142 3.8333 -0.3333 -0.0042 -0.0858 -8.1667 -3.3333 -0.1542 -0.0508 -6.1667 0.1667 -0.0542 0.0642 4.8333 -1.3333 0.1958 0.1142 5.8333 -0.3333 0.5158 -0.0558 -0.1667 5.1667 -0.4992

(17)

Next we proceed to calculate the SVD of Y : U = -0.2820 -0.1205 0.4406 0.7356 -0.2464 0.3318 0.6349 -0.3880 0.2421 -0.1529 0.2045 0.5680 0.4569 0.1356 -0.4827 0.2098 -0.6977 0.0955 -0.3484 -0.3008 0.2322 -0.5895 -0.6049 0.1441 -0.4318 -0.1627 -0.6444 0.0068 0.2118 0.5718 -0.0295 0.8364 0.2123 -0.2097 -0.0039 0.4588 Σ= 13.3710 0 0 0 0 6.1907 0 0 0 0 0.4120 0 0 0 0 0.0126 0 0 0 0 0 0 0 0 V = -0.0113 -0.0097 -0.1469 -0.9890 -0.9933 -0.1086 0.0400 0.0065 -0.1114 0.9907 -0.0786 0.0033 -0.0297 -0.0820 -0.9852 0.1475

Table 4: SVD decomposition of M matrix to U , Σ, & V Therefore our variances can be calculated by squaring the diagonal values of Σ: Variance =

1st 2nd 3rd 4th

Principal Component Principal Component Principal Component Principal Component

178.7827 38.3248 0.1697 0.0002

Percent Explained =

1st 2nd 3rd 4th

Principal Component Principal Component Principal Component Principal Component

(18)

It is clear that the first two principal components make up more than 99% of the variance, and as such should only be taken into consideration.

The principal components are then calculated by multiplying our normalized data with VT:

1st principal component: 0.0113 0.9933 0.1114 0.0297

2ndprincipal component: -0.0097 -0.1086 0.9907 -0.0820

Table 5: 1st & 2ndprincipal components

To help us visualize the result of the above PCA, we consider the following biplot:

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 height weight age reach Component 1 Component 2

Figure 1: biplot of correlation values and scores

Noting the min and max values of the axes, ranging from -1 to 1, mark the correlation circle. Thus when variables (represented in blue) are located far from the center, certain conclusions

could be draw as such [14]:

• if they are closely grouped, then they are significantly positively correlated.

• if the are grouped at opposite sides, then they are significantly negatively correlated.

• if they fall orthogonal to each other, then they are uncorrelated.

However if the variables happen to be located near the center of the axes, then no conclusions can be draw from these two principal component factors; revealing hidden information on other principal components. Moreover the location of the observations (represented in red),

(19)

falling in one of the four quadrants, can be used to estimate number of clusters required for grouping; revealing the underlying structure of the data given.

Therefore from the above biplot, one can conclude that age and weight, being orthogonally placed at a distance from the center, are uncorrelated. While different principal components need to be plotted to draw correlation based conclusions from height and reach variables, due to their close proximity to the center. Also it is worthy to note the the first principal component, the horizontal axis, is linked to age, reach and height; while the second principal component, the vertical axis, correlates with age. Finally the location of the observations imply the existence of three possible clusters in the data.

Building on this analysis the next logical step will venture into clustering and grouping of data. Data clustering is vital to us, since the average behavior of the observations might not tell the full story, hiding the inner group dynamics sought after.

4.3.2 Clustering

In distribution based clustering two important factors come in play. The first being the type of distribution applied in modeling, and secondly the number of clusters required for grouping. The previous Principal Component Analysis answers both. Since PCA has the

underlying assumption of linear normal distribution of variables [14], then it is

straightfor-ward to decide on the use of a gaussian probability distribution function. As for the number of clusters, two contributors can aid us in this vital decision. The first is the initial guess we conclude from the analysis of observation scatter in the PCA we performed. The other comes in the form of a centriod based clustering method known as k-means.

K-means is a clustering method based on a voting system. Data is randomly assigned to an ini-tial number of clusters, where the new centroid of the clusters is then calculated. Afterwards the distance of every observation to every cluster mean is measured, and each observation is

then reassigned to the closest cluster [18]. Thus k-means aims to minimize the within-cluster

sum of squares of each cluster. The method is an iterative one, whereby the algorithm stops once it reaches saturation. The iteration can be formulated mathematically as:

• Expectation step:

xp ∈ Ci iff {k xp− mi k ≤ k xp− mj k ∀ 1 ≤ j ≤ k}

where xp represents the observation point, Ci is the cluster in consideration, mi is the

mean of cluster Ci, k is the number of clusters provided.

• Maximization step:

mi = hCii

Note the the distance method usually considered is either the Euclidean distance or the more general Mahalanobis distance. Performing k-means utilizing Euclidean distance measure, with the initial number of cluster k = 3 deduced from the PCA biplot, we obtain the follow-ing superimposed scatter:

(20)

−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12

scores 1st pricipal component

scores 2nd pricipal component

1 2 3

Figure 2: k-means of data

k-means revealed the obvious grouping given an initial number of three clusters according to centroid closeness. However the figure above uses the PCA scoring achieved to display the scatter grouping in 2-dimensional space, making it simpler to draw conclusions from. Alas k-means is a crude voting method that does not rely on any normal probability distribution

of votes [18]. Thus a more adequate clustering technique, for higher dimensional data, is

the mixture of Gaussian clustering methodology using the Expectation-Maximization

algo-rithm [19]. Nonetheless k-means provides us with a computational wise efficient algorithm

to test for the number of clusters through iteratively increasing (k = 1, 2, 3, etc...), as well as provides us with the the initial parameters for a mixture of gaussian; such as initial guess on indexing, cluster centers, and the covariance matrix. However before we move forward to the mixture of gaussian it is important to note out the methodology of how to determine the correct number of clusters in the iterative mechanism of k-means. Increasing the number of clusters will yield different groupings, thus in order to figure out the correct number of clusters, we look into the Davies-Bouldin index.

The Davies-Bouldin index is based on the measure of separation of the clusters as well as the measure of scatter within each cluster. In short Davies-Bouldin rewards the configu-ration with the most separate clusters having the least scatter of observations within each

cluster (measure of compactness of each cluster) [20]. This is expressed mathematically as

follows:

• S, compactness measure, is the average Euclidean distance between the centroid of the cluster and its observations:

Si =

k xj− Ai k

(21)

where xj represents the observation point within a cluster, Ai is the centroid of the

cluster in consideration, T is the cluster size.

• Mi,j, measure of separation, is the Euclidean distance between cluster centroids i & j:

Mi,j = k Ai− Aj k 2 2.5 3 3.5 4 4.5 5 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Number of Clusters k DB index fitted curve

Figure 3: Davies-Bouldin index for k = 2,3,4,5

Here the smallest Davies-Bouldin number of 0.128 implies the need of 4 separate clusters instead of the utilized 3 clusters in the above example. This is simple to justify looking at the second cluster, represented in red in figure 2, where the two observation points are relatively widely separated, and as such should be considered a separate cluster each. Note that the curve has been smoothed out using cubic spine interpolation, in order to better signify the global minima of the curvature.

After concluding the number of clusters as well as the initial parameters, it is time to take a closer look at the mixture of gaussian clustering technique. Similar to the k-means, in that it can be described as a two step algorithm of Expectation and Maximization, it clusters the data points in accordance to centroid closeness. However, the voting mechanism is a soft one using a probabilistic normalized function; hence instead of a rigid boundary, as in the k-means, the data points are assigned to its density function according to its respective log-likelihood.

The multivariate normal distribution is of the following form:

(22)

Noted as:

X ∼ N (µ, Σ)

where σ is the standard deviation of the distribution, Σ represents the covariance matrix, &

µbeing the mean of the distribution. All parameters are attained from the k-means clustering

as initial guesses.

Notice that the equation inside the brakets: (x − µ)0Σ−1(x − µ) is the squared Mahalanobis

distance function between x & µ.

Also in case we have uncorrelated variables, our variance-covariance matrix becomes of the form: Σ =       σ2 1 0 . . . 0 0 σ2 2 . . . 0 ... ... ... ... 0 0 . . . σ2 p      

As a result our multivariate distribution will attain a spherical form with the following nota-tion: φ(x) = p Y j=1 1 q 2πσ2 j exp{− 1 2 j (xj − µj)2}

This reminds us of the relationship of the mixture of Gaussians with k-means, that is a hard clustering with uncorrelated variance-covariance matrix.

The Mixture Model of Gaussians is defined as the weighted sum of Gaussian probability

distribution functions [19]:

p(x) = π0f0(x) + π1f1(x) + π2f2(x) + ... + πkfk(x)

where π is the weight determined apriori withPk

i=0πi = 1, k is a predetermined number of

clusters, while fi(x) = N (x|µi, Σi)the multivariate normal distribution with mean µ & Σ

being the covariance matrix. Thus p(x) can be noted as:

p(x) = k X

i=0

πiN (x|µi, Σi)

The aim of Gaussian clustering is to maximize the likelihood of the mixture function, as such to maximize: ln p(x|π, µ, Σ) = N X n=1 ln{ k X i=0 πiN (x|µi, Σi)}

To perform this, Mixture of Gaussian Models utilizes Bayes probability to evaluate the

prob-ability of an observation belonging to a distribution [19]. This is often referred to as the

Expectation step, and is formulated as follows:

τnk = P (n ∈ Ck|n) = P (n|n ∈ Ck)P (Ck) P (n) = πkN (x|µk, Σk) Pk j=1πjN (x|µj, Σj)

(23)

After evaluating the responsibilities of each observation τnk belonging to every Gaussian

dis-tribution in the mixture model, we move on to maximize the likelihood of the probability, in what is often known as the Maximization step, as follows:

ln p(x|π, µ, Σ) = N X n=1 ln{ k X i=0 πiN (xn|µk, Σk)} δ ln p(x|π, µ, Σ) δµk = N X n=1 πkN (xn|µk, Σk) Pk j=1πjN (xn|µj, Σj) Σ−1k (xk− µk) = 0 = N X n=1 τnkΣ−1k (xk− µk) = 0

Therefore allowing us to update our initial parameter guess obtained from k-means as fol-lows: µk = PN n=1τnkxn PN n=1τnk Σk = 1 PN n=1τnk N X n=1 τnk(xk− µk)(xk− µk)T πk= PN n=1τnk N

Then we iterate and check for convergence of the likelihood function. To better demon-strate this let us consider the example data provided in the k-means example.With number of clusters equal to 3, and initial center means to be:

µ1 = [1.74; 80.00; 35.00; 2.02]

µ2 = [1.72; 66.00; 21.50; 2.81]

µ2 = [1.98; 90.00; 23.33; 3.49]

Along with the initial covariance matrices1:

Σ1 =      0.0025 0.14 0.2450 0.0070 0.14 8.00 14.00 0.40 0.25 14.00 24.50 0.70 0.007 0.40 0.70 0.02      Σ2 =      29.38 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     

(24)

Σ3 =      20.01 0.20 0 0.0520 0.20 4.00 0 1.04 0 0 1.33 0.08 0.0520 1.04 0.08 0.2752     

Finally assigning the initial weights equally among the three clusters:

π1 = π2 = π3 =

1 3

We now proceed to evaluate the responsibilities of each observation and maximize the likeli-hood in order to revaluate the new parameter and iterate till convergence. Finally we obtain the following clusters:

−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12

scores 1st pricipal component

scores 2nd pricipal component

1 2 3

(a) Gaussian clustering k = 3

−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12

scores 1st pricipal component

scores 2nd pricipal component

1 2 3

(b) K-means clustering k = 3

Figure 4: Gaussian v.s K-means

For the simple six data points provided, the two clustering techniques provided us with iden-tical grouping as shown in figure 4. However with the rise in data dimension and population, Mixture of Gaussian has an upper hand of being a generalized form of K-means, offering more flexible and useful clustering.

4.3.3 Cluster Validation

In a repeated measures experiment, as is the case here, it is not sufficient only to cluster the data at different intervals, but it is highly important to insure that the new clusters attain the same data points as the previous indexing. This is known as cluster validation, whereby we measure the validity of the new clusters with respect to the initial target clustering. A particular cluster validation method that is adopted in this thesis is the F-measure validation

[21]. This internal criterion describes the validation process in four categories:

• True Positive (TP): Observation in cluster X is found in cluster Y

(25)

• False Positive (FP): Observation in cluster X is not in cluster Y

• False Negative (FN): Observation not in cluster X is in found in cluster Y

F-measure is similar to the Rand index, which penalizes both False Positives and False Nega-tive decisions, however F-measure assigns weights to the associated penalties.

F-measure can be formulated as such:

F − measure : Fβ = (1 + β2)P R β2P + R where, P recision : P = T P T P + F P and, Recall : R = T P T P + F N

Note that β is the weighted penalty that is predetermined, usually set to 1.

Hence, going back to our example of data points we will take an additional measurement for mathematical skill:

height weight age reach math

(m) (kg) (y) (m) (/100) 1.88 88 24 3.01 100 1.68 64 18 2.71 60 1.75 68 25 2.91 88 1.98 90 22 3.41 100 2.08 92 24 4.05 20 1.74 80 35 2.02 55

(26)

Clustering with the reach and mathematical skills measurements we obtain the following groupings: −20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12

scores 1st pricipal component

scores 2nd pricipal component

1 2 3

(a) Gaussian (k = 3), with reach

−20 −15 −10 −5 0 5 10 15 −6 −4 −2 0 2 4 6 8 10 12

scores 1st pricipal component

scores 2nd pricipal component

1 2 3

(b) Gaussian (k = 3), with mathematical skills

Figure 5: F-measure

Notice how one of the green and blue observations on the left side clustering transformed into a red colored observation. This indicates that there has been a shift in data point in between the clusters. Thus calculating F-measure with β = 1:

with TP = 1, TN = 8, FP = 3, & FN = 3. This yields:

= 0.25  1

Therefore the clusters cannot be considered to be the same in both groupings.

However in the case of Fβ being close to 1, we are able to track the cluster’s centroid shift.

This allows us to perform at a later stage trend analysis through least mean square fitting, highlighting seasonal changes exhibited by our tenant consumption.

(27)

We are able to visualize the summary result of our work thus far as follows:

!"#$"%& '''!$#( )(*(+,(%

-.(*/%0*0/&'12#3$+4/02# -.(*/%0*0/&'12#3$+4/02# -.(*/%0*0/&'12#3$+4/02#

5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 5 2 /'6 " /( %'1 2# 3$ + 4/ 02 # 7%(#8'.0#(3

Figure 6: Trend analysis for Seasonal Changes

Where each apartment is considered as an observation, and grouped according to its charac-teristics as well as electricity and hot water consumption data. Furthermore, we were able to track the center means, through cluster validation, in order to analyse the trend utilizing regression to conclude a seasonal change model.

4.3.4 Multivariate Analysis of Variance (MANOVA)

Having obtained the different groups from the applied clustering, we can now proceed to investigate whether feedback introduction holds any significant effect on any of the groups in terms of electricity and hot water consumption. Therefore we will want to look into the change in variance to verify if the change is a significant one or a result of random chance. Having different groups implies the use of multiple t-tests, which could lead to a high type 1

error [22]. Hence the need of an ANOVA type testing which could take into consideration

the different groups in a single test, thus minimizing the sensitivity to type 1 error. How-ever, since it is required to look into two Dependent Variables, then the multivariate case of ANOVA, or MANOVA, comes into play. MANOVA can handle the analysis of two separate

dependent variables in a single test [23].

Analysis of Variance tends to examine the change in the mean distribution between two groups, seen as an extension to the t-test, ANOVA examens the total Sum of Squares Within

(SSW ) as well as the total Sum of Squares Between (SSB) [22]. Anova utilizes Fisher’s

distribution as part of its statistical significance, and as such requires us to compute the F-statistic: F − statistic = SSB m−1 SSW m(n−1)

Note that the denominators (m − 1) & m(n − 1) represent the degrees of freedom (df ), where m is the number of data sets, n is the number of observations within a dataset, SSB

is calculated as: m

X

(28)

& SSW is calculated as: SSW = m X j=1 n X i=1 (xi− ¯xj)2

Where xi denotes the observation at i, ¯xi represents the mean of group i, and ¯X is the total

mean of all data groups.

However an apparent question surfaces which bids us to evaluate the use of MANOVA in contrast to a repeated measure ANOVA. Two straight forward answers come in use, the first being that a MANOVA, in contrast to a repeated ANOVA does not assume data sphericity. Sphericity assumes that the variance of the population remains constant regardless of

com-bination of dependent variables [23]. Secondly, performing repeated ANOVAs increases the

type 1 error in the experiment. However a more subtle and intuitive answer to be considered is that repeated ANOVA running in separate will not account to the covariation pattern of

Dependen Variables, DVs [24]. To help us better understand this we consider the following

figure: ! " # $%&'( )*+ ,-.%&(/ %,&

Figure 7: ANOVA vs. MANOVA

Figure 7 that performing repeated ANOVAs on three normally distributed groups, A, B, & C, will result in three normal distributions projected on the respective orthogonal axis. However these two projections show that the three group means are close to each other, and thus it would be difficult to determine if indeed A, B, and C have different mean distributions. On the other hand a MANOVA projects onto a linear combination of dependent variable

axis [25], thus accurately portraying the three groups having three distinct mean distributions.

This clearly demonstrates the advantage of a MANOVA over a repeated ANOVA in this aspect of statistical analysis.

(29)

height weight age reach math (m) (kg) (y) (m) (/100) 1.88 88 24 3.01 100 1.68 64 18 2.71 60 1.75 68 25 2.91 88 1.98 90 22 3.41 100 2.08 92 24 4.05 20 1.74 80 35 2.02 55 ¯ xh x¯w x¯a x¯r x¯m 1.85 80.33 24.67 3.02 70.5

Table 7: Set of generated data with repeated measurements continued With ¯ X = x¯h+ ¯xw + ¯xa+ ¯xr+ ¯xm 5 ¯ X = 36.07 Thus SSB = n m X i=1xi− ¯X)2 = 6[(1.85 − 36.07)2+ (80.33 − 36.07)2+ (24.67 − 36.07)2+ (3.02 − 36.07)2+ (70.5 − 36.07)2] = 3.32 ∗ 104 & SSW = m X j=1 n X i=1 (xi− ¯xj)2 = [(1.88−1.85)2+(1.68−1.85)2+(1.75−1.85)2+(1.98−1.85)2+(2.08−1.85)2+(1.74−1.85)2]+... ... [(88−80.33)2+(64−80.33)2+(68−80.33)2+(90−80.33)2+(92−80.33)2+(80−80.33)2] + ... ... [(24−24.67)2+(18−24.67)2+(25−24.67)2+(22−24.67)2+(24−24.67)2+(35−24.67)2] + ... ... [(3.01−3.02)2+(2.71−3.02)2+(2.91−3.02)2+(3.41−3.02)2+(4.05−3.02)2+(2.02−3.02)2] + ... ... [(100 − 70.5)2+ (60 − 70.5)2+ (88 − 70.5)2+ (100 − 70.5)2+ (20 − 70.5)2+ (55 − 70.5)2] = 0.1217 + 707.33 + 159.33 + 2.32 + 4.95 ∗ 103 = 5.82 ∗ 103 Therefore: F − statistic = SSB m−1 SSW m(n−1) = 3.32∗104 5−1 5.82∗103 5(6−1) = 8300 232.80 = 35.65

With a significance level of α = 0.10, and with our numerator degree of freedom df1 =

(30)

we conclude that our F-statistic of 35.65 is  critical F value of 2.18, meaning that there is very little possibility that our null Hypothesis is true. Therefore moving forward to reject the null hypothesis in favor of the alternate hypothesis which states that there is a significant difference in our population mean. Note that the null hypothesis in an ANOVA is usually formulated as such:

H0 : ¯X1 = ¯X2 = ¯X3 = ... = ¯XK

Where K represents the number of independent variables, referred to as levels. On the other

hand, the alternate hypothesis rejects H0 with:

H1 : ¯X1 6= ¯X2 6= ¯X3 6= ... 6= ¯XK

Analogously, a multivariate ANOVA, or MANOVA, dictates the null hypothesis to be of a more general form with:

H0 :       ¯ X11 ¯ X21 ... ¯ Xp1       =       ¯ X12 ¯ X22 ... ¯ Xp2       = ... =       ¯ X1K ¯ X2K ... ¯ XpK      

Where p represents the number of dependent variables in all K levels. Similarly, the alternate hypothesis in a MANOVA is formulated as:

H1 :       ¯ X11 ¯ X21 ... ¯ Xp1       6=       ¯ X12 ¯ X22 ... ¯ Xp2       6= ... 6=       ¯ X1K ¯ X2K ... ¯ XpK      

While ANOVA requires the probability of rejecting H0 given the F-statistic and our degrees

of freedom, the MONOVA’s test statistic on the other hand is λ:

λ = |SSW |

|SSW + SSB|

Where |SSW | & |SST | = |SSB + SSW | are the determinants of the within and total the sum of squares and cross-product matrices.

Notice how the value of SSB plays a key role in the variation of λ between 1 and 0. As SSB increases in value, λ decreases approaching zero; while as SSB decreases in value, λ increases towards one.

Taking the example data in table 6 and adding an additional group to each set we end up with:

Calculating λ:

λ = |SSW |

(31)

height weight age reach math (m) (kg) (y) (m) (/100) M F M F M F M F M F 1.88 1.65 88 59 24 21 3.01 2.64 100 58 1.68 1.58 64 45 18 17 2.71 2.45 60 51 1.75 1.65 68 55 25 28 2.91 2.04 88 90 1.98 1.78 90 66 22 18 3.41 2.87 100 100 2.08 1.88 92 75 24 26 4.05 2.52 20 60 1.74 1.74 80 61 35 32 2.02 1.75 55 43 ¯ x11 x¯21 x¯12 x¯22 x¯13 x¯23 x¯14 x¯24 x¯15 x¯25 1.85 1.71 80.33 60.17 24.67 23.67 3.02 2.38 70.50 67

Table 8: Set of generated data with repeated measurements for Males & Females

SSW = SSW1+ SSW2+ SSW3+ SSW4+ SSW5 where: SSW1 = " ss1 ss12 ss21 ss2 # = " P6 j=1(x1(j)− ¯x11)2 P6j=1(x1(j)− ¯x11)(x2(j)− ¯x21) P6 j=1(x1(j)− ¯x11)(x2(j)− ¯x21) P6j=1(x2(j)− ¯x21)2 # = " 0.1217 0.0712 0.0712 0.0587 # Similarly, SSW2 = " 707.33 531.66 531.66 512.83 # SSW3 = " 159.33 147.33 147.33 177.33 # SSW4 = " 2.32 0.97 0.97 0.84 # SSW5 = " 4.94 ∗ 103 2004 2004 2580 # Having, SSW = " 5.8091 ∗ 103 2.6840 ∗ 103 2.6840 ∗ 103 3.2711 ∗ 103 # While, ¯ X = 33.53 Therefore, SSB1 = " 6.021 ∗ 103 6.048 ∗ 103 6.048 ∗ 103 6.075 ∗ 103 # " 1.31 ∗ 104 7.48 ∗ 103 #

(32)

SSB3 = " 470.99 524.15 524.15 583.31 # SSB4 = " 5.58 ∗ 103 5.70 ∗ 103 5.70 ∗ 103 5.82 ∗ 103 # SSB5 = " 8.20 ∗ 103 7.42 ∗ 103 7.42 ∗ 103 6.72 ∗ 103 # Having, SSB = SSB1+ SSB2 + SSB3+ SSB4+ SSB5 = " 3.3372 ∗ 104 2.7172 ∗ 104 2.7172 ∗ 104 2.3448 ∗ 104 # Concluding that: SST = SSW + SSW = " 3.91 ∗ 104 2.98 ∗ 104 2.98 ∗ 104 2.6719 ∗ 104 # λ = 5.8091 ∗ 103 2.6840 ∗ 103 2.6840 ∗ 103 3.2711 ∗ 103 3.91 ∗ 104 2.98 ∗ 104 2.98 ∗ 104 2.6719 ∗ 104 λ = 1.1798 ∗ 10 7 1.555 ∗ 108 = 0.0759

Note that this is Wilks’ definition of:

λ = |SSW |

|SST | Where as Pillai’s trace is defined as:

V =Xeig(SSB

SST)

The sum of the eigenvalues of matrix hSSB

SST i

, while Roy’s formulation of λ is the largest

eigenvalue of the matrix hSSB

SSW i

, and finally Hotelling’s trace is defined as the sum of the

eigenvalues of the matrixhSSB

SSW i

[23].

These different tests provide us with different viewing angles in assessing differences of the linear combination of dependent variables. In simpler terms, they aid us in the analysis of the

variance-covariance matrices resulting from the MANOVA [22]. The usual tests conducted

in a MANOVA is Wilks’ or Pillai’s, that is due to for their robustness towards violations of homogeneity of the covariance matrix, and yield conservative results. This comes from the

assumptions made by the MANOVA application to our data set [25].

• Normality: MANOVA has the underlying assumption that the dependent variables

and their linear combinations are normally distributed.

• Independence: MANOVA assumes the observations are taken to be independent on

(33)

• Outliers: MANOVA has sensitive to data outliers, impacting type 1 error, thus the need to identify outliers using Mahalanobis’ distance measure.

• Linearity: MANOVA assumes all dependent variables to be linearly correlated.

Adopting Wikls’ lambda formulation, we proceed to approximate the F-statistic through Chi-Square test. In other words, assuming a Chi-squared distribution, we have:

χ2 = −[(N − 1) − 0.5(p + k)] ln λ

Where N is the total number of independent variables, and p(k − 1) represents the degree of freedom.

Hence following up on our example we have: χ2 = −[(30 − 1) − 0.5(2 + 5)] ln 0.0759 = 65.75

with 8 degrees of freedom. Thus consulting the Chi-squared chart (see Appendix B), and

adopting a significance level of 0.10, we conclude that our critical statistic χ2

c = 13.36 

65.75. We can then move to reject the null hypothesis in favor of the alternate hypothesis; re-alizing that there is indeed a significant change in mean distribution among the groups. After identifying a significant change in the mean distribution, we proceed to determine in which of the dependent variables is the change exhibited, and in which groups was the change significant. To help us answer the question concerning which group exhibited the significant change, we look into multiple multivariate t-tests such as the Tukey Honestly Significant Difference, Tukey (HSD) test, which performs a pairwise comparison of mean

µi − µj in order to identify which group mean holds the highest significance level [23]. As

for the problem of determining which DV invokes the highest mean shift for the different groups, we approach this through Discriminant Function Analysis, DFA. The most common type of DFA is what is referred to as a Roy-Bargman Step down procedure. In short we can identify if the groups differ by taking a single dependent variable at a time. We thus prioritize the DVs, and then consider the most important DV (theoretically), after which all subsequent DVs are checked after removing the effects of the higher variables through

considering them to be covariates [23]. This type of canonical analysis is similar to the PCA

described previously, where we investigate the combination of the original variables that has the highest variation, however unlike PCA in multivariate analysis we investigate the linear combination of variables which yield the highest separation between groups. This procedure is repeated to reveal all levels of separation through the canonical analysis described above.

(34)

.

"...there is something inherently ridiculous about digital watches... We were trying to find ways of translating purely numeric data into graphic form so that the information leapt easily to the eye... then we

suddenly got terribly excited at the idea of translating them ’back’ to numeric data, simply because we suddenly had the technology to do so."FAX from Douglas Adams to Byron Preiss

(35)

5 Results & Discussion

In this section we implement the theoretical procedures described inMethodology section unto

the data that was collected thus far. The information obtained from Nynäshamnbostäder AB, at present, is a one year data on hot water consumption spanning from January 2011 till December 2011 for 115 residence apartments in Nynäshamn. We were also provided with apartment characteristic data for the following categories:

• Age of official tenant in years,

• Number of occupants in an apartment,

Apartment Area in metric squared m2,

• Location factor.

The location factor is an index value between 0.25 and 1, taking into consideration location of the apartment with respect to building level (appointing higher values for low level apart-ments, and lower value for higher level apartments). This is justified due to the increase in heating consumption necessary to overcome wind and natural heat dissipation attributed to higher rise apartments. While the hot water consumption data provided has a monthly reso-lution, and as such is taken at the start of every month with measured units of metric cube

m3.

5.1 Box Plot

In order to get a general and graphical look at the characteristic data of the apartments at hand, we produce the box plot below:

0.5 1 1.5 2 2.5 3 occupants locfac Data Categories 40 50 60 70 80 90 area age Data Categories

Figure 8: Box Plot of Characteristic Data The box plot reveals the following statistical descriptions:

(36)

• Standard deviation: displays the standard deviation above and below the mean, repre-sented through the box’s length.

• Min/Max: outputs the minimum and maximum values in the data, represented by two

dottedT s.

• Outliers: reveals possible outliers in the data, represented by red plus signs+.

From the box plot we were able to conclude that the median value for the location factor is 0.65 with minimum and maximum values of 0.25 & 1 respectively. While the standard deviation of location factor is evenly distributed below and above the mean, resulting in a symmetrical box plot around the mean. This shows that there is a normalized distribution of apartments, since location factor is dependent on the geographical location of the apart-ment.

When it comes to the number of occupants in an apartment, we see a mean of 1.34 occupant per apartment with a high positive standard deviation; while the maximum value for occu-pants is set at 3. This shows that the majority of these apartment s are inhabitat by a single tenant, while a distributed few have 3 occupants.This indicates that we are dealing with single occupant apartments rather than family housing areas.

As for age distribution among the residences, we have an average age of 73.12, with mini-mum & maximini-mum values at 55 and 95 years young respectively. The box polt also shows a somewhat symmetrical distribution around the mean, however with several outliers in the younger age group below the minimum value, with the outermost minimum having a value of 35 years of age. This demonstrates that the population at hand is an elderly one in most of the cases, and the presence of outliners amongst the younger age group suggest the possible need to divide the age category into two groups for further processing.

Finally the apartment area values reveal that the average residence area lies around 60.88

m2, with most of the distribution skewed towards the upper region above the mean. The

minimum and maximum values are set at 43.4 m2 and 71.7 m2. This shows that most of

the apartments are mid-to-small sized apartments. This further backs up the number of occupants data having a mean of 1.3 apartmentoccupant .

5.2 Principal Components

Principal Component Analysis performed on the characteristic data results in the following: eigenvalues:      0.5326 0.7024 1.1454 1.6196      eigenvectors:      0.7082 −0.0277 −0.3944 0.5850           −0.6966 −0.1863 −0.3566 0.5940           0.0668 −0.7357 −0.4996 −0.4525           0.0939 −0.6506 0.6838 0.3166     

(37)

Variance:      1.6196 1.1454 0.7024 0.5326      Percent Explained:      40.4909 28.6357 17.5595 13.3139     

Represented graphically in figure 9 below:

1 2 3 4 0 10 20 30 40 50 60 70 80 90 100 Principal Component Variance Explained (%) Effect of Each PC 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Figure 9: Variance in percentage

Thus far PCA has given us four principal components with the highest two accounting for 69.12 % of the total scores, while adding the third principal component we obtain a reach of 86.68 % inclusion of all data scores. This informs us that we should be investigating 2-dimensional as well as 3-2-dimensional PCA charts to properly adress the hidden structure behind the characteristic data.

(38)

Visualizing the first 2 principal components we investigate Figure 10: −0.6 −0.4 −0.2 0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 occupantsarea age locfac 1st Principal Component 2nd Principal Component

Projected Data with Component Vectors

Figure 10: 2-D Principal Component Analysis

First let us inspect the data scatter in the biplot Figure 10. The distribution of red points show that there are few outliers in the data, since most of the points are located near the

center of the axises. To verify this we look at the Hotelling’s T2 as a measure of multivariate

distance of each observation to the center of the data set. Through this analytical procedure

we are able to evaluate the outliers and mean distance from the center. Hotelling’s T2reveals

two extreme outliers with distances of 15.29 & 15.0374, while the mean distance is 3.965. With this we can proceed to extrude the outliers from our data set. This verifies the biplot result, which validates that most of the observations are centered close to each other with few outliers. However the biplot scatter of data points also reveals a possible 3 cluster formations highlighted below:

(39)

0.6 0.4 0.2 0 0.2 0.4 0.6 0.6 0.4 0.2 0 0.2 0.4 0.6 occupantsarea age locfac 1st Principal Component 2 n d P ri n ci p a l C o m p o n e n t

Projected Data with Component Vectors

Figure 11: Possible Clusters

Now shifting our attention to the coefficients we first note that all categories are far from the center, meaning certain conclusions concerning their correlation can be drawn. Having

theArea and Occupant coefficients closely placed indicates that those two categories are

pos-itively correlated. Whilst having theLocation Factor coefficient orthogonal to both Area and

Occupant coefficients hints that these coefficients are in fact uncorrelated. Age is similarly

orthogonal toArea and Occupant coefficients which indicates that the coefficients are

uncor-related. Whilst Age and Location Factor being opposite to each other indicates that they are

in fact negatively correlated.

Making sense of these correlations must be taken with caution. Correlation does not imply

causality, however it hints of a possible underlying structure. That being said, it is interest-ing to analyse these correlations to better understand the workinterest-ings behind the characteristic

data. Having theArea and Occupant positively correlated comes as no surprise, since smaller

apartment area logically attracts a lesser number of occupants. Therefore with increase in apartment size we witness an increase in number of inhabitants. However it is worthy to

note thatArea factor is the driving force for the Occupant factor, and thus is considered to be

the leading force. This is so due to the logical reasoning that the apartment area is present at first, and thus plays a role in attracting at a later stage the number of tenants. Therefore, the

value ofArea tends to explain the value of Occupant.

The box plot also shows that Area and Occupant are uncorrelated with respect to Location

Factor. Location Factor is based on apartment’s level in a building, thus a high value of Loca-tion Factor indicates a ground floor apartment, while a low LocaLoca-tion Factor indicates a roof

apartment. Having said that, we see no affect of apartment level to the Area and Occupant

categories, and as such are deemed uncorrelated.

Finally, investigating the negative correlation between Location Factor coefficient and Age

reveals an interesting fact which is backed up by the box plot summary presented previously. Knowing our population to be an elderly one, based on the age category mean in the box

plot in figure 8, translates to a higher rate of older tenants (high Age value) renting upper

(40)

role in the age distribution of the tenants, with older tenants renting out the upper floors,

while younger ones occupying ground levels. Noting that since Age is orthogonal to both

Area and Occupant, then they are found to be uncorrelated.

Taking into account the third principal component we obtain Figure 11:

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1st Principal Component occupants area age 2nd Principal Component locfac 3rd Principal Component

Figure 12: 3-D Principal Component Analysis

Adding the third principal component turns out to reveal thatLocation Factor is coupled with

the third principal component, rather than the first two. As a consequence to this, projecting the data solely on two principal components will not accurately represent the full dimensions of the observations. This is backed up by the variances displayed in figure 9. However, it is sufficient to take into consideration two principal components as a result of a cost-benefit analysis of adding a third principal component to the increase in data representation accuracy.

Therefore we can conclude that the first two principal components are sufficient for our statistical analysis of the data.

(41)

5.3 Clustering

Based on the preliminary conclusions conducted in PCA, it is recommended to cluster our characteristic data along with hot water consumption data based on 3 cluster groups. How-ever it is important to verify analytically the number of clusters proposed, and as such we conduct the previously described Davies-Bouldin index for:

2 ≤ k ≤ 9

Where 2 is the minimum number of clusters allowed, whilst 9 comes as a conservative esti-mate from the rule of thumb in determining the number of clusters in a data set formulated by [26]:

k ≈

rn

2

where n is the total number of observations, here found to be 115 apartments. Running the iterative program for different k values we obtain:

2 3 4 5 6 7 8 9 0 5 10 15 20 25 30 Clusters DBI DBI vs. Clusters DBI

(a) Davies-Bouldin index

2 3 4 5 6 7 8 9 4000 4020 4040 4060 4080 4100 4120 Clusters AIC AIC vs. Clusters AIC

(b) Akaike information criterion

Figure 13: Cluster Validation Measures

Akaike Information Criterion, or AIC, is another relative goodness-of-fit measure that is based on the entropy of the data. AIC is often used when applying a gaussian mixture of models clustering algorithm, since the log-likelihood is a required measurement in both algo-rithms. AIC is formulated as such:

AIC = 2k − 2 ln(L)

Where L represents the likelihood of the probability distribution, gaussian in our case,

cal-culated as previously described in theClustering section.

Note that the common union of both output measures is a number of clusters equivalent to 3. Davies-Bouldin graph shows that k = 3 yields best results, while Akaike criterion shows that k = 4 yields the best fit. However D-B index shows that k = 4 is not an ideal case, whilst AIC agrees that k = 3 will yield respectable results. Thus based on the latter, and the observations seen in the PCA biplot graph, we conclude that k = 3 would be an adequate

(42)

Having agreed on the issue of number of clusters, we move on to the actual clustering itself. Applying Mixture of Gaussian with k = 3, and representing the clustering results through the principal components projection of the multivariate data, we obtain:

−3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (a) January −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (b) February −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (c) March −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (d) April −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (e) May −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (f) June −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (g) July −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (h) August −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (i) September −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (j) October −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (k) November −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 1st Principal Component 2nd Principal Component Cluster 1 Cluster 2 Cluster 3 Centroids (l) December

Figure 14: Gaussian clustering with k = 3

This series of clustering takes into account the monthly hot water consumption along with the characteristic data of every apartment. In the series of subplots above, we visualize every apartment as a colored point in the scatter. The color indicates the point’s association to one of the three clusters, represented in red, green, & blue. Also shown in the graphs above are the cluster centers, represented as ⊗. Note that the cluster’s center does not necessary

References

Related documents

Swedenergy would like to underline the need of technology neutral methods for calculating the amount of renewable energy used for cooling and district cooling and to achieve an

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Particular attention was paid to cold needs in warm climates and for this reason the supermarket is located in Valencia (Spain), representing a Mediterranean Climate. The idea of

At the moment, to ensure reliable and high-quality electricity supply on the island as well as transmission of the total wind power production, the local grid company,