Professional Documents
Culture Documents
a r t i c l e
i n f o
Article history:
Received 11 July 2013
Received in revised form 18 October 2013
Accepted 5 February 2014
Keywords:
Data mining
Building automation system
Feature extraction
Clustering analysis
Association rule mining
Recursive partitioning
a b s t r a c t
Todays building automation system (BAS) provides us with a tremendous amount of data on actual
building operation. Buildings are becoming not only energy-intensive, but also information-intensive.
Data mining (DM) is an emerging powerful technique with great potential to discover hidden knowledge
in large data sets. This study investigates the use of DM for analyzing the large data sets in BAS with the
aim of improving building operational performance. An applicable framework for mining BAS database
is proposed. The framework is implemented to mine the BAS database of the tallest building in Hong
Kong. After data preparation, clustering analysis is performed to identify the typical power consumption
patterns of the building. Then, association rule mining is adopted to unveil the associations among power
consumptions of major components in each cluster. Lastly, post-mining is conducted to interpret the
rules. 457 rules are obtained in association rule mining, of which the majority can be easily deduced
from domain knowledge and hence be ignored in this study. Four of the rules are used for improving
building performance. This study shows that DM techniques are valuable for knowledge discovery in
BAS database; however, solid domain knowledge is still needed to apply the knowledge discovered to
achieve better building operational performance.
2014 Elsevier B.V. All rights reserved.
1. Introduction
Buildings have great impacts on human life and global sustainability. They consume a large amount of energy to build
a comfortable, healthy, safety and productive environment for
human beings. Buildings consume 41% of primary energy in the
United States, which exceeds the transportation sector (29%) and
the industry sector (30%) [1]. In Hong Kong, buildings contribute to
nearly 90% of the total electric energy consumption and around 60%
of greenhouse gas emissions [2]. Buildings consume energy in their
whole life cycles. Normally, the energy use during the operation
stage accounts for 8090% of their lifecycle energy use [3]. Improving building operational performance is of signicant importance
for energy saving in the building sector.
Modern buildings are usually integrated with a diversity of
advanced technologies. Building automation system (BAS) is a
typical example which integrates technologies from information
science, computing science, control theory and etc. BAS enables
modern buildings to be more intelligent through real-time automatic monitoring and control. A huge number of records of
Corresponding author. Tel.: +852 2766 4194; fax: +852 2765 7198.
E-mail address: linda.xiao@polyu.edu.hk (F. Xiao).
http://dx.doi.org/10.1016/j.enbuild.2014.02.005
0378-7788/ 2014 Elsevier B.V. All rights reserved.
110
111
Data preparation
(Data cleaning, data transformation, data reduction, etc.)
Clustering analysis
(Partitioning clustering, hierarchical clustering, etc.)
Post-mining
(Rule selection, rule interpretation, etc.)
to 2-level categorical data, e.g., Low and High. The scales of BAS
data are also very different due to different units used. For instance,
a typical control signal usually changes from 0 to 1; the temperature measurements may change from 0 C to 40 C; and the power
measurements may change for 0 kW to 5000 kW. Some predictive
data mining techniques (e.g. support vector machine) perform better if the input data have similar scales. Therefore, scaling methods,
such as normalization or standardization, should be performed in
data preparation.
Generally speaking, the data preparation fullls three tasks
including data cleaning, data transformation and data reduction.
Data cleaning handles missing values, resolves inconsistencies, and
detects and removes outliers. Missing values can be lled in using
the global constant, moving average, imputation or inference-based
models [18]. Outliers can be identied using unsupervised clustering, supervised classication and semi-supervised recognition
[4]. Data transformation includes scaling of data sets and transformation of data attribute or type. Commonly used scaling methods
include the maxmin normalization, Z-score normalization and
decimal point normalization [18]. Attribute transformation prepares the data into the suitable format as required by a DM
algorithm, for instance, transform numerical data to categorical
data. Popular methods of attribute transformation include feature
extraction, equal-frequency binning, equal-interval binning and
entropy-based discretization [18]. Data reduction aims to reduce
the dimension of the data sets so as to improve the calculation efciency. Data sets retrieved from BAS can usually form a matrix with
each row representing an observation set at specic time instant
and each column representing a variable or an item, such as the
chiller power consumption and the space temperature. Sampling
techniques, such as random sampling and stratied sampling, can
be applied to reduce the row number (i.e. number of observation
sets). Three methods are commonly used to reduce the column
number. Firstly, one may use its domain expertise to select the
most relevant variables. Secondly, one may create a few representative variables using a linear combination of original variables
(e.g. principal component analysis [12]). Thirdly, one may use the
heuristic methods, such as the step-wise forward selection and
step-wise backward elimination methods, to reduce the column
number [18].
112
P(A, B)
P(A)
The interestingness of a rule is evaluated using three parameters, i.e. support, condence and lift:
Support (A B) = P(A, B)
Condence (A B) = P(B|A)
P(B|A)
P(A, B)
=
Lift (A B) =
P(A)P(B)
P(B)
ARM aims to nd out all rules satisfying the user-specied minimum support or minimum condence. Support of a rule is the joint
probability of the antecedent and the consequent. Condence is the
conditional probability of the consequent, given the antecedent.
Support and condence are normally used to determine whether
the rule is statistically signicant or not. Lift is a measure of the
dependence and correlation between the antecedent and the consequent. If the lift equals 1, it indicates that antecedent and the
consequent are independent of each other, and hence, the discovered knowledge has little value. Lift larger than 1 indicates positive
correlation, which means that the probability of the consequent is
positively affected by the occurrence of the antecedent. In contrast,
lift smaller than 1 indicates negative correlation. Therefore, desired
association rules should have lift values deviating from 1.
Many algorithms are available to perform ARM, including the
Apriori, ECLAT and FP-growth. In this study, the Apriori algorithm
[4] is employed. The key assumption is that any subsets of a frequent data set should also be frequent, which ensures the efciency
in generating candidate frequent sets. More specically, the Apriori algorithm rst generates candidate frequent item sets from
the original large data sets. Then, by comparing the user-specied
threshold for support and the frequency counts, frequent item sets
can be selected. Association rules are then derived within each
frequent item set considering the user-specied threshold for condence.
2.4. Recursive partitioning for post-mining
Recursive partitioning is a supervised, nonparametric method
used to develop tree-structure models for predictions or classications. Recursive partitioning models are self-explanatory and easy
to follow, enabling users to conveniently understand the underlying reasoning process. The tree-structure model normally consists
of a root node, internal nodes and terminal nodes. A root node only
has outgoing edges while a terminal node only has incoming edges.
An internal node has both incoming and outgoing edges. A simple
tree model for classifying chiller energy consumption levels based
on outdoor temperature and occupancy is shown in Fig. 2. The rst
splitting variable is the outdoor temperature, which presents in
the root node, or Node 1 in the gure. If the outdoor temperature is
higher than 24 C, the chiller energy consumption level should be
High. Otherwise, the occupancy level should be considered. The
occupancy level is selected as the splitting variable in the internal
node (i.e., Node 2). For instance, if the outdoor temperature is no
more than 24 C and the occupancy level is larger than 0.5, then
the chiller energy consumption level should be High. In this tree
model, Nodes 3, 4 and 5 are terminal nodes, showing the classication results with proportions. In this example, the classication
accuracy is 100%, because all the proportions of the resultant chiller
energy consumption levels are 1 (or 100%) at the three terminal
nodes. Recursive partitioning has been extensively used in analyzing problems in genetics, clinical medicine, and bioinformatics
[20].
There are many algorithms to perform recursive partitioning,
such as the conditional inference tree method [19], CART and C4.5
[18]. The conditional inference tree method, which has shown
effective diagnostic capability, is employed in the post-mining step
to develop the tree-structure model for analyzing the abnormalities
detected by the association rules.
3. Mining BAS data sets retrieved from a real building
3.1. Description of the raw BAS data sets
The data sets concerned in this study were collected from the
tallest commercial building in Hong Kong [20]. This building was
designated as an Intelligent Building of 2011 by the Asian Institute of Intelligent Buildings. An advanced BAS is installed in this
building. Over 500 power meters record the real-time power consumptions of various components like chillers, pumps, fans, lifts,
lighting devices, and etc. 8-month data (from January, 2012 to
August, 2012) were collected with an interval of 15 min, resulting in 22,974 observation sets in total. The collected data consist
of the date and time (i.e. year, month, day, hour, minute, weekday), measurements of indoor and outdoor variables (e.g. outdoor
temperature and relative humidity, indoor CO2 concentration)
and various power consumptions of sub-systems and components
including essential power, normal power, plumbing & drainage,
lift & escalator, mechanical ventilation system, air-handling units,
primary air units, chillers, cooling towers, primary chilled water
pumps (PCHWP), secondary chilled water pumps (SCHWP), condenser water pumps (CDWP). All the data are considered in this
study.
3.2. Data preparation
3.2.1. Data cleaning
The raw BAS data sets contain signicant missing values and
outliers. Discarding low-quality data will enhance the reliability of
mining results. In this study, missing values are handled using a
simple moving average method with a window size of 5 samples.
In the raw BAS data set, there are also some dead values, which do
not change in long time. In this study, if a variable does not change
in one hour, the corresponding observation set will be discarded.
Outliers will be detected using a simple lter, the interquartile
range rule [15]. The interquartile range is the difference between
the third quartile (i.e., Q3 ) and the rst quartile (i.e., Q1 ). The
lower limit and the upper limit are dened as Q1 1.5(Q3 Q1 ) and
113
114
Hierarchical
Kmeans
Pam
Hierarchical
Pam
0.45
Silhouette Width
0.05
Dunn Index
Kmeans
0.04
0.03
0.40
0.35
0.30
0.02
2
Number of Clusters
Number of Clusters
Fig. 3. Clustering validation results.
ratio of the minimal intra-cluster distance to the maximal intercluster distance. Therefore, the Dunn index should be maximized.
The Silhouette width measures the average of each observations
Silhouette value, which reects the condence level of the clustering of a particular observation. Hence, the Silhouette Width ranges
from 1 to 1 and it should be maximized.
Three popular clustering algorithms, i.e. the agglomerative
hierarchical, k-means and PAM algorithms, were compared. The
searching range of the cluster number was set from 2 to 7. Clustering validation results are shown in Fig. 3. Both indexes show that
k-means algorithm with 3 clusters has the best clustering results.
Therefore, the k-means algorithm was selected with k equals to 3.
The entropy-weighted k-means (EWKM), which is an extension
of the k-means algorithm, was applied to perform the clustering
analysis. EWKM can obtain not only the clustering results, but also
the relative importance (RI) of each variable. The Dunn index and
Silhouette width are used to nd the optimal parameters. As a
result, the weight distribution parameter () and the convergence
threshold () were set as 0.2 and 0.0001, respectively.
The EWKM clustering result is shown in Fig. 4. Nearly all the
daily building power consumption feature data in Cluster 1 come
from weekdays (i.e., Monday to Friday). Cluster 2 mainly consists of
data from Saturday while the majority of observations in Cluster 3
come from Sunday. Such result indicates that power consumption
Cluster ID
Mon
Tue
Wed
Thu
Fri
Sat
Sun
Label
Fig. 4. EWKM clustering result.
115
116
Table 1
Summary of interesting association rules.
No.
Antecedent
Consequent
Supp.
Conf.
Lift
Cluster
1
2
3
4
5
6
Out.T = (15,20)
Pwr.PAU = High
Pwr.PCHWP = 4th
Pwr.PCHWP = 3rd
Pwr.PCHWP = 3rd
Pwr.PCHWP = 4th
Pwr.Chiller = Low
Pwr.Lift = High
Pwr.CDWP = 3rd
Pwr.CDWP = 2nd
Pwr.CDWP = 2nd
Pwr.SCHWP = High
0.25
0.35
0.27
0.32
0.34
0.24
0.88
0.86
0.99
0.88
0.89
0.89
2.10
1.76
2.73
1.78
1.69
2.83
Saturday
Weekday
Weekday
Saturday
Sunday
Weekday
117
Acknowledgements
The authors gratefully acknowledge the support of this research
by the Hong Kong Polytechnic University (project No. G-YM86). We
would also like to thank Professor Shengwei Wang for his invaluable advice and help.
References
[1] 2011 Building Energy Data Book, U.S. Department of Energy, March 2012.
[2] Hong Kong Energy End-use Data 2012, Hong Kong Electrical & Mechanical
Services Department, September 2012.
[3] T. Ramesh, R. Prakash, K.K. Shukla, Life cycle energy analysis of buildings: an
overview, Energy and Buildings 42 (10) (2010) 15921600.
[4] O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook, 2nd
ed., Springer, New York, 2010.
[5] B. Dong, C. Cao, S.E. Lee, Applying support vector machines to predict building energy consumption in tropical region, Energy and Buildings 37 (2005)
545553.
[6] M.R. Amin-Naseri, A.R. Soroush, Combined use of unsupervised and supervised
learning for daily peak load forecasting, Energy Conversion and Management
49 (2008) 13021308.
[7] A. Kusiak, M.Y. Li, F. Tang, Modeling and optimization of HVAC energy consumption, Applied Energy 87 (2010) 30923102.
[8] A. Ahmed, N.E. Korres, J. Ploennigs, H. Elhadi, K. Menzel, Mining building performance data for energy-efcient operation, Advanced Engineering Informatics
25 (2011) 341354.
[9] Z. Yu, F. Haghighat, C.M. Fung, H. Yoshino, A decision tree method
for building energy demand modeling, Energy and Buildings 42 (2010)
16371646.
[10] Z. Yu, F. Haghighat, C.M. Fung, L. Zhou, A novel methodology for knowledge discovery through mining associations between building operational data, Energy
and Buildings 47 (2012) 430440.
[11] D.F.M. Cabrera, H. Zareipour, Data association mining for identifying lighting
energy waste patterns in educational institutes, Energy and Buildings (2013),
http://dx.doi.org/10.1016/j.enbuild.2013.02.049.
[12] D.L. Olson, D. Delen, Advanced Data Mining Techniques, Springer-Verlag, Berlin,
Heidelberg, 2008.
[13] T. Hothorn, K. Hornik, A. Zeileis, Unbiased recursive partitioning: a conditional inference framework, Journal of Computation and Graphical Statistics
15 (2006) 651674.
[14] S.C. Zhang, C.Q. Zhang, Q. Yang, Data preparation for data mining, Applied
Articial Intelligence 17 (2003) 375381.
[15] S.W. Wang, Q. Zhou, F. Xiao, A system-level fault detection and diagnosis strategy for HVAC systems involving sensor faults, Energy and Buildings 42 (2010)
477490.
[16] Z.J. Ma, S.W. Wang, Supervisory and optimal control of central chiller plants
using simplied adaptive models and genetic algorithm, Applied Energy 88
(2011) 198211.
[17] A. Khamis, Z. Ismail, K. Haron, A.T. Mohammed, The effects of outliers data
on neural network performance, Journal of Applied Science 5 (8) (2005)
13941398.
[18] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data
Mining, Inference and Prediction, 2nd ed., Springer Series in Statistics, New
York, USA, 2009.
118
[19] L.P. Jing, M.K. Ng, J.Z. Huang, An entropy-weighting k-means algorithm for
subspace clustering of high-dimensional sparse data, IEEE Transactions on
Knowledge and Data Engineering 19 (2007) 10261041.
[20] C. Strobl, J. Malley, G. Tutz, An introduction to recursive partitioning: rationale,
application, and characteristics of classication and regression trees, bagging,
and random forests, Psychological Methods 14 (2009) 323348.