You are on page 1of 18

Information Sciences 379 (2017) 128–145

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Dependable large scale behavioral patterns mining from


sensor data using Hadoop platform
Md. Mamunur Rashid a,∗, Iqbal Gondal b,a, Joarder Kamruzzaman b,a
a
Faculty of Information Technology, Monash University, Clayton, Australia
b
School of Engineering and Information Technology, Federation University, Australia

a r t i c l e i n f o a b s t r a c t

Article history: Wireless sensor networks (WSNs) will be an integral part of the future Internet of Things
Received 14 November 2015 (IoT) environment and generate large volumes of data. However, these data would only
Revised 14 June 2016
be of benefit if useful knowledge can be mined from them. A data mining framework for
Accepted 23 June 2016
WSNs includes data extraction, storage and mining techniques, and must be efficient and
Available online 25 June 2016
dependable. In this paper, we propose a new type of behavioral pattern mining technique
Keywords: from sensor data called regularly frequent sensor patterns (RFSPs). RFSPs can identify a set
Wireless sensor networks of temporally correlated sensors which can reveal significant knowledge from the moni-
Data mining tored data. A distributed data extraction model to prepare the data required for mining
Knowledge discovery RFSPs is proposed, as the distributed scheme ensures higher availability through greater
Frequent pattern redundancy. The tree structure for RFSP is compact requires less memory and can be con-
Regularly frequent sensor pattern structed using only a single scan through the dataset, and the mining technique is effi-
MapReduce cient with low runtime. Current mining techniques in the literature on sensor data em-
ploy a single memory-based sequential approach and hence are not efficient. Moreover,
usage of the MapReduce model for the distributed solution has not been explored exten-
sively. Since MapReduce is becoming the de facto model for computation on large data,
we also propose a parallel implementation of the RFSP mining algorithm, called RFSP on
Hadoop (RFSP-H), which uses a MapReduce-based framework to gain further efficiency. Ex-
periments conducted to evaluate the compactness and performance of the data extraction
model, RFSP-tree and RFSP-H mining show improved results.
© 2016 Elsevier Inc. All rights reserved.

1. Introduction

Wireless sensor networks (WSNs) in diverse applications generate huge volumes of dynamic, geographically distributed
and heterogeneous data [5,17]. Data mining techniques may play a vital role in efficiently extracting and analyzing usable
information from the raw data to facilitate automated or human induced decision making. A data mining application for
large scale WSN data must be dependable. A dependable system should include reliability and availability of operation,
among other attributes. In addition, dependability must be analyzed with a focus on the particular service or application.
Data mining from WSN includes data acquisition and storage as well as applying mining techniques on the stored data.
The greater redundancy and reconfigurability of distributed systems allow for very high availability to be designed into


Corresponding author.
E-mail addresses: md.rashid@monash.edu (Md.M. Rashid), iqbal.gondal@federation.edu.au (I. Gondal), joarder.kamruzzaman@federation.edu.au (J. Kam-
ruzzaman).

http://dx.doi.org/10.1016/j.ins.2016.06.036
0020-0255/© 2016 Elsevier Inc. All rights reserved.
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 129

distributed systems [27]. Therefore, a distributed data extraction, storage and mining technique, especially using a big data
platform will improve the availability of the system which is addressed in this paper. Another important dependability issue
is the response time of the mining operation [32], which is particularly important when a WSN is deployed for time-critical
applications.
Knowledge discovery in WSN has been used to extract information about the surrounding environment by deducing from
the data reported by sensor nodes or behavioral patterns about sensor nodes from the meta-data describing sensor behavior.
In the literature different behavioral pattern mining techniques, such as sensor association rules [6,28], target association
rules [26], associated sensor patterns [21], and share-frequent sensor patterns [22] have been successfully used on sensor
data where behavioral patterns are extracted from the status of the nodes sensing events. Generating association rules
[6,26,28] that have a certain frequency (support) requires finding all the patterns present in the data, i.e., frequent patterns.
Associated sensor patterns [21] can capture association-like co-occurrences and the strong temporal correlations implied by
such co-occurrences of sensor data patterns. Both techniques (sensor association rules [6,28] and associated patterns [21])
work based on the binary occurrence frequency of a pattern (i.e., only occurrence or non-occurrence of sensor triggers in
an epoch, not their numbers) and reflect only the number of epochs in the database which contain that pattern. On the
other hand, share-frequent patterns [22] deal with non-binary frequency values of sensors, considering the exact number of
sensor triggers in epochs.
Another important criterion for identifying the ‘interestingness’ of frequent patterns is the shape of occurrence, i.e.,
whether they occur regularly, irregularly, or mostly at specific time interval in the sensor database. A frequent pattern that
occurs after regular intervals is called a regularly frequent sensor pattern. In WSN, a set of patterns that not only appear
frequently but also co-occurs at regular intervals may carry more significant information regarding the environment being
monitored. Regularly frequent sensor patterns can also identify a set of temporally correlated sensors. This knowledge can
be helpful to overcome the undesirable effects (e.g., missed reading) of unreliable wireless communications. Traditional fre-
quent pattern mining methods in the literature fail to discover such regularly frequent sensor patterns because they only
focus on the high frequency patterns. Therefore, in this paper we propose new behavioral sensor patterns called regularly
frequent sensor patterns (RFSPs).
The main challenges to mining RFSPs are: firstly, to find a formal definition for RFSPs that maintains the regularity of the
patterns throughout the sensor dataset. Secondly, to design a distributed data extraction mechanism that can collect data
from nodes considering their limited resources (e.g., energy). The previous studies on behavioral pattern mining techniques
[6,21,22,26,28] stored all sensor data at a central location and therefore, may have failed to perform mining within the ac-
ceptable time when the monitoring duration was long and/or collected data were voluminous. A distributed data extraction
model is necessary in such cases, and thereby achieves more dependability. Thirdly, to devise a compact tree structure that
can capture important information from the sensor dataset in a very compact manner and ensure fast mining. Current stud-
ies [6,20,21,28] only consider single processor and main memory-based machine for behavioral pattern mining. However, an
enormous amount of data will be generated from scenarios like IoT, which is basically composed of sensors deployed ev-
erywhere. Therefore, the limited hardware resources are not capable of handling large sensor data for mining and analyzing
and suffer from scalability problems. To handle such bottlenecks, more efficient approaches (besides the serial approach) are
needed.
Traditional parallel and distributed data mining (PDDM) techniques assume that data are partitioned and transmitted
to the computing nodes in advance. PDDM approaches need a great deal of message passing and I/O operations, since the
distributed nodes have to share/pass the needed data. This approach is impractical in distributed systems for mining of large
sensor data. To handle big data, some researchers have proposed the use of MapReduce [8] to mine the search space in a
parallel or distributed manner. It assumes a data-centric method of distributed computing with the principle of ‘moving
computation to data’. It uses a distributed file system that is particularly optimized to improve the I/O performance while
handling big data. Hadoop is an open source implementation of the MapReduce framework. In its potential use for RFSP
mining, MapReduce on Hadoop only needs to share and pass the support of individual candidate patterns rather using
the whole sensor dataset. Therefore, communication cost is low compared to the traditional distributed environments. In
literature, the impact of MapReduce to discover behavioral sensor patterns from big sensor data has not been investigated
yet. In this work, the RFSP mining technique is further enhanced to RFSP-H (RFSP mining on Hadoop), a distributed regularly
frequent sensor pattern mining technique over MapReduce.
The contribution of this paper can be summarized as follows:

• We propose a new type of behavioral pattern called regularly frequent sensor patterns that discovers the shape of oc-
currence behaviors among sensors in a WSN.
• We have designed a distributed data extraction mechanism which can handle extraction of large sensor data efficiently
and aid faster data mining through its parallel implementation. Such a distributed mechanism contributes to the high
dependability of the overall system.
• We have developed a single-pass tree structure, called the RFSP-tree (regularly frequent sensor patterns tree), that can
efficiently mine the regularly frequent sensor patterns in sensor dataset using a pattern growth approach. Finally, we
propose a MapReduced-based algorithm called RFSP-H, to mine RFSPs among a set of dedicated nodes in parallel to
overcome the limitations of single processor and main memory-based techniques. A Hadoop platform-based implemen-
tation of the whole mining framework ensures better service availability and timely mining results.
130 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

The rest of this paper is organized as follows. In Section 2, we describe the related works in sensor data mining. In
Section 3, problem formulation of mining regularly frequent sensor patterns is presented. In Section 4, we discuss the dis-
tributed data extraction methodology. In Section 5, we develop our proposed tree structure and mining process. In Section 6,
we develop our proposed method based on MapReduce. In Section 7, experimental results are presented and analyzed. Fi-
nally, Section 8 concludes the paper.

2. Related works

After the first known frequent pattern mining technique introduced by Agrawal et al. in [1], a large number of algorithms
(e.g., [11]) have been proposed to improve its performance. However, these techniques are not suitable to mine ‘interesting-
ness’ in patterns (e.g., periodic patterns) because their outputs are based only on the support threshold. Mining periodic
patterns [10,19,30] and cyclic patterns [19] in a static database have been well-addressed over the last decade. Periodic pat-
tern mining in time series data focuses on the cyclic behavior of patterns in either full periodic pattern mining [10,19] or
partial periodic pattern mining of time series. Tanbeer et al. [29] proposed the Regular Pattern tree (RP-tree) to mine the
regularly occurring patterns in static transactional databases. They define the regularity measure for a pattern by the max-
imum interval at which the same pattern occurs. However, in many real-world applications, the patterns may not appear
regularly without any interruptions due to erroneous or a noisy environment. In such cases, the maximum interval measure
for regularity calculation is not effective.
Recently, Rashid et al. [23] have introduced a method of finding regularly frequent patterns in transactional databases
that follow a temporal regularity in their occurrence characteristics by using a tree structure, called a Regularly Frequent
Pattern tree (RF-tree). However, the requirement of two database scans for RF-tree is inefficient in mining regularly frequent
sensor patterns from sensor datasets. In this paper, we propose RFSP-tree, which takes only one database scan to mine
RFSPs from WSNs.
The data extraction model for sensor data or metadata forms an essential part in sensor data mining. Most of the existing
data extraction models in WSN focus on extracting patterns regarding the phenomenon monitored by the nodes. In these
schemes, the sensed data received from the sensor nodes are first accumulated at a central database and then mining
techniques are applied. In [34], a decentralized approach for mining event correlations of the sensed data was proposed,
which suggested that events should be aggregated to a set of databases rather than a centralized database. Boukerche et al.
[6] proposed a data extraction method to mine metadata instead of sensed data regarding the behavior of nodes where data
were stored in a centralized database. Here, we propose a solution to extract metadata to mine the behavior of the sensor
nodes in a network where the data are stored in a distributed manner.
Recently proposed behavioral pattern mining techniques for WSNs (e.g., sensor association rules [6] target association
rules [26] and associated patterns [21]) are main memory and single processor based techniques. These techniques assume
that datasets fit well in the main memory and the mining task finishes within a reasonable amount of time. However,
emerging platforms like IoT will generate huge amounts of sensor data. Therefore, this type of assumption will no longer be
valid [7]. To process large data in transactional databases, researchers have focused on large-scale parallel and distributed
frequent pattern mining techniques [2,15,36] to resolve the sequential bottlenecks and to improve scalability and response
time. However, these techniques are not suitable to handle large-scale sensor datasets. Rashid et al. in [22] introduced a
parallel technique for mining the share-frequent sensor patterns from WSN data by considering the homogeneous as well
as heterogeneous computing environment. Although this technique reduces the inter-process communication cost and I/O
cost by performing a single database scan, it needs more time for the insertion and restructuring phases during local tree
construction at each node.
Recently, for analyzing massive data the MapReduce [8] framework of distributed computing has been most successful.
Earlier works on MapReduce focused on either data processing [8] or data mining tasks other than frequent pattern mining
Lin et al. [14] propose three Apriori-based algorithms called SPC (single-pass count), FPC (fixed-passes combined-counting)
and DPC (dynamic-passes combined-counting) to mine frequent patterns from transactional data using MapReduce. Riondato
et al. [24] propose a parallel randomized algorithm called PARMA for mining approximations to the top-k frequent item-
sets and association rules from transactional data using MapReduce. Aridhi et at. [3] proposed a novel MapReduce based
approach for distributed frequent subgraph mining. M. Bhuiyan and M. Al Hasan [4] proposed an iterative MapReduce al-
gorithm to mine frequent subgraph. In [13], a MapReduced-growth (MR-growth) algorithm uses MapReduce to mine truly
frequent itemsets from uncertain data. Although these works have established the foundation of using MapReduce in data
mining from ‘stored databases’, mining behavioral patterns based on MapReduce from a sensor database has received the
least attention.
The current work differs from our previous works in the following ways. Tree construction mechanisms in [21] and
[22] are based on the binary and non-binary values of patterns only and therefore, mines associated and share frequent
patterns, respectively. However, none the works capture regularity among sensor occurrences and identify sensors that are
temporally correlated. Although the preliminary concept of regularly frequent pattern mining was presented in [20], that
work does not integrate any mechanism for data extraction from WSN and lacks analyses of tree construction and proper-
ties as well as of time and space complexity. Tree characteristics and performance were not adequately explored through
extensive simulation or using diversified, large and real world datasets. Moreover, [20] is a single memory based sequential
approach and hence inefficient to handle large amount of sensor data. The current work addresses these aspects and for the
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 131

first time presents a Hadoop-based framework fully integrated for data extraction, storage and behavioral pattern mining
from large scale sensor data offering dependable mining services.

3. RFSPs mining in WSNs: problem formulation

Let S = {s1 , s2 , . . . , sn } be the set of sensors in a specific WSN. We assume that the time is divided into equal-sized slots
t = {t1 , t2 , . . . , tq } such that t j+1 − t j = λ, j ∈ [1, q − 1] where λ is the size of each time slot. A set P = {s1 , s2 , . . . , s p } ⊆ S is
called a pattern of a sensor.
An epoch is a tuple e(ets , Y) such that Y is a pattern of the event-detecting sensors that report events within the same
time slot and ets is the epoch’s time slot. A sensor database SD is a set of epochs E = {e1 , e2 , . . . , em } with m = |SD|, i.e., total
number of epochs in SD. If X⊆Y, it is said that X occurs in e and is denoted as eXj , j ∈ [1, m]. Let E X = {eXj , . . . , eXk }, where j
≤ k and j, k ∈ [1, m] be the ordered set of epochs in which pattern X has occurred in SD. Let eXs and etX , where j ≤ s < t
≤ k be the two consecutive epochs in EX . The number
 X of
 epochs or time difference X X
 between et and es , can be defined as
X X X X X X X
a period of X, say p . Then a period of X, p = et − es . Let P = p1 , p2 , . . . , ps be the set of periods for pattern X. For
simplicity in period computation, we assume the first and last epochs (say, ef and el ) in SD are identified as null with e f = 0
and em with (el = em ), respectively.
Although some recent studies [29,30] have used maximum period (maxPrd) of interval of a pattern as a temporal regu-
larity measure, but in erroneous or noisy environment in which WSN operation and its stream data collection are subject
to, maxPrd measure for regularity calculation is not effective. Only one large interval can make a pattern appear irregular
if we use maxPrd as a regularity measure, which otherwise is a regular pattern. On the other hand, if we use variance of
interval time between pattern occurrences, then one large interval of that pattern has small effect on variance calculation,
capturing the true nature of regularity closely. Therefore, by using variance of interval as a regularity measure, we can mine
regular patterns properly.
 
Definition 1 (regularity of pattern X). Let for a EX , PX be the set of all periods of X i.e., P X = pX1 , pX2 , . . . , pXN , where N is the
 PkX
total number of periods in PX . Then the average period value of pattern X is represented as, X̄ = N k=1 N and the variance
 X 2
 Pk −X̄
of periods is represented as σ X = N k=1 N . We define regularity of X as Reg(X ) = σ X (variance of periods for pattern
X).

Definition 2 (support of a pattern X). The number of epochs in a SD that contain X is called the support of X in SD and is
denoted as Sup(X ) = |E X |, where |EX | is the size of EX .

Definition 3 (regularly frequent sensor pattern). A pattern is called a regularly frequent pattern if it satisfies both of the
following two conditions: (i) its support value is no less than a user-given minimum support threshold, say, min_sup and
(ii) its regularity is no greater than a user-given maximum regularity threshold, say, max_var.

Problem definition: Given a SD, min_sup and max_var constraints, the objective is to discover the complete set of regu-
larly frequent sensor patterns in SD having support no less than min_sup and regularity no more than max_var.

4. Distributed data extraction methodology

In our model sensors themselves are the main objects regardless of their values. We collect the sensors activity data
called metadata (e.g., triggering on detection of events). We are interested to capture sensors’ activity data, not the actual
values of sensed data, and use this activity data for mining purpose later. The proposed network architecture for data ex-
traction is shown in Fig. 1, which consists of sensors and a well-equipped sink where sensors are deployed in an ad hoc
fashion over the area under monitoring. Each sensor node is embedded in a flash memory device that acts as a local storage
to keep records of the detected events during monitoring. It is shown in [18] that energy consumption to maintain a unit of
data in a flash memory embedded in a sensor node is very low compared to energy required to transmit this unit of data.
All the sensor nodes share a sensor distributed file system (SDFS) along with their local storage, as shown in Fig. 1 Each
sensor node has the ability to download files from SDFS to its local storage and upload files to SDFS.
The decentralized method for data extraction is designed to put more computational load on the sensors. This is ob-
tained by equipping each sensor with an additional storage to store the metadata that exhibit the sensor’s activity during
the monitoring/observation period. The notion behind the decentralized data extraction is to filter out the sensors whose
frequencies (i.e., support) are less than min_sup. This will reduce the communication cost when the messages are uploaded
into the SDFS. The data extraction process begins by sending the mining parameters from the sink to the sensor nodes in
the network. These parameters include historical period (This ), time slot (Ts ) and minimum support, min_sup. After receiving
the parameters, each sensor creates a local buffer. Each buffer has one bit entry for each time slot in the historical period
of the data extraction. At first, all the bit entries in the buffer are unset. At the end of each time slot, every node checks
whether there is any detected event for the current time slot. If an event is detected, the bit value for the corresponding
time slot is set. After the end of the historical period, each sensor scans over its local buffer. If the number of bits is greater
132 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

Fig. 1. Network architecture.

9:00 9:10 9:20 9:30 9:40 9:50 10:00 10:10 10:20


|...................|………………|………………|……………..|……………|…………..|……………….|………………|
s1s2s3s5s7 s1s2s5s6 s1s2s3s5 s1s2s3s6 s3s4s5 s2s3s4 s4s5s6s7 s2s3s4

Fig. 2. Detected events for 80 min historical period.

than or equal to the min_sup, the node will form a message or series of messages, depending on the packet size. The mes-
sage contains sensor IDs and the time slot numbers in which the corresponding bits are set. Then the sensor uploads these
messages to the SDFS.
Note that the messages may be stored in different sensor nodes. This characteristic of our model ensures data availability
in a noisy/erroneous environment. Then a MapReduce pass is employed to merge messages into epochs in parallel. Each
mapper takes in an input pair in the form of (key = message ID, value = message), where message = (ts, s) is a message
that was generated previously. It splits the message into a time slot (ts) and a sensor (s), and outputs a key-value pairs (key
= ts, value = s). After all mapper instances have completed, for each distinct key ts, the MapReduce infrastructure collects
its corresponding values as sensors, and feeds the reducers with key-value pair (key = ts, value = S), where S is the set of
sensors that triggered at the same time slot. The reducer receives the key-value pair, merges all sensors with the same time
slot into an epoch E, and associates its time slot number (TS) with ts. Finally, it outputs the key-value pair (key = TS, value
= E). Algorithm 1 shows the pseudo-code for the distributed data extraction model.

Example. Let us consider the following simple scenarios. Let s = {s1 , s2 , s3 , s4 , s5 , s6 , s7 } be the sensors in a particular
sensor network. Let the time slot Ts = 10 min and the historical period This = 80 min. Suppose that the data extraction
process starts at time 09:00. Each sensor node will keep a buffer length of 8, one entry for each time slot. A buffer entry
is set if there is an event detected in that time slot. The detected event for the 80 min historical period is shown in Fig. 2.
Assume the minimum support, min_sup = 3. At the end of the historical period (10:20) sensors s1 , s2 , s3 , s4 , s5 and s6 will
formulate the following messages (s1 , [1, 1, 1, 1, 0, 0, 0, 0]), (s2 , [1, 1, 1, 1, 0, 1, 0, 1]), (s3 , [1, 0, 1, 0, 1, 1, 0, 1]), (s4 , [0, 0, 0,
0, 1, 1, 1, 1]), (s5 , [1, 1, 1, 1, 1, 0, 1, 0]) and (s6 , [0, 1, 0, 1, 0, 0, 1, 0]). Then each sensor sends the message as a set of sensor
IDs and time slots to the SDFS (e.g., messages sent by the sensor s1 are m1 (s1 , 1), m2 (s1 , 2), m3 (s1 , 3), m4 (s1 , 4)). Finally,
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 133

Algorithm 1 Distributed data extraction


Input: Raw sensor data
Output: A set of epoch on SDFS
Sink:
1: Broadcast parameters (This , Ts , min_sup)
Node:
2: Upon receiving mining parameter
3: Slot Number = 1;
4: Time = current time;
5: while (current time ≤ This ) do
6: if (current time ≤ Time + Slot Number * Ts ) then
7: if there is a detected event then
8: Set buffer[Slot Number]
9: end if
10: else
11: Slot Number ++
12: end if
13: end while
14: if (number of set bits ≥ min_sup) then
15: Form message, m = (Sensor ID, ts)
16: Send m to the SDFS
17: end if
SDFS:
18: Procedure Mapper (key = message ID, value = m)
19: Begin
20: ts ← message.Ts
21: sensor ← message.sensor
22: Output (<key=ts, value = sensor>)
23: END
24: Procedure Reducer (key = ts, value = S)
25: Begin
26: Epoch E.TS ← ts
27: for each sensor s ∈ S do
28: E ← E ∪(s)
29: end for
30: Output (key =TS, value = E)
31: END

Table 1
Epochs on SDFS (An example sensor database (SD)).

TS Epoch TS Epoch

1 s1 s2 s3 s5 5 s3 s4 s5
2 s1 s2 s5 s6 6 s2 s3 s4
3 s1 s2 s3 s5 7 s4 s5 s6
4 s1 s2 s5 s6 8 s2 s3 s4

the MapReduce pass is used to merge these messages into epochs and epochs on SFDS which are shown in Table 1. Sensor
node s7 does not send any message because the number of set entries is less than the required min_sup.

5. Proposed RFSP-tree structure and mining process

In this section, we present the construction and mining process of the regularly frequent sensor pattern tree (RFSP-tree)
for finding regularly frequent patterns. We also analyze the complexity of RFSP-tree construction. The RFSP-tree construc-
tion has two phases: insertion phase and reconstruction phase. The step-by-step construction process of the RFSP-tree is
presented below in Fig. 3(a–e), with examples based on the sensor database in Table 1. For the sake of clarity, we do not
show the node traversal pointers in the tree.
Each node in an RFSP-tree represents a sensor set in the path from the root up to that node. An important feature of
a RFSP-tree is that, in the tree structure it maintains the appearance information for each epoch. To explicitly track such
134 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

SL SFD

SL-list {} SL-list {} SL-list {} SL-list {} SL-list {}


s1 s1:1 s1:2 s1:4 s1 s2:6
s1 s1 s2 s3 s4 s2 s5
s2 s2:1 s2:2 s2:6 s5:6
s3 s3:1 s2 s3:1 s2 s3:5 s2 s3 s4 s5 s3:5 s3 s5 s 3 s4
s4 s4 s3 s4 s3 s5 s4:4 s3 s5 s4:6,8 s5:5 s6:7 s1:4 s4:6,8 s1 s3 s4:5 s6:7
s5 s5:1 s5:2 s5:6 s4:4
s5:1 s5:1 s6:2 s5:1,3 s6:2,4 s6:2,4 s1:1,3
s6 s6 s6:1 s6:3 s6:3

(a) Initial empty (b) RFSP-treeL after (c) RFSP-treeL after (d) RFSP-treeL after inserting (e) The final RFSP-tree
RFSP-treeL inserting TS=1 inserting TS=2 all epoch

Fig. 3. RFSP-tree construction.

information, it keeps a list of TS (time-slot) information only at the last sensor-node for an epoch. Such a node is denoted
as a tail-node. Hence, an RFSP-tree maintains two types of nodes, namely, ordinary nodes and tail nodes. The former are the
type of nodes that do not maintain TS information. On the other hand, the latter type can be defined as follows:

Definition 4 (tail node:). Let e = si , s j , . . . , sk be an epoch that is sorted according to the sensor lexicographic order list,
SL-list (where i < j < k). If e is inserted into RFSP-tree in this order, then the node of the tree that represents sensor sk is
defined as the tail-node for e and it explicitly maintains e’s TS.

Irrespective of the node type, no node in the RFSP-tree needs to maintain a support count value like FP-tree [11]. Each
node in the RFSP-tree maintains parents, children, and node traversal pointers. So, the structures of an ordinary node and a
tail node are given as follows. An ordinary node is denoted by M, where M is the sensor name of the node. A tail node is
denoted by M[e1 , e2 , . . . , en ], where ei , i [1, n] is an epoch TS in the TS-list, indicating that M is the tail-node for epoch ei .

Lemma 1. A tail-node in an RFSP-tree inherits an ordinary node, but not vice versa.

Proof. The structure of an ordinary node states that it exactly maintains three types of pointers: a parent pointer, a list
of child pointers and a node traversal pointer. A tail-node maintains all such information like an ordinary node. It also
maintains the TS-list, which is additional information. Since the TS-list is not maintained in an ordinary node, we can say,
there is an ordinary node in every tail-node and in contrast, no tail-node is an ordinary node. 

5.1. RFSP-tree construction process

Insertion phase: In this phase, RFSP-tree arranges sensors according to the lexicographic sensor order in the database
and is built by inserting every epoch one after another into it. At this stage we call it RFSP-treeL . Simply, it maintains a
sensor lexicographic order (SL-list). The SL-list includes each distinct sensor found in all epochs in the database according
to sensor lexicographic order (e.g., s1 , s2 , s3 , s4 , s5 , s6 for the example sensor database shown in Table 1) and contains the
support value of each sensor in the database. Initially the RFSP-tree is empty and starts construction with the null root node
shown in Fig. 3(a). To include the first epoch (i.e., TS =1), {s1 , s2 , s3 , s5 } is inserted into the tree < {} → s1 → s2 → s3
→ s5 :1 > as-it-is manner and builds the first branch of the tree with s1 as the initial node, just after the root node, and
s5 : 1 as the tail-node as shown in Fig. 3(b). Hence, it carries the TS (i.e., 1) epoch in its TS-list. The support count entries
for sensors s1 , s2 , s3 and s5 are also updated at the same time. Fig. 3(c) shows the status of the SL-list and the RFSP-treeL
after inserting T S = 2, {s1 s2 s5 s6 }. TS=2 has its prefix < {} → s1 → s2 > in common with TS=1. Therefore, the epoch TS=2
is inserted in the tree following the path < {} → s1 → s2 > and then creates a new child from s2 for the uncommon part
of the epoch with node s6 : 2 being the tail-node that carries the TS information for the epoch. In this way, after adding all
epochs (TS=3–8), we obtain the complete RFSP-treeL as shown in Fig. 3(d). We call the SL-list of the constructed RFSP-treeL
SL. Here, the insertion phase ends and the reconstruction phase starts.
Reconstruction phase: The purpose of this phase is to achieve a highly compact RFSP-tree which will utilize less memory
and facilitate a fast mining process. In the restructuring phase, we first sort the SL in the frequency-descending order (SFD )
using the merge sort and reorganize the tree structure according to SFD order. For restructuring our RFSP-tree, we use the
branch sorting method (BSM) proposed in [31]. BSM uses the merge sort to sort every path of the prefix tree. This approach
first removes the unsorted paths and then sorts the paths and reinserts them into the tree. Fig. 3(e) shows the structure of
the final RFSP-tree that we obtained by the restructuring operation. The pseudo-code for RFSP-tree construction is shown in
Algorithm 2. The RFSP-tree supports the following properties and lemma.
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 135

Algorithm 2 RFSP-tree construction


Input: SD, Initial sensor lexicographic order (ISLO)
Output: RFSP-tree of SD
1: Begin
2: SL ← an SL-list arranged in ISLO
3: create the root, R of a RFSP-tree and label it as ‘null’
4: for each epoch Ei in SD do
5: sort Ei according to SL-list order;
6: let SL-list in Ei be [y|Y ], where y is the first element and Y is the remaining list;
7: if R has a child C such that C.sensor-name = y.sensor-name then
8: select C as the current node of R;
9: else
10: create a new node C as child of R;
11: end if
12: if y = the tail-sensor of Ei then
13: if C = an ordinary node then
14: assign a TS-list to C;
15: end if
16: add the TS of Ei in C’s TS-list;
17: end if
18: Calculate SFD from SL in frequency-descending order using merge-sort method;
19: for each branch in RFSP-tree do
20: Sort the branch in SFD using branch sorting method (BSM);
21: end for
22: end for
23: End

Property 1. An RFSP-tree contains a complete set of frequent sensor projection for each epoch in the sensor database (SD)
only once.

Property 2. The TS-list in an RFSP-tree maintains the occurrence information for all the nodes in the path (from tail to root
node) at least in the epochs of the list.

Lemma 2. Let P = b1 , b2 , . . . , bn be a path in an RFSP-tree where node bn is the tail node that carries the TS-list of the path. If the
TS-list is pushed-up to the node bn−1 , then the node bn−1 maintains the occurrence information of the path P
= b1 , b2 , . . . , bn−1
for the same set of epochs in the TS-list without any loss.

Proof. Based on the Property 2, the TS-list at node bn maintains the occurrence information of the path Z
at least in epochs
it contains. So, the same TS-list at node bn−1 exactly maintains the same epoch information for Z
without any lose. 

5.2. RFSP-tree mining process

Here, we describe the mining process of our proposed RFSP-tree. Similar to the FP-growth [11] mining approach, we
recursively mine the RFSP-tree of decreasing size to generate regularly frequent patterns by creating conditional pattern-
bases (PB) and the corresponding conditional trees (CT) without additional database scan. Then we generate the frequent
patterns from the conditional tree. At the end, we check the regularity of generated frequent patterns to find regularly
frequent sensor patterns.
For the example database shown in Table 1, suppose the min_sup = 3 and max_var = 1.1. We explain our mining pro-
cedure below using the bottom-most sensor s6 . The conditional pattern-base tree of s6 is shown in Fig. 4(a). According
to Lemma 2, the TS-list of s6 is pushed up to its respective parent nodes s1 and s4 . Therefore, each parent node of s6 is
converted to a tail-node. For node s6 , its immediate frequent pattern is (s6 : 2, 4, 7) i.e., s6 occurs in epochs 2, 4 and 7.
Therefore, its support is 3 and it has two paths in RFSP-tree: (s2 , s5 , s1 , s6 : 2, 4) and (s5 , s4 , s6 : 7) where the number after
“:” indicates each sub-pattern occurring TS. Then the s6 conditional pattern-base is {(s2 , s5 , s1 : 2, 4), (s5 , s4 : 7)} which is
shown in Fig. 4(a). s6 conditional tree leads to only one branch (s5 : 2, 4, 7) and the generated frequent patterns are (s5 s6 : 2,
4, 7) and (s6 : 2, 4, 7). We then calculate the regularity values of s5 s6 and s6 using Definition 6.1, which are calculated as 1.5
and 1.5, respectively. Since {Reg(s5 s6 ), Reg(s6 )} > 1.1, the patterns s5 s6 and s6 are not regularly frequent sensor patterns. A
similar process is repeated for other sensors in the RFSP-tree to find the complete set of regularly frequent sensor patterns
which are shown in Fig. 4(b–e).
Table 2 shows the overall mining process of the regularly frequent sensor patterns for our example sensor database
in Table 1. With the above mining technique, one can see that, from an RFSP-tree constructed on a SD, the complete set
136 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

SL-list {} SL-list {} SL-list {} SL-list {} SL-list {} SL-list {} SL-list {} SL-list {}


s2:2 s :3 s :4 s :4 s2:4 s :4
s2 s5 5 s5:2,4,7 s2:2 s2 s5:7 s3:3 s3:5,6,8 2 s2 2 s2 s2:6,8 s5:5 2 s2:1,2,3,4
s5:3 s5:2 s5:4 s5:4 s 5 :3
s1:2 s5 s4:7 s3:3 s3:6,8 s3:5 s3:2 s5:2,4 s5:1,2,3,4 s5:1,3
s4:1 s1:2,4 s3:1,3

(a) Conditional pattern-base (b) Conditional pattern-base (c) Conditional pattern-base (d) Conditional-tree (e) Conditional-tree
and conditional- tree for ‘s6’ and conditional tree for ‘s4’ and conditional tree for ‘s1’ for ‘s3’ for ‘s5’

Fig. 4. Conditional pattern-base and conditional tree construction with the RFSP-tree.

Table 2
Mining the RFSP-tree by creating conditional (sub-) pattern base.

Sensor Conditional Conditional-tree FSP RFSPs


pattern-base

s6 {(s2 s5 s1 : 2, 4), < s5 : 2, 4, 7 > (s5 , s6 : 2, 4, 7), 0


(s5 s4 : 7)} (s6 : 2, 4, 7)
s4 {(s2 s3 : 6, 8), < s3 : 5, 6, 8 > (s3 s4 : 5, 6, 8), 0
(s5 s3 : 5), (s4 : 5, 6, 7, 8)
(s5 : 7)}
s1 {(s2 s5 : 2, 4), < s2 s5 : 1, 2, 3, 4 > (s1 s2 : 1, 2, 3, 4), 0
(s2 s5 s3 : 1, 3)} (s1 s5 : 1, 2, 3, 4),
(s1 s2 s5 : 1, 2, 3, 4),
(s1 : 1, 2, 3, 4)
s3 {(s2 : 6, 8), < (s2 : 1, 3, 6, 8), (s2 s3 : 1, 3, 6, 8), s2 s3 , s3 s5 ,
(s2 s5 : 1, 3), (s5 : 1, 3, 5) > (s3 s5 : 1, 3, 5) s3
(s5 : 5)} (s3 : 1, 3, 5, 6, 8)
s5 {s2 : 1, 2, 3, 4} < s2 : 1, 2, 3, 4 > (s2 s5 : 1, 2, 3, 4), s5
(s5 : 1, 2, 3, 4, 5, 7)
s2 0 0 (s2 : 1, 2, 3, 4, 6, 8) s2

of regularly frequent sensor patterns for given min_sup and max_var thresholds can be mined efficiently with the pattern
growth approach.

5.3. Complexity analysis of RFSP-tree

Let N be the number of epochs in a SD and M be the average length of all epochs. Assume that the average computation
cost to scan one epoch from SD and to insert it into the RFSP-tree are CS and CI , respectively. Therefore, the total cost to scan
all epochs from SD and insert them into the RFSP-tree is, CostT = N × (CS + CI ). Consider CFD is the average cost to sort one
epoch in the frequency-descending order. Then, the cost required to restructure the RFSP-tree in the frequency-descending
order by scanning SD is, CostS = N × (CF D ) + CostT = N × (CS + CF D + CI ).
In the restructuring phase, we employed BSM that utilizes the merge sort technique to sort the nodes of any epoch.
Therefore, the degree of disorder is not an important factor for performance during sorting, since irrespective of data distri-
bution the complexity of merge sort is always O(nlog2 n), where n is the total number of sensors in the list. BSM sorts each
epoch (i.e., each path) with its support count. It processes as many epochs as the count value with a single operation. As a
result, the total sorting cost of all epochs is P × (CFD ), where P is the total number of identical epochs in SD, with P ≤ N.
Let the cost required to read an epoch of length M from the inner-memory in the initial RFSP-tree using the BSM is CrIM .
If there is no sorted path in the RFSP-tree, then the computation cost for restructure becomes, CostR = P × (CrIM + CF D + CI ).
On the other hand, when the path is already sorted, the restructuring cost will be reduced. If Q is the total number of
sorted paths found during the RFSP-tree restructuring process, then CostR can be reduced to, C ostR = P × (CrIM + CF D + CI ) −
Q × (CF D + CI ).
In the worst case, when N=Pand Q=0, restructuring cost can be computed as, C ostR = N × (CrIM + CF D + CI ). Thus the total
cost of RFSP-tree construction is, C ostRF SP = C ostS + C ostR .

6. RFSPs mining using MapReduce model

6.1. Preliminary

MapReduce is a high-level programming model that allows distributed computation over a large amount of data [8]. The
two key functional programming primitives in MapReduce are ‘Map’ and ‘Reduce’. The map function takes a pair of (key,
value) data and returns a list of intermediate < key, value > pairs.
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 137

Input Data

Partition

Part1 Part2 … Partn

Mapper M1 M2 … Mn

Reducer R1 R2 … Rn

Phase-1
FSP, Support

R1 R2 … Rn

Phase-2
RFSP

Fig. 5. Proposed MapReduce framework for RFSP mining (RFSP-H).

map: (key1 , value1 ) → list of (key2 , value2 ).


Then, these pairs are shuffled and sorted. Each node then executes the reduce function to the set of intermediate pairs
with the same key. Typically, the Reduce function produces zero or more output pairs by performing a merging opera-
tion. The Reduce function ‘reduces’ - by combining, aggregating, summarizing, filtering, or transforming the list of values
associated with a given key (for all k keys) and returns a list of k values.
reduce: (key2 , list of value2 ) → list of (value3 ).
Hadoop is an open source implementation of MapReduce written in Java language that comprises two primary segments:
(i) the Hadoop’s Distributed File System (HDFS) and (ii) the MapReduce.

6.2. Proposed MapReduce model for RFSPs mining

In this section, we present the proposed model for large scale RFSPs mining with MapReduce shown in Fig. 5.
As shown in Fig. 5, our model works as follows:

1. Input sensor database (SD) is partitioned into N partitions. Each partition is processed by a Mapper machine.
2. Mapper i reads the assigned data partition and generates the corresponding local candidate sensorset according to a
l ocal _min_sup. Mapper i outputs < c.sensorset, sup > (key, value) pairs where ‘sensorset’ refers to a set of sensors, S.
3. For each unique intermediate key, the Reducer passes the key and the corresponding set of intermediate values to the
defined Reduce function. According to these (key, value) pairs, the Reducer outputs the final list of (key, value) pairs after
filtering according to the gl obal _min_support in reduce phase 1 as < f.sensorset, sup > . Finally, these (key, value) pairs
are filtered according to max_var in reduce phase 2 and outputs complete set of RFSPs.

The framework of RFSP-H consists of two steps:


(i) Data Partition:
In this step, RFSP-H splits the input sensor database (SD) into many partitions. The straightforward partition strategy is
to distribute the epochs evenly to each partition. The algorithm for data partition is shown in Algorithm 3 where in line 3
and 4, RFSP-H performs the partitioning of the input dataset and writes each partition in the HDFS.
(ii) Distributed RFSPs mining:
138 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

Algorithm 3 Data partition


1: Create a data directory (D) in distributed file system (DFS)
2: while data available in D do
3: partition = create partition()
4: write partition to file parti in data directory
5: end while

Algorithm 4 MAP function


Input: A sensor database on HDFS, l ocal _min_sup, Key = TS, Value = Sensorset
Output: Candidate sensorset (c.sensorset)
1: Generate candidate (Ci ) by using Balanced FP-Growth [36]
2: for all c in Ci do
3: Emit (c.sensorset, sup)
4: end for

Algorithm 5 Reduce function


Input: gl obal _min_sup, max_var, key = a candidate sensorset (c.sensorset), value = local support of c.sensorset
Output: Regularly frequent sensor patterns (RFSPs)
1: Support = 0
2: for all v in values do
3: Support+ = v
4: end for
5: if support ≥ gl obal _sup then
6: Emit (f.sensorset, sup)
7: end if
8: if Reg(f.sensorset) ≤ max_var then
9: Emit (RFSPs)
10: end if

Table 3
I/O scheme for proposed model.

I/O Map Reduce-1 Reduce-2

Input: key: TS key: c.sensorset f.sensorset


key/value pairs value: sensorset value: support regularity
Output: key: c.sensorset key: f.sensorset RFSP
key/value pairs value: support value: support support

The distributed subgraph mining step consists of mining the set of regularly frequent patterns (RFSPs). The input to this
step is a set of partitions of the SD and the output is the set of RFSPs. The execution of the distributed RFSPs mining step is
done through a MapReduce pass. In the Map step, we use the balanced FP-growth [36] mining technique that runs on each
partition in parallel. In the Reduce step, we compute the final set of frequent patterns and use these frequent patterns to
discover the RFSPs. Algorithms 4 and 5 present our Map and Reduce functions, respectively.
After partitioning the sensor database into smaller parts, the master node assigns task to each idle slave node (i.e.,
Mapper). The execution flow of RFSP-H is shown in Fig. 6. The execution starts from the Mappers as they read the epochs
in the partition p from the HDFS as < TS, sensorset > key-value pair. Then, the Mappers generate all possible candidate
patterns by using the balanced FP-growth [36]. The Mapper builds its key-value pair as < c.sensorset, sup > and emits
that for the Reducers. These values are stored in the local disk as intermediate results and HDFS perform the sorting or
merging operation to produce results as < c.sensorset, sup > pairs. For this we used two levels of pruning techniques,
namely local pruning and global pruning which uses different minimum support threshold expressed as l ocal _min_sup and
gl obal _min_sup, respectively. The local pruning is applied on Map phase in each segment while the global pruning is applied
in the Reduce phase. For this purpose, we modified the balanced FP-growth [36] using MapReduce library function written
in Java. Algorithm 4 presents our Map functions.
The master node assigns the Reduce operation to the slave nodes. In reduce phase-1, a slave node (i.e., Reducer 1) takes
input as < c.sensorset, support > pairs and generate output as < f.sensorset, sup > pair. In reduce phase-2, Reducer 2
takes input as < f.sensorset, support > pairs, then checks the regularity criteria among the frequent patterns and writes
regularly frequent sensor patterns on the output files as < RFSP, sup > pairs and stores in the local disks. Table 3 shows the
input/output schemes of the proposed framework. Algorithm 5 presents our Reduce functions.
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 139

Mapper HDFS Reducer


Read from HDFS

Candidate Generation

Emit (key, value) to Reducer


Sort/Shuffle
Receive (key, value) from Mapper

Support counting

If frequent, write corresponding


(key, value) to HDFS
Receive (key, value) from HDFS

Regularity Check

Write final RFSPs to HDFS

Fig. 6. The execution flow of RFSP-H.

Table 4
Candidate sensorset (c.sensorset) for segment 1 from Mapper 1.

c.sensorset Support c.sensorset Support

s1 4 s2 s3 2
s2 4 s2 s5 4
s3 2 s3 s5 2
s5 4 s2 s6 2
s6 2 s1 s2 s3 2
s1 s2 4 s1 s2 s6 2
s1 s5 4 s2 s3 s5 2
s1 s6 2 s1 s2 s3 s5 2

6.3. Step-by-step example

Consider the min_sup is 3 and max_var is 1.1. Assume the SD in Table 1 is divided into two segments where each segment
contains four epochs. TSs 1 to 4 are in the first segment and TSs 5 to 8 are in the second segment. Assume the master node
has assigned segment 1 to Mapper 1 and segment 2 to Mapper 2. The Mappers map the < TS, sensoreset > pairs, generate
output as < c.sensorset, sup > pair, store them as intermediate values on the local disk of respective Mapper and inform
the master node. Tables 4 & 5 show the results of Map phase. These values are fed into the reduce phase after sorting and
merging operations, if necessary.
The Master node then assigns reduce task to the Reducers. In the reduce phase 1, Reducer 1 takes input as < c.sensorset,
sup > pairs, shares the support value of the candidate sensorset with other Reducers, finds the complete set of frequent
sensor patterns as < f.sensorset, support > pairs and writes the result on the local disk of Reducer 1 which is shown in
Table 6. In reduce phase 2, the Reducer 2 takes the (f.sensorset, sup) pairs as input and applies the constraint indicated in
Definition 2 and generates < RFSP, sup > pairs as output which is indicated in Table 7.
140 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

Table 5
Candidate sensorset (c.sensorset) for segment 2 from Mapper 2.

c.sensorset Support c.sensorset Support

s2 2 s2 s4 2
s3 3 s3 s4 3
s4 4 s3 s5 1
s5 2 s2 s3 s4 2
s6 1 s3 s4 s5 1
s2 s3 2 – –

Table 6
Frequent sensorset (f.sensorset) from Reducer 1.

f.sensorset Support f.sensorset Support

s1 4 s1 s2 4
s2 6 s1 s5 4
s3 5 s2 s3 4
s4 4 s2 s5 3
s5 6 s3 s4 3
s6 3 s3 s5 3

Table 7
Final output from Reducer 2.

f.sensorset Regularity RSFP f.sensorset Regularity RSFP

s1 1.44 × s1 s2 1.44 ×
s2 0.693  s1 s5 1.44 ×
s3 0.556  s2 s3 1.04 
s4 3.04 × s2 s5 1.44 ×
s5 0.085  s3 s5 0 
s6 1.5 × – – –

6.4. Complexity analysis of RFSP-H

The time complexity to read each epoch in the assigned segment on Hadoop is O(n), where n is the number of sensors
in an epoch. The complexity for prefix creation is O(mn), where m is the number of prefixes that occur together for n
sensors in an epoch using at most (n-1) keys. The complexity of merging the intermediate results without the scanning
time is O(nlogn) on merge sort and the complexity of message passing between the master node and the mappers/reducers
is almost constant. Therefore, the time complexity of each Mapper is O(n+mn). On the other hand, due to the removal of
the infrequent sensors it is necessary to scan fewer sensors in each segment.

7. Experimental results

In this section, initially we will present the simulation results of our proposed data extraction model and then evaluate
the performance of pattern mining using RFSP-tree and RFSP-H technique. First, we used synthetic datasets generated by our
simulator. To generate synthetic data for WSN, an event generation program was written in Microsoft Visual C++ and run
with Windows 7 on a 2.66 GHz machine with 4GB of main memory. Second, we used another dataset containing real WSN
data from Intel Berkely Research Lab [16], which has been widely used in literature (e.g., [6], [9], [25]). From the available
Intel dataset, we utilized the sensor data collected over 5 days and 10 days at 30 s interval where the datasets consists
of tuples from 54 sensors reporting environmental reading in every 30 s. Since the radio transmitter utilized for data
extraction was of quite poor quality, this led to many missed readings in each epoch. We assumed the missed readings from
sensors as undetected events which are beneficial in generating patterns despite lost readings. In absence of any regularly
frequent sensor pattern mining technique in literature for sensor dataset, we compared the performance of RFSP-tree with
RF-Tree [23], which was proposed to mine regularly frequent patterns in transactional database and shown to outperform
other related techniques.
To evaluate the performance of RFSP-H, we used Hadoop version 1.2.1, running on a cluster with nine nodes, one of
which was selected as the master and the other 8 as slaves (data nodes). The master node had 3.8 GHz Intel core i5
processor with 8 GB RAM and each slave machine had 2.63 GHz Intel core i5 processor with 4 GB RAM. We configured
HDFS on Ubuntu-14.04.1. We implemented RFSP-H on Hadoop using Java on MapReduce library functions. We followed the
load distribution indicated in [35].
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 141

Table 8
Simulation’s parameters [6,26].

Parameter Value

Number of sensor nodes 150, 300


Historical period 5, 10 days
Time slot 15, 60 sec
Minimum support 0 - 80 (%)
Flash memory size 16 MB
Read, write, and erase from flash 0.017 μJ/Byte
Transmission and receive energy 50 nJ/bit
Transmitter’s amplifier 100 pJ/bit/m2
Size of the grid 250m × 250m

Table 9
Dataset characteristics.

Dataset No. of No. of This Ts Data type


sensors epochs (day) (s)

S150H5dT15s 150 28,800 5 15 Simulator


S150H5dT60s 150 7,200 5 60 Simulator
S150H10dT15s 150 57,600 10 15 Simulator
S150H10dT60s 150 14,400 10 60 Simulator
S300H5dT15s 300 28,800 5 15 Simulator
S300H5dT60s 300 7,200 5 60 Simulator
S300H10dT15s 300 57,600 10 15 Simulator
S300H10dT60s 300 14,400 10 60 Simulator
S300H20dT10s 300 172,0 0 0 20 10 Simulator
Intel data (5-day) 54 14,400 5 30 Real
Intel data (10-day) 54 28,800 10 30 Real

60 30 4
S300H10dT15s S150H10dT15s Intel Lab Data (10-days)
S300H10dT60s S150H10dT60s Intel Lab Data (5-days)
S300H5dT15s S150H5dT15s 3
Data size (MB)

Data size (MB)


Data size (MB)

40 S300H5dT60s S150H5dT60s
20

20 10
1

0 0 0
0 20 40 60 80 0 20 40 60 80 0 20 40 60 80
Support value (%) Support value (%) Support value (%)

(a) (b) (c)


Fig. 7. Data size v/s support values: (a) S300 data, (b) S150 data, and (c) Intel lab data.

7.1. Synthetic data generation and its characteristics

The simulator generates events with the several input parameters including the number of nodes, the historical period,
slot size and minimum support value for mining. In this simulator, we considered two scenarios consisting of 150 and 300
nodes respectively placed in a grid of 250 m × 250 m, where the sensors are evenly placed over the area. We used the
radio model introduced in [12] and the storage model used by the sensor nodes introduced in [18]. We assumed a Toshiba
16MB NAND flash memory that costs 0.017 uJ to read, write, and erase a byte of data [18,33]. The simulation parameters
used in our experiments are summarized in Table 8. Event generation by each sensor node was assumed to be uniformly
distributed over the possible number of slots within the given historical period. In addition, we assumed the nodes were
uniformly distributed in space in the monitoring area and messages were delivered reliably by acknowledging the number
of set bits in each sensor’s buffer. The dataset characteristics are shown in Table 9.
The performance of the proposed data model is evaluated on the basis of the amount of data that are accumulated in
the SDFS. Fig. 7 shows the data size for different support values accumulated at SDFS. The amount of data at 0 support
value refers to the data size accumulated using centralized mechanism. The results indicate that the data size is reduced
with increasing support values and this reduction rate is very sharp when the support value exceeds 40%.

7.2. Performance of RFSP-tree

In the first experiment, we show the effectiveness of RFSP-tree in mining regularly frequent sensor patterns mining in
terms of execution time. Experiments were conducted for the mining request for the given datasets by varying the max_var
values where min_sup value was fixed at 20% and the results are shown in Fig. 8. The x-axis in each graph shows the
change of min_var value in the form of percentage of database size and the y-axis indicates the overall execution time. The
142 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

200 100
1000 400
RF-tree RF-tree
RF-tree RF-tree
160 RFSP-tree 80 RFSP-tree
800 RFSP-tree 300 RFSP-tree

Runtime (s)

Runtime (s)
Runtime (s)

Runtime (s)
120 60
600
200
80 40
400
100 40
20
200
0 0
20 30 40 50 20 30 40 50 20 30 40 50 20 30 40 50
max_var (%) max_var (%) max_var (%) max_var (%)

(a) (b) (c) (d)


Fig. 8. Execution time: RFSP-tree v/s RF-tree by varying max_var. (a) S300H10dT60s data, (b) S150H10dT60s data, (c) 10-day Intel data and (d) 5-day Intel
data.

1000 800
1800 1000
RF-tree RF-tree
RF-tree RF-tree 900 700
900 RFSP-tree RFSP-tree
RFSP-tree RFSP-tree

Memory (MB)

Memory (MB)
Memory (MB)

Memory (MB) 800


800 600
1200
700 500
700
600 400
600
600
500
300
500
20 30 40 50 20 30 40 50 20 30 40 50 20 30 40 50
max_var (%) max_var (%) max_var (%) max_var (%)

(a) (b) (c) (d)


Fig. 9. Memory size of RFSP-tree and RF-tree by varying max_var: (a) S300H10dT60s data, (b) S150H10dT60s data, (c) 10-day Intel data and (d) 5-day Intel
data.

1000 1800
RF-tree RF-tree
750 RFSP-tree RFSP-tree
Memory (MB)

1200
Runtime (s)

500

600
250

0 0
3K 6K 9K 12K 15K 3K 6K 9K 12K 15K
Data size Data size

(a) (b)
Fig. 10. Scalability of the RFSP-tree v/s RF-tree (a) over execution, and (b) over memory.

RFSP-tree structure significantly outperforms the RF-tree structure in terms of overall execution time for all datasets. The
reason of this performance gain is that the RF-tree construction requires two database scans, while RFSP-tree construction
requires only one database scan. In the second experiment, we show the compactness of the RFSP-tree. The memory usages
of our RFSP-tree and RF-free for different max_var values for different datasets are shown in Fig. 9. Memory requirement in
RSP-tree is at least 40% and 45% less than the RF-tree for the synthetic and real datasets, respectively at 50% max_var value.
This shows the compactness attained by our proposed tree. We have done experiments with different values of min_sup
which shows the same level of performance superiority of RFSP-tree over RF-tree.
In the third experiment, we also studied the scalability of the RFSP-tree by varying the number of epochs in the dataset
and observing its impact on the overall execution time and required memory. To test the scalability of RFSP-tree, we used
300H10dT60s dataset which contains 300 distinct sensors and 15K epochs. This dataset was divided into five portions, each
of 3K epochs. The experimental results are presented in Fig. 10, where we kept min_sup and max_var fixed at 20% and
50%, respectively. The figure indicates that, as the size of data increases, the execution time and memory usage increase for
RFSP-tree and RF-tree. Although, both tree structures show linear increase of execution time and memory usage with the
increase of data size, RFSP-tree requires significantly less execution time and lower memory usage with respect to the size
of the data.

7.3. Performance of RFSP-H

7.3.1. Runtime efficiency of RFSP-H


In this experiment, we demonstrate the overall execution time of RFSP-H distributed over Map and Reduce phase for
varying max_var values. We conducted this experiment for S150H10dT60s, S300H10dT60s, Intel 10 days and 5 Days datasets.
We fixed the number of processors (data nodes) to 6 for this experiment. We kept track of the running time of RFSP-H and
varied the value of max_var between 20% to 50%, while min_sup was fixed at 20%. As expected, the runtime increases as
the max_var value increases. This is because increasing the max_var values also increases the number of candidate patterns
for regularly frequent patterns. The result is shown in Fig. 11 (a–d). From the results, we observe that for different max_var
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 143

50 20 10 6.0
Map Runtime Map Runtime Map Runtime Map Runtime
40 Reduce Runtime 16 Reduce Runtime 8 Reduce Runtime Reduce Runtime
4.5

Runtime (s)

Runtime (s)

Runtime (s)

Runtime (s)
30 12 6
3.0
20 8 4

1.5
10 4 2

0 0 0 0.0
20 30 40 50 20 30 40 50 20 30 40 50 20 25 30 35 40 45 50
max_var (%) max_var (%) max_var (%) max_var (%)

(a) (b) (c) (d)


Fig. 11. Overall execution time of RFSP-H by varying max_var: (a) S300H10dT60s data, (b) S150H10dT60s data, (c) 10-day Intel data and (d) 5-day Intel data.

400 120 80 32
S300H10dT60s S150H10dT60s 10-Days 5-Days
300 90 60 24
Runtime (s)

Runtime (s)

Runtime (s)

Runtime (s)
200 60 40 16

100 30 20 8

0 0 0 0
2 4 6 8 2 4 6 8 2 4 6 8 2 4 6 8
Data Nodes Data Nodes Data Nodes Data Nodes

(a) (b) (c) (d)


Fig. 12. Execution time: by varying data nodes (a) S300H10dT60s data, (b) S150H10dT60s data, (c) 10-day Intel data and (d) 5-day Intel data.

Table 10
Execution time on large data set.

Nodes Map runtime (s) Reduce runtime (s) Total runtime (s)

2 1180 420 1500


4 900 680 220
6 600 490 110
8 400 330 70

values, Map phase takes longer time to execute than the Reduce phase. The execution time of Map phase is approximately
60% of overall execution time.

7.3.2. Runtime of RFSP-H on varying the number of data node


Here, we demonstrate how RFSP-H’s overall runtime varies with the number of slave nodes. We fixed the min_sup and
max_var values at 20% and 50%, respectively for S150H10dT60s, S300H10dT60s, Intel 10 days and 5 Days datasets. We varied
the number of slaves nodes between 2 to 8, and recorded the runtime for each dataset. Fig. 12 shows that the runtime
reduces significantly with an increasing number of data nodes.
To evaluate the effectiveness of RFSP-H technique on large dataset, we have conducted a new experiment by using
S300H20dT10s synthetic data of size 280MB that contains 172,0 0 0 epochs. The dataset was executed on 2, 4, 6 and 8 nodes
keeping min_sup and max_var values fixed at 20% and 50% and the execution times needed in each case are recorded in
Table 10. The table demonstrates that, RFSP-H technique efficiently mines regularly frequent patterns from the large dataset
within reasonable time.

7.3.3. Runtime of RFSP-H on varying MapReduce parameters


We evaluated the performance of RFSP-H by varying the MapReduce parameters, namely, block size and replication factor
(i.e., number of copies of data). At first, we varied the block size and observed the execution time of RFSP-H. For this
experiment, we used S150H10dT15s and S300H10dT15s datasets and the blocksize was varied from 10MB to 80MB. Results
shown in Fig. 13(a) suggest that when the dataset is large, the change in blocksize significantly affects the execution time
while it has minimal effect on small sized dataset (e.g., S150H10dT15s).
Results by varying the replication factor on the same datasets are shown on Fig. 13(b). It shows that increased replication
factor improves execution time, which is due to the fact that an increase in replication factor produces better data locality.
However, at high replication factor, the performance reaches plateau.

7.3.4. Comparison RFSP-H v/s RFSP-tree


In the absence of any MapReduce based regularly frequent pattern mining technique in the literature, we compared the
performance of RFSP-H technique with single processor based RFSP-tree. We compared the runtime of RFSP-H with RFSP-
tree by varying the max_var values where min_sup was fixed at 20%. Although, RFSP-tree shows good performance, this
technique uses single processor to process large amount of data. In this experiment, for RFSP-H the number of processors
were fixed to 6 for each dataset. Fig. 14 indicates that RFSP-H significantly outperforms RFSP-tree in terms of runtime. The
reason is that RFSP-H uses MapReduce based parallel processing with less inter-processor communications which enables
144 Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145

400 400
S300H10dT15s S300H10dT15s
S150H10dT15s S150H10dT15s
300 300

Runtime (s)

Runtime (s)
200 200

100 100

0 0
10 20 30 40 50 1 2 3 4 5 6
Block Size (MB) Replication factor

(a) (b)
Fig. 13. Execution time of RFSP-H (a) effect of block size, and (b) effect of replication factor.

500 160 100 50


RFSP-tree RFSP-tree RFSP-tree RFSP-tree
400 RFSP-H RFSP-H RFSP-H 40 RFSP-H
120 75
Runtime (s)

Runtime (s)

Runtime (s)
Runtime (s)

300 30
80 50
200 20

40 25
100 10

0 0 0 0
20 30 40 50 20 30 40 50 20 30 40 50 20 30 40 50
max_var (%) max_var (%) max_var (%) max_var (%)

(a) (b) (c) (d)


Fig. 14. Execution time: RFSP-H vs RFSP-tree (a) S300H10dT60s data, (b) S150H10dT60s data, (c) 10-day Intel data and (d) 5-days Intel data.

the RFSP-H to efficiently mine the RFSPs from large data. Experiments with different min_sup values shows the similar trend.
All these indicates that the use of Hadoop platform for large scale sensor data mining reduces response time and thereby
enhances the dependability of such data mining applications.

8. Conclusions

In this paper, we have introduced a new type of sensor data mining called regularly frequent sensor patterns mining
that captures the temporal regularity among sensors and a distributed mechanism to aid data extraction. To extract such
patterns, we have devised a novel prefix tree structure called RFSP-tree that stores sensor data in a compact manner, and
based upon this tree a mining algorithm is proposed which effectively mines regularly frequent sensor patterns over sensor
databases in only one scan. To realize data mining in large scale sensor networks in the IoT environment, we provide a
MapReduce-based regularly frequent sensor pattern mining algorithm for sensor datasets, called RFSP-H, which enables par-
allel execution of the algorithm ensuring timeliness of mined results. Dependability of the proposed data mining framework
is enhanced through distributed data extraction and storage mechanism that improve availability and the parallel imple-
mentation on Hadoop platform that improves the timeliness of mining responses. Comparative performance analyses show
that our techniques are very effective and efficient for mining regularly frequent sensor patterns from sensor data and out-
perform the existing related algorithms in both runtime and memory usage. Future research will explore ways to use the
extracted knowledge to improve the operational efficiency of WSNs, and promote green communications.

References

[1] R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, in: ACM SIGMOD Record, vol. 22, ACM, 1993,
pp. 207–216.
[2] A. Appice, M. Ceci, A. Turi, D. Malerba, A parallel, distributed algorithm for relational frequent pattern discovery from very large data sets, Intell. Data
Anal. 15 (1) (2011) 69–88.
[3] S. Aridhi, L. D’Orazio, M. Maddouri, E. Mephu, A novel mapreduce-based approach for distributed frequent subgraph mining, in: Reconnaissance de
Formes et Intelligence Artificielle (RFIA), 1, 2014, pp. 1–7.
[4] M. Bhuiyan, M. Al Hasan, An iterative mapreduce based frequent subgraph mining algorithm, Knowl. Data Eng. IEEE Trans. 27 (3) (2015) 608–620.
[5] M.Z.A. Bhuiyan, G. Wang, J. Cao, J. Wu, Sensor placement with multiple objectives for structural health monitoring, ACM Trans. Sensor Netw. (TOSN)
10 (4) (2014) 1–68.
[6] A. Boukerche, S. Samarah, A novel algorithm for mining association rules in wireless ad hoc sensor networks, IEEE Trans. Parallel Distrib. Syst. 19 (7)
(2008) 865–877.
[7] C.P. Chen, C.-Y. Zhang, Data-intensive applications, challenges, techniques and technologies: a survey on big data, Inf. Sci. 275 (2014) 314–347.
[8] J. Dean, S. Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107–113.
[9] P. Desnoyers, D. Ganesan, H. Li, M. Li, P.J. Shenoy, Presto: a predictive storage architecture for sensor networks., HotOS, 2005.
[10] M.G. Elfeky, W.G. Aref, A.K. Elmagarmid, Periodicity detection in time series databases, Knowl. Data Eng. IEEE Trans. 17 (7) (2005) 875–887.
[11] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: ACM SIGMOD Record, 29, ACM, 20 0 0, pp. 1–12.
[12] W.R. Heinzelman, A. Chandrakasan, H. Balakrishnan, Energy-efficient communication protocol for wireless microsensor networks, in: Proceedings of
the 33rd Annual Hawaii International Conference on System Sciences, IEEE, 20 0 0, pp. 1–10.
[13] C.K.-S. Leung, Y. Hayduk, Mining frequent patterns from uncertain data with mapreduce for big data analytics, in: Database Systems for Advanced
Applications, 7825, Springer, 2013, pp. 440–455.
Md.M. Rashid et al. / Information Sciences 379 (2017) 128–145 145

[14] M.-Y. Lin, P.-Y. Lee, S.-C. Hsueh, Apriori-based frequent itemset mining algorithms on mapreduce, in: Proceedings of the 6th International Conference
on Ubiquitous Information Management and Communication, ACM, 2012, p. 76.
[15] W.-T. Lin, C.-P. Chu, Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments, Int. J.
Parallel Emergent Distrib. Syst. 30 (5) (2014) 1–13.
[16] S. Madden, Intel Berkeley Research Lab Data, 2003,
[17] A. Mahmood, K. Shi, S. Khatoon, M. Xiao, Data mining techniques for wireless sensor networks: a survey, Int. J. Distrib. Sensor Netw. 2013 (1) (2013)
1–24.
[18] G. Mathur, P. Desnoyers, P. Chukiu, D. Ganesan, P. Shenoy, Ultra-low power data storage for sensor networks, ACM Trans. Sensor Netw. (TOSN) 5 (4)
(2009) 1–33.
[19] B. Özden, S. Ramaswamy, A. Silberschatz, Cyclic association rules, in: Data Engineering Proceedings., 14th International Conference on, IEEE, 1998,
pp. 412–421.
[20] M.M. Rashid, I. Gondal, J. Kamruzzaman, Regularly frequent patterns mining from sensor data stream, in: Neural Information Processing, 8227, Springer,
2013, pp. 417–424.
[21] M.M. Rashid, I. Gondal, J. Kamruzzaman, Mining associated patterns from wireless sensor networks, IEEE Trans. Comput. 64 (7) (2014) 1998–2011.
[22] M.M. Rashid, I. Gondal, J. Kamruzzaman, Share-frequent sensor patterns mining from wireless sensor network data, Parallel Distrib. Syst. IEEE Trans.
26 (12) (2015) 3471–3484.
[23] M.M. Rashid, M.R. Karim, B.-S. Jeong, H.-J. Choi, Efficient mining regularly frequent patterns in transactional databases, in: Database Systems for
Advanced Applications, 7238, Springer, 2012, pp. 258–271.
[24] M. Riondato, J.A. DeBrabant, R. Fonseca, E. Upfal, Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce, in:
Proceedings of the 21st ACM International Conference on Information and Knowledge Management, ACM, 2012, pp. 85–94.
[25] K. Römer, Distributed mining of spatio-temporal event patterns in sensor networks, EAWMS/DCOSS 1 (1) (2006) 103–116.
[26] S. Samarah, A. Boukerche, A.S. Habyalimana, Target association rules: a new behavioral patterns for point of coverage wireless sensor networks, IEEE
Trans. Comput. 60 (6) (2011) 879–889.
[27] R. Slater, Distributed Dependability, (http://users.ece.cmu.edu/∼koopman/des_s99/distributed/). [Online; accessed 10-June-2016].
[28] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, An efficient single-pass algorithm for mining association rules from wireless sensor networks, IETE Tech. Rev. 26
(4) (2009) 280–289.
[29] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Rp-tree: a tree structure to discover regular patterns in transactional database, in: Intelligent Data
Engineering and Automated Learning–IDEAL 2008, 5326, Springer, 2008, pp. 193–200.
[30] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Discovering periodic-frequent patterns in transactional databases, in: Advances in Knowledge Discovery
and Data Mining, 5476, Springer, 2009, pp. 242–253.
[31] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Efficient single-pass frequent pattern mining using a prefix-tree, Inf. Sci. 179 (5) (2009) 559–583.
[32] B. Thuraisingham, L. Khan, C. Clifton, J. Maurer, M. Ceruti, Dependable real-time data mining, in: Object-Oriented Real-Time Distributed Computing,
2005. ISORC 2005. Eighth IEEE International Symposium on, IEEE, 2005, pp. 158–165.
[33] S. Toshiba, Toshiba 128-mbit (16m8 bits/8mxl6 bits) cmos nand e2prom,
[34] G. Wu, H. Zhang, M. Qiu, Z. Ming, J. Li, X. Qin, A decentralized approach for mining event correlations in distributed system monitoring, J. Parallel
Distrib. Comput. 73 (3) (2013) 330–340.
[35] J. Xie, S. Yin, X. Ruan, et al., Improving mapreduce performance through data placement in heterogeneous hadoop clusters, in: Parallel & Distributed
Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, IEEE, 2010, pp. 1–9.
[36] K.-M. Yu, J. Zhou, W.C. Hsiao, Load balancing approach parallel algorithm for frequent pattern mining, in: Parallel Computing Technologies, 4671,
Springer, 2007, pp. 623–631.

You might also like