Big Data For Cyber Physical Systems in Industry 4.0: A Survey

Enterprise Information Systems
ISSN: 1751-7575 (Print) 1751-7583 (Online) Journal homepage: http://www.tandfonline.com/loi/teis20
Big data for cyber physical systems in industry 4.0:

a survey
Li Da Xu & Lian Duan
To cite this article: Li Da Xu & Lian Duan (2018): Big data for cyber physical systems in industry
4.0: a survey, Enterprise Information Systems, DOI: 10.1080/17517575.2018.1442934
To link to this article: https://doi.org/10.1080/17517575.2018.1442934
Published online: 01 Mar 2018.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=teis20
ENTERPRISE INFORMATION SYSTEMS, 2018
https://doi.org/10.1080/17517575.2018.1442934
ARTICLE
Big data for cyber physical systems in industry 4.0: a survey

Li Da Xua and Lian Duan b
a
Department of Information Technology & Decision Sciences, Strome College of Business, Old Dominion
University, Norfolk, VA, USA; bDepartment of Information Systems and Business Analytics, Frank G. Zarb School of
Business, Hofstra University, Hempstead, NY, USA
ABSTRACT ARTICLE HISTORY

With the technology development in cyber physical systems and big data, Received 17 July 2017
there are huge potential to apply them to achieve personalization and Accepted 3 January 2018
improve resource efficiency in Industry 4.0. As Industry 4.0 is the relatively KEYWORDS
new concept originated from an advanced manufacturing vision supported Industry 4.0; IoT; cloud
by the German government in 2011, there are only several existing surveys computing; cyber-physical
on either cyber physical systems or big data in Industry 4.0. In addition, systems; big data; data
there are much less surveys related to the intersection between cyber science; industrial
physical systems and big data in Industry 4.0. However, cyber physical information integration
systems are closely related to big data in nature. For example, cyber physical engineering
systems will continuously generate a large amount of data which requires
the big data techniques to process and help to improve system scalability,
security, and efficiency. Therefore, we conduct this survey to bring more
attention to this critical intersection and highlight the future research
direction to achieve the fully autonomy in Industry 4.0.
1. Introduction
Industry 4.0 originates from a project for advanced manufacturing vision supported by the German
government in 2011 (Lasi et al. 2014; Xu, Xu, and Li 2018) and has become a widely-used concept since
then. It refers to the fourth industrial revolution (Lu 2017a, 2017b). After mechanization and steam
power (Industry 1.0), mass production and assembly line (Industry 2.0), and digitalization and automa-
tion (Industry 3.0), Industry 4.0 is designed for the decentralized production through shared facilities in
the integrated global industrial system for on-demand manufacturing to achieve personalization and
resource efficiency (Brettel et al. 2014). It has profound impact on both producers and consumers. From
the manufacturers’ perspective, it only requires marginal human interventions, because computers can
automatically reconfigure facilities to meet the production plan. In addition, it is not necessary for
manufacturers to have their own factories and facilities anymore (Brettel et al. 2014). Some specialized
companies provide manufacturers physical facilities for production, and manufacturers will pay these
facility providers based on their usage of physical facilities. Such a metered service can achieve a much
higher resource efficiency and is highly flexible. For example, nowadays a typical manufacturer must
own enough facilities to satisfy the need in its busy season, and let those facilities underutilized in its
off-season. In Industry 4.0, the manufacturer only needs to pay more to use more facilities in its busy
season and release unnecessary facilities to the cloud for others to use in its off-season. In addition, the
specialized companies, which provide manufacturers physical facilities, can hire a more specialized and
cost-effective team to maintain their physical facilities because of their economies of scale. From the
consumers’ perspective, Industry 4.0 allows consumers to get their individualized products because
CONTACT Lian Duan lian.duan@hofstra.edu Department of Information Systems and Business Analytics, Frank G. Zarb
School of Business, Hofstra University, Hempstead, NY 11549, United States
© 2018 Informa UK Limited, trading as Taylor & Francis Group
2 L. D. XU AND L. DUAN
manufacturers can dynamically reconfigure manufacturing systems based on the collected customer
needs in an online platform (Ganschar et al. 2013). Industry 4.0 can particularly help Small and Medium
Enterprises with limited resources to dynamically follow market opportunities.
Despite the huge potential to achieve personalization and resource efficiency in Industry 4.0,
there are significant challenges to overcome. The most relevant term for Industry 4.0 is Cyber Physical
System (CPS) (Chen 2017a, 2017b). The term ‘Cyber Physical System’ was coined by Helen Gill at the
NSF in US to encourage research on the interaction between physical systems and computing
systems (Lee 2015). CPS is about the physical facilities with embedded sensors, processors, and
actuators that can be controlled or monitored by computers. Such a system has feedback loops with
physical facilities impact computing procedures or vice versa, and improve adaptability, resiliency,
scalability, and security of physical facilities (Lee 2008). CPS has wide applications on manufacturing
systems, medical operation and monitoring systems, military systems, traffic control and safety,
power generation and distribution, and so on (Lee 2015). Although CPS and Industry 4.0 are
interchangeably used in many cases, in theory, they are two different concepts with intersection
with each other. On one hand, CPS has applications in industry and is considered as an important
component in Industry 4.0, but also can be applied to other areas, such as healthcare, public
transportation, and military. On the other hand, Industry 4.0 is not just the interaction between
physical systems and computing systems. It serves the entire business cycle from gathering natural
resources, producing components, and assembling products to delivering products to customers
and managing customer relationship (Xu 2007). As both Industry 4.0 and CPS are huge topics, this
paper only discuss the intersection between Industry 4.0 and CPS to be more focused. In other words,
neither the application of CPS in other areas, such as healthcare, public transportation, and military,
nor the other types of interaction beyond CPS, such as human-computer interactions, computer
interactions, and human interactions, will be discussed in this paper.
In addition, with the recent development on more affordable sensors, better data acquisition
systems, and faster communication networks in the CPS of Industry 4.0, there is a growing use of
interconnected physical facilities which will continuously generate a large amount of data to process,
called Big Data (Lee et al. 2013). Manufacturing industry sector generates and saves bigger data than
any other sector (Baily and Manyka 2013; Chen et al. 2016). For example, a single machine could
generate thousands of records for production and health monitoring information within a second,
which compiles several trillion records in a year (Yin and Kaynak 2015). Such big data in the CPS of
Industry 4.0 has the huge potential to help reduce malfunction rate and improve production rate and
quality for a better supply chain management. In addition, although the intersection between big
data and CPS only catches the attention recently, big data is closely related to CPS since the
beginning of CPS. With the two main components of collaborating IT devices and physical objects
controlled by IT devices, the big data techniques implicitly used in CPS are related to Internet of
Things (referring to the network connectivity to enable the data communication among cyber
devices embedded physical devices, Li, Xu and Zhao 2018) for IT devices to communicate with
each other, wireless sensor networks for data collection, and cloud computing for the coordination
among IT devices. Lee et al. (2015b) proposed a 5-layer CPS structure from the initial data acquisition,
conversion, and analysis to intervention and self-adaptation. In the manufacturing process, there are
many existing work using big data from different devices and systems to improve quality and
productivity (Slay and Miller 2007; Tang et al. 2010; Hunter et al. 2013; Kao et al. 2015). Therefore,
applying Big Data techniques becomes critically important for a more adaptive, intelligent, and
resilient CPS in Industry 4.0.
When utilizing two different keywords, including ‘big data industry 4.0 survey’ and ‘cyber
physical systems industry 4.0 survey’, for literature review in google scholar, we only choose the
high impact paper with citation more than 50 for study. Among them, Lee (2008), Khaitan and
McCalley (2015), Lee (2015), Lee et al. (2015a), Lee, Bagheri, and Kao (2015b) are related to cyber
physical systems in industry 4.0, and Lee et al. (2013), Lee, Kao, and Yang (2014), Lee et al. (2015a),
and Yin and Kaynak (2015) are related to big data in industry 4.0. Only Lee et al. (2015a) is related
ENTERPRISE INFORMATION SYSTEMS 3
to the intersection between cyber physical systems and big data in industry 4.0. On one hand,
cyber physical systems and big data are tightly connected with each other in industry 4.0. On the
other hand, there are few surveys related to the intersection between cyber physical systems and
big data in industry 4.0. To bring more attention to this critical intersection and highlight the future
research direction, we conduct this survey study for this emerging topic.
The rest of the paper is organized as follows. Section 2 is related to the big data character-
istics. Then the big data issues of the CPS in Industry 4.0 are discussed in Section 3, and the
current active big data research on the CPS in Industry 4.0 are presented in Section 4. Finally,
the future promising direction for big data research on the CPS in Industry 4.0 are discussed in
Section 5.
2. Big data characteristics

Big data is an umbrella term for any technique to process a large amount of data, including
capture, transfer, storage, curation, search, analysis, visualization, security, and privacy. The size of
Big Data is a constantly changing value from Terabytes in 2005 and Petabytes in 2010 to Exabyte or
Zettabye in 2017, and is usually defined by the amount of data beyond a commonly used
computer to process within a tolerable amount of time.
The most-widely used characteristics of Big Data is the ‘3 Vs’: volume, velocity, and variety (Laney
2001; De Mauro, Greco, and Grimaldi 2016). Volume is related to how much data are generated,
velocity is related to how fast data are generated, and variety is related to how many different types of
data are generated. Therefore, the ‘3 Vs’ characteristics require big data techniques able to handle a
large amount data, process it fast, and robust to deal with heterogeneous data. The volume is
considered as an important characteristic because a more reliable estimation can be calculated with
more data according to central limit theorem. The velocity is important because data are continuously
generated from social interaction, sensor monitor, and business activities. If related techniques cannot
process data faster than its generation speed, a lot of data will not be analyzed to gain insights. The
variety is important because useful patterns are easier to be captured if observed from different
perspectives. For example, it might be hard to identify tigers in a jungle through pictures captured by
normal camera lens. Combined with infrared photo, it is much easier to notice tigers because their
body temperature is much higher than their background. Besides ‘3 Vs’, others extended it with other
Vs, such as veracity (Schroeck et al. 2012) and value (Dijcks 2012). Veracity is related to how accurate
data are. There are various reasons to collect inaccurate data, such as wrong manual input, machine
failure, and problematic data ETL procedure. If raw data are incorrectly recorded, any decision based
on it is problematic. For example, if customer gender is incorrectly recorded in a supermarket, the
supermarket could send the coupon of lipsticks to single males. Value is related to how much the
ultimate gain and social impact is from data, and the ultimate driven factor for the big data topic is its
values. In fact, big data is not a new concept. Back to 1980s, we already had many big data problems,
such as stock exchange of calls and puts, human genome, and particle physics. The topic of big data
becomes more and more hot recently in Industry 4.0 and many other applications since the existing
techniques become more mature to handle big data and gain values from them.
3. Big data issues of the CPS in industry 4.0

In general, there are two main functional components, system infrastructures and data analytics, to
handle big data issues of the CPS in Industry 4.0, shown in Figure 1. System infrastructures oversee
connectivity to ensure real-time communication between facilities and cyber devices, while data
analytics focus on improving product personalization and resource efficiency in Industry 4.0. In
addition, several important issues of the CPS in Industry 4.0, such as adaptive, security, and
resiliency, are related to both system infrastructures and data analytics.
Big Data in the CPS of Industry 4.0
System
Data Analytics
Infrastructures
Data Distributed Descriptive Predictive Prescriptive

Database
Capture Computing Analytics Analytics Analytics
Figure 1. The taxonomy of current techniques for the big data issues of the CPS in Industry 4.0.
3.1 System infrastructures

As system infrastructures oversee connectivity to ensure real-time communication between facil-
ities and cyber devices, they are related to data capture, transfer, storage, and computing in a
distributed environment.
3.1.1. Data capture

Data are the raw materials for data analytics. Searching for useful patterns to gain insights without
data is like manufacturing products from thin air. Therefore, capturing accurate and reliable data is
the first step of data analytics. With the recent development on automatic data collection techni-
ques, more and more useful data can be captured in a cost-effective and reliable way. The captured
data include sensor data (Li et al. 2017), system logs (Qun et al. 2015), camera images (Weyer et al.
2015), radio-frequency identification (RFID) records (Albrecht et al. 2015), GPS data (Gorecky et al.
2014), Enterprise resource planning (ERP) data (Wollschlaeger, Sauter, and Jasperneite 2017), and
social media data (Jiang, Ding, and Leng 2016).
There are several important factors to consider for correct and meaningful data capture. First, it
requires a seamless procedure from data collection and transfer to data storage in servers
(Vijayaraghavan et al. 2008). Any manual intervention has two consequences: (1) related data
analytics will be delayed and cannot detect some important warning signals on time.
Consequently, we might miss the opportunity of preventive measures, which costs much less
than repairs on machine failures, due to these delays, and (2) any manual intervention will
significantly increase companies’ labor cost, because digital devices are much faster and less likely
to make mistakes than human for routine information processing tasks. Second, selecting the
proper data collection devices must be tightly connected with the goal of data analytics (Lee et al.
2015a) since there are many different types of sensors to measure different factors, such as light,
motion, temperature, sound, humidity, and oxygen level. For example, an oil refinery plant requires
the corresponding chemical sensors to measure whether different chemical components are mixed
with the right percentage. However, the same set of chemical sensors is useless for a car assembly
factory, which needs a different set of sensors to monitor the status of operating facilities.
3.1.2. Data storage and retrieval

As big data is beyond a commonly used computer to process within a tolerable amount of time, it
will be stored in different database systems with different formats for different purposes. The
traditional relational database system proposed by Codd (1970) is the most widely-used database
system for accurate and consistent data storage. Cassandra, MongoDB, and data warehouse are
three widely-used non-traditional database systems to handle big data (Gölzer, Cato, and Amberg
2015). Different big data database systems compromised a different level of data accuracy, con-
sistency, and redundancy for a faster speed of processing big data. Although relational database
systems do not handle big data very well and many other database systems are proposed to
replace it, they are still widely-used and especially irreplaceable for monetary transaction data
because of its accuracy and consistency.
Cassandra is designed to save the same type of table data with rows and columns as traditional
relational database systems (Hewitt 2010). Instead of saving all the data in one computer,
Cassandra saves data and spreads it workloads over a cluster. Such a distributed design over a
cluster make it faster and more robust than traditional relational database systems. However, it
cannot guarantee the row level consistency (Anderson et al. 2010). In other words, when updating
the same row at approximately the same time, the non-key attributes might not be updated in a
consistent way. For some applications, such as social interactions, system logs, and news feed, the
rare inconsistent update is not a critical issue. However, for the other applications that cannot
tolerate the rare inconsistent update, such as customer relationship management, stock trading,
and customer purchase, Cassandra is not a good option.
MongoDB is designed to avoid the high cost of joint operation in traditional relational database
systems (Chodorow 2013). To make sure all the data are correctly saved and updated in traditional
relational database systems, normalization is a very critical procedure to make sure each table is
related to only one entity type. Although such a normalization procedure helps to improve data
quality, it requires computational expensive joint operation over many tables when retrieving data.
To reduce expensive joint operations, MongoDB saves the entity types tightly connected with each
other into one JSON-like document. For example, product information and customer reviews are
considered as two different entity types. In the traditional relational database systems, two
separate tables will be created to save product information and customer reviews respectively
through normalization. However, in MongoDB, product information and its customer reviews will
be saved in the same document (equivalent to tables in traditional relational database systems),
because product information and its customer reviews are always presented together to custo-
mers. Putting them together will save the cost of joint operation when retrieving them.
Data warehouse is designed to save summarized information for further analysis (Kagermann
et al. 2013; Xu et al. 2008a; Yasmina Santos, Martinho, and Costa 2017). In some application, users
are only interested in summarized information instead of individual detailed records. For example,
inventory replenishment is based on the total sale instead of each individual transaction. If a
company has billions of transactions in a year, it requires computers to scan the billions of
transactions to calculate the annual total sale, which is very slow. However, with data warehouses,
the current daily total sale can be aggregated and calculated after each workday. Later, if the
annual total sale is needed, the calculation only involves 365 records in a data warehouse instead
of billions of transactions in a traditional relational database system, which is much faster.
3.1.3. Distributed computing

As big data is beyond a commonly used computer to process, it requires either a super-computer or a
cluster to process big data. As super-computers are very expensive, a cluster is more cost-effective
and widely-used in industry. Depending on different conditions of processing big data, different
distributed computing systems are designed to process data (He and Xu 2014).
The most common scenario is that a large amount of data is saved in a cluster where computers
are in the same building and connected by a high-speed local area network (LAN). In this scenario,
we have the high-speed network connection, and the delay related to data communication
between computers is not a serious concern. Therefore, more attentions are paid to a robust
system for coordinating thousands of computers. The well-known system of this kind is the
Hadoop system (Shvachko et al. 2010). This system is designed under the assumption that the
hardware failure will frequently occur in a large cluster. For example, a typical PC could last 3 years
before its first hardware failure. If a cluster consists of 1000 computers, the cluster will expect one
hardware failure everyday. The Hadoop system will automatically spread the data over a cluster,
maintain data redundancy for potential hardware failure, and rebalance data when new computers
are added into the cluster or old computers fail. With the Hadoop system, data analytical algo-
rithms only need to focus on how to search for insightful patterns without worrying about how to
handle the data correction issue caused by system failures. One typical data programming model is
MapReduce (Stonebraker et al. 2010). It parcels out key-value pairs to nodes in the cluster, and then
reduces the summarized value for the set of key-value pairs with the same key. A complicated data
analytical method might involve many different MapReduce operations in a sequence. Each round
of MapReduce incurs a lot of I/O operations as the reduced result will be saved in hard-drives. To
avoid the high-cost I/O operation with hard-drives, the Spark system is designed to process and
save data directly in computer memories instead of hard-drives (Karau et al. 2015). Generally
speaking, the Spark system is 1000 thousand faster than MapReduce because of the higher I/O
speed in memories than hard-drives.
Besides a large amount of data is saved in a cluster, another common scenario is that the data
are distributed over several computers(/clusters) which are physically far away from each other.
Such a setting is very common for the companies that do international business and have branches
in different locations. In such a setting, network bandwidth is the bottleneck and the critical issue is
how to reduce the amount of data communication between computers(/clusters) (Jagannathan
and Wright 2005). Take the simple calculation of average sales for example. The simplest way is to
transfer all the data into one computer(/cluster) to calculate the average sale. However, it will take
a large amount of network bandwidth to transfer raw data. A better solution is to calculate the
total local sale amount and the count of local sales in each local computer(/cluster), and transfer
these two numbers to a central server. The central server can calculate the global average sales
with dividing the sum of all the total local sale amounts by the sum of all the counts of local sales.
This method reduces the significant amount of data transfer and calculation to the central server.
Because there is no routine way to handle this setting with different methods, it requires research-
ers to re-design their algorithms to fit the need. Another similar setting in Industry 4.0 is for outside
environmental data collection by sensors (Yao, Cao, and Vasilakos 2015). When collecting environ-
mental data by scattered sensors, it will run out of battery power very soon if each sensor sends its
data to the central server by itself. A better solution is to have one powerful and expensive sensor
to collect data from nearby less powerful but cheaper sensors, summarize collected data, and then
send the summarized data to the central server.
Another important term related to distributed computing is the cloud computing. It is the
technology to enable a unified access to shared pools of system resources over the Internet. It
helps minimize up-front IT infrastructure cost and leave the management and expansion of IT
infrastructure to the third-party cloud computing service provider company. The cloud model
enables companies to offer automated services from a dynamic infrastructure to satisfy the
dynamic need. For example, Tao et al. (2011) combined some existing manufacturing models
with clouding computing technology for service-oriented cloud manufacturing. Xu (2012) studied
the key technologies for managing distributed resources encapsulated into cloud services.
3.2. Data analytics

Besides system infrastructures, which is related to data capture, transfer, storage, and computing in
a distributed environment, the other important component is data analytics, which is for how to
gain insights from data prepared by system infrastructures. There are many different types of data
analytical methods which can be categorized into three types: descriptive analytics, predictive
analytics, and prescriptive analytics (Delen and Demirkan 2013; Daniel 2015). Descriptive analytics
is about describing what happened in the past, predictive analytics is about predicting what will
happen in the future based on the assumption that what happened in the past will happen in the
same or a similar way in the future, and prescriptive analytics is about how to be better prepared
for the future based on our prediction of the future needs.
3.2.1. Descriptive analytics

The most widely used descriptive analytical methods are the descriptive statistical functions, such
as mean, variance, and median. These statistical values help the basic understanding of data, such
as trend of different combination, correlation among attributes, and outlier detection. Besides the
above type of descriptive statistical functions, there are other types of sophisticated methods
including correlations, clustering, and generative model.
3.2.1.1. Correlation analysis. Correlation analysis searches for attributes changing at the same
time. Research on correlation analysis has a very long history as it is a sub-topic in statistics. The
well-known methods include Chi-square (Elderton 1902) for categorical data, and Pearson correla-
tion coefficient (Pearson 1895) for numeric data. The research on correlation analysis can be
categorized into two directions: effectiveness direction and efficiency direction.
In the effectiveness direction, many researchers proposed different new methods, such as odds ratio
(Mosteller 1968), relative risk (Sistrom and Garvan 2004), likelihood ratio (Neyman and Pearson 1992),
lift (Brin et al. 1997), leverage (Piateski and Frawley 1991), BCPNN (Bate et al. 1998), two-way support
(Tew et al. 2014), added value (Kannan and Bhaskaran 2009), and putative causal dependency (Huynh
et al. 2007). Different methods will highlight different patterns as the random noise has different
impacts on different methods for different patterns. For example, Leverage highlights correlated
patterns which occur frequently in the dataset, while BCPNN highlights correlated patterns which
occur infrequently (Duan et al. 2014). Besides handling random noise, another direction is related to
causal analysis (Krämer et al. 2013). The results from correlation analysis are useful for prediction. For
example, if events A and B are correlated, we can expect a higher chance for A to happen if B occurs.
However, such a correlation relationship is not very useful for intervention. In other words, making
efforts to reduce the probability of B can not necessarily reduce the probability of A. For example, it is
very important to detect confounding factors to highlight more causal relationship from correlation. A
confounding factor is an event C that is associated with both event A and B (Pearl 2000). The seemingly
positive correlation between A and B is spurious when considering the confounding factor C. For
example, in healthcare, the drug Naltrexone is positively correlated with the disease pancreatitis,
because Naltrexone is used to treat alcoholism and alcoholism often leads to pancreatitis. The
confounding factor alcoholism is a distortion of the genuine correlation between Naltrexone and
pancreatitis. Popular methods for detecting confounding factors include the Cochran-Mantel-Haenszel
method (Cochran 1954), logistic regression model (Li et al. 2014), and partial correlation (Baba, Shibata,
and Sibuya 2004). In addition, timestamps of events are also useful for causal analysis, because causes
always happen earlier than their effect (Kleinberg and Mishra 2009). Correlation and causal analysis is
very useful for machine failure monitoring and maintenance in Industry 4.0. For example, machines
could have many different failure types which require different interventions and maintenance.
However, among numerous signals generated by machines and events associated with machines, a
certain signal or event is associated with one type of machine failure but not the other. Correlation and
causal analysis can help to associated signals/events with failure types, which is useful for machine
failure prediction and maintenance plan improvement. Zaki, Lesh, and Ogihara (2001) utilized the
correlation analysis to prune out unpredictable and redundant patterns to improve machine failure
prediction performance. Sammouri (2014) utilized correlation analysis to connect severe railway
operation failures with sensor data for vehicle, rail, high-voltage lines, track geometry, and other
railway infrastructures, which allows the constant and daily diagnosis of both vehicle components
and railway infrastructure.
In the efficiency direction, the first well-known algorithm is the Apriori (Agrawal and Srikant
1994). It is proposed to search the set of items co-occurring frequently by utilizing a downward-
closed property. As the event occurring probability follows a Power-law distribution in many cases,
the Apriori algorithm can help to prune the exponential search space. Other classical methods in
this type include FP-Tree (Han, Pei, and Yin 2000), which uses an extended prefix-tree structure for
storing compressed information, and ECLAT (Zaki 2000) which store transaction information in a
vertical data layout for fast support counting. Besides classical methods, other current researches
include Niche-Aided Gene Expression Programming, a specialized algorithm for gene data (Chen,
Li, and Fan 2015), a multi-objective particle swarm optimization for numerical attributes without a
priori discretization (Beiranvand, Mobasher-Kashani, and Bakar 2014), and a more efficient candi-
date generation algorithm for Apriori (Jiao 2013). However, the main problem for the this type of
algorithms is that it utilizes the co-occurrence function as a sub-optimal measure for correlation.
Take supermarket purchases for example. Both banana and milk are frequently purchased by
customers, while spaghetti and tomato sauce are less frequently purchased. If the co-occurrence
function is used, it indicates the correlation between banana and milk is stronger than that
between spaghetti and tomato sauce, which contradicts with our intuition. Therefore, it is more
appropriate to use any correlation function which will compare the actual co-occurrence against
the expected co-occurrence under the assumption of independency. However, no existing correla-
tion function has the downward-closed property to prune the search space. For this issue, Duan
and Street (2009) proposed a fully-correlated itemset framework to decouple correlation functions
from satisfying the downward-closed property. Although the fully-correlated itemset framework
helps to prune the 3-itemset and above, it cannot speed up the search for pairs. To speed up pairs,
Xiong et al. (2006) utilized the monatomic property of the upper bound for Pearson correlation,
and Duan, Street, and Liu (2013) extended his work to the correlation functions that satisfies three
correlation properties. Another extension in the efficiency direction is for the confounding factor
search. As the confounding factor involves 3 items, the search space is much bigger than pairs.
Zhou and Xiong (2009) proposed a method to speed up the search, but this method can only be
applied to a special type of confounding factors, called reversing confounding factors.
3.2.1.2. Clustering. When data are presented in a structured table with rows and columns, correla-
tion analysis searches for the set of attributes changing together, while clustering searches for the
group of similar records. The detected clusters can be a group of customers with a similar preference or
a group of factory geographically close to each other. Usually, on one hand, the entire population is
heterogeneous and it is hard to have an effective intervention to the entire population. On the other
hand, it is not cost-effective to have interventions for each object. However, after dividing the entire
heterogeneous population into several homogeneous clusters, it is easier to have a different effective
intervention for each different group, which can strike a better balance between costs and effective-
ness. There are many potentials for applying clustering in industry 4.0. For example, Younis and Fahmy
(2004) utilized a clustering algorithm to group sensor nodes in an ad hoc sensor network to reduce the
message overhead. Lapira (2012) used the clustering to group similar machines for fault detection.
There are many different clustering algorithms proposed in the past. Based on how objects are
clustered, these methods can be categorized into four types: partitioning-based, density-based, grid-
based, subspace-based, and model-based. Partitioning-based methods usually start with a random
partition first, and then iteratively refine partitions by moving objects into different clusters. K-means
(MacQueen 1967) is the first method in this type, and other common methods in this type include
K-modes (Huang and Ng 1999), CLARANS (Ng and Han 2002), and K-medians (Bradley, Mangasarian,
and Street 1997). When moving objects into different clusters, they move objects to the cluster with the
closest cluster center. Therefore, detected clusters are spherical-shaped, but there are arbitrary shape
clusters in many applications. For example, residential areas, lakes, and tumors are all arbitrary shaped.
To solve this problem, density-based methods are proposed to iteratively joint each object with its
small spherical-shaped regions. Because each object is randomly scattered in the feature space, the
small spherical-shaped region of each object in a cluster is connected to form an arbitrary shaped
region. DBSCAN (Ester et al. 1996) is the first method in this type, and other common methods in this
type include ST-DBSCAN (Birant and Kut 2007), LDBSCAN (Duan et al. 2007), and OPTICS (Ankerst et al.
1999). Both partitioning-based and density-based methods process individual records, and they can be
very computational expensive when having a large amount of data. To solve this issue, grid-based
methods will divide the entire feature space into grids and map each object into its corresponding grid.
When clustering, they cluster grids instead of objects nested inside grids. As the number of grids are
usually much smaller than the number of objects, they can run very fast. The typical methods in this
type include STING (Wang, Yang, and Muntz 1997), OptiGrid (Hinneburg and Keim 1999), and DGB (Wu
and Wilamowski 2016). All the above methods use the entire feature space for clustering. However, it is
possible for have different results with different sub-space features. Take cats, dogs, tigers, and wolfs for
example. If we use the feature in the biology side, cats and tigers are clustered as felines, while dogs
and wolfs are clustered as canines. If we use the feature in the perspective of how to get along with
human, cats and dogs are clustered into as pets, while tigers and wolfs are clustered as wild animals.
Therefore, subspace-based methods will search for different clusters under different combinations of
sub-feature space. The typical methods in this type include GPCA (Vidal, Ma, and Sastry 2005), SSC
(Elhamifar and Vidal 2013), and RSC (Soltanolkotabi, Elhamifar, and Candes 2014). Model-based
methods is related to find the best parameter fitting for a predefined model. There are two general
models: the statistical model (Xu and Tian 2015) and the neural network model (Du 2010). The typical
statistical model includes COBWEB (Fisher 1987) and GMM (Rasmussen 2000), and the neural network
model includes SOM (Kohonen 1990) and ART (Carpenter and Grossberg 1990). SOM is based on a
reduced mapping dimension through the neural network, while ART is an incremental algorithm to
generate new neuron to match a new cluster when the current neuron is not enough for representing
underlying patterns. Through model-based methods, the original high-dimensional space to be
reduced into a low-dimensional space with clear clustering structures to improve the performance.
The above taxonomy is based on how clusters are formed. If clustering methods are classified
according to how object membership is assigned, methods can be classified into overlapping or
non-overlapping types. In non-overlapping clustering, each object belongs to a definite cluster and
cannot be assigned to another cluster. In overlapping clustering, each object can belong to several
clusters. Even for the same dataset, both overlapping clustering and non-overlapping clustering are
useful for different purposes. Take employees in a company for example. The non-overlapping
clustering is useful when assigning employees for different full-time tasks, while the overlapping
clustering is useful for information diffusion because the employees with multi-memberships are
the important bridge objects across different clusters. Generally speaking, fuzzy logic (Lee and
Cheng 2012; Melin and Castillo 2014) is the common way to convert a non-overlapping clustering
algorithm into an overlapping one. Another common way is to apply a probabilistic model for
multi-memberships (Yu, Zhang, and Wang 2016; Khanmohammadi, Adibeig, and Shanehbandy
2017). In addition, the hierarchical structure can be considered as a special case of overlapping
clustering. The overlapping structure is very common in our daily life. For example, a company can
have several branches. Each branch has its own departments. Typically, hierarchical methods
iteratively group the previously generated clusters to form a hierarchical structure (Murtagh and
Legendre 2014; Müllner 2013).
3.2.1.3. Generative model. Any real life data are generated according to some rules in the
nature, although we don’t know the exact rules and use randomness to explain the part we cannot
explain. The idea of generative model is to define a set of rules to generate our data. If the set of
rules defined by us is close to the ground true rules in the nature, the generated data by our rules
will be similar to the real data we observed. In manufacturing, Jałowiecki, Kłusek, and Skarka (2017)
proposed generative modelling method for the dynamic development of Computer Aided Design.
Alleman et al. (2010) used matrix factorization to monior the air pollution in an industrial zone.
Through defining a likelihood function to describe how likely the generated data is the same as the
real data, such a likelihood function can guide us to search for a better generative model to
proximate the ground true rules in the nature. Typical generative models include Naïve Bayes
(Simha et al. 2015), Latent Dirichlet Allocation (Blei, Ng, and Jordan 2003), Hidden Markov Models
(Rabiner 1989), Matrix factorization (Lee and Seung 1999). As generative models proximate the
ground true rules in the nature, it can reveal the hidden parameter for any real data record. Those
hidden parameters can be used for either clustering or prediction purpose late on.
3.2.2. Predictive analytics

Different from descriptive analytics which focuses on describing what happened in the past,
predictive analytics utilizes the past patterns to predict what will happen in the future based on
the assumption that what happened in the past will happen in the same or a similar way in the
future. It typically involves a dataset with many normal attributes and one target attribute for
prediction. For the target attribute, its actual value is available only for past data, but not
available for the current data. As data are interrelated with each other, different predictive
methods will search for the relationship between other normal attributes with the target
attribute, and then use this relationship to predict the target attribute with the values of normal
attributes. Kuo et al. (2017) applied neural network to predict machine status with inexpensive
add-on triaxial sensors for the small-sized factories which cannot afford the sensor-embedded
machines. Bagheri, Ahmadi, and Labbafi (2011) used acoustic signals from gearbox to predict
worn tooth face gear and broken tooth gear. For many different predictive methods, they can
be categorized into five types: regression, decision tree, Bayesian statistics, neural network, and
support vector machine.
Regression models are widely used and have a very long history in statistics. It cannot only be
used as predictive models, and the coefficients in its model are also used for descriptive purposes.
The first method in this type is linear regression model (Neter et al. 1996). However, linear
regression model can only be used to predict numeric target attributes. Sometimes, the target
attribute can be nominal. For example, we are interested in whether a given tumor is benign or
malignant, whether a customer is interested in the on-sale product, and whether a student is
engaged or lose interests. To solve this issue, logistic regression (Hosmer, Lemeshow, and
Sturdivant 2013) is proposed. The above regression models assume a linear relationship among
variables, and sometimes this assumption is not valid. To solve this issue, some local regression
models are proposed to fit a non-linear model, such as LOESS (Cleveland and Devlin 1988) and
LOWESS (Howarth and McArthur 1997).
Decision tree methods construct a tree structure by allocating records to each branches. It
utilizes different functions to make records in each branches as pure as possible. The three popular
functions are information gain (Quinlan 1986), gain ratio (Karegowda, Manjunath, and Jayaram
2010), and gini index (Raileanu and Stoffel 2004). When the tree grows larger and larger, the
number of records in each branches will be smaller and smaller. Then the estimated probability in
each branches becomes less and less reliable. Therefore, one important technique to improve
decision tree prediction performance is the pruning method. In addition, random forest is another
popular method in this type. As the splitting attribute is selected only considering the performance
at the current level, only the local optimal attributes will be chosen to construct the tree. However,
the local optimal attribute is unnecessary to be global optimal. To avoid trapping in a local optimal,
it randomly selects a subset of attributes to construct a tree in each round, and use the ensemble
result for the final prediction.
Bayesian statistics is based on the Bayes’ theorem. Naïve Bayesian is the most widely used in this
type (Domingos and Pazzani 1997). It naively assumes each attribute is independent from each
other to speed up calculation. Naïve Bayesian can achieve a high prediction performance if its
assumption holds. However, in many cases, the assumption of independency is incorrect. For
example, the age is somehow related to education level. It is much harder for a 20-year old person
to get a Ph.D degree than a 30-year old person. The salary is positively correlated with the year of
working experiences. To solve this dependency issue, Bayesian network is proposed to improve
prediction performance (Friedman, Geiger, and Goldszmidt 1997).
Neural network constructs a network with input nodes, hidden nodes, and output nodes
(Demuth et al. 2014). These different types of nodes are connected by weighted edges. It mimics
how human learn from their failures. For example, when a person learns to toss a ball into the
basket, he will try to use an initial angle and strength to toss the first ball. Based on how far the ball
is away from the basket, he will adjust the angle and strength accordingly in the next round. Such
an adjustment will continue until the muscle adapts to the right angle and strength. In a neural
network, the edge weights are randomly assigned. Based on how far the predicted value is away
from the actual value, the neural network starts to adjust their weights towards the goal. After a
reasonable round of training, the weights in the neural network becomes stable and accurate.
Support Vector Machine searches for a linear hyperplane to separate two classes (Suykens and
Vandewalle 1999). It is originally invented by Vapnik in 1963 for a linear classifier. Because it only
uses a linear hyperplane for separation, it is less likely to be overfitting. However, some data is not
linearly separable in its original feature space. To solve this issue, some non-linear kernel function
(Boser, Guyon, and Vapnik 1992) is used to map the original feature space into a higher dimension
space that can be better linearly separated. Another solution for this issue is to use the hinge loss
function for a soft-margin to tolerate some mistakes (Chen et al. 2004). As the original SVM can
only make prediction on nominal data, Drucker et al. (1997) proposed a version of combination
with regression to make prediction on numeric data.
Besides research on specific prediction models, feature selection (Chandrashekar and Sahin 2014;
Khalid, Khalil, and Nasreen 2014) and ensemble (Kuncheva and Rodríguez 2014) are two general
techniques can be applied to any prediction models to improve the prediction performance. Feature
selection is a preprocessing step for predictive models to work on better data by removing irrelevant
and redundant features. The typical methods can be classified into feature selection algorithms to
select a subset of the original feature space and feature extractions algorithms to map the original
feature space into a new feature space. Popular feature selection algorithms are correlation-based
feature selection (Hall 2000), plus-l minue-r selection (Yu and Liu 2003), and minimal redundancy and
maximal relevance (Peng, Long, and Ding 2005), and popular feature extraction algorithms are PCA,
SVD, and PKPCA (Zhang et al. 2009). Besides feature selection, ensemble is another common method
to improve prediction performance. The basic idea is to build multiple prediction models and
combine their predictions. The four typical methods are bagging (Breiman 1996), boosting
(Schapire 1990), and stacked generalization (Wolpert 1992). Bagging is the simplest with each
model vote with equal weight. Boosting is the procedure to train a new model that emphasize on
the misclassified data by previous models. Stacked generalization is related to train a final model
based on the predictions of other models.
3.2.3. Prescriptive analytics

Predictive analytics can predict what will happen in the future based on the assumption that what
happened in the past will happen in the same or a similar way in the future. For example, it can be used
to forecast the demand for an existing product. With the predicted demand, companies can have
different plans to produce the predicted number of products by considering when it is needed, what is
the current available production capacity, the requirements of raw materials, labor cost, and others.
Different plans will have different costs for different factors. For example, manufacturing products with
its full load can reduce the labor cost but increase the inventory storage cost. Prescriptive analytics are
very important because they can search for the optimal plan with the lowest overall cost. The existing
research of prescriptive analytics in industry 4.0 includes different self-optimizing autonomic strategies
to accomplish a given goal in a dynamically changing environmental conditions and demands (Maggio
et al. 2012), and the goal-oriented self-organization algorithms to optimize design cost functions in a
distributed fashion and induce an overall degree of autonomy in the CPS (Bogdan 2015). There are two
types of prescriptive analytics methods: mathematical programing and heuristic search. While math-
ematical programming is designed to find the global optimal solution, heuristic search is designed to
find local optimal solutions.
Linear programming is the beginning of mathematical programming (Stone and Tovey 1991). It
searches for the best solution of a linear objective function subjected to linear equality and
inequality constraints. It has been well-studied and solvable in polynomial time. Different from
linear programming, integer programming is related to the variables that must be integer (Wolsey
and Nemhauser 1999). Because of this special constraints, integer programming is not always
solvable in polynomial time. One straightforward way to solve integer programming is to drop the
integer constraints and solve it as a linear programming problem. If the related variables are
actually integer in the optimal result solved as a linear programming problem, this integer
programming problem has been solved directly. Otherwise, the branch-and-bound procedure
(Lawler and Wood 1966) is a popular way to solve it. In addition, there are many other types of
programming problems, such as stochastic programming with random variables (Birge and
Louveaux 2011), dynamic programming for questions that can be divided into smaller problems
(Bellman 2013), and nonlinear programming with non-linear functions in either objective function
or constraints (Bazaraa, Sherali, and Shetty 2013).
However, some problems are so complicated in its nature format that cannot be formatted into
the right mathematical programming type to find the global optimal solution. Heuristic methods
are proposed to find local optimal solutions to solve the problem in a sub-optimal way. Typical
heuristic methods include genetic algorithm (Deb et al. 2002), simulated annealing (Kirkpatrick,
Gelatt, and Vecchi 1983), hill climbing (Tsamardinos, Brown, and Aliferis 2006), tabu search (Glover
and Laguna 2013), and colony optimization (Dorigo, Birattari, and Stutzle 2006).
4. Current research
For CPS in Industry 4.0, there are two important aspects, including robustness and intelligence,
to make it work properly, shown in Figure 2. The robustness aspect is measured by whether CPS
is available and working, while the intelligence aspect is measured by whether CPS is efficient
and cost-effective. Therefore, among many existing research efforts on big data for CPS in
industry 4.0, they can be categorized into efforts on system infrastructures to ensure security,
resiliency, and reliability and efforts on data analytics to improve self-aware and self-maintained
capabilities.
4.1. Research on system infrastructures to ensure security, resiliency, and reliability

Besides the common research questions on system infrastructures to be able to collect and store
big data, there are some unique requirements for CPS systems related to its security, resiliency, and
Robustness Intelligence
System Data
Infrastructures Analytics
Self- Self-
Security Resiliency Reliability
awareness maintain
Figure 2. The taxonomy of current research for CPS in Industry 4.0.

reliability issues (Khaitan and McCalley 2015). Because of the dynamic connection within CPS, a
failure in one component might lead to a cascading failure in the entire system (Anand et al. 2006),
which challenge its design for security, resiliency, and reliability.
First, each component in both physical facilities and cyber devices needs an appropriate level of
security guard. Otherwise, the entire system might malfunction. For example, one attack on an
Australia sewage treatment system resulted in its raw sewage to be released into local rivers (Slay
and Miller 2007). To improve the security levels, there are following research efforts. To meet the
need of flexible configuration, equipment can freely add and removed from a CPS. To ensure the
secured communication among devices, traditional public key cryptography, which suffers from
costly and complex key management issues, is widely used. To solve this issue, Xu et al. (2008b)
proposed a certificateless public key cryptography method, which is effective and efficient to
defense two common black hole attack and rushing attack. Increasing the level of security will
inevitably reduce the CPS accessibility. Each security measure needs a balance between security
and other domain requirements. Zhu and Basar (2011) proposed a unifying framework for the
cross-layer security architecture to implement their security mitigation strategies. Sun et al. (2009)
proposed a formal framework to detect the confliction between security and other domain
requirements in the design stage. The security in CPS is related to both the cyber aspect and
physical aspect. Burmester, Magkos, and Chrissikopoulos (2012) presents a unified system to take
both cyber and physical aspect into consideration to address combined vector attacks and
synchronization issues.
Second, although a centralized controlling system is easier to implement, it makes CPS hard to
satisfy its 24 × 7 availability requirement when being attacked or upgrading systems. A scalable
and distributed coordination layer is very important to ensure it resiliency. In Industry 4.0, the CPS
potentially involve trillions of devices, and the communication among devices must be highly
scalable. Stojmenovic (2014) studied the strategies for localized coordination and communication
to meet the need of communication among trillions of devices. The ADREAM project (Arlat, Diaz,
and Kaâniche 2014) tackled not only scientific challenges but not social, legal and ethical concerns
in a highly dynamic service-oriented CPS. In the CPS of Industry 4.0, it is supported by numerous
feedback control loops for dynamic configuration. As these control systems are more and more
decentralized to meet the need of a large and integrated system, they are more vulnerable for
attacks and an attack on these control loops has disastrous consequences. Fawzi, Tabuada, and
Diggavi (2014) studied a secure local control loop design to help reconstruct the state of a system.
Shoukry et al. (2017) used a satisfiability modulo theory approach to estimate the state of a
dynamical CPS from attackes. Lucia, Sinopoli, and Franze (2016) proposed a set-theoretic control
framework against False Data Injection attacks on the communication channels.
Third, another important dimension for CPS is reliable to maintain the same QoS and generate
trustworthy data. Tang et al. (2010) proposed a method to find meaningful information from a
large amount of noisy data by conducting trustworthiness linkage inference on a constructed
object-alarm graph. Balasubramanian et al. (2010) presented a NetQoPE middleware to make
application developer free from the complicated admission control and network resources alloca-
tion. Wang et al. (2016a) modeled mobile data traffic offloading with WiFi and VANET to miminize
mobile data traffic and obtain the global QoS guarantee in Vehicular Cyber-Physical Systems.
4.2. Research on data analytics to improve self-aware and self-maintained capabilities

Research efforts on system infrastructures will ensure CPS available to end users. However, the
other important aspect is to ensure CPS efficient and cost-effective. Then data analytics can help to
improve CPS self-aware and self-maintained capabilities.
The most straightforward application of data analytics is related to the machine health maintenance.
Most existing manufacturing strategies assume continuous device readiness and optimal performance,
which is not true in a real-life manufacturing environment. Any physical machines will wear out during
their operation and need timely maintenance to keep it properly running. However, due to the different
workload and working environment, the same types of machines might stop working at different times.
Although the intrusive examination on machines is more accurate to understand their healthy states, it
requires to stop machines and excessive human labors on examinations. Instead, data analytics methods
can be more cost-effective for this problem. When a machine starts to malfunction, it has some abnormal
events associated with it, such as higher temperatures, larger vibration, and different power consump-
tion. As these indirect but related information is easier to be collected by different sensors, predictive
analytical methods can utilize this information to predict machine failures. This prediction can signifi-
cantly reduce the number of unnecessary intrusive examination on machines. Because predictive
analytical methods only have indirect information and the abnormal value generated by sensors might
be caused by sensor failures instead of real machine states, it can never be perfect, and increasing their
prediction accuracy is critical to reduce the number of unnecessary interventions. Lee, Kao, and Yang
(2014) studies the machinery maintenance systems with Komatsu, a manufacturing company for con-
struction, mining, and military equipment. They have a data collection system to collect pressure, fuel
flow rate, temperature, and rotational speed of their diesel engines. As there is no guarantee the
deployed sensors are problem free, their machinery maintenance system uses the Huber method, a
descriptive method, for outlier removal. In the meantime, an autoregressive moving average approach is
used to fill missing values caused by either sensor or data transfer failures. Then the Bayesian Belief
Network, a predictive method, is used to predict whether the machine will fail soon, and identify the root
cause of the problem at the early stage of degradation. Fang et al. (2016) utilized the data from the
supervisory control and data acquisition system to derive the influential factors with the gray relation
analysis. With the transformed feature, a support vector machine regression model is used to predict the
wind turbine status. Bagheri, Ahmadi, and Labbafi (2011) used acoustic signals from gearbox to predict
worn tooth face gear and broken tooth gear. Discrete wavelet transform is used to transform features and
neural network is used to predict the gear fault.
Another straightforward application of data analytics is related to self-optimization strategies in a
distributed environment. As CPS need to satisfy its 24 × 7 availability requirement when being
attacked or upgrading systems, a highly scalable distribution system is desired for CPS in Industry 4.0.
Different from traditional optimization problem with all the information in a central server, it is an
increasing need to make decision based on local information but still globally optimal strategies.
Maggio et al. (2012) studied different self-optimizing autonomic strategies to accomplish a given
goal in a dynamically changing environmental conditions and demands. They found adaptive and
model predictive control systems can produce good performance, especially in a priori unknown
situations. Wang et al. (2016b) integrated the autonomous agents with feedback and coordination,
and proposed a contract net protocol negotiation mechanism to cooperation with each other.
Bogdan (2015) studied the goal-oriented self-organization algorithms to optimize design cost func-
tions in a distributed fashion and induce an overall degree of autonomy in the CPS.
4.2. Summary
As the CPS in Industry 4.0 involves trillions of physical facilities controlled by computing devices to
collaboratively work together, the critical dimensions for a successful CPS include scalability,
security, resiliency, and efficiency. With the huge amount of data generated in the CPS with the
large number of heterogeneous devices, big data techniques are expected to play an important
role to make the CPS work properly. However, since we still in the preliminary stage of Industry 4.0,
there are some typical limitations for the current research. First, most current research efforts focus
on the system infrastructure perspective of big data to make the CPS work, and fewer attentions
are paid to the data analytics perspective to make the CPS more efficient and cost effective.
Second, even in the intensively studied system infrastructure perspective, most research is related
to a relatively simple CPS within one company/organization. After the techniques for simple CPS
within one company/organization becomes mature, we are expecting more and more research will
shift towards more advanced CPS across companies/organizations, and research on game theory
and incentive mechanism will play a more important role. Third, in the few studies related to the
data analytics perspective of big data, most are related to machine failure prediction to improve
system resiliency and self-optimization in a distributed environment, which can still be considered
as the efforts to make the system work.
5. Conclusion
As discussed above, cyber physical systems are the fundamental infrastructure in industry 4.0,
while big data is the critical component to efficiently and effectively dealing with data gener-
ated from cyber physical systems. For the relatively new concept industry 4.0, there are only
several existing surveys on either its cyber physical system aspect or its big data aspect, and
missing the interaction among these two aspects. Therefore, we conduct this survey to bring
more attention to this critical intersection. In this survey, the detailed methods in both cyber
physical systems and big data are introduced to help readers understand their role in industry
4.0. Then, we group the research in cyber physical systems into the robustness category and the
intelligence category. At last, the role of big data and its related application to help each
category in cyber physical systems are discussed. In addition, we summarized the existing work
and highlighted the future research direction for this intersection to achieve the fully autonomy
in Industry 4.0.
6. Future research direction

Besides more advanced CPS across companies/organizations mentioned in the previous section,
there are many other underexploited areas to improve the CPS. First, data analytics in the CPS can
also potentially be applied to improve scalability and security. Younis and Fahmy (2004) utilized a
clustering algorithm to group sensor nodes in an ad hoc sensor network. The clustering structure
can be used to reduce message overhead in the network to prolong the network lifetime and
support scalable data aggregation. Duan et al. (2009) utilized an outlier detection method to find
abnormal network throughput for further investigation. Second, it is more important to have a
global design from data collection and various data processing to predict future events to
optimizing the control and production plan beyond the scope of CPS to serve the goal of
Industry 4.0. Although Industry 4.0 is currently under its developing stage for its system infra-
structures and there are not too much opportunities to collect real-life data for further data
analytics, it eventually requires the seamless connection between system infrastructures and data
analytics to achieve the fully autonomy. For example, on one hand, the machine health status
monitoring and prediction results need be used for modifying manufacturing plans, and the
manufacturing plan together machine health status monitoring will be used together for the
grouped machine maintenance planning. On the other hand, different data analytics applications
have impact on system infrastructures for a closed loop lifecyle redesign, such as which sensors to
be deployed to collect needed data and how the data can be more efficient collected. In all, the
customer preferences collected in the CRM system, raw material supply from collaborating com-
panies in the ERP system, and machine health status monitoring needs to be combined together
for an optimal planning to satisfy dynamically market opportunities.
Disclosure statement
No potential conflict of interest was reported by the authors.
ORCID
Lian Duan http://orcid.org/0000-0002-0618-8628
References
Agrawal, R., and R. Srikant. 1994. “Fast Algorithms for Mining Association Rules.” In Proceedings of 20th International
Conference Very Large Data Bases, VLDB, September 12 - 15, vol. 1215, 487–499. San Francisco, CA: Morgan
Kaufmann Publishers Inc.
Albrecht, J., R. Dudek, J. Auersperg, R. Pantou, and S. Rzepka. 2015. “Thermal and Mechanical Behaviour of an RFID
Based Smart System Embedded in a Transmission Belt Determined by FEM Simulations for Industry 4.0
Applications.” In Thermal, Mechanical and Multi-Physics Simulation and Experiments in Microelectronics and
Microsystems (EuroSimE), 2015 16th International Conference on, 1–5. IEEE.
Alleman, L. Y., L. Lamaison, E. Perdrix, A. Robache, and J. C. Galloo. 2010. “PM 10 Metal Concentrations and Source
Identification Using Positive Matrix Factorization and Wind Sectoring in a French Industrial Zone.” Atmospheric
Research 96 (4): 612–625. doi:10.1016/j.atmosres.2010.02.008.
Anand, M., E. Cronin, M. Sherr, M. Blaze, Z. Ives, and I. Lee. 2006. “Security Challenges in Next Generation Cyber
Physical Systems.” In Beyond SCADA: Cyber Physical Systems Meeting (HCSS-NEC4CPS), edited by Bruce Krogh, Marija
Ilic, and S. Shankar Sastry, November 8 & 9, 2006, Pittsburgh, Pennsylvania.
Anderson, E., X. Li, M. A. Shah, J. Tucek, and J. J. Wylie. 2010. “What Consistency Does Your Key-Value Store Actually
Provide?” In HotDep, vol. 10, 1–16.
Ankerst, M., M. M. Breunig, H. P. Kriegel, and J. Sander. 1999, June. “OPTICS: Ordering Points to Identify the Clustering
Structure.” In ACM Sigmod Record, vol. 28, no. 2, 49–60. ACM.
Arlat, J., M. Diaz, and M. Kaâniche. 2014. “Towards Resilient Cyber-Physical Systems: The ADREAM Project.” In Design &
Technology of Integrated Systems in Nanoscale Era (DTIS), 2014 9th IEEE International Conference On, 1–5. IEEE.
Baba, K., R. Shibata, and M. Sibuya. 2004. “Partial Correlation and Conditional Correlation as Measures of Conditional
Independence.” Australian & New Zealand Journal of Statistics 46 (4): 657–664. doi:10.1111/j.1467-842X.2004.00360.x.
Bagheri, B., H. Ahmadi, and R. Labbafi. 2011. “Implementing Discrete Wavelet Transform and Artificial Neural Networks
for Acoustic Condition Monitoring of Gearbox.” Elixir Mech. Engg 35: 2909–2911.
Baily, M., and J. Manyka. 2013. Is Manufacturing ‘Cool’ Again. San Francisco, CA: McKinsey Global Institute.
Balasubramanian, J., S. Tambe, A. Gokhale, B. Dasarathy, S. Gadgil, and D. C. Schmidt. 2010. A Model-Driven QoS
Provisioning Engine for Cyber Physical Systems. Technical Report, Vanderbilt University and Telcordia Technologie.
Bate, A., M. Lindquist, I. R. Edwards, S. Olsson, R. Orre, A. Lansner, and R. M. De Freitas. 1998. “A Bayesian Neural
Network Method for Adverse Drug Reaction Signal Generation.” European Journal of Clinical Pharmacology 54 (4):
315–321. doi:10.1007/s002280050466.
Bazaraa, M. S., H. D. Sherali, and C. M. Shetty. 2013. Nonlinear Programming: Theory and Algorithms. Hoboken, NJ: John
Wiley & Sons.
Beiranvand, V., M. Mobasher-Kashani, and A. A. Bakar. 2014. “Multi-Objective PSO Algorithm for Mining Numerical
Association Rules without a Priori Discretization.” Expert Systems with Applications 41 (9): 4259–4273. doi:10.1016/j.
eswa.2013.12.043.
Bellman, R. 2013. Dynamic Programming. North Chelmsford, MA: Courier Corporation.
Birant, D., and A. Kut. 2007. “ST-DBSCAN: An Algorithm for Clustering Spatial–Temporal Data.” Data & Knowledge
Engineering 60 (1): 208–221. doi:10.1016/j.datak.2006.01.013.
Birge, J. R., and F. Louveaux. 2011. Introduction to Stochastic Programming. New York, NY: Springer Science & Business
Media.
Blei, D. M., A. Y. Ng, and M. I. Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan):
993–1022.
Bogdan, P. 2015. “A Cyber-Physical Systems Approach to Personalized Medicine: Challenges and Opportunities for
Noc-Based Multicore Platforms.” In Proceedings of the 2015 Design, Automation & Test in Europe Conference &
Exhibition, 253–258. EDA Consortium.
Boser, B. E., I. M. Guyon, and V. N. Vapnik. 1992. “A Training Algorithm for Optimal Margin Classifiers.” In Proceedings
of the Fifth Annual Workshop on Computational Learning Theory 144–152. ACM.
Bradley, P. S., O. L. Mangasarian, and W. N. Street. 1997. “Clustering via Concave Minimization.” In Advances in Neural
Information Processing Systems, edited by Mozer, M. C., M. I. Jordan, and T. Petsche, 368–374. MIT Press. http://
papers.nips.cc/paper/1260-clustering-via-concave-minimization.pdf
Breiman, L. 1996. “Bagging Predictors.” Machine Learning 24 (2): 123–140. doi:10.1007/BF00058655.
Brettel, M., N. Friederichsen, M. Keller, and M. Rosenberg. 2014. “How Virtualization, Decentralization and Network
Building Change the Manufacturing Landscape: An Industry 4.0 Perspective.” International Journal of Mechanical,
Industrial Science and Engineering 8 (1): 37–44.
Brin, S., R. Motwani, J. D. Ullman, and S. Tsur. 1997. “Dynamic Itemset Counting and Implication Rules for Market
Basket Data.” In ACM SIGMOD Record, vol. 26, no. 2, 255–264. ACM.
Burmester, M., E. Magkos, and V. Chrissikopoulos. 2012. “Modeling Security in Cyber–Physical Systems.” International
Journal of Critical Infrastructure Protection 5 (3): 118–126. doi:10.1016/j.ijcip.2012.08.002.
Carpenter, G. A., and S. Grossberg. 1990. “ART 3: Hierarchical Search Using Chemical Transmitters in Self-Organizing
Pattern Recognition Architectures.” Neural Networks 3 (2): 129–152. doi:10.1016/0893-6080(90)90085-Y.
Chandrashekar, G., and F. Sahin. 2014. “A Survey on Feature Selection Methods.” Computers & Electrical Engineering 40
(1): 16–28. doi:10.1016/j.compeleceng.2013.11.024.
Chen, D. R., Q. Wu, Y. Ying, and D. X. Zhou. 2004. “Support Vector Machine Soft Margin Classifiers: Error Analysis.”
Journal of Machine Learning Research 5 (Sep): 1143–1175.
Chen, H. 2017a. “Theoretical Foundations for Cyber Physical Systems-A Literature Review.” Journal of Industrial
Integration and Management 2 (3). doi:10.1142/S2424862217500130.
Chen, H. 2017b. “Applications of Cyber Physical Systems-A Literature Review.” Journal of Industrial Integration and
Management 2 (3). doi:10.1142/S2424862217500129.
Chen, Y., F. Li, and J. Fan. 2015. “Mining Association Rules in Big Data with NGEP.” Cluster Computing 18 (2): 577–585.
doi:10.1007/s10586-014-0419-3.
Chen, Y., H. Chen, A. Gorkhali, Y. Lu, Y. Ma, and L. Li. 2016. “Big Data Analytics and Big Data Science: A Survey.” Journal
of Management Analytics 3 (1): 1–4. doi:10.1080/23270012.2016.1141332.
Chodorow, K. 2013. MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. Sebastopol, CA: O’Reilly Media.
Cleveland, W. S., and S. J. Devlin. 1988. “Locally Weighted Regression: An Approach to Regression Analysis by Local
Fitting.” Journal of the American Statistical Association 83 (403): 596–610. doi:10.1080/01621459.1988.10478639.
Cochran, W. G. 1954. “Some Methods for Strengthening the Common Tests.” Biometrics 10 (4): 417–451. doi:10.2307/
3001616.
Codd, E. F. 1970. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM 13 (6): 377–
387. doi:10.1145/362384.362685.
Daniel, B. 2015. “Big Data and Analytics in Higher Education: Opportunities and Challenges.” British Journal of
Educational Technology 46 (5): 904–920. doi:10.1111/bjet.2015.46.issue-5.
De Mauro, A., M. Greco, and M. Grimaldi. 2016. “A Formal Definition of Big Data Based on Its Essential Features.”
Library Review 65 (3): 122–135. doi:10.1108/LR-06-2015-0061.
Deb, K., A. Pratap, S. Agarwal, and T. A. M. T. Meyarivan. 2002. “A Fast and Elitist Multiobjective Genetic Algorithm:
NSGA-II.” IEEE Transactions on Evolutionary Computation 6 (2): 182–197. doi:10.1109/4235.996017.
Delen, D., and H. Demirkan. 2013. “Data, Information and Analytics as Services.” Decision Support Systems 55 (1): 359–
363. doi:10.1016/j.dss.2012.05.044.
Demuth, H. B., M. H. Beale, O. De Jess, and M. T. Hagan. 2014. Neural Network Design. Stillwater, OK: Martin Hagan.
Dijcks, J. P. 2012. “Oracle: Big Data for the Enterprise.” Oracle White Paper. http://www.oracle.com/us/products/
database/big-data-for-enterprise-519135.pdf
Domingos, P., and M. Pazzani. 1997. “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss.”
Machine Learning 29 (2): 103–130. doi:10.1023/A:1007413511361.
Dorigo, M., M. Birattari, and T. Stutzle. 2006. “Ant Colony Optimization.” IEEE Computational Intelligence Magazine 1 (4):
28–39. doi:10.1109/MCI.2006.329691.
Drucker, H., C. J. Burges, L. Kaufman, A. J. Smola, and V. Vapnik. 1997. “Support Vector Regression Machines.” In
Advances in Neural Information Processing Systems, edited by Mozer, M. C., M. I. Jordan, and T. Petsche, 155–161.
MIT Press. http://papers.nips.cc/paper/1238-support-vector-regression-machines.pdf
Du, K. L. 2010. “Clustering: A Neural Network Approach.” Neural Networks 23 (1): 89–107. doi:10.1016/j.
neunet.2009.08.007.
Duan, L., L. Xu, F. Guo, J. Lee, and B. Yan. 2007. “A Local-Density Based Spatial Clustering Algorithm with Noise.”
Information Systems 32 (7): 978–986. doi:10.1016/j.is.2006.10.006.
Duan, L., L. Xu, Y. Liu, and J. Lee. 2009. “Cluster-Based Outlier Detection.” Annals of Operations Research 168 (1): 151–
168. doi:10.1007/s10479-008-0371-9.
Duan, L., and W. N. Street. 2009. Finding Maximal Fully-Correlated Itemsets in Large Databases. In ICDM, vol. 9, pp.
770–775.
Duan, L., W. N. Street, and Y. Liu. 2013. “Speeding up Correlation Search for Binary Data.” Pattern Recognition Letters 34
(13): 1499–1507. doi:10.1016/j.patrec.2013.05.027.
Duan, L., W. N. Street, Y. Liu, S. Xu, and B. Wu. 2014. “Selecting the Right Correlation Measure for Binary Data.” ACM
Transactions on Knowledge Discovery from Data (TKDD) 9 (2): 13. doi:10.1145/2637484.
Elderton, W. P. 1902. “Tables for Testing the Goodness of Fit of Theory to Observation.” Biometrika 1 (2): 155–163.
Elhamifar, E., and R. Vidal. 2013. “Sparse Subspace Clustering: Algorithm, Theory, and Applications.” IEEE Transactions
on Pattern Analysis and Machine Intelligence 35 (11): 2765–2781. doi:10.1109/TPAMI.2013.57.
Ester, M., H. P. Kriegel, J. Sander, and X. Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise. In Kdd (Vol. 96, No. 34, pp. 226–231).
Fang, R., M. Wu, R. Shang, and C. Peng. 2016. “Failure Prediction Of Wind Turbines Using Improved Gray Relation
Analysis Based Support Vector Machine Method.” Journal of Computational and Theoretical Nanoscience 13 (9):
5887–5895. doi:10.1166/jctn.2016.5502.
Fawzi, H., P. Tabuada, and S. Diggavi. 2014. “Secure Estimation and Control for Cyber-Physical Systems under
Adversarial Attacks.” IEEE Transactions on Automatic Control 59 (6): 1454–1467. doi:10.1109/TAC.2014.2303233.
Fisher, D. H. 1987. “Knowledge Acquisition via Incremental Conceptual Clustering.” Machine Learning 2 (2): 139–172.
doi:10.1007/BF00114265.
Friedman, N., D. Geiger, and M. Goldszmidt. 1997. “Bayesian Network Classifiers.” Machine Learning 29 (2–3): 131–163.
doi:10.1023/A:1007465528199.
Ganschar, O., S. Gerlach, M. Hämmerle, T. Krause, and S. Schlund. 2013. “Arbeit der Zukunft – Mensch und
Automatisierung.” In Produktionsarbeit Der Zukunft-Industrie 4.0, edited by D. Spath, 50–56. Stuttgart: Fraunhofer
Verlag.
Glover, F., and M. Laguna. 2013. “Tabu Search∗.” In Handbook of Combinatorial Optimization, 3261–3362. Norwell, MA:
Kluwer Academic Publishers.
Gölzer, P., P. Cato, and M. Amberg. 2015. “Data Processing Requirements of Industry 4.0-Use Cases for Big Data
Applications.” In ECIS.
Gorecky, D., M. Schmitt, M. Loskyll, and D. Zühlke. 2014. “Human-Machine-Interaction in the Industry 4.0 Era.” In
Industrial Informatics (INDIN), 2014 12th IEEE International Conference on, 289–294. IEEE.
Hall, M. A. 2000. “Correlation-Based Feature Selection of Discrete and Numeric Class Machine Learning.” In Proceedings
of the Seventeenth International Conference on Machine Learning (ICML '00), edited by Pat Langley, 359-366. San
Francisco, CA: Morgan Kaufmann Publishers Inc.
Han, J., J. Pei, and Y. Yin. 2000. “Mining Frequent Patterns without Candidate Generation.” In ACM Sigmod Record, vol.
29, no. 2, 1–12. ACM.
He, W., and L. Xu. 2014. “Integration of Distributed Enterprise Applications: A Survey.” IEEE Transactions on Industrial
Informatics 10 (1): 35–42. doi:10.1109/TII.2012.2189221.
Hewitt, E. 2010. Cassandra-The Definitive Guide: Distributed Data at Web Scale. Definitive Guide Series. Sebastopol, CA:
O'Reilly Media.
Hinneburg, A., and D. A. Keim. 1999. “Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-
Dimensional Clustering.” Proceedings of the 25th International Conference on Very Large Databases, 506–517.
Hosmer Jr, D. W., S. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. Vol. 398. Hoboken, NJ: John
Wiley & Sons.
Howarth, R. J., and J. M. McArthur. 1997. “Statistics for Strontium Isotope Stratigraphy: A Robust LOWESS Fit to the
Marine Sr-Isotope Curve for 0 to 206 Ma, with Look-Up Table for Derivation of Numeric Age.” The Journal of Geology
105 (4): 441–456. doi:10.1086/515938.
Huang, Z., and M. K. Ng. 1999. “A Fuzzy K-Modes Algorithm for Clustering Categorical Data.” IEEE Transactions on Fuzzy
Systems 7 (4): 446–452. doi:10.1109/91.784206.
Hunter, T., T. Das, M. Zaharia, P. Abbeel, and A. M. Bayen. (2013). “Large-scale estimation in cyberphysical systems
using streaming data: a case study with arterial traffic estimation.” IEEE Transactions on Automation Science and
Engineering, 10(4), 884-898.
Huynh, H. X., F. Guillet, J. Blanchard, P. Kuntz, H. Briand, and R. Gras. 2007. “A Graph-Based Clustering Approach to
Evaluate Interestingness Measures: A Tool and A Comparative Study.” Quality Measures in Data Mining 43: 25–50.
Jagannathan, G., and R. N. Wright. 2005. “Privacy-Preserving Distributed K-Means Clustering over Arbitrarily
Partitioned Data.” In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in
Data Mining, 593–599. ACM.
Jałowiecki, A., P. Kłusek, and W. Skarka. 2017. “Skeleton-Based Generative Modelling Method in the Context of
Increasing Functionality of Virtual Product Assembly.” Procedia Manufacturing 11: 2211–2218. doi:10.1016/j.
promfg.2017.07.368.
Jiang, P., K. Ding, and J. Leng. 2016. “Towards a Cyber-Physical-Social-Connected and Service-Oriented Manufacturing
Paradigm: Social Manufacturing.” Manufacturing Letters 7: 15–21. doi:10.1016/j.mfglet.2015.12.002.
Jiao, Y. 2013. “Research of an Improved Apriori Algorithm in Data Mining Association Rules.” International Journal of
Computer and Communication Engineering 2 (1): 25.
Kagermann, H., J. Helbig, A. Hellinger, and W. Wahlster. 2013. Recommendations for Implementing the Strategic
Initiative INDUSTRIE 4.0: Securing the Future of German Manufacturing Industry. final report of the Industrie 4.0
Working Group. Forschungsunion.
Kannan, S., and R. Bhaskaran. 2009. “Association Rule Pruning Based on Interestingness Measures with Clustering.”
International Journal of Computer Science Issues 6 (1): 35–45.
Kao, H. A., W. Jin, D. Siegel, and J. Lee. 2015. “A Cyber Physical Interface for Automation Systems—Methodology and
Examples.” Machines 3 (2): 93–106. doi:10.3390/machines3020093.
Karau, H., A. Konwinski, P. Wendell, and M. Zaharia. 2015. Learning Spark: Lightning-Fast Big Data Analysis. Sebastopol,
CA: O’Reilly Media.
Karegowda, A. G., A. S. Manjunath, and M. A. Jayaram. 2010. “Comparative Study of Attribute Selection Using Gain
Ratio and Correlation Based Feature Selection.” International Journal of Information Technology and Knowledge
Management 2 (2): 271–277.
Khaitan, S. K., and J. D. McCalley. 2015. “Design Techniques and Applications of Cyberphysical Systems: A Survey.” IEEE
Systems Journal 9 (2): 350–365. doi:10.1109/JSYST.2014.2322503.
Khalid, S., T. Khalil, and S. Nasreen. 2014. “A Survey of Feature Selection and Feature Extraction Techniques in Machine
Learning.” In Science and Information Conference (SAI), 2014, 372–378. IEEE.
Khanmohammadi, S., N. Adibeig, and S. Shanehbandy. 2017. “An Improved Overlapping K-Means Clustering Method
for Medical Applications.” Expert Systems with Applications 67: 12–18. doi:10.1016/j.eswa.2016.09.025.
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. 1983. “Optimization by Simulated Annealing.” Science 220 (4598): 671–
680. doi:10.1126/science.220.4598.671.
Kleinberg, S., and B. Mishra. 2009. “The Temporal Logic of Causal Structures.” In Proceedings of the Twenty-Fifth
Conference on Uncertainty in Artificial Intelligence, 303–312. AUAI Press.
Kohonen, T. 1990. “The Self-Organizing Map.” Proceedings of the IEEE 78 (9): 1464–1480. doi:10.1109/5.58325.
Krämer, A., J. Green, J. Pollard Jr, and S. Tugendreich. 2013. “Causal Analysis Approaches in Ingenuity Pathway
Analysis.” Bioinformatics 30 (4): 523–530. doi:10.1093/bioinformatics/btt703.
Kuncheva, L. I., and J. J. Rodríguez. 2014. “A Weighted Voting Framework for Classifiers Ensembles.” Knowledge and
Information Systems 38 (2): 259–275. doi:10.1007/s10115-012-0586-6.
Kuo, C. J., K. C. Ting, Y. C. Chen, D. L. Yang, and H. M. Chen. 2017. “Automatic Machine Status Prediction in the Era of
Industry 4.0: Case Study of Machines in a Spring Factory.” Journal of Systems Architecture 81: 44–53. doi:10.1016/j.
sysarc.2017.10.007.
Laney, D. 2001. ”3-D Data Management: Controlling Data Volume, Velocity and Variety.” META Group. Research Note.
February 2001.
Lapira, E. R. 2012. “Fault detection in a network of similar machines using clustering approach.” Doctoral diss.,
University of Cincinnati.
Lasi, H., P. Fettke, H. G. Kemper, T. Feld, and M. Hoffmann. 2014. “Industry 4.0.” Business & Information Systems
Engineering 6 (4): 239. doi:10.1007/s12599-014-0334-4.
Lawler, E. L., and D. E. Wood. 1966. “Branch-And-Bound Methods: A Survey.” Operations Research 14 (4): 699–719.
doi:10.1287/opre.14.4.699.
Lee, D. D., and H. S. Seung. 1999. “Learning the Parts of Objects by Non-Negative Matrix Factorization.” Nature 401
(6755): 788–791. doi:10.1038/44565.
Lee, E. A. 2008. “Cyber Physical Systems: Design Challenges.” In Object Oriented Real-Time Distributed Computing
(ISORC), 2008 11th IEEE International Symposium on, 363–369. IEEE.
Lee, E. A. 2015. “The Past, Present and Future of Cyber-Physical Systems: A Focus on Models.” Sensors 15 (3): 4837–
4869. doi:10.3390/s150304837.
Lee, J., B. Bagheri, and H. A. Kao. 2015b. “A Cyber-Physical Systems Architecture for Industry 4.0-Based Manufacturing
Systems.” Manufacturing Letters 3: 18–23. doi:10.1016/j.mfglet.2014.12.001.
Lee, J., E. Lapira, B. Bagheri, and H. A. Kao. 2013. “Recent Advances and Trends in Predictive Manufacturing Systems in
Big Data Environment.” Manufacturing Letters 1 (1): 38–41. doi:10.1016/j.mfglet.2013.09.005.
Lee, J., H. A. Kao, and S. Yang. 2014. “Service Innovation and Smart Analytics for Industry 4.0 And Big Data
Environment.” Procedia Cirp 16: 3–8. doi:10.1016/j.procir.2014.02.001.
Lee, J., H. D. Ardakani, S. Yang, and B. Bagheri. 2015a. “Industrial Big Data Analytics and Cyber-Physical Systems for
Future Maintenance & Service Innovation.” Procedia CIRP 38: 3–7. doi:10.1016/j.procir.2015.08.026.
Lee, J. S., and W. L. Cheng. 2012. “Fuzzy-Logic-Based Clustering Approach for Wireless Sensor Networks Using Energy
Predication.” IEEE Sensors Journal 12 (9): 2891–2897. doi:10.1109/JSEN.2012.2204737.
Li, X., D. Li, J. Wan, A. V. Vasilakos, C. F. Lai, and S. Wang. 2017. “A Review of Industrial Wireless Networks in the Context
of Industry 4.0.” Wireless Networks 23 (1): 23–41. doi:10.1007/s11276-015-1133-7.
Li, Y., H. Salmasian, S. Vilar, H. Chase, C. Friedman, and Y. Wei. 2014. “A Method for Controlling Complex Confounding
Effects in the Detection of Adverse Drug Reactions Using Electronic Health Records.” Journal of the American
Medical Informatics Association 21 (2): 308–314. doi:10.1136/amiajnl-2013-001718.
Li, S., L. D. Xu, and S. Zhao. 2018. “5G Internet of Things: A Survey..” Journal of Industrial Information Integration.
doi:10.1016/j.jii.2018.01.005.
Lu, Y. 2017a. “Industry 4.0: A Survey on Technologies, Applications and Open Research Issues.” Journal of Industrial
Information Integration 6: 1–10. doi:10.1016/j.jii.2017.04.005.
Lu, Y. 2017b. “Cyber Physical System (Cps)-Based Industry 4.0: A Survey.” Journal of Industrial Integration and
Management 2 (3). doi:10.1142/S2424862217500142.
Lucia, W., B. Sinopoli, and G. Franze. 2016. “A Set-Theoretic Approach for Secure and Resilient Control of Cyber-
Physical Systems Subject to False Data Injection Attacks.” In Cyber-Physical Systems Workshop (SOSCYPS), Science of
Security for, 1–5. IEEE.
MacQueen, J. 1967. “Some Methods for Classification and Analysis of Multivariate Observations.” In Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, 281–297.
Maggio, M., H. Hoffmann, A. V. Papadopoulos, J. Panerati, M. D. Santambrogio, A. Agarwal, and A. Leva. 2012.
“Comparison of Decision-Making Strategies for Self-Optimization in Autonomic Computing Systems.” ACM
Transactions on Autonomous and Adaptive Systems (TAAS) 7 (4): 36.
Melin, P., and O. Castillo. 2014. “A Review on Type-2 Fuzzy Logic Applications in Clustering, Classification and Pattern
Recognition.” Applied Soft Computing 21: 568–577. doi:10.1016/j.asoc.2014.04.017.
Mosteller, F. 1968. “Association and Estimation in Contingency Tables.” Journal of the American Statistical Association
63 (321): 1–28.
Müllner, D. 2013. “Fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python.” Journal of
Statistical Software 53 (9): 1–18. doi:10.18637/jss.v053.i09.
Murtagh, F., and P. Legendre. 2014. “Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms
Implement Ward’s Criterion?.” Journal of Classification 31 (3): 274–295. doi:10.1007/s00357-014-9161-z.
Neter, J., M. H. Kutner, C. J. Nachtsheim, and W. Wasserman. 1996. Applied Linear Statistical Models, 318. Vol. 4. Chicago:
Irwin.
Neyman, J., and E. S. Pearson. 1992. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” In:
Breakthroughs in Statistics, edited by Kotz, S., and N. L. Johnson, 73–108. New York: Springer.
Ng, R. T., and J. Han. 2002. “CLARANS: A Method for Clustering Objects for Spatial Data Mining.” IEEE Transactions on
Knowledge and Data Engineering 14 (5): 1003–1016. doi:10.1109/TKDE.2002.1033770.
Pearl, J. 2000. “Simpson’s Paradox, Confounding, and Collapibility.” In: Causality: Models, Reasoning and Inference, 173–
200. Cambridge: Cambridge University Press.
Pearson, K. 1895. “Note on Regression and Inheritance in the Case of Two Parents.” Proceedings of the Royal Society of
London 58: 240–242. doi:10.1098/rspl.1895.0041.
Peng, H., F. Long, and C. Ding. 2005. “Feature Selection Based on Mutual Information Criteria of Max-Dependency,
Max-Relevance, and Min-Redundancy.” IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8): 1226–
1238. doi:10.1109/TPAMI.2005.159.
Piateski, G., and W. Frawley. 1991. Knowledge Discovery in Databases. Cambridge, MA: MIT press.
Quinlan, J. R. 1986. “Induction of Decision Trees.” Machine Learning 1 (1): 81–106. doi:10.1007/BF00116251.
Qun, Z., S. Lai, S. Dai, X. Gao, and T. Li. 2015. “Mining User’s Preference Information through System Log toward a
Personalized ERP System.” Communications of the IIMA 5 (3): 6.
Rabiner, L. R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.”
Proceedings of the IEEE 77 (2): 257–286. doi:10.1109/5.18626.
Raileanu, L. E., and K. Stoffel. 2004. “Theoretical Comparison between the Gini Index and Information Gain Criteria.”
Annals of Mathematics and Artificial Intelligence 41 (1): 77–93. doi:10.1023/B:AMAI.0000018580.96245.c6.
Rasmussen, C. E. 2000. “The Infinite Gaussian Mixture Model.” In Advances in Neural Information Processing Systems,
edited by Solla, S. A., T. K. Leen, and K. Muller, 554–560. MIT Press: http://papers.nips.cc/paper/1745-the-infinite-
gaussian-mixture-model.pdf
Sammouri, W. 2014. “Data Mining of Temporal Sequences for the Prediction of Infrequent Failure Events: Application
on Floating Train Data for Predictive Maintenance.” Doctoral diss., Université Paris-Est.
Schapire, R. E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2): 197–227. doi:10.1007/BF00116037.
Schroeck, M., R. Shockley, J. Smart, D. Romero-Morales, and P. Tufano. 2012. “Analytics: The Real-World Use of Big
Data.” IBM Global Business Services 1–20. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?htmlfid=
GBE03519USEN
Shoukry, Y., P. Nuzzo, A. Puggelli, A. L. Sangiovanni-Vincentelli, S. A. Seshia, and P. Tabuada. 2017. “Secure State
Estimation for Cyber Physical Systems under Sensor Attacks: A Satisfiability Modulo Theory Approach.” IEEE
Transactions on Automatic Control 62: 4917–4932. doi:10.1109/TAC.2017.2676679.
Shvachko, K., H. Kuang, S. Radia, and R. Chansler. 2010. “The Hadoop Distributed File System.” In Mass Storage Systems
and Technologies (MSST), 2010 IEEE 26th Symposium on, 1–10. IEEE.
Simha, R., S. Briesemeister, O. Kohlbacher, and H. Shatkay. 2015. “Protein (Multi-) Location Prediction: Utilizing
Interdependencies via a Generative Model.” Bioinformatics 31 (12): i365–i374. doi:10.1093/bioinformatics/
btv264.
Sistrom, C. L., and C. W. Garvan. 2004. “Proportions, Odds, and Risk.” Radiology 230 (1): 12–19. doi:10.1148/
radiol.2301031028.
Slay, J., and M. Miller. 2007. “Lessons Learned from the Maroochy Water Breach.” Critical Infrastructure Protection 73–82.
Boston, MA: Springer
Soltanolkotabi, M., E. Elhamifar, and E. J. Candes. 2014. “Robust Subspace Clustering.” The Annals of Statistics 42 (2):
669–699. doi:10.1214/13-AOS1199.
Stojmenovic, I. 2014. “Machine-To-Machine Communications with In-Network Data Aggregation, Processing, and
Actuation for Large-Scale Cyber-Physical Systems.” IEEE Internet of Things Journal 1 (2): 122–128. doi:10.1109/
JIOT.2014.2311693.
Stone, R. E., and C. A. Tovey. 1991. “The Simplex and Projective Scaling Algorithms as Iteratively Reweighted Least
Squares Methods.” SIAM Review 33 (2): 220–237. doi:10.1137/1033049.
Stonebraker, M., D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. 2010. “MapReduce and Parallel
DBMSs: Friends or Foes?.” Communications of the ACM 53 (1): 64–71. doi:10.1145/1629175.
Sun, M., S. Mohan, L. Sha, and C. Gunter. 2009. “Addressing Safety and Security Contradictions in Cyber-Physical
Systems.” In Proceedings of the 1st Workshop on Future Directions in Cyber-Physical Systems Security (CPSSW’09).
Suykens, J. A., and J. Vandewalle. 1999. “Least Squares Support Vector Machine Classifiers.” Neural Processing Letters 9
(3): 293–300. doi:10.1023/A:1018628609742.
Tang, L. A., X. Yu, S. Kim, J. Han, C. C. Hung, and W. C. Peng. 2010. “Tru-Alarm: Trustworthiness Analysis of Sensor Networks
in Cyber-Physical Systems.” In Data Mining (ICDM), 2010 IEEE 10th International Conference on, 1079–1084. IEEE.
Tao, F., L. Zhang, V. C. Venkatesh, Y. Luo, and Y. Cheng. 2011. “Cloud Manufacturing: A Computing and Service-
Oriented Manufacturing Model.” Proceedings of the Institution of Mechanical Engineers, Part B: Journal of Engineering
Manufacture 225 (10): 1969–1976. doi:10.1177/0954405411405575.
Tew, C., C. Giraud-Carrier, K. Tanner, and S. Burton. 2014. “Behavior-Based Clustering and Analysis of Interestingness
Measures for Association Rule Mining.” Data Mining and Knowledge Discovery 28 (4): 1004–1045. doi:10.1007/
s10618-013-0326-x.
Tsamardinos, I., L. E. Brown, and C. F. Aliferis. 2006. “The Max-Min Hill-Climbing Bayesian Network Structure Learning
Algorithm.” Machine Learning 65 (1): 31–78. doi:10.1007/s10994-006-6889-7.
Vidal, R., Y. Ma, and S. Sastry. 2005. “Generalized Principal Component Analysis (GPCA).” IEEE Transactions on Pattern
Analysis and Machine Intelligence 27 (12): 1945–1959. doi:10.1109/TPAMI.2005.244.
Vijayaraghavan, A., W. Sobel, A. Fox, D. Dornfeld, and P. Warndorf. 2008. “Improving Machine Tool Interoperability
Using Standardized Interface Protocols: MT Connect.” In Laboratory for Manufacturing and Sustainability. 2008
International Symposium on Flexible Automation, Atlanta, GA, USA, June 23-26.
Wang, S., J. Wan, D. Zhang, D. Li, and C. Zhang. 2016b. “Towards Smart Factory for Industry 4.0: A Self-Organized Multi-
Agent System with Big Data Based Feedback and Coordination.” Computer Networks 101: 158–168. doi:10.1016/j.
comnet.2015.12.017.
Wang, S., T. Lei, L. Zhang, C. H. Hsu, and F. Yang. 2016a. “Offloading Mobile Data Traffic for QoS-aware Service
Provision in Vehicular Cyber-Physical Systems.” Future Generation Computer Systems 61: 118–127. doi:10.1016/j.
future.2015.10.004.
Wang, W., J. Yang, and R. Muntz. 1997. “STING: A Statistical Information Grid Approach to Spatial Data Mining.” In
VLDB, vol. 97, 186–195.
Weyer, S., M. Schmitt, M. Ohmer, and D. Gorecky. 2015. “Towards Industry 4.0-Standardization as the Crucial Challenge
for Highly Modular, Multi-Vendor Production Systems.” IFAC-PapersOnLine 48 (3): 579–584. doi:10.1016/j.
ifacol.2015.06.143.
Wollschlaeger, M., T. Sauter, and J. Jasperneite. 2017. “The Future of Industrial Communication: Automation Networks
in the Era of the Internet of Things and Industry 4.0.” IEEE Industrial Electronics Magazine 11 (1): 17–27. doi:10.1109/
MIE.2017.2649104.
Wolpert, D. H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–259. doi:10.1016/S0893-6080(05)80023-1.
Wolsey, L. A., and G. L. Nemhauser. 1999. Interger and Combinatorial Optimization. 1st ed. Hoboken, NJ: Wiley-
Interscience.
Wu, B., and B. M. Wilamowski. 2016. “A Fast Density and Grid Based Clustering Method for Data with Arbitrary Shapes
and Noise.” IEEE Transactions on Industrial Informatics 13 (4): 1620–1628. doi:10.1109/TII.2016.2628747.
Xiong, H., S. Shekhar, P. M. Tan, and V. Kumar. 2006. “TAPER: A Two-Step Approach for All-Strong-Pairs Correlation
Query in Large Databases.” IEEE Transactions on Knowledge and Data Engineering 18 (4): 493–508. doi:10.1109/
TKDE.2006.1599388.
Xu, L.. 2007. “Editorial: Inaugural Issue.” Enterprise Information Systems 1 (1): 1–2. doi:10.1080/17517570712331393320.
Xu, D., and Y. Tian. 2015. “A Comprehensive Survey of Clustering Algorithms.” Annals of Data Science 2 (2): 165–193.
doi:10.1007/s40745-015-0040-1.
Xu, L., E. Xu, and L. Li. 2018. “Industry 4.0: State of the Art and Future Trends.” International Journal of Production
Research. in press.
Xu, L., N. Liang, and Q. Gao. 2008a. “An Integrated Approach for Agricultural Ecosystem Management.” IEEE
Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38 (4): 590–599. doi:10.1109/
TSMCC.2007.913894.
Xu, X. 2012. “From Cloud Computing to Cloud Manufacturing.” Robotics and Computer-Integrated Manufacturing 28 (1):
75–86. doi:10.1016/j.rcim.2011.07.002.
Xu, Z., X. Liu, G. Zhang, W. He, G. Dai, and W. Shu. 2008b. “A Certificateless Signature Scheme for Mobile Wireless
Cyber-Physical Systems.” In Distributed Computing Systems Workshops, 2008. ICDCS’08. 28th International Conference
on, 489–494. IEEE.
Yao, Y., Q. Cao, and A. V. Vasilakos. 2015. “EDAL: An Energy-Efficient, Delay-Aware, and Lifetime-Balancing Data
Collection Protocol for Heterogeneous Wireless Sensor Networks.” IEEE/ACM Transactions on Networking (TON) 23
(3): 810–823. doi:10.1109/TNET.2014.2306592.
Yasmina Santos, M., B. Martinho, and C. Costa. 2017. “Modelling and Implementing Big Data Warehouses for Decision
Support.” Journal of Management Analytics 4 (2): 111–129. doi:10.1080/23270012.2017.1304292.
Yin, S., and O. Kaynak. 2015. “Big Data for Modern Industry: Challenges and Trends [Point of View].” Proceedings of the
IEEE 103 (2): 143–146. doi:10.1109/JPROC.2015.2388958.
Younis, O., and S. Fahmy. 2004. “HEED: A Hybrid, Energy-Efficient, Distributed Clustering Approach for Ad Hoc Sensor
Networks.” IEEE Transactions on Mobile Computing 3 (4): 366–379. doi:10.1109/TMC.2004.41.
Yu, H., C. Zhang, and G. Wang. 2016. “A Tree-Based Incremental Overlapping Clustering Method Using the Three-Way
Decision Theory.” Knowledge-Based Systems 91: 189–203. doi:10.1016/j.knosys.2015.05.028.
Yu, L., and H. Liu. 2003. “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution.” In
Proceedings of the 20th International Conference on Machine Learning (ICML-03), 856–863.
Zaki, M. J. 2000. “Scalable Algorithms for Association Mining.” IEEE Transactions on Knowledge and Data Engineering 12
(3): 372–390. doi:10.1109/69.846291.
Zaki, M. J., N. Lesh, and M. Ogihara. 2001. “Predicting Failures in Event Sequences.” In Data Mining for Scientific and
Engineering Applications, edited by Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and
Raju R. Namburu. Kluwer Academic Publishers.
Zhang, T., D. Tao, X. Li, and J. Yang. 2009. “Patch Alignment for Dimensionality Reduction.” IEEE Transactions on
Knowledge and Data Engineering 21 (9): 1299–1313. doi:10.1109/TKDE.2008.212.
Zhou, W., and H. Xiong. 2009. “Efficient Discovery of Confounders in Large Data Sets.” In Data Mining, 2009. ICDM’09.
Ninth IEEE International Conference on, 647–656. IEEE.
Zhu, Q., and T. Basar. 2011. “Towards a Unifying Security Framework for Cyber-Physical Systems.” In Proceedings of
Workshop on Foundations of Dependable and Secure Cyber-Physical Systems, 47–50.

Big Data For Cyber Physical Systems in Industry 4.0: A Survey

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data For Cyber Physical Systems in Industry 4.0: A Survey

Uploaded by

Copyright:

Available Formats

Enterprise Information Systems

ISSN: 1751-7575 (Print) 1751-7583 (Online) Journal homepage: http://www.tandfonline.com/loi/teis20

Big data for cyber physical systems in industry 4.0:

Li Da Xu & Lian Duan

To link to this article: https://doi.org/10.1080/17517575.2018.1442934

Published online: 01 Mar 2018.

Submit your article to this journal

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at

Big data for cyber physical systems in industry 4.0: a survey

ABSTRACT ARTICLE HISTORY

2. Big data characteristics

3. Big data issues of the CPS in industry 4.0

Big Data in the CPS of Industry 4.0

Data Distributed Descriptive Predictive Prescriptive

3.1 System infrastructures

3.1.1. Data capture

3.1.2. Data storage and retrieval

3.1.3. Distributed computing

3.2. Data analytics

3.2.1. Descriptive analytics

3.2.2. Predictive analytics

3.2.3. Prescriptive analytics

4.1. Research on system infrastructures to ensure security, resiliency, and reliability

Figure 2. The taxonomy of current research for CPS in Industry 4.0.

4.2. Research on data analytics to improve self-aware and self-maintained capabilities

6. Future research direction

You might also like