Professional Documents
Culture Documents
Yuri Demchenko
University of Amsterdam
y.demchenko@uva.nl
I.
INTRODUCTION
Hadoop - Site A
Master Node
Client
protocol
Name
Node
Job
Tracker
Data
Node
Client
Slave Node
Secure Storage
Hadoop - Site B
Slave Node
Reduce
Task
Data
Node
Map
Task
Task
Tracker
Reduce
Task
Data
Node
Task
Tracker
(Create, Read, Update and Delete). HDFS must also offer the
capability to define mandatory encryption zones/containers
where data cannot be stored in the clear. The data stored in
HDFS encrypted zones must also be accessible by processes
(e.g. during the MapReduce phase). The KMS must therefore be
summoned equally by a Client when data is stored or read, and
by the Name Node.
Legend
RPC/SASL - Service to Service
communications
RPC/SASL or http client protocol
Hadoop - Site A
Master Node
Client
protocol
Name
Node
Job
Tracker
Data
Node
Client
Secure Storage
Hadoop - Site B
Slave Node
Slave Node
Reduce
Task
Data Node
EZ
(encryption
zone)
Key Management
System
EZK
Encryption
Zone Keys
Reduce
Task
Task
Tracker
Map
Task
Data Node
EZ
(encryption
zone)
Task
Tracker
Identity and
Accountability
Management system
Users
Accounts
DEK
Data
Encryption
Keys
B. Strategies drawbacks
1) Strategy #1 (Inter-regions encrypted tunnels and VPCs)
A machine to machine encryption seems at first a very
attractive solution as its deployment could be almost fully
automated. Such setup however presents several drawbacks:
III.
EXPERIMENTAL TESTBED
Fig. 4. initial architecture for the Testbed using AWS free tier
A. Hardware configuration
In order to remain consistent with the aim of the experiment
(an applied testing of encryption impact on performance), we
selected a standard hardware configuration which is proposed by
default EMR/EC2 instances. The hardware configuration of
these instances is as follows:
EC2
instances:
CPUUtilization,
NetworkIn,
NetworkOut.
Temporary EC2 storage (temp files and swap):
DiskReadOps,
DiskWriteOps,
DiskReadBytes,
DiskWriteBytes
EMR: ClusterStatus, MapReduceStatus, NodeStatus,
IO.
1+2
using
HDFS
with
transparent,
AES256/CTR/NoPadding encryption.
Repeat with 1+4
Repeat with 1+8
Repeat with 1+16
Repeat with 1+19 (maximum of EC2 instances running
is 20 overall).
IV.
1) S3 bucket, no encryption
The initial run -which we qualify as dry run since it is not
yet charged by any security mechanism (except basic AAAA,
since it is a service enabled by default on AWS and consorts)demonstrated that a roughly 4,7GB dataset can be processed
rather rapidly using only one master and two slave nodes (1+2).
Doubling the amount of slave nodes helped gaining significant
processing time (approx. 50%), but to half the processing
performance, we had to quadruple the 4 slave nodes. Adding a
set of 3 additional nodes to the set of 16 which manages to
process 4,7GB in 3:42 shows only little improvement, roughly
24s. Obviously, the horizontal upgrades tends to a limit which
appears to be set around 3min for the processing of 4,7GB.
1+2
1+4
1+8
1 + 16
1 + 19
00:14:38
00:07:42
00:05:12
00:03:42
00:03:18
1 + 2 +
AES256
00:14:14
1 + 4 +
AES256
00:07:56
1 + 8 +
AES256
00:05:12
1 + 16 +
AES256
00:03:26
slave nodes
#
processing
time (s)
1 + 19 +
AES256
00:03:06
1+2
1+4
1+8
1 + 16
1 + 19
Encryption
Performance
Impact
-3%
3%
0%
-8%
-6%
4) HDFS, no encryption
Since we used an S3 bucket for our initial dry run, we used
it as a baseline to compare the course of events when testing
other mechanisms, including the native HDFS which is provided
with Hadoop. HDFS is notably not a great performer, but has
proven to be extremely reliable in ensuring data consistency and
integrity thanks to its replication mechanisms, and the integrity
checks it performs continuously. Overall, HDFS still provides a
valuable service, despite the complaints we could read about its
overall performance. Furthermore, HDFS was until Hadoop
v2.6.0 unable to provide any security function to data. Since,
HDFS transparent encryption was introduced and HDFS is now
able to play a role in the security of the information it contains.
To start with, we have uploaded our dataset onto the Hadoop
Master Node from an SSH shell, using the grunt shells cp
command. For the 4,7GB dataset, the cp command took a little
below 1min to complete. Once aboard, the dataset was ready to
be processed; we used the same script as previously; we however
launch each line of the script using the grunt shell.
Performance metrics show that HDFS performs much better
than S3 when only few slave nodes are used (up to 8). Above 8
slave nodes, HDFS seems to provide no further advantage and
tends to a lower processing time limit of 4min, instead of 3min
for S3.
slave nodes
#
processing
time (s)
1+2
1+4
1+8
1 + 16
1 + 19
00:10:47
00:06:03
00:04:33
00:04:13
00:04:17
1 + 2 +
AES256
CTR
00:10:27
1 + 4 +
AES256
CTR
00:06:12
1 + 8 +
AES256
CTR
00:04:36
1 + 16 +
AES256
CTR
00:04:36
1 + 19 +
AES256
CTR
00:04:11
Fig. 10. 4,7GB, HDFS transparent encryption, Encrypted Zone dataE folder
1+2
1+4
1+8
1 + 16
1 + 19
Performance
Impact
-3%
2%
1%
8%
-2%
1+8
1 + 16
1 + 19
00:28:09
00:25:29
00:25:19
1 + 8 + AES256
CTR
00:27:39
1 + 16 +
AES256 CTR
00:20:43
1 + 19 +
AES256 CTR
00:19:53
1+8
1 + 16
1 + 19
Performance
Impact
-2%
-23%
-27%
V.
ANALYSIS
Fig. 15. Network Out (Bytes) behavior during 1+19 AES CTR processing (job
start: 20h10)
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
NIST Big Data Public Working Group Security and Privacy Subgroup
(2015) NIST Big Data Interoperability Framework: Volume 4, Security
and Privacy. National Institute of Standards and Technology. doi:
10.6028/NIST.SP.1500-4.
Fhom, H. S. (2015) Big Data: Opportunities and Privacy Challenges,
arXiv.org, p. 823.
Okman, L., Gal-Oz, N., Gonen, Y., Gudes, E. and Abramov, J. (2011)
Security Issues in NoSQL Databases, 2011 IEEE 10th International
Conference on Trust, Security and Privacy in Computing and
Communications
(TrustCom).
IEEE,
pp.
541547.
doi:
10.1109/TrustCom.2011.70.
Quan, Q., Tian-Hong, W., Rui, Z. and Ming-jun, X. (2013) A model of
cloud data secure storage based on HDFS. IEEE, pp. 173178. doi:
10.1109/ICIS.2013.6607836.
Yin, J. and Zhao, D. (2015) Data confidentiality challenges in big data
applications.
IEEE,
pp.
28862888.
doi:
10.1109/BigData.2015.7364111.
Lafuente, G. (2015) The big data security challenge, Network Security,
2015(1), pp. 1214. doi: 10.1016/S1353-4858(15)70009-7.
Demchenko, Yuri, Peter Membrey, Cees de Laat, Defining Architecture
Components of the Big Data Ecosystem. Second International
Symposium on Big Data and Data Analytics in Collaboration (BDDAC
2014). Part of The 2014 Int. Conf. on Collaboration Technologies and
Systems (CTS 2014), May 19-23, 2014, Minneapolis, USA.
Cohen, J. and Acharya, S. (2013) Towards a Trusted Hadoop Storage
Platform: Design Considerations of an AES Based Encryption Scheme
with TPM Rooted Key Protections. IEEE, pp. 444451. doi:
10.1109/UIC-ATC.2013.57.
Rong, C., Quan, Z. and Chakravorty, A. (2013) On Access Control
Schemes for Hadoop Data Storage. IEEE, pp. 641645. doi:
10.1109/CLOUDCOM-ASIA.2013.82.
Spivey, B. and Echeverria, J. in Hadoop security. Sebastopol, CA:
O'Reilly Media, 2015
Lakhe, B. in Practical Hadoop Security, 1st edition. Apress, 2014
Huang, J., Nicol, D. M. and Campbell, R. H. (2014) Denial-of-Service
Threat to Hadoop/YARN Clusters with Multi-tenancy. IEEE, pp. 4855.
doi: 10.1109/BigData.Congress.2014.17.
Rabl, T., Sadoghi, M., Jacobsen, H.-A., Gmez-Villamor, S., MuntsMulero, V. and Mankowskii, S. (2012) Solving Big Data Challenges for
Enterprise Application Performance Management, arXiv.org.
Lamb, C. (2015). Overview of HDFS Transparent Encryption. online
Cloudera, Inc. Available at: http://www.slideshare.net/cloudera/hdfsencryption Accessed 16 Jan. 2016.
Desai, S., Park, Y., Gao, J., Chang, S.-Y. and Song, C. (2015) Improving
Encryption Performance Using MapReduce. IEEE, pp. 13501355. doi:
10.1109/HPCC-CSS-ICESS.2015.206.
Dworkin, M. J. (2001) Recommendation for block cipher modes of
operation :. 0 edn. Gaithersburg, MD: National Institute of Standards and
Technology. doi: 10.6028/NIST.SP.800-38a.