Professional Documents
Culture Documents
MULTIPURPOSE SHARED
GLOBAL STORAGE WITH ECS
Labs...............................................................................................................................................................11
Lab 4 - Querying Object Metadata with Apache Spark SQL for use by HDFS based applications................... 51
Querying object metadata with Apache Spark SQL for analytics......................................................................52
Troubleshooting ......................................................................................................................................... 82
Troubleshooting & Tips ..................................................................................................................................83
Conclusion ................................................................................................................................................. 87
Conclusion.....................................................................................................................................................88
Solution .........................................................................................................................................................91
ECS ................................................................................................................................................................92
This lab offers first hand experience of how ECS smart storage simplifies multi-tenant data access and management
with instant meta-data search and multi-protocol access to the same data set using NFS or S3.
• 1 x Windows 2012 Domain Controller. This server provides DNS and AD services to the environment.
• 1 x Windows 2012 Launchpad. This server is the host you are currently logged into and provides access into
the other components of this lab.
• Two sites with three ECS nodes each. Each node is a VM running ECS 2.2 software and has two disks: one for
OS install and another for object storage. Real world ECS installs require a minimum four-node setup; this
three-node install is for demonstration purposes only.
• 1 x Linux VM that hosts Spark Notebook
Lab Architecture
• Username: ecsadmin01@vlab.local
• Password: Password123!
• Username: rainpoleadmin@vlab.local
• Password: Password123!
• Username: mediauser@vlab.local
• Password: Password123!
Alternatively, if you chose to skip the configuration steps in Lab1 and Lab2, following set of credentials have been pre-
configured for use in Labs 3, 4, and 5:
• Username: rainpoleadmin2@vlab.local
• Password: Password123!
• Username: mediauser2@vlab.local
• Password: Password123!
The following is a list of all lab servers and appliance and their corresponding IP addresses and credentials.
Note: This table is for reference purposes, you may not access everything when completing the use cases in the lab
guide.
The content generation team needs to ingest image content and its associated metadata.
Two different teams would use the metadata associated with the media images:
• The media review team has a custom application (Media Reviewer) that needs to search for images based on
the metadata to review the content.
• The analytics team uses a Zeppelin notebook to view object metadata and run Spark SQL queries to extract
data for analysis.
A remote advertising team needs to access archived media content from their windows platform as a mounted file
system, for reference and reuse.
You have the opportunity to perform tenant configuration, object data ingest using S3 protocol, metadata search using
SPARK SQL and S3 client applications, and NFS access from a Windows environment.
Note: You can optionally start with Lab 3, which allows you to skip the tenant configuration and object data ingest
modules. You will use a pre-configured tenant and pre-populated data for the metadata search and NFS access Labs.
Contents
• Perform the role of a system administrator to create new namespace and configure a namespace admin
• Perform the role of a namespace admin to enable access for users and applications within the media unit
• Observe how S3 client applications can leverage the ECS 2.2 metadata search capability to efficiently
discover and access data sets
• Experience the ability to use an SQL interface to the metadata search capability in ECS
• Write SQL queries to access data based on metdata search criteria
• Walk through the use case of ingesting content via S3 and then accessing that content via NFS mount in
Windows.
Note: You may skip this lab (Lab 1) and Lab 2 if you prefer to explore the features in ECS with preconfigured tenants
and data.
Open Site 1
The lab is backed by two virtual instances of ECS software, simulating Boston and New York data centers in a geo-
federated environment. The lab experience is identical to provisioning a true ECS appliance.
Click Advanced
Advanced.
Click Login
To ensure all configuration steps are performed in the right order, the storage administrator can use the configuration
guide, which is displayed on their first login to ECS.
To display the guide on subsequent logins, click the Open Guide icon.
Pre-configuration
Before a new tenant/namespace can be provisioned, a storage pool, virtual data center (VDC) and replication group
need to be configured.
Note: All steps appear complete in green because a tenant has been preconfigured for lab users who prefer to skip
this lab (Lab 1) and Lab 2.
Logout
You will now log out of the ECS portal as the storage administrator and continue configuration of the tenant by acting
as the Namespace administrator, rainpoleadmin@vlab.local.
Click logout
logout.
You will now access the ECS portal using the credentials of the Namespace admin domain user. You will create an
object user and provision a bucket for object storage within the media Namespace.
Error message
An error message will be displayed. This will not affect lab functionality and can be ignored.
Close dialog
To provide access to the ECS object store, the Namespace administrator must first create an object user with the
appropriate credentials.
An object user account is utilized by a user or an application to consume the object storage platform via REST API like
S3, Swift or Centera CAS.
Object users need a S3 Secret Key to be able to connect to the HTTP endpoint and read/write objects. You will now
create one for the user, mediauser@vlab.local. This S3 Secret Key is distributed out of band. You will save this access
key for use in later labs.
Click Generate and Add Password to create an S3 Secret Key for mediauser@vlab.local.
Click in the key control field. A blue outline should appear around the field to signify it is active. Triple-click in the field
to select all of the key text. Right-click on the selection and select Copy to copy the access key to clipboard.
Minimize the ECS Portal. Right-click on the desktop, and select New
New, Text Document
Document. Name the file
mediauser@vlab.local.txt
mediauser@vlab.local.txt.
Double click to open the text file. Right-click, and paste the access key into the file. Click File and select Save
Save. Finally,
close the text document.
Now that you have provisioned an object user and secret key, you will now provision a bucket for the object user to
consume.
Click Buckets
Buckets, and then click New Bucket
Bucket.
Proceed to the next step to enable metadata search on the bucket and configure indexes.
The media unit has both a media review team and an analytics team that needs to be able to search for objects within
the bucket using metadata search criteria.
The configuration to enable metadata search has to be done by the Namespace admin using the portal. This can also
be done by a bucket owner or an admin user using the ECS management API or the S3 API to configure metadata
search and indexes at the time of bucket creation.
The Namespace admin must know the metadata attributes that are required to be searchable. While system metadata
attributes are available to be selected, user metadata keys need to be manually entered.
The teams in the media unit need to be able to search for data based on the following user metadata attributes:
• image-width (integer)
• image-height (integer)
• image-viewcount (integer)
• gps-latitude (decimal)
• gps-longitude (decimal)
• Click Add
• In the Metadata Key Type list box, select User
• In the Key Name field, type x-amz-meta-image-width
x-amz-meta-image-width. The name is already prefixed, complete the rest.
• In the Data Type list box, select integer
• Click Add
Repeat the previous step for the remaining four metadata search keys:
• image-height (integer)
• image-viewcount (integer)
• gps-latitude (decimal)
• gps-longitude (decimal)
When the five keys are complete, scroll down, and then click Save
Save.
You have successfully operated as a Namespace administrator to create an object user and provision a brand new
bucket for the teams in the media unit. They will now use the object user to ingest and access data.
Proceed to the next lab to start executing the media team use cases of ingest, metadata search and NFS based access.
The content generation team in the media wing uses an S3-based client application called Media Loader to copy
images to the ECS platform. Its configuration process is covered in this lab and is similar to that of any S3 client.
You may skip Lab 1 and Lab 2 if you prefer to explore metadata search and NFS access capabilities with a pre-
configured tenant and pre-populated data.
Minimize the ECS portal window, and on the desktop double-click Media Loader
Loader.
For the secret key, follow the instructions detailed in the next steps.
Object users need a secret key to be able to connect to the HTTP endpoint and read/write objects to this bucket. You
created this in the previous lab and copied into a file on the desktop called mediauser@vlab.local.txt
mediauser@vlab.local.txt.
Paste Key
Apply the copied key into the Secret Key area in the Media Loader application.
Click Upload to start loading the image files using the S3 client application.
Experience how quickly image contents are copied. When the process is complete you will see the success status and
duration of the copy.
Success!
Proceed to experience how the uploaded data can be accessed through S3 based metadata search and NFS protocols.
The media review team of the media unit has an S3 client application to review the content. This lab will take you
through the type of metadata queries that can be performed efficiently by using the ECS metadata search API.
Open the mediauser2@vlab.local.txt file located on the desktop and copy the secret key.
On the desktop, there are two icons to open the Media Reviewer.
Click Login
Click Search
Search.
All images in the bucket with a meta data attribute named viewcount that is greater that 5000 will be returned.
Click on any of the red pushpins on the map to view more details about the selected image. Observe Number of views
is above 5000
5000.
Click Close
Close.
Next, you will not be specifying a Minimum number of views but instead will search within a selected geographic
boundary.
Delete 5000 from the Minimum number of views field, and then select Use selected area
area.
On the map, click and drag to move the gray marquee down to South America. You can use the circular touch points to
cover the entire landmass of South America.
Click Search
Search.
Based on the geographic boundaries you provided on the map, four meta data search parameters were passed in the
query to ECS. Two parameters indicated that the latitude has to be between the north and south boundaries. The
other two parameters indicated that the longitude had to be between the east and west boundaries.
At the bottom of the window, observe the latitude and longitude are within the boundaries of the search you
performed.
Click Close
Close.
Click Logout
Logout, and then close the Media Reviewer.
Conclusion
This lab demonstrates how an application can query objects using meta data indexes specified on the S3 bucket. The
application in this lab used different variations of meta data search queries to find and group results.
You began with a simple search that located all images with a meta data attribute viewcount that was larger than
5000. The second search was more complex, utilizing a geographic range in which longitude and latitude needed to
be within the plot you specified.
The analytics wing of the media unit uses Zeppelin Notebook with SPARK SQL libraries so that they can run SQL
queries on large data sets. Apache Spark is a fast general engine for large-scale data processing.
In this lab, SPARK SQL queries are translated to ECS metadata search queries so that data can be efficiently accessed
and queries can be serviced by use of ECS metadata indexes.
Launch Zeppelin
Apache Zeppelin is a web-based notebook for data analytics. It provides an easy front end for interacting with Apache
Spark data sets.
Notebook Overview
Inside the notebook, you will see a list of paragraphs. Each paragraph can be run individually by clicking run on the
step or by clicking the run all paragraphs at the top of the notebook. In this lab, you will be running them one at a
time.
First, you will need to load the ECS search extension into Zeppelin.
Click the run icon in the paragraph with the z.load() statement. This step may take up to a minute to complete the first
time it is run.
Result
When the extension is loaded, you will see a result object res0 or res1
res1.
Next, a connection to ECS is configured. There are four parameters already provided:
• bucketName = images2
• endpoint = http://192.168.1.11:9020/
• secretKey = the shared secret key for the user, mediauser2@vlab.local.
• user = mediauser2@vlab.local
The last two lines in this script will register the bucket as a temporary table for use with SparkSQL. This does not load
any ECS objects into memory; it simply queries ECS for the indexed attributes on the bucket for later querying.
Run
When complete, you will now have a DataFrame object that has columns defined for the indexed attributes in ECS.
Next, you can use Spark SQL to query the objects in your bucket. In the query, you will select all of the objects in the
bucket that have at least 5000 views and up to 10000 views.
The query will run and display the results. Feel free to modify the SQL and try your own examples. You can quickly
execute the SQL statement by pressing Shift-Enter
Shift-Enter.
While it is outside the scope of this lab, such query results can easily be saved on ECS as an HDFS file and made
available to analytics applications.
The next paragraph will compute the average number of views for the images in the bucket. It uses the avg function to
compute the averages of all the results in the column. Click the run icon to execute the query.
Next, you will create a histogram of the view counts. This query uses Hive operators and grouping to break the objects
down into buckets of 1000 views. Click the run icon to execute the query.
A table is displayed with the number of images per each bucket of thousands of views. To view this visually, click the
bar graph icon:
You can now see the same data set rendered as a bar graph. Clicking on the other icons can change the view style of
the result set.
Explore - Optional
SparkSQL is a powerful engine that can be used to analyze your data in many ways. Feel free to experiment. To create
a new paragraph, hover under the previous paragraph and click +
The results of such analysis can also be made available to Hadoop applications directly from the Zeppelin notebook or
by persisting the results as an HDFS file on ECS
To use SparkSQL in your paragraph, start the first line with %sql then begin your SELECT statement on the following
line. Press Shift-Enter to execute your query.
Close Zeppelin
• You can get more information about SparkSQL on the Apache Spark website.
• The Zeppelin notebook application is available from Apache.
• The ECS driver for Spark is available from GitHub under emcvipr/spark-ecs-s3.
• More information on HDFS capabilities in ECS is available in http://www.emc.com/campaign/ecs-hadoop/
index.htm
• This ECS demo is also available as a Docker container from DockerHub under emccorp/spark-ecs-s3.
This lab details the steps for how an outside marketing firm could sharing advertising media via S3 to RainPole's
advertising team whom them will access this data via NFS mount in Windows.
Logout
Click Login
Error message
An error message will be displayed. This will not affect lab functionality and can be ignored.
Close dialog
1. Click Manage
2. Click Buckets
3. In the Namespace list box, ensure media2 is selected
4. Click New Bucket
Bucket.
Bucket details
Save
1. Click File
2. In the Namespace list box, ensure media2 is selected
3. Click New Export
Export.
Export details
1. In the Export Host field, type * (this will allow access from all NFS clients)
Verify configuration
Verify your NFS export configuration was added for the rainpole-media bucket.
You will now configure access the the file system enabled bucket via S3 Browser. This will allow you to upload your
source image that you want to access and read back via NFS.
You are now prompted to Add New Account. Specify the following:
Upload
Click Upload
Upload, and then select Upload file(s)
In the right navigation pane, select ECS.jpg and then click Open
Open.
Confirm ECS.jpg has been successfully uploaded into the rainpole-media bucket.
You will now use a Windows client for NFS command to mount the export. This command will allow access to the
export as if it were another drive in the Windows environment.
Press Enter.
Type:
mount
Press Enter
Enter.
Using Windows Explorer, you gain a more visual representation of the export you mounted in ECS.
In the left navigation pane, the NFS drive is mounted and named rainpole-media (\\192.168.1.11\media2) (E:)
(E:).
You can think of this as a listing of the objects in the rainpole-media bucket.
Right-click ECS.jpg
ECS.jpg, and then select Edit
Edit.
Close Paint
Paint.
You have now experienced the capabilities of using NFS within ECS. Furthermore, you have seen first-hand how cross-
head access can be leveraged.
You first created a new file system enabled bucket in ECS which was then referenced via NFS export configuration.
Then, you uploaded content to the bucket via the AWS S3 API using S3 Browser. To experience cross-head access, you
viewed the uploaded image by creating a mount in Windows (using the Windows NFS Client) and opening/previewing
the image.
If you have trouble logging in to the ECS GUI (hangs, reports bad password), you should make sure that ECS has fully
initialized. Sometimes if the vLab is busy, this can take a little longer than normal.
Inside the GUI, you'll see the initialization status for each Virtual Data Center (VDC). If both VDCs do not indicate
100% readiness, wait for initialization to complete and then return to your task.
If you get any strange errors with Zeppelin, you can try restarting its interpreter. First, click on the Interpreter tab at the
top:
Then click the restart button to restart the interpreter. You can then go back to your notebook and restart your
execution from the top.
If you experience any issues (like the message seem below) attempting to access/modify images via NFS, you can try
to recreate the NFS mount by following the steps below.
umount e:
Enter. If you're warned and asked to continue this operation, type y and press Enter
Press Enter Enter.
Press Enter
Enter. You will see a success message.
Type the following command and press Enter to verify your mount properties.
mount
Verify the mount is listed and ready. Note that the drive letter is e:
RainPole's IT team has successfully extended the ECS cloud storage platform to its media and communications wing.
All of the following needs have been successfully met and the IT team is convinced that ECS is a smart and flexible
multipurpose cloud storage platform that addresses varied needs:
Ingest of image content and associated metadata by the content generation team.
Use of the metadata associated with the media images by two different teams:
• The media review team used a custom application (Media Reviewer) to search for images based on the
metadata to review the content
• The analytics team used a Zeppelin notebook to view object metadata and run Spark SQL queries to extract
data for analysis
A remote advertising team could access archived media content from their windows platform as a mounted file
system, for reference and reuse.
RainPole’s enterprise IT offers a competitive cloud storage platform to all business units using EMC Elastic Cloud
Storage. They have achieved Terabyte to Exabyte multisite scale-out architecture, giving customers flexibility and
control to scale out according to business needs.
RainPole now wants to extend their cloud infrastructure to the media and communications wing. The media wing is
faced with an explosive growth of unstructured data and the need to serve modern cloud applications.
The following use cases are to be supported for the media wing:
The content generation team needs to ingest image content and its associated metadata.
Two different teams would use the metadata associated with the media images:
• The media review team has a custom application (Media Reviewer) that needs to search for images based on
the metadata to review the content.
• The analytics team wants to use Zeppelin notebook to view object metadata and run Spark SQL queries to
extract data for analysis.
A remote advertising team needs to access archived media content from their Windows platform as a mounted file
system for reference and reuse.
The ECS infrastructure gives RainPole’s IT team the ability to offer a wide range of data ingest and access options such
as Object protocols (S3, Swift), NFS, and Hadoop file system access to achieve a TCO that is 30% lower than that of
public cloud storage. ECS also provides the flexibility of being deployed as software on commodity hardware or in a
turnkey appliance form factor.
ECS is built from the ground up to support multiple protocols for unstructured (Object and File) workloads on a single
cloud-scale storage platform, natively speeding up storage for traditional archive services as well as modern web,
mobile, Internet of Things (IoT), and Hadoop applications. You can store, access or consume data without any
translation gateways.
ECS delivers simple storage management of globally-distributed infrastructure under a single global namespace with
anywhere access to content. It simplifies data governance and management with instant metadata search, analytics
enabled by HDFS, and built-in optimizations for speed and storage efficiency.
Building on the EMC experience for enterprise-grade features for protection, availability, encryption, authentication,
and access controls, ECS delivers a long list of ISV application and technology support. ECS provides you more control
of your data assets with enterprise-class object, file, and HDFS storage in a secure and compliant system. With the
new management, monitoring, and scripting capabilities, you can offer storage as a service within your enterprises.
ECS features a flexible software-defined architecture that is layered in such a way to promote limitless scalability. Each
layer is completely abstracted and independently scalable with high availability and no single point of failure.
ECS brings public cloud capabilities within your own datacenter and behind the corporate firewall. Leveraging COTS
(commodity off the shelf) hardware, high storage efficiency, smaller datacenter footprint, and simple management,
enterprises can benefit from public-cloud economics within your own datacenter.
A sample custom client application that runs on Windows platforms and uses the S3 API interface to extract data from
ECS.
Media Loader
A sample s3 client application that ingests image data into the ECS cloud storage platform using the S3 object
protocol.
A sample general-purpose client application. It is a web-based notebook that enables interactive data analytics and
use of SPARK SQL on top of ECS metadata search APIs.