You are on page 1of 5

Big Data Architecture

Big data architecture is the overarching system that a business uses to steer its data analytics work.
Planning this analytics system ahead of time is crucial for success.
Email Article
Print Article
Comment on this article

Share Articles

Posted June 8, 2017


By
Christine Taylor

Submit Feedback
More Articles

Big data architecture is the foundation for big data analytics. Think of big data architecture as an
architectural blueprint of a large campus or office building. Architects begin by understanding the
goals and objectives of the building project, and the advantages and limitations of different
approaches. Its not an easy task, but its perfectly doable with the right planning and tools.

System architects go through a similar process to plan big data architecture. They meet with
stakeholders to understand company objectives for its big data, and plan the computing framework
with appropriate hardware and software, data sources and formats, analytics tools, data storage
decisions, and results consumption.

Do I Need Big Data Architecture?


Not everyone does need to leverage big data architecture. Single computing tasks rarely top more
than 100GB of data, which does not require a big data architecture. Unless you are analyzing
terabytes and petabytes of data and doing it consistently -- look to a scalable server instead of a
massively scale-out architecture like Hadoop. If you need analytics, then consider a scalable array
that offers native analytics for stored data.

You probably do need big data architecture if any of the following applies to you:

1. You want to extract information from extensive networking or web logs.


2. You process massive datasets over 100GB in size. Some of these computing tasks run 8
hours or longer.
3. You are willing to invest in a big data project, including third-party products to optimize your
environment.
4. You store large amounts of unstructured data that you need to summarize or transform into
a structured format for better analytics.
5. You have multiple large data sources to analyze, including structured and unstructured.
6. You want to proactively analyze big data for business needs, such as analyzing store sales
by season and advertising, applying sentiment analysis to social media posts, or
investigating email for suspicious communication patterns or all the above.
The Challenges of Cloud Integration
Download Now

With use cases like these, chances are that your organization will benefit from a big data
architecture expressly built for these challenging tasks. Plan for an environment that will capture,
store, transform, and communicate this valuable intelligence.

Planning the Big Data Architecture


Big data architecture includes mechanisms for ingesting, protecting, processing, and transforming
data into filesystems or database structures. Analytics tools and analyst queries run in the
environment to mine intelligence from data, which outputs to a variety of different vehicles.

The architecture has multiple layers. Lets start by discussing the Big Four logical layers that exist in
any big data architecture.

1. Big data sources layer: Data sources for big data architecture are all over the map. Data
can come through from company servers and sensors, or from third-party data providers.
The big data environment can ingest data in batch mode or real-time. A few data source
examples include enterprise applications like ERP or CRM, MS Office docs, data
warehouses and relational database management systems (RDBMS), databases, mobile
devices, sensors, social media, and email.
2. Data massaging and storage layer: This layer receives data from the sources. If
necessary, it converts unstructured data to a format that analytic tools can understand and
stores the data according to its format. The big data architecture might store structured data
in a RDBMS, and unstructured data in a specialized file system like Hadoop Distributed File
System (HDFS), or a NoSQL database.
3. Analysis layer: The analytics layer interacts with stored data to extract business
intelligence. Multiple analytics tools operate in the big data environment. Structured data
supports mature technologies like sampling, while unstructured data needs more advanced
(and newer) specialized analytics toolsets.
4. Consumption layer: This layer receives analysis results and presents them to the
appropriate output layer. Many types of outputs cover human viewers, applications, and
business processes.

In addition to the logical layers, four major processes operate cross-layer in the big data
environment: data source connection, governance, systems management, and quality of service
(QoS).

1. Connecting to data sources: Fast data ingress requires connectors and adapters that can
efficiently connect to different storage systems, protocols, and networks; and data formats
running the gamut from database records to social media content to sensors.
2. Governing big data: Big data architecture includes governance provisions for privacy and
security. Organizations can choose to use native compliance tools on analytics storage
systems, invest in specialized compliance software for their Hadoop environment, or sign
service level security agreements with their cloud Hadoop provider. Compliance policies
must operate from the point of ingestion through processing, storage, analysis, and deletion
or archive.
3. Managing systems: Big data architecture is typically built on large-scale distributed
clusters with highly scalable performance and capacity. IT must continually monitor and
address system health via central management consoles. If your big data environment is in
the cloud, you will still need to spend time and effort to establish and monitor strong service
level agreements (SLAs) with your cloud provider.
4. Protecting Quality of service: QoS is the framework that supports defining data quality,
compliance policies, ingestion frequency and sizes, and filtering data. For example, a public
cloud provider experimented with QoS-based data storage scheduling in a cloud-based,
distributed big data environment. The provider wanted to improve the data massage/storing
layers availability and response time, so they automatically routed ingested data to
predefined virtual clusters based on QoS service levels.

Big data architecture includes myriad different concerns into one all-encompassing plan to make
the most of a companys data mining efforts.

Critical Components
Lets look at a big data architecture using Hadoop as a popular ecosystem. Hadoop is open source,
and several vendors and large cloud providers offer Hadoop systems and support. There are also
numerous open source and commercial products that expand Hadoop capabilities.

Core Clusters
Hadoop architecture is cluster architecture. Hadoop runs on commodity servers, and recommends
dual CPU servers with 4-8 cores each, and at least 48GB of RAM. (Using accelerated analytics
technologies like Apache Spark will speed up the environment even more.) Storage must also be
highly scalable.

Another option is cloud Hadoop environments where the cloud provider does the infrastructure for
you. The cloud might add latency, youll be in a shared environment, and you dont want to be
locked-in. But the cloud is an excellent choice for a new Hadoop installation, or when you know that
you dont want to grow your data center racks or IT staff to support on-premise Hadoop.

Loading the Data


Loading data onto the clusters is an ongoing event. Hadoop supports both batched data such as
loading in files or records at specific times of the day, and event-driven data such as loading
transactional data as the transactions occur. Software tools for loading source data include Apache
Sqoop for batch loading and Apache Flume for event-driven data loading.

Your big data environment will also stage the incoming data for processing, including converting
data as needed and sending it to the correct storage in the right format. Additional activities include
partitioning data and assigning access controls.

Processing the Data


Once the system has ingested, identified, and stored the data it will automatically process it. This is
a 2-step process of transforming the data and analyzing it. Transforming the data simply means
processing it into analytics-ready formats and/or compressing it.

In Hadoop, this is MapReduce territory. MapReduce is the core component of Hadoop that filters
(maps) data among nodes, and aggregates (reduces) data returned in response to a query.
MapReduce achieves high performance thanks to parallel operations across massive clusters, and
fault-tolerance reassigns data from a failing node. MapReduce works on both structured and
unstructured data.

Many analysts and vendors run MR with additional filters, like adding collaborative filtering to MR to
identify user preferences in Twitter data. Other analytics products replace it, such as Googles
proprietary Cloud Dataflow.

Output and Querying


One of Hadoops shining features is that once data is processed and placed, different analytics
tools can operate on the unchanging data set. There is no need to re-process it for different tools, or
to copy it to different locations. The same copy of data serves for all queries.

Output covers a variety of destinations, including reports and dashboard visualization for users or
next step triggers in business processes.

Data Pipelines
Micro- and macro-pipelines enable discrete processing steps. Micro-pipelines operate at a step-
based level to create sub-processes on granular data. In a typical scenario, one source of data is
customer transactional data from the companys primary data center. The data enters Hadoop so
company analysts can investigate customer churn. However, compliance is an issue because the
data includes customer credit card numbers. A micro-pipeline adds a granular processing step that
cleans credit card numbers from the analyst teams reports.

Macro-pipelines operate on a workflow level. They define 1) workflow control: what steps enable the
workflow, and 2) action: what occurs at each stage to enable proper workflow.
Big Data Architecture: Crucial for Analytics Success
Big data architecture takes ongoing attention and investment. Before you run screaming for the
hills, remember that a well-executed big data architecture will do much of this for you behind the
scenes. You can offload even more planning and management tasks if youre working with
consultants and service providers.

Despite complexity and cost, big data architecture lets you extract vital business information from
your otherwise opaque data for higher profit and lower risk. Done well, these results are more than
worth the price of admission.