Professional Documents
Culture Documents
2016
By Philip Russom
Sponsored by:
DECEMBER 2016
T DW I CHECK L IS T RE P OR T
TABLE OF CONTENTS
2 F OREWORD
3 NUMBER ONE
Design a data lake for both business
and technology goals.
4 NUMBER TWO
Simplify your data lake with a scalable
onboarding process.
5 NUMBER THREE
Rely on data integration infrastructure to make
the lake work.
6 NUMBER FOUR
Integrate your data lake with enterprise data
architectures.
7 NUMBER FIVE
Embrace new data management best practices
for the data lake.
8 NUMBER SIX
Empower new best practices for business analytics
via a data lake.
9 ABOUT OUR SPONSOR
9 ABOUT THE AUTHOR
9 ABOUT TDWI RESEARCH
9 ABOUT TDWI CHECKLIST REPORTS
2016 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or
in part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of
their respective companies.
FOREWORD
A confluence of trends is pushing many organizations toward the use
of data lakes.
Businesses depend on data more heavily than ever before. They
need data for fact-based decision making (both operational and
strategic) as well as to run the business. Many want to deepen their
commitment to data-driven processes by competing on analytics
and by designing new products and services.
Technical teams are under pressure to capture new data and
develop its business value. Many organizations are now receiving
data from more sources than ever before, ranging from the Internet
of Things (including sensors, vehicles, mobile devices, and robots)
to Web-based applications and social media. Savvy businesses
are eager to capture and use big data and other new data sources
because they realize they can leverage these data assets for driving
analytics insights, competitive advantage, operational excellence,
and other business value.
Note that both trends are driving toward a greater use of analytics
because analytics is a critical component of modern strategies
for growth, profitability, efficiency, and competitiveness. Many
organizations already have mature programs for set-based
analytics based on SQL and OLAP. Those are still valuable and
relevant investments, although they are based on well-defined
business facts that must be tracked over time in reports and
dimensional data.
As a complement to those, forward-looking businesses today need
discovery-oriented analytics, which helps them learn new things
about customers, partners, operations, locations, employees,
competitors, and other important entities and parties. The catch is
that discovery analytics tends to work best with large volumes of raw
source data because that kind of data set is used by technologies
for mining, statistics, graph analytics, clustering, natural language
processing (NLP), and other forms of advanced analytics.
The point of managing data in its original raw state is so that its
details can be repurposed repeatedly as new business requirements
and opportunities for new analytics applications arise. After all, once
data is remodeled, standardized, and otherwise transformed (as is
required for report-oriented data warehousing), its applicability for
other unforeseen use cases is greatly narrowed.
Hadoop is the preferred platform for data lakes. Due to the
great size and diversity of data in a lake, plus the many ways
data must be processed and repurposed, Hadoop has become an
important enabling platform for data lakes (and for other purposes,
too, of course). Hadoop scales linearly, supports a wide range of
processing techniques, and costs a fraction of similar relational
configurations. For these reasons, Hadoop is now the preferred data
platform for data lakes.
Successful data lakes depend on significant data integration
(DI) infrastructure. Given the great diversity of data types and
sources seen in data lakes, DI must support a wide range of
interfaces, platforms, data structures, and processing methods.
Furthermore, DI helps define the key characteristics of a data lake,
such as early ingestion, metadata management, and repurposing
data on the fly for queries, data prep, and other ad hoc practices.
Equally important, DI must make Hadoop easier to work with
by supporting multiple interfaces to Hadoop, enriching Hadoop
with metadata, and pushing processing into Hadoop without any
programming required.
This TDWI Checklist Report will discuss many of the emerging best
practices for data lakes, including technical data management
issues and practical business use cases.
The data lake enables analytics with big data and other diverse
sources. A design pattern known as the data lake has arisen to
address the need for business analytics with large volumes of
raw, detailed source data. A data lake isfirst and foremosta
repository for raw data. A data lake tends to manage highly diverse
data types and can scale to handle tens or hundreds of terabytes
sometimes petabytes. A data lake is optimized to ingest raw data
quickly from both new and traditional sources.
tdwi.org
NUMBER ONE
Govern the data lake and its users for business compliance. After
all, governance is a goal for all enterprise data collections, including
lakes. Note that the self-service data access and broad data
exploration discussed earlier are inherently risky in terms of privacy
violations and compliance infractions.
To avoid these problems, data governance policies should be
updated or extended to encompass data from the lake, and users
should be trained in how the policies affect their work with data
in the lake. The data integration tools used for the lakes data
onboarding process can help by identifying sensitive data and
applying encryption or masking policies where necessary. All
tools used to access or analyze a lakes data should have security
measures enabled.
tdwi.org
NUMBER TWO
Organizations seeking to capture and leverage big data via a Hadoopbased data lake are typically besieged with a swelling swarm of data
sources and schema. Many organizations start by employing handcoded programming and manual methods for data ingestion, but they
soon realize this approach is not scalable.
The issue is that many forms of big data lack obvious or easily
accessed metadata, which seriously impedes data onboarding,
ingestion, exploration, and analytics. However, new technologies can
now detect structure on the fly at run time, generate metadata, then
manage that metadata in a shared repository.
tdwi.org
NUMBER THREE
is designed for self-service data access and light data set creation.
Data prep targets nontechnical and mildly technical users, and these
classes of users typically assume metadata will be available within
the data lake.
Note that the emerging tools and best practices for data prep do not
replace the traditional best practices and mature tools of traditional
data integration. The two are designed for different classes of users,
in different development contexts, so they are complementary. A
modern and up-to-date DI environment should support both.
tdwi.org
NUMBER FOUR
User organizations are interested in the data lake right now because
it has many beneficial use cases in data warehouse architectures
and related technology stacks for analytics. Data lakes also have
desirable uses in other complex data ecosystems across enterprises,
especially multichannel marketing, data archiving, content
management, ERP, and the supply chain.
The overall point is that a data lake rarely exists in a vacuum.
Instead, most data lakes are integrated into larger enterprise data
architectures. It behooves users to design the data lake and to select
tools that will result in a lake that integrates and interoperates well
with other enterprise processes, as in the following:
Multiplatform data warehouse environments (DWEs). A recent
TDWI survey indicates that the number of deployed Hadoop clusters
is up 60 percent over two years. In another TDWI survey, 17 percent
of data warehouse professionals already have Hadoop in production
in their extended DWE. The users surveyed also indicated that
Hadoop complements their primary traditional warehouse platform
without replacing it.
tdwi.org
NUMBER FIVE
Data lakes, big data, and Hadoop are still new to most IT and data
management teams. However, the good news is that these teams
prior experience and skills with traditional data platforms are still
relevant. However, to get maximum value from a Hadoop-based data
lake, technical teams will need to adjust older best practices in data
management and embrace a few emerging ones.
Developing metadata as you ingest or explore the data in the
lake. Data management professionalsespecially those who work
in data warehousingare used to extracting metadata from source
systems long before a DI solution connects to that system and
extracts actual data. That method simply doesnt work with many
new big data sources. For one thing, connecting to the source system
is often impossible (as with most sensors, which broadcast or push
out data without connection). For another, many users must onboard
new data suddenly, then quickly develop metadata from it, as
discussed in the onboarding section of this report. Hence, many data
management professionals need to think in a new way so they can
adapt to the changing data landscape. Metadata management is just
one area where thats true.
Modeling data on the fly. Analytics data modeling is another skill
that is evolving. The most dramatic example is how a user creates
a data model as he or she explores the data in a lake with iterative
ad hoc queries. Likewise, the new practice of data prep is all about
creating a new data setand typically a new metadata model
very quickly, based on manipulating the actual data. This is quite
different from the older academic approaches that design a model
in a vacuum with a dedicated modeling tool, then test the model
by loading data into it. The old practice too often leads to beautiful
models that dont fit the available data; the new practice avoids such
disconnects. As more ecosystems mix old and new data platforms
and their attendant practices, theres a need for both approaches to
data modeling.
Adopting early ingestion and late processing. This is standard
procedure with a Hadoop-based data lake, and for good reasons. It
makes data available as soon as possible for reporting and analytics,
which enables business monitoring, operational decision making,
and other time-sensitive business practices. It avoids processing too
much data on the off chance that someone might need it (a common
cause of squandered resources in data warehousing). Early ingestion
also fosters a newfound respect for datas original state, which is
beneficial to emerging forms of discovery-oriented analytics.
tdwi.org
NUMBER SIX
Managing diverse big data is important, and yet its just a means
to an end. The end goal for savvy managers is to get business value
and drive organizational effectiveness from the data, not just capture
it in a cost center, and the primary path to that value is through
analytics. In that context, here are some of the new analytics best
practices empowered by a Hadoop-based data lake:
Advanced analytics that complement older analytics. In a nutshell,
organizations need to preserve existing analytics based on reporting,
OLAP, and SQL and also complement these with advanced analytics
based on technologies for mining, clustering, graph, statistics, and
natural language processing. Traditional forms of analytics and
reporting are mostly about tracking the facts and business entities
that you know well and need to monitor over time. New, advanced
forms of analytics are mostly about discovering facts that you did not
know before, as well as linking together highly diverse facts, events,
and entity characteristics (such as customer behavior, partner
reliability, or operational metrics) to form new insights and develop
new business opportunities.
Note that traditional analytics tends to require squeaky clean
data on a relational platform for highly precise and structured
output, as seen in standard reports and cubes. However, todays
analytics focuses on raw detailed data because it fuels discoveries
and complex linkages without the need for obsessive precision or
structure. Given the differences in data requirements, a growing
number of data warehouse and data management teams work with
both relational databases and Hadoop-based data lakes.
Multiple forms of analytics in tandem. One of the strongest trends
in analytics is toward using multiple forms of analytics. Thats
because each method tells you something different about the same
issue. Connect multiple analytics results in tandem, and you get a
more comprehensive insight for business advantage.
When a Hadoop-based data lake captures and manages data in
its raw original state, data can easily be repurposed for multiple
forms of analytics. Depending on the design and data content of an
individual data lake, it may support both set-based analytics (based
on OLAP, SQL, and other relational techniques) and algorithmic
analytics (based on mining, clustering, graph, statistics, and NLP).
Integrated sequence of self-service best practices. One of the
most desirable emerging analytics practices today is to connect,
in sequence, several related self-service data-driven tasks. The
sequence typically follows this order: data access, exploration, prep,
visualization, and analysis. For example, as users access and explore
data, they may discover something meaningful in the data, such
as the root cause of the most recent churn or a cost center thats
eroding profits from the bottom line. After the discovery, they want
to quickly prepare a data set based on what they learned, then share
the prepped data set with colleagues or seamlessly move the data
set to other tools for further analysis and visualization.
The assumption is that several tool types are employed (one per
step in this multistep analytics process), and the tools are tightly
integrated for seamless handoff. This multistep analytics process
seems to work well with Hadoop-based data lakes, but only when
users are given an integrated toolset that supports self-service. As
discussed earlier, self-service isnt for everyone; it succeeds when
provided to certain classes of users, who are governed carefully.
Analytics value from human language, text, and other
unstructured data. Theoretically, you can put any data or other
digital information in a file, and Hadoop can manage it and
make it available for analytics processing. Within the category of
unstructured data, file-based human language and other text is
already being leveraged via analytics. The killer app is sentiment
analysis, which scans mountains of comments from customers,
prospects, and other people (perhaps drawn from social media or
text fields in call center apps) to determine what the marketplace
is saying about your firm, its products, or its services. As another
example, the claims process in insurance captures a ton of text
about losses; insurance companies collect this in lakes, process it
to extract facts about entities of interest, and use the output data
to extend analytics applications in fraud detection and actuarial
calculations. Similar text-driven analytics are seen in the patient
outcome analyses of healthcare (by both insurer and provider).
Analytics focused on a single department or application. Some
data lakes are designed for multitenant enterprise use, as when
the lake is part of a larger data warehouse environment or provided
by central IT. However, other data lakes are designed specifically
for a single department or application. The departmental analytics
program is a rising phenomenon because most analytics applications
have a departmental association. For example, sales and marketing
wants to own and control customer analytics so they get the
information they need for campaigns, telemarketing, prospecting,
and their general understanding of customers. Likewise, the financial
department needs to control financial analytics, and procurement
needs to control supply-chain analytics. As departmental analytics
programs are modernized, they tend to integrate Hadoop, then
migrate data collections to a data lake atop it. Hadoops relatively
low cost puts the data lake within reach of departmental budgets.
tdwi.org
www.pentaho.com
Pentaho, a Hitachi Group company, is a leading data integration
and business analytics company with an enterprise-class, open
sourcebased platform for diverse big data deployments. Pentahos
unified data integration and analytics platform is comprehensive,
completely embeddable, and delivers governed data to power
any analytics in any environment. Pentahos mission is to help
organizations across multiple industries harness the value from
all their data, including big data and IoT, enabling them to find
new revenue streams, operate more efficiently, deliver outstanding
service, and minimize risk. Pentaho has over 15,000 product
deployments and 1,500 commercial customers today including
ABN-AMRO Clearing, BT, Caterpillar Marine Asset Intelligence, EMC,
Moodys, NASDAQ, Opower, and Sears Holdings Corporation. For more
information, visit www.pentaho.com.
tdwi.org