Tdwi Checklist 2016 Pentaho Web

CHECKLIST REPORT
2016
Emerging Best Practices

for Data Lakes
By Philip Russom
Sponsored by:
DECEMBER 2016
T DW I CHECK L IS T RE P OR T
Emerging Best Practices

for Data Lakes
By Philip Russom
TABLE OF CONTENTS
2 F OREWORD
3 NUMBER ONE
Design a data lake for both business
and technology goals.
4 NUMBER TWO
Simplify your data lake with a scalable
onboarding process.
5 NUMBER THREE
Rely on data integration infrastructure to make
the lake work.
6 NUMBER FOUR
Integrate your data lake with enterprise data
architectures.
7 NUMBER FIVE
Embrace new data management best practices
for the data lake.
8 NUMBER SIX
Empower new best practices for business analytics
via a data lake.
9 ABOUT OUR SPONSOR
9 ABOUT THE AUTHOR
9 ABOUT TDWI RESEARCH
9 ABOUT TDWI CHECKLIST REPORTS
555 S. Renton Village Place, Ste. 700

Renton, WA 98057-3295
T 425.277.9126
F 425.687.2842
E info@tdwi.org
tdwi.org
2016 by TDWI, a division of 1105 Media, Inc. All rights reserved. Reproductions in whole or
in part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of
their respective companies.
TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES
FOREWORD
A confluence of trends is pushing many organizations toward the use
of data lakes.
Businesses depend on data more heavily than ever before. They
need data for fact-based decision making (both operational and
strategic) as well as to run the business. Many want to deepen their
commitment to data-driven processes by competing on analytics
and by designing new products and services.
Technical teams are under pressure to capture new data and
develop its business value. Many organizations are now receiving
data from more sources than ever before, ranging from the Internet
of Things (including sensors, vehicles, mobile devices, and robots)
to Web-based applications and social media. Savvy businesses
are eager to capture and use big data and other new data sources
because they realize they can leverage these data assets for driving
analytics insights, competitive advantage, operational excellence,
and other business value.
Note that both trends are driving toward a greater use of analytics
because analytics is a critical component of modern strategies
for growth, profitability, efficiency, and competitiveness. Many
organizations already have mature programs for set-based
analytics based on SQL and OLAP. Those are still valuable and
relevant investments, although they are based on well-defined
business facts that must be tracked over time in reports and
dimensional data.
As a complement to those, forward-looking businesses today need
discovery-oriented analytics, which helps them learn new things
about customers, partners, operations, locations, employees,
competitors, and other important entities and parties. The catch is
that discovery analytics tends to work best with large volumes of raw
source data because that kind of data set is used by technologies
for mining, statistics, graph analytics, clustering, natural language
processing (NLP), and other forms of advanced analytics.
The point of managing data in its original raw state is so that its
details can be repurposed repeatedly as new business requirements
and opportunities for new analytics applications arise. After all, once
data is remodeled, standardized, and otherwise transformed (as is
required for report-oriented data warehousing), its applicability for
other unforeseen use cases is greatly narrowed.
Hadoop is the preferred platform for data lakes. Due to the
great size and diversity of data in a lake, plus the many ways
data must be processed and repurposed, Hadoop has become an
important enabling platform for data lakes (and for other purposes,
too, of course). Hadoop scales linearly, supports a wide range of
processing techniques, and costs a fraction of similar relational
configurations. For these reasons, Hadoop is now the preferred data
platform for data lakes.
Successful data lakes depend on significant data integration
(DI) infrastructure. Given the great diversity of data types and
sources seen in data lakes, DI must support a wide range of
interfaces, platforms, data structures, and processing methods.
Furthermore, DI helps define the key characteristics of a data lake,
such as early ingestion, metadata management, and repurposing
data on the fly for queries, data prep, and other ad hoc practices.
Equally important, DI must make Hadoop easier to work with
by supporting multiple interfaces to Hadoop, enriching Hadoop
with metadata, and pushing processing into Hadoop without any
programming required.
This TDWI Checklist Report will discuss many of the emerging best
practices for data lakes, including technical data management
issues and practical business use cases.
The data lake enables analytics with big data and other diverse
sources. A design pattern known as the data lake has arisen to
address the need for business analytics with large volumes of
raw, detailed source data. A data lake isfirst and foremosta
repository for raw data. A data lake tends to manage highly diverse
data types and can scale to handle tens or hundreds of terabytes
sometimes petabytes. A data lake is optimized to ingest raw data
quickly from both new and traditional sources.
2TDWI RESE A RCH
tdwi.org
NUMBER ONE
DESIGN A DATA LAKE FOR BOTH BUSINESS AND

TECHNOLOGY GOALS
Blend new big data and traditional enterprise data together in a

data lake. The data lake excels at managing large volumes of data
from new sources, such as sensors, Web applications, and social
media. However, data lakes can also manage many sets of traditional
data, such as detailed data about customers, transactions, and
operations. This mix of old and new within the data lake is fortuitous
because the discovery-oriented exploration and analytics that many
business people need today demands that traditional and modern
data be blended. The mixture empowers users to explore and discover
very broadly, plus make rich correlations among data of diverse
types, sources, and vintages. As an example, consider multichannel
marketing, which by definition understands a customer more deeply
by correlating behavioral and demographic data drawn from multiple
customer channels, touch points, and third-party sources.
Govern the data lake and its users for business compliance. After
all, governance is a goal for all enterprise data collections, including
lakes. Note that the self-service data access and broad data
exploration discussed earlier are inherently risky in terms of privacy
violations and compliance infractions.
To avoid these problems, data governance policies should be
updated or extended to encompass data from the lake, and users
should be trained in how the policies affect their work with data
in the lake. The data integration tools used for the lakes data
onboarding process can help by identifying sensitive data and
applying encryption or masking policies where necessary. All
tools used to access or analyze a lakes data should have security
measures enabled.
To blend traditional and modern data in a lake successfully and

efficiently, technical professionals need to set up an onboarding
process that supports multiple ingestion methods and latencies as
well as interfaces with old and new sources. Likewise, technical
users should look for tools that provide significant automation for
the onboarding process for the sake of developer productivity and the
automatic handling of a broad range of data. Finally, technical and
business users must work together to ensure that the data lake has
the right mix of data to support business goals around exploration
and analytics.
Provide self-service functions for certain classes of users. A
growing number of business analysts, data stewards, engineers,
marketers, and other business people want self-service data access
and analytics so they can work with agility and independence
without waiting for other team members to create data sets for them.
Self-service also helps data scientists, data analysts, and data
management professionals work faster, more autonomously, and more
creatively. For these users, a data lake typically requires some level of
self-service data access.
Note that self-service data access is important because its the
foundation that other desirable self-service practices, such as data
exploration, data prep, and data visualization, are built on. For the
best results, these tools should integrate tightly so that users move
from one to the other in a seamless, self-service fashion. When
designing a data lake ecosystem, business and technology users
should work together to select tools that enable an appropriate level of
self-service for certain user classes.
3TDWI RESE A RCH
tdwi.org
NUMBER TWO
SIMPLIFY YOUR DATA LAKE WITH A SCALABLE

ONBOARDING PROCESS
Organizations seeking to capture and leverage big data via a Hadoopbased data lake are typically besieged with a swelling swarm of data
sources and schema. Many organizations start by employing handcoded programming and manual methods for data ingestion, but they
soon realize this approach is not scalable.
The issue is that many forms of big data lack obvious or easily
accessed metadata, which seriously impedes data onboarding,
ingestion, exploration, and analytics. However, new technologies can
now detect structure on the fly at run time, generate metadata, then
manage that metadata in a shared repository.
Scaling to the volumes of big data is obviously important. Its equally

important, however, that a DI solution scale to the exploding number
of data sources that organizations face. It must also scale to the
increasing number of users who need to onboard data as part of
their jobs. All this is worth the effort; when the data and its sources
represent new clients, partners, and analytics, achieving scale with
data onboarding also means you have achieved business goals.
Furthermore, improving the onboarding process achieves technology
goals for early ingestion, data standards, and system speed and scale.
Metadata injection goes a step further by injecting metadata into

data transformations to drive their logic, so onboarding routines can
be automatically generated to intelligently accommodate variations in
schema from source to source. This greatly automates, simplifies, and
accelerates data onboarding for Hadoop and data lakes.
Note that highly dynamic data integration processes are required to

enable automated onboarding at scale. Automation from a DI tool can
achieve most of the goals of a scalable data onboarding process for a
Hadoop-based data lake.
Empower users and offload IT with self-service. This empowers
certain classes of business users to onboard data themselves. Selfservice DI can also free up technical personnel and data management
team members who would otherwise need to create unique data sets
for each user. Self-service onboarding streamlines the data ingestion
process to accelerate the creation of new analytics applications, but it
has operational ramifications, too.
Reduce your dependence on hand-coded programming. Writing

and debugging code is inherently nonproductive up front and difficult
to maintain over time. Due to the immaturity of tools for Hadoop,
some organizations have fallen into the habit of coding ingestion and
processing for Hadoop data.
However, theres a better way. For greater agility in all stages of a DI
solution, look for tools that support Hadoop and other environments
natively with fast and productive drag-and-drop development instead
of hand-coded programming. This in turn reduces the manual effort
expended for data onboarding, which tends to be time consuming,
error prone, not reusable, not consistently repeatable, and not a costeffective use of expensive technical personnel.
For example, in business-to-business environments, establishing a

relationship with another business (as a client or partner) regularly
involves exchanged data that must be onboarded before the
relationship goes live. Furthermore, data onboarding solutions should
enable customers to onboard data without requiring customer-specific
onboarding processes.
Adopt methods that are repeatable for productivity, consistency,
and governance. For example, a method based on a tools template
library provides standards for DI solution development involving data
lakes, plus consistent data usage for governance purposes. A template
library can also be a foundation for extending the templates to support
new and evolving data sources.
Automate the management of metadata and operationalize it via
metadata injection. Relevant capabilities here include schema on
read, metadata deduction, and automated metadata development.
4TDWI RESE A RCH
tdwi.org
NUMBER THREE
RELY ON DATA INTEGRATION INFRASTRUCTURE

TO MAKE THE LAKE WORK
Data integration tools and accompanying infrastructure

play many roles in making a data lake practical, including
metadata management, native Hadoop support, and simplified
data prep functionality.
is designed for self-service data access and light data set creation.
Data prep targets nontechnical and mildly technical users, and these
classes of users typically assume metadata will be available within
the data lake.
Data integration enables the big data pipeline. In todays

multiplatform hybrid data ecosystems, DI must provide interfaces
for a long list of applications, data platforms, file types, and future
sources and targets. By doing so, DI becomes the golden thread that
stitches together a complex pipeline for new big data and blended
combinations with traditional data. Because the average data lake
has similar goals, in terms of data diversity and blending, up-to-date
DI infrastructure is mandatory for interoperability with Hadoop and
for provisioning diverse data at scale.
Note that the emerging tools and best practices for data prep do not
replace the traditional best practices and mature tools of traditional
data integration. The two are designed for different classes of users,
in different development contexts, so they are complementary. A
modern and up-to-date DI environment should support both.
Rely on DI infrastructure for a lakes metadata management. Look

for DI tools that support metadata injection or similar techniques
to automate the identification and creation of metadata in various
environments, especially Hadoop and big data, which tend to have
weak support for metadata. The same tool should also support the
creation of business metadata, which is required if mildly technical
users are to access and work with data in a self-service manner.
Operational metadata is also useful for tracking data access for
governance and compliance purposes.
For data lake success, DI infrastructure must provide deep
Hadoop support. A DI tool should integrate tightly with different
Hadoop interfaces, such as MapReduce and Hive. This is key to
optimized ingestion and analytics and to the general in-cluster data
processing that users prefer to perform with Hadoop. To help futureproof the data lake, the DI tool should support all distributions and
versions of Hadoop plus provide updates for these.
Consider Apache Avro for optimizing lake data. Avro packages
data into small, fast containers that can be persisted in Hadoop. An
advantage of Avros format is that it injects a schema into the file,
thereby greatly facilitating access to the files data. Note that when
Avro conforms data to its format (to gain compression, access speed,
and metadata), it preserves the datas original format and details, as
expected for most data stored in a data lake.
DI must support rich functionality with a subset for data prep. A
growing number of users need data prep, which is a select subset
of the rich functionality that a mature, general-purpose DI tool
provides. The data prep subset is presented via a business-friendly
user interface with business metadata (not technical metadata) and
5TDWI RESE A RCH
tdwi.org
NUMBER FOUR
INTEGRATE YOUR DATA LAKE WITH ENTERPRISE

DATA ARCHITECTURES
User organizations are interested in the data lake right now because
it has many beneficial use cases in data warehouse architectures
and related technology stacks for analytics. Data lakes also have
desirable uses in other complex data ecosystems across enterprises,
especially multichannel marketing, data archiving, content
management, ERP, and the supply chain.
The overall point is that a data lake rarely exists in a vacuum.
Instead, most data lakes are integrated into larger enterprise data
architectures. It behooves users to design the data lake and to select
tools that will result in a lake that integrates and interoperates well
with other enterprise processes, as in the following:
Multiplatform data warehouse environments (DWEs). A recent
TDWI survey indicates that the number of deployed Hadoop clusters
is up 60 percent over two years. In another TDWI survey, 17 percent
of data warehouse professionals already have Hadoop in production
in their extended DWE. The users surveyed also indicated that
Hadoop complements their primary traditional warehouse platform
without replacing it.
demographics per region, and so on. Furthermore, this data is used

to complete a 360-degree view of each customer.
From a technology viewpoint, most marketing data ecosystems
have evolved uncontrollably, so they are dogged by redundancy,
inconsistent data, data quality problems, and limited integration
across silos. A Hadoop-based data lake can be an effective and
inexpensive consolidation platform that solves many of these
technical issues with customer data.
From a business viewpoint, the marketing data ecosystem is worth
further investment because almost any study of customers yields
a demonstrable business return. This is why savvy data-driven
marketers want to explore and dig deeper into data they already
have, plus capture more data as new customer channels come
online, such as new smartphone apps, Web-based self-service
apps, and social media. A Hadoop-based data lake can enable
the broad exploration and multisource correlations that marketers
need and want today.
Why are data warehouse professionals adopting Hadoop aggressively

while still maintaining their relational warehouses? Its because
Hadoop is a perfect fit for certain components of the average data
warehouse architecture, and the data lake is a good fit, too.
Those components are the data landing/staging area and large
repositories of raw detailed data for advanced analytics (not
reporting). A Hadoop-based data lake can serve double duty by
being the preferred data platform and design pattern for both of
those components. Data landing and analytics both require linear
scalability, the handling of diverse data, the management of data in
its original raw state, and in situ processing of such data.
Thats a long and daunting list, but a Hadoop-based data lake can
do it all, when coupled with modern data integration, at a fraction
of the cost of an equivalent configuration of a relational database
management system.
Multichannel marketing and its hybrid data ecosystem. Many
organizations have deployed multiple packaged and homegrown
applications for customer relationship management (CRM) and sales
force automation (SFA). Some marketing and sales organizations
also have a number of databases and data hubs where theyve
collected data from each customer channel or touch point as well
as transactions or other operational artifacts about customers.
They often perform customer analytics with this data in the form
of customer-base segmentation, profitability, churn propensity,
6TDWI RESE A RCH
tdwi.org
NUMBER FIVE
EMBRACE NEW DATA MANAGEMENT BEST PRACTICES

FOR THE DATA LAKE
Data lakes, big data, and Hadoop are still new to most IT and data
management teams. However, the good news is that these teams
prior experience and skills with traditional data platforms are still
relevant. However, to get maximum value from a Hadoop-based data
lake, technical teams will need to adjust older best practices in data
management and embrace a few emerging ones.
Developing metadata as you ingest or explore the data in the
lake. Data management professionalsespecially those who work
in data warehousingare used to extracting metadata from source
systems long before a DI solution connects to that system and
extracts actual data. That method simply doesnt work with many
new big data sources. For one thing, connecting to the source system
is often impossible (as with most sensors, which broadcast or push
out data without connection). For another, many users must onboard
new data suddenly, then quickly develop metadata from it, as
discussed in the onboarding section of this report. Hence, many data
management professionals need to think in a new way so they can
adapt to the changing data landscape. Metadata management is just
one area where thats true.
Modeling data on the fly. Analytics data modeling is another skill
that is evolving. The most dramatic example is how a user creates
a data model as he or she explores the data in a lake with iterative
ad hoc queries. Likewise, the new practice of data prep is all about
creating a new data setand typically a new metadata model
very quickly, based on manipulating the actual data. This is quite
different from the older academic approaches that design a model
in a vacuum with a dedicated modeling tool, then test the model
by loading data into it. The old practice too often leads to beautiful
models that dont fit the available data; the new practice avoids such
disconnects. As more ecosystems mix old and new data platforms
and their attendant practices, theres a need for both approaches to
data modeling.
Adopting early ingestion and late processing. This is standard
procedure with a Hadoop-based data lake, and for good reasons. It
makes data available as soon as possible for reporting and analytics,
which enables business monitoring, operational decision making,
and other time-sensitive business practices. It avoids processing too
much data on the off chance that someone might need it (a common
cause of squandered resources in data warehousing). Early ingestion
also fosters a newfound respect for datas original state, which is
beneficial to emerging forms of discovery-oriented analytics.
Think of financial reports that demand accuracy down to the

penny and a lineage thats unassailable in an audit. Thats why
data warehouses remain relevant. Early ingestion and the data
prep practices that go with it are more appropriate to the kind
of discovery analytics that tend to be the top priority for a data
lake. The outputs of such analytics are, by nature, estimates (e.g.,
fuzzy metrics) and generalizations (e.g., customer segments and
entity clusters suggesting fraud). This is another reason why data
warehouses and data lakes are progressively paired; the two have
different sets of strengths and use cases, conveniently segregated
as reporting and analytics.
Modernizing data integration infrastructure. Many data flows
executed by a DI solution need a data platform where data is landed,
and possibly processed for staging, before the data moves onward
to various targets. Most data landing today is done haphazardly,
using network drives and spare licenses of database management
systems. In some organizations, data is landed exclusively on
relational platforms, which limits the range of data structures that
can be landed and integrated. The exclusive use of overnight batch
is another limitation, which precludes the early ingestion that data
management is moving toward.
Hence, many organizations need to modernize the data landing area
within their DI infrastructure for a wider range of data structure
handling, processing, latencies, and scalability. A Hadoop-based data
lake is a natural fit for these requirements.
Extending data management to far more structures of data. If
you cannot capture, manage, and process human language, text, and
other unstructured data, then you cannot get analytics value from it.
For years, the data management community has paid lip service to
the assumption that unstructured data contains facts that could be
put to good use in reporting, analytics, and decision makingif we
could just figure out how to handle it. The solution is now available: a
Hadoop-based data lake coupled with a strong DI infrastructure.
Analytics with all the data. This is what end users want, and they
need it for broad exploration and complex correlations. Furthermore,
data mining is more impactful with larger data sets. For these
reasons, technical users must provide a way to consolidate diverse
data into a lake. In addition, there is a need for queries and DI
solutions that blend data on the fly, in a virtual or federated manner,
drawing from the lake as well as many other data-driven design
patterns and data platforms.
Nevertheless, theres still a need for carefully remodeled,

standardized, accurate, and documented data for standard reports.
7TDWI RESE A RCH
tdwi.org
NUMBER SIX
EMPOWER NEW BEST PRACTICES FOR BUSINESS

ANALYTICS VIA A DATA LAKE
Managing diverse big data is important, and yet its just a means
to an end. The end goal for savvy managers is to get business value
and drive organizational effectiveness from the data, not just capture
it in a cost center, and the primary path to that value is through
analytics. In that context, here are some of the new analytics best
practices empowered by a Hadoop-based data lake:
Advanced analytics that complement older analytics. In a nutshell,
organizations need to preserve existing analytics based on reporting,
OLAP, and SQL and also complement these with advanced analytics
based on technologies for mining, clustering, graph, statistics, and
natural language processing. Traditional forms of analytics and
reporting are mostly about tracking the facts and business entities
that you know well and need to monitor over time. New, advanced
forms of analytics are mostly about discovering facts that you did not
know before, as well as linking together highly diverse facts, events,
and entity characteristics (such as customer behavior, partner
reliability, or operational metrics) to form new insights and develop
new business opportunities.
Note that traditional analytics tends to require squeaky clean
data on a relational platform for highly precise and structured
output, as seen in standard reports and cubes. However, todays
analytics focuses on raw detailed data because it fuels discoveries
and complex linkages without the need for obsessive precision or
structure. Given the differences in data requirements, a growing
number of data warehouse and data management teams work with
both relational databases and Hadoop-based data lakes.
Multiple forms of analytics in tandem. One of the strongest trends
in analytics is toward using multiple forms of analytics. Thats
because each method tells you something different about the same
issue. Connect multiple analytics results in tandem, and you get a
more comprehensive insight for business advantage.
When a Hadoop-based data lake captures and manages data in
its raw original state, data can easily be repurposed for multiple
forms of analytics. Depending on the design and data content of an
individual data lake, it may support both set-based analytics (based
on OLAP, SQL, and other relational techniques) and algorithmic
analytics (based on mining, clustering, graph, statistics, and NLP).
Integrated sequence of self-service best practices. One of the
most desirable emerging analytics practices today is to connect,
in sequence, several related self-service data-driven tasks. The
sequence typically follows this order: data access, exploration, prep,
visualization, and analysis. For example, as users access and explore
data, they may discover something meaningful in the data, such
8TDWI RESE A RCH
as the root cause of the most recent churn or a cost center thats
eroding profits from the bottom line. After the discovery, they want
to quickly prepare a data set based on what they learned, then share
the prepped data set with colleagues or seamlessly move the data
set to other tools for further analysis and visualization.
The assumption is that several tool types are employed (one per
step in this multistep analytics process), and the tools are tightly
integrated for seamless handoff. This multistep analytics process
seems to work well with Hadoop-based data lakes, but only when
users are given an integrated toolset that supports self-service. As
discussed earlier, self-service isnt for everyone; it succeeds when
provided to certain classes of users, who are governed carefully.
Analytics value from human language, text, and other
unstructured data. Theoretically, you can put any data or other
digital information in a file, and Hadoop can manage it and
make it available for analytics processing. Within the category of
unstructured data, file-based human language and other text is
already being leveraged via analytics. The killer app is sentiment
analysis, which scans mountains of comments from customers,
prospects, and other people (perhaps drawn from social media or
text fields in call center apps) to determine what the marketplace
is saying about your firm, its products, or its services. As another
example, the claims process in insurance captures a ton of text
about losses; insurance companies collect this in lakes, process it
to extract facts about entities of interest, and use the output data
to extend analytics applications in fraud detection and actuarial
calculations. Similar text-driven analytics are seen in the patient
outcome analyses of healthcare (by both insurer and provider).
Analytics focused on a single department or application. Some
data lakes are designed for multitenant enterprise use, as when
the lake is part of a larger data warehouse environment or provided
by central IT. However, other data lakes are designed specifically
for a single department or application. The departmental analytics
program is a rising phenomenon because most analytics applications
have a departmental association. For example, sales and marketing
wants to own and control customer analytics so they get the
information they need for campaigns, telemarketing, prospecting,
and their general understanding of customers. Likewise, the financial
department needs to control financial analytics, and procurement
needs to control supply-chain analytics. As departmental analytics
programs are modernized, they tend to integrate Hadoop, then
migrate data collections to a data lake atop it. Hadoops relatively
low cost puts the data lake within reach of departmental budgets.
tdwi.org
ABOUT OUR SPONSOR
www.pentaho.com
Pentaho, a Hitachi Group company, is a leading data integration
and business analytics company with an enterprise-class, open
sourcebased platform for diverse big data deployments. Pentahos
unified data integration and analytics platform is comprehensive,
completely embeddable, and delivers governed data to power
any analytics in any environment. Pentahos mission is to help
organizations across multiple industries harness the value from
all their data, including big data and IoT, enabling them to find
new revenue streams, operate more efficiently, deliver outstanding
service, and minimize risk. Pentaho has over 15,000 product
deployments and 1,500 commercial customers today including
ABN-AMRO Clearing, BT, Caterpillar Marine Asset Intelligence, EMC,
Moodys, NASDAQ, Opower, and Sears Holdings Corporation. For more
information, visit www.pentaho.com.
ABOUT THE AUTHOR
Philip Russom, Ph.D., senior director

of TDWI Research for data management, is
a well-known figure in data warehousing,
integration, and quality, having published
over 550 research reports, magazine
articles, opinion columns, and speeches over
a 20-year period. Before joining TDWI in 2005, Russom was an
industry analyst covering data management at Forrester Research
and Giga Information Group. He also ran his own business as an
independent industry analyst and consultant, was a contributing
editor with leading IT magazines, and worked as a product manager
at database vendors. His Ph.D. is from Yale. You can reach him
at prussom@tdwi.org, @prussom on Twitter, and on LinkedIn at
linkedin.com/in/philiprussom.
ABOUT TDWI RESEARCH
TDWI Research provides research and advice for BI professionals

worldwide. Focusing exclusively on data management and analytics
issues, TDWI Research teams up with industry practitioners
to deliver both broad and deep understanding of the business
and technical issues surrounding the deployment of business
intelligence and data warehousing solutions. TDWI Research
offers reports, commentary, and inquiry services via a worldwide
membership program and provides custom research, benchmarking,
and strategic planning services to user and vendor organizations.
ABOUT TDWI CHECKLIST REPORTS
TDWI Checklist Reports provide an overview of success factors for

a specific project in business intelligence, data warehousing, or
a related data management discipline. Companies may use this
overview to get organized before beginning a project or to identify
goals and areas of improvement for current projects.
9TDWI RESE A RCH
tdwi.org

Tdwi Checklist 2016 Pentaho Web

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Tdwi Checklist 2016 Pentaho Web

Uploaded by

Copyright:

Available Formats

CHECKLIST REPORT

Emerging Best Practices

Emerging Best Practices

555 S. Renton Village Place, Ste. 700

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

2TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

DESIGN A DATA LAKE FOR BOTH BUSINESS AND

Blend new big data and traditional enterprise data together in a

To blend traditional and modern data in a lake successfully and

3TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

SIMPLIFY YOUR DATA LAKE WITH A SCALABLE

Scaling to the volumes of big data is obviously important. Its equally

Metadata injection goes a step further by injecting metadata into

Note that highly dynamic data integration processes are required to

Reduce your dependence on hand-coded programming. Writing

For example, in business-to-business environments, establishing a

4TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

RELY ON DATA INTEGRATION INFRASTRUCTURE

Data integration tools and accompanying infrastructure

Data integration enables the big data pipeline. In todays

Rely on DI infrastructure for a lakes metadata management. Look

5TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

INTEGRATE YOUR DATA LAKE WITH ENTERPRISE

demographics per region, and so on. Furthermore, this data is used

Why are data warehouse professionals adopting Hadoop aggressively

6TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

EMBRACE NEW DATA MANAGEMENT BEST PRACTICES

Think of financial reports that demand accuracy down to the

Nevertheless, theres still a need for carefully remodeled,

7TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

EMPOWER NEW BEST PRACTICES FOR BUSINESS

8TDWI RESE A RCH

TDWI CHECKLIST REPORT: EMERGING BEST PR ACTICES FOR DATA L A K ES

ABOUT OUR SPONSOR

ABOUT THE AUTHOR

Philip Russom, Ph.D., senior director

ABOUT TDWI RESEARCH

TDWI Research provides research and advice for BI professionals

ABOUT TDWI CHECKLIST REPORTS

TDWI Checklist Reports provide an overview of success factors for

9TDWI RESE A RCH

You might also like