Hadoop

Heading Towards Big Data
Building A Better Data Warehouse For More Data, More Speed, And More Users
Raymond Gardiner Goss
raymond.goss@globalfoundries.com
Kousikan Veeramuthu
kousikan.veeramuthu@globalfoundries.com
Manufacturing Technology
GLOBALFOUNDRIES
Malta, NY, USA
determine the callers income level and specify to which agent
to route the call or the switch would timeout. When switches
were overwhelmed with data, they would drop packets and
algorithms had to infer states based on most probable current
state. Other industries, such as social media, are challenged
more by unstructured data and need tools to help turn text
messages and photos into useful information for search engines
and marketing purposes. The challenge in the semiconductor
world is with the size of the data. Speed becomes a secondary
problem because so many sources are needed to be joined
together in a timely manner. Large recipes, complex output
from the test floor combined now with more Interface-A trace
data amass terabytes each month that need to be handled for
both real-time SPC, APC, and command and control scenarios
as well as offline yield analyses. Users now require real-time
access to data from a much larger pool of sources. This paper
describes the various states of handling the increasing
complexity and volumes today and the challenges ahead.
AbstractAs a new company, GLOBALFOUNDRIES is

aggressively agile and looking at ways to not just mimic existing
semiconductor manufacturing data management but to leverage
new technologies and advances in data management without
sacrificing performance or scalability. Being a global technology
company that relies on the understanding of data, it is important
to centralize the visibility and control of this information, bringing
it to the engineers and customers as they need it.
Currently, the factories are employing the best practices and
data architectures combined with business intelligence analysis
and reporting tools. However, the expected growth in data over
the next several years and the need to deliver more complex data
integration for analysis will easily stress the traditional tools
beyond the limits of the traditional data infrastructure. The
manufacturing systems vendors need to offer new solutions based
on Big Data concepts to reach the new level of information
processing that work well with other vendor offerings.
In this paper, we will show where we are and where we are
heading to manage the increasing needs for handling larger
amounts of data with faster as well as secure access for more
users.
KeywordsData
Warehousing,
Reporting, Scaling, Big Data
I.
Real-Time,
II.
A. Many types of Big Data

In the past year, Big Data has been gaining more buzz. It
isnt uncommon to hear someone say, we will scale with a
Big Data solution, Google does it just fine, or vendor x
must have already a Big Data solution in the plans.
However, there are different Big Data problems and solutions
and not all apply or can be used at once. We need to first
define the term.
Analysis,
INTRODUCTION
Not long ago, the price of gasoline was less expensive and
we drove cars based on features we desired like roof racks,
cargo space, sporty looks, and prestige, but the world is
changing. The cost of fuel has increased and we are more
aware of the environmental concerns. We are switching to
vehicles with different engines that go much further with less
energy, but still expect all the new features of built-in GPS,
backup cameras and keyless ignition. The move to Big Data
will be a similar paradigm shift. The principle analysis is the
same, but the engines and amount of data are changing.
Big Data is the territory where our existing traditional

relational database and file systems processing capacities are
exceeded in high transactional volumes, velocity
responsiveness, and the quantity and or variety of data. The
data are too big, move too fast, or dont fit the strictures of
RDBMS architectures. Scaling also becomes a problem. To
gain value from these data, we must choose alternative ways to
process them.
Various industries have different problems, but most will

have Big Data needs. When first moving to the semiconductor
manufacturing industry, we noticed that the transaction volume
was a fraction of what we had experienced in the telecom
world, where data were optimized into compressed bytes and
streamed over raw sockets and switch responses were expected
in milliseconds. For instance, for a 800 number routing
scheme, we had less than 250ms to look the phone number up,
978-1-4673-5007-5/13/$31.00 2013 IEEE
TRADITIONAL SOLUTION, GROWTH AND BIG DATA
B. Complexity Example
Big Data covers a range of situations, all with the common
theme of more more variety, more quantity, more users,
more speed, more complexity. There are currently different
Big Data solution approaches to each of these. Let us start off
with an example of determining root cause and correlation of a
220
ASMC 2013
new variety. In one fab, there was a challenge of reticle

hazing. It wasnt hard to determine the culprit of the haze by
sending it off to the lab but other details were not as easy.
From a few of the facilitys air quality sensors, it could be seen
that there were traces of an oxidizing agent detected in the air
and the fab had started the practice of inspecting the reticles
after every 200 wafers. While the risk to the wafers was
mitigated, the transports became high and potentially created
more exposure of contamination while in the Automated
Material Handling System (AMHS). Where were the reticles
picking up the haze? Was it from outgases in the tool or while
they were in transit to the inspection tools or the stocker? In
order to solve the problem, we needed temporal data from the
process, metrology and inspection tools, MES, facilities, MCS,
and AMHS to be brought together in one place for analysis. Up
until this point, the data warehouse had not yet contained all
such sources. From problems like this, we needed a solution,
which resulted in the creation of the General Engineering and
Manufacturing Data warehouse (GEM-D).
either by asking for a user account and then creating a stand

alone application around the data usage or by first creating
independent MS Access, PHP, or Perl applications that then
urgently need to be connected to factory systems to solve
some pressing need.
In Fig. 1, a standard factory system with a well organized
SOA architecture is shown, with the introduction of Ad Hoc
data consumers and generators. These ad hoc systems could
eventually be migrated into or become new core systems but
serve as a reminder that our systems today have a long way to
go to offer universal and consistent data integration.
D. Yesterdays Technology Today
Traditionally, GLOBALFOUNDRIES was comprised of
systems from Chartered Ltd and AMD which were focused on
self-contained processes and reports. Analyses were either
limited to information placed in a data warehouse using the
older batched Extract Transfer and Load (ETL) paradigm that
could have a lag of hours or to specialized reports generated
from separate run-time systems. The data warehouse, for
example, had data from the SiView MES, inline SPC results,
and engineering data from WET and SORT, but lacked data
from advanced reticle handling and preventative maintenance
activities. Direct queries to quality systems were often
performed as a side activity, and not well correlated with the
other data. The data warehouse in a single fab housed 40+
terabytes and yet did not house any of the newer Interface-A
tool data. Similarly, reporting focused on MES WIP status and
lacked access to data other than through some random run-time
systems that were supposed to be used for decision services.
The staff used tools like APF RTD reports, which are familiar
to dispatch writers but not well-suited for analysis or traditional
scheduled static reports using applications like SAP Business
Objects. Both the data warehouse and WIP reporting did not
stand up to the demands of more voluminous and real-time
data. The number of systems that have relevant data for each
use case continues to increase.
Data performance is becoming equally ripe for

improvements. In our factories, command and control is
continually leveraging more data sources and visualization of
real-time data to make decisions. For instance, before shipping
wafers, a fab out inspection occurs comparing all experiments,
incidents, quality reviews, and prior holds from SPC events.
All these data are made available not only to systems but also
to the users so they can rectify any outstanding issues. The
real-time systems need to support nearly instant responses. At
the same time we have data retention and archiving
requirements to keep much of these data online for some time
and then in a dearchivable state.
C. Ad Hoc Organic Data Organization and Growth
Nearly every engineering university graduate has had some
programming experience and understands how to use a
database. Even though we have clear requirements for
architectural review for new system connections and
introductions, a large percentage of connections are created
Ad Hoc
Ad Hoc
Ad Hoc App
Factory Systems
MES Siv iew
UI
framew ork
SPC
FDC
Setup
Decison
Serv ices &
Integration
Replication
MQ Message Bus
EI
Dispatch &
Scheduling
APC
Other Fabs/Corporate
Systems
CMMS
RM S
eTEST
Data
Warehouse
(GEM-D)
Business Analysis
and Reporting
Fig. 1. Factory Systems with Ad Hoc applications
221
ASMC 2013
The landscape has now changed. GLOBALFOUNDRIES

is focused on gathering data in real time, with less than 10
second latencies, in a new General Engineering and
Manufacturing Data Warehouse (GEM-D, see Fig. 2) taking
advantage of Oracle GoldenGate feeds that have minimal
impact to run-time systems and provide the data in easily
joinable schemas. These feeds were previously only used for
decision services, but now are scaling across the enterprise.
New compression techniques are also utilized to reduce storage
volumes. GEM-D offers one stop shopping for users with
specific organized data marts built on top. New Business
Intelligence tools are used to empower the users to sort and sift
through the newly accumulated data with in-memory
associative technology in a compressed form with associations
defined between data items. However, we are just beginning to
enter the Big Data world. As masking layers increase,
transistors shrink, tool data increase, and wafer sizes move to
450mm, there are challenges ahead for such data-centric
companies.
III.
C. Automated Algorithms
Big Data can feed advanced analytics and algorithms to
vastly improve the decision making process and identify
valuable insights which were previously hidden or not easily
available. Fabs can adjust production lines automatically to
optimize efficiency, reduce waste, and avoid dangerous
conditions. At GLOBALFOUNDRIES, we are already using
controlled experiments to make better decisions by embedding
real-time, highly granular data from networked sensors in the
supply chain and production processes. Automating the
analysis of the data reported by sensors embedded in complex
products combined with the tool owners input enables
manufacturers to create proactive smart preventive
maintenance service. Service personnel can perform the PM
operations before there is an equipment failure which may
cause costly fab disruptions. Also this enables the fab to have
a new business model like leasing the portions of fab space for
specific customers based on the sensor data.
D. Virtual Factory.
Taking product development and historical data and real-time
inputs from MES data, fabs can apply advanced computational
methods to create a digital model of the entire manufacturing
process. This model can be used to design and simulate the
most efficient production system. Some of the applications of
virtual factory include: 1. validation of designed production
concept; 2. processes verification before start of production; 3.
optimization of production equipment allocation; 4. bottlenecks
and collisions analysis; 5. better utilization of existing
resources; 6. eliminating errors in the production line. Fab
engineers can leverage the power of Big Data to simulate these
operations with millions of different combinations to optimally
schedule and dispatch WIP.
WHY DO WE NEED BIG DATA CAPABILITIES?
Big Data analytics can reveal insights previously hidden by

data that were too costly to process, such as sensor logs and
wafer maps in conjunction with other factory information.
Being able to process every required item in a reasonable
amount of time could increase throughput of the factory, skip
sampling operations, ensure tool performance between
maintenance cycles, promote an investigative approach to data,
in contrast to the somewhat static nature of running
predetermined reports. A large amount of data is already kept
in the FAB archives unused since there is no cost effective way
to process it and get value out of it. Some of the Big Data use
cases are explained below.
A. Transparency
Making Big Data more easily accessible to relevant
stakeholders in a timely way can create tremendous value. Data
are often compartmentalized within a single group in the fab.
Several departments have their own IT systems and
unfortunately frequently store and maintain redundant data.
We need to have a free interchange of data among different
department systems. Here at GLOBALFOUNDRIES a lot of
teams were trying to gain access to the business process
request information and wanted to use it for automation but
that infrastructure only supported access to a copy refreshed
every six hours. Real-time information for decisions was
impossible. Also integrating data from R&D, engineering, and
manufacturing units in the same fab and between fabs can
significantly reduce the wasteful redundancy and improve
much faster turnaround time for resolving issues and accelerate
time to market.
IV.
BIG DATA SOUTIONS
As mentioned, there are several types of Big Data problems

to be solved. The computer science industry offers a few
models. Described here are the main variants and applicable
uses. A summary pros and cons list is provided in Table 1 and
a corresponding radar chart, which includes traditional
RDBMS, similar to GEM-D is shown in Fig. 3. The desirable
solution would be the one which gets high score in all or most
of the axes.
A. The Data Appliance
Data appliance offerings include Oracle Exadata, IBM
Netezza, and Teradatas platforms. These solutions offer a
complete closed solution with optimized hardware accelerators
that scale according to the rack space. These appliances offer
access to data via traditional SQL, but indexing and queries are
optimized by proprietary architectures that consist of query
distributions and select statements offloaded from the CPU to
specialized chips.
B. Experimental Analysis
Experimental analysis is critical in many areas of our fabs.
Ability to process huge amounts of data will open up the
possibility of conducting new tests and analysis which were
unimaginable earlier. This will have a dramatic effect in areas
like R&D and will help achieve faster time to market.
222
ASMC 2013
Current State
Future State Additions and

Considerations
Factory Source Systems
Untapped Factory Source Systems
EI logs, bitmaps, system logs,

test data
Real time data

feeds
GEM-D Staging
Layer
Integrated Identity
Mgmt
Evaluating compression,
In-memory analysis and
hardware solutions
GEM-D Logical
layer
New BI
Tools &
Reports
New
Analysis
Tools
Real time and

automated
analytics
Fig. 2. The GEM-D Model
223
ASMC 2013
B. The Hadoop Derivative

In order to scale to the petabytes of unstructured data (not
modeled in tables with well defined rows and columns),
Google introduced the Map Reduce paradigm that later was
made available in the Hadoop Open Source project. This has
been extended with tools like HBase, Pig, Hive, etc. which
improve the usability and reduce the complexity of the Map
Reduce coding. Using commodity hardware as a foundation,
Hadoop provides a layer of software that spans the entire grid,
turning it into a single system. Hadoop based solutions are
provided by companies like Cloudera, Hortonworks and
MapR.
Work together to offer a standard nomenclature of

objects to better tie system data together.
Expect to work with new analytic tools for

analysis and reporting not requiring proprietary
reporting platforms and perhaps providing
components or models for BI solutions.
B. Big Data Environment

We are looking to provide vendors the opportunity to test
their Big Data platforms in our lab. Our state of the art test
environment has been an environment where vendors have
introduced hardware and software solutions and see how they
work together with other factory applications.
C. Massive In Memory Database

While the previous two models offer ways to scale to
support larger amounts of data, a new model attempts to make
larger amounts of data available in real time by placing the
database in a highly available cluster of servers that keep all
the data in memory. This eliminates the need for indexing and
I/O on the traditional drive storage is completely eliminated.
Some systems, such as SAP Hana, also flush data to disk for
recovery scenarios. It is by far the fastest solution for critical
structured data but comes with price tag.
D. Solid State Disk (SSD)
There is yet another improvement to any of the other Big
Data solutions or even the traditional model that addresses the
speed of access. The hard drive storage itself can be moved to
solid state. These flash drives can be leveraged to all or just
a subset of the data. SSD based solutions can be used with
Hadoop systems to address the big data problems. Companies
like Fusion-io, NetApp, EMC, etc. provide SSD based
solutions.
Fig. 3. Comparison chart of Big Data technologies
VI.
V.
We are entering a new realm of data management.

Solutions will perhaps take several forms. As the complexity
of our needs scale, we need our suppliers to move away from
stand alone proprietary infrastructure, reporting and offer plugand-play components.
WHERE DO WE GO FROM HERE?
The path for the next level of data management is not yet
defined. Reporting and analysis has already moved beyond the
single system or data set. The factory data volumes are
expanding rapidly. The reader can see from Fig. 3 that there
are significant tradeoffs with using these solutions. The
appliance covers a subset of the traditional RDBMS whereas
the Hadoop paradigm covers new territory.
ACKNOWLEDGMENT
The authors would like to thank the Data Integration and IT
DBA teams at GLOBALFOUNDRIES in deploying GEM-D
and making the current solutions a reality.
A. A call for action

We appeal to our vendors to work together to leverage Big
Data strategies within their product offerings. Several things
can be done to help including:
Provide schemas that are portable and not locked

into specific RDBS providers.
Provide Map-Reduce stubs that can be grouped

together with other such routines. This applies to
tool vendors log files, test file output and
otherwise untapped data today.
Leverage SOA architectures using interchangeable

message buses for communication.
CONCLUSION
224
ASMC 2013
TABLE I. TECHNOLOGY PROS AND CONS

Technology
Pros
Appliance
RDBMS
Hadoop
In Memory DB
SSD
Cons
Very mature. Long history of successful

installation.
Can be used by multiple user types, from
Business users using reporting tools,
through SQL novices and expert users.
Vendors like Teradata support data volume
in petabyes.
Custom hardware accelerates query and
transformations.
Data volumes in Petabytes.

Open source.
Commodity hardware and inexpensive.
Fault tolerant, designed to survive multiple
node failures.
Supports both structured and unstructured
data. Data can be operated on in native
format.
Heterogeneous hardware, the nodes in the
installation can be different.
In-memory. Real-time analysis.

Single foundation for OLTP+OLAP.
No need for Indexing.
Very high performance.

No spinning Disks.
Consumes less power and space.
Leverage existing software architecture
and data base systems, improving virtually
any disk-based solution.
Fault Tolerance is only at transaction level;

cannot survive node failure.
Supports only structured data.
Homogenous hardware - all nodes in the
installation must be the same.
Disk based data storage. Limited real-time
analysis capabilities compared to in-memory
technologies.
OLTP and OLAP layers are separate
More expensive compared to open source
solutions.
Currently it only has batch processing

capabilities. Real-time data processing is still
under works.
Requires programming skills to work with.
Smaller pool of individuals capable of
performing these tasks.
Relatively new technology. The source is
continuously under development with new
features being added.
Data volume in terabytes. Doesn't scale to

petabyte size currently.
Very expensive. Enterprise level toolset.
Proprietary hardware and software.
Limited fault tolerance capabilities.
Very expensive. About ten times that of a
standard hard drive.
Limited no. of writes, thus limiting the life of
the device.
Cannot scale out.
Relatively less reliable.
[5]
[1]
[2]
[3]
[4]
http://blogs.gartner.com/merv-adrian/2013/02/10/hadoop-and-di-aplatform-is-not-a-solution/
[6] http://www.saphana.com/community/blogs/blog/2012/04/30/whatoracle-wont-tell-you-about-sap-hana
[7] http://www.dbms2.com/2012/08/20/in-memory-hybrid-memorycentric/
[8] http://hadoop.apache.org/
[9] http://en.wikipedia.org/wiki/MapReduce
[10] http://en.wikipedia.org/wiki/Big_data
[11] http://www.datanami.com/datanami/2012-0213/big_data_and_the_ssd_mystique.html
REFERENCES
J Manyika, M Chui, B Brown, J Bughin, R Dobbs, C Roxburgh, A H
Byers, McKinsey Global Institute Publication "Big data: The next
frontier for innovation, competition, and productivity"
J G Kobielus, The Forrester Wave: Enterprise Hadoop Solutions,
Q1 2012
A McClean, R. C. Conceio, M OHalloran, "A Comparison of
MapReduce and Parallel Database Management Systems," ICONS
2013, The Eighth International Conference on Systems.
"Big Data Now: 2012 Edition" OReilly Media, Inc.
225
ASMC 2013

Hadoop

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop

Uploaded by

Copyright:

Available Formats

Heading Towards Big Data

AbstractAs a new company, GLOBALFOUNDRIES is

A. Many types of Big Data

Big Data is the territory where our existing traditional

Various industries have different problems, but most will

978-1-4673-5007-5/13/$31.00 2013 IEEE

TRADITIONAL SOLUTION, GROWTH AND BIG DATA

new variety. In one fab, there was a challenge of reticle

either by asking for a user account and then creating a stand

Data performance is becoming equally ripe for

MES Siv iew

Fig. 1. Factory Systems with Ad Hoc applications

The landscape has now changed. GLOBALFOUNDRIES

WHY DO WE NEED BIG DATA CAPABILITIES?

Big Data analytics can reveal insights previously hidden by

BIG DATA SOUTIONS

As mentioned, there are several types of Big Data problems

Future State Additions and

Factory Source Systems

Untapped Factory Source Systems

EI logs, bitmaps, system logs,

Real time data

Real time and

Fig. 2. The GEM-D Model

B. The Hadoop Derivative

Work together to offer a standard nomenclature of

Expect to work with new analytic tools for

B. Big Data Environment

C. Massive In Memory Database

Fig. 3. Comparison chart of Big Data technologies

We are entering a new realm of data management.

WHERE DO WE GO FROM HERE?

A. A call for action

Provide schemas that are portable and not locked

Provide Map-Reduce stubs that can be grouped

Leverage SOA architectures using interchangeable

TABLE I. TECHNOLOGY PROS AND CONS

Very mature. Long history of successful

Data volumes in Petabytes.

In-memory. Real-time analysis.

Very high performance.

Fault Tolerance is only at transaction level;

Currently it only has batch processing

Data volume in terabytes. Doesn't scale to

You might also like