You are on page 1of 51

High Performance Data Mining and Big Data Analytics1

Author: Khosrow Hassibi, PhD Contact: Khosrow.hassibi@sas.com

January 2012 Last Modified Date: April 17, 2012

This document is still a work in progress.

Table of Contents
1 2 3 Summary ................................................................................................................................. 4 Introduction ............................................................................................................................. 5 State of Computing ................................................................................................................. 7 3.1 3.2 3.3 3.4 3.5 4 Microprocessors and Moores Law .................................................................................. 7 Traditional Supercomputing............................................................................................. 7 High Performance Computing ......................................................................................... 8 New Computing Delivery Models ................................................................................. 10 Key Future Trends in Computing................................................................................... 11

Exponential Data Growth and Big Data ............................................................................... 13 4.1 4.2 What defines a Big Data scenario? ................................................................................ 13 6Vs of Data..................................................................................................................... 14 Value ....................................................................................................................... 15 Volume.................................................................................................................... 15 Variety..................................................................................................................... 15 Velocity ................................................................................................................... 16 Validity ................................................................................................................... 16 Volatility ................................................................................................................. 16

4.2.1 4.2.2 4.2.3 4.2.4 4.2.5 4.2.6 4.3 5

The Impact of 6Vs .......................................................................................................... 17

Big Data Analytics ................................................................................................................ 19 5.1 Major Developments in Big Data Analytics .................................................................. 19 Hadoop and MapReduce ......................................................................................... 20 Scalable database .................................................................................................... 23 Real-time stream processing ................................................................................... 24 The In-memory Big Data appliance ........................................................................ 26 Approaches to Achieve High Performance and Scalability in Data Mining .......... 28 Chunking or Data Partitioning ............................................................................ 29 Statistical Query Model (Machine Learning and Advanced Analytics) ............. 30 Serial Computing Environment........................................................................... 30 Multiprocessor Computing Environment (Symmetric Multiprocessing or SMP) ..31 2 5.1.1 5.1.2 5.1.3 5.1.4 5.2 5.2.1 5.2.1.1 5.2.1.2 5.2.1.3 5.2.1.4

High Performance Data Mining and Big Data ............................................................... 28

5.2.1.5 5.2.1.6 5.2.1.7 5.2.1.8

Distributed Computing Environment (Cluster Computing) with Shared Storage ..32 Shared-Nothing Distributed Computing Environment ....................................... 33 In-memory Distributed Computing Environment ............................................... 34 Side Note on Accelerated Processing Units (ALUs or CPU/GPU combinations) ..35

Applications of Big Data Analytics ...................................................................................... 36 6.1 Applications of Big Data in Traditional Analytic Environments ................................... 37 ETL ......................................................................................................................... 38 Extracting Specific Events, Anomalies or Patterns ................................................ 38 Low End Analysis, Queries, and Visualization ...................................................... 38 Data Mining and Advanced Analytics .................................................................... 39 Structured Risk Minimization (SRM) and VC Dimension ..................................... 41 Properly Sampled Data is Good Enough for Most Modeling Tasks ................... 44 Where to Use all the Data? ..................................................................................... 46 6.1.1 6.1.2 6.1.3 6.1.4 6.2 6.2.1 6.2.2 6.2.3

Big Data Mining and Sampling .................................................................................. 40

7 8

Evolution of Business Analytics Environments ................................................................... 47 REFERENCES ..................................................................................................................... 50

Summary

Due to the exponential growth of data, today there is an ever-increasing need to process and analyze Big Data. High performance computing architectures have been devised to address the needs for handling Big Data not only from a transaction processing standpoint but also from a tactical and strategic analytics viewpoint. The objective of this write up is to provide a historical and comprehensive view on the recent trend toward high performance computing architectures especially as it relates to Analytics and Data Mining. The article also emphasizes the impact of Big Data on requiring a rethinking of every aspect of analytics life cycle from data management, data Analysis, to data mining and analysis, and to deployment.

Introduction

With the exponential growth of data comes an ever-increasing need to process and analyze the so-called Big Data. High performance computing architectures have been devised to address the needs for handling Big Data processing not only from a transaction processing viewpoint but also from an analytics perspective. The main goal of this paper is to provide the reader with a historical and comprehensive view on the recent trend toward high performance computing architectures especially as it relates to Analytics and Data Mining.

There are a variety of readings separately on Big Data (and its characteristics), High Performance Computing for Analytics, Massively Parallel Processing (MPP) databases, algorithms for Big Data, In-memory Databases, implementation of machine learning algorithms for Big Data platforms, Analytics environments, etc. However none gives a historical and comprehensive view of all these separate topics in a single document. It is the authors first attempt to bring about as many of these topics together as possible and to portray an ideal analytic environment that is better suited to the challenges of todays analytics demands. In Section 3, todays state of computing is examined with special focus on microprocessor Power Wall and the impact it has had on the new processor designs and the popularity of parallel and distributed processing architectures in our every days life. The Section ends with enumerating a few key future trends in computing. In Section 4, the ever-increasing exponential growth of data in todays world is examined on what is now called Big Data in marketing jargon. What is considered Big Data is described associated with what I call 6Vs of data value, volume, velocity, variety, validity, and volatility. Each combination of these characteristics could require different process and possibly infrastructure for data management and data analysis. Section 5 starts with major developments in Big Data Analytics including Hadoop and MapReduce, Scalable database, Real-time stream processing, and the In-memory Big Data appliance. Algorithms that can be parallelized and fall into external memory and statistical query model definitions are also described. A variety of approaches to achieve high performance and scalability in data mining and analytics are covered including parallel processing and distributed computing architectures. Section 6 talks about the applications of Big Data from a high level. These include but not limited to ETL, event and anomaly extraction, low end analysis, and data mining. Sampling is then examined and the question of whether all the data should be used or not in a data mining exercise will be discussed where I use a couple of topics in statistical learning theory to describe my viewpoint. Section 7 portrays a picture of an ideal analytic environment in which all sorts of data can be processed and analyzed depending on a variety of business needs. These environments will be 5

the fertile ground for many new innovations and drastic evolutions in Analytics and Data Mining practices. Section 8 provides a list of references for further readings.

State of Computing

3.1

Microprocessors and Moores Law

Microprocessors performance has grown 1,000-fold over the past 20 years, driven by transistor speed and energy scaling, microarchitecture advances that exploited the transistor density gains from Moores Law2 (such as Advanced microarchitectures pipelining, branch prediction, out-of-order execution, and speculation), and cache memory architectures. About several years ago, however, the top speed for most microprocessors peaked when their clocks hit about 3 Gigahertz. The problem is not that the individual transistors themselves can't be pushed to run faster; they can. But doing so for the many millions of them found on a typical microprocessor would require that chip to dissipate impractical amounts of heat. Computer engineers call this the power wall. Given that obstacle, it's clear that all kinds of computers, including PCs, Servers and Supercomputers, are not going to advance at nearly the rates they have in the past. This means faster singlethreaded performance has hit a limit [2]. One technique to cope with the Power Wall has been the introduction of multi-core processor chips -keeping clock frequency fixed, but increasing the number of processing cores on a chip. One can maintain lower power this way while increasing the speed of many applications. This allows for example, two threads to process twice as much data in parallel, but at the same speed at which they operated previously. As a result, there has been an industry-wide shift to multicore in the last few years. In the next two decades, diminishing transistor speed scaling and practical energy limits create new challenges for continued performance scaling. As a result, the frequency of operations will increase slowly, with energy the key limiter of performance, forcing designs to use large-scale parallelism, heterogeneous cores, and accelerators (e.g. GPUs, FPGAs) to achieve performance and energy efficiency. Software- hardware partnership to achieve efficient data orchestration is increasingly critical in the drive toward energy-proportional computing.

3.2

Traditional Supercomputing

Traditional Supercomputers are one of the greatest achievements of the digital age even though yesterday's supercomputer is today's game console (Table 1), as far as performance goes.

Moores Law states that the density of integrated circuits doubles every generation (every two years) for half the cost.

SANDIA LABs ASCI RED Date Introduced Peak Performance Size Power Usage 1997 1.8 Teraflops 150 square Meters 800,000 Watts

Sony PlayStation 3 2006 1.8 Teraflops 0.08 square Meters < 200 Watts

Table 1: Sony PlayStation 3 versus SANDIA Labs Supercomputer.

Traditional supercomputers are very expensive and are employed for specialized applications that require immense amounts of mathematical calculations. During the past five decades these machines have driven some fascinating pursuits; for example, weather forecasting, animated graphics, fluid dynamic calculations, nuclear energy research, designing new drugs, and petroleum exploration. Traditional supercomputing is based on groups of tightly interconnected microprocessors connected by a local high speed bus. Examples of these are DOE Supercomputers used for weather forecasting and nuclear energy research3. The focus of design for these is to run compute-intensive tasks [1].

3.3

High Performance Computing

In recent years, modern supercomputers have shaped our daily lives more directly. We now rely on them every time we do a Google search or try to find an old high school friend on Facebook, for example. And you can rarely watch a big-budget movie without seeing supercomputergenerated special effects. The focus of design for these is to run compute-intensive tasks on huge amount of data [2]. The data distribution and colocation of data and computing is the key to successful application of these applications and algorithms.

Modern supercomputing could be of different architecture varieties compared to traditional supercomputer architectures. Distributed Computing4 This is a method of computer processing in which different parts of a program are run simultaneously on two or more computers that are communicating with each other over a network. Distributed computing is a type of segmented or parallel computing, but the latter term is most commonly used to refer to processing in which different parts of a program run simultaneously on two or more processors that are part of the same computer and share memory
3

The fastest super computer at the time of this writing is Japans Fijutsu K Computer at 10.51 Peta FLOPS. http://en.wikipedia.org/wiki/Distributed_computing

(also called Shared Memory parallel computing model). Parallel computing may be seen as a particular tightly-coupled form of distributed computing, and distributed computing may be seen as a loosely-coupled form of parallel computing. In distributed computing, each processor has its own private memory, i.e. the memory is distributed (also called Distributed Memory parallel computing model) and local to each processor. Information is exchanged by passing messages between the processors for example using MPI (Message Passing Interface). MPI is a standardized and portable message-passing protocol that was designed by a group of researchers from academia and industry in 1994 which fostered the development of a parallel software industry. The standard defines the syntax and semantics of a core of library routines. While both types of processing require that a program be segmenteddivided into sections that can run simultaneously, distributed computing also requires that the division of the program take into account the different environments on which the different sections of the program will be running. For example, two computers are likely to have different file systems and different hardware components. It is important to emphasize that Distributed Computing covers both Functional Partitioning (breaking a program to parts that can run in parallel) and Data Partitioning5 where very large data is broken into pieces and the same program runs on each piece (also called SPMD or Single Program Multiple Data), or a combination of both. True scalability for very large applications is only realized through Distributed Computing techniques.

Cluster Computing6 Computer clusters (a form of distributed computing) emerged as a result of convergence of a number of computing trends including the availability of low cost microprocessors, high speed networks, and software for High Performance Distributed Computing. A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more costeffective than single computers of comparable speed or availability. Grid Computing7 Grid computing is a term referring to the federation of computer resources from multiple administrative domains to reach a common goal. The GRID is a form of distributed computing with non-interactive workloads that involve a large number of files. What distinguishes GRID computing from conventional High Performance Computing systems such as Cluster Computing is that grids tend to be more loosely coupled, heterogeneous, and geographically dispersed.

This is in particular of great importance for Big Data processing. http://en.wikipedia.org/wiki/Computer_cluster http://en.wikipedia.org/wiki/Grid_Computing

Although a GRID can be dedicated to a specialized application, it is more common that a single grid will be used for a variety of different purposes. Grids are often constructed with the aid of general-purpose grid software libraries known as middleware8. One of the main strategies of GRID computing is to use this middleware to divide and apportion pieces of a program among several computers, sometimes up to many thousands9.

In-memory Computing With the advent of 64-bit computing in the last several years (emergence of both 64-bit CPUs and 64-bit OSs), the addressable memory for a program has become theoretically unlimited. The only practical limit on memory size is the physical memory that could be provided on the CPU board. This has been a blessing for developers of data intensive applications that have to rely on I/O performance. Availability of multicore processing units, huge 64-bit addressable memory, and distributed computing/data paradigms have provided the possibility of full In-memory distributed processing specially for data intensive and I/O bound applications such as data mining and large scale Analytics.

Virtual Computing Virtualization in computing is the creation of a virtual (rather than actual) version of a hardware platform, operating system, a storage device or network resources. The usual goal of virtualization is to centralize administrative tasks while improving scalability and overall hardware-resource utilization.

3.4

New Computing Delivery Models

To deliver computing power and applications, there are new computing models in addition to the personal and traditional client/server computing models. These computing delivery models could democratize the use of supercomputing for a much larger pool of users10 [10].
8

There are Open Source and commercial GRID management software. Condor is an Open Source software framework for managing workload on a dedicated cluster of computers. Platform LSF (IBM) is a commercial computer software, job scheduler sold by Platform Computing that is used to execute batch jobs on networked Unix and Windows systems on many different architectures.
9

Todays 20 Nano Meter Computer Chip design and verification process is a good example where thousands of computers in a GRID may be needed to verify the logic and the layout. A typical process may require running hundreds of thousands of small jobs and handling millions of files.
10

Amazon's virtual super computer is capable of running 240 trillion calculations per second, or 240 teraflops on 17,000 cores. While undoubtedly impressive, this pales in comparison to Fujitsu's K Super Computer, which hit 10 Petaflops in November 2011, equating to 10 quadrillion (One quadrillion =10 15, See here for an interesting comparison) calculations a second.

10

Cloud Computing11 Cloud computing is a computing paradigm shift where computing is moved away from personal computers or an individual application server to a cloud of computers. Users of the cloud only need to be concerned with the computing service being asked for, as the underlying details of how it is achieved are hidden. This method of distributed computing is done through pooling all computer resources together and being managed by software rather than a human. The services being requested of a cloud are not limited to using web applications, but can also be IT management tasks such as requesting of systems, a software stack or a specific web appliance.

Utility Computing12 Conventional Internet hosting services have the capability to quickly arrange for the rental of individual servers, for example to provision a bank of web servers to accommodate a sudden surge in traffic to a web site. Utility computing usually envisions some form of virtualization so that the amount of storage or computing power available is considerably larger than that of a single time-sharing computer. Multiple servers are used on the back end to make this possible. These might be a dedicated computer cluster specifically built for the purpose of being rented out, or even an under-utilized supercomputer.

3.5

Key Future Trends in Computing

The following are the key future trends in computing: 1. Moores Law continues but demands radical changes in architecture and software. 2. Architectures will go beyond homogeneous parallelism, embrace heterogeneity, and exploit the bounty of transistors to incorporate application-customized hardware [2]. 3. Software must increase parallelism and exploit heterogeneous and applicationcustomized hardware to deliver performance growth. 4. Where data is large, colocation of data and processing will be the key to truly improve performance. This scheme has been commercially employed by some data warehouse vendors for a long time for simple analytics and on open source systems such as Hadoop.

11

http://en.wikipedia.org/wiki/Cloud_computing http://en.wikipedia.org/wiki/Utility_computing

12

11

The key take away is that the same application software that with no change has enjoyed a 2030% year over year performance increase in the last two decades (due to consistent processor speed improvements) now has to be written differently and more intelligently to exploit parallelism, distributed data, and modern hardware architectures for speed up. That means a lot more work for software engineers and application developers. This is in particular true for advanced analytics and data mining applications that rely on processing of very large amounts of data.

12

Exponential Data Growth and Big Data

With the exponential growth in data generation and acquisition whether in business (CRM, Web data, Mobile data, social media, etc.) or scientific applications (next-generation telescopes, Petascale scientific computing, or high-resolution sensors)it's an exciting time for discovery, analysis, and better decision making. As a result of these technological advances, the next decade will see even more significant impacts in fields such as commerce, medicine, astronomy, materials science, climate, etc. Interesting discoveries and better business decisions will likely be made possible with amounts of data previously unavailable or available but not minable. In the early 2000s, storage technologies were overwhelmed by the numerous terabytes of big datato the point that IT faced a data scalability crisis. Then yet again, storage not only developed greater capacity, speed, and intelligence; they also fell in price. Enterprises went from being unable to afford or manage big data to lavishing budgets on its collection and analysis. And using advanced analytics, businesses could analyze detailed-level in big data to understand the current state of the business and track still-evolving aspects such as customer behavior. An optimal approach to the overall data analysis problem requires close cooperation between domain experts and computer scientists. If we asked computational or data scientists today what would maximize progress in their field, most would still say more disk space and more CPU cycles. However, the emerging petabytes13 of data fundamentally change every aspect of business and scientific discovery: the tools (computer hardware and software), the techniques (algorithms and statistics), and as a result the cycle of the scientific method itself, which is greatly accelerated [9].

4.1

What defines a Big Data scenario?

Big Data is a new marketing term14 to highlight the ever increasing exponential growth of data in every aspect. The term Big Data15 originated from within the open source community where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured data produced daily by web users. All vendors including database/data warehousing vendors, hardware vendors, and analytic vendors have jumped on the bandwagon of Big Data.
13

In May 2011, IBM announced that it will invest $100 million in Big Data Analysis Research. This included tools and service offerings. They specifically mention the challenge of analyzing Petabytes of data which requires advances in software, systems, and services.
14

Very large databases are nothing new. For three decades at least, the term very large database, or VLDB, has been used in technical circles and defined as a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical file system storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time like today petabytes may be more appropriate.
15

Due to the importance of Big Data, the US Government (Office of Science and Technology Policy at The White House) has announced the Big Data Research and Development Initiative with a $200 million commitment from six agencies. See here for details.

13

The most common definition of Big Data16 is datasets that grow so large that they become awkward to work with using available database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualization of such data. The trend in working with ever larger and larger datasets continues because first and foremost, companies at the scale of Google, Facebook, Linked-in, Yahoo!, have to capture and manage all the data on their site at the most granular detail to provide the services they offer. Second, for more traditional companies in industries such as Finance, Telco, Retail, etc. there are potential benefits in analyzing big data that allows their data scientists to unravel business trends, detect novel customer behavior, detect fraud, combat crime, etc. Though Big Data is always a moving target, current limits are on the order of terabytes and petabytes of data that is the size of a single dataset or combinations of datasets that has to be analyzed for a specific analysis purpose at a time. Scientists have regularly encountered this problem in data mining, meteorology, genomics, connectomics, complex physics simulations, biological/environmental research, Internet search, and finance. Data sets also grow in size because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, software logs, cameras, microphones, Radio-frequency identification (RFID) readers, and wireless sensor networks. The worlds technological per capita capacity to store information has roughly doubled every 40 months since the 1980s (about every 3 years) [15].

Data Ination Megabyte (MB) Gigabyte (GB) Terabyte (TB) Petabyte (PB) Exabyte (EB) Zettabyte (ZB) Yottabyte (YB)

Byte Size 2 20 2 30 2 40 2 50 2 60 2 70 2 80

1,000 MB 1,000 GB 1,000 TB 1,000 PB 1,000 EB 1,000 ZB

4.2

6Vs of Data

The exponential growth of data in the last decade can be looked at in six dimensions which I call the 6Vs [5]: Value, Volume, Variety, Velocity, Validity, and Volatility.

16

Refer to Wikipedia.

14

4.2.1 Value Almost all big organizations, specifically in the last 15 years have started to look at the analysis of the data they generate (interactions with their customers, vendors, machines, etc.) strategically. They treat that data as a strategic asset from an analysis standpoint. Detailed analysis of data has proved to bring significant competitive advantage in Risk, Fraud, Marketing and Sales. As such, the value of the data from a business standpoint has increased significantly resulting in easier justification of storing and processing of that data for analysis. If the perception of the strategic value of data was not real, none of the other data dimensions would matter. Great success of data warehouse companies such as Teradata in the last two decades exemplifies this point. These companies have been able to create a separate market (in comparison with transaction processing databases) focused mainly on the value of strategic data Analysis.

4.2.2 Volume With great advances in technology in the last two decades, many transactions between interested entities (human to human, human to machine, machine to machine) can be electronically captured and stored. New transactions that did not exist before are now captured: Smart Meters, Mobile data, RFID data, and healthcare data are just a few of the newcomers. One can now request a cab from his cell phone without entering any address and then paying for it by tapping the phone on a reader. The volume of traditional transactions has also significantly increased year over year. For example, traditional payment transactions such as credit and debit have significantly increased in the last decade after huge growth in the decade just before that. The reader can check Reference [15] for some interesting statistics. We can look at data in terms of different metrics such as the number of rows and columns, files and tables, transactions, or terabytes. All of these metrics have significantly increased resulting in higher volumes of data available for processing and analysis. For example, a decade ago, a large bank may be looking at tens of variables to use for risk or marketing purposes for predictive models. Now they are looking at thousands of variables instead and one or more of these variables by itself may provide great competitive advantage for a specific business question. 4.2.3 Variety Data is now collected in all shapes and forms and for all new purposes. It could be structured or unstructured (e.g. free text) or semi-structured (XML). A decade ago, capturing of unstructured data was only of interest to the pioneers in the private sector. Now it is the norm. The data could be static or slowly changing (customer master data) or dynamic (transactional data), or highly volatile (social media data), etc. Data is even generated when one is not actively engaged like cell phone location data or TV viewing. The important thing is that each variety of data often requires its own special storage, processing, and analysis. 15

4.2.4 Velocity With the advent of Internet, Mobile, and Social Media, the velocity of transactions among entities has significantly increased. This has impacted the data volume. For particular applications, a decision has to be made in real-time in sub-seconds where other decisions can still be made in batch or near real-time. The most famous and pioneering example (which has been in use since 1993) is payment card fraud detection where authorization decisions based on the latest transaction fraud score must be refined and made in milliseconds [6]. 4.2.5 Validity Data validity and quality has always been a challenge. Historically it was typical to expect that data validity/quality suffers as data volumes increase. In last several years with improvements in tools and processes surrounding data quality and the attention given to the value of the data, validity of the captured data has improved. This is in particular true for structured data that is captured into the data warehouses for business analysis and decision making. The value of data as an asset cannot be realized unless there is trust that it represents the true state of reality in whatever application it is.

4.2.6 Volatility One aspect of some of the newer data types that is often ignored is its volatility (or its transient nature) and the ever-shrinking time windows to act upon it. Even though data can be captured and analyzed, its relevance may be short lived given a specific business purpose. Consider someone who is browsing different sites on internet in search of buying a specific product. For the specific purchase he has in mind at that moment of time, the collected data has short utility. If one observes his click streams, compare it to the outcome of others who have generated similar click streams, and provide the right offering to him quickly, a sale could be made. If this data is not used in a short time span, it is not useful anymore for a revenue generation purpose but only good for some summary statistic in a report. Twitter and some other social media data are good examples of volatility of the data. The best known example is automatic short term trading systems where stock trends in seconds or less are crucial in making a profitable trade. In general, one would expect data volatility as a whole to increase irrelevant of the channel or medium because people have shorter attention spans and their habits change more often than what was the norm in the past. Cisco Systems Inc. estimates there are approximately 35 billion electronic devices that can connect to the Internet. Any electronic device can be connected (wired or wirelessly) to the Internet, and even automakers are

16

building Internet connectivity into vehicles. Connected cars could become commonplace by 2012-13 and generate millions of new transient data streams17.

4.3

The Impact of 6Vs

Each combination of these characteristics could require different process and maybe different infrastructure for data management and data analysis. I divide these six characteristics into two groups: Group 1 (the traditional 3Vs): Volume, Velocity, Variety Group 2: Value, Validity, Volatility.

Size of data is directly related to any increase in Group 1 (3Vs) and it is well understood. However, Group 2 is indirectly related to the size of data. As the Value perception of data increases, it creates the appetite for collecting more of it, often with more granularity, Velocity, and Variety. As an example, consider a department in an organization that has benefited from analysis of its customer data for a particular business need. This propels the curiosity and the drive to collect more data on the customer (internal or third party data) to add possibly more business benefits. Other departments in the organization can follow the same path and all this is translated into higher volumes of data for the enterprise to store and to analyze. That has been the story of the last 2 decades and the growth of data warehouse and analytic market. Some data is known to be valuable but in its current form has low validity. This may have to do with the way the data is generated, transferred, or captured. Such data is deemed useless unless its validity can be improved and assured. As data validity improves, given its known value, it will directly affect volume. As an example, log data from web servers is known to be valuable for analysis of online fraud where information about IP address, time stamp, Operating System, and the Browser is captured. If this information is not generated or captured accurately, it will be useless and not worth to store or analyze. If validity of this data improves, it affects the volume of data to be processed and analyzed for that specific business need. Note that validity is not equivalent to Data Quality. One can have valid data with low or medium quality which is still acceptable for data mining18 and Analytics. Data can always be cleaned, standardized, and enhanced through data quality procedures.

17

Analytics derived from wireless Intra VAN (Intra Vehicle Area Network) and Inter VAN (between vehicles) data networks are becoming necessary components of smart vehicles in the year ahead and could be another source of data for a slew of new applications [11].
18

Data mining methodologies and algorithms are inherently designed to be resistive to noise and can cope with lower quality data as long as it is valid and has some value.

17

High Volatility data will reduce the volume of the data that needs to be stored and analyzed. High volatile data has a short useful life span that may not justify its storage and analysis outside an established time window. And if longer history outside that time window is ever needed, it could be captured and stored as summaries and not in granular detailed form.

18

Big Data Analytics

Big Data analytics is more interesting and multifaceted compared to Big Data Storage, but less understood especially by the IT organizations. Development of Big Data analytics processes has been driven historically by the need to cope with the web generated data. However, the rapid growth of applications for Big Data analytics is taking place in all major vertical industry segments and now represents a growth opportunity to vendors that's worth all the hype [9]. One of the challenges of Big Data Analytics is the sheer scale of the data: the data simply can't be moved for analysis. Therefore, the data must be analyzed in situ and/or one must develop methods for extracting a smaller set of relevant data. Big Data analytics is an area of rapidly growing diversity. Therefore, trying to define it is probably not helpful. What is helpful, however, is identifying the characteristics that are common to the technologies now identified with Big Data Analytics [4]. These include:

The perception that traditional data warehousing processes are too slow and limited in scalability, The ability to converge data from multiple data sources, both structured and unstructured, The realization that time to information is critical to extract value from data sources that include mobile devices, RFID, Logs, the web and a growing list of automated sensory technologies19, The accelerating trend for companies moving from annual budgets and monthly/daily reviews to instant responses. They need in real time to know what's going on in their markets, how do they predict changes and change their operations faster than their competition.

The key to success of Big Data Analytics projects is the appropriate combination of analytics (advanced software tools combined with the right methodology), expertise (domain knowledge), and delivery platform (hardware architecture) around certain business problems enabling outcomes and results which are simply not possible in traditional ways.

5.1

Major Developments in Big Data Analytics

There are at least four major development segments that underline the diversity to be found within Big Data analytics. These segments are MapReduce & Hadoop, scalable database, realtime stream processing and Big Data appliance [9].

19

The Marketing jargon often used with this is Analytics at the speed of thought.

19

5.1.1 Hadoop and MapReduce Apache Hadoop is a software framework where conceptually began with a paper published by Google in 2004 [3]. The paper described a programming model for parallelizing the processing of web-based data for Google search engine implementation using Google File System (GFS). It was called MapReduce. MapReduce is a software framework to support distributed computing on large data sets on clusters of computers. Shortly thereafter, Apache Hadoop was born as an open source implementation of the MapReduce process. The community surrounding it is growing dramatically and producing add-ons that expand Apache Hadoop's usability within corporate data centers.
In summary, Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. Performing computation on large volumes of data has been done before, usually in a distributed setting. What makes Hadoop unique is its simplified programming model20 which allows the user to quickly write and test distributed systems, and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores [13]. Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Apache Hadoop users typically build their own parallelized computing clusters from commodity servers, each with dedicated storage in the form of a small disk array or solid-state drive (SSD) for performance. Historically, such architectures have been called shared-nothing architectures21 and have been in use by some database vendors for some time. Traditional Shared storage architectures use scalable and resilient Storage-area network (SAN) and network-attached storage (NAS) but they typically lack the kind of I/O performance needed to rise above the capabilities of the standard data warehouse. Hadoop storage is direct-attached storage (DAS). A typical issue with Hadoop is the growing list of sourcing choices that range from pure open source to highly commercialized versions like Cloudera Inc and MapR. Apache Hadoop and related tools are available for free at the Apache Hadoop site. There are a variety of tools available for Hadoop such as: Hive (initiated by Facebook): This is a data warehouse software with SQL-like language that facilitates querying and managing large datasets residing in distributed storage,

20

Traditionally, the collaboration between multiple distributed compute nodes is managed with a communication system such as MPI.
21

Some data warehouse architectures use shared nothing architectures. The most notable is Teradata and AsterData.DB2 and Oracle use shared storage.

20

HBase (initiated by Powerset): One uses HBase for random, realtime read/write access to Big Data where the goal is to deal with very large tables -billions of rows and millions of columns- atop clusters of commodity hardware. PIG (initiated by Yahoo!) is an Analysis platform including a high level programming language, Oozie provides users with the ability to define workflow (action and dependencies between them) and scheduling actions for execution when their dependencies are met. Mahout22: Mahout is an open source and scalable Java library for Machine Learning and Data Mining. It is a framework of tools intended to be used and adapted by developers. Mahout aims to be the machine learning tool of choice when the data to be processed is very large, perhaps far too large for a single machine. In its current incarnation, these scalable implementations are written in Java, and some portions are built upon Apache's Hadoop distributed computation project. Lucene is a very popular for text search and it predates Hadoop. It provides full text indexing and search libraries for use in an application (Java, C++, Python, Perl, etc.). Sqoop, Flume or Chukwa allow users to procure the data to be ingested and placed in a Hadoop-based cluster.

Figure 1: Current Hadoop Big Data Analytics software ecosystem.

22

As of this writing, Mahout is still work in progress. Some of the algorithms are implemented on top of Apache Hadoop using the map/reduce paradigm. However Mahout does not restrict contributions to just Hadoop implementations. Contributions that run on a single node or on a non-Hadoop cluster are also accepted. The core libraries are highly optimized to allow for good performance for non-distributed environments.

21

In traditional corporate world, the typical analytics use cases mentioned on Hadoop focus on ETL and simple processing and analysis of Big Data such as web and computer logs. The following link is an alphabetical list of organizations that are currently using Hadoop. For each organization, there is an explanation of the application and the size of the Hadoop Cluster: Example Uses of Hadoop: http://wiki.apache.org/hadoop/PoweredBy One of the largest Hadoop implementations is at Yahoo! where most of the user interactions are with the data processed by Hadoop including Content Optimization, Search Index, Ads Optimization and Content Feed Processing. Yahoo! Hadoop clusters currently have more than 40,000 computers (100,000 + CPUs). The biggest cluster used to support research for Web Search and Ad Systems has 4500 nodes (each node is a 2x4cpu box with 4x1TB disk and 16GB RAM). 60% of Hadoop jobs in Yahoo! are Pig Jobs. Some types of big data are too expensive to store in any other way by IT. These include: Unstructured data (search optimization, filtering, indexing) Web data: Web Logs, Click Streams, Telemetry and Tracking data (e.g. smart meters, automobile tracking,) Security logs, Application/Server performance logs, Ad serving logs, Network security, Multi-channel detailed granular data Media data (pictures, audio, video, ) Social Network data,

These data are too large and storing them on sequential file systems makes their analysis long and tedious. Hadoop provides an efficient way to store them in distributed fashion for much faster and scalable analysis23. A typical ETL application may require to compute and to extract some hand-picked useful variables from these large datasets and storing them into a warehouse as structured fields for BI/reporting and modeling purposes. They are also archiving use cases for Hadoop to store older data as better alternative to tapes whose processing is cumbersome and manually expensive. Another use case could be to go through computers and networks security logs24 and look for anything suspicious or abnormal: again high volume/simple processing. One other Hadoop
23

As an example, Yahoo! Search Assist that uses three years of log data used to take 26 days to be processed. With Hadoop and a 20-step MapReduce, this time was reduced to 20 minutes (Source: Google for Hadoop and its Realworld Applications).
24

In big organizations, variety of system logs generated by Servers, Networks, databases, etc. are massive in size. They need to be analyzed for a variety of purposes including performance improvement of systems, Security/Fraud, prediction/provisioning, and reporting/profiling.

22

example use by IT is to parse web logs and extract a number of fields into the warehouse for incorporation in existing analytic processing flows. Processing the data in Hadoop significantly reduces the ETL time. Both ETL vendors and data warehouse vendors have sensed a threat for some time and are realigning their products to incorporate Hadoop. Based on anecdotal information from corporate IT staff, from a storage standpoint, Hadoop is anywhere from 20-100 times cheaper than a data warehouse storage option. In one case Hadoop cost $500/Terabyte for vs $20,000/Terabyte of an alternative warehouse option. However, a totally different skill set is required to do data analysis with Hadoop. This however does not take into account the different skillsets and resources required to use it especially in the context of more traditional business Analytic environment. In the corporate world, we do not see many applications -except the big known applications in Linked-In, Yahoo, and the like- that performs any type of advanced analytics directly in Hadoop. However, many corporations are actively investigating to use Hadoop for specific internal advanced analytic applications that ware not conceivable to do before. Easier-to-use tools and applications to leverage Hadoop more efficiently in the corporate world are being developed. Hadoop is a better deal for mass volume unstructured data processing and mass volume semistructured data (like XML data). In all instances, it still needs a lot more savvy programming though a lot of companies are working toward making this programming much simpler.

5.1.2 Scalable database While Hadoop has grabbed most of the headlines because of its ability to process unstructured data in a data warehouse-like environment, theres much more going on in the Big Data analytics space. Structured data is also getting lots of attention. There is a rapidly growing community surrounding NoSQL25 databases. NoSQL is an open source, non-relational, distributed and horizontally scalable (elastic scaling or scale out instead of scale up) collection of database structures that address the need for a web-scale database designed for high-traffic websites and streaming media. Compared to their SQL counterparts, NoSQL databases have flexible data models and are designed from ground up to require less maintenance (less tuning, automatic repair, etc.). Often, NoSQL databases are categorized according to the way they store the data and fall under categories such as key-value stores, BigTable implementations26, document store databases27, and graph databases. NoSQL database systems rose alongside major internet companies, such as
25

NoSQL is sometimes expanded to Not Only SQL and it is an important Big Data technology. In academic communities, these databases are referred to as Structured Storage where relational databases are a subset.
26

Examples of Big Table-based NoSQL databases are Casandra and Hadoop HBase. Mongo Db for example.

27

23

Google, Amazon, Twitter, and Facebook which had significantly different challenges in dealing with data that the traditional RDBMS solutions could not cope with. With the rise of the realtime web, there was a need to provide information out of large volumes of data which more or less followed similar horizontal structures.Available implementations include MongoDB (as in humongous DB), Casandra28, Hadoop HBase, and Terrastore. Another analytics-oriented database emanating from the open source community is SciDB which is being developed for use cases that include environmental observation and monitoring, radio astronomy and seismology, among others. Traditional commercial data warehouse vendors aren't standing idly by. Oracle Corp. is building itsnext-generation Big Data platforms that will leverage its analytical platform and In-memory computing for real-time information delivery. Teradata Corp. recently acquired Aster Data Systems Inc. to add Aster Datas SQL-MapReduce implementation to its product portfolio. EMC/Greenplum and a slew of others have been realigning their products to leverage these new technologies. There are faster implementations of Hadoop and similar technologies (also with real-time processing support). MapR, DataRush, Hstreaming, HPCC, Platform computing, Datastax etc. are examples of faster technologies that can serve as alternatives to Apache Hadoop. 5.1.3 Real-time stream processing Applications that require real-time processing of high-volume data streams are pushing the limits of traditional data processing infrastructures. These stream-based applications include market feed processing29, electronic trading on Wall Street30, network and infrastructure monitoring, fraud detection, Micro-Sensors (RFID, Lojack), Smart Sensors, product recommendation on the Web, and command and control in military environments. Furthermore, as the sea change caused by cheap micro-sensor technology takes hold, one expects to see everything of material significance on the planet get sensor-tagged and report its state or location in real time. This sensorization of the real world will lead to a green field of novel monitoring and control applications with high-volume and low-latency processing requirements. In real-time stream processing, data has to be analyzed and acted upon in motion, not at rest. Current database management systems assume a pull-based model of data access as opposed to a push-based model of data in streaming system in which users are passive and the data management system is active [21].
28

Apache Cassandra is an open source distributed database management system. It is designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service with no single point of failure. Cassandra was developed at Facebook to power their Inbox Search feature.
29

The Options Price Reporting Authority (OPRA), which aggregates all the quotes and trades from the options exchanges, estimates peak rates of 122,000 messages per second in 2005, with rates doubling every year. This dramatic escalation in feed volumes is stressing or breaking traditional feed processing systems.
30

In electronic trading, a latency of even one second is unacceptable, and the trading operation whose engine produces the most current results will maximize arbitrage profits. This fact is causing financial services companies to require very high volume processing of feed data with very low latency.

24

Traditionally, custom coding has been used to solve high-volume, low-latency stream processing problems. Even though the design it yourself approach is universally despised because of its inflexibility, high cost of development and maintenance, and slow response time to new feature requests, application developers had to resort to it as they have not had success with traditional off-the- shelf software. In 2005, Stonebraker et al. [8] listed the eight requirements for RT stream processing as follows:

(1) To process messages in-stream, without any requirement to store them to perform any operation or sequence of operations. Ideally the system should also use an active (i.e., non-polling) processing model. (2) To support a high-level StreamSQL language with built-in extensible stream oriented primitives and operators. (3) To have built-in mechanisms to provide resiliency against stream imperfections, including missing and out-of-order data, which are commonly present in real-world data streams. (4) A stream processing engine must guarantee predictable and repeatable outcomes. (5) To have the capability to efficiently store, access, and modify state information, and combine it with live streaming data. For seamless integration, the system should use a uniform language when dealing with either type of data. (6) To ensure that the applications are up and available, and the integrity of the data maintained at all times, despite failures. (7) To have the capability to distribute processing across multiple processors and machines to achieve incremental scalability. Ideally, the distribution should be automatic and transparent. (8) A stream processing system must have a highly-optimized, minimal-overhead execution engine to deliver real-time response for high-volume applications. In last few years, several traditional software technologies, such as In-memory databases and rule engines31, have been repurposed and remarketed to address this application space. In addition, stream processing engines (SPEs) have emerged to specifically support high volume, low latency processing and analysis. SPEs perform SQL-style processing on the incoming Messages (or transactions) as they fly by, without necessarily storing them. Clearly, to store state when necessary, one can use a conventional SQL database embedded in the system for efficiency. SPEs use specialized primitives and constructs (e.g., time-windows) to express stream-oriented processing logic. The following table shows how the three technology approaches stack up against the requirements.

31

Rule Engines date from the early 1970-80s when systems such as PLANNER, Conniver, and Prolog were initially proposed by the artificial intelligence community.

25

Requirement

In-memory DBMS (IMDBMS) No No Hard Hard Possible No Possible Possible

Rule Engine (RE)

Stream Processing Engine (SPE) Yes Yes Possible Possible Possible Yes Possible Possible

Keep the data moving SQL on streams Handle stream imperfections Predictable outcome High availability Stored and streamed data Distribution and scalability Instantaneous response

Yes No Possible Possible Possible No Possible Possible

Table 2. This table shows how the three different technologies stack up against the 8 aforementioned requirements.

One early example of a Real-Time Stream Processing system is FalconTM Payment Card Fraud Detection which has been in operation since 1993 [6]. This system in addition to a real-time rule engine and a mechanism to store the states for each individual entity (in this case a payment card user) also incorporated state-of-the-art predictive analytics algorithms. In short, it could process transactions on the fly and could assign an account-level fraud score to each without ever storing any of the transactions. Though specifically designed and implemented for transactional payment fraud, it satisfied some of these requirements in the context of its specific application. The ability to do real-time analytics32 on multiple data streams using StreamSQL has been available since 2003. Up until now, StreamSQL has only been able to penetrate some relatively small niche markets in the financial services, surveillance and telecommunications network monitoring areas. However, with the interest in all things Big Data, StreamSQL is bound to get more attention and find more market opportunities. StreamSQL is an outgrowth of an area of computational research called Complex Event Processing (CEP), a technology for low-latency processing of real-world event data. Both IBM, with InfoSphere Streams, and StreamBase Systems Inc. have products in this space.

5.1.4 The In-memory Big Data appliance As the interest in Big Data analytics expands into enterprise and corporate data centers, the vendor community has seen an opportunity to put together Big Data appliances. These appliances are pre-configured and they integrate server (including software), networking and storage gear into a single enclosure and run analytics software that accelerates information
32

Real-time analytics is finding more and more applications in business. It spans technologies such as in-database analytics, data warehouse MPP (massively parallel processing) appliances, and In-memory analytics.

26

delivery to users. They typically use in-memory computing schemes such as In-memory database (IMDB) where all data is distributed across database nodes memory for much faster access. These appliances are targeted at enterprise buyers who will value the ease of implementation and use characteristics inherent in Big Data appliances. Vendors in this space include EMC with appliances built around the Greenplum database engine, IBM/Netezza, MapRs recently announced commercialized version of Hadoop, SAP (Hana), Oracle (Exalytics), and Teradata with comparable, pre-integrated systems.

27

5.2

High Performance Data Mining and Big Data

Data mining is an interdisciplinary field of computer science that focuses on the process of discovering novel patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, pattern recognition, statistics and database systems. The main defining characteristic of Data Mining system is the large input data sets it must process. Novel patterns from these data sets can be inferred using a variety of computations ranging from simple/complex queries to OLAP to Advanced Analytics/Machine Learning algorithms. Query and OLAP are considered low end analytics since their focus is to aggregate, to slice and dice the data, or to compute statistics that all in all present new perspectives on the data. To cope with the huge size of the data especially in analytic-focused data warehouses and applications, distributed computing techniques have been in use for a long time33. Use of Advanced Analytics algorithms on very large data has been the core and the main differentiator of data mining. These algorithms and techniques provide the ability to discover, learn, and predict new and novel patterns beyond simple analysis of the past and current data. They have enabled businesses to make smarter decisions at the level of individual customers (or entities34) to impact their behavior. Use of very large data (millions of records and Gigabyte range tables) is nothing new to data mining algorithms. With exponential growth of data, the new era of Big Data, and continuous drop of CPU/disk prices, corporations can potentially capture Petabytes and Exabytes of data of all varieties from their interactions with the customers and counter-parties. A typical data mining analysis may require processing of billions of rows and terabyte range tables. Getting to this scale requires a rethinking of how to use and to deploy existing data mining algorithms and applications. There is no question that data mining systems must do a much better job of separating noise from signal in this sea of data and to discover interesting patterns in reasonable amount of time to satisfy the business requirements.

5.2.1 Approaches to Achieve High Performance and Scalability in Data Mining Any successful data mining effort is built upon five pillars: (a) (b) (c) (d) (e) Availability of relevant data, Business domain knowledge, Analytic methodology & Know-How, Analytics Software (including algorithms) to implement the methodology, Right platform for development and deployment of results.

33

Teradata which delivered its first system in 1983 is probably the first commercial implementation of SharedNothing distributed data computing architecture.
34

In the realm of Data Mining, an entity can be defined as a customer, household, merchant, a device (ATM, POS, Cell phone,), etc.

28

Until a few years ago consistent improvements in processors speed/architecture/memory had coped with the data growth. However with exponential growth of data, this equation has not been holding true anymore and the gap has been widening. Since mid-1990s there have been efforts in academic circles to address this gap35. With success of Hadoop and MapReduce, customers are now demanding from commercial vendors to address Big Data market with new product offerings. When Big Data is involved, it is essential for the solution to be massively scalable for which one has to go beyond the normal and traditional ways of doing (c), (d) and (e) in the above. The traditional ways of speeding up the data mining applications have been: (T1) Improvements in clock speed and processor performance in general, (T2) Placement of all the data in memory, (T3) Chunking that is typically used to achieve data scalability (ability to handle large data files of any size). The more modern approaches for increasing analysis speed and scalability is to use parallel and distributed computing techniques: (N1) Functional partitioning across multiple processing units (cores with shared memory or computer nodes on a cluster), (N2) Data partitioning (distribution) across multiple computer nodes, and co-locating of the program and the data.

5.2.1.1

Chunking or Data Partitioning

In Computer Science, the algorithms that are optimized to process external memory data are referred to as External Memory Algorithms or EM Algorithms. External Memory is typically referred to slower storage media such as disk. The problem domains include databases/data management (sorting, merging, transformation, new variable creation, permuting , ), scientific computing (FFT, ), machine learning algorithms (many clustering and decision tree algorithms, linear models, maximum likelihood and optimization algorithms, neural networks, ), text and string processing, geographic information systems, graphs, and computational geometry. Chunking or data partitioning (T3 above) is a technique used by External Memory algorithms and was originally developed for single thread single processor applications. In chunking, a subset of data is read into the available RAM and processed optimally in memory. This continues with reading other data subsets into RAM and processing those sequentially. The results and outcomes on the chunks are then aggregated to get the final output of the algorithm. Handling
35

Many of these efforts have been funded by government agencies.

29

large data (data scalability) is achieved this way. Many data mining and data analysis tools operating on very large data use this approach. Those tools and applications that require loading of the whole dataset into internal memory fail to scale (Open Source R is a good example). 5.2.1.2 Statistical Query Model (Machine Learning and Advanced Analytics)

In 2006, Chu et al. [7] at Stanford developed a general and exact technique for parallel programming of a large class of machine learning algorithms for multi-core processors. Their central idea was to allow a programmer to speed up machine learning applications simply by throwing more cores at the problem rather than search for specialized optimizations of the software or hardware or use of programming languages with parallel constructs (like Orca, Occam, MPI, SNOW, PARLOG). They showed that: (i) Any algorithm that fits to the Statistical Query Model36 (SQM) may be written in a certain summation form without changing the underlying algorithm (so it is not an approximation); The summation form could be expressed in MapReduce framework; The technique achieves basically linear speed up with the number of cores.

(ii) (iii)

They adapted Googles MapReduce Paradigm to demonstrate this parallel speed up scheme. They listed the algorithms that satisfied these conditions which included Linear Regression, Logistic Regression, Nave Bayes, Gaussian Discriminant Analysis, K-Means Clustering, Neural Networks, Principal Component Analysis, Independent Component Analysis, Expectation Maximization, and Support Vector Machines. Their experimental runs showed that they can achieve linear speed ups for these machine learning algorithms in the number of cores.

5.2.1.3

Serial Computing Environment

In a serial computing environment where the algorithm of interest is single threaded and is running on a single core- placing the whole data in memory will naturally speed up the processing, especially if the algorithm requires multiple iterations through the data. However this would only work if the data can fit into the fast internal memory (RAM) completely37. For data mining applications where datasets are almost always very large, this is rarely the case though. Historically, any data mining tool requiring data to fully reside in memory has not been scalable in terms of ability to handle data of any size. Scalable data mining tools have to continuously
36

Statistical query (SQ) learning is a natural restriction of probably approximately correct learning (PAC learning) proposed by Leslie Valiant that models algorithms that use statistical properties of a data set rather than individual examples.
37

64-bit processors which practically have unlimited addressable memory space are only constrained by the size of physical memory. In 32-bit processors, maximum addressable space was at best 4 Giga-bytes (typically lower depending on the OS) no matter what the size of the physical memory was.

30

access the external memory. The resulting input/output overhead (I/O) between fast internal memory and slower external memory (such as disks) has always been their major performance bottleneck. Chunking has been a technique to provide data scalability and also optimizing the I/O taking advantage of locality (by cashing subsets of the data). Statistical Query Model algorithms can also be implemented in this way to get data scalability. Many data mining and data analysis tools operating on very large data use this approach. In summary, (T1) through (T3) are popular techniques that have been applied in these environments. Algorithms deployed in these environments have historically benefited from consistent increases in processor speeds till recently. Power Wall and Big Data are both forcing data mining software developers to use parallel and distributed approaches for further processing speedups.

5.2.1.4 or SMP)

Multiprocessor Computing Environment (Symmetric Multiprocessing

Symmetric multiprocessing (SMP38) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors. In an SMP environment which uses shared external storage (external memory), the schemes (T1) through (T3) will provide speed up39. The real benefit, however, is realized by leveraging the parallelism of the algorithm and mapping it to the multicore environment using multithreading40. The later requires careful analysis and implementation of the algorithms in multithreaded fashion. If the entire algorithm can be parallelized and multi-threaded, in theory it can be sped up linearly with the number of processors. Though typically not all part of algorithms are parallelizable and the maximum speed up follows the Amdhals Law41.
38

NUMA or ccNUMA (cache coherent Non-Uniform Memory Access) is an enhancement of SMP architecture to surpass the scalability of pure SMP. NUMA or ccNUMA could also be used in place of SMP in this environment.
39

There are In-memory relational databases such as Oracle Times Ten that are designed for low latency, highvolume data, transaction, and event management. An In-memory database (IMDB) typically uses SMP architecture and stores the whole database in the shared memory. As such the database size will be limited by the size of RAM.
40

The term multi-threaded is used for software systems that create and manage multiple points of execution simultaneously. Compared to a process, a thread is lightweight because its state can be preserved by a set of CPU registers and a stack making the thread switching very efficient. Memory and other resources are all shared among threads in the same process.
41

Amdahl's law is a model for the relationship between the expected speedup of parallelized implementations of an algorithm relative to the serial algorithm, under the assumption that the problem size remains the same when parallelized. More technically, the law is concerned with the speedup achievable from an improvement to a computation that affects a proportion P of that computation where the improvement has a speedup of S. The overall speedup is then computed as:

31

It is important to realize that it is possible to automatically parallelize and distribute External Memory algorithms. Hence, chunking provides a way for data subsets to be processed in parallel on different cores at the same time. All cores access the same shared storage (typically disk) but operate on their own subset of data. Statistical Query Model algorithms are ideal to be implemented in this environment and can take advantage of the parallel processing on multicore. Many programming frameworks exist for implementation of Statistical Query Model algorithms. Googles MapReduce which is a functional programming construct is one way to implement these algorithms in a multicore environment. Refer to [7] for details.

RAM Multiple Processors

Storage Appliance (NAS, SAN, DAS)

Figure 2: Symmetric Multi-Processing (SMP) architecture where all processors cores share all the resources including memory and storage. These systems could be highly optimized in terms of hardware and software and be very expensive.

5.2.1.5 Storage

Distributed Computing Environment (Cluster Computing) with Shared

A computer cluster is a group of linked computers with their own local memory and OS instance, working together closely resembling a single computer. Speeding up data mining algorithms in these environments is in essence similar to SMP with the difference that each compute node in this case is a separate computer with its own resources. The cluster only share the external memory (disk) and any scheme used to parallelize an algorithm should manage and account for the communication in between the cluster compute nodes. On each node, (T1) and (T2) can be leveraged for speed up. In this environment, data lives on a shared storage and is not distributed at rest across the cluster compute nodes. Similar to SMP, Cluster (or GRID) Computing is mainly geared toward functional partitioning (N1) programs distribution across nodes and load balancing them- not data partitioning.

32

Storage Appliance (NAS, SAN, ...)

Network

RAM Multicores DAS

RAM Multicores DAS

RAM Multicores DAS ....

RAM Multicores DAS ....

Figure 3: Distributed Computing Architecture using commodity computing nodes with Shared Storage. In this scheme each node has its own resources with the exception of the shared storage where the data to be processed resides. Typically one node is dedicated as Master or Queen and the rest are assigned as Slave or Worker nodes.

5.2.1.6

Shared-Nothing Distributed Computing Environment

A Shared-Nothing architecture (SN) is a parallel distributed computing architecture in which each compute node is independent and self-sufficient, i.e. none of the nodes share any resources including internal memory or external memory (disk storage). As such, there is no single point of contention across the system. SN architectures are highly scalable and have become prevalent in the data warehousing space42 and web development. They have become popular for web development because of their shear scalability. They can scale infinitely by adding inexpensive computers as nodes. With distributed partitioning of the data (this is called sharding), each compute node can operate in parallel on its own data subset locally and independently. In this environment, data is distributed at rest. The key for optimal parallelism here is the colocation of the data and the program. This is ideal for EM and SQM algorithms. Since mid-1990s, where the use of commodity servers as compute nodes became popular in academia, there has been a consistent trend toward the use of clusters of commodity computers in commercial applications including databases and data analysis applications.

42

There are computer cluster architectures for data warehouses that are shared everything which means every node in the cluster has access to the resources of the other nodes. Sybase IQ (a part of SAP AG) -which is a columnar data warehouse- is an example of such architecture where the nodes are connected through a full mesh high-speed interconnect and the storage is shared among all nodes.

33

Network

RAM Multicores

RAM Multicores

RAM Multicores ....

RAM Multicores ....

DAS

DAS

DAS

DAS

Figure 4: This is a depiction of a Shared Nothing (SN) Distributed Computing Architecture using commodity computing nodes. As in Figure 2, each node has its own local resources including storage and is in full control of those resources. In this scheme the data is also distributed across the nodes on their local storage, i.e. data is distributed at rest. Each node only operates on its local data and reports its results of processing back to the Master or Queen node.

5.2.1.7

In-memory Distributed Computing Environment

In a Shared-Nothing Distributed Computing environment, partitioned distributed data can be placed in local memory of each computing node instead of its local disk resulting in significant improvement in processing speed. The In-memory distributed databases are the latest innovation in database technology43. IT organizations dedicate significant resources to building data structures (either through denormalized schemas or OLAP programming) that provide acceptable performance for analytics applications. Even then, maintaining these structures can introduce formidable challenges. The most obvious advantages of In-memory technology, then, are the improved performance of analytics applications. Equally beneficial to speed is the simplicity of Inmemory solutions that eliminate arduous OLAP requirements. OLAP technologies require significant expertise around building dimensions, aggregates and measures. Typically, resources require expertise with SQL development and multi-dimensional programming languages, such as MDX. The result is the delivery of enterprise reporting and analytics with less complexity and simplified application maintenance as compared to traditional OLAP solutions. In addition to low end analytics (basic data analysis, OLAP, and visualization), In-memory architectures can also be improved to perform sophisticated data mining and advanced analytics. There are applications that can greatly benefit from this capability.

43

SAP Hana and Oracle Exalytics (uses Oracle Times Ten) are examples of In-memory databases (IMDB) that aim for applications that require close to real-time response for analysis and explorations of structured data. They are not intended to handle unstructured data or Petabyte scale data. IMDBs primarily rely on main memory for database storage instead of disk.

34

5.2.1.8 Side Note on Accelerated Processing Units (ALUs or CPU/GPU combinations)

Graphical Processing Units (GPUs) are SIMD (single instruction, multiple data) architectures that have been used for fast graphic processing by graphic card vendors such as Nvidia, Intel, and AMD/ATI. With the challenges set forward on CPU clock speeds due to Power Wall phenomenon, chip vendors have been looking to pack more power into their processing units in a variety of ways. Aside from multicore chips, one trend is to fuse CPU and GPU combinations on the same CPU die. Today, parallel GPUs have begun making computational inroads against the CPU44, and a subfield of research called GPGPU (General Purpose Computing on GPU) has found its way into fields as diverse as machine learning, linear algebra, statistics, stream processing, stock option pricing, etc. Many algorithms in Data Mining and Analytics can be sped up using the parallel computational power of a GPU. The challenge with the use of CPU/GPU (called ALU) combinations is the recoding required for converting the existing algorithms to run on the GPU. There are languages like OpenCL45 for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors such as DSPs (Digital Signal Processors). As of the time of this writing, OpenCL is considered the development platform most widely used by the developers. Furthermore, GPU-based high performance computers are starting to play a significant role in large-scale modeling. Three of the 5 most powerful supercomputers in the world take advantage of GPU acceleration. This includes the current leader as of October 2010, Tianhe-1A, which uses the Nvidia Tesla platform.

NOTE on FPGAs: FPGAs (Field Programmable Arrays) are reconfigurable hardware that could be used as accelerators to speed up specific data processing applications. FPGAs have been used in database appliances from Netezza. Netezza's appliances use a proprietary Asymmetric Massively Parallel Processing architecture that combine open blade-based servers and disk storage with a proprietary data filtering process using FPGAs. For Analytics in general, FPGAs have been recommended for use in high volume real-time stream processing, dynamic query workloads, and data mining though overall they have not found much traction.

44

Folding@home is a distributed computing project designed to perform simulations of protein folding and other molecular dynamics, and to improve the methods of doing so. It is developed and operated by Stanford University and pioneered the uses of GPUs, Play Station 3s, and MPI on multicore processors.
45

Originally developed by Apple, OpenCL is supported by Intel, AMD (acquired ATI), Nvidia, and ARM.

35

Applications of Big Data Analytics

Shared-Nothing Distributed Computing Environments are currently the best alternatives for solving Big Data problems. The In-memory version of the architecture is suited for specific analytics applications that require close to real-time responses. These architectures benefit from distributing the data on many computing nodes and co-locating the program with the distributed data. Some data warehouse vendors have been using such architecture in their products for some time. For them the capability to run low end Analytics in such an environment has provided new commercial opportunities to exploit and it has been the source of their success. They have been able to speed up basic data management, statistics, and analytics significantly. Some have also tried to implement a subset of the more advanced machine learning and statistical modeling algorithms on their platforms. However, due to lack of popularity of SQL as a language for advanced analytics, none has been successful to get consistent traction46. There are also third party analytic software alternatives such as Fuzzy Logix that provide analytics libraries that run in-database using SQL language. For companies heavily invested in such parallel data warehouses, there has been a real benefit in doing more of their analytics In-database. This created a trend starting early 2000s toward moving more of the analytical processing that was being conducted outside of the data warehouses into the data warehouses reducing unnecessary large data movements (reducing I/O burden). In-database model development never got much of traction since many modeling problems can work equally as well on properly sampled data and outside the data warehouse. Data preparation for model development and model scoring, however, were the great beneficiary of In-database fever. Scoring (including data preparation part) has to be executed on the whole data in its entirety and not just the sampled data. Moving the scoring code to the data warehouse significantly reduces data movement post-model. In existing databases, performing In-database Analytics and Scoring could be challenging due to a variety of limitations they pose on third party developers. A data warehouse however is a general purpose shared resource that is used for many different purposes. It is not really intended to solve Big Data problems. Data warehouse vendors (e.g. Teradata, Oracle, IBM Netezza47, Greenplum,) have realized that their current architectures can be employed as an appliance (dedicated platform) to solve Big Data problems. Analytic software vendors could export their advanced analytic capabilities into these appliances to solve Big Data problems. One issue with Big Data appliances is their high cost. In academia, clusters of commodity computers have long been used solving challenging problems at much lower cost. They have

46

A good example is Teradatas venture into advanced analytics and Data Mining with TeraMiner. Netezza was first to coin the term Data Warehouse Appliance.

47

36

also been successfully employed commercially to solve specific Big Data problems in the last several years and this has created another alternative which is Open Source and low cost. Popular languages for performing serious analytics on very large data -beyond reporting and OLAP- are still a few. The notable ones are SAS and R. These languages must now offer a subset of their capabilities and power on these new architectures. This is especially prevalent in the presence of Open Source alternatives (i.e., Hadoop File System, MapReduce parallel computing paradigm, and all their associated software like Hive, Mahout,) that have been used successfully to solve specific but challenging data-intensive problems; problems that were not cost effective or practical to solve using traditional databases.

6.1

Applications of Big Data in Traditional Analytic Environments

IT departments - typically in conjunction with data consumers (business) select high-valued data and take it through sophisticated ETL (or ELT) and data quality steps for storage in an Enterprise Data Warehouse (EDW). When the value per byte of the data is low or unknown, it is not justified to let it through the EDW in any shape or form especially if it is Big Data. Such data first needs to be stored in raw form somewhere for possible further analysis till its real value to the business can be assessed. Also some data (unstructured and semi-structured) are just not suited for storage in EDW ever. In Sub-section 5.1.1, I mentioned some examples of Big Data that were too expensive to store and to analyze if needed in any other way but Hadoop. These included: Unstructured data (on Internet, Intra-nets, Social Media, etc) for search optimization, filtering, indexing, social media analytics, Semi-structured data o Web Logs, o Click streams, o Security logs, o Application/Server performance logs48, o Ad serving logs, Multi-channel granular customer behavior data (including speech, web browsing, call center, etc.) Network data and computer security Sensor/Telemetry data Media files (pictures, audio, video, )


48

There are many start-ups specializing is handling specific types of big data. For example, Splunk is enterprise software used to monitor, report and analyze the machine data produced by the applications, systems and infrastructure that run a business. Splunk lets users search, monitor and analyze machine-generated data via webstyle interface.

37

The following is an attempt to show at a high level the different applications of Big Data in traditional analytic environments of today, from an Analytics perspective (where the value of data can be realized) and not a storage perspective.

6.1.1 ETL Sometimes, the Big Data (typically in unstructured or semi-structured form) is captured to be processed solely for the purpose of deriving new variables for each entity of interest (columns or attributes) into an existing enterprise data warehouse. This ETL use case is of great interest to Information Technology teams where they can provide business analysts and data scientists with new information about the entities of interest to them where it has been impossible to provide before. In this use case, big data volumes are reduced to a set of new variables that can be appended to existing structured variables for each entity. The addition of new variables could further augment the current analytic processes. Hadoop and MapReduce are becoming popular in IT departments for this purpose since they significantly reduce ETL time, are cheaper (compared to database alternatives), and often they allow tapping into new sources of information deemed impossible to analyze before.

6.1.2 Extracting Specific Events, Anomalies or Patterns There are situations in which the Big Data (typically in unstructured, semi-structured form, or media) is to be processed for detection and analysis of interesting events, rare events (anomalies), salient patterns, or interesting bounds. An example use scenario is in monitoring, security, Fraud, or intelligence applications. The real objective is to detect interesting or suspicious events and patterns that match specific statistical or syntacticsignatures.

6.1.3 Low End Analysis, Queries, and Visualization In this use case, the Big Data in its raw form needs to be queried or visualized for different business objectives and at all times. Opposite to the first use case, it is simply not cost effective to convert the data into any structured form for storage in a data warehouse. Contrary to the second use case, the interest here is not focused on specific events or patterns of interest in the data but in every aspect of the data. Consider use cases in non-traditional environments such as Yahoo! or linked-in that use for example Hadoop to store, capture and analyze all of their site activities based on which they can provide new services/products to their users. The objective here is not to store the data and 38

summarize it once and for all, but to house the data permanently for continuous analysis and reporting. They also perform sophisticated analytics and data mining on this data which is the subject of the next section. In a nutshell, this is a use case in which queries are run against a NOSQL and non-relational Big Data repository. In a traditional environment such as a bank, one may need to store all the granular website data for more detailed analysis that suites the needs of the Marketing or Fraud departments. The Big Data however could be structured and already placed in a distributed relational database or a distributed file repository for the purpose of batch analysis. The In-memory distributed database systems or In-memory distributed file systems provide possibilities for real-time analysis of this Big Data, e.g. real-time queries, real-time OLAP, or real-time visualization to implement a variety of if-then scenarios of interest to data scientists and business analysts. Risk, Fraud, Compliance, and Marketing/Sales can benefit from such analysis as long as the business problem and goals are clearly defined. In this context (structured Big Data), Gigabytes to Terabyte tables are to be analyzed or datasets containing billions of rows of data instead of millions.

6.1.4 Data Mining and Advanced Analytics

Performing Data Mining and advanced analytics on Big Data is even more challenging. The analytic datasets in this case are almost always in non-normalized structured form as rows and columns and could be sparse. These datasets are typically the results of processing and preparation of other raw Big Data datasets of all shapes or forms (structured or unstructured). Data Mining algorithms are often iterative and require more compute power compared to low end analytics. These algorithms are harder to implement on distributed computing systems. Examples include recommendation engines (e.g. Amazon and Netflix), Internet search (e.g. Page Rank algorithm of Google) [17], clustering (e.g. Linked-in), etc. Some data mining applications dealing with structured data require access to all the data in its entirety. As an example, link analysis for a large Telco or Customer/Merchant analysis for a payment card processor (Visa, MC, AMEX, Discover,) requires huge amount of transactional data to be processed in whole and network related variables (communities, influencers, followers, etc.) to be extracted. Aside from the massive amount of transactional data that has to be processed to derive all the relationships, some of the algorithms like community detection are iterative and all can benefit from high performance data mining architectures. Text Mining, Text Analytics, and Content Mining are known and popular applications for unstructured Big Data. For last fifteen years, search engine companies have dealt with Internet Big Data problem and many Big Data technologies have emerged from these solutions (See Section 5.1.1 on Google GFS and MapReduce). Mining Media contents (Speech, Audio, Pictures, videos) are still the subject of research but there are technologies developed that address specific applications today. For example, Nexidia 39

has developed a phonetic indexing and search approach which is superior to the slow and inaccurate process of large vocabulary continuous speech recognition (speech-to-text).

6.2

Big Data Mining and Sampling

Historically one major distinction between Data Mining and traditional Statistics approaches to modeling has been on the issue of sampling. Statistics due to its historical roots has always emphasized on the reduction of the data (rows and columns) as much as possible before any final analysis. On the other hand, Data Mining (and Machine Learning) has its roots in Computer Science where the computing power has always been available to throw in to solve large data problems using often heuristic or iterative algorithms. Hence, data Mining approach is very liberal in the use of the data and typically it preaches the more data the better approach. However, even in this context, the more data the better does not necessarily mean that all available data has to be used. Many Data Mining algorithms, but not all benefit from established sampling methodologies. Sampling has been a cornerstone of data mining and a great contributor to its success. Many existing Data Mining algorithms and methodologies work well on properly sampled data as they do on the whole data. Data Mining has always opted for using larger datasets (larger samples) when available as long as it empirically adds value to the final outcome. The capability to store and analyze Big Data in high performance computing platforms also brings the possibility of performing all data mining steps on the whole data including the model development step which is typically performed on a data sample. In this context, I am often asked the question of whether this capability developing the model on the whole data- provides any real measurable advantages. And if it does, what those advantages are. In the context of Data Mining, for twenty years the author has dealt with a variety of very large data from image recognition to payment card transactions to retail transaction data. Empirically, it is the experience of the author that performance of typical models built on properly sampled data does not improve much compared to when all the data is used. This is irrelevant of the algorithm to be used as long as the random sample is chosen properly and that comes with experience. Machine learning practitioners learn through empirical experience how to choose their training sample. Data miners always use validation and test datasets to select the model with the best generalization. This is opposed to what traditional statistics has suggested where the models are selected based on satisfaction of certain criteria during training which often requires making many assumptions about the distribution of the data and the structure of the model. Today, Statistical Learning Theory provides the theoretical principles and underpinnings for what data mining and machine learning practitioners had learned and practiced empirically through decades. In the next subsection, I summarize some key concepts in Statistical Learning Theory that is relevant to this sampling discussion in this context.

40

6.2.1 Structured Risk Minimization (SRM) and VC Dimension For a given classification learning task49, with a given finite amount of training data (N observations), the best generalization performance (Best Test Error ( ) ) will be achieved if the right balance is struck between the accuracy attained on that particular training set (Called Empirical Risk or Empirical Error ( ) ), and the capacity of the learning machine (measured by a single non-negative integer h called VC Dimension), that is, the ability of the machine to learn any new training set. A machine with too much capacity (large h or high complexity) is like someone with a photographic memory who, when presented with a new tree, concludes that it is not a tree because it has a different number of leaves from anything she has seen before; a machine with too little capacity is like someone who declares that if its green, its a tree. Neither can generalize well. The exploration and formalization of these concepts has resulted in Structured Risk Minimization (SRM) principle that is one of the shining peaks of the theory of Statistical Learning Theory. The principle states that with probability (1- ), the Test Error of a learning machine will be guaranteed to hold as in the following equation irrelevant of the data distribution:
( ) ( )

( )

( )

Where:

( ) ( )

In simple terms, this states that the Generalization Error of a learning system (typically represented by a family of functions to model the phenomenon of interest) is bounded by the Empirical Error (Training Error) and a Confidence Term. The Confidence Term is only a

49

The SRM results can be generalized to estimation learning tasks as well where the target of interest is continuous instead of Binary.

41

function of the number of training observations and the capacity of the learning machine (measured by h or VC Dimension), whatever the data. The Confidence Term is very conservative (could be at least hundreds of times larger than empirical over-fitting effect), but this is expected since it does not make any assumption about the data distribution. It holds true whatever the data. The important principle EQ 1outlines is that to select50 the best learning system51, one has to choose the one with the lowest right hand side which is only dependent on the Training Error, N, and h (assuming a fixed ) When N is also fixed (all data is used or the size of training sample already decided), the generalization error is fully controlled by the capacity h. There are a few very important observations: For a fixed N, as h increases (the learning system capacity or complexity increases), ( ) or Training Error reduces to zero (or the system memorizes the training observations). The Confidence Term keeps increasing with h causing poor generalization (Higher Test Error). All these match the empirical results obtained in Machine Learning for example when training high capacity learning systems such as neural networks (parametric), decision trees or K-nearest neighbor (last two are non-parametric). See the following graph illustrating this point. The best model is however the one that minimizes the right hand side of (EQ 1).

50

There are other model selection techniques such as Akaike Information Criterion (AIC), Baysian Information Criterion (BIC), etc.
51

Occam Razors principle states that "simpler explanations are, other things being equal, generally better than more complex ones." In Machine Learning this is interpreted as less complex models are to be selected if they provide the same performance as more complex models. So the goal of any learning algorithm is not to learn the spurious or transient patterns but the core underlying patterns that are more persistent and stable.

42

For a fixed h (Learning system capacity), as N increases, ( ) approaches a minimum limit asymptotically and the Confidence Term goes to zero. This means that for large N, ( ) will be equal to ( ) or in simple terms, training and test errors will become the same. This matches the empirical results and what statistics is all about. In data mining, N is typically much larger than h:
( )

Note that there is no mention of the number of variables (inputs) or the number of learning system parameters in (EQ 1). This also explains empirical results where good generalizable models have been successfully developed when v (the number of variables) or w (the number of parameters) have been much larger than N because (EQ 3) had still held. h (capacity or VC Dimension) is related but not the same as the number of free parameters of the learning system. There are systems with a single free parameter that have infinite VC Dimension52 and systems with infinite number of parameters that have finite VC Dimension. One can interpret h as the effective number of parameters of a system. There has been a lot of effort to compute h (VC Dimension) for different classes of learning systems either in closed form or empirically. As an example, for a linear regression model (a single layer network with linear threshold), the VC Dimension is equivalent to the number of the linear model parameters w. For feed-forward neural networks with threshold units, VC Dimension is of the order of ( ) where w denotes the number of parameters of the feed-forward network. The higher capacity measured by VC Dimension explains the empirical results of such networks having higher classification capacity. For feed-forward networks with continuous non-linear thresholds, VC Dimension is of the order of [20]. VC Dimension beautifully explains why techniques such as regularization and maximum margin classifiers have better generalization. These techniques effectively reduce VC Dimension or capacity of the learning machine. For example in multi-linear regression, regularization helps to combat multi-colinearity (LASSO is an example) by introducing bias into estimation while reducing the variance of the estimated coefficients. Regularization in essence reduces the capacity of the system (reducing the VC Dimension) providing better generalization (reducing h reduces the Confidence Term).

52

For points on a single dimension x, the function

)) has infinite VC Dimension or capacity

while just having a single parameter

43

6.2.2 Properly Sampled Data is Good Enough for Most Modeling Tasks Based on SRM summary in 6.2.1, to obtain good generalization, there is no need to use the whole dataset as long as the number of training sample observations is sufficiently larger than the capacity of the learning system.

From a Data Mining perspective, here are some typical questions that are raised: (1) How the size of the sample is to be selected? Isnt it better to use all the data if one can use it all? Answer: Machine learning (and statistical modeling) by nature is an approximation exercise to explain a phenomenon using the simplest but the most accurate model. Historically the size of the proper sample required to solve the problem has been selected empirically based on the learning problem, the choice of the learning system, and experimentation. In case of linear regression, the VC Dimension is known. For cases that it is not known, some practitioners have used empirical methods to measure and use it directly to build good classification systems; without using any validation dataset. The typical and simplest approach is to use properly sampled data size (with Training and Validation) one has gained through experience. If one can afford to use more of the data (e.g. by having access to a Big Data high performance analytic platform) and does not have much prior experience in the problem, it is justified to use it all at least as a starting point. However, one should not expect necessarily better classification or estimation performance by using all the data53.

(2) What about highly unbalanced datasets (classification tasks)? Answer: When the data is highly unbalanced meaning that the event (target) to be modeled is extremely rare, the solution requires under-sampling of the non-rare event (target). As an example, in Payment Card fraud, on average 1 out of every 2500 active credit card accounts could be fraudulent [6]. This means that for every 100 million accounts, only 40,000 accounts are fraudulent. The correct methodology requires proper under-sampling of non-fraud accounts to build a more reasonably balanced dataset that can then be learned by the learning algorithms of interest. In such a case, access to all the data for the purpose of model development does not add any value and actually backfires and generates poor performing models. One may argue that instead of under-sampling the non-event, one could over-sample (replicate) the rare events to the extent of making the data more balanced. This also makes data even bigger. There are two objections to this approach. First, by the simple act of replication, the validation dataset will look exactly like the training dataset in terms of rare events patterns. As a
53

For novices, use of all data could actually backfire. Some algorithms are extremely sensitive to outliers and their presence could produce completely wrong models. Outliers are more probable when larger subset of data is used. Experienced data miners always account for possible outliers.

44

result, in such a case the training-validation methodology must not be used and not using this methodology always results in poor models. Second, to benefit from the over-sampling exercise, for some algorithms the rare events should be of the same order as the non-events. This could easily balloon the already large size of the training data for no real good reason.

(3) If model performance does not gain much by using all the data, then are there any advantages of using all the data on a high performance analytics platform? Answer: (1) One benefit of using all the data on an HPA platform is that data mining algorithms that require multiple iterations through the training data can be run much faster. Here the data mining can still be run on a random sample too (instead of the whole data), but each iteration of the algorithm will be executed much faster which is an advantage. (2) A second advantage is that different learning structures and algorithms can be applied to the same data (sampled) for finding the best model performance on the validation or Test data54. (3) Another benefit is that models could be built quickly on different segments by application of different filters to explore different possibilities. This segmented modeling approach for important models is proven to sometimes provide superior models. The ability to run these different scenarios fast and test different hypotheses is of great importance for some applications (Fraud and Credit Risk in particular). Still the models could be built on samples of the data but they are built on a much faster rate providing an opportunity for analysis of different scenarios and different segmentation combinations. (4) If a model is built using all the data, though there could be no performance improvement but after the model is built and selected, all the data has already been scored. In case that the scores are to be used in batch processes and do not need near real-time or real-time update, there is no need for a separate scoring process. The scores are already there when the model development is done. (5) The main and the most time consuming part of data mining is data preparation for building the Analytic (or Modeling) dataset. Here the assumption is that the data from which the variables are to be derived from is stored, validated, and of good modeling quality. The ability to build the analytic data set quickly (and sometimes iteratively) is of great value to a data miner. Today, thousands of variables are often used as input to a model which at the end of development may only use tends of those variables. All the speed up gained to build analytic datasets and to select variables will be of great benefit to a power user. In particular, faster variable binning and grouping using high performance platforms will be of great benefit.

54

It is the experience of the author that for human generated data, it is more important to spend time on data preparation and encoding aspects than on trying different algorithms.

45

6.2.3 Where to Use all the Data? In a nutshell, here are some advantages of using all the data in a data mining task and big data mining on a HPA platform:

One can use all the data since it is feasible to do so. The time it takes to build a model (excluding all data preparation tasks) on all the data is so fast that it is justified as a standard practice. After the development of the model is done using all the data, one already has the scores for all the records. If the scores are to be used in a batch process, no separate scoring process will be required. If the learning system is complex (has huge capacity or VC Dimension, e.g. neural networks, nearest neighbor,) and one is not sure what the right sample size should be, then all data should be used at the start. Any linear model, including logistic regression, does not fall into this category since their capacity is known. However, in the context of linear models, when the problem is linearized by using transformations on the inputs (e.g. binning, grouping, etc.), these transformations will increase the complexity of the system as a whole which could justify the use of all the data. The nature of the data mining problem is such that no sampling can be used. The data has to be analyzed in its entirety. Some examples of data mining tasks of this nature are anomaly detection, known event pattern matching, Page Rank calculation (Google like), recommendation on the long tail (Netflix), link analysis, etc.

46

Evolution of Business Analytics Environments

Figure 5 illustrates what an ideal analytic environment may look like in the near future. Such an environment will be a merge of three Analytics environments: (1) The traditional Analytic environment of today centered around an Enterprise Data Warehouse with data marts, BI and Analytic servers, and associated software tools; (2) Open source Hadoop or the like environment with its software eco-system; (3) High Performance In-memory Distributed Data Analytics (HPIMDDA) platform for specific business needs covering high performance visualization, basic analytics, data mining & advanced analytics, and streaming (including Rule Engine).

By using the word near future, it is not meant that no such environment is in use today -partial subsets of it are in use already. Traditional Analytics environments will be moving to integrate with (2) and/or (3). Some existing environments built around Hadoop will be moving to integrate and to include (1). Organizations will develop best practices on when and how to use these different technologies based on their specific data characteristics (6Vs data): value, variety, volume, velocity, validity, and volatility. Some features of each environment will appear in others; for example MapReduce capability could become available in some structured databases55 (as SQL extentions) while there are already efforts to add real-time capability to Hadoop which is inherently a batch environment. As an example, Linked-In is leveraging Hadoop to store and to analyze all of its site data. The products and services they offer to the customers (150 million people) are most if not all based on the analysis performed in Hadoop. They use high languages such as PIG, R, and Python (also Java) for interface to Hadoop, and use other tools like Gephi (for visualization), Kafka, etc. At the same time, they are building a Teradata Enterprise Data Warehouse (EDW) by ETLing a subset of the data from Hadoop into it for ad hoc analysis. The EDW is intended specifically for shorter term ad hoc queries. At the same time, they use Asterdata (as data mart) for longer time series data using SQL Map-Reduce. If we consider traditional analytics environments that exist in most banks, they are built around an EDW, a variety of departmental data marts, Analytic servers, and a variety of software Analytics tools (for Query, Reporting, OLAP, Data Mining, visualization). Sandboxes in data

55

Already there is some SQL MapReduce functionality in Asterdata.

47

warehouse56 or standalone sandboxes are used typically by power users for exploration and data mining activities. Hadoop has found applications in these environments specifically for unstructured or semi-structured Big Data such as web and log data. It is envisioned that Hadoop will play a very important role in these environments going forward. As discussed before, the applications may range from ETL, to Archiving (new tape drive), to Operations, to tactical and strategic Analytics.

At the same time, for specific applications such as Risk, Fraud, Marketing, HPIMDDA platforms will play a crucial role as well especially when fast response times (reactive or pro-active) will be required for making decisions. These platforms are specifically designed to analyze very large data (typically structured data) in real-time for very specific business needs in BI (fast low end analytics, Real-Time OLAP, visualization) and Data Mining. High performance scalable stream processing engine will be an important part of the new environment in which some data is needed to be acted upon before it gets stored, if it is ever stored. There is a fuzzy segmentation of the environment into two parts: Business Intelligence (BI) and Data Mining (DM). Typically Business Users operate in the BI world and Power Users operate in the DM world. In this new environment, a Power User (e.g. a data scientist or analytic scientist) having the right permissions could potentially need to access and analyze any of the data wherever it resides: Hadoop, EDW, In-memory sandbox, data mart, etc. The software tool sets required to access and to use these different data repositories are currently of wide variety. For analytics vendors, there is an opportunity to provide their users with a single interface for accessing and analyzing the data wherever it is. From a user point of view, all complexities of data access and analysis on different data platforms should be hidden. Any code written should have the intelligence to execute itself efficiently on a specific platform with minimal change and with proper interaction with the user. Hadoop (and other NoSQL databases) ecosystem is only in its infancy and it will continue to evolve and integrate more tightly into traditional Analytic environments. Structured database environments will also evolve and will coexist with NoSQL environments. Big Analytic software vendors such as SAS Institute are realigning their Analytic software tools and solutions to play a big part in these new Analytics environments and will play a great role in shaping it in the years to come.

56

Teradata has introduced a product called Data Lab which is a sandbox inside the database with better DBA management control over user activity.

48

Operational Databases (Structured Data)

High Performance Streaming / Rule Engine Hadoop Cluster (ETL) Business Users HPIMDD Server for BI

Database

Database . . . Database Hadoop Cluster (Operations & Tactical Analytics)

BI Server

Departmental Data

EDW Other Data Sources Unstructured & Semi-Structured Data Sandboxes Web

Business Intelligence Data Mining & Advanced Analytics


Power Users

Media Files

Hadoop Cluster (Strategic Analytics)

Analytic Sandbox

Data Mining Server

Machine Data . . . Hadoop Cluster (Archiving)

Other

HPIMDD Server for Data Mining

HPIMDDA: High Performance In-Memory Distributed Data Analytics Depicts HPIMDDA Server or Appliance.

Hadoop (or the like including other NoSQL) Big Data Platform

Traditionial BI or Analytics Server or Appliance (e.g., SMP, GRID with shared storage)

Figure 5: This diagram depicts evolution of current business analytics environments in near future. The traditional business intelligence and analytics environment will coexist and will integrate into environments suited for Big Data Analytics such as Hadoop, High Performance In-memory Distributed Data (HPIMDD) databases, HPIMDDA for BI and Data Mining, and High performance Streaming engines.

49

REFERENCES
[1] Next Generation Supercomputers, IEEE Spectrum, February 2011. [2] The Future of Microprocessors, Communications of the ACM, May 2011. [3] MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, 2004. [4] Big Data Analytics, Philip Russom, TDWI Research, 4th Quarter 2011. [5] New Challenges for Creating Predictive Analytic Models, Khosrow Hassibi, 2009. [6] Detecting Payment Card Fraud with Neural Networks, Khosrow Hassibi, in Business Applications of Neural Networks, Page 141, 2001. [7] Map-Reduce for Machine Learning on Multicore, 20th Annual Conference on Neural Information Processing Systems (NIPS), 2006. [8] The 8 Requirements of Real-Time Stream Processing, Stonebraker et al. [9] Understanding Big Data Analytics, John Webster. [10] Amazon Takes Supercomputing to the Cloud, C-net article, December 2011. [11] Progress and Challenges in Intelligent Vehicle Area Networks, Communications of the ACM, February 2012. [12] MapReduce Tutorial, Apache.org. [13] Hadoop Tutorial, Yahoo. [14] Organizations using Hadoop with Example Use Cases and Cluster Sizes. [15] The Worlds Technological Capacity to Store, Communicate, and Compute Information, Martin Hilbert and Priscila Lpez (2011), Science (journal), 332(6025), 60-65; free access to the article through here: http://martinhilbert.net/WorldInfoCapacity.html . [16] In-memory Analysis Delivering Insights at the Speed of Thought, Wayne Eckerson, TechTarget, December 2011. [17] Page Rank: Standing on the Shoulders of Giants, Communications of the ACM, June 2011, Volume 54 NO 6.

50

[18] Scorecard construction with unbalanced class sizes, David Hand et al., JIRSS (2003), Vlo.2, No.2, pp 189-205. [19] A Tutorial on Support Vector Machines for Pattern Recognition, CHRISTOPHER J.C. BURGES, Bell Labs, Data Mining and Knowledge Discovery, 2, 121-167 (1998). [20] Neural Networks with Quadratic VC Dimension, Pascal Koiran, Eduardo D. Sontag, Journal of Computer and System Sciences, 1997. (PostScript Link)
[21] Scalable Distributed Stream Processing, Mitch Cherniak et al., Proceedings of the 2003 CIDR Conference.

51

You might also like