Beyond Hadoop - Next-Generation Big Data Architectures - Cloud Computing News

Beyond Hadoop: Next-Generation Big Data Architectures Cloud C...
1 of 11
http://gigaom.com/cloud/beyond-hadoop-next-generation-big-data-archi...
GigaOM
Home
Apple
Cleantech
Cloud
Europe
Mobile
Video
Oct 23, 2010 - 9:00AM PT
Beyond Hadoop: Next-Generation Big Data Architectures

By Bill McColl
18 Comments
Tweet
186
Share
22
Email This
People who have cutting-edge performance and scalability requirements today have already moved on from the Hadoop
model. Some back to SQL, but more to a raft of radically new post-Hadoop architectures. Welcome to the NoHadoop era,
as companies realize big data requires Not Only Hadoop.
After 25 years of dominance, relational databases and SQL have in recent

years come under fire from the growing NoSQL movement. A key element of this movement is Hadoop, the open-source
clone of Googles internal MapReduce system. Whether its interpreted as No SQL or Not Only SQL, the message has
been clear: If you have big data challenges, then your programming tool of choice should be Hadoop.
The only problem with this story is that the people who really do have cutting edge performance and scalability
requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more
significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of
radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale
than Hadoop. We are now at the start of a new NoHadoop era, as companies increasingly realize that big data requires
Not Only Hadoop.
Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the
big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are embarrassingly parallel,
but most of the difficult big data tasks confronting companies today are much more complex than that. They can involve
complex joins, ACID requirements, real-time requirements, supercomputing algorithms, graph computing, interactive
analysis, or the need for continuous incremental updates. In each case, Hadoop is unable to provide anything close to the
levels of performance required. Fortunately, however, in each case there now exist next-generation big data architectures
that can provide that required scale and performance. Over the next couple of years, these architectures will break out into
the mainstream.
Here is a brief overview of the current NoHadoop or post-Hadoop space. In each case, the next-gen architecture beats
MapReduce/Hadoop by anything from 10x to 10,000x in terms of performance at scale.
8/11/2012 12:04 AM
2 of 11
SQL. Having been around for 25 years, its a bit weird to call SQL next-gen, but it is! Theres currently a tremendous
amount of innovation going on around SQL from companies like VoltDB, Clustrix and others. If you need to handle complex
joins, or need ACID requirements, SQL is still the way to go. Applications: Complex business queries, online transaction
processing.
Cloudscale. [McColl is the CEO of Cloudscale. See his bio below.] For realtime analytics on big data, its essential to break
free from the constraints of batch processing. For example, if youre looking to continuously analyze a stream of events at a
rate of one million events per second per server, and deliver results with a maximum latency of five seconds between data in
and analytics out, then you need a real-time data flow architecture. The Cloudscale architecture provides this kind of
realtime big data analytics, with latency that is up to 10,000X faster than batch processing systems such as Hadoop.
Applications: Algorithmic trading, fraud detection, mobile advertising, location services, marketing intelligence.
MPI and BSP. Many supercomputing applications require complex algorithms on big data, in which processors
communicate directly at very high speed in order to deliver performance at scale. Parallel programming tools such as MPI
and BSP are necessary for this kind of high performance supercomputing. Applications: Modelling and simulation, fluid
dynamics.
Pregel. Need to analyse a complex social graph? Need to analyse the web? Its not just big data, its big graphs! Were
rapidly moving to a world where the ability to analyse very-large-scale dynamic graphs (billions of nodes, trillions of edges)
is becoming critical for some important applications. Googles Pregel architecture uses a BSP model to enable highly
efficient graph computing at enormous scale. Applications: Web algorithms, social graph algorithms, location graphs,
learning and discovery, network optimisation, internet of things.
Dremel. Need to interact with web-scale data sets? Googles Dremel architecture is designed to support interactive, ad hoc
queries over trillion-row tables in seconds! It executes queries natively without translating them into MapReduce jobs.
Dremel has been in production since 2006 and has thousands of users within Google. Applications: Data exploration,
customer support, data center monitoring.
Percolator (Caffeine). If you need to incrementally update the analytics on a massive data set continuously, as Google now
has to do on its index of the web, then an architecture like Percolator (Caffeine) beats Hadoop easily; Google Instant just
wouldnt be possible without it. Because the index can be updated incrementally, the median document moves through
Caffeine over 100 times faster than it moved through the companys old MapReduce setup. Applications: Real time search.
The fact that Hadoop is freely available to everyone means it will remain an important entry point to the world of big data
for many people. However, as the performance demands for big data apps continue to increase, we will find these new, more
powerful forms of big data architecture will be required in many cases.
Bill McColl is the founder and CEO of Cloudscale Inc. and a former professor of Computer Science, Head of the Parallel
Computing Research Center, and Chairman of the Computer Science Faculty at Oxford University.
Related GigaOM Pro Research (sub reqd):
Big Data Marketplaces Put a Price on Finding Patterns
How Big Data Tools Are Shaping Sustainability Software
Will Hadoop Vendors Profit from Banks Big Data Woes?
Share This Story
Tweet
Share
186
22
Email
Stacey Higginbotham
8/11/2012 12:04 AM
3 of 11
Follow Stacey Higginbotham

@gigastacey
RSS Feed
Stacey is happy when immersed in SEC filings, tech specs or poking through a data center. She has spent the last 11 years
covering technology and finance for publications such as The Deal, the Austin Business Journal, The Bond Buyer and
BusinessWeek, and works remotely from Austin, Texas. At GigaOM,...
18 Comments
1.
Eddie Saturday, October 23 2010
Good article, although its verbosity could have been map reduced ;-) to simply its a matter of horses for courses or
Hadoop is not the be-all end-all (but then again, who said that Hadoop was the be-all end-all? I dont recall anyone
in the Hadoop community stating so).
Share
Facebook
Tweet
2.
Andrew Purtell Saturday, October 23 2010
Percolator is built on top of BigTable. In the Hadoop ecosystem, we have HBase as an open source implementation of
BigTable, and it seems feasible to build an open equivalent to Percolator on top of the HBase coprocessor framework.
Hadoop is not just MapReduce. This article is written as if Hadoop has stood still since 2006.
Share
Facebook
Tweet
3.
Steve Loughran Sunday, October 24 2010
1. Google are using something nobody else can see -its hard to say theyve moved on from Hadoop, merely evolved
their own MR engine.
2. Nobody in Hadoop-land is going to say you should use Hadoop and friends if you want transactions, ACID, etc.
What we do say is you dont need to index all the stuff you want to search through later, and if you keep some
stuff in a distributed filesystem, you make storing PB affordable
3. What Hadoop does have is testing at double digit petabyte storage capacity, thousands of servers, each with 6+
HDDs.
4. MPI. MPI doesnt handle failure well. Which is why most HPC facilities dont like MPI jobs that take more than
48h to complete -too much risk of an outage. I think MPI is great for some problems, but its not the silver bullet
either.
What Hadoop does bring to the table is community and scale. Nobody in the group thinks its perfect, but we know
what the MapReduce problems are (latency due to the saving of intermediate results to HDD and a wait for all maps
to complete before the reduces), and those of the filesystem (the namenode is an SPOF, better checksumming and
security; the latter is trickling out). Its also designed for a static set of machines; when hosted in on-demand
infrastructure you need to integrate the infrastructure operations into your workflow. We know them, people in
different companies and some universities are working on them. Its going to be hard to compete with the community,
even if you have better solutions.
People used to dismiss Linux compared to real unix, remember?
Share
8/11/2012 12:04 AM
4 of 11
Facebook
Tweet
rohitsift Tuesday, October 26 2010

+1
never, never underestimate community.
Share
Facebook
Tweet
4.
John Monday, October 25 2010
Great article. Very intersting and to the point, thank you for it.
Share
Facebook
Tweet
5.
Yyzfan@gmail.com Monday, October 25 2010
Memory based architectures are the future. Spinning disks are the root problem.. There are a few in the field, vmware
just added gemfire oracle and IBM have compete tech.
Its the future in about 5- 7 years.
Share
Facebook
Tweet
Razi Sharir Tuesday, October 26 2010

The future is here checkout Xeround.com and enjoy the best of both worlds SQL Cloud DB with the
benefits of NoSQL underneath the hood
Share
Facebook
Tweet
yyzfan@gmail.com Saturday, October 30 2010

The reason why its the future is it needs integration at the app / code level. Most existing apps will not be
ported
SQL s*cks as well. As much as I loved linear algebra, thats not how the world is organized.
Another reason there is time, is we need to teach the next generation of developers a different way, thats
what takes the real time, not the technology
Share
Facebook
Tweet
6.
Stefan Groschupf Monday, October 25 2010
8/11/2012 12:04 AM
5 of 11
Nice try
If someone uses Hadoop for a realtime or ACID environment, he / she has the wrong job. Hadoop is about offline
analytics and the cost to scale equation.
Share
Facebook
Tweet
7.
Henry Robinson Monday, October 25 2010
Pregel effectively *is* BSP (synchronous checkpointed steps consisting of local processing then message passing).
If you squint hard enough, MapReduce fits into this model as well.
Share
Facebook
Tweet
Andrew Purtell Monday, October 25 2010

The Apache Hama project appears to be busy implementing BSP on Hadoop: http://people.apache.org
/~edwardyoon/papers/Apache_HAMA_BSP.pdf
Share
Facebook
Tweet
8.
Stephen T Monday, October 25 2010
Really hilarious to see the comments on this article from the Hadoop guys about MPI and BSP. McColl invented the
BSP approach to parallel programming along with Leslie Valiant of Harvard back in the 1990s. He also led the
international team that defined the standard library for BSP programming, as a simpler and faster alternative to MPI,
and his team built all the programming tools. His 1996 paper Questions and Answers about BSP is the standard
introduction to BSP software. The Wikipedia page
http://en.wikipedia.org/wiki/Bulk_synchronous_parallel
is based on that. Theres even a link there to an ancient (1998) web page by McColl on BSP which has tons of papers,
and a link to a Cover Story in New Scientist on McColl, Valiant and BSP.
The fact that Google has recently rediscovered BSP and has used it in an exciting way to build Pregel, which now
accounts for more than 20% of all big data computing at Google, is further validation of how important this approach
is today.
Share
Facebook
Tweet
Andrew Purtell Monday, October 25 2010

Speaking of MPI as a next generation technology beyond Hadoop is inverting history. But implied in your
comment (I think) is that this is some kind of competition. I dont get it. MPI and MapReduce are very different
tools that solve different and at least partially exclusive sets of problems. I go back to the horses for courses
comment made by the first poster.
Share
Facebook
Tweet
8/11/2012 12:04 AM
6 of 11
Patrick Angeles Monday, October 25 2010

And yet you havent addressed Steve Ls assertion that MPI doesnt handle failure well. And none of the
Hadoop guys actually put down BSP, but somehow it got non-sequiturd into a plug for the author.
Speaking of non-sequiturs, how about this one:
[the] No SQL the message has been clear: If you have big data challenges, then your programming tool of
choice should be Hadoop.
Nobody working closely with Hadoop will:
1. touch the NoSQL movement with a 10 foot pole.
2. use it as a silver bullet for all big data challenges.
Share
Facebook
Tweet
9.
Bill McColl Monday, October 25 2010
Hey guys. Thanks for the comments.
I was surprised that some of you thought it controversial that we were moving into a Post-Hadoop era. I guess it
depends which world you live in. I write from the perspective of the Silicon Valley startup world. In that world the
game is moving on, since MapReduce (a 7 year old programming model) is now available from any one of the big
established vendors: Amazon Elastic MapReduce for cloud MR, IBM BigInsights for spreadsheet-fronted Hadoop and
for DB2 with Hadoop tables, Microsofts Dryad being prepared for commercial launch, Oracle integration with
Hadoop. It doesnt get more mainstream than Amazon, IBM, Microsoft and Oracle! As I say in the article, the new
wave of startup innovation in big data architectures is going to be around the many areas where MR/Hadoop is
nowhere near enough: in-memory, transactions, realtime, graph, exascale fault-tolerant message passing and RDMA,
interactive, incremental.
As I also said in the article, Hadoop is, and will remain, a small but important part of the overall big data ecosystem.
Its very basic coarse-grain style provides easy fault tolerance for the simple types of parallel apps that some
organizations have today. And it comes in a free version too, which is great! However, as soon as performance
(throughput and/or latency) begins to matter, as usual youre in a One Size Doesnt Fit All situation and you need to
look for the right kind of architecture.
Bill
Share
Facebook
Tweet
10.
Razi Sharir Tuesday, October 26 2010
You get what you pay for; a simple as that. If NoSQL serves you good and theres no need for relational and/or
transaction modeling, than go for it.
Practical exercise implies this is usually not the case and we se indeed many folks go back to RDBMS. For those who
never left or for those who are already there MySQL backend applications were there to support. A SQL
Cloud DB that is elastically scalable and highly available. Dont take my word, checkout this out on our Beta
@xeround.com
Share
Facebook
Tweet
Displaying 16 of 18 comments. View all comments
8/11/2012 12:04 AM
7 of 11
Most popular in Cloud

The IT worlds love-hate relationship with OpenStack 08/07/2012
HP better not together? 08/07/2012
Microsoft opens Office and SharePoint up to web developers 08/09/2012
Database superstar Jim Starkey touts NuoDBs new patent 08/08/2012
For Google, keeping search relevant means baking big data into everything 08/08/2012
Related
Structure:Data live coverage
On March 21 and 22, at Structure:Data, we'll look at how companies like @WalmartLabs, IBM, and PayPal...
What digital fashion brands can learn from the Sears catalog and Facebook fans
Looking to sell stylish, quality clothes at affordable prices? Only-only retailers might have to work a little...
Another reason not to use corporate software: Your bosses are spying on you
Big data and the ability to analyze unstructured data enable all sorts of great applications, but it's...
Sprint will use the biggest vendors to build its smallest cells
Sprint is selected two of its small cell manufacturers, Samsung and Alcatel-Lucent, which happen to be the...
Whats Happening Now

Just commented on:
Across the planet, broadband is getting faster & faster
8/11/2012 12:04 AM
8 of 11
India sprinting hard to make a mark in the cyberworld.

Just commented on:
Sprint will use the biggest vendors to build its smallest cells
Three biggies on their way to achieve yet another aim of theirs.

Just commented on:
Another reason not to use corporate software: Your bosses are spying on you
With the advent of social networking sites like Facebook, Google plus, twitter, etc....
8/11/2012 12:04 AM
9 of 11
AT&T Small Business Phone

Free Smartphone with Purchase of
Qualifying New Voice & Data Plan.
att.com/SmallBusiness
Cloud Computing Info

Join Enterprise CIO Forum Today. The
Latest News In Enterprise Tech.
www.enterprisecioforum.com
IBM Big Data Report

See Why Top Analysts Call IBM a "Leader"
In Big Data - Free Report!
www.ibm.com/Big_Data
Stay on top of cloud news in your inbox

Get a daily roundup of news and analysis about everything cloud (see a sample):
8/11/2012 12:04 AM
10 of 11
8/11/2012 12:04 AM
11 of 11
Events
Pro Research
GigaOM TV
Privacy Policy
Terms of Service
About
Editorial Team
Media Kit
Contact
GigaOM
Powered by WordPress.com VIP
News
Events
paidContent
Research
Click to log in with:
LinkedIn
Twitter
Facebook
WordPress.com
GigaOM Pro
Not you?
Remember me
Comment as guest:
By continuing you are agreeing to our Terms of Service and Privacy Policy.
Submitting comment...
Click to log in with:
LinkedIn
Twitter
Facebook
WordPress.com
GigaOM Pro
8/11/2012 12:04 AM

Beyond Hadoop - Next-Generation Big Data Architectures - Cloud Computing News

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beyond Hadoop - Next-Generation Big Data Architectures - Cloud Computing News

Uploaded by

Copyright:

Available Formats

Beyond Hadoop: Next-Generation Big Data Architectures Cloud C...

Oct 23, 2010 - 9:00AM PT