A Simple Introduction To Playing With Big Data

A Simple Introduction To Playing With Big Data | Regular Geek
1 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
search
Subscribe to RSS
Subscribe by Email
Where programming, the internet and social media collide.

Home
About
Consulting
Advertise
Blog Roll
Free Resources
A Simple Introduction To Playing With Big Data

Published in September 19th, 2010
Posted by Rob Diana in Miscellaneous
1 Comment
29
retweet
Like
tweets
0
StumbleUpon
Submit
With social media, big data has come to the forefront of technology. Whether you want to continuously
search Twitter, aggregate the social activity on several sites, or do some mining of peoples activity on
Facebook, handling big data is critical. There are two questions you need to answer when looking at a project
that will handle big data. First, how is big data defined and when do I know I am dealing with it? The second
question is how do I deal with big data?
How big is big data?

Big data thresholds change over time. This is really due to how well traditional storage mechanisms can deal
with them. Part of the storage problem is hardware related, eg. can the disk store a file larger than 4GB? That
question may not be a big deal now, but 15 years ago it was a major concern. Another question about size is,
how well an RDBMS can store the data? Will the database crash if it tries to manage 100GB of data? Yes,
100GB of data in one database was huge before 2000. Technologies like database partitioning, where a large
table was physically split and managed by the database engine, were still young. Now, even open source and
free databases have partitioning and replication. The size of big data has increased dramatically as well.
When people talk of big data, they mean hundreds of millions of rows in one table and a database potentially
over 1TB, yes one terabtye. Even though big data is a hot topic, you have few opportunities to really interact
with big data. For our purposes, lets assume you are going to aggregate data from social services in some way,
otherwise this post would be fairly short and uninteresting.
8/11/2012 12:56 AM
2 of 7
How do you deal with big data?

One of the first questions when dealing with any database, big data or smaller data, is what are you going to
do with it. Is your primary function search of the data? Are you going to try to analyze the data using typical
data mining techniques? Are you creating more of a FriendFeed-like reading and browsing service? Knowing
your target is very important as it will likely change the way your data is stored as well as affect your choice
in technologies. One major assumption that I am making is that you do not want to spend money on
expensive tools like Oracle, SAS or Informatica. So, what kind of tools and technologies do you need to look
at?
Data Storage
Data storage is possibly the most important decision when dealing with large amounts of data. Traditional
RDBMS software can handle huge amounts of data but sometimes require extensive knowledge to manage.
MySQL can easily handle many data storage needs and it is well known by many developers. It is the easy
choice for many people. However, there are a growing number of NoSQL choices that may also make sense.
Some of the NoSQL options have very good text search capabilities, while others have been optimized for
speed of reads or writes. Knowing how your application will handle data access helps refine this choice. Also,
do not forget about the potential of a mixed environment where some data is in an RDBMS and other data is
better suited to a NoSQL datastore. There is a large list of categorized NoSQL options at NoSQLDatabases.org.
Data Caching
No matter how well architected your data storage solution is, sometimes reads are just not fast enough. This
will typically happen if you have a highly trafficked site, but maybe there is just some data that does not
change too frequently. In order to squeeze as much speed as possible out of your application, you probably
want some level of data caching in your application. The basic idea is that your data cache is on big hashmap
stored in memory which allows extremely fast reads. This is much faster than traditional database access or
basic file I/O. If you have paid any attention to web application development over the past several years, you
have heard of memcache. Memcache is a data caching server that you can use with your application. This is
one option you can take, but some people like to have more control over how data caching works with their
application. In that case, you need to find a data caching library for your language of choice. For Java, there
are several available, and some have been integrated into web frameworks like Spring. In particular, ehcache
has good integration with Spring so you could quickly include data caching in your application.
Distributed Computing
If you take the NoSQL route, many of those solutions are meant to be deployed in a distributed environment.
In many cases, the software will have an agent running on several servers in order to store some of your data
on that server. The master or orchestrator (the terminology could be different) will be configured to know
which agent to talk to for the requested data. This is a gross simplification of the process, but it should give
you an idea of what to expect. Distributed computing has various potential issues as well. If one of your
servers crashes, or even if you have to perform some maintenance on a server, how do you continue to
retrieve the data stored on that server? Is the data replicated to multiple agents in order to provide simple
fail-over capabilities? Do you need to provide your own clustering solution to support the data storage? In
some cases, you may even feel that the existing software do not provide you with a good enough solution so
you need to build your own. Distributed computing is at the center of many solutions when dealing with big
data. Knowing more about how things work will give you a better idea of how to architect your solution as
well as what failure points may exist.
Search
Search is a separate field entirely due to its focus on relevance and speed. Speed is critical in search because
nobody wants to wait more than a minute for reasonable results. Thanks to Google, the longer a user waits,
8/11/2012 12:56 AM
3 of 7
the higher the expectations will be. For example, if I wait a minute to get results from a search engine, I
would expect that they would be highly relevant to my question. Googles focus on speed with good enough
results definitely changed how we interact with search. If your application will have significant search
requirements, you need to look at your data storage to determine whether search is core to its function or
whether you need an external solution. In years past, search was the domain of the RDBMS vendors, but the
rise of the internet and Google has changed things. Search is not about finding the structured data in your
database, it now looks at anything on the internet. There are various search projects on Apache that deal with
various levels of search. Lucene is the core search engine index software and can be considered a low-level
search technology. Solr, using the Lucene libraries, provides search through web services in order to keep
search as a distinct application outside of your application. Solr and Lucene are focused on keyword searches
just like most search technology. Nutch, also built on Lucene, is Apaches answer to web crawling, so if you
wanted to search the contents of various web pages, this is the solution for you.
Probability, Statistics and Machine Learning

If you decide to do any analysis of your data, there is a host of information you may need to review. If you
are planning to graph trends or even simply report on your data, then you need a basic understanding of some
simple statistics. You do not need a deep understanding, but even gaining knowledge of standard deviations
could prove valuable. If you decide that you want to take your trends a step further and look at expected
trends or even simple prediction, probability will rear its ugly head. Just like statistics, some simple concepts
in probability will go a long way for many web applications. However, there are times when statistics and
probability do not give you the results or the functionality that you desire. At that point you will need to delve
into the realm of machine learning. This is not an idea that should just be jumped into as machine learning
uses some advanced statistics, probability and mathematics to show how things work. In some cases, you may
be able to treat things like a black box and implement an algorithm for simple categorization, like naive bayes,
but it may not give you the results you desire. In those cases, you may need to understand more of how these
machine learning algorithms work in order to determine what the best approach may be. This may be a
difficult area to understand, but you can do some amazing things with machine learning. How cool would it be
to personalize your site based on the users past behavior without the user needing to explicitly select
categories or keywords?
Do I need a PhD to do all this?

Typical databases are easy to work with. You can use a GUI to create a database and some tables. You can
write a query to get back information. Big data changes everything and there are a lot of technologies that try
to make things easier. Thankfully, you do not need a PhD to work with big data, because many tools and
libraries have been created to make these technologies more accessible to the typical developer. Sometimes
more advanced knowledge would be helpful, but in many cases you might be able to treat the technologies as
a black box, just like your old RDBMS. You might also think that your case is special and nobody has done
anything like it before. If you are developing a web application, I highly doubt what you are doing is really
unknown. It may not be known to you, but there may be academic papers explaining things or even solutions
in an unrelated field. Big data did not start with social media, it really started in financials, pharmaceuticals
and health information. So, if you cant find something specific to fit your needs, broaden your search and the
information is probably out there.
One response to "A Simple Introduction To Playing With Big Data"

1. Random Links #269 | YASDW - yet another software developer weblog says:
September 21, 2010 at 12:48 AM
[...] A Simple Introduction To Playing With Big Data With big I mean really big [...]
Follow Rob Diana
8/11/2012 12:56 AM
4 of 7
Follow Regular Geek
rubix.answers.com
Bay Area Colo
Build-Outs
Mining the Social Web

Matthew A. Russell...
New $26.36
Best $22.68
FAST Experienced
Team Available Now
20% Discount Summer
Special
standsure.com/colobuild
Privacy Information
Data Mining
Courses
Online Data Mining
Courses, earn
World-Class
Credentials. Learn
More
Stanford.edu/Data-Mining
Find us on Google+Featured Resources:

Fierce Health Finance
Is a weekly healthcare finance
email news briefing for healthcare
executives and financial
managers.... >>
FierceDeveloper
Is the mobile developer guide to
the latest trends and technologies
driving mobile application
development forward.... >>
Top Brands
Frys
Promo
Code
HHgregg
Coupons
Toshiba
Coupon
Codes
BH
Photo
Video
Geeks.com
8/11/2012 12:56 AM
5 of 7
RegularGeek
Like
216 people like RegularGeek.
Zean
Manoj
Darlene
Harun
Estuardo
Daniel
Steven
Angelo
Ter
Blake
Blog Network:
Name:
RegularGeek
Topics:
programming, social
media, internet
Join my network
Blog Netw orks
Categories
Book Review (9)
Business (43)
Career (38)
Geek Reading (158)
Internet (166)
java (11)
Job Trends (22)
Miscellaneous (37)
Mobile (14)
Programming (100)
Semantic Web (12)
Social Media (176)
YackTrack (18)
Archives
August 2012 (8)
July 2012 (25)
June 2012 (27)
May 2012 (8)
April 2012 (2)
8/11/2012 12:56 AM
6 of 7
February 2012 (4)

January 2012 (4)
December 2011 (5)
November 2011 (7)
October 2011 (36)
September 2011 (35)
August 2011 (40)
July 2011 (7)
June 2011 (8)
May 2011 (7)
April 2011 (7)
March 2011 (12)
February 2011 (8)
January 2011 (6)
December 2010 (9)
November 2010 (8)
October 2010 (7)
September 2010 (6)
August 2010 (8)
July 2010 (8)
June 2010 (6)
May 2010 (10)
April 2010 (10)
March 2010 (9)
February 2010 (9)
January 2010 (9)
December 2009 (13)
November 2009 (13)
October 2009 (17)
September 2009 (14)
August 2009 (13)
July 2009 (11)
June 2009 (12)
May 2009 (16)
April 2009 (18)
March 2009 (18)
February 2009 (9)
January 2009 (12)
December 2008 (15)
November 2008 (13)
October 2008 (15)
September 2008 (13)
August 2008 (11)
July 2008 (13)
June 2008 (10)
May 2008 (10)
April 2008 (10)
March 2008 (10)
February 2008 (9)
January 2008 (7)
December 2007 (4)
Meta
Log in
8/11/2012 12:56 AM
7 of 7
2012 Regular Geek

Powered by WordPress
8/11/2012 12:56 AM

A Simple Introduction To Playing With Big Data - Regular Geek

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Simple Introduction To Playing With Big Data - Regular Geek

Uploaded by

Copyright:

Available Formats

A Simple Introduction To Playing With Big Data | Regular Geek

Where programming, the internet and social media collide.

How big is big data?