Professional Documents
Culture Documents
1 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
search
Subscribe to RSS
Subscribe by Email
29
retweet
Like
tweets
0
StumbleUpon
Submit
With social media, big data has come to the forefront of technology. Whether you want to continuously
search Twitter, aggregate the social activity on several sites, or do some mining of peoples activity on
Facebook, handling big data is critical. There are two questions you need to answer when looking at a project
that will handle big data. First, how is big data defined and when do I know I am dealing with it? The second
question is how do I deal with big data?
8/11/2012 12:56 AM
2 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
Data Storage
Data storage is possibly the most important decision when dealing with large amounts of data. Traditional
RDBMS software can handle huge amounts of data but sometimes require extensive knowledge to manage.
MySQL can easily handle many data storage needs and it is well known by many developers. It is the easy
choice for many people. However, there are a growing number of NoSQL choices that may also make sense.
Some of the NoSQL options have very good text search capabilities, while others have been optimized for
speed of reads or writes. Knowing how your application will handle data access helps refine this choice. Also,
do not forget about the potential of a mixed environment where some data is in an RDBMS and other data is
better suited to a NoSQL datastore. There is a large list of categorized NoSQL options at NoSQLDatabases.org.
Data Caching
No matter how well architected your data storage solution is, sometimes reads are just not fast enough. This
will typically happen if you have a highly trafficked site, but maybe there is just some data that does not
change too frequently. In order to squeeze as much speed as possible out of your application, you probably
want some level of data caching in your application. The basic idea is that your data cache is on big hashmap
stored in memory which allows extremely fast reads. This is much faster than traditional database access or
basic file I/O. If you have paid any attention to web application development over the past several years, you
have heard of memcache. Memcache is a data caching server that you can use with your application. This is
one option you can take, but some people like to have more control over how data caching works with their
application. In that case, you need to find a data caching library for your language of choice. For Java, there
are several available, and some have been integrated into web frameworks like Spring. In particular, ehcache
has good integration with Spring so you could quickly include data caching in your application.
Distributed Computing
If you take the NoSQL route, many of those solutions are meant to be deployed in a distributed environment.
In many cases, the software will have an agent running on several servers in order to store some of your data
on that server. The master or orchestrator (the terminology could be different) will be configured to know
which agent to talk to for the requested data. This is a gross simplification of the process, but it should give
you an idea of what to expect. Distributed computing has various potential issues as well. If one of your
servers crashes, or even if you have to perform some maintenance on a server, how do you continue to
retrieve the data stored on that server? Is the data replicated to multiple agents in order to provide simple
fail-over capabilities? Do you need to provide your own clustering solution to support the data storage? In
some cases, you may even feel that the existing software do not provide you with a good enough solution so
you need to build your own. Distributed computing is at the center of many solutions when dealing with big
data. Knowing more about how things work will give you a better idea of how to architect your solution as
well as what failure points may exist.
Search
Search is a separate field entirely due to its focus on relevance and speed. Speed is critical in search because
nobody wants to wait more than a minute for reasonable results. Thanks to Google, the longer a user waits,
8/11/2012 12:56 AM
3 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
the higher the expectations will be. For example, if I wait a minute to get results from a search engine, I
would expect that they would be highly relevant to my question. Googles focus on speed with good enough
results definitely changed how we interact with search. If your application will have significant search
requirements, you need to look at your data storage to determine whether search is core to its function or
whether you need an external solution. In years past, search was the domain of the RDBMS vendors, but the
rise of the internet and Google has changed things. Search is not about finding the structured data in your
database, it now looks at anything on the internet. There are various search projects on Apache that deal with
various levels of search. Lucene is the core search engine index software and can be considered a low-level
search technology. Solr, using the Lucene libraries, provides search through web services in order to keep
search as a distinct application outside of your application. Solr and Lucene are focused on keyword searches
just like most search technology. Nutch, also built on Lucene, is Apaches answer to web crawling, so if you
wanted to search the contents of various web pages, this is the solution for you.
8/11/2012 12:56 AM
4 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
rubix.answers.com
Bay Area Colo
Build-Outs
FAST Experienced
Team Available Now
20% Discount Summer
Special
standsure.com/colobuild
Privacy Information
Data Mining
Courses
Online Data Mining
Courses, earn
World-Class
Credentials. Learn
More
Stanford.edu/Data-Mining
FierceDeveloper
Is the mobile developer guide to
the latest trends and technologies
driving mobile application
development forward.... >>
Top Brands
Frys
Promo
Code
HHgregg
Coupons
Toshiba
Coupon
Codes
BH
Photo
Video
Geeks.com
8/11/2012 12:56 AM
5 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
RegularGeek
Like
216 people like RegularGeek.
Zean
Manoj
Darlene
Harun
Estuardo
Daniel
Steven
Angelo
Ter
Blake
Blog Network:
Name:
RegularGeek
Topics:
programming, social
media, internet
Join my network
Blog Netw orks
Categories
Book Review (9)
Business (43)
Career (38)
Geek Reading (158)
Internet (166)
java (11)
Job Trends (22)
Miscellaneous (37)
Mobile (14)
Programming (100)
Semantic Web (12)
Social Media (176)
YackTrack (18)
Archives
August 2012 (8)
July 2012 (25)
June 2012 (27)
May 2012 (8)
April 2012 (2)
8/11/2012 12:56 AM
6 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
Meta
Log in
8/11/2012 12:56 AM
7 of 7
http://regulargeek.com/2010/09/19/a-simple-introduction-to-playing-wi...
8/11/2012 12:56 AM