Professional Documents
Culture Documents
FS Shell
The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take
path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme
is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used. An HDFS file or
directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply
as /parent/child.
Administrator Commands:
fsck /
Run a HDFS filesystem checking Utility
Example: hadoop fsck - /
balancer
Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the rebalancing process.
Example: hadoop balancer
version
Prints the hadoop version configured on the machine
Example: hadoop version
Hadoop fs commands:
ls
For
a
file
returns
stat
on
the
file
with
the
following
format:
filename <number of replicas> filesize modification_date modification_time permissions userid
groupid For a directory it returns list of its direct children as in Unix. A directory is listed
as: dirname <dir> modification_time modification_time permissions userid groupid
Usage: hadoop fs -ls <args>
Example: hadoop fs -ls
lsr
Recursive version of ls. Similar to Unix ls -R.
Usage: hadoop fs -lsr <args>
Example hadoop fs -lsr
mkdir
Takes path uris as an argument and creates directories. The behavior is much like unix mkdir -p
creating parent directories along the path.
Usage: hadoop fs -mkdir <paths>
Example: hadoop fs -mkdir /user/hadoop/Aravindu
mv
Moves files from source to destination. This command allows multiple sources as well in which
case the destination needs to be a directory. Moving files across file systems is not permitted.
Usage: hadoop fs -mv URI [URI ] <dest>
Example: hadoop fs -mv /user/hduser/Aravindu/Consolidated_Sheet.csv
/user/hduser/sandela/
put
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.
Usage: hadoop fs -put <localsrc> <dst>
rm
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.
Usage: hadoop fs -rm URI [URI ]
Example: hadoop fs -rm /user/hduser/sandela/Consolidated_Sheet.csv
rmr
Recursive version of deleting.
Usage: hadoop fs -rmr URI [URI ]
Example:
cat
The cat command concatenates and display files, it works similar to Unix cat command:
Usage: hadoop fs -cat URI [URI ]
Example: hadoop fs -cat /user/hduser/Aravindu/Consolidated_sheet.csv
chgrp
Usage: hadoop fs -chgrp [-R] GROUP URI [URI ]
chmod
Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user.
Usage: hadoop fs -chmod [-R] <MODE[,MODE] | OCTALMODE> URI [URI ]
chown
Change the owner of files. With -R, make the change recursively through the directory structure.
The user must be a super-user.
Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
copyFromLocal
Copies file form local machine and paste in given hadoop directory
Usage: hadoop fs -copyFromLocal <localsrc> URI
copyToLocal
Coppies file from hadoop directory and paste the file in local direcotry
Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Example: hadoop fs copyToLocal
/user/hduser/output/Expecting_result_set_112613/part-r-00000
/home/hduser/contact_wordcount/11-26-2013
cp
Copy files from source to destination. This command allows multiple sources as well in which
case the destination must be a directory.
Usage: hadoop fs -cp URI [URI ] <dest>
Example:hadoop fs -cp /user/hduser/input/Consolodated_Sheet.csv
/user/hduser/Aravindu
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
du
Displays aggregate length of files contained in the directory or the length of a file in case its just
a file.
Usage: hadoop fs -du URI [URI ]
Example: hadoop fs -du /user/hduser/Aravindu/Consolidated_Sheet.csv
dus
Displays a summary of file lengths.
Usage: hadoop fs -dus <args>
Example: hadoop fs dus /user/hduser/Aravindu/Consolidated_Sheet.csv
count:
Count the number of directories, files and bytes under the paths that match the specified file
pattern
Example: hadoop fs count hdfs:/
expunge
Empty the Trash.
Usage: hadoop fs -expunge
Example: hadoop fs expunge
setrep
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.
Usage: hadoop fs -setrep [-R] <path>
Example: hadoop fs -setrep -w 3 -R /user/hadoop/Aravindu
stat
Returns the stat information on the path.
Usage: hadoop fs -stat URI [URI ]
Example: hadoop fs -stat /user/hduser/Aravindu
tail
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.
Usage: hadoop fs -tail [-f] URI
Example:
text
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.
Usage: hadoop fs -text <src>
Example: hadoop fs text /user/hduser/Aravindu/Info
touchz
Create a file of zero length.
Usage: hadoop fs -touchz URI [URI ]
Example:
Please find the complete step by step process for installing Hadoop 2.2.0 stable version on
Ubuntu as requested by many of this blog visitors, friends and subscribers.
Apache Hadoop 2.2.0 release has significant changes compared to its previous stable release,
which is Apache Hadoop 1.2.1(Setting up Hadoop 1.2.1 can be found here).
In short , this release has a number of changes compared to its earlier version 1.2.1:
HDFS Symlinks feature is disabled & will be taken out in future versions
Jobtracker has been replaced with Resource Manager and Node Manager
Before starting into setting up Apache Hadoop 2.2.0, please understand the concepts of Big Data
and Hadoop from my previous blog posts:
Big Data Characteristics, Problems and Solution.
What is Apache Hadoop?.
Setting up Single node Hadoop Cluster.
Setting up Multi node Hadoop Cluster.
Understanding HDFS architecture (in comic format).
Setting up the environment:
In this tutorial you will know step by step process for setting up a Hadoop Single Node cluster,
so that you can play around with the framework and learn more about it.
In This tutorial we are using following Software versions, you can download same by clicking
the hyperlinks:
If you are using putty to access your Linux box remotely, please install openssh by running this
command, this also helps in configuring SSH access easily in the later part of the installation:
sudo apt-get install openssh-server
Prerequisites:
1. Installing Java v1.7
2. Adding dedicated Hadoop system user.
3. Configuring SSH access.
4. Disabling IPv6.
Before starting of installing any applications or softwares, please makes sure your list of
packages from all repositories and PPAs is up to date or if not update them by using this
command:
If it fails to download, please check with this given command which helps to avoid passing
username and password.
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F
%2Fwww.oracle.com" "https://edelivery.oracle.com/otn-pub/java/jdk/7u45b18/jdk-7u45-linux-x64.tar.gz"
c. Create a Java directory using mkdir under /user/local/ and change the directory to
/usr/local/Java by using this command
mkdir -R /usr/local/Java
cd /usr/local/Java
e. Edit the system PATH file /etc/profile and add the following system variables to your system
path
sudo nano /etc/profile
or
f. Scroll down to the end of the file using your arrow keys and add the following lines below to
the end of your /etc/profile file:
JAVA_HOME=/usr/local/Java/jdk1.7.0_45
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH
g. Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell
the system that the new Oracle Java version is available for use.
sudo update-alternatives --install "/usr/bin/javac" "javac"
"/usr/local/java/jdk1.7.0_45/bin/javac" 1
sudo update-alternatvie --set javac /usr/local/Java/jdk1.7.0_45/bin/javac
This command notifies the system that Oracle Java JDK is available for use
h. Reload your system wide PATH /etc/profile by typing the following command:
. /etc/profile
2.
Adding
dedicated
Hadoop
system
user.
We will use a dedicated Hadoop user account for running Hadoop. While thats not required but
it is recommended, because it helps to separate the Hadoop installation from other software
applications and user accounts running on the same machine.
a. Adding group:
sudo addgroup Hadoop
It will ask to provide the new UNIX password and Information as shown in below image.
Before this step you have to make sure that SSH is up and running on your machine and
configured it to allow SSH public key authentication.
Generating
an
SSH
a.
Login
as
b. Run this Key generation command:
key
for
hduser
the
hduser
with
user.
sudo
c. It will ask to provide the file name in which to save the key, just press has entered so that it
will generate the key at /home/hduser/ .ssh
d. Enable SSH access to your local machine with this newly created key.
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
e. The final step is to test the SSH setup by connecting to your local machine with the hduser
user.
ssh hduser@localhost
4. Disabling IPv6.
We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations.
You will need to run the following commands using a root account:
sudo gedit /etc/sysctl.conf
Add the following lines to the end of the file and reboot the machine, to update the
configurations correctly.
#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Hadoop Installation:
Go to Apache Downloadsand download Hadoop version 2.2.0 (prefer to download any stable
versions)
i. Run this following command to download Hadoop version 2.2.0
wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz
iv. Move hadoop package of your choice, I picked /usr/local for my convenience
sudo mv hadoop /usr/local/
v. Make sure to change the owner of all the files to the hduser user and hadoop group by using
this command:
sudo chown -R hduser:hadoop Hadoop
Configuring Hadoop:
The following are the required files we will use for the perfect configuration of the single node
Hadoop cluster.
a.
b.
c.
d.
e. Update $HOME/.bashrc
We can find the list of files in Hadoop directory which is located in
cd /usr/local/hadoop/etc/hadoop
a.yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
yarn-site.xml:
core-site.xml
mapred-site.xml
hdfs-site.xml
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
b. core-site.xml:
i. Change the user to hduser. Change the directory to /usr/local/hadoop/conf and edit the coresite.xml file.
vi core-site.xml
ii. Add the following entry to the file and save and quit the file:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
c. mapred-site.xml:
If this file does not exist, copy mapred-site.xml.template as mapred-site.xml
i. Edit the mapred-site.xml file
vi mapred-site.xml
ii. Add the following entry to the file and save and quit the file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
d. hdfs-site.xml:
i. Edit the hdfs-site.xml file
vi hdfs-site.xml
iii. Add the following entry to the file and save and quit the file:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value>
</property>
</configuration>
e. Update $HOME/.bashrc
Data node:
$ hadoop-daemon.sh start datanode
Resource Manager:
$ yarn-daemon.sh start resourcemanager
Node Manager:
$ yarn-daemon.sh start nodemanager
Hadoop comes with several web interfaces which are by default available at these locations:
By this we are done in setting up a single node hadoop cluster v2.2.0, hope this step by step
guide helps you to setup same environment at your end.
Please leave a comment/suggestion in the comment section,will try to answer asap and dont
forget to subscribe for the newsletter and a facebook like
Setting up Hive
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment
As I said earlier, Apache Hive is an open-source data warehouse infrastructure built on top of
Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files,
it is developed by Facebook and it provides
Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase
In this post we will get to know about, how to setup Hive on top of Hadoop cluster
Objective
The objective of this tutorial is for setting up Hive and running HiveQL scripts.
Prerequisites
The following are the prerequisites for setting up Hive.
You should have the latest stable build of Hadoop up and running, to install hadoop, please check
my previous blog article on Hadoop Setup.
Setting up Hive:
Procedure
1. Download a stable version of the hive file from apache download mirrors, For this tutorial we
are using Hive-0.12.0,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
wget http://apache.osuosl.org/hive/hive-0.12.0/hive-0.12.0.tar.gz
3. Create a hive directory under usr/local directory as root user and change the ownership to
hduser as shown, this is for our convenience to differentiate each framework,software and
application with different users.
cd /usr/local
mkdir hive
sudo chown -R hduser:hadoop /usr/local/hive
9. table in hive by the following command. Also after creating check if the table exists.
create table test (field1 string, field2 string);
show tables;
By this output we know that hive was setup correctly on top of Hadoop cluster, its time to learn
the HiveQL.
Apache Hive is an open-source data warehouse infrastructure built on top of Hadoop for
providing data summary, query, and analyzing large datasets stored in Hadoop files, it is
developed by Facebook and it provides
Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase
It supports queries expressed in a language called HiveQL, which automatically translates SQLlike queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom
MapReduce scripts to be plugged into queries. Hive also enables data
serialization/deserialization and increases flexibility in schema design by including a system
catalog called Hive-Metastore.
According to the Apache Hive wiki, Hive is not designed for OLTP workloads and does not
offer real-time queries or row-level updates. It is best used for batch jobs over large sets of
append-only data (like web logs).
Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary
key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a
columnar database way.)
Setting up Pig
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment
Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:
Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.
Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.
Extensibility: Pig Users can write their own user defined functions (UDF) to do specialpurpose processing as per the requirement using Java/Phyton and JavaScript.
Objective
The objective of this tutorial is for setting up Pig and running Pig scripts.
Prerequisites
The following are the prerequisites for setting up Pig and running Pig scripts.
You should have the latest stable build of Hadoop up and running, to install hadoop,
please check my previous blog article on Hadoop Setup.
Setting up Pig
Procedure
1. Download a stable version of Pig file from apache download mirrors, For this tutorial we
are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X
wget http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz
5. set PIG_HOME in $HOME/.bashrc so it will be set every time you login. Add the following
line to it.
export PIG_HOME=<path_to_pig_home_directory>
e.g.
export PIG_HOME='/usr/local/pig/pig-0.11.1'
export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH
6. Set the environment variable JAVA_HOME to point to the Java installation directory, which
Pig uses internally.
export JAVA_HOME=<<Java_installation_directory>>
Execution Modes
Pig has two modes of execution local mode and MapReduce mode.
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.
To run in local mode, please pass the following command:
$ pig -x local
grunt>
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.
$ pig
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt
is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script,
Grunt, or embedding the script into Java code. Running in Interactive shell is shown in the
Problem section. To run a batch of pig scripts, it is recommended to place them in a single file
with .pig extension and execute them in batch mode, will explain them in depth in coming posts.
Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:
Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.
Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.
Extensibility: Pig Users can write their own user defined functions (UDF) to do specialpurpose processing as per the requirement using Java/Phyton and JavaScript.
How it works:
Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System
(HDFS). The language for the platform is called Pig Latin, which abstracts from the Java
MapReduce idiom into a form similar to SQL. Pig Latin is a flow language which allows you to
write a data flow that describes how your data will be transformed. Since Pig Latin scripts can
be graphs it is possible to build complex data flows involving multiple inputs, transforms, and
outputs. Users can extend Pig Latin by writing their own User Defined functions, using Java,
Python, Ruby, or other scripting languages.
We can run Pig in two modes:
Local Mode
Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.
MapReduce Mode
This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.
we will discuss more about pig, setting up pig with hadoop, running PigLatin scripts in Local and
MapReduce Mode in my next posts.
Step 1: Download the VMware player from the link shown and install it as shown in the
images.
Url to download VMware Player(Non-commercial use):
https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0|
PLAYER-600-A|product_downloads
Step 2: Download the Cloudera Setup File from the given url and extract that zipped file onto
your hard drive.
By clicking on Play virtual machine button, It will start the VM in a couple of minutes, but
sometimes you may get this issue This host does not support Intet VT-x
There are two reasons why you are getting this error:
Login credentials:
Machine Login credentials are:
a. Username cloudera
b. Password cloudera
http://hadoop.apache.org/docs/r1.0.4/commands_manual.pdf
1. Prerequisites
i.Networking
Networking plays an important role here, before merging both single node servers into a multi
node cluster we need to make sure that both the node pings each other( they need to be connected
on the same network / hub or both the machines can speak to each other). Once we are done with
this process, we will be moving to the next step in selecting the master node and slave node, here
we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave
(hadoopnode) . Then we need to add them in /etc/hosts file on each machine as follows.
sudo vi /etc/hosts
172.16.17.68
172.16.17.61
Haadoopmaster
hadoopnode
Note: The addition of more slaves should be updated here in each machine using unique names
for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).
If you can see the below output when you run the given command on both master and slave, then
we configured it correctly.
ssh Hadoopmaster
ssh hadoopnode
2. Configurations:
The following are the required files we will use for the perfect configuration of the multi node
Hadoop cluster.
a.
masters
b. slaves
c.
core-site.xml
d. mapred-site.xml
e.
hdfs-site.xml
b. slaves:
Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers)
will be running as shown:
Hadoomaster
hadoopnode
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.
Configuring all *-site.xml files:
We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all
*-site.xml files on each and every server accordingly.
c. core-site.xml:
We are changing the host name from localhost to Hadoopmaster, which specifies the
NameNode (the HDFS master) host and port.
vi core-site.xml
d. hdfs-site.xml:
We are changing the replication factor to 2, The default value of dfs.replication is 3. However,
we have only two nodes available, so we set dfs.replication to 2.
vi hdfs-site.xml
e. mapred-site.xml:
We are changing the host name from localhost to Hadoopmaster, which specifies the
JobTracker (MapReduce master) host and port
vi mapred-site.xml
By running jps command, we will see list of java processes running on master and slaves:
This will bring up the MapReduce cluster with the JobTracker running on the machine you ran
the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.
By running jps command, we will see list of java processes including JobTracker and
TaskTracker running on master and slaves:.
/user/hduser/demo
Scalable New nodes can be added as needed, and added without needing to change data
formats, how data is loaded, how jobs are written, or the applications on top.
Flexible Hadoop is schema-less, and can absorb any type of data, structured or not,
from any number of sources. Data from multiple sources can be joined and aggregated in
arbitrary ways enabling deeper analyses than any one system can provide.
Reliable When you lose a node, the system redirects work to another location of the
data and continues processing without missing a beat.(Credit:Cloudera Blog)
Applications that run on HDFS need continuous access to their data sets. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access.
It is also worth examining the applications for which using HDFS does not work so well. While
this may change in the future, these are areas where HDFS is not a good fit today:
Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data,
and this may be at the expense of latency.
http://blog.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
http://en.wikipedia.org/wiki/Apache_Hadoop
http://hadoop.apache.org/
Big Data has become a Buzzword these days, data in every possible form, whether
through social media, structured, unstructured, text, images, audio, video, log files,
emails, simulations, 3D models, military surveillance, e-commerce and so on, it amounts
to around some zettabytes of data! This huge data is what we call as BIG DATA!
Big data is nothing but a synonym of a huge and complex data that it becomes very
tiresome or slow to capture, store, process, retrieve and analyze it with the help of any
relational database management tools or traditional data processing techniques.
Lets take a moment to look into the above picture this should explain us what happens
in every 60 seconds on the internet. By this we can understand how much data being
generated in a second, a minute, a day or a year and how exponentially its generating, as
per the analysis by TechNewsDaily we might generate more than 8 Zettabytes of data by
2015.
Over the next 10 years: The number of servers worldwide will grow by 10x, Amount of
information managed by enterprise data centers will grow by 50x, Number of files
enterprise data center handles will grow by 75x (Systems / Enterprises generate huge
amount of data from Terabytes to and even Petabytes of information.
Volume: BIG DATA depends upon how large it is. It could amount to hundreds of
terabytes or even petabytes of information.
Velocity: The increasing rate at which data flows into an organization
Variety: A common theme in big data systems is that the source data is diverse and
doesnt fall into neat relational structures
As we are speaking about data size, the above image will help us to understand or
correlate about what we are speaking.
As per our earlier discussions we might now understand what is Big Data, now lets
discuss what are all the problems we might face with Big data.
Traditional systems build within the company for handling the relational databases may
not be able to support/scale as data generating with high volume, velocity and variety
data.
1. Volume: For instance, Terabytes of Facebook posts or 400 billion annual twitter
tweets could mean Big Data! This data need to be stored to analyze and come up with
data science reports for different solutions and problem solving approaches.
2. Velocity: Big data requires fast processing. Time factor plays a very crucial role in
several organizations. For instance, processing million records on the share market need
to able to write this data with the same speed its coming back into the system.
3. Variety: Big Data may not belong to a specific format. It could be in any form such as
structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D
models, etc. Till date we have been working with structured data, it might be difficult to
handle unstructured or semi structured data with quality and quantity we are generating
on a daily basis.
Parallel Processing:
Data is residing on N number of servers and holds the power of N servers and can be
processed parallel for analysis, which helps user to reduce the wait time to generate the
final report or analyzed data.
Fault Tolerance:
One of the primary reasons to use some of the BigData frameworks(ex: Hadoop) to run
your jobs is due to its high degree of fault tolerance. Even when running jobs on a large
cluster where individual nodes or network components may experience high rates of
failure,BigData frameworks can guide jobs toward a successful completion as the data is
replicated into multiple nodes/slaves.