Introduction To HDFS

distributed file sytem replicate the data between the racks and also the computers
distributed in geographical regions.

Data replication makes data more fault tolerance. Data replication helps in scaling
(increasing) the access to the data by many users.
Data replication causes high concurrent access and gives low consistency when
multiple user changes data in different nodes. so system has to create separate
nodes for changes.
When many data computers are connected through network we call it as distributed
file system.
DFS provides data scalability, fault tolerance and concurrency through partition and
replication of data.
A parallel computer is a very large number of single computing nodes with
specialized capabilities connected to other network.
Commodity clusters are affordable parallel computers with an average number of
computing nodes. the nodes in the commodity cluster are more generic in their
computing capabilities.
Computing in one or more of these clusters across a local area network or the
internet is called distributed computing. Such architectures enable what we call
data-parallelism. In data-parallelism many jobs that share nothing can work on
different data sets or parts of a data set.
Large volumes and varieties of big data can be analyzed using this mode of
parallelism, achieving scalability, performance and cost reduction. As you can
imagine,
there are many points of failure inside systems The ability to recover from such
failures is called Fault-tolerance.
For Fault-tolerance of such systems, two neat solutions emerged. Namely,
Redundant data storage and restart of failed individual parallel jobs.
As a summary the commodity clusters are a cost effective way of achieving data
parallel scalability for big data applications. These type of systems have a higher
potential for partial failures. It is this type of distributed computing that pushed for a
change towards cost effective reliable and Fault-tolerant systems for management
and analysis of big data.
A programming model is an abstraction or existing machinery or infrastructure. It is
a set of abstract runtime libraries and programming languages that form a model of
computation. example, Java. So we can say, if the enabling infrastructure for big
data analysis is distributed file systems
as we mentioned, then the programming model for big data should enable the
programmability of the operations within distributed file systems.
What we mean by this being able to write computer programs that work efficiently
on top of distributed file systems using big data and making it easy to cope with all
the potential issues.
First of all, such a programming model for big data should support common big
data operations like splitting large volumes of data. This means for partitioning and
placement of data in and out of computer memory along with a model to
synchronize the datasets later on. The access to data should be achieved in a fast
way.
It should allow fast distribution to nodes within a rack and these are potentially, the
data nodes we moved the computation to. Since there are a variety of different
types of data, such as documents, graphs, tables, key values, etc. A programming
model should enable operations over a particular set of these types. Not every type
of data may be supported by a particular model, but the models should be
optimized for at least one type.
MapReduce is a big data programming model that supports all the requirements of
big data modeling we mentioned. It can model processing large data, split
complications into different parallel tasks and make efficient use of large commodity
clusters and distributed file systems. In addition, it abstracts out the details of
parallelzation, fault tolerance, data distribution, monitoring and load balancing.
Why Hadoop :
Major goals of hadoop: what, why,where,who,how.
Hadoop ecosystem frameworks and applications provide scalability to store large
volumes of data on commodity hardware.
Hadoop ecosystem, is the ability to gracefully recover from hardware failure
problems.
Hadoop framework handles variety of file types as big data comes in a variety of
flavors, such as text files, graph of social networks, streaming sensor data and
raster images. A third goal for the Hadoop ecosystem then, is the ability to handle
these different data types for any given type of data.
Hadoop ecosystem is the ability to facilitate a shared environment. Since even
modest-sized clusters can have many cores, it is important to allow multiple jobs to
execute simultaneously.
The Hadoop distributed file system, or HDFS. YARN is the scheduler and resource
manager. MapReduce is a programming model for processing big data.
Hadoop Eco System:
MapReduce is a programming model that simplifies parallel computing. MapReduce
only assume a limited model to express data. Hive and Pig are two additional
programming models on top of MapReduce to augment data modeling of
MapReduce with relational algebra and data flow modeling respectively. Hive was
created at Facebook to issue SQL-like queries using MapReduce on their data in
HDFS. Pig was created at Yahoo to model data flow based programs using
MapReduce.
Giraph was built for processing large-scale graphs efficiently. For example, Facebook
uses Giraph to analyze the social graphs of its users. Similarly, Storm, Spark, and
Flink were built for real time and in memory processing of big data on top of the
YARN resource scheduler and HDFS. In-memory processing is a powerful way of
running big data applications even faster, achieving 100x's better performance for
some tasks.
Sometimes, your data or processing tasks are not easily or efficiently represented
using the file and directory model of storage includes collections of key-values or
large sparse tables. NoSQL projects such as Cassandra, MongoDB, and HBase
handle these cases. Cassandra was created at Facebook, but Facebook also used
HBase for its messaging platform.
Running all of these tools requires a centralized management system for
synchronization, configuration and to ensure high availability. Zookeeper performs
these duties. It was created by Yahoo to wrangle services named after animals.
The Hadoop Distributed File System, a storage system for big data. Serves as the
foundation for most tools in the Hadoop ecosystem. Scalability to large data sets
and reliability to cope with hardware failures. HDFS has shown production scalability
up to 200 petabytes and a single cluster of 4,500 servers. With close to a billion files
and blocks.
HDFS achieves scalability by partitioning or splitting large files across multiple
computers. This allows parallel access to very large files since the computations run
in parallel on each node where the data is stored. Typical file size is gigabytes to
terabytes. The default chunk size, the size of each piece of a file is 64 megabytes.
But you can configure this to any size.
By spreading the file across many nodes, the chances are increased that a node
storing one of the blocks will fail. What happens next? Do we lose the information
stored in block C? HDFS replicates, or makes a copy of, file blocks on different nodes
to prevent data loss. the node that crashed stored block C. But block C was
replicated on two other nodes in the cluster. By default, HDFS maintains three
copies of every block. But you can change it globally for every file, or on a per file
basis. HDFS is also designed to handle a variety of data types aligned with big data
variety. HDFS provides a set of formats for common data types. But this is
extensible and you can provide custom formats for your data types. For example
text files can be read. Line by line or a word at a time or Geospatial data can be
read as vectors or rasters.
HDFS is comprised of two components. NameNode, and DataNode. These operate
using a master slave relationship. Where the NameNode issues comments to
DataNodes across the cluster.The NameNode is responsible for metadata. And
DataNodes provide block storage. usually one NameNode per cluster, a DataNode
however, runs on each node in the cluster. In some sense the NameNode is the
administrator or the coordinator of the HDFS cluster. When the file is created, the
NameNode records the name, location in the directory hierarchy and other
metadata. The NameNode also decides which data nodes to store the contents of
the file and remembers this mapping. The DataNode runs on each node in the
cluster. And is responsible for storing the file blocks. The data node listens to
commands from the name node for block creation, deletion, and replication.
Replication provides two key capabilities. Fault tolerance and data mortality.
Replication also means that the same block will be stored on different nodes on the
system which are in different geographical locations. A location may mean a specific
rack or a data center in a different town. The location is important since we want to
move computation to data.A high replication factor means more protection against
hardware failures, and better chances for data locality. But it also means increased
storage space is used.
As a summary HDFS provides scalable big data storage by partitioning files over
multiple nodes. This helps to scale big data analytics to large data volumes. The
application protects against hardware failures and provides data locality when we
remove analytical complications to data.
YARN is a resource manage layer that sits just above the storage layer
HDFS.interacts with applications and schedules resources for their use.YARN
enables running multiple applications over HDFC increases resource efficiency and
let's you go beyond the map reduce or even beyond the data parallel programming
model.
One of the biggest limitations of Hadoop one point zero, is inability to support nonmaproduce applications.This meant that for advanced applications such as graph
analysis that required different ways of modelling and looking at data, you would
need to move your data to another platform.
Adding YARN in between HDFS and the applications enabled new systems to be
built, focusing on different types of big data applications such as Giraph for graph
data analysis, Storm for streaming data analysis, and Spark for in-memory analysis.
YARN does so by providing a standard framework that supports customized
application development in the HADOOP ecosystem.
In this picture, notice the resource manager in the center, and the node managers
on each of the three nodes on the right. The resource manager controls all the
resources, and decides who gets what. Node manager operates at machine level
and is in charge of a single machine. Together the resource manager and the node
manager form the data computation framework.
Each application gets an application master.It negotiates resource from the
Resource Manager and it talks to Node Manager to get its tasks completed.
The container is an abstract Notions that signifies a resource that is a collection of
CPU memory disk network and other resources within the compute note to simplify
and be less precise you can think of a container and the Machine.
YARN success is evident from an explosive growth of different application that the
Hadoop ecosystem now has.
Summary Yarn gives you many ways for applications to extract value from data. It
lets you run many distributed applications over the same Hadoop cluster.YARN
reduces the need to move data around and supports higher resource utilization
resulting in lower costs.
MapReduce is a programming model for the Hadoop ecosystem, it execute parallel
processing over the distributed file blocks in HDFS. Hive has a SQL-like interface
that adds capabilities that help with relational data modeling. And Pig is a high level
data flow language that adds capabilities that help with process map modeling. Map
and reduce are two concepts based on functional programming where the the
function is based solely on the input. For map, the operation is applied on each data
element. And in reduce, the operation summarizes elements in some manner.
we should not use mapreduce if data is frequently changing, MapReduce is slow
since it reads the entire input data set each time.
The MapReduce model requires that maps and reduces execute independently of
each other. This greatly simplifies your job as a designer, since you do not have to
deal with synchronization issues.if computations have dependencies then it cannot
be expressed with MapReduce.
MapReduce does not return any results until the entire process is finished. It must
read the entire input data set. This makes it unsuitable for interactive applications
where the results must be presented to the user very quickly, expecting a return
from the user.
As a summary, MapReduce hides complexities of parallel programming and greatly
simplifies building parallel applications. As a summary, MapReduce hides
complexities of parallel programming and greatly simplifies building parallel
applications.
If data grows in large amount which we need tackle then hadoop is the best tool to
use.
If we want to access to huge amount of old data what would otherwise go on tape
drives for archival storage hadoop is good alternative.
when you want to use multiple applications over the same data store. High volume
or high variety are also great indicators for Hadoop as a platform choice.
Hadoop is good for data parallelism. Data parallelism is the simultaneous execution
of the same function on multiple nodes across the elements of a dataset.
Task parallelism is the simultaneous execution of many different functions on
multiple nodes across the same or different data sets.
Not all algorithms are scalable in Hadoop, or reducible to one of the programming
models supported by YARN. Hence, if you are looking to deploy highly coupled data
processing algorithms proceed with caution.
Hadoop may be a good platform where your diverse data sets can land and get
processed into a form digestible with your database.
HTFS stores data in blocks of 64 megabytes or larger, so you may have to read an
entire file just to pick one data entry. That makes it a bit harder to perform random
data access.
Latency sensitive tasks, and cyber security of sensitive data.
Hadoop framework is not the best for working with small data sets, advanced
algorithms that require a specific hardware type, task level parallelism,
infrastructure replacement, or random data access.
Cloud computing Iaas,paa,saas.
Cloud does the heavy lifting, so your team can extract value from data without
getting bogged down in the infrastructure details.
Cloud provides convenient and viable solutions for scaling your prototype to a full
fledged application. You can leverage the experts to handle security, robustness,
and let them handle the technical issues.
Your team can work on utilizing your strengths to solve your domain specific
problem.

Introduction To HDFS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To HDFS

Uploaded by

Copyright:

Available Formats

distributed file sytem replicate the data between the racks and also the computers

distributed in geographical regions.

You might also like