You are on page 1of 19

An introductory study on what and why

MongoDB

GE Confdential

MongoDB at a
glance

Earlier in 1970s every business organization used


to have their own database structures.
In 1969 Edgar F. Codd introduced 12 rules and
pioneered relational model of databases.
Relational databases follow Normalization and
maintain ACID properties.
We often think of fact-dimension modelling in case
of RDBMS with lot many joins.

GE MSAT Internal

Relational Database

GE MSAT Internal

A little about Big Data

Google started their journey with their search


engine in 1996.
They tried to index all the websites worldwide.
They seemed to face 3 Vs.
Volume too large
Velocity too quick
Variety too different

To address these problems and also some other


security concerns they came up with:
Google File System (GFS)
Big Table
MapReduce

GE MSAT Internal

The beginning

Googles research paper came out in the market.


Doug Cutting from Yahoo took the paper and
created a similar framework named Hadoop.
Later Apache foundation open sourced the entire
ecosystem framework in their Apache Hadoop
project.
Google

Apache Hadoop

GFS

HDFS

Big Table

HBase

MapReduce

MapReduce

GE MSAT Internal

How Hadoop came into


picture

In Hadoop ecosystem we consider cluster of


nodes instead of single processor and present that
as a single source.
Also the data is replicated across different nodes
to avoid the probability of loss.
Finally when taking data out of Hadoop we assign
Mapper task and Reducer task to the individual
Task Trackers on the Slave Node.

GE MSAT Internal

Hadoop Framework

open source
document oriented
high performance
scalable
NoSQL

GE MSAT Internal

MongoDB

Big Data
Databas
e

NoSQL

MongoD
B

NoSQL database, also called Not only SQL, is an


approach to data management and database design
that's for access and analyze very large sets of
distributed data.
NoSQL
Tabular
Key-Value Store
No Joins
No Complex Queries
No Constraints

Document-Oriented

GE MSAT Internal

Whats NoSQL

RDBMS

MongoDB

Database

Database

Table

Collection

Rows / Records

Documents

Value

Field : Value pair


GE MSAT Internal

Mongo basics

Data Modeling
Modeling in RDBMS:
Modeling in MongoDB:
References

GE MSAT Internal

One to One
One to Many
Many to Many

Embedded
Data

{
frst_name: Paul,
surname: Miller,
city: London,
location: [45.123,47.232],
cars: [
{ model: Bentley,
year: 1973,
value: 100000, .},
{ model: Rolls Royce,
year: 1965,
value: 330000, .},
]
}

GE MSAT Internal

To address Relational
Database Modeling

Added features of
Ad hoc queries (field, range queries, RegEx search)
MongoDB

Indexing (any field in document can be indexed)


Replication (Master-Slave replication, Master doing read-write

Duplication of Data (runs over multiple servers)


Load balancing (Scales horizontally)
Journaling (crash recovery mechanism)
Schemaless Structure (collection having documents of
different shape and size)

Capped collection (maintains insertion order; once the

specified size is reached it starts behaving like circular queue)

File storage (GridFS available for purpose, used as file system)


Aggregation (Pipeline Aggregation, MapReduce)
Server-side JavaScript execution (JavaScript in queries)
Support Location (understand longitude and latitude
natively)

GE MSAT Internal

& Slave copies data & uses as backup)

Sharding

GE MSAT Internal

Sharding is a method for storing data across multiple


machines, with no change in the application code, since
MongoDB supports horizontal scalability.

GridFSis a specifcation for storing and retrieving fles


that exceed theBSON-documentsize limitof 16MB.
GridFSstores fles in two collections:
chunksstores the binary chunks.
filesstores the fles metadata.

GE MSAT Internal

GridFS

For efficient execution of queries.


To store a small portion of the collections data set
in an easy to traverse form.
Defnes indexes at the collection level.
Supports indexing on any feld or sub-feld of the
document.
It follows B-Tree architecture.

GE MSAT Internal

Index

Replication

GE MSAT Internal

Primary (from client drivers read write operation happens on this)


Secondary (Secondary's data sets reflect the primarys data set)
Arbiter (Arbiters only exist to vote in elections)

When a primary does not communicate with the


other members of the set for more than 10
seconds, the replica set will attempt to select
another member to become the new primary.
The frst secondary that receives a majority of the
votes becomes primary.

GE MSAT Internal

Failure Recovery

GE MSAT Internal

MapReduce

GE MSAT Internal

Thank you

You might also like