You are on page 1of 22

Introduction to MongoDB

Anton Radice
eScience Research Institute

March 15, 2016

Overview:
1. Databases overview
2. MongoDB Introduction
Data format, schema, and modeling
CRUD operations (read, write)
3. Mongo for Big Data
Replication and sharding
Aggregation pipeline
4. Q&A

Databases overview:
SQL

NoSQL

NewSQL

Graph

Data Model

Relational

Document based

Hybrid

Graph based

Schema

Fixed

Flexible

Mixed

Flexible

Examples

MySQL,
SQLite,
PostgreSQL

MongoDB, HBase, Clustrix,


Cassandra
MemSQL

Neo4j,
OrientDB

Which database is right for you? Well it depends on


1. Characteristics of your data
Is your data tabular or complex (i.e. multiple levels of nesting)?
2. Volatility of your data model
Is your schema likely to evolve over time or stay the same?
3. Operational considerations (scalability, performance, availability)
CAP theorem: Consistency, availability, partition tolerance - choose 2

MongoDB: Introduction
Open sourced, cross-platform NoSQL database released in 2009
Written in C, C++, and JavaScript
Drivers for almost all popular languages:
C, C++, Java, Perl, Python, Ruby, Scala, and more
According to DB-engines.com ranking, Mongo is the fourth most
popular database, and the most popular NoSQL database
Cheaper than traditional enterprise systems
4 servers, 2 processors each:
Oracle Enterprise Edition: $456,000.00
MongoDB Subscription: $16,000.00
Features: high availability, journaling, replication, auto-sharding,
aggregation framework

Source: indeed.com

MongoDB: Data format


{
name: denis,
field: value (Hidden: type)
age: 29,
occupation: researcher,
interests: [high performance computing, big data, long
walks on the beach]
}
Data is stored in documents, which are BSON (Binary JSON) files
BSON includes additional type information as well as a prefixed
length field for large elements to facilitate efficient scanning
The maximum BSON document size is 16 MBs
Documents are grouped into collections, which are analogous to
tables in relational databases

MongoDB: Schema
*Collections do not enforce document structure*
{
name: denis,
age: 29,
occupation: researcher,
interests: [high performance computing, big data, long walks on the beach],
address: <DenisAddressObject>,
teaching: {
course_name: Big Data Technologies,
semester: Spring 2016,
students_count: 11
}

document reference

},
{
name: anton,
age: 24,
occupation: student,
interests: [data science, big data, eating],
address: {
street: Beechwood Drive,
zip_code: 19083,
country: USA
}
}

embedded document

MongoDB: Data modeling


Embedded denormalized models:
best for entities that have contains relationships
Example: customer contains address,
orders, account
also use for one-to-many relationships where the
child documents are referenced by a field in one
parent document
Reference normalized models:
best for representing complex many-to-many
relationships or to model large hierarchical datasets
also use when you expect documents to grow

MongoDB: Data modeling


{

One-to-one (embedded documents)


_id: 123,
name: John Smith

{
_id: 123,
name: John Smith,
job: {
position: accountant,
company: Deloitte
}

}
{
_id: 456,
user_id: 123,
position: accountant,
company: Deloitte
}

MongoDB: Data modeling


{

One-to-many (embedded documents)

_id: 123,
name: John Smith
}

{
_id: 123,
name: John Smith,
jobs: [
{
position: accountant,
company: Deloitte
},
{
position: consultant,
company: StartupX
}]

{
_id: 456,
user_id: 123,
position: accountant,
company: Deloitte
}
{
_id: 789,
user_id: 123,
position: consultant,
company: StartupX
}

MongoDB: Data modeling


{

One-to-many (document references)

_id: 123,
name: John Smith
}

{
_id: 456,
user_id: 123,
position: accountant,
company: Deloitte

_id: 123,
name: John Smith,
jobs: [456, 789]

{
_id: 789,
user_id: 123,
position: consultant,
company: StartupX
}

MongoDB: CRUD operations


Create (write):
An _id field (like a primary key) is required for every
document
if _id not specified, mongod instance creates one
Writes are atomic at the document level even if they
modify embedded documents
db.big_data_students.insertOne(
{
name: Max Petrov,
IQ: 175
})
Mongo

INSERT INTO big_data_students


(name, IQ)
VALUES
(Max Petrov, 175)
SQL

MongoDB: CRUD operations


Read (query):
all queries address a single collection
queries can include projection criteria (i.e. what
data to return)
queries return a cursor to the matching documents
primary method: db.collection.find()
db.big_data_students.find(
SELECT name, gap
{ nationality: Russian, age: {$gt: 25} }, FROM big_data_students
{ name: 1, gpa: 1, _id: 0 }
WHERE nationality = Russian
)
AND age > 25
Mongo

SQL

MongoDB for Big Data: Overview


MongoDB is web scale
Viral video that was
released in 2010 that
criticized MongoDBs
ability to handle large
workloads
MongoDB version at the
time: 1.6.1
Latest MongoDB version:
3.2.4
Source: mongodb-is-web-scale.com

MongoDB for Big Data: Overview

The largest search engine in Russia uses MongoDB


to manage all user and metadata for its file sharing
service. MongoDB has scaled to support tens of
billions of objects and TBs of data, growing at 10
million new file uploads per day.
Source: mongodb.com/mongodb-scale

MongoDB for Big Data: Overview


Strengths:
1. Storing unstructured, heterogeneous data
Collect data from a variety of sources without
the need to prepare it
2. Scalability
Cluster, performance, and data scalability
3. Native JS Map-Reduce function
4. Aggregation pipeline
Declarative, Unix-like pipeline processing to
transform documents in stages into an
aggregated result

MongoDB for Big Data: Replication


Replication: storing multiple copies of data on different
servers to increase data availability and provide redundancy
Replica sets are groups of
mongod instances of the same
dataset
The primary node receives
all write operations
Secondary nodes apply
operations asynchronously to their
data and elect a new primary if
it goes down (failover)
Source: docs.mongodb.org

MongoDB for Big Data: Sharding


Sharding: a method for storing data across multiple
servers in order to store massive datasets yet maintain
high throughput operations
Also known as horizontal scaling
3 components of a sharded cluster:
1. Shard: a replica set
2. Query router: mongos instances
that interface with client
applications and direct
operations to the shards
3. Config server: stores the
clusters metadata
Source: docs.mongodb.org

MongoDB for Big Data: Aggregation pipeline


Includes an optimization stage and can operate on
sharded collections
Aggregation operators: $group, $match, $map,
$project, $sort, $unwind, $lookup, and more

Source: docs.mongodb.org

MongoDB for Big Data: Aggregation pipeline


Example query:
db.students.aggregate( [
{ $group: {
_id: $student_id,
gpa: { $divide: [ $sum: $grades,
$credits]}
}
},
{ $match: {
gpa: { $gte: 3}
}
}
])
Mongo
{
}

_id: 718581,
gpa: 3.56

SELECT student_id,
COUNT(grades) / credits as gpa
FROM students
GROUP BY student_id
HAVING gpa >= 3

SQL

Output

Further resources:
https://docs.mongodb.org/manual/
Detailed documentation
Tutorials
Presentations/webinars
MongoDB University

Any questions?
antonradice@gmail.com

March 15, 2016

You might also like