MongoDB Tutorial

Introduction to MongoDB
Anton Radice
eScience Research Institute
March 15, 2016
Overview:
1. Databases overview
2. MongoDB Introduction
Data format, schema, and modeling
CRUD operations (read, write)
3. Mongo for Big Data
Replication and sharding
Aggregation pipeline
4. Q&A
Databases overview:
SQL
NoSQL
NewSQL
Graph
Data Model
Relational
Document based
Hybrid
Graph based
Schema
Fixed
Flexible
Mixed
Flexible
Examples
MySQL,
SQLite,
PostgreSQL
MongoDB, HBase, Clustrix,

Cassandra
MemSQL
Neo4j,
OrientDB
Which database is right for you? Well it depends on

1. Characteristics of your data
Is your data tabular or complex (i.e. multiple levels of nesting)?
2. Volatility of your data model
Is your schema likely to evolve over time or stay the same?
3. Operational considerations (scalability, performance, availability)
CAP theorem: Consistency, availability, partition tolerance - choose 2
MongoDB: Introduction
Open sourced, cross-platform NoSQL database released in 2009
Written in C, C++, and JavaScript
Drivers for almost all popular languages:
C, C++, Java, Perl, Python, Ruby, Scala, and more
According to DB-engines.com ranking, Mongo is the fourth most
popular database, and the most popular NoSQL database
Cheaper than traditional enterprise systems
4 servers, 2 processors each:
Oracle Enterprise Edition: $456,000.00
MongoDB Subscription: $16,000.00
Features: high availability, journaling, replication, auto-sharding,
aggregation framework
Source: indeed.com
MongoDB: Data format

{
name: denis,
field: value (Hidden: type)
age: 29,
occupation: researcher,
interests: [high performance computing, big data, long
walks on the beach]
}
Data is stored in documents, which are BSON (Binary JSON) files
BSON includes additional type information as well as a prefixed
length field for large elements to facilitate efficient scanning
The maximum BSON document size is 16 MBs
Documents are grouped into collections, which are analogous to
tables in relational databases
MongoDB: Schema
*Collections do not enforce document structure*
{
name: denis,
age: 29,
occupation: researcher,
interests: [high performance computing, big data, long walks on the beach],
address: <DenisAddressObject>,
teaching: {
course_name: Big Data Technologies,
semester: Spring 2016,
students_count: 11
}
document reference
},
{
name: anton,
age: 24,
occupation: student,
interests: [data science, big data, eating],
address: {
street: Beechwood Drive,
zip_code: 19083,
country: USA
}
}
embedded document
MongoDB: Data modeling

Embedded denormalized models:
best for entities that have contains relationships
Example: customer contains address,
orders, account
also use for one-to-many relationships where the
child documents are referenced by a field in one
parent document
Reference normalized models:
best for representing complex many-to-many
relationships or to model large hierarchical datasets
also use when you expect documents to grow

{
One-to-one (embedded documents)

_id: 123,
name: John Smith
{
_id: 123,
name: John Smith,
job: {
position: accountant,
company: Deloitte
}
}
{
_id: 456,
user_id: 123,
company: Deloitte
}

{
One-to-many (embedded documents)
_id: 123,
name: John Smith
}
{
_id: 123,
name: John Smith,
jobs: [
{
company: Deloitte
},
{
position: consultant,
company: StartupX
}]
{
_id: 456,
user_id: 123,
company: Deloitte
}
{
_id: 789,
user_id: 123,
company: StartupX
}

{
One-to-many (document references)
_id: 123,
name: John Smith
}
{
_id: 456,
user_id: 123,
company: Deloitte
_id: 123,
name: John Smith,
jobs: [456, 789]
{
_id: 789,
user_id: 123,
company: StartupX
}
MongoDB: CRUD operations

Create (write):
An _id field (like a primary key) is required for every
document
if _id not specified, mongod instance creates one
Writes are atomic at the document level even if they
modify embedded documents
db.big_data_students.insertOne(
{
name: Max Petrov,
IQ: 175
})
Mongo
INSERT INTO big_data_students

(name, IQ)
VALUES
(Max Petrov, 175)
SQL
MongoDB: CRUD operations

Read (query):
all queries address a single collection
queries can include projection criteria (i.e. what
data to return)
queries return a cursor to the matching documents
primary method: db.collection.find()
db.big_data_students.find(
SELECT name, gap
{ nationality: Russian, age: {$gt: 25} }, FROM big_data_students
{ name: 1, gpa: 1, _id: 0 }
WHERE nationality = Russian
)
AND age > 25
Mongo
SQL
MongoDB for Big Data: Overview

MongoDB is web scale
Viral video that was
released in 2010 that
criticized MongoDBs
ability to handle large
workloads
MongoDB version at the
time: 1.6.1
Latest MongoDB version:
3.2.4
Source: mongodb-is-web-scale.com
The largest search engine in Russia uses MongoDB

to manage all user and metadata for its file sharing
service. MongoDB has scaled to support tens of
billions of objects and TBs of data, growing at 10
million new file uploads per day.
Source: mongodb.com/mongodb-scale

Strengths:
1. Storing unstructured, heterogeneous data
Collect data from a variety of sources without
the need to prepare it
2. Scalability
Cluster, performance, and data scalability
3. Native JS Map-Reduce function
4. Aggregation pipeline
Declarative, Unix-like pipeline processing to
transform documents in stages into an
aggregated result
MongoDB for Big Data: Replication

Replication: storing multiple copies of data on different
servers to increase data availability and provide redundancy
Replica sets are groups of
mongod instances of the same
dataset
The primary node receives
all write operations
Secondary nodes apply
operations asynchronously to their
data and elect a new primary if
it goes down (failover)
Source: docs.mongodb.org
MongoDB for Big Data: Sharding

Sharding: a method for storing data across multiple
servers in order to store massive datasets yet maintain
high throughput operations
Also known as horizontal scaling
3 components of a sharded cluster:
1. Shard: a replica set
2. Query router: mongos instances
that interface with client
applications and direct
operations to the shards
3. Config server: stores the
clusters metadata
MongoDB for Big Data: Aggregation pipeline

Includes an optimization stage and can operate on
sharded collections
Aggregation operators: $group, $match, $map,
$project, $sort, $unwind, $lookup, and more
MongoDB for Big Data: Aggregation pipeline

Example query:
db.students.aggregate( [
{ $group: {
_id: $student_id,
gpa: { $divide: [ $sum: $grades,
$credits]}
}
},
{ $match: {
gpa: { $gte: 3}
}
}
])
Mongo
{
}
_id: 718581,
gpa: 3.56
SELECT student_id,
COUNT(grades) / credits as gpa
FROM students
GROUP BY student_id
HAVING gpa >= 3
SQL
Output
Further resources:
https://docs.mongodb.org/manual/
Detailed documentation
Tutorials
Presentations/webinars
MongoDB University
Any questions?
antonradice@gmail.com
March 15, 2016

MongoDB Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MongoDB Tutorial

Uploaded by

Copyright:

Available Formats

Introduction to MongoDB

March 15, 2016

MongoDB, HBase, Clustrix,

Which database is right for you? Well it depends on

MongoDB: Data format

MongoDB: Data modeling

MongoDB: Data modeling

One-to-one (embedded documents)

MongoDB: Data modeling

One-to-many (embedded documents)

MongoDB: Data modeling

One-to-many (document references)

MongoDB: CRUD operations

INSERT INTO big_data_students

MongoDB: CRUD operations

MongoDB for Big Data: Overview

MongoDB for Big Data: Overview

The largest search engine in Russia uses MongoDB

MongoDB for Big Data: Overview

MongoDB for Big Data: Replication

MongoDB for Big Data: Sharding

MongoDB for Big Data: Aggregation pipeline

MongoDB for Big Data: Aggregation pipeline

March 15, 2016

You might also like