You are on page 1of 26

Table of Contents

1. Structure:
a. Markus
b. Flavio
2. Who are we?
a. Markus Gattol
b. Flavio Percoco Premoli
3. Introduction Part 1
a. What I am going to tell you
4. Integration with other Technologies
5. Frequently Asked Questions
a. Basics
i. Are there any Reasons not to use MongoDB?
ii. What are the supported Programming Languages?
iii. What is the Status of Python 3 Support?
iv. What is the difference in the main Building-blocks to RDBMSs?
b. Administration
i. Is there a Web GUI? What about a REST Interface/API?
ii. Can I rename a Database?
iii. How do I physically migrate a Database?
1. Secure Copy .... as in scp
2. Minimum Downtime
iv. How do I update to a new MongoDB version?
v. What is the default listening Port and IP?
vi. Is there a Way to do automatic Backups?
vii. What is getSisterDB() good for?
viii. How can I make MongoDB automatically start/restart on Server boot/reboot?
c. Resource Usage
i. Why is my Database growing so fast?
ii. What Caching Algorithm does MongoDB use?
iii. Why does MongoDB use so much RAM?
iv. What is the so-called Working Set Size?
v. How much RAM does MongoDB need?
1. Speed Impact of not having enough RAM
vi. Can I limit MongoDB's RAM Usage?
vii. What can I do about Out Of Memory Errors?
1. OpenVZ
viii. Does MongoDB use more than one CPU Core?
ix. How can I tell how many clients are connected?
x. How many parallel Client Connections to MongoDB can there be?
xi. Does MongoDB do Connection Pooling?
xii. Is there a Size limit of how much Data can be stored inside MongoDB?
xiii. Do embedded Documents count toward the 4 MiB BSON Document Size
Limit?
xiv. Does Document Size impact read/write Performance?
xv. Is there a Way to tell the Size of a specific Document?
xvi. How can I tell the Size of a Collection and its Indexes?
d. Collections / Namespaces
i. What is a Capped Collection? Why use it?
ii. Can I rename a Collection?
iii. What is a Virtual Collection? Why use it?
iv. Can I use a larger Number of Collections/Namespaces?
v. How about cloning a Collection?
vi. Can I merge two or more Collections into one?
vii. How can I get a list of Collections in my Database?
viii. How do I delete a Collection?
ix. What is a Namespace with regards to MongoDB?
x. How can I get a list of Namespaces in Database?
e. Statistics / Monitoring
i. The Server Status, what does it tell?
f. Schema / Configuration
g. Indexes / Search / Metadata
h. Map / Reduce
i. GridFS / Data Size
i. What is GridFS?
1. What can we do with GridFS
ii. Why use GridFS over ordinary Filesystem Storage?
j. Scalability / Fault Tolerance / Load Balancing
k. Miscellaneous
6. Use Case
7. Summary Part 1
8. Introduction Part 2
9. Existing Technologies
10. SQL to MongoDB Query Translation....
11. Keeping things lazy...
12. Keeping Relations or Embedding?
a. Using References:
b. Without references:
c. Light and fast (For registered users):
d. Heavy and slow (For any user):
e. Lazy relations or mongodb like ones:
13. Taking Advantage from schema-less Databases for Web Development
14. Summary Part 2

Structure:
Markus

2min: tell the audience what I am going to tell them (a summary) and why I think it's
worth mentioning
3min: I'll start with a big picture view (how MongoDB just integrates nicely with existing
setups eg folks can continue on using dm-crypt/luks) basic principles like
5min: pick a few FAQs items and elaborate on them eg "Why is MongoDB using so much
RAM"
5min: I will then go on taking a use case as an example (a webapplication build with
Django and MongoDB) from the financial domain where we need transactions/locking/ACID
and talk about the differences to eg MySQL/PostgreSQL
5min: also, with this use case, other things like:storing various precison numbers
5min: summarize what I've told them

You start after me and drill down on details (the stuff you mentioned in your email ~9 days ago)
or whatever you/we see fit.
Flavio

2min: I'll tell the audience the topics I'll talk about and how they help us with mongodb
and django integration
5min: Mappers & Stack, I'll list some of the current ODM's used to integrate mongodb and
django and how django-mongodb-engine integrates with django and mongodb.
5min: I'll talk about queries, what we have in sql that we don't have in mongodb and how
we can obtain the same results using it
o perfect, nothing to add/change here
3min: I'll talk about embedding and referencing, when it worths doing each and why
5min: I'll talk about how it is possible to take advantage of schemeless databases in web
programming (django oriented)
o ok sounds good, not sure I understand exactly; approach me today on #sunoano
and give me an example
5min: Summarize and maybe some benchmark!!!

Who are we?


Still, with all the technology we have these days, at the end of the day it is all about the
people ...

/me definitely not a

Markus Gattol

grown up in Carinthia (southernmost Austrian state, bordering Italy), lives in the UK now
o http://sunoano.name/albums/places/austria/index.html
technical background, MSc (Computer Science, Electrical Engineering)
with Linux (Debian) since 1995, Contributor
RDBMSs, the usual ...
Open Source Developer/Contributor in general
website http://sunoano.name
o http://sunoano.name/ws/mongodb.html
works for Heart Internet Ltd., NSN before that
o http://www.heartinternet.co.uk

Flavio Percoco Premoli


GNOME a11y Contributor (MouseTrap [http://live.gnome.org/MouseTrap])
Open Source Developer/Contributor (Web and Desktop)
R&D Developer at The Net Planet Europe
o NoSQL Technologies
o Cloud Computing
o Knowledge Management Systems
Linux Lover/User and Mac user too
website: http://www.flaper87.org
Twitter: FlaPer87
Github: FlaPer87
Bitbucket: FlaPer87
Everywhere else: FlaPer87

Introduction Part 1
The why ...

1. why are you here today?


2. why does some business want to know about new technology?
3. why are we looking to move away from RDBMs to NoSQL DBMSs?
4. German: Hardware und Software sind dann gut, wenn sie sich verstehen lassen, whrend
man sie benutzt - und nicht, wenn man damit vielleicht zum Mars fliegen kann.

Part 1 is mainly about MongoDB itself and not about Django/Python .... Part 2? .... Django!

What I am going to tell you


Best listener experience possible ...

Introduction Part 1 ... Tell the audience what you're going to tell them
Tell them

Integration with other Technologies

Frequently Asked Questions

Use Case

Summary Part 1 ... Tell the audience what you told them

Integration with other Technologies


How can I get MongoDB?
Ok, have it! Now what?
a. full-disk encryption / filesystem-level encryption
b. backup technologies, Rsync/Unison, Bacula, Amanda
c. LVM
d. VPN, SSH
e. Virtualization, OpenVZ

Frequently Asked Questions


Well, just because ...

Basics
Before we start running we need to be able to walk ...
Are there any Reasons not to use MongoDB?

1. We need transactions (ACID (Atomicity, Consistency, Isolation, Durability)).


2. Our data is very relational.
3. Related to 2, we want to be able to do joins on the server (but can not do embedded objects
/ arrays).
4. We need triggers on our tables. There might be triggers available soon however.
5. We rely on triggers (or similar functionality) for cascading updates or deletes.
6. We need the database to enforce referential integrity (MongoDB has no notion of this at all).
7. If we need 100% per node durability.
8. Write ahead log. MongoDB does not have one simply because it does not need one.
9. Dynamic aggregation with ad-hoc queries; Crystal reports, reporting, business logic, ...
RDBMSs heartland ...

What are the supported Programming Languages?

Right now (June 2010) we can use MongoDB from at least C, C++, C#, .NET, ColdFusion,
Erlang, Factor, Java, Javascript, PHP, Python, Ruby, Perl. Of course, there might be more
languages available in the future.

What is the Status of Python 3 Support?

The current thought is to use Django as more or less a signal for when adding full support
for Python 3 makes sense. MongoDB can probably support it a bit earlier than Django does, but
that is certainly not something the MongoDB community wants to rush and then have to support
two totally different code bases.

What is the difference in the main Building-blocks to RDBMSs?

We have RDBMSs like for example MySQL, Oracle, PostgreSQL and then there are NoSQL DBMSs
like for example MongoDB. Below is a breakout about how MongoDB relates to
the afore mentioned, it is a breakout about how the main building blocks of each party resemble:

MySQL, PostgreSQL, Oracle


--------------------------------------------
Server:Port
- Database
- Table
- Row

MongoDB
--------------------------------------------
Server:Port
- Database
- Collection
- Document

Administration
The usual handicraft work ... get and keep it running ... if in doubt, automate!
Is there a Web GUI? What about a REST Interface/API?

assuming a mongod process is running on localhost then we can access some statistics at
http://localhost:28017/ and http://localhost:28017/_status
In order to have a REST interface to MongoDB, same as CouchDB has it, we have to start
mongod with the --rest switch.
o Note however that this is just a read-only REST interface.
For a read and/or write REST interface:
o http://www.mongodb.org/display/DOCS/Http+Interface
o http://github.com/kchodorow/sleepy.mongoose
o http://github.com/tdegrunt/mongodb-rest
If we wanted real-time updates from the CLI, then we could also use mongostat.

Can I rename a Database?

Yes, but it is not as easy as renaming a collection. As of now, the recommended


way to rename a database is to clone it and thereby rename it. This will require
enough additional free disk space to fit the current/old database at least twice.

How do I physically migrate a Database?

There is even a clone command for that. Note however that neither copyDatabase()
nor cloneDatabase() actually perform a point-in-time snapshot of the entire
database -- what they basically do is query the source database and then
replicate to the target database i.e. if we use copyDatabase() or cloneDatabase()
on a source database which is online and has operations performed on it, then
the target database cannot be a point-in-time snapshot pointing to the exact
time when either one command was issued. Rather, at some point in time, they
will/might have the same data/state as their source database.

Secure Copy .... as in scp

A bit downtime but the chance to resume a canceled transfer ....

shutdown mongod on the old machine

copied/sync the database directory to the new machine


start mongod on the new machine with dbpath set appropriately
o http://sunoano.name/ws/debian_notes_cheat_sheets.html#resume_an_scp_transfe
r

Minimum Downtime

Below is what we could do in order to have as little downtime as possible:

stop and re-start the existing mongod as master (if it is not already running as master that
is)

install mongod on the new machine and configure it as slave using --slave and --source
wait while the slave copies the database, re-indexes and then catches up with its master
(this happens automatically when we point a slave to its master). Once the slave has
caught up, we

disable writes to the master (clients can still read/query)

once all outstanding writes have been committed on the master and the slave caught up,
we shutdown the master and restart the slave as new master. The old master can now be
removed entirely.

now we point all traffic at the new master

finally we enable writes on the new master again, ... Et voil!

Of course, we might also use OpenVZ and its live-migration feature ...

How do I update to a new MongoDB version?

If it is a drop-in replacement we just need to shutdown the older version and start the
new one with the appropriate dbpath. Otherwise, i.e. if it is not a drop-in replacement, we
would use mongoexport followed by mongoimport.

What is the default listening Port and IP?

We can use netstat to find out:

wks:/home/sa# netstat -tulpena | grep mongo


tcp 0 0 0.0.0.0:27017 0.0.0.0:* LISTEN 124 1474236 8822/mongod
tcp 0 0 0.0.0.0:28017 0.0.0.0:* LISTEN 124 1474237 8822/mongod
wks:/home/sa#

The default listening port for mongod is 27017. 28017 is where we can point our web browser
in order to get some statistics. The default listening IPs are all local IPs i.e. 0/0 which
matches all source addresses from 0.0.0.0 with netmask 0.0.0.0 i.e all source addresses from
the local machine ... plus ...

And yes, this includes the loopback device/address/network 127.0.0.0/8, the private class
A network 10.0.0.0/8, the private class B network 172.16.0.0/12 and of course also the
private class C network 192.168.0.0/16 amongst others.

Both, listening port and IP address, can be changed either by using the CLI switches --port
and --bind_ip or the configuration file which we can figure out by looking at the runtime
configuration.
Is there a Way to do automatic Backups?

Yes, http://github.com/micahwedemeyer/automongobackup

What is getSisterDB() good for?

We can use it to get ourselves references to databases which not just saves a lot of typing
but is, once we got used to using it, a lot more intuitive:

1 sa@wks:~/mm/new$ mongo
2 MongoDB shell version: 1.5.2-pre-
3 url: test
4 connecting to: test
5 type "help" for help
6 > db.getCollectionNames();
7 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
8 > reference_to_test_db = db.getSisterDB('test');
9 test
10 > reference_to_test_db.getCollectionNames();
11 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
12 > use admin
13 switched to db admin
14 > reference_to_test_db.getCollectionNames();
15 [ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
16 > bye
17 sa@wks:~/mm/new$

Note how we get a reference to our test database in line 8 and how it is used in lines 10 and even
line 14, after switching from our test database to the admin database. getCollectionNames() has
just been chosen as an example, it could have been any other command as well of course.

How can I make MongoDB automatically start/restart on Server


boot/reboot?

One way would be to use the @reboot directive with Cron. However, .deb and .rpm packages
install init scripts (sysv or upstart style, as appropriate) on Debian, Ubuntu, Fedora, and
CentOS already so MongoDB will restart there without further need from us to do anything
special.

For other constellations, http://gist.github.com/409301is an init.d script for Unix-like


systems based on
http://bitbucket.org/bwmcadams/toybox/src/3e84be941408/mongodb.init.rhel.

For Mac OS X, people have reported that launchctl configurations like


http://github.com/AndreiRailean/MongoDB-OSX-
Launchctl/blob/master/org.mongo.mongod.plist work.

For Windows, we have http://www.mongodb.org/display/DOCS/Windows+Service


documentation.
Resource Usage
Lot's of confusion amongst beginners ...

Why is my Database growing so fast?

The first file for a database is dbname.0, then dbname.1, etc. dbname.0 will be 64 MiB,
dbname.1 128 MiB, ... up to 2 GiB. Once the files reach 2 GiB in size, each successive file
is also 2 GiB.

So, if we have say, database files up to dbname.n, then dbname.n-1 might be 90% unused but
dbname.n has already be allocated once we start using dbname.n-1. The reasoning here is
simple: we do not want to wait for new database files when we need them so we always
allocate the next one in the background as soon as we start to use an empty
one.

Note that deleting data and/or dropping a collection or index will not release already
allocated disk space since it is allocated per database. Disk space will only be released if a
database is repaired or the database is dropped altogether. Go to
http://www.mongodb.org/display/DOCS/Developer+FAQ#DeveloperFAQ-
Whyaremydatafilessolarge%3F for more information.

What Caching Algorithm does MongoDB use?

Actually, that is done by the OS using the LRU (Least Recently Used) caching pattern.

Why does MongoDB use so much RAM?

Well, it does not actually, it is just that most folks do not really understand memory management
-- there is more to it than just is in RAM or is not in RAM.

The current default storage engine for MongoDB is called MongoMemMapped_RecStore. It uses
memory-mapped files for all disk I/O operations. Using this strategy, the operating
system's virtual memory manager is in charge of caching. This has several implications:

There is no redundancy between file system cache and database cache, actually,
they are one and the same.
MongoDB can use all free memory on the server for cache space automatically without
any configuration of a cache size.
Virtual memory size and RSS (Resident Set Size) will appear to be very large for the
mongod process. This is benign however -- virtual memory space will be just larger than
the size of the datafiles open and mapped i.e. resident size will vary depending on the
amount of memory not used by other processes on the machine.
Caching behavior (such as LRU'ing out of pages, and laziness of page writes) is
controlled by the operating system. The quality of the VMM (Virtual Memory
Manager) implementation will vary by OS.
As of now, an alternative storage engine (CachedBasicRecStore), which does not use
memory-mapped files, is under development. This engine is more traditional in design with its
own page cache. With this store the database has more control over the exact timing of reads and
writes, and of the cache LRU strategy.

Generally, the memory-mapped store (MongoMemMapped_RecStore) works quite well. The


alternative store will be useful in cases where an operating system's VMM is behaving suboptimal.

What is the so-called Working Set Size?

Working set size can roughly be thought of as how much data we will need MongoDB (or any
other DBMS, relational or non-relational) to access in a period of time.

For example, YouTube has ridiculous amounts of data, but only 1% may be accessed at any
given time. If, however, we are in the rare case where all the data we store is accessed at the
same rate at all times (LRU), then our working set size can be defined as our entire data set
stored in MongoDB.

How much RAM does MongoDB need?

We now know MongoDB's caching pattern, we also know what a working set size is. Therefore we
can have the following rule of thumb on how much RAM a machine needs in order to work
properly.

It is the working set size plus MongoDB's indexes which should reside in RAM at all times
i.e. the amount of available RAM should be at least the working set size plus the size of indexes
plus what the rest of the OS and other software running on the same machine needs.

Speed Impact of not having enough RAM

Generally, when databases are to big to fit into RAM entirely, and if we are doing random
access, we are in trouble as HDDs are slow at that (roughly a 100 operations per second per
drive).

One solution is to have lots of HDDs (10, 100, ...). Another one is to use SSDs (Solid State
Drives) or, even better, add more RAM. Now that being said, the key factor here is random
access. If we do sequential access to data bigger than RAM, then that is fine.

So, it is ok if the database is huge (more than RAM size), but if we do a lot of random access
to data, it is best if the working set fits in RAM entirely.

However, there are some nuances around having indexes bigger than RAM with MongoDB.
For example, we can speed up inserts if the index keys have certain properties -- if inserts
are an issue, then that would help.

Can I limit MongoDB's RAM Usage?

No, it is not designed to do that, it is designed for speed and scalability.


If we wanted to run MongoDB on the same physical machine alongside some web server and for
example some application server like Django, then we could ensure memory limits on each one by
simply using virtualization and putting each one in its own VE (Virtual Environment). In the end
we would thus have a web application made of MongoDB, Django and for example Cherokee, all
running on the same physical machine but being limited to whatever limits we set on each VE
they run in.

What can I do about Out Of Memory Errors?

If we are getting something like this Fri May 21 08:29:52 JS Error: out of memory (or akin stuff)
in our logs, then we hit a memory limit.

As we already know, MongoDB takes all RAM it can get i.e. RAM, or more precisely RSS (Resident
Set Size), itself part of virtual memory, will appear to be very large for the mongod process.

The important point here is how it is handled by the OS. If the OS just blocks any attempt
to get more virtual memory or, even worse, kills the process (e.g. mongod) which tries to
get more virtual memory, then we have got a problem. What can be done is to elevated/alter a
few settings:

1 sa@wks:~$ ulimit -a | egrep virtual\|open


2 open files (-n) 1024
3 virtual memory (kbytes, -v) unlimited
4 sa@wks:~$ lsb_release -irc
5 Distributor ID: Debian
6 Release: unstable
7 Codename: sid
8 sa@wks:~$ uname -a
9 Linux wks 2.6.32-trunk-amd64 #1 SMP Sun Jan 10 22:40:40 UTC 2010 x86_64
GNU/Linux
10 sa@wks:~$

As we can see from lines 5 to 9, I am on Debian sid (still in development) running the 2.6.32
Linux kernel.

The settings we are interested in are with lines 2 and 3. Virtual memory is unlimited by
default so that is fine already -- this is actually what causes the most problems so we need to
make sure virtual memory is either reasonably high or, even better, set to unlimited as shown
above. With regards to allowed open file descriptors -- by default we are limited to 1024 open
files which, in some cases, might pose a problem -- simply elevating it might be enough already
and make memory errors go away.

Note that we need to run these commands (e.g. ulimit -v unlimited) in the same user context as
mongod i.e. we basically want to script them as part of our mongod startup process.

OpenVZ

If we are running MongoDB with OpenVZ then there are some more settings we might want to
tune in order to avoid the OOM (Out of memory) killer to kick in or simply hit the virtual memory
ceiling if not set to unlimited. Special attention should be paid to the OpenVZ memory settings
i.e. they should be set to reflect MongoDB's memory usage.
Does MongoDB use more than one CPU Core?

For write operations MongoDB makes use of one CPU core. For read operations however,
which tend to be the majority of operations, MongoDB uses all CPU cores available to it.

In short: one will notice a speed increase going from a single-core CPU to dual-core or
even higher e.g. quad-core or maybe even octo-core since the speed increase is roughly
proportional to the available CPU cores.

How can I tell how many clients are connected?

We can look at the connections field (current) with the server status:

sa@wks:~$ mongo --quiet


type "help" for help
> db.serverStatus();
{

[skipping a lot of lines ...]

"connections" : {
"current" : 2,
"available" : 19998
},

[skipping a lot of lines ...]

}
> bye
sa@wks:~$

How many parallel Client Connections to MongoDB can there be?

Have a look at the connections field (available) with the server status.

Does MongoDB do Connection Pooling?

Yes, we can do connection pooling for performance reasons and overall resource usage
optimization -- without it things would be a lot slower and resource intensive. Fact is that as of
now (June 2010) most of the client drivers do connection pooling, how exactly it is done
varies with driver e.g. PyMongo.

Is there a Size limit of how much Data can be stored inside MongoDB?

4 MiB is the limit on individual documents, but GridFS uses many documents, so there is
no limit, technically/practically speaking.
As the above is true for x86-64, it is not entirely true for x86 (32 bit) -- there is a limit because
of how memory mapped files work which
is a limit of 2GiB per database.

Do embedded Documents count toward the 4 MiB BSON Document Size


Limit?

Yes, the entire BSON (Binary JSON) document (including all embedded documents, etc.) cannot
be more than 4 MiB in size.

Does Document Size impact read/write Performance?

Yes, but this is mostly due to network limitations e.g. one will max out a GigE link with inserts
before document size starts to slow down MongoDB itself.

Is there a Way to tell the Size of a specific Document?

Yes, one can use Object.bsonsize(db.whatever.findOne()) in the shell like this:

sa@wks:~$ mongo
MongoDB shell version: 1.5.1-pre-
url: test
connecting to: test
type "help" for help
> db.test.save({ name : "katze" });
> Object.bsonsize(db.test.findOne({ name : "katze"}))
38
> bye
sa@wks:~$

How can I tell the Size of a Collection and its Indexes?

sa@wks:~$ mongo --quiet


type "help" for help
> db.getCollectionNames();
[ "fs.chunks", "fs.files", "people", "system.indexes", "test" ]
> db.test.dataSize();
160
> db.test.storageSize();
2304
> db.test.totalIndexSize();
8192
> db.test.totalSize();
10496

We are using the test collection here. dataSize() is self-explanatory. storageSize() includes our
data and all the still free but already allocated disk space to this collection.
totalIndexSize() is the size in bytes of all the indexes in this collection and totalSize() is all
the storage allocated for all data and indexes in this collection. If we need/want a more
detailed view we could also have a look at
> db.test.validate();
{
"ns" : "test.test",
"result" : "
validate
firstExtent:2:2b00 ns:test.test
lastExtent:2:2b00 ns:test.test
# extents:1
datasize?:160 nrecords?:4 lastExtentSize:2304
padding:1
first extent:
loc:2:2b00 xnext:null xprev:null
nsdiag:test.test
size:2304 firstRecord:2:2be8 lastRecord:2:2c58
4 objects found, nobj:4
224 bytes data w/headers
160 bytes data wout/headers
deletedList: 0000001000000000000
deleted: n: 1 size: 1904
nIndexes:1
test.test.$_id_ keys:4
",
"ok" : 1,
"valid" : true,
"lastExtentSize" : 2304
}
> bye
sa@wks:~$

Note that while MongoDB generally does a lot of pre-allocation, we can remedy this by
starting mongod with --noprealloc and --smallfiles.

Collections / Namespaces
Needs to be known, plain and simple ...

What is a Capped Collection? Why use it?

Size: http://www.mongodb.org/display/DOCS/Capped+Collections
Time (TTL Collections): http://jira.mongodb.org/browse/SERVER-211

Can I rename a Collection?

Yes. Using help(); from MongoDB's interactive shell we get, amongst others,
db.test.renameCollection( newName , <dropTarget> ) which renames the collection. So yes, we
could do db.foo.renameCollection('bar'); and have the collection foo renamed to bar. Renaming a
collection is an atomic operation by the way.
What is a Virtual Collection? Why use it?

It refers to the ability to reference embedded documents as if they were a


first-class collection of top level documents, querying on them and returning
them as stand-alone entities, etc.

Can I use a larger Number of Collections/Namespaces?

There is a limit to how much collections/namespaces we can have within a single MongoDB
database. It is ~24000 namespaces per database. This is essentially the number of
collections plus the number of indexes.

How about cloning a Collection?

Yes, possible. Have a look at mongoexport and mongoimport.

Can I merge two or more Collections into one?

Yes, we read from all collections we want to merge and use insert() to write it into our
single target collection. This can be done on the server (using MongoDB's interactive shell) or
from a client.

How can I get a list of Collections in my Database?

We can use getCollectionNames() as shown below in lines 8 and 9. Yet another possibility is
shown in lines 23 to 28. Of course, since every collection is also a namespace, we can find them
aside indexes in lines 11 to 21:

1 sa@wks:~$ mongo
2 MongoDB shell version: 1.2.4
3 url: test
4 connecting to: test
5 type "help" for help
6 > db
7 test
8 > db.getCollectionNames();
9 [ "fs.chunks", "fs.files", "mycollection", "system.indexes", "things" ]
10 > db.system.namespaces.find();
11 { "name" : "test.system.indexes" }
12 { "name" : "test.fs.files" }
13 { "name" : "test.fs.files.$_id_" }
14 { "name" : "test.fs.files.$filename_1" }
15 { "name" : "test.fs.chunks" }
16 { "name" : "test.fs.chunks.$_id_" }
17 { "name" : "test.fs.chunks.$files_id_1_n_1" }
18 { "name" : "test.things" }
19 { "name" : "test.things.$_id_" }
20 { "name" : "test.mycollection" }
21 { "name" : "test.mycollection.$_id_" }
23 > show collections
24 fs.chunks
25 fs.files
26 mycollection
27 system.indexes
28 things
29 > bye
30 sa@wks:~$

How do I delete a Collection?

db.collection.drop() but there is no undo so beware.

What is a Namespace with regards to MongoDB?

Collections can be organized in namespaces. These are named groups of collections


defined using a dot notation. For example, we could define collections blog.posts and
blog.authors, both reside under the namespace blog but are two separate collections.

Namespaces can then be used to access these collections using the dot notation e.g.
db.blog.posts.find(); will return all documents from the collection blog.posts but nothing from the
collection blog.authors.

Namespaces simply provide an organizational mechanism for the user i.e. the collection
namespace is flat from the database point of view which means that blog.authors really just
is a collection on its own and not some collection authors grouped under some namespace blog.
Again, the collection namespace is flat from the database point of view i.e. technically speaking
blog.authors is no different than foo or foo.bar.baz -- grouping just helps the humans keep
track ...

How can I get a list of Namespaces in Database?

One way to list all namespaces for a particular database would be to enter MongoDB's interactive
shell:

sa@wks:~$ mongo
MongoDB shell version: 1.2.4
url: test
connecting to: test
type "help" for help
> db.system.namespaces.find();
{ "name" : "test.system.indexes" }
{ "name" : "test.fs.files" }
{ "name" : "test.fs.files.$_id_" }
{ "name" : "test.fs.files.$filename_1" }
{ "name" : "test.fs.chunks" }
{ "name" : "test.fs.chunks.$_id_" }
{ "name" : "test.fs.chunks.$files_id_1_n_1" }
{ "name" : "test.things" }
{ "name" : "test.things.$_id_" }
{ "name" : "test.mycollection" }
{ "name" : "test.mycollection.$_id_" }
> db.system.namespaces.count();
11
> bye
sa@wks:~$

The system namespace in MongoDB is special since it contains database system information
(read metadata). There are several collections like for example system.namespaces which for
example can be used to get information about all the namespaces with some database.

Statistics / Monitoring
Because pilots need to know ...

The Server Status, what does it tell?

sa@wks:~$ mongo --quiet


type "help" for help
> db.serverStatus();
{
"uptime" : 6695,
"localTime" : "Sun Apr 11 2010 11:22:19 GMT+0200 (CEST)",
"globalLock" : {
"totalTime" : 6694193239,
"lockTime" : 45048,
"ratio" : 0.000006729414343397326
},
"mem" : {
"resident" : 3,
"virtual" : 138,
"supported" : true,
"mapped" : 0
},

Most of it is obvious like for example uptime. The globalLock part is interesting. totalTime is the
same as uptime but in microseconds. lockTime is the amount of time the global lock has been
held i.e. the total time spend waiting for write queries until a lock has been assigned and
thus a write could be made.

One may ask what is the point of having both, uptime and totalTime? Well, totalTime will rollover
faster since it is in microseconds, at some point they diverge. The rollover is coordinated
between totalTime and lockTime.

mem units are in MiB, all of them. resident, what is in physical memory (also known as RAM),
virtual is the virtual address space, mapped is the space memory mapped, and supported is
if memory info is supported on our platform.

"connections" : {
"current" : 2,
"available" : 19998
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 146048,
"page_faults" : 57
},
"indexCounters" : {
"btree" : {
"accesses" : 0,
"hits" : 0,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 111,
"total_ms" : 2,
"average_ms" : 0.018018018018018018,
"last_ms" : 0,
"last_finished" : "Sun Apr 11 2010 11:21:45 GMT+0200 (CEST)"
},

connections tells us how many client connections we can open against mongod, more
precisely, current tells us how many existing client connections to mongod there are right now
and available shows us how many we got left.

Within the extra_info part we have heap_usage_bytes which is the main memory needed by the
database.

"opcounters" : {
"insert" : 16513,
"query" : 1482263,
"update" : 141594,
"delete" : 38,
"getmore" : 246889,
"command" : 1247316
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 0,
"rollovers" : 0
},
"ok" : 1
}
> bye
sa@wks:~$

The opcounters part is also pretty interesting. insert, query, update, and delete are self-
explanatory but getmore and command are probably not. When we do a query, we get results in
batches. The first batch is counted in query, all subsequent in getmore. commands are things like
count, group, distinct, etc.

And yes, taking those numbers and dividing them by time (delta or total) will give us
operations/time e.g. operations per second or operations since mongod got started. In fact,
there is a Munin plugin (http://github.com/erh/mongo-munin) which does use this.
Schema / Configuration
Sorry folks, no can do, lack of time ... go to
http://sunoano.name/ws/mongodb.html#faqs_schema_configuration

Indexes / Search / Metadata


Sorry folks, no can do, lack of time ... go to
http://sunoano.name/ws/mongodb.html#faqs_indexes_search_metadata

Map / Reduce
Sorry folks, no can do, lack of time ... go to
http://sunoano.name/ws/mongodb.html#faqs_map_reduce

GridFS / Data Size


Store tons of data reliable and smart ...

What is GridFS?

Basically a collection of normal documents. We have two collections, one for metadata
(fs.files) and one consisting of chunks of data (fs.chunks).

The GridFS spec provides a mechanism for transparently dividing a large file among
multiple documents. This allows us to efficiently store large objects, and in the case of
especially large files, such as videos, permits range operations (e.g., fetching only the first n
bytes of a file).

What can we do with GridFS

Store ridcoulous amounts of data in a smart way.

Why use GridFS over ordinary Filesystem Storage?

If we use the filesystem we would have to handle backup/replication/scaling ourselves. We


would also have to come up with some sort of hashing scheme ourselves plus we would need to
take care about cleanup/sorting/moving because filesystems do not love lots of small
files.

With GridFS, we can use MongoDB's built-in replication/backup/scaling e.g. scale reads by
adding more read-only slaves and writes by using sharding. We also get out of the box hashing
(read UUID (Universally Unique Identifier)) for stored content plus we do not suffer from
filesystem performance degradation because of a myriad of small files.
Also, we can easily access information from random sections of large files, another thing
traditional tools working with data right off the filesystem are not good at. Last but not least, we
can keep information associated with the file (who has edited it, download count, description,
etc.) right with the file itself.

Scalability / Fault Tolerance / Load Balancing


Sorry folks, no can do, lack of time ... go to
http://sunoano.name/ws/mongodb.html#faqs_scalability_fault_tolerance_load_balancing

Miscellaneous
Sorry folks, no can do, lack of time ... go to
http://sunoano.name/ws/mongodb.html#faqs_miscellaneous

Use Case
This should have been my major part

o locking (read transactions)


o asynchronous as opposed to synchronous operations
o numbers (double precision)

Again, lack of time ... go to http://sunoano.name/ws/mongodb.html

Summary Part 1
Tell them what you told them ... simple as that ...

Introduction Part 2

Before starting with mongodb specific topics it's important to know that we don't dislike
relational databases, we know they are good for many things but we also know that web
applications success is mainly based on their performance and speed so that's what we're
running after and that's why we're all here.
Existing Technologies

MongoKit (Nicolas Clairon):

o Great for completely unstructured model programming. It has structure validation


but Ive never used it, I prefer to use mongokit on models that may be constantly
changing their structure.

mongoengine (Harry Marr):


o It allows you to define schemas for documents and query collections using django-
like syntax.

django-mongodb-engine (Alberto Paro and myself):

o This is a real Django backend based on django-mongodb and mongoengine,


adapted to work with django-nonrel and mongodb without changing anything in the
code.

SQL to MongoDB Query Translation....


"What matters is who adapts faster to the changing conditions"

- Charles Darwin

The first we should remember when passing from SQL databases to NoSQL ones is that models
were made to model data but, models can be modeled too, what I mean is that people use to
adapt databases features to their models instead of adapting models to databases. I'll try to
mention some of the common quesitons found in the m-l:

Lets start with JOINS. Why JOINS? Because we dont have those in MongoDB and we
might need them so, we have to figure out whats the best workaround for this. The best
thing you can do here is forget about JOINS, you wont have them we are not talking about
highly relational databases we are talking about non relational ones so there can't be joins
between 2 collections if there's no relation between them. One of the things we did was
remodeling the way we stored data. We embedded what could be embedded and did 2 or
more queries where embedding was not possible.

What about ForeignKeys, do we have those? Yes, or kind off. We have DBRef which is a
kind of ForeignKey but I personally wouldn't use refs in mongodb. As I said, MongoDB is
not about referencing and collection relations it is about performance based on dynamism.

If MongoDB barely has references you could guess that many to many is insignificant,
instead of that I would start thinking on dicionaries/maps and lists/arrays.

And last but not least, If you really need to do a query that joins 2 collections based on a
field reference that should handle a many to many relation then you have map/reduce.

Keeping things lazy...

Yes, because were lazy people so we do lazy things ...

It is important when getting orms to work with mongodb that we keep things lazy to avoid bottle
necks in our web applications. Mongodb doesn't have many to many relations but it can have lists
and dictionaries saved. For example

class User(models.Model)

nickname = models.CharField(max_length=255)
full_name = models.CharField(max_length=255)

friends = ListField()

groups = ListField()

In the User model we have 2 ListFields that may cause some slow downs in our web application,
the first one is a list containing ids/names of the user friends and the second one containing the
groups user is related to so, think of a user that have many friends and that is related to many
groups (a popular one), that's a lot of data transfer and many instantiations for our code because
each object/id in the ListField should be instantiated. Maybe this might sound obvios but trust me,
nothing is obvious when doing web programming.

Keeping Relations or Embedding?

This is a common question when moving from relational databases to non-rel ones. Should we
keep our models related or embed smallest ones into the biggest ones?. The answer is NO, you
shouldn't keep them related. For Example, A common situation (or commonly used to show how
mongodb works) is a blog engine with posts and comments. Lets see how we could handle
comments (not threaded) in our blog engine:

Using References:
class Comment(models.Model):

post = models.ForeignKey(Post)

user = models.ForeignKey(User)

text = models.CharField(max_length=255)

my_comment, created = Comment.objects.get_or_create(post=my_post, user=my_user,


text=my_text, defaults={})

Without references:
class Post(models.Model)

....

comments = ListField()

post.comments.append({ user : user, text : text})

post.save()

The first example is the most used because is the way we're used to think when we write our
models but, the second one is the right one when talking about nosql databases because
references make things slower.
The bad thing about embedding our comments like that is that we have to worry about our 4mb
Document limit so if we are really popular on the net and many people comes to our blog and
comments our posts, that might be a problem for us, even though, This is great, I mean, we have
removed a model from our app so it should be easier to maintain, shouldn't it? but, what is user
supposed to be? Is it an embedded user object? is it a ForeignKey? what is it? How should we
handle users there?

It again depends on how you'd like to do things, for example It is possible to save the username
as it should be showed and then when the comments are loaded just show the username, for
those wanting to know more about this user then it is possible to do that just by clicking on its
username it'll load the user's personal info. Here are some examples:

Light and fast (For registered users):


post.comments.append({'user' : 'FlaPer87', 'text' : 'My Comment'})

post.save()

Heavy and slow (For any user):


post.comments.append({'user' : {'username' : 'FlaPer87',

'email' : 'flaper87@flaper87.org',

'url' : 'http://blog.flaper87.org'},

'text' : 'My Comment'})

post.save()

Lazy relations or mongodb like ones:


#Automatic serialization done in django-mongodb-engine

post.comments.append({'user' : {'_app': model._meta.app_label,

'_model': model._meta.module_name,

'pk': model.pk,

'_type': "django"},

'text' : 'My Comment'})

post.save()
Taking Advantage from schema-less Databases for Web Development

One of the things I like more from mongodb is that it is schema-less. People use to think about
schema-less dbs as a mess which they're not. Schema-less databases do have a structure the
difference between them and Schema based ones is that the schema-less structures are dynamic,
this means that they can be modified at anytime and they're not typed, you can think about
schema-less dbs as (just like mongodb does) json based maps.

This kind of structures can be really helpful when doing web programing, in our case they let us
save any kind of data in our collections and have generic structures that changed during the time.
For example, let's try to improve our Comment model (in case we decided to have some
relations).

class Comment(models.Model):

post = models.ForeignKey(Post)

user = GenericField()

text = models.CharField(max_length=255)

my_user = "FlaPer87" #Known User

my_comment, created = Comment.objects.get_or_create(post=my_post,

user=my_user,

text=my_text, defaults={})

my_user = {'nickname' : 'FlaPer87',

'full_name' : 'Flavio Percoco Premoli',

'email' : 'flaper87@flaper87.org',

'url' : 'http://blog.flaper87.org'} #Anonymous User

my_comment2, created = Comment.objects.get_or_create(post=my_post,

user=my_user,

text=my_text,

defaults={})

Using a GenericField we'll be able to save anything into that attr and we'll have to do our checks
and controls code side. In this case the Schema-less collection helped us to get/save the
anonymous users information without having to create a record in our Users table or without
forcing the user to register.

Summary Part 2
Re-model your models
Be Lazy to be faster
Forget about relations, they will slow you down
Remember that dynamism is better than restrictions

You might also like