Professional Documents
Culture Documents
Implementation free
Benoit Perroud 21. January 2011, 01. March 2011
Disclaimer
Any views or opinions presented in this presentation are solely those of the author and do not necessarily represent those of Verisign.
Outline
Introduction Scalability and Availability Difficulties to scale CAP Theorem NoSQL Goals NoSQL Taxonomy Concepts and Patterns Existing implementations NoSQL in the Real World Conclusion
Introduction
NoSQL Term
Mandatory name disambiguation : NoSQL stands for Not Only SQL. The term NoSQL is more or less attributed to Eric Evans, a Rackspace employee, who used it in early 2009 when Johan Oskarsson, a Last.fm employee, wanted to organize an event to discuss open-source distributed databases.
Wikipedia Definition
[Wikipedia] NoSQL is a term used to designate database management systems that differ from classic relational database management systems (RDBMS) in some way. These data stores may not require fixed table schemas, usually avoid join operations, do not attempt to provide ACID (atomicity, consistency, isolation, durability) properties and typically scale horizontally.
Tag Cloud
Why NoSQL ?
Is there really a problem with SQL and RDBMS ?
No there isn't !
SQL is powerful, ACID (atomicity, consistency, isolation, durability) properties are well-established, developers and DBAs have dominated it.
So why the hell need we NoSQL ? What was the motivations of Google and Amazon to invest huge amount in research around NoSQL ? Why social sites are so hard to scale ?
Scalability definition
[Wikipedia] Scalability is a desirable property of a system, a network, or a process, which indicates its ability to either handle growing amounts of work in a graceful manner or to be readily enlarged. In summary : handle load and peaks. Scalability in two dimensions :
Scale up scale vertically (increase RAM in an existing node) Scale out scale horizontally (add a node to the cluster)
10
Availability definition
[Wikipedia] Availability refers to the ability of the users to access and use the system. If a user cannot access the system, it is said to be unavailable. Generally, the term downtime is used to refer to periods when a system is unavailable. In summary : minimize downtime.
Availability % 90% ("one nine") 95% 99% ("two nines") 99.9% ("three nines") 99.99% ("four nines") 99.999% ("five nines") 99.9999% ("six nines")
Downtime per year 36.5 days 18.25 days 3.65 days 8.76 hours 52.56 minutes 5.26 minutes 31.5 seconds
Downtime per month 72 hours 36 hours 7.20 hours 43.2 minutes 4.32 minutes 25.9 seconds 2.59 seconds
Downtime per week 16.8 hours 8.4 hours 1.68 hours 10.1 minutes 1.01 minutes 6.05 seconds 0.605 seconds
11
Difficulties to Scale
12
RDBMS Scalability
RDBMS are hard to scale. Hard should be understood as costly
RDBMS licenses, hardware, DBAs' and operational costs grow non linearly with the load.
or replication latency
distributed transactions : two-phase commit, paxos algorithm.
13
Hardware Scalability
Commodity hardware and appliances are I/O bound
But network throughput is cheaper than hard disk throughput.
Distributing data across a network of small computers (and applying the data locality concept) scale better (cheaper) than a huge appliance.
14
15
CAP Theorem
16
17
NoSQL trade-offs
NoSQL datastores have typically done other trade-offs than RDBMS to the CAP theorem
Most of them gave up the C of the theorem, giving up the ACID properties in the same way.
18
NoSQL Goals
19
NoSQL promises
NoSQL finality is to achieve (horizontal) scalability and high availability.
Business goal : Keep cost growing proportionally with the load (tight provisioning). Operational goal :
Scale the system by simply adding node (or removing). The system runs on commodity hardware.
20
NoSQL Taxonomy
21
Taxonomy
NoSQL most common types are : Document store
store document which structure can be explored.
Key/value store
simple hash map data access pattern.
Graph database
store node and edges, walk-through data access
Object database
22
Classification
23
24
25
26
27
28
Merkle Tree
The main use of hash trees is to make sure that data blocks received from other peers in a peer-to-peer network are received undamaged and unaltered
SEDA
Staged Event-Driven Architecture
29
30
31
Data locality
Processing is sent to data instead of data sent do workers.
32
Existing Implementations
33
34
35
Twitter
Use Cassandra for real time analytics
Google Megastore
ACID within partitions, lower consistency across partitions Synchronous replication with Paxos algorithm
36
37
Conclusion
38
Conclusion
NoSQL is not a general purpose datastore !
Eventually consistent model can be tricky, and can reserve nasty surprises if not used carefully. MapReduce search job have high latency
39
40