You are on page 1of 6

http://wiki.apache.

org/cassandra/ArticlesAndPresentations
http://docs.datastax.com/en/landing_page/doc/landing_page/current.html

Info from website: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis


================================================
Written in: Java
Main point: Store huge datasets in "almost" sql
License: Apache
Protocol: CQL3 & Thrift
- CQL3 is very similar SQL, but with some limitations that come from the
scalability (most notably: no JOINs, no aggregate functions.)
- CQL3 is now the official interface. Don't look at Thrift, unless you're working
on a legacy app. This way, you can live without understanding ColumnFamilies,
SuperColumns, etc.
- Querying by key, or key range (secondary indices are also available)
- Tunable trade-offs for distribution and replication (N, R, W)
- Data can have expiration (set on INSERT).
- Writes can be much faster than reads (when reads are disk-bound)
- Map/reduce possible with Apache Hadoop
- All nodes are similar, as opposed to Hadoop/HBase
- Very good and reliable cross-datacenter replication
- Distributed counter datatype.
- You can write triggers in Java.
Best used: When you need to store data so huge that it doesn't fit on server, but
still want a friendly familiar interface to it.
For example: Web analytics, to count hits by hour, by browser, by IP, etc.
Transaction logging. Data collection from huge sensor arrays.
================================================

A) Cassandra Architecture: -

Cassandra
- A distributed database.
- There is no master-slave concept and each node is equal.
- A cluster can easily be across more than one data center.

Snitch
- It is, How the nodes in a cluster know about the topology of the cluster.
- There is no master-slave concept and each node is equal.
- Type: Dynamic Snitching, SimpleSnitch, RackInferring Snitch, PropretyFileSnitch,
GossipingPropertyFileSnitch, EC2Snitch, EC2MultiRegionSnitch

Gossip (Internal communication)


- It is, How the nodes in a cluster communicates with each other.
- Every one second, each node communicates with up to three other nodes, exchanging
information about itself and all other nodes that it has information about.
Note: For External communication, such as from an application to C* database,
CQL(Cassandra Query Language) or Thrift are used.

Data Distribution
- It is done through consistent hashing, to strive for even distribution of data
across the nodes in cluster.
- Rather than all rows of a table existing on only one node, the rows are
distributed across the nodes in cluster, in an attempt to evenly spread out the
load of the table's data.
- To distribute the rows across the nodes, a partitioner is used. The partitioner
uses an algorithm to determine which node a given row of data will go to
- The default partitioner in cassandra is Murmur3
Murmur3: It takes the value in the first column of the row to generate a unique
number between -2^63 and 2^63.
Calculate the token ranges: -
<In below python formula it is calculated for 4 nodes, you can replace it with
actual number of nodes for your env.>
$ python -c 'print [str(((2**64 / 4) * i - 2**63) for in range(4)]'
['-9223372036854775808', '-4611686018427387904', 0, '461686018427387904']
OR
Use a Murmur3 calculator
- Each nodes in a cluster is assigned one token range. (OR multiple ranges with
virtual nodes)
e.g.: Each node is responsible for the token range between its endpoint and the
endpoint of the previous node.
Node wise endpoint is defined below.
NodeA: -100
NodeB: 0
NodeC: 51
NodeD: 100
-> NodeA can store value from value greater than 100 in +ve and value less than
-100 in -ve
-> NodeB can store value from -99 to 0
-> NodeC can store value from 1 to 51
-> NodeD can store value from 52 to 100

Replication Factor:
- It must be specified whenever a database is defined.
- It specifies how many instances of the data there will be within a given
database.
- Although 1 can be specified, it is common to specify 2,3, or more so that if a
node goes down, there is at least one other replica of the data, so that the data
is not lost with down node.

Virtual Nodes:
- They are alternative way to assign token ranges to nodes, and "Virtual Nodes" are
now the default in Cassandra.
- With Virtual Nodes, instead of a node being responsible for just one token range,
it is instead responsible for many small token range (by default, 256 of them)
- Virtual Nodes allow for assigning a high number of ranges to a powerful
computer(e.g. 512) and a lower number of ranges (e.g. 128) to a less powerful
computer
- Virtual Nodes (aka vnodes) were created to make it easier to add new nodes to a
cluster while keeping the cluster balanced
- When a new node is added, it receives many small token range slices from the
existing nodes, to maintain a balanced cluster

===================================================================================
=========================================================================
B) Installing and Configuring

Installation: -
- http://www.planetcassandra.org/cassandra/
- Where you unzip the folder Casssndra is installed in that directory.

Configuration: -
- Go inside conf directory to see configuration files.
(/Users/ashah/cassandra/dsc-cassandra-3.0.0/conf)
- cassandra.yaml is main configuration file.
File permission: -
<if you have modified cassandra.yaml as per below then create those directories
and give permission>
- sudo mkdir /var/lib/cassandra
- sudo mkdir /var/log/cassandra
- sudo chown -R $USER:$GROUP /var/lib/cassandra
- sudo chown -R $USER:$GROUP /var/log/cassandra

Starting/Stoping Cassandra: -

Way 1)
-> Start
<for now it is via root user>
- $pwd
o/p:/Users/ashah/cassandra/dsc-cassandra-3.0.0
- bin/cassandra

-> Stop
- ps aux | grep cass
- kill <pid>

Way 2)
- start: bin/cassandra -f
- stop: control or command + c

Checking Status: -
- bin/nodetool status
- bin/nodetool info [-h <host>]
- bin/nodetool ring

Accessing the Cassandra system.log File


- Location: /Users/ashah/cassandra/dsc-cassandra-3.0.0/logs
- File name is system.log and debug.log.
- Current version:: Setting of log file direcgory: /Users/ashah/cassandra/dsc-
cassandra-3.0.0/conf/logback.xml
- Earlier version:: Setting of log file directory: /Users/ashah/cassandra/dsc-
cassandra-<x>/conf/log4j-server.properties

===================================================================================
=========================================================================
C) Communicating with Cassandra

Understanding ways to communicate with Cassandra: -


- CQL (Cassandra Query Langauge) is a SQL-like query language for communicating
with Cassandra, created to make it easy for people familiar with SQL to work with
Cassandra.
e.g.: select home_id, datetime, event, code_used from activity;
* CQL commands are not case-sensitive.
* Although CQL looks similar to SQL, it does not have all of these options as
SQL, due to the distributed nature of C* database.
- Thrift is a low-level API, currently still supported in Cassandra (support may be
phased out in future release of C*)(It exists before CQL)
- For Administrative activities, such as cluster monitoring and management tasks,
tool built on JMX (Java Management Extentions) are commonly used.

CQLSH: -
- bin/cqlsh
- cqlsh> HELP
- cqlsh> help create_keyspace
- Semicolon (";") is optional for CQLSH command but mandatory for CQL command.

===================================================================================
========================================================================
D) Creating a database

Understanding a Cassandra Database: -


- In C*, a database is defined as a keyspace -> Within keyspace tables can be
defined.
- Check existing keyspaces: -
cqlsh> describe keyspaces;
- To see inside keyspace: -
cqlsh> describe keyspace <name>;

Defining a keyspace: -
- A keyspace name is case sensitive only if you put it inside double quote
otherwise it will go in lower case.
e.g.: a) CREATE KEYSPACE "Test" :: This will be created as Test.
b) CREATE KEYSPACE Test :: This will be created as test.
- A keyspace can be defined through the create keyspace command.
->
CREATE KEYSPACE vehicle_traker WITH REPLICATION =
{'class':'NetworkTopologyStrategy', 'dc1':3, 'dc2':2};
<dc1 3 means data center 1 contains 3 replica of data and same way data center 2
contains 2 replica of data>
->
CREATE KEYSPACE vehicle_traker WITH REPLICATION = {'class':'SimpleStrategy',
'replication_factor':1}

Deleting a keyspace: -
- DROP KEYSPACE vehicle_tracker;

Working inside a keyspace: -


- USE <keyspace_name>

===================================================================================
=========================================================================
E) Creating a Table

Creating/dropping a Table: -
- CREATE TABLE activity
(home_id text, datetime timestamp, event text, code_used text PRIMARY
KEY(home_id, datatime)) WITH CLUSTERING ORDER BY (datetime DESC);
- DROP TABLE activity;

Defining Columns And Data Type: -


- Data types: ascii, bigint, blob, boolean, counter, decimal, double, float, inet,
int, list, map, set, text, timestamp, uuid, timeuuid, varchar, varint

Defining a primary key: -


- same as other database

Reconizing a partition key: -


- The partition key is hashed by the partitioner to determine which node in the
cluster will store the partition.
- The primary key column defines the partition key.
- For compound primary key, first column listed in primary key defines the
partition key.
-> How data is stored internally is that, all of the CQL rows that have the same
partition key value are stored in the same partition key (aka RowKey).

Specifying a descending clustering order


- A table can be defined to store its data in ascending (default) or descending
order
e.g.: WITH CLUSTERING ORDER BY (datetime DESC)
- Specifying descending causes writes to take a little longer, as cells are
inserted at the start of a partition, rather than added at the end, but improves
read performance when descending order needed by an application
- Once clustering order is defined, changing the clustering order of a table is not
an option.

===================================================================================
=========================================================================
F) Inserting Data

Understanding Ways to Write Data


- INSERT INTO (CQL command)
- COPY command
- sstableloader tool (bulk loading)

Using the INSERT INTO command


- Same as other DB Insert command.
e.g. : INSERT INTO activity (home_id, datetime , event, code_used) VALUES
('H01474777', '2014-05-21 07:32:16', 'alarm set', '5599');

Using the COPY command


- The COPY command can be used to import data (COPY FROM) from a .csv file.
e.g.: COPY activity (home_id, datetime , event, code_used) FROM
'/Users/ashah/events.csv' WITH header = true AND delimiter = '|';

- The COPY command can be used to export data (COPY TO) a .csv file.

How Data is stored in C*


- Internally, a partition key value(in Thrift, referred to as a row key value) is
what makes an internal storate row unique.

How Data is stored on Disk


- When data is written to a table in Cassandra, it goes to both a commit log on
disk(for playback, in case of node failure) and to memory(called memcache).
- Once the memcache for a table is full, it is flused to disk, as an SSTable
- For each table on each node there is a memcache
- The SSTables for a table are stored on disk, in the location specified in the
Cassandra.yaml file.
- To see the contents of an SSTable, sstable2json can be used. (looks like obsolate
in 3.0)
- To flush the content to disk use below command.
-> bin/nodetool flush home_security

===================================================================================
=========================================================================
G) Modelling Data

===================================================================================
=========================================================================
H) Creating an application

===================================================================================
=========================================================================

You might also like