You are on page 1of 52

Cosmos, Big Data GE implementation

Building your first application using FI-WARE

Open APIs for Open Minds

Big Data:
What is it and how
much data is there
2

What is big data?

> small
data

What is big data?


http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.
jpg

> big data

How much data is there?

Data growing forecast


21

2017

s
e
t
y
b

0
1
=
e
ytes
b
t
0
y
0
0
,
b
a
t
0,000
t
0
0
e
,
0
z
0
1 1,000,000,000,0
2017

1.4
2017
2012
39

2012
2012
2.3

2017

2012

19

0.5

12

11.3

3.6

Global
users
(billions)

Global
networked
devices
(billions)

Global broadband
speed
(Mbps)

Global traffic
(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~foreca
st

It is not only about storing big data but using it!


http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.
jpg

> big data

> tools

How to deal with it:


The Hadoop
reference
8

Hadoop was created by Doug Cutting at Yahoo!...

based on the MapReduce patent by Google

Well, MapReduce was really invented by Julius Caesar

Divide et
impera*

* Divide and
conquer

10

An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF7
P20

LATIN
REF4
P73

LATIN
REF1
P45

LATIN
pages 45

LATIN
REF5
P34

GREEK
REF2
P128

still
reading

45 (ref 1)

Reducer
GREEK
REF8
P230

EGYPT
REF6
P10

EGYPT
REF3
P12

EGYPTIAN
Mapper
s

11

An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF7
P20

LATIN
REF4
P73

LATIN
REF5
P34

GREEK
REF2
P128

still
reading

45 (ref 1)

GREEK
Reducer

GREEK
REF8
P230

EGYPT
REF6
P10

EGYPTIAN
Mapper
s

12

An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF7
P20

LATIN
REF4
P73

LATIN
pages 73

LATIN
REF5
P34

LATIN
pages 34

45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Reducer

GREEK
REF8
P230

EGYPTIAN
Mapper
s

13

An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK

GREEK
REF7
P20

45 (ref 1)
+73 (ref 4)
+34 (ref 5)

idle
Reducer
GREEK
REF8
P230

GREEK
Mapper
s

14

An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
idle

45 (ref 1)
+73 (ref 4)
+34 (ref 5)

idle
Reducer
idle
Mapper
s

15

152 TOTAL

Hadoop architecture

head node

16

FI-WARE proposal:
Cosmos Big Data

17

What is Cosmos?
Cosmos is Telefnica's Big Data platform
Dynamic creation of private computing clusters
as a service
Infinity, a cluster for persistent storage

Cosmos is Hadoop ecosystem-based

HDFS as its distributed file system


Hadoop core as its MapReduce engine
HiveQL and Pig for querying the data
Oozie as remote MapReduce jobs and Hive
launcher

Plus other proprietary features


Infinity protocol (secure WebHDFS)
Cygnus, an injector for context data coming from
Orion CB
18

Cosmos architecture

19

What can be done with Cosmos?


Locally

Remotely

What

(sshing into the Head


Node)

(connecting your app)

Clusters operation

Cosmos CLI

REST API

I/O operation

hadoop fs
command

REST API
(WebHDFS, HttpFS,
Infinity protocol)

Hive CLI

JDBC, Thrift*

hadoop jar
command

Oozie REST API

Querying tools
(basic analysis)

MapReduce
(advanced analysis)

20

Clusters operation:
Getting your own
roman legion
21

Using the RESTful API (1)

22

Using the RESTful API (2)

23

Using the RESTful API (3)

24

Using the Python CLI

Creating a cluster
$ cosmos create --name <STRING> --size <INT>

Listing all the clusters


$ cosmos list

Showing a cluster details


$ cosmos show <CLUSTER_ID>

Connecting to the Head Node of a cluster


$ cosmos ssh <CLUSTER_ID>

Terminating a cluster

$ cosmos terminate <CLUSTER_ID>

Listing available services


$ cosmos list-services

Creating a cluster with specific services


$ cosmos create --name <STRING> --size <INT>
--services <SERVICES_LIST>
25

How to exploit the


data:
Commanding your
roman legion
26

1. Hadoop filesystem commands

Hadoop general command


$ hadoop
Hadoop file system subcommand
$ hadoop fs
Hadoop file system options
$ hadoop fs ls
$ hadoop fs mkdir <hdfs-dir>
$ hadoop fs rmr <hfds-file>
$ hadoop fs cat <hdfs-file>
$ hadoop fs put <local-file> <hdfs-dir>
$ hadoop fs get <hdfs-file> <local-dir>
http://
hadoop.apache.org/docs/current/hadoop-project-di
st/hadoop-common/CommandsManual.html
27

2. WebHDFS/HttpFS REST API

List a directory
GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS

Create a new directory

PUT http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]

Delete a file or directory

DELETE http://<host>:<port>/webhdfs/v1/<path>?op=DELETE
[&recursive=<true|false>]

Rename a file or directory

PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?
op=RENAME&destination=<PATH>

Concat files

POST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?
op=CONCAT&sources=<PATHS>

Set permission

PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION
[&permission=<OCTAL>]

Set owner

PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER
[&owner=<USER>][&group=<GROUP>]
28

2. WebHDFS/HttpFS REST API (cont.)

Create a new file with initial content (2 steps operation)


PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
[&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>]
[&permission=<OCTAL>][&buffersize=<INT>]
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
PUT -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?
op=CREATE...

Append to a file (2 steps operation)

POST
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersize=<INT>]
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Content-Length: 0
POST -T <LOCAL_FILE>
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...

29

2. WebHDFS/HttpFS REST API (cont.)

Open and read a file (2 steps operation)


GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN [&offset=<LONG>]
[&length=<LONG>][&buffersize=<INT>]
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...
Content-Length: 0
GET http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...

http://
hadoop.apache.org/docs/current/hadoop-project-dist
/hadoop-hdfs/WebHDFS.html

HttpFS does not redirect to the Datanode but to


the HttpFS server, hidding the Datanodes (and saving
tens of public IP addresses)
The API is the same
http://
hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/i
ndex.html
30

3. Local Hive CLI


Hive is a querying tool
Queries are expresed in HiveQL, a SQL-like
language
https://
cwiki.apache.org/confluence/display/Hive/Languag
eManual

Hive uses pre-defined MapReduce jobs for


Column selection
Fields grouping
Table joining

All the data is loaded into Hive tables
31

3. Local Hive CLI (cont.)


Log on to the Master node
Run the hive command
Type your SQL-like sentence!
$ hive
$ Hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt
hive>select column1,column2,otherColumns from mytable where
column1='whatever' and columns2 like '%whatever%';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201308280930_0953, Tracking URL =
http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_0953
2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%
2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%
2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%
2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%

32

4. Remote Hive client


Hive CLI is OK for human-driven testing
purposes

But it is not usable by remote applications

Hive has no REST API


Hive has several drivers and libraries

JDBC for Java


Python
PHP
ODBC for C/C++
Thrift for Java and C++
https://
cwiki.apache.org/confluence/display/Hive/HiveClien
t

A remote Hive client usually performs:

33
A connection to the Hive
server (TCP/10000)

4. Remote Hive client Get a connection


private Connection getConnection(
String ip, String port, String user, String password) {
try {
// dynamically load the Hive JDBC driver
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
} catch (ClassNotFoundException e) {
System.out.println(e.getMessage());
return null;
} // try catch
try {
// return a connection based on the Hive JDBC driver, default DB
return DriverManager.getConnection("jdbc:hive://" + ip + ":" +
port + "/default?user=" + user + "&password=" + password);
} catch (SQLException e) {
System.out.println(e.getMessage());
return null;
} // try catch
} // getConnection

https://
github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-cl
34
ient

4. Remote Hive client Do the query


private void doQuery() {
try {
// from here on, everything is SQL!
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery("select column1,column2," +
"otherColumns from mytable where column1='whatever' and " +
"columns2 like '%whatever%'");
// iterate on the result
while (res.next()) {
String column1 = res.getString(1);
Integer column2 = res.getInteger(2);
// whatever you want to do with this row, here
} // while
// close everything
res.close(); stmt.close(); con.close();
} catch (SQLException ex) {
System.exit(0);
https://
} // try catch
github.com/telefonicaid/fiware-connectors/tre
} // doQuery

e/develop/resources/hive-basic-client
35

4. Remote Hive client Plague Tracker demo

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker
36

5. MapReduce applications
MapReduce applications are commonly written in
Java

Can be written in other languages through Hadoop


Streaming

They are executed in the command line


$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>

A MapReduce job consists of:

A driver, a piece of software where to define inputs, outputs,


formats, etc. and the entry point for launching the job
A set of Mappers, given by a piece of software defining its
behaviour
A set of Reducers, given by a piece of software defining its
behaviour

There are 2 APIS

org.apache.mapred old one


org.apache.mapreduce new one
37

Hadoop is distributed with MapReduce examples

5. MapReduce applications Map


/* org.apache.mapred example */
public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
/* use the input value, the input key is the offset within the
file and it is not necessary in this example */
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
/* iterate on the string, getting each word */
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
/* emit an output (key,value) pair based on the word and 1 */
output.collect(word, one);
} // while
} // map
} // MapClass
38

5. MapReduce applications Reduce


/* org.apache.mapred example */
public static class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
/* iterate on all the values and add them */
while (values.hasNext()) {
sum += values.next().get();
} // while
/* emit an output (key,value) pair based on the word and its count */
output.collect(key, new IntWritable(sum));
} // reduce
} // ReduceClass

39

5. MapReduce applications Driver


/* org.apache.mapred example */
package my.org
import java.io.IOException;
import java.util.*;
import
import
import
import
import

org.apache.hadoop.fs.Path;
org.apache.hadoop.conf.*;
org.apache.hadoop.io.*;
org.apache.hadoop.mapred.*;
org.apache.hadoop.util.*;

public class WordCount {


public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
} // main
} // WordCount

40

6. Launching tasks with Oozie


Oozie is a workflow scheduler system to manage
Hadoop jobs

Java map-reduce
Pig and Hive
Sqoop
System specific jobs (such as Java programs and shell scripts)

Oozie Workflow jobs are Directed Acyclical Graphs


(DAGs) of actions.
Writting Oozie applications is about including in a
package

The MapReduce jobs, Hive/Pig scritps, etc (exeutable code)


A Workflow
Parameters for the Workflow

Oozie can be use locally or remotely


https://
oozie.apache.org/docs/4.0.0/index.html#Developer_D
41
ocumentation

6. Launching tasks with Oozie Java client


OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/");
// create a workflow job configuration and set the workflow application path
Properties conf = client.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "hdfs://cosmosmastergi:8020/user/frb/mrjobs");
conf.setProperty("nameNode", "hdfs://cosmosmaster-gi:8020");
conf.setProperty("jobTracker", "cosmosmaster-gi:8021");
conf.setProperty("outputDir", "output");
conf.setProperty("inputDir", "input");
conf.setProperty("examplesRoot", "mrjobs");
conf.setProperty("queueName", "default");
// submit and start the workflow job
String jobId = client.run(conf);
// wait until the workflow job finishes printing the status every 10 seconds
while (client.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
System.out.println("Workflow job running ...");
Thread.sleep(10 * 1000);
} // while
System.out.println("Workflow job completed");

42

Useful references
Hive resources:

HiveQL language https://


cwiki.apache.org/confluence/display/Hive/LanguageManual
How to create Hive clients https
://cwiki.apache.org/confluence/display/Hive/HiveClient
Hive client example https
://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic
-client
Plague Tracker demo https
://github.com/telefonicaid/fiware-livedemoapp/tree/master/cosmos/plague-tra
cker
Plague Tracker instance http://130.206.81.65/plague-tracker/

Hadoop filesystem commands:

http://
hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Comm
andsManual.html

WebHDFS and HttpFS REST APIs:

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Web
HDFS.html
http://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html
43

Cosmos place in FIWARE:


Typical scenarios
44

General IoT platform


BLNK

RULES
DEFINITION

OPERATIONAL
DASHBOARD

REAL TIME
PRCSSING

DATA
QUERYING
SUBS

CEP

GIS

BI
ETL

SHORT TERM
HISTORIC

DATA
PROCESSING

OPEN DATA

OPEN DATA

COSMOS
(BIG DATA)

Service
Orchrestation

CONTEXT BROKER

Context
Adapters

CKAN

SENSOR 2 THINGS

You d
o
to us nt have
e t he
m all
!

IoT Backend
Device Management
measures / commands

T-T

PORTALS

IoT/Sensor

Open Data

45

IDM & Auth

City
Services

KPI GOVERNANCE

Accounting & Payment & Billing

BLNK

Real time context data persistence (architecture)

https://
https://
github.com/telefonicaid/fiware-connectors/tree/develop/flu
forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/How_to_persist_Orion_data_in_Cosmos
me
46

Real time context data persistence (detail)

47

Real time context data persistence (examples)


Information coming from city sensors
Presence map gradients, aglomerations
Services usage distributions, top users (if
available), top POIs, unused resources
Information generated by smartphones
Geolocation routes, map gradients,
aglomerations
Issues reporting top neighbourhooods in
incidents, crimilality, noises, garbage,
plagues
Any other real time information
Depending on your app, this could be product
likes, product consumption,
user-2-user
48

Roadmap:
More functionalities
and integrations
49

Roadmap
Integrate the clusters creation with the
cloud portal
No more REST API work
Streaming analysis capabilities
Not all the analysis can wait for a batch
processing
Geolocation analysis capabilities
An important source of data nowadays
Integrate with CKAN
As a source of batch data
Integrate with the Marketplace
Selling datasets
Selling analysis results
50

fiware-lab-help@lists.fi-ware.org
francisco.romerobueno@telefonic
a.com

51

Thanks !
http://fi-ppp
.eu
http://fi-war
e.eu
Follow
@Fiware on
Twitter!
52

You might also like