Cosmos Fiware

Cosmos, Big Data GE implementation
Building your first application using FI-WARE
Open APIs for Open Minds
Big Data:
What is it and how
much data is there
2
What is big data?
> small
data
What is big data?

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.
jpg
> big data
How much data is there?
Data growing forecast

21
2017
s
e
t
y
b
0
1
=
e
ytes
b
t
0
y
0
0
,
b
a
t
0,000
t
0
0
e
,
0
z
0
1 1,000,000,000,0
2017
1.4
2017
2012
39
2012
2012
2.3
2017
2012
19
0.5
12
11.3
3.6
Global
users
(billions)
Global
networked
devices
(billions)
Global broadband
speed
(Mbps)
Global traffic
(zettabytes)
http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~foreca
st
It is not only about storing big data but using it!

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.
jpg
> big data
> tools
How to deal with it:

The Hadoop
reference
8
Hadoop was created by Doug Cutting at Yahoo!...
based on the MapReduce patent by Google
Well, MapReduce was really invented by Julius Caesar
Divide et
impera*
* Divide and
conquer
10
An example
How much pages are written in latin among the books
in the Ancient Library of Alexandria?
GREEK
REF7
P20
LATIN
REF4
P73
LATIN
REF1
P45
LATIN
pages 45
LATIN
REF5
P34
GREEK
REF2
P128
still
reading
45 (ref 1)
Reducer
GREEK
REF8
P230
EGYPT
REF6
P10
EGYPT
REF3
P12
EGYPTIAN
Mapper
s
11
An example
GREEK
REF7
P20
LATIN
REF4
P73
LATIN
REF5
P34
GREEK
REF2
P128
still
reading
45 (ref 1)
GREEK
Reducer
GREEK
REF8
P230
EGYPT
REF6
P10
EGYPTIAN
Mapper
s
12
An example
GREEK
REF7
P20
LATIN
REF4
P73
LATIN
pages 73
LATIN
REF5
P34
LATIN
pages 34
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
Reducer
GREEK
REF8
P230
EGYPTIAN
Mapper
s
13
An example
GREEK
GREEK
REF7
P20
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
idle
Reducer
GREEK
REF8
P230
GREEK
Mapper
s
14
An example
idle
45 (ref 1)
+73 (ref 4)
+34 (ref 5)
idle
Reducer
idle
Mapper
s
15
152 TOTAL
Hadoop architecture
head node
16
FI-WARE proposal:
Cosmos Big Data
17
What is Cosmos?
Cosmos is Telefnica's Big Data platform
Dynamic creation of private computing clusters
as a service
Infinity, a cluster for persistent storage
Cosmos is Hadoop ecosystem-based
HDFS as its distributed file system

Hadoop core as its MapReduce engine
HiveQL and Pig for querying the data
Oozie as remote MapReduce jobs and Hive
launcher
Plus other proprietary features

Infinity protocol (secure WebHDFS)
Cygnus, an injector for context data coming from
Orion CB
18
Cosmos architecture
19
What can be done with Cosmos?

Locally
Remotely
What
(sshing into the Head

Node)
(connecting your app)
Clusters operation
Cosmos CLI
REST API
I/O operation
hadoop fs
command
REST API
(WebHDFS, HttpFS,
Infinity protocol)
Hive CLI
JDBC, Thrift*
hadoop jar
command
Oozie REST API
Querying tools
(basic analysis)
MapReduce
(advanced analysis)
20
Clusters operation:
Getting your own
roman legion
21
Using the RESTful API (1)
22
23
24
Using the Python CLI
Creating a cluster
$ cosmos create --name <STRING> --size <INT>
Listing all the clusters

$ cosmos list
Showing a cluster details

$ cosmos show <CLUSTER_ID>
Connecting to the Head Node of a cluster

$ cosmos ssh <CLUSTER_ID>
Terminating a cluster
$ cosmos terminate <CLUSTER_ID>
Listing available services

$ cosmos list-services
Creating a cluster with specific services

$ cosmos create --name <STRING> --size <INT>
--services <SERVICES_LIST>
25
How to exploit the

data:
Commanding your
roman legion
26
1. Hadoop filesystem commands
Hadoop general command

$ hadoop
Hadoop file system subcommand
$ hadoop fs
Hadoop file system options
$ hadoop fs ls
$ hadoop fs mkdir <hdfs-dir>
$ hadoop fs rmr <hfds-file>
$ hadoop fs cat <hdfs-file>
$ hadoop fs put <local-file> <hdfs-dir>
$ hadoop fs get <hdfs-file> <local-dir>
http://
hadoop.apache.org/docs/current/hadoop-project-di
st/hadoop-common/CommandsManual.html
27
2. WebHDFS/HttpFS REST API
List a directory
GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS
Create a new directory
PUT http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]
Delete a file or directory
DELETE http://<host>:<port>/webhdfs/v1/<path>?op=DELETE
[&recursive=<true|false>]
Rename a file or directory
PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?
op=RENAME&destination=<PATH>
Concat files
POST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?
op=CONCAT&sources=<PATHS>
Set permission
PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION
[&permission=<OCTAL>]
Set owner
PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER
[&owner=<USER>][&group=<GROUP>]
28
2. WebHDFS/HttpFS REST API (cont.)
Create a new file with initial content (2 steps operation)

PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE
[&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>]
[&permission=<OCTAL>][&buffersize=<INT>]
HTTP/1.1 307 TEMPORARY_REDIRECT
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...
Content-Length: 0
PUT -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?
op=CREATE...
Append to a file (2 steps operation)
POST
http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersize=<INT>]
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
Content-Length: 0
POST -T <LOCAL_FILE>
http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...
29
2. WebHDFS/HttpFS REST API (cont.)
Open and read a file (2 steps operation)

GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN [&offset=<LONG>]
[&length=<LONG>][&buffersize=<INT>]
Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...
Content-Length: 0
GET http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...
http://
hadoop.apache.org/docs/current/hadoop-project-dist
/hadoop-hdfs/WebHDFS.html
HttpFS does not redirect to the Datanode but to

the HttpFS server, hidding the Datanodes (and saving
tens of public IP addresses)
The API is the same
http://
hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/i
ndex.html
30
3. Local Hive CLI

Hive is a querying tool
Queries are expresed in HiveQL, a SQL-like
language
https://
cwiki.apache.org/confluence/display/Hive/Languag
eManual
Hive uses pre-defined MapReduce jobs for

Column selection
Fields grouping
Table joining

All the data is loaded into Hive tables
31
3. Local Hive CLI (cont.)

Log on to the Master node
Run the hive command
Type your SQL-like sentence!
$ hive
$ Hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txt
hive>select column1,column2,otherColumns from mytable where
column1='whatever' and columns2 like '%whatever%';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Starting Job = job_201308280930_0953, Tracking URL =
http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953
Kill Command = /usr/lib/hadoop/bin/hadoop job
-Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_0953
2013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%
2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%
2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%
2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%
32
4. Remote Hive client

Hive CLI is OK for human-driven testing
purposes
But it is not usable by remote applications
Hive has no REST API

Hive has several drivers and libraries
JDBC for Java

Python
PHP
ODBC for C/C++
Thrift for Java and C++
https://
cwiki.apache.org/confluence/display/Hive/HiveClien
t
A remote Hive client usually performs:
33
A connection to the Hive
server (TCP/10000)
4. Remote Hive client Get a connection

private Connection getConnection(
String ip, String port, String user, String password) {
try {
// dynamically load the Hive JDBC driver
Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver");
} catch (ClassNotFoundException e) {
System.out.println(e.getMessage());
return null;
} // try catch
try {
// return a connection based on the Hive JDBC driver, default DB
return DriverManager.getConnection("jdbc:hive://" + ip + ":" +
port + "/default?user=" + user + "&password=" + password);
} catch (SQLException e) {
System.out.println(e.getMessage());
return null;
} // try catch
} // getConnection
https://
github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-cl
34
ient
4. Remote Hive client Do the query

private void doQuery() {
try {
// from here on, everything is SQL!
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery("select column1,column2," +
"otherColumns from mytable where column1='whatever' and " +
"columns2 like '%whatever%'");
// iterate on the result
while (res.next()) {
String column1 = res.getString(1);
Integer column2 = res.getInteger(2);
// whatever you want to do with this row, here
} // while
// close everything
res.close(); stmt.close(); con.close();
} catch (SQLException ex) {
System.exit(0);
https://
} // try catch
github.com/telefonicaid/fiware-connectors/tre
} // doQuery
e/develop/resources/hive-basic-client
35
4. Remote Hive client Plague Tracker demo
https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker
36
5. MapReduce applications
MapReduce applications are commonly written in
Java
Can be written in other languages through Hadoop

Streaming
They are executed in the command line

$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>
A MapReduce job consists of:
A driver, a piece of software where to define inputs, outputs,

formats, etc. and the entry point for launching the job
A set of Mappers, given by a piece of software defining its
behaviour
A set of Reducers, given by a piece of software defining its
behaviour
There are 2 APIS
org.apache.mapred old one

org.apache.mapreduce new one
37
Hadoop is distributed with MapReduce examples
5. MapReduce applications Map

/* org.apache.mapred example */
public static class MapClass extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
/* use the input value, the input key is the offset within the
file and it is not necessary in this example */
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
/* iterate on the string, getting each word */
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
/* emit an output (key,value) pair based on the word and 1 */
output.collect(word, one);
} // while
} // map
} // MapClass
38
5. MapReduce applications Reduce

public static class ReduceClass extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
/* iterate on all the values and add them */
while (values.hasNext()) {
sum += values.next().get();
} // while
/* emit an output (key,value) pair based on the word and its count */
output.collect(key, new IntWritable(sum));
} // reduce
} // ReduceClass
39
5. MapReduce applications Driver

package my.org
import java.io.IOException;
import java.util.*;
import
import
import
import
import
org.apache.hadoop.fs.Path;
org.apache.hadoop.conf.*;
org.apache.hadoop.io.*;
org.apache.hadoop.mapred.*;
org.apache.hadoop.util.*;
public class WordCount {

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(ReduceClass.class);
conf.setReducerClass(ReduceClass.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
} // main
} // WordCount
40
6. Launching tasks with Oozie

Oozie is a workflow scheduler system to manage
Hadoop jobs
Java map-reduce
Pig and Hive
Sqoop
System specific jobs (such as Java programs and shell scripts)
Oozie Workflow jobs are Directed Acyclical Graphs

(DAGs) of actions.
Writting Oozie applications is about including in a
package
The MapReduce jobs, Hive/Pig scritps, etc (exeutable code)

A Workflow
Parameters for the Workflow
Oozie can be use locally or remotely

https://
oozie.apache.org/docs/4.0.0/index.html#Developer_D
41
ocumentation
6. Launching tasks with Oozie Java client

OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/");
// create a workflow job configuration and set the workflow application path
Properties conf = client.createConfiguration();
conf.setProperty(OozieClient.APP_PATH, "hdfs://cosmosmastergi:8020/user/frb/mrjobs");
conf.setProperty("nameNode", "hdfs://cosmosmaster-gi:8020");
conf.setProperty("jobTracker", "cosmosmaster-gi:8021");
conf.setProperty("outputDir", "output");
conf.setProperty("inputDir", "input");
conf.setProperty("examplesRoot", "mrjobs");
conf.setProperty("queueName", "default");
// submit and start the workflow job
String jobId = client.run(conf);
// wait until the workflow job finishes printing the status every 10 seconds
while (client.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {
System.out.println("Workflow job running ...");
Thread.sleep(10 * 1000);
} // while
System.out.println("Workflow job completed");
42
Useful references
Hive resources:
HiveQL language https://

cwiki.apache.org/confluence/display/Hive/LanguageManual
How to create Hive clients https
://cwiki.apache.org/confluence/display/Hive/HiveClient
Hive client example https
://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic
-client
Plague Tracker demo https
://github.com/telefonicaid/fiware-livedemoapp/tree/master/cosmos/plague-tra
cker
Plague Tracker instance http://130.206.81.65/plague-tracker/
Hadoop filesystem commands:
http://
hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Comm
andsManual.html
WebHDFS and HttpFS REST APIs:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Web
HDFS.html
http://hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html
43
Cosmos place in FIWARE:

Typical scenarios
44
General IoT platform

BLNK
RULES
DEFINITION
OPERATIONAL
DASHBOARD
REAL TIME
PRCSSING
DATA
QUERYING
SUBS
CEP
GIS
BI
ETL
SHORT TERM
HISTORIC
DATA
PROCESSING
OPEN DATA
OPEN DATA
COSMOS
(BIG DATA)
Service
Orchrestation
CONTEXT BROKER
Context
Adapters
CKAN
SENSOR 2 THINGS
You d
o
to us nt have
e t he
m all
!
IoT Backend
Device Management
measures / commands
T-T
PORTALS
IoT/Sensor
Open Data
45
IDM & Auth
City
Services
KPI GOVERNANCE
Accounting & Payment & Billing
BLNK
Real time context data persistence (architecture)
https://
https://
github.com/telefonicaid/fiware-connectors/tree/develop/flu
forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/How_to_persist_Orion_data_in_Cosmos
me
46
Real time context data persistence (detail)
47
Real time context data persistence (examples)

Information coming from city sensors
Presence map gradients, aglomerations
Services usage distributions, top users (if
available), top POIs, unused resources
Information generated by smartphones
Geolocation routes, map gradients,
aglomerations
Issues reporting top neighbourhooods in
incidents, crimilality, noises, garbage,
plagues
Any other real time information
Depending on your app, this could be product
likes, product consumption,
user-2-user
48
Roadmap:
More functionalities
and integrations
49
Roadmap
Integrate the clusters creation with the
cloud portal
No more REST API work
Streaming analysis capabilities
Not all the analysis can wait for a batch
processing
Geolocation analysis capabilities
An important source of data nowadays
Integrate with CKAN
As a source of batch data
Integrate with the Marketplace
Selling datasets
Selling analysis results
50
fiware-lab-help@lists.fi-ware.org
francisco.romerobueno@telefonic
a.com
51
Thanks !
http://fi-ppp
.eu
http://fi-war
e.eu
Follow
@Fiware on
Twitter!
52

Cosmos Fiware

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cosmos Fiware

Uploaded by

Copyright:

Available Formats

Cosmos, Big Data GE implementation

Building your first application using FI-WARE

Open APIs for Open Minds

What is big data?

What is big data?

> big data

How much data is there?

Data growing forecast

It is not only about storing big data but using it!

> big data

How to deal with it:

Hadoop was created by Doug Cutting at Yahoo!...

based on the MapReduce patent by Google

Well, MapReduce was really invented by Julius Caesar

Cosmos is Hadoop ecosystem-based

HDFS as its distributed file system

Plus other proprietary features

What can be done with Cosmos?

(sshing into the Head

(connecting your app)

Oozie REST API

Using the RESTful API (1)

Using the RESTful API (2)

Using the RESTful API (3)

Using the Python CLI

Listing all the clusters

Showing a cluster details

Connecting to the Head Node of a cluster

$ cosmos terminate <CLUSTER_ID>

Listing available services

Creating a cluster with specific services

How to exploit the

1. Hadoop filesystem commands

Hadoop general command

2. WebHDFS/HttpFS REST API

Create a new directory

Delete a file or directory

Rename a file or directory

2. WebHDFS/HttpFS REST API (cont.)

Create a new file with initial content (2 steps operation)

Append to a file (2 steps operation)

2. WebHDFS/HttpFS REST API (cont.)

Open and read a file (2 steps operation)

HttpFS does not redirect to the Datanode but to

3. Local Hive CLI

Hive uses pre-defined MapReduce jobs for

3. Local Hive CLI (cont.)

4. Remote Hive client

But it is not usable by remote applications

Hive has no REST API

JDBC for Java

A remote Hive client usually performs:

4. Remote Hive client Get a connection

4. Remote Hive client Do the query

4. Remote Hive client Plague Tracker demo

Can be written in other languages through Hadoop

They are executed in the command line

A MapReduce job consists of:

A driver, a piece of software where to define inputs, outputs,

There are 2 APIS

org.apache.mapred old one

Hadoop is distributed with MapReduce examples