You are on page 1of 2

Kafka

-----
1. General purpose publish - subscribe model messaging system
2. Easy to add more consumers without downtime
3. Subscribers are responsible for pulling data and also maintaining pointer to
offset
4. Provides fault tolerance
5. writes from producer to broker and reads from broker to consumers can happen at
their own pace

Flume
-----
1. Distributed reliable system for collecting aggregating and moving large amount
of data to centralized datastore like HDFS
2. Supports many built-in sources and sinks out of box
3. Flume pushes data into sink and hence consumers do not have to maintain offset
4. Events are lost in case the agent goes down
5. Flume pushes data to the sink because of which writes to sink can overwhelm data
reads from sink
6. Tightly integrated with Hadoop

Spark MLib
----------
MLlib is Spark�s machine learning (ML) library.

1. ML Algorithms: common learning algorithms such as classification, regression,


clustering, and collaborative filtering
2. Featurization: feature extraction, transformation, dimensionality reduction, and
selection
3. Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
4. Persistence: saving and load algorithms, models, and Pipelines
5. Utilities: linear algebra, statistics, data handling, etc.

Data Hub
--------
SAP Data Hub enables you to perform data integration, orchestrate the movement of
data, and provide governance capabilities for data across a complex and diverse
data landscape.

You can also use Big Data processing to create uniquely powerful data pipelines
based on the serverless computing paradigm.

Existing data and processes can be managed, shared, and distributed across the
enterprise with seamless, unified, and enterprise-ready monitoring and landscape
management capabilities.

SAP Vora
--------
SAP Vora is a distributed database system for big data processing. SAP Vora can run
on a cluster of commodity hardware compute nodes and is built to scale with the
size of the data by scaling up the compute cluster.

SAP Vora Engine Architecture


----------------------------
SAP Vora comes with support for various data types, such as relational data, graph
data, collections of JSON (JavaScript Object Notation) documents, and time series.
Each of these data types is managed by a specialized engine, which has tailored
internal data structures and algorithms to natively support and efficiently process
that type of data.

SAP Vora can load and index data from external distributed data stores, such as
HDFS and WebHDFS, Azure Data Lake (ADL), Microsoft Windows Azure Storage Blob
(WASB), and Amazon S3. The data is either kept in main memory for fast processing,
or, in the case of the relational disk engine, it is indexed and stored on the hard
disks which are locally attached to the compute nodes. Data loaded to SAP Vora can
be partitioned by user-defined partitioning schemes, such as range, block, or hash
partitioning. SAP Vora contains a distributed query processor, which can evaluate
queries on the partitioned data. Metadata (that is, table schemas, partition
schemes, and so on) is stored in SAP Vora's own catalog, which persists the catalog
entries using SAP Vora's Distributed Log (DLog) infrastructure.

Apache Spark Integration


------------------------
SAP Vora is also accessible through Spark SQL (with Spark 2.1 or higher) by
implementing the Spark data source API and a public SAP Vora/HANA client. The Spark
2 integration emphasizes the separation between the SAP Vora database commands and
Spark commands for a clearer and more intuitive usage of the SAP Vora
functionality.

Contexualize the Technical Architecture specific to Kao including the various


systems and applications
Include provision for IoT in the technical atchitecture diagram
Recommend specific tools and solutions (for example which reporting tools to use)
Propose a BI Analytics roadmap that will take them till 2020
They want the data lake to be in AWS

Give a BI Analytics and Roadmap plan for next 2 years that Tanay can present to
management to secure budget
Propose design thinking workshops with different groups to comeup with use cases

Business Units have their own IT in silos


Kao is buying duplicate data (Syndicate Data from Nielson and POS data from
retailers)