You are on page 1of 22

Next-genera;on

Python Big Data Tools,


powered by Apache Arrow
Wes McKinney @wesmckinn
SF Big Analy;cs Meetup, 2016-04-05
Cloudera, Inc. All rights reserved.

Me
Data Science Tools at Cloudera, formerly DataPad CEO/founder
Serial creator of structured data tools / user interfaces
Wrote bestseller Python for Data Analysis 2012
Open source projects
Python {pandas, Ibis, statsmodels}
Apache {Arrow, Parquet, Kudu (incuba;ng)}
Mostly work in Python and Cython/C/C++

Cloudera, Inc. All rights reserved.

In process:
Python for Data Analysis: 2nd Edi4on
Coming late 2016 / early
2017

Cloudera, Inc. All rights reserved.

Python + Big Data: The State of things


See Python and Apache Hadoop: A State of the Union from February 17
Areas where much more work needed
Binary le format read/write support (e.g. Parquet les)
File system libraries (HDFS, S3, etc.)
Client drivers (Spark, Hive, Impala, Kudu)
Compute system integra;on (Spark, Impala, etc.)

Cloudera, Inc. All rights reserved.

Apache
Arrow
Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow

Cloudera, Inc. All rights reserved.

Arrow in a Slide
New Top-level Apache Sofware Founda;on project
Announced Feb 17, 2016
Focused on Columnar In-Memory Analy;cs
1.
2.
3.
4.

10-100x speedup on many workloads


Common data layer enables companies to choose best of
breed systems
Designed to work with any programming language
Support for both rela;onal and complex data as-is

Developers from 13+ major open source projects involved

A signicant % of the worlds data will be processed through

Arrow!

Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

Cloudera, Inc. All rights reserved.

Apache Arrow: What is it?


hkp://arrow.apache.org
Not a piece of sofware, exactly!
A standardized in-memory representa;on for columnar data
Enables
Suitable for implemen;ng high-performance analy;cs in-memory (think like

pandas internals)
Cheap data interchange amongst systems, likle or no serializa;on
Flexible support for complex JSON-like data
Targets: Impala, Kudu, Parquet, Spark

Cloudera, Inc. All rights reserved.

Focus on CPU Eciency


Cache Locality
Super-scalar & vectorized

opera;on
Minimal Structure Overhead
Constant value access

Traditional
Memory Buffer
1331246660
Row 1

With minimal structure overhead

Operate directly on columnar

Arrow
Memory Buffer

Row 2

compressed data

3/8/2012 2:44PM

1331246351
1331244570

1331246351

1331261196

3/8/2012 2:38PM

3/8/2012 2:44PM
timestamp

3/8/2012 2:38PM

1331244570

3/8/2012 2:09PM

3/8/2012 2:09PM

3/8/2012 6:46PM

71.10.106.181

99.155.155.225

1331261196
Row 4

session_id

99.155.155.225

65.87.165.114

Row 3

1331246660

source_ip

65.87.165.114

3/8/2012 6:46PM

71.10.106.181

76.102.156.138

76.102.156.138
Cloudera, Inc. All rights reserved.

High Performance Sharing & Interchange


Today
Pandas

With Arrow
Drill

Spark

Pandas
Impala

Drill

Spark

Impala

Copy & Convert


Copy & Convert
Copy & Convert
Copy & Convert

Parquet
Cassandra

Arrow Memory

Copy & Convert

HBase

Kudu

Each system has its own internal


memory format
70-80% CPU wasted on serialization
and deserialization
Similar functionality implemented in
multiple projects

Parquet

HBase
Cassandra

Kudu

All systems utilize the same memory


format
No overhead for cross-system
communication
Projects can share functionality (eg,
Parquet-to-Arrow reader)
Cloudera, Inc. All rights reserved.

Big Data Systems: Poor Python IO performance

h9p://wesmckinney.com/blog/pandas-and-apache-arrow/
Cloudera, Inc. All rights reserved.

10

Real World Example: Feather File Format for Python


and R
Problem: fast, language-

agnos;c binary data frame


le format
Wriken by Wes McKinney
(Python) Hadley Wickham (R)
Read speeds close to disk IO
performance

Feather file
Arrow array 0
Arrow array 1

Apache Arrow
memory

Arrow array n
Feather
metadata

Google
flatbuers

Cloudera, Inc. All rights reserved.

11

Real World Example: Feather File Format for Python


and R
R
library(feather)

path <- "my_data.feather"
write_feather(df, path)

df <- read_feather(path)

Python
import feather

path = 'my_data.feather'

feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

Cloudera, Inc. All rights reserved.

12

Apache Parquet: Binary columnar storage format


I just became a Parquet commiker!
github.com/apache/parquet-cpp
Python users will soon be able to
read Parquet les via PyArrow
parquet-cpp <-> PyArrow <->
pandas

Cloudera, Inc. All rights reserved.

13

Language Bindings
Target Languages
Java (beta)
CPP (underway)
Python & Pandas (underway)
R
Julia
Ini;al Focus
Read a structure
Write a structure
Manage Memory
Cloudera, Inc. All rights reserved.

14

pandas and Arrow in context

Cloudera, Inc. All rights reserved.

15

RPC & IPC: Moving Data Between Systems


RPC
Avoid Serializa;on & Deserializa;on
Layer TBD: Focused on suppor;ng vectored io
Scaker/gather reads/writes against socket
IPC
Alpha implementa;on using memory mapped les
Moving data between Python and Drill
Working on shared alloca;on approach
Shared reference coun;ng and well-dened ownership seman;cs
Cloudera, Inc. All rights reserved.

16

Execu;ng data science languages in the compute layer


UI
Ibis, SQL, Spark API,

Python,
R, Julia, ?

Storage

Compute

HDFS, Kudu, HBase

Analytic SQL, Spark, MapReduce


Cloudera, Inc. All rights reserved.

17

Real World Example: Python With Spark, Drill, Impala


User-supplied
Python code

SQL Engine
input
in partition 0

Python
function

SQL Engine
output

in partition
n-1

out partition 0

input

Python
function

output

out partition
n-1

Cloudera, Inc. All rights reserved.

18

Whats Next
Parquet for Python & C++
Using Arrow as intermediary
Available IPC Implementa;on
Spark, Drill Integra;on
Faster UDFs, Storage interfaces

Cloudera, Inc. All rights reserved.

19

Apache Arrow in prac;ce

Cloudera, Inc. All rights reserved.

20

Get Involved
Join the community
dev@arrow.apache.org
Slack: hkps://apachearrowslackin.herokuapp.com/
hkp://arrow.apache.org
@ApacheArrow

Cloudera, Inc. All rights reserved.

21

Thank you
Wes McKinney @wesmckinn
Views are my own

Cloudera, Inc. All rights reserved.

22

You might also like