Next-Generation Python Big Data Tools, Powered by Apache Arrow

Next-genera;on
Python Big Data Tools,

powered by Apache Arrow
Wes McKinney @wesmckinn
SF Big Analy;cs Meetup, 2016-04-05
Cloudera, Inc. All rights reserved.
Me
Data Science Tools at Cloudera, formerly DataPad CEO/founder
Serial creator of structured data tools / user interfaces
Wrote bestseller Python for Data Analysis 2012
Open source projects
Python {pandas, Ibis, statsmodels}
Apache {Arrow, Parquet, Kudu (incuba;ng)}
Mostly work in Python and Cython/C/C++
In process:
Python for Data Analysis: 2nd Edi4on
Coming late 2016 / early
2017
Python + Big Data: The State of things

See Python and Apache Hadoop: A State of the Union from February 17
Areas where much more work needed
Binary le format read/write support (e.g. Parquet les)
File system libraries (HDFS, S3, etc.)
Client drivers (Spark, Hive, Impala, Kudu)
Compute system integra;on (Spark, Impala, etc.)
Apache
Arrow
Many slides here from my joint talk with Jacques Nadeau, VP Apache Arrow
Arrow in a Slide
New Top-level Apache Sofware Founda;on project
Announced Feb 17, 2016
Focused on Columnar In-Memory Analy;cs
1.
2.
3.
4.
10-100x speedup on many workloads

Common data layer enables companies to choose best of
breed systems
Designed to work with any programming language
Support for both rela;onal and complex data as-is
Developers from 13+ major open source projects involved
A signicant % of the worlds data will be processed through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
Apache Arrow: What is it?

hkp://arrow.apache.org
Not a piece of sofware, exactly!
A standardized in-memory representa;on for columnar data
Enables
Suitable for implemen;ng high-performance analy;cs in-memory (think like
pandas internals)
Cheap data interchange amongst systems, likle or no serializa;on
Flexible support for complex JSON-like data
Targets: Impala, Kudu, Parquet, Spark
Focus on CPU Eciency

Cache Locality
Super-scalar & vectorized
opera;on
Minimal Structure Overhead
Constant value access
Traditional
Memory Buffer
1331246660
Row 1
With minimal structure overhead
Operate directly on columnar
Arrow
Memory Buffer
Row 2
compressed data
3/8/2012 2:44PM
1331246351
1331244570
1331246351
1331261196
3/8/2012 2:38PM
3/8/2012 2:44PM
timestamp
3/8/2012 2:38PM
1331244570
3/8/2012 2:09PM
3/8/2012 2:09PM
3/8/2012 6:46PM
71.10.106.181
99.155.155.225
1331261196
Row 4
session_id
99.155.155.225
65.87.165.114
Row 3
1331246660
source_ip
65.87.165.114
3/8/2012 6:46PM
71.10.106.181
76.102.156.138
76.102.156.138
High Performance Sharing & Interchange

Today
Pandas
With Arrow
Drill
Spark
Pandas
Impala
Drill
Spark
Impala
Copy & Convert

Copy & Convert
Copy & Convert
Copy & Convert
Parquet
Cassandra
Arrow Memory
Copy & Convert
HBase
Kudu
Each system has its own internal

memory format
70-80% CPU wasted on serialization
and deserialization
Similar functionality implemented in
multiple projects
Parquet
HBase
Cassandra
Kudu
All systems utilize the same memory

format
No overhead for cross-system
communication
Projects can share functionality (eg,
Parquet-to-Arrow reader)
Big Data Systems: Poor Python IO performance
h9p://wesmckinney.com/blog/pandas-and-apache-arrow/
10
Real World Example: Feather File Format for Python

and R
Problem: fast, language-
agnos;c binary data frame

le format
Wriken by Wes McKinney
(Python) Hadley Wickham (R)
Read speeds close to disk IO
performance
Feather file
Arrow array 0
Arrow array 1
Apache Arrow
memory
Arrow array n
Feather
metadata
Google
flatbuers
11
Real World Example: Feather File Format for Python

and R
R
library(feather)

path <- "my_data.feather"
write_feather(df, path)

df <- read_feather(path)
Python
import feather

path = 'my_data.feather'

feather.write_dataframe(df, path)
df = feather.read_dataframe(path)
12
Apache Parquet: Binary columnar storage format

I just became a Parquet commiker!
github.com/apache/parquet-cpp
Python users will soon be able to
read Parquet les via PyArrow
parquet-cpp <-> PyArrow <->
pandas
13
Language Bindings
Target Languages
Java (beta)
CPP (underway)
Python & Pandas (underway)
R
Julia
Ini;al Focus
Read a structure
Write a structure
Manage Memory
14
pandas and Arrow in context
15
RPC & IPC: Moving Data Between Systems

RPC
Avoid Serializa;on & Deserializa;on
Layer TBD: Focused on suppor;ng vectored io
Scaker/gather reads/writes against socket
IPC
Alpha implementa;on using memory mapped les
Moving data between Python and Drill
Working on shared alloca;on approach
Shared reference coun;ng and well-dened ownership seman;cs
16
Execu;ng data science languages in the compute layer

UI
Ibis, SQL, Spark API,
Python,
R, Julia, ?
Storage
Compute
HDFS, Kudu, HBase
Analytic SQL, Spark, MapReduce

17
Real World Example: Python With Spark, Drill, Impala

User-supplied
Python code
SQL Engine
input
in partition 0
Python
function
SQL Engine
output
in partition
n-1
out partition 0
input
Python
function
output
out partition
n-1
18
Whats Next
Parquet for Python & C++
Using Arrow as intermediary
Available IPC Implementa;on
Spark, Drill Integra;on
Faster UDFs, Storage interfaces
19
Apache Arrow in prac;ce
20
Get Involved
Join the community
dev@arrow.apache.org
Slack: hkps://apachearrowslackin.herokuapp.com/
hkp://arrow.apache.org
@ApacheArrow
21
Thank you
Wes McKinney @wesmckinn
Views are my own
22

Next-Generation Python Big Data Tools, Powered by Apache Arrow

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Next-Generation Python Big Data Tools, Powered by Apache Arrow

Uploaded by

Copyright:

Available Formats

Next-genera;on

Python Big Data Tools,

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved.

Python + Big Data: The State of things

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved.

10-100x speedup on many workloads

Developers from 13+ major open source projects involved

A signicant % of the worlds data will be processed through

Cloudera, Inc. All rights reserved.

Apache Arrow: What is it?

Cloudera, Inc. All rights reserved.

Focus on CPU Eciency

With minimal structure overhead

Operate directly on columnar

High Performance Sharing & Interchange

Copy & Convert

Copy & Convert

Each system has its own internal

All systems utilize the same memory

Big Data Systems: Poor Python IO performance

Real World Example: Feather File Format for Python

agnos;c binary data frame

Cloudera, Inc. All rights reserved.

Real World Example: Feather File Format for Python

Cloudera, Inc. All rights reserved.

Apache Parquet: Binary columnar storage format

Cloudera, Inc. All rights reserved.

pandas and Arrow in context

Cloudera, Inc. All rights reserved.

RPC & IPC: Moving Data Between Systems

Execu;ng data science languages in the compute layer

HDFS, Kudu, HBase

Analytic SQL, Spark, MapReduce

Real World Example: Python With Spark, Drill, Impala

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved.

Apache Arrow in prac;ce

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved.

Cloudera, Inc. All rights reserved.

You might also like