Professional Documents
Culture Documents
Me
Data
Science
Tools
at
Cloudera,
formerly
DataPad
CEO/founder
Serial
creator
of
structured
data
tools
/
user
interfaces
Wrote
bestseller
Python
for
Data
Analysis
2012
Open
source
projects
Python
{pandas,
Ibis,
statsmodels}
Apache
{Arrow,
Parquet,
Kudu
(incuba;ng)}
Mostly
work
in
Python
and
Cython/C/C++
In
process:
Python
for
Data
Analysis:
2nd
Edi4on
Coming
late
2016
/
early
2017
Apache
Arrow
Many
slides
here
from
my
joint
talk
with
Jacques
Nadeau,
VP
Apache
Arrow
Arrow
in
a
Slide
New
Top-level
Apache
Sofware
Founda;on
project
Announced
Feb
17,
2016
Focused
on
Columnar
In-Memory
Analy;cs
1.
2.
3.
4.
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
pandas
internals)
Cheap
data
interchange
amongst
systems,
likle
or
no
serializa;on
Flexible
support
for
complex
JSON-like
data
Targets:
Impala,
Kudu,
Parquet,
Spark
opera;on
Minimal
Structure
Overhead
Constant
value
access
Traditional
Memory Buffer
1331246660
Row 1
Arrow
Memory Buffer
Row 2
compressed data
3/8/2012 2:44PM
1331246351
1331244570
1331246351
1331261196
3/8/2012 2:38PM
3/8/2012 2:44PM
timestamp
3/8/2012 2:38PM
1331244570
3/8/2012 2:09PM
3/8/2012 2:09PM
3/8/2012 6:46PM
71.10.106.181
99.155.155.225
1331261196
Row 4
session_id
99.155.155.225
65.87.165.114
Row 3
1331246660
source_ip
65.87.165.114
3/8/2012 6:46PM
71.10.106.181
76.102.156.138
76.102.156.138
Cloudera,
Inc.
All
rights
reserved.
With Arrow
Drill
Spark
Pandas
Impala
Drill
Spark
Impala
Parquet
Cassandra
Arrow Memory
HBase
Kudu
Parquet
HBase
Cassandra
Kudu
h9p://wesmckinney.com/blog/pandas-and-apache-arrow/
Cloudera,
Inc.
All
rights
reserved.
10
Feather file
Arrow array 0
Arrow array 1
Apache Arrow
memory
Arrow array n
Feather
metadata
Google
flatbuers
11
Python
import
feather
path
=
'my_data.feather'
feather.write_dataframe(df,
path)
df
=
feather.read_dataframe(path)
12
13
Language
Bindings
Target
Languages
Java
(beta)
CPP
(underway)
Python
&
Pandas
(underway)
R
Julia
Ini;al
Focus
Read
a
structure
Write
a
structure
Manage
Memory
Cloudera,
Inc.
All
rights
reserved.
14
15
16
Python,
R, Julia, ?
Storage
Compute
17
SQL Engine
input
in partition 0
Python
function
SQL Engine
output
in partition
n-1
out partition 0
input
Python
function
output
out partition
n-1
18
Whats
Next
Parquet
for
Python
&
C++
Using
Arrow
as
intermediary
Available
IPC
Implementa;on
Spark,
Drill
Integra;on
Faster
UDFs,
Storage
interfaces
19
20
Get
Involved
Join
the
community
dev@arrow.apache.org
Slack:
hkps://apachearrowslackin.herokuapp.com/
hkp://arrow.apache.org
@ApacheArrow
21
Thank
you
Wes
McKinney
@wesmckinn
Views
are
my
own
22