You are on page 1of 27

Spark at eBay -

Troubleshooting the
everyday issues

Aug. 6, 2014
Seattle Spark Meetup
Don Watters - Sr. Manager of Architecture, eBay Inc.
Suzanne Monthofer - Solutions Architect, eBay Inc.


Agenda


eBay Overview
Spark Motivation
Use Cases At eBay
Troubleshooting the everyday issues

2
eBay Overview
3
> 50 thousand categories of products
> 200 million items listed for sale on the site
Average retailer has thousands of products
4
PLATFORM

5
5
Data @ eBay
5
>50 TB/day new data
>100 PB/day
>100 Trillion pairs of information
Millions of queries/day
>6000
business users & analysts
>50k chains of logic
24x7x365
99.98+% Availability
turning over a TB every
second
Active/Active
Near-Real-time
>100k data elements
Always online
Processed
Spark Motivation


Great Promise!
Fits our pattern well
Iterative approach possible, like SQL



6
7
Agenda


Use Cases At eBay
8
9
eBay Transformer = More Data
Agenda


Troubleshooting the everyday issues

10
Tools and Skill sets
JIRA issue tracking internal and apache
Github repository source version control, documentation (.md)
Compilation/dependencies - Maven jar dependencies
Java versioning, debugging stack traces, environments, multiple JDK/JREs,
compatibility errors
POSIX OS environment variables, directory structures, permissions, Shell
scripting
HDFS, hadoop queues, formats, compression
Yarn/Mesos environments, debugging, logs, killing
JIRA internal wikis global internal collaboration
User groups, internal DLs, platform support teams, informal emails
Ability to decipher Java Stack traces
Stack Overflow, Googling, indirect clues
Scrappiness:
when dwarfed by a challenge, compensating for seeming inadequacies
through will, persistence and heart

11
Most Common Question: Yarn ShellException
(GiraphApplicationMaster.java:onContainersCompleted(574)) - Got container status for
containerID=container_1392317581183_0245_01_000003, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:252)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


12
Means an error occurred in the Yarn container need to search for Java
stack trace deeper in the Yarn logs
Killing Yarn Jobs and Viewing Yarn
Logs and status in many places:
Hadoop console (transient disappear after job done)
Aggregated Yarn logs not available until job finishes or is killed
Execution shell only very high-level status

Killing: Ctrl-C, then
/apache/hadoop/bin/yarn application -kill application_1392973982912_7321

Viewing Logs:
/apache/hadoop/bin/yarn logs -applicationId application_1392973982912_7321

Sifting to find text Exception, Memory, etc.
| grep Exception -5
| grep Memory -5

Would like easier debugging and exiting on errors
May look at a log4j appenders




13
Biggest Challenge: Resource Allocation/Capacity
Scheduling
Users must request needed resources
Long-running jobs hang without releasing resources and must be killed
manually
Created a dedicated Spark queue still not equitable
Capacity allocation prioritization is complex
Spark shell hangs on to memory

Many users deciding to wait for better stability and better guarantee of
resource availability and job completion

Yarn vs. Mesos debate?

14
Tuning Spark Hanging Jobs and Out-of-Memory
Errors

spark.default.parallelism - # requested Yarn containers
spark.executor.memory - ~75-90% requested Yarn container memory size
spark.storage.memoryFraction - lower from default 0.6 to ~0.2 (if you are not pinning
significant amount of data)

Remove outliers from dataset (dual-pass with larger entities)
Use primitive data types avoid Strings
Use Kryo serialization
app UI at localhost:4040 (disabled on our cluster)
Need to understand inner workings of Spark
Community working to reduce the amount of configuration needed

Alex Rubensteyn blog post: Spark should be better than MapReduce (if only it worked)
http://blog.explainmydata.com/2014/05/spark-should-be-better-than-mapreduce.html?m=1
Patrick Wendells talk on performance at Spark Summit 2013:
https://spark-summit.org/talk/wendell-understanding-the-performance-of-spark-applications/
Tuning Guide: https://spark.apache.org/docs/latest/tuning.html


15
Yarn Improvements Needed for Spark
16
Great talk by Sandy Ryza from Cloudera at Spark Summit 2014
https://www.youtube.com/watch?v=N6pJhxCPe-Y





Rapid Pace of Change
17
18
SPARK-1203
spark-shell on yarn-client race in properly getting hdfs delegation
tokens - error on saveAsTextFile
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException):
Delegation Token can be issued only with kerberos or web authentication
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:6211)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:461)
...
at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:920)
at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1336)
at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:527)
at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:505)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:202)
Burnt by bugs in snapshots during incubating phase
Check Spark JIRA issues https://issues.apache.org/jira/browse/SPARK/
Apache Shark Hive on Spark
NOW OBSOLETE

Google protobuf error (notorious) had to replace bundled jar
Caused by: java.lang.VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetOwnerRequestProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
Had to replace hadoop core/security jars with eBay jars
JDBC driver: mysql-connector-java-5.0.8-bin.jar
Got it working on single node able to access/query existing hive tables
Couldnt use for extremely large tables/joins yet (need multi-node)
Requires JDK 1.7 couldnt run on multiple nodes in cluster (still 1.6)
./bin/shark-withinfo skipRddReload to avoid a bad table error
Performance 2-5xs better than Hive for 8M row table count query

Start Looking at Spark SQL!





19
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
at shark.SharkCliDriver.<init>(SharkCliDriver.scala:283)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
at shark.SharkCliDriver.main(SharkCliDriver.scala)
Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:51)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
... 4 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1137)
... 9 more
Caused by: java.lang.VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetO
wnerRequestProto overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source))







20
Shark Jar Incompatibilities
21
Caused by: KrbException: Server not found in Kerberos database (7)
at sun.security.krb5.KrbTgsRep.<init>(Unknown Source)
at sun.security.krb5.KrbTgsReq.getReply(Unknown Source)
at sun.security.krb5.KrbTgsReq.sendAndGetCreds(Unknown Source)
at sun.security.krb5.internal.CredentialsUtil.serviceCreds(Unknown Source)

14/05/07 17:49:58 ERROR security.UserGroupInformation: PriviledgedActionException
as:XXUSER@XXXX.XXXX.XXX cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos
database (7))]
14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating logout for USER@XXXX.XXXX.XXX
14/05/07 17:49:58 INFO security.UserGroupInformation: Initiating re-login forUSER@XXXX.XXXX.XXX
14/05/07 17:50:02 ERROR security.UserGroupInformation: PriviledgedActionException as:USER@XXXX.XXXX.XXX
ause:javax.security.sasl.SaslException: GSS initiate failed [Caused by
GSSException: No valid credentials provided (Mechanism level: Server not found
in Kerberos database (7))]
14/05/07 17:50:02 WARN security.UserGroupInformation: Not attempting to re-login since the last re-login was
attempted less than 600 seconds before.
Shark vs. Hive, Spark SQL vs Shark
Big Data Benchmarks
22
https://amplab
.cs.berkeley.e
du/benchmark
/
http://databricks.com/
blog/2014/06/02/excit
ing-performance-
improvements-on-
the-horizon-for-
spark-sql.html
Compilation: Maven, sbt, ivy, ant
Maven/sbt/ivy/munge can be complex, finicky

[info] Resolving com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT ...
[warn] module not found: com.ebay.incdata.metis#metis-matching-engine;1.0-SNAPSHOT
[warn] ==== local: tried
[warn] /Users/smonthofer/.ivy2/local/com.ebay.incdata.metis/metis-matching-engine/1.0-SNAPSHOT/ivys/ivy.xml
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pom
[warn] ==== Local Maven Repository: tried
[warn] file:///var/root/.m2/repository/com/ebay/incdata/metis/metis-matching-engine/1.0-SNAPSHOT/metis-matching-engine-1.0-SNAPSHOT.pomURI
has an authority component
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:213)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:122)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:121)
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
java.net.MalformedURLException: no protocol: /Users/smonthofer/.m2/repository

build.sbt resolvers +=
"Local Maven Repository" at file:///Users/smonthofer/.m2/repository
Needed 3 slashes (platform independence feature)!!! Grrrr

23
Learned New Term:
Yak Shaving
24
From Urban Dictionary:
Any seemingly pointless activity which is actually
necessary to solve a problem which solves a
problem which, several levels of recursion later,
solves the real problem you're working on.
origin: MIT AI Lab, after 2000: orig. probably from a
Ren & Stimpy episode.


Building scalable systems is not all sexy roflscale fun.
Its a lot of plumbing and yak shaving. A lot of
hacking together tools that really ought to exist
already, but all the open source solutions out there
are too bad (and yours ends up bad too, but at least
it solves your particular problem).

- Martin Kleppmann, LinkedIn, Founder of Rapportive
25
Simple documentation saves
time later for yourself and for
others

Cut/paste/collect things that work, errors,
common commands and put on a wiki page
(even email drafts are a fast holding place).

Source control/backups for working versions
be able to start from scratch

Maven, sbt, dependencies complex,
corruptible, bizarre tricks, multiple open
source projects magic (also scary)

Get ahead of the curve on new technology
cause new challenges will always come up





From xkcd
If you want to succeed as badly as you want the air,
then you will get it there is no other secret to success
- Socrates (lesson to his students)
Privileged and Confidential
26
Quoted by Spark User group user:

Spark at eBay -
Troubleshooting the
everyday issues

Aug. 6, 2014
Seattle Spark Meetup
Don Watters dwatters@ebay.com
Suzanne Monthofer smonthofer@ebay.com

You might also like