Ebook923 pages2 hours

Apache Spark for Data Science Cookbook

Name: Apache Spark for Data Science Cookbook
Author: Padma Priya Chitturi
ISBN: 9781785288807

By Padma Priya Chitturi

Rating: 0 out of 5 stars

()

Read preview

About this ebook

This book is for novice and intermediate level data science professionals and data analysts who want to solve data science problems with a distributed computing framework. Basic experience with data science implementation tasks is expected. Data science professionals looking to skill up and gain an edge in the field will find this book helpful.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateDec 22, 2016

ISBN9781785288807

Author

Padma Priya Chitturi

Related authors

Skip carousel

Related to Apache Spark for Data Science Cookbook

Related ebooks

Skip carousel

Apache Hive Cookbook
Ebook
Apache Hive Cookbook
byShrey Mehrotra
Rating: 0 out of 5 stars
0 ratings
Hadoop Real-World Solutions Cookbook - Second Edition
Ebook
Hadoop Real-World Solutions Cookbook - Second Edition
byDeshpande Tanmay
Rating: 0 out of 5 stars
0 ratings
MongoDB Cookbook - Second Edition
Ebook
MongoDB Cookbook - Second Edition
byDasadia Cyrus
Rating: 0 out of 5 stars
0 ratings
Hadoop MapReduce v2 Cookbook - Second Edition
Ebook
Hadoop MapReduce v2 Cookbook - Second Edition
byThilina Gunarathne
Rating: 0 out of 5 stars
0 ratings
Neo4j Cookbook
Ebook
Neo4j Cookbook
byAnkur Goel
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis Cookbook
Ebook
Python Data Analysis Cookbook
byIvan Idris
Rating: 5 out of 5 stars
5/5
Python: Real World Machine Learning
Ebook
Python: Real World Machine Learning
byJohn Hearty
Rating: 0 out of 5 stars
0 ratings
Spark Cookbook
Ebook
Spark Cookbook
byRishi Yadav
Rating: 0 out of 5 stars
0 ratings
Flask Framework Cookbook
Ebook
Flask Framework Cookbook
byShalabh Aggarwal
Rating: 5 out of 5 stars
5/5
Practical Data Analysis Cookbook
Ebook
Practical Data Analysis Cookbook
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Python Parallel Programming Cookbook
Ebook
Python Parallel Programming Cookbook
byGiancarlo Zaccone
Rating: 5 out of 5 stars
5/5
Neo4j High Performance
Ebook
Neo4j High Performance
bySonal Raj
Rating: 0 out of 5 stars
0 ratings
Learning Apache Spark 2
Ebook
Learning Apache Spark 2
byMuhammad Asif Abbasi
Rating: 0 out of 5 stars
0 ratings
Data Science with Python and Dask
Ebook
Data Science with Python and Dask
byJesse Daniel
Rating: 0 out of 5 stars
0 ratings
Learning PySpark
Ebook
Learning PySpark
byTomasz Drabas
Rating: 0 out of 5 stars
0 ratings
Machine Learning with Spark - Second Edition
Ebook
Machine Learning with Spark - Second Edition
byNick Pentreath
Rating: 0 out of 5 stars
0 ratings
Introducing Data Science: Big data, machine learning, and more, using Python tools
Ebook
Introducing Data Science: Big data, machine learning, and more, using Python tools
byDavy Cielen
Rating: 5 out of 5 stars
5/5
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
Ebook
Mastering Large Datasets with Python: Parallelize and Distribute Your Python Code
byJohn Wolohan
Rating: 0 out of 5 stars
0 ratings
Frank Kane's Taming Big Data with Apache Spark and Python
Ebook
Frank Kane's Taming Big Data with Apache Spark and Python
byFrank Kane
Rating: 0 out of 5 stars
0 ratings
Pandas in Action
Ebook
Pandas in Action
byBoris Paskhaver
Rating: 0 out of 5 stars
0 ratings
Scala for Data Science
Ebook
Scala for Data Science
byBugnion Pascal
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics
Ebook
Big Data Analytics
byVenkat Ankam
Rating: 0 out of 5 stars
0 ratings
Classic Computer Science Problems in Python
Ebook
Classic Computer Science Problems in Python
byDavid Kopec
Rating: 0 out of 5 stars
0 ratings
Apache Spark Graph Processing
Ebook
Apache Spark Graph Processing
byRamamonjison Rindra
Rating: 0 out of 5 stars
0 ratings
Hadoop Beginner's Guide
Ebook
Hadoop Beginner's Guide
byGarry Turkington
Rating: 4 out of 5 stars
4/5
Fast Data Processing with Spark 2 - Third Edition
Ebook
Fast Data Processing with Spark 2 - Third Edition
byKrishna Sankar
Rating: 0 out of 5 stars
0 ratings
Microsoft Azure Machine Learning
Ebook
Microsoft Azure Machine Learning
bySumit Mund
Rating: 4 out of 5 stars
4/5
Data Analysis with Python and PySpark
Ebook
Data Analysis with Python and PySpark
byJonathan Rioux
Rating: 0 out of 5 stars
0 ratings
Data Science Bookcamp: Five real-world Python projects
Ebook
Data Science Bookcamp: Five real-world Python projects
byLeonard Apeltsin
Rating: 5 out of 5 stars
5/5
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Ebook
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
byAlok Kumar
Rating: 0 out of 5 stars
0 ratings

Computers For You

Skip carousel

The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
Ebook
The Simulation Hypothesis: An MIT Computer Scientist Shows Why AI, Quantum Physics and Eastern Mystics All Agree We Are In a Video Game
byRizwan Virk
Rating: 5 out of 5 stars
5/5
The Invisible Rainbow: A History of Electricity and Life
Ebook
The Invisible Rainbow: A History of Electricity and Life
byArthur Firstenberg
Rating: 4 out of 5 stars
4/5
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
Ebook
Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics
byGary Smith
Rating: 4 out of 5 stars
4/5
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
Ebook
Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls
byKathleen Hale
Rating: 4 out of 5 stars
4/5
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
Ebook
Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad
byAaron Smith
Rating: 0 out of 5 stars
0 ratings
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
Ebook
101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters
byTriumph Books
Rating: 4 out of 5 stars
4/5
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
Ebook
Data Science from Scratch: The #1 Data Science Guide for Everything A Data Scientist Needs to Know: Python, Linear Algebra, Statistics, Coding, Applications, Neural Networks, and Decision Trees
bySteven Cooper
Rating: 4 out of 5 stars
4/5
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
Ebook
SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL
byWalter Shields
Rating: 4 out of 5 stars
4/5
Elon Musk
Ebook
Elon Musk
byWalter Isaacson
Rating: 4 out of 5 stars
4/5
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
Ebook
The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology
byTJ Books
Rating: 0 out of 5 stars
0 ratings
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
Ebook
Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition
byAndrew Hodges
Rating: 4 out of 5 stars
4/5
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
Ebook
Machine Learning for Beginners: An Introduction for Beginners, Why Machine Learning Matters Today and How Machine Learning Networks, Algorithms, Concepts and Neural Networks Really Work
bySteven Cooper
Rating: 4 out of 5 stars
4/5
Dark Aeon: Transhumanism and the War Against Humanity
Ebook
Dark Aeon: Transhumanism and the War Against Humanity
byJoe Allen
Rating: 5 out of 5 stars
5/5
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
Ebook
Mastering ChatGPT: 21 Prompts Templates for Effortless Writing
byCea West
Rating: 5 out of 5 stars
5/5
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Ebook
Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
bySeth Stephens-Davidowitz
Rating: 4 out of 5 stars
4/5
Grokking Algorithms: An illustrated guide for programmers and other curious people
Ebook
Grokking Algorithms: An illustrated guide for programmers and other curious people
byAditya Bhargava
Rating: 4 out of 5 stars
4/5
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
Ebook
Hacking: Ultimate Beginner's Guide for Computer Hacking in 2018 and Beyond: Hacking in 2018, #1
byDexter Jackson
Rating: 4 out of 5 stars
4/5
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
Ebook
Python for Beginners. A Smarter Way to Learn Python in 5 Days and Remember it Longer. With Easy Step by Step Guidance and Hands on Examples. (Python Crash Course-Programming for Beginners)
byArthur T. Brooks
Rating: 0 out of 5 stars
0 ratings
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
Ebook
CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61
byQuentin Docter
Rating: 0 out of 5 stars
0 ratings
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
Ebook
How to Write a Book: An 11-Step Process to Build Habits, Stop Procrastinating, Fuel Self-Motivation, Quiet Your Inner Critic, Bust Through Writer's Block, & Let Your Creative Juices Flow (Short Read)
byDavid Kadavy
Rating: 5 out of 5 stars
5/5
Master Builder Roblox: The Essential Guide
Ebook
Master Builder Roblox: The Essential Guide
byTriumph Books
Rating: 4 out of 5 stars
4/5
Remote/WebCam Notarization : Basic Understanding
Ebook
Remote/WebCam Notarization : Basic Understanding
byJeannie Eunice Franks
Rating: 3 out of 5 stars
3/5
CompTIA Security+ Practice Questions
Ebook
CompTIA Security+ Practice Questions
byIP Specialist
Rating: 2 out of 5 stars
2/5
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
Ebook
AI Crash Course: A fun and hands-on introduction to machine learning, reinforcement learning, deep learning, and artificial intelligence with Python
byHadelin de Ponteves
Rating: 0 out of 5 stars
0 ratings
Summary of Max Tegmark's Life 3.0
Ebook
Summary of Max Tegmark's Life 3.0
byIRB Media
Rating: 0 out of 5 stars
0 ratings
The Insider's Guide to Technical Writing
Ebook
The Insider's Guide to Technical Writing
byKrista Van Laan
Rating: 0 out of 5 stars
0 ratings
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
Ebook
YouTube: How to Build and Optimize Your First YouTube Channel, Marketing, SEO, Tips and Strategies for YouTube Channel Success
byTommy Swindali
Rating: 4 out of 5 stars
4/5
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
Ebook
The Hacker Crackdown: Law and Disorder on the Electronic Frontier
byBruce Sterling
Rating: 4 out of 5 stars
4/5
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
Ebook
Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands
byTriumph Books
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
Podcast episode
55: Go on The Web: Summary Andrew Gerrand (@enneff), Developer Advocate at Google & Go core contributor, talks about GoLang and how it is being used in Web Development today as well as the plans for the future of the Go as a platform for the web. Resources Go...
byThe Web Platform Podcast
100%
100% found this document useful
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
Podcast episode
Putting Airflow Into Production With James Meickle - Episode 43: Lessons Learned While Building A Data Science Platform With Airflow (Interview)
byData Engineering Podcast
0 ratings
0% found this document useful
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
Podcast episode
Level Up Your Data Platform With Active Metadata: A conversation with Atlan co-founder Prukalpa Sankar about the idea of active metadata and how it can reduce the toil involved in managing a data platform
byData Engineering Podcast
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
#059 - 10 Python clean code tips drawn from code reviews
Podcast episode
#059 - 10 Python clean code tips drawn from code reviews
byPybites Podcast
0 ratings
0% found this document useful
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
Podcast episode
433: Falling for FastAPI: Mike's falling in love with FastAPI and gives us a hint at the next project he's building.
byCoder Radio
0 ratings
0% found this document useful
Measuring Your Python Learning Progress
Podcast episode
Measuring Your Python Learning Progress
byThe Real Python Podcast
100%
100% found this document useful
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
Podcast episode
A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore: An interview with Shireesh Thota about how the Singlestore database engine allows you to reduce architectural sprawl in your data systems by combining performant and scalable transactional and analytical capabilities into a single platform
byData Engineering Podcast
0 ratings
0% found this document useful
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
Podcast episode
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
byMaintainable
0 ratings
0% found this document useful
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
Podcast episode
Open Source TensorFlow with Yifei Feng: Yifei Feng, a TensorFlow software engineer, shares with Melanie and Mark about her work on the open source TensorFlow project and the tools she builds.
byGoogle Cloud Platform Podcast
100%
100% found this document useful
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
Podcast episode
State In React: In this episode of Syntax, Scott and Wes talk about state in React: local state, global state, UI state, data state, caching, API data and more! LogRocket - Sponsor LogRocket lets you replay what users do on your site, helping you reproduce bugs and...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
Podcast episode
Crafting Interpreters With Bob Nystrom: Bob Nystrom is the author of Crafting Interpreters. I speak with Nystrom about building a programming language and an interpreter implementation for it. We talk about parsing, the difference between compiler and interpreters and a lot more. If you are...
byCoRecursive: Coding Stories
0 ratings
0% found this document useful
Advantages of Completing Small Python Projects
Podcast episode
Advantages of Completing Small Python Projects
byThe Real Python Podcast
0 ratings
0% found this document useful
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
Podcast episode
Harnessing Python for Research: Scientific Applications of Python with Michael Kennedy: Still scrabbling with Excel? Consider Python language uses, says programmer and podcaster Michael Kennedy. A general programming language that is easy to use in multiple environments, Python programming is limitless and has numerous open source...
byFinding Genius Podcast
0 ratings
0% found this document useful
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
Podcast episode
#70 Beyond the Language Wars: R & Python for the Modern Data Scientist
byDataFramed
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
Podcast episode
Renee M. P. Teate, "SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis" (John Wiley & Sons, 2021): An interview with Renee M. P. Teate
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
026 jsAir - webpack: JavaScript bundler with Juho Vepsäläinen, Johannes Ewald, Sean T. Larkin, and Tobias Koppers: webpack: JavaScript bundler with Juho Vepsäläinen, Johannes Ewald, Sean T. Larkin, and Tobias Koppers Description: webpack is an amazing bundler for frontend assets. For many people it has completely changed the game for their build pipeline, entirely...
Podcast episode
026 jsAir - webpack: JavaScript bundler with Juho Vepsäläinen, Johannes Ewald, Sean T. Larkin, and Tobias Koppers: webpack: JavaScript bundler with Juho Vepsäläinen, Johannes Ewald, Sean T. Larkin, and Tobias Koppers Description: webpack is an amazing bundler for frontend assets. For many people it has completely changed the game for their build pipeline, entirely...
byJavaScript Air
0 ratings
0% found this document useful
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
Podcast episode
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
byScreaming in the Cloud
0 ratings
0% found this document useful
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
Podcast episode
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
byThe Bike Shed
0 ratings
0% found this document useful
Keeping the Cloudwatch with Ewere Diagboya: This week Corey is joined by Ewere Diagboya, Head of Cloud at Mycloudseries, and multifaceted blogger and author, and the first AWS Hero from Africa. Ewere’s book on CloudWatch is the first of its kind, and certainly a valuable asset to the community. Ewe
Podcast episode
Keeping the Cloudwatch with Ewere Diagboya: This week Corey is joined by Ewere Diagboya, Head of Cloud at Mycloudseries, and multifaceted blogger and author, and the first AWS Hero from Africa. Ewere’s book on CloudWatch is the first of its kind, and certainly a valuable asset to the community. Ewe
byScreaming in the Cloud
0 ratings
0% found this document useful
Listener Questions 3 - How to Get Rid of Your Oracle Addiction: Join Pete and Jesse as they talk about the Herculean effort that is lifting and shifting a system to AWS; why your work is just getting started after you do a lift-and-shift; how technical debt piles up when you don’t modernize applications to take advant
Podcast episode
Listener Questions 3 - How to Get Rid of Your Oracle Addiction: Join Pete and Jesse as they talk about the Herculean effort that is lifting and shifting a system to AWS; why your work is just getting started after you do a lift-and-shift; how technical debt piles up when you don’t modernize applications to take advant
byAWS Morning Brief
0 ratings
0% found this document useful
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
Podcast episode
Beam and Spark with Holden Karau: This week our colleague, Holden Karau, joins us to talk about Spark and Beam.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
319: Wins & Losses: Steph started a new project and shares details about the new tools she's using, including working on a remote dev environment. Chris shares a journey with Lograge and Rails flash messages as he strives to capture user-facing errors. They also discuss "silencing" flaky tests, using Graphviz to visualize data dependencies, and porting Devise views to use Inertia and Svelte. It's also interesting how different their paths have been this year!
Podcast episode
319: Wins & Losses: Steph started a new project and shares details about the new tools she's using, including working on a remote dev environment. Chris shares a journey with Lograge and Rails flash messages as he strives to capture user-facing errors. They also discuss "silencing" flaky tests, using Graphviz to visualize data dependencies, and porting Devise views to use Inertia and Svelte. It's also interesting how different their paths have been this year!
byThe Bike Shed
0 ratings
0% found this document useful
Episode 326: JSJ 322: Building SharePoint Extensions with JavaScript with Vesa Juvonen LIVE at Microsoft Build
Podcast episode
Episode 326: JSJ 322: Building SharePoint Extensions with JavaScript with Vesa Juvonen LIVE at Microsoft Build
byJavaScript Jabber
0 ratings
0% found this document useful
Episode 199: AiA 198: Building SharePoint Extensions with JavaScript with Vesa Juvonen LIVE at Microsoft Build
Podcast episode
Episode 199: AiA 198: Building SharePoint Extensions with JavaScript with Vesa Juvonen LIVE at Microsoft Build
byAdventures in Angular
0 ratings
0% found this document useful
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
Podcast episode
Whiteboard Confessional: Everything's a Database Except SQLite: Join me as I continue a new series called Whiteboard Confessional with a look at the awesomeness that is SQLite, including how it wasn’t designed to work in a client-server fashion, when you should use it and when you absolutely shouldn’t, how deciding to
byAWS Morning Brief
0 ratings
0% found this document useful
Whiteboard Confessional: Configuration MisManagement: Join me as I continue a new series called Whiteboard Confessional by examining the dark underbelly of configuration management: configuration mismanagement. In this episode, I discuss what it was like to be a very early developer on the SaltStack project,
Podcast episode
Whiteboard Confessional: Configuration MisManagement: Join me as I continue a new series called Whiteboard Confessional by examining the dark underbelly of configuration management: configuration mismanagement. In this episode, I discuss what it was like to be a very early developer on the SaltStack project,
byAWS Morning Brief
0 ratings
0% found this document useful
396: Build vs. Buy: Joël has been fighting a frustrating bug where he's integrating with a third-party database, and some queries just crash. Stephanie shares her own debugging story about a leaky stub that caused flaky tests. Additionally, they discuss the build vs. buy decision when integrating with third-party systems. They consider the time and cost implications of building their own integration versus using off-the-shelf components and conclude that the decision often depends on the specific needs and priorities of the project, including how quickly a solution is needed and whether the integration is core to the business's value proposition.
Podcast episode
396: Build vs. Buy: Joël has been fighting a frustrating bug where he's integrating with a third-party database, and some queries just crash. Stephanie shares her own debugging story about a leaky stub that caused flaky tests. Additionally, they discuss the build vs. buy decision when integrating with third-party systems. They consider the time and cost implications of building their own integration versus using off-the-shelf components and conclude that the decision often depends on the specific needs and priorities of the project, including how quickly a solution is needed and whether the integration is core to the business's value proposition.
byThe Bike Shed
0 ratings
0% found this document useful
Potluck — $100k Dev Jobs × Sponsored Blog Posts × How To Keep Your Skills Up To Date × Libraries vs Custom × Dev Tools × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about VS Code, JavaScript, $100k-per-year dev jobs, sponsored blog posts, how to use dev tools, how to keep your skills up to date, and more! Prismic - Sponsor Prismic is a...
Podcast episode
Potluck — $100k Dev Jobs × Sponsored Blog Posts × How To Keep Your Skills Up To Date × Libraries vs Custom × Dev Tools × More!: It’s another potluck! In this episode, Scott and Wes answer your questions about VS Code, JavaScript, $100k-per-year dev jobs, sponsored blog posts, how to use dev tools, how to keep your skills up to date, and more! Prismic - Sponsor Prismic is a...
bySyntax - Tasty Web Development Treats
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
The Future Of The Database
Linux Format
Article
The Future Of The Database
Aug 27, 2019
7 min read
DJANGO Create A Database-driven Website
Linux Format
Article
DJANGO Create A Database-driven Website
Jun 4, 2019
The Django web framework was named after the famous guitarist Django Reinhardt and was first created by web developers at a small newspaper in Kansas. The main goals of Django is to enable fast development of complex websites with database needs. It
7 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Mac 911
MacWorld
Article
Mac 911
Apr 20, 2021
7 min read
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
PC Pro Magazine
Article
“There’s No Single ‘Best’ Language To Learn. I Think The Real Key Is To Learn How To Write Code”
Oct 8, 2022
9 min read
FLASK Web Frameworks
Linux Format
Article
FLASK Web Frameworks
Jun 4, 2019
The main focus of Python has always been to get you cracking on with your coding – the language was never made for web programming. However, this has just made it more interesting to extend the language for the web, or to create an interface to web-b
9 min read
Set Up A Production- Ready Web Server
APC
Article
Set Up A Production- Ready Web Server
Nov 4, 2019
8 min read
Liz Rice Chief Open Source Officer at Isovalent
Techfastly
Article
Liz Rice Chief Open Source Officer at Isovalent
Apr 1, 2022
5 min read
Mac 911
MacWorld
Article
Mac 911
Sep 18, 2018
5 min read
A Place For Everything
Outdoor Photographer
Article
A Place For Everything
Aug 10, 2019
9 min read
Contacts
MacFormat
Article
Contacts
Sep 24, 2019
I enjoyed the feature on ‘44 mighty Mac tips’ (MF #341); I remember learning number 6 ‘Minimise clutter’ in System 7. I’ve recently discovered a new one: if you use Safari > Services > ‘Make new TextEdit window using selection’ to capture the content
2 min read
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
PC Pro Magazine
Article
“When Something Goes Wrong, You Realise You’re Like That Cartoon Character That Has Run Off The Edge Of The Cliff”
Feb 9, 2023
We need to talk about data. Specifically, your data and my data. The stuff we use on a day-to-day basis, from where we store it to what our expectations are for its safe handling. Now let me get one thing clear from the beginning: I am going to sugge
9 min read
Set Up A Production-ready Web Server
Linux Format
Article
Set Up A Production-ready Web Server
Sep 24, 2019
8 min read
How To Build The Linux Format Server
Linux Format
Article
How To Build The Linux Format Server
Oct 19, 2021
10 min read
Help Yourself To Avoid These Pitfalls
MacLife
Article
Help Yourself To Avoid These Pitfalls
Dec 11, 2018
GETTING UP TO full speed with the Shortcuts app takes time, and you’ll inevitably make a few mistakes along the way. Having to troubleshoot your efforts doesn’t mean you’ve failed — with years of experience, even professional programmers do this. Tak
2 min read
Enterprise-grade Monitoring Made Easy
Linux Format
Article
Enterprise-grade Monitoring Made Easy
Mar 10, 2020
9 min read
About Kibana
Linux Format
Article
About Kibana
Mar 10, 2020
Kibana offers analytics and a search dashboard for Elasticsearch, as well as visualisation capabilities for data stored in Elasticsearch. Kibana is so handy that it would be a shame to use Elasticsearch without combining it with Kibana. Generally spe
1 min read
Disrupting Databases
APC
Article
Disrupting Databases
Apr 19, 2021
3 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Answers
Linux Format
Article
Answers
Jul 2, 2019
Q How do I generate a very large file, say around 980GB, with the dd command? I’m trying to test a quota for an XFS file system under RHEL7. I use the following command as the user oracle, which has the user quota set: and it outputs Where is my
7 min read
Backups Are Dead, Long Live Backups!
Linux Format
Article
Backups Are Dead, Long Live Backups!
Aug 25, 2020
No one needs to back up any more. Yes, after decades of tech “journalists” telling you to back up your systems, modern ways of working have made the backup obsolete. No, wait, don’t go! We’re not obsolete, just yet. It’s true enough that many of us a
1 min read
Jonathan Ellis INTERVIEW
Linux Format
Article
Jonathan Ellis INTERVIEW
Oct 22, 2019
6 min read
Ask
MacLife
Article
Ask
Oct 11, 2022
7 min read
Mailserver
Linux Format
Article
Mailserver
Oct 19, 2021
3 min read
Use Katana For Lookdev And Lighting
3D World
Article
Use Katana For Lookdev And Lighting
Sep 7, 2021
3 min read
Usability
Linux Format
Article
Usability
Oct 19, 2021
3 min read
How To Test Out New Desktops The Easy Way
Linux Format
Article
How To Test Out New Desktops The Easy Way
Jun 27, 2023
11 min read
All Your Database Are Belong To Us
Linux Format
Article
All Your Database Are Belong To Us
Apr 6, 2021
7 min read

Related categories

Skip carousel

Reviews for Apache Spark for Data Science Cookbook

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Apache Spark for Data Science Cookbook - Padma Priya Chitturi

Apache Spark for Data Science Cookbook

Credits

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

What you need for this book

Who this book is for

Sections

Getting ready

How to do it…

How it works…

There's more…

See also

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Big Data Analytics with Spark

Introduction

Initializing SparkContext

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark's Python and Scala shells

How to do it…

How it works…

There's more…

See also

Building standalone applications

Getting ready

How to do it…

How it works…

There's more…

See also

Working with the Spark programming model

How to do it…

How it works…

There's more…

See also

Working with pair RDDs

Getting ready

How to do it…

How it works…

There's more…

See also

Persisting RDDs

Getting ready

How to do it…

How it works…

There's more…

See also

Loading and saving data

Getting ready

How to do it…

How it works…

There's more…

See also

Creating broadcast variables and accumulators

Getting ready

How to do it…

How it works…

There's more…

See also

Submitting applications to a cluster

Getting ready

How to do it…

How it works…

There's more…

See also

Working with DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark Streaming

Getting ready

How to do it…

How it works…

There's more…

See also

2. Tricky Statistics with Spark

Introduction

Working with Pandas

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Sampling data

Getting ready

How to do it…

How it works…

There's more…

See also

Summary and descriptive statistics

Getting ready

How to do it…

How it works…

There's more…

See also

Generating frequency tables

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Pandas on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Pandas from source

Getting ready

How to do it…

How it works…

There's more…

See also

Using IPython with PySpark

Getting ready

How to do it…

How it work…

There's more…

See also

Creating Pandas DataFrames over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Splitting, slicing, sorting, filtering, and grouping DataFrames over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing co-variance and correlation using Pandas

Getting ready

How to do it…

How it works…

There's more…

See also

Concatenating and merging operations over DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Complex operations over DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Sparkling Pandas

Getting ready

How to do it…

How it works…

There's more…

See also

3. Data Analysis with Spark

Introduction

Univariate analysis

Getting ready

How to do it…

How it works…

There's more…

See also

Bivariate analysis

Getting ready

How to do it…

How it works…

There's more…

See also

Missing value treatment

Getting ready

How to do it…

How it works…

There's more…

See also

Outlier detection

Getting ready

How to do it…

How it works…

There's more…

See also

Use case - analyzing the MovieLens dataset

Getting ready

How to do it…

How it works…

There's more…

See also

Use case - analyzing the Uber dataset

Getting ready

How to do it…

How it works…

There's more…

See also

4. Clustering, Classification, and Regression

Introduction

Supervised learning

Unsupervised learning

Applying regression analysis for sales data

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Data exploration

Getting ready

How to do it…

How it works…

There's more…

See also

Feature engineering

Getting ready

How to do it…

How it works…

There's more…

See also

Applying linear regression

Getting ready

How to do it…

How it works…

There's more…

See also

Applying logistic regression on bank marketing data

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Data exploration

Getting ready

How to do it…

How it works…

There's more…

See also

Feature engineering

Getting ready

How to do it…

How it works…

There's more…

See also

Applying logistic regression

Getting ready

How to do it…

How it works…

There's more…

See also

Real-time intrusion detection using streaming k-means

Variable identification

Getting ready

How to do it…

How it works…

There's more…

See also

Simulating real-time data

Getting ready

How to do it…

How it works…

There's more…

See also

Applying streaming k-means

Getting ready

How to do it…

How it works…

There's more…

See also

5. Working with Spark MLlib

Introduction

Working with Spark ML pipelines

Implementing Naive Bayes' classification

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing decision trees

Getting ready

How to do it…

How it works…

There's more…

See also

Building a recommendation system

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing logistic regression using Spark ML pipelines

Getting ready

How to do it…

How it works…

There's more…

See also

6. NLP with Spark

Introduction

Installing NLTK on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Installing Anaconda on Linux

Getting ready

How to do it…

How it works…

There's more…

See also

Anaconda for cluster management

Getting ready

How to do it…

How it works…

There's more…

See also

POS tagging with PySpark on an Anaconda cluster

Getting ready

How to do it…

How it works…

There's more…

See also

NER with IPython over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing openNLP - chunker over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing openNLP - sentence detector over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing stanford NLP - lemmatization over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing sentiment analysis using stanford NLP over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

7. Working with Sparkling Water - H2O

Introduction

Features

Working with H2O on Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing k-means using H2O over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing spam detection with Sparkling Water

Getting ready

How to do it…

How it works…

There's more…

See also

Deep learning with airlines and weather data

Getting ready

How to do it…

How it works…

There's more…

See also

Implementing a crime detection application

Getting ready

How to do it…

How it works…

There's more…

See also

Running SVM with H2O over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

8. Data Visualization with Spark

Introduction

Visualization using Zeppelin

Getting ready

How to do it…

Installing Zeppelin

Customizing Zeppelin's server and websocket port

Visualizing data on HDFS - parameterizing inputs

Running custom functions

Adding external dependencies to Zeppelin

Pointing to an external Spark Cluster

How to do it…

How it works…

There's more…

See also

Creating scatter plots with Bokeh-Scala

Getting ready

How to do it…

How it works…

There's more…

See also

Creating a time series MultiPlot with Bokeh-Scala

Getting ready

How to do it…

How it work…

There's more…

See also

Creating plots with the lightning visualization server

Getting ready

How to do it…

How it works…

There's more…

See also

Visualize machine learning models with Databricks notebook

Getting ready

How to do it…

How it works…

There's more…

See also

9. Deep Learning on Spark

Introduction

Installing CaffeOnSpark

Getting ready

How to do it…

How it works…

There's more…

See also

Working with CaffeOnSpark

Getting ready

How to do it…

How it works…

There's more…

See also

Running a feed-forward neural network with DeepLearning 4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Running an RBM with DeepLearning4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Running a CNN for learning MNIST with DeepLearning4j over Spark

Getting ready

How to do it…

How it works…

There's more…

See also

Installing TensorFlow

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark TensorFlow

Getting ready

How to do it…

How it works…

There's more…

See also

10. Working with SparkR

Introduction

Installing R

Getting ready…

How to do it…

How it works…

There's more…

See also

Interactive analysis with the SparkR shell

Getting ready

How to do it…

How it works…

There's more…

See also

Creating a SparkR standalone application from RStudio

Getting ready

How to do it…

How it works…

There's more…

See also

Creating SparkR DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

SparkR DataFrame operations

Getting ready

How to do it…

How it works…

There's more…

See also

Applying user-defined functions in SparkR

Getting ready

How to do it…

How it works…

There's more…

See also

Running SQL queries from SparkR and caching DataFrames

Getting ready

How to do it…

How it works…

There's more…

See also

Machine learning with SparkR

Getting ready

How to do it…

How it works…

There's more…

See also

Apache Spark for Data Science Cookbook

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: December 2016

Production reference: 1161216

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham

B3 2PB, UK.

ISBN 978-1-78588-010-0

www.packtpub.com

Credits

About the Author

Padma Priya Chitturi is Analytics Lead at Fractal Analytics Pvt Ltd and has over five years of experience in Big Data processing. Currently, she is part of capability development at Fractal and responsible for solution development for analytical problems across multiple business domains at large scale. Prior to this, she worked for an Airlines product on a real-time processing platform serving one million user requests/sec at Amadeus Software Labs. She has worked on realizing large-scale deep networks (Jeffrey dean's work in Google brain) for image classification on the big data platform Spark. She works closely with Big Data technologies such as Spark, Storm, Cassandra and Hadoop. She was an open source contributor to Apache Storm.

First, I would like to thank the Packt Publishing team for providing a great opportunity for me to take part in this exciting journey and would like to express my special thanks and gratitude to my family, friends and colleagues who has been very supportive and helped me in finishing this project within time.

About the Reviewer

Roberto Corizzo is a PhD student at the Department of Computer Science, University of Bari, Italy. His research interests include Big Data analytics, data mining, and predictive modeling techniques for sensor networks. He has been involved as technical reviewer for Packt's Learning Hadoop 2 and Learning Python Web Penetration Testing video courses.

www.PacktPub.com

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Customer Feedback

Thank you for purchasing this Packt book. We take our commitment to improving our content and products to meet your needs seriously – that's why your feedback is so valuable. Whatever your feelings about your purchase, please consider leaving a review on this book's Amazon page. Not only will this help us, more importantly it will also help others in the community to make an informed decision about the resources that they invest in to learn. You can also review for us on a regular basis by joining our reviewers club. If you're interested in joining, or would like to learn more about the benefits we offer, please contact us: customerreviews@packtpub.com.

Preface

In recent years, the volume of data being collected, stored, and analyzed has exploded, in particular in relation to the activity on the Web and mobile devices, as well as data from the physical world collected via sensor networks. While previously large-scale data storage, processing, analysis, and modeling was the domain of the largest institutions such as Google, Yahoo!, Facebook, and Twitter, increasingly, many organizations are being faced with the challenge of how to handle a massive amount of data.

With the advent of big data, extracting knowledge from large, heterogeneous, and noisy datasets requires not only powerful computing resources, but the programming abstractions to use them effectively. The abstractions that emerged in the last decade blend ideas from parallel databases, distributed systems, and programming languages to create a new class of scalable data analytics platforms that form the foundation for data science at realistic scales.

The objective of this book is to get the audience the flavor of challenges in data science and addressing them with a variety of analytical tools on a distributed system such as Spark (apt for iterative algorithms), which offers in-memory processing and more flexible for data analysis at scale. This book introduces readers to the fundamentals of Spark and helps them learn the concepts with code examples. It also talks in brief about data mining, text mining, NLP, machine learning, and so on. The readers get to know how to solve real-world analytical problems with large datasets and are made aware of a very practical approach and code to use analytical tools that leverage the features of Spark.

What this book covers

Chapter 1, Big Data Analytics with Spark, introduces Scala, Python and R can be used for data analysis. It also details about Spark programming model, API will be introduced, shows how to install, set up a development environment for the Spark framework and run jobs in distributed mode. I will also show working with DataFrames and Streaming computation models.

Chapter 2, Tricky Statistics with Spark, shows how to apply various statistical measures such as generating sample data, constructing frequency tables, summary and descriptive statistics on large datasets using Spark and Pandas

Chapter 3, Data Analysis with Spark, details how to apply common data exploration and preparation techniques such as univariate analysis, bivariate analysis, missing values treatment, identifying the outliers and techniques for variable transformation using Spark.

Chapter 4, Clustering, Classification and Regression, deals with creating models for regression, classification and clustering as well as shows how to utilize standard performance-evaluation methodologies for the machine learning models built.

Chapter 5, Working with Spark MLlib, provides an overview of Spark MLlib and ML pipelines and presents examples for implementing Naive Bayes classification, decision trees and recommendation systems.

Chapter 6, NLP with Spark, shows how to install NLTK, Anaconda and apply NLP tasks such as POS tagging, Named Entity Recognition, Chunker, Sentence Detector, Lemmatization using Core NLP and Stanford NLP over Spark.

Chapter 7, Working with Sparkling Water - H2O, details how to integrate H2O with Spark and shows applying various algorithms such as k-means, deep learning and SVM and also show developing applications –spam detection and crime detection with Sparkling Water.

Chapter 8, Data Visualization with Spark, show the integration of widely used visualization tools such as Zeppelin, Lightning Server and highly active Scala bindings (Bokeh-Scala) for visualizing large data sets.

Chapter 9, Deep Learning on Spark, shows how to implement deep learning algorithms such as RBM, CNN for learning MNIST, Feed-forward neural networks with the tools Deep Learning4j, TensorFlow using Spark.

Chapter 10, Working with SparkR, provides examples on creating distributed data frames in R, various operations that could be applied in SparkR and details on applying user-defined functions, SQL queries and machine learning in SparkR.

What you need for this book

Throughout this book, we assume that you have some basic experience with programming in Scala, Java, or Python and have some basic knowledge of machine learning, statistics, and data analysis.

Who this book is for

This book is intended for entry-level to intermediate data scientists, data analysts, engineers and practitioners who want to get acquainted with solving numerous data science problems using a distributed computing framework like Spark. The readers are expected to have knowledge on statistics, data science tools like R, Pandas and understanding on distributed systems (some exposure to Hadoop).

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also).

To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Both spark-shell and PySpark are available in the bin directory of SPARK_HOME, that is, SPARK_HOME/bin

A block of code is set as follows:

from pyspark

import SparkContext

stocks = hdfs://namenode:9000/stocks.txt

sc = SparkContext(, ApplicationName)

data = sc.textFile(stocks)

totalLines = data.count()

print(Total Lines are: %i % (totalLines))

Any command-line input or output is written as follows:

$SPARK_HOME/bin/spark-shell --master

Spark context available as sc.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.

Hover the mouse pointer on the SUPPORT tab at the top.

Click on Code Downloads & Errata.

Enter the name of the book in the Search box.

Select the book for which you're looking to download the code files.

Choose from the drop-down menu where you purchased this book from.

Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/ChitturiPadma/SparkforDataScienceCookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

Chapter 1. Big Data Analytics with Spark

In this chapter, we will cover the components of Spark. You will learn them through the following recipes:

Initializing SparkContext

Working with Spark's Python and Scala shells

Building standalone applications

Working with the Spark programming model

Working with pair RDDs

Persisting RDDs

Loading and saving data

Creating broadcast variables and accumulators

Submitting applications to a cluster

Working with DataFrames

Working with Spark Streaming

Introduction

Apache Spark is a general-purpose distributed computing engine for large-scale data processing. It is an open source initiative from AMPLab and donated to the Apache Software Foundation. It is one of the top-level projects under the Apache Software Foundation. Apache Spark offers a data abstraction called Resilient Distributed Datasets (RDDs) to analyze the data in parallel on top of a cluster of resources. The Apache Spark framework is an alternative to Hadoop MapReduce. It is up to 100X faster than MapReduce and offers the best APIs for iterative and expressive data processing. This project is written in Scala and it offers client APIs in Scala, Java, Python, and R.

Initializing SparkContext

This recipe shows how to initialize the SparkContext object as a part of many Spark applications. SparkContext is an object which allows us to create the base RDDs. Every Spark application must contain this object to interact with Spark. It is also used to initialize StreamingContext, SQLContext and HiveContext.

Getting ready

To step through this recipe, you will need a running Spark Cluster in any one of the modes that is, local, standalone, YARN, or Mesos. For installing Spark on a standalone cluster, please refer to http://spark.apache.org/docs/latest/spark-standalone.html. Install Hadoop (optional), Scala, and Java. Please download the data from the following location:

https://github.com/ChitturiPadma/datasets/blob/master/stocks.txt

How to do it…

Let's see how to initialize SparkContext:

Invoke spark-shell:

$SPARK_HOME/bin/spark-shell --master

Spark context available as sc.

Invoke PySpark:

$SPARK_HOME/bin/pyspark --master

SparkContext available as sc

Invoke SparkR:

$SPARK_HOME/bin/sparkR --master

Spark context is available as sc

Now, let's initiate SparkContext in different standalone applications, such as Scala, Java, and Python:

Scala:

import org.apache.spark.SparkContext._

import org.apache.spark.SparkConf

object SparkContextExample {

def main(args: Array[String]) {

val stocksPath = hdfs://namenode:9000/stocks.txt

val conf = new SparkConf().setAppName("Counting

Lines).setMaster(spark://master:7077")

val sc = new SparkContext(conf)

val data = sc.textFile(stocksPath, 2)

val totalLines = data.count()

println(Total number of Lines: %s.format(totalLines))

}

Java:

import org.apache.spark.api.java.*;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.Function;

public class SparkContextExample {

public static void main(String[] args) {

String stocks = hdfs://namenode:9000/stocks.txt

SparkConf conf = new SparkConf().setAppName("Counting

Lines).setMaster(spark://master:7077");

JavaSparkContext sc = new JavaSparkContext(conf);

JavaRDD logData = sc.textFile(stocks);

long totalLines = stocks.count();

System.out.println(Total number of Lines + totalLines);

}

Python:

from pyspark

import SparkContext

stocks = hdfs://namenode:9000/stocks.txt

sc = SparkContext(, ApplicationName)

data = sc.textFile(stocks)

totalLines = data.count()

print(Total Lines are: %i % (totalLines))

How it works…

In the preceding code snippets, new SparkContext(conf), new JavaSparkContext(conf), and SparkContext(, ApplicationName) initialize SparkContext in three different languages: Scala, Java, and Python. SparkContext is the starting point for Spark functionality. It represents the connection to a Spark Cluster, and can be used to create RDDs, accumulators, and broadcast variables on that cluster.

There's more…

SparkContext is created on the driver. It connects with the cluster. Initially, RDDs are created using SparkContext. It is not serialized. Hence it cannot be shipped to workers. Also, only one SparkContext is available per application. In the case of Streaming applications and Spark SQL modules, StreamingContext and SQLContext are created on top of SparkContext.

Working with Spark's Python and Scala

Enjoying the preview?

Page 1 of 1

Apache Spark for Data Science Cookbook

About this ebook

Padma Priya Chitturi

Related authors

Related to Apache Spark for Data Science Cookbook

Related ebooks

Computers For You

Related podcast episodes

Related articles

Related categories

Reviews for Apache Spark for Data Science Cookbook

What did you think?

Book preview

Apache Spark for Data Science Cookbook - Padma Priya Chitturi

Table of Contents

Apache Spark for Data Science Cookbook

Apache Spark for Data Science Cookbook

About the Author

About the Reviewer

www.PacktPub.com

Why subscribe?

Customer Feedback

Preface

What this book covers

Who this book is for

Sections

Getting ready

There's more…

See also

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

Chapter 1. Big Data Analytics with Spark

Introduction

Initializing SparkContext

Getting ready

How to do it…

How it works…

There's more…

See also

Working with Spark's Python and Scala