Get a quick start with PySpark and spark-submit

We just released a new open source boilerplate template to help you (any Spark user) run spark-submit commands smoothly – such as inserting dependencies, project source code and more.

TLDR: Here is an open source template to help you get started

At Soluto, as part of our everyday Data Science work, we create ETL (Extract, Transform, Load) jobs. Our main tool for this is Spark, specifically, PySpark, with spark-submit

Spark is used for distributed computing on large scale data sets. Spark-submit helps you launch your code application on your cluster.

Here are some examples of jobs we run daily at Soluto:

  • Creating offline content recommendations for users
  • Aggregating single events into more logical tables – as part of our service we offer tech support via chat messaging. Instead of multiple message events for a single support session, we create Sessions Table with one session entity that holds all the aggregated information of a single chat session

Some of the basic needs when using Spark for ETL jobs:

  • Passing arguments
  • Creating Spark context and sql context
  • Loading your project source code (src directory)
  • Loading pip modules (with simple requirements file)

We created a simple template that can help you get started running ETL jobs using PySpark (both using spark-submit and interactive shell), create Spark context and sql context, use simple command line arguments and load all your dependencies (your project source code and third party requirements).

So if you’re starting a new Spark project, “Fork” it on GitHub and enjoy!

Please feel free to share any thoughts, open issues and contribute code!

Previous

Why a flexible team structure will help you achieve your goals

Next

Integration tests: Fake it till you make it!

2 Comments

  1. Nice post, Good work. I would like to share this post on my blog.

  2. Really this is a very useful blog.
    Thank you

Leave a Reply

Your email address will not be published. Required fields are marked *

Powered by WordPress & Theme by Anders Norén