Getting started with spark-jobserver

Spark-jobserver is a really cool RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. At megam our analytics platform Meglytics is powered by apache spark and we leverage spark-jobserver to execute spark jobs. This blog post we will see how to get started with apache spark jobserver. Before we go ahead, a big thanks to the Ooyala folks for making the spark-jobserver opensource. Lets get started.

Note: Make sure you have spark installed locally

1. Running spark-jobserver

For sanity check…

 $sudo apt-get update

Now clone the [spark-jobserver] project

  $git clone https://github.com/spark-jobserver/spark-jobserver

To run it,

  $export VER=`sbt version | tail -1 | cut -f2`
  >reStart

Your dev setup is done, fire up your browser and point it to localhost:8090 and you can see a not-so-quagmire kinda UI.

Note: For proper deployment you can find the conf and scripts here

2. Building and deploying a jar:

The fundamental steps in setting up and working in SJS is that,

  • First build jar(like duh!) with your sparkContext(s) and you push it to SJS where your spark Master is also running(in our case, it is local).

  • Then run the jar by providing the classPath and the name of the jar.

Let us look at the simple wordCount example for now to get all the missing pieces together.

cd into spark-jobserver and run this,

sbt job-server-tests/package

They are examples that you can find here. Now your wordcount example is built. Lets push the jar to SJS /jar API

 curl --data-binary @job-server-tests/target/scala-2.10/job-server-tests.jar localhost:8090/jars/firsttest

3. Submitting a job:

Lets run it and get the output,

curl -d "input.string = a b c a b see" 'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample'

We send a request to /jobs api with the appName and classPath. Upon every job submission SJS gives you an jobID ` “jobId”: “5453779a-f004-45fc-a11d-a39dae0f9bf4”`

4. Getting the status of the job:

Call the /jobs api with the key to get the status/result and also the duration of your job. Also, fire up the spark master UI to see the job getting exectuted.

curl localhost:8090/jobs/5453779a-f004-45fc-a11d-a39dae0f9bf4

SJS is a really nice project which makes a ton easy to work with apache spark and the production cases looks promising aswell. There is also a gitter chat room where all the SJS folks hang out and solve any kind of queries.

Thats it for now. If I find time I will write about spark-jobserver in production and using sqlContext and dataframes with spark-jobserver. Any questions regarding spark-jobserver comment below or shoot me an email.

VirtEngine by DET.io

VirtEngine by DET.io
VirtEngine specializes in building Virtualization Software and powering Cloud Service Providers / Hosting Providers..

Meet VirtEngine at HostingCon India 2016!

Virtualization platform VirtEngine will be exhibiting in HostingCon India 2016! Continue reading

Installing VirtEngine on CentOS

Published on October 18, 2016

Cassandra Replication - HA

Published on July 14, 2016