Simple Spark ml pipeline

Mediative recently hosted a Apache Spark Montreal Meetup’s project night where some of us decided to create a simple ML pipeline. To spare the installation of Spark, we used the Databricks community edition. Since the goal was to see if we could make it work, we wanted to use data that we knew was correlated. But to make the project a little more fun, we decided to explore something else than the usual data sets so we went for the Dow Jones and Nasdaq. »

Sparrow version 0.2.0

Sparrow version 0.2.0 is now available with updated dependency on Spark 1.6.0. It’s both available as a Spark package and from the YPG Data Bintray repository. Release notes Bump Spark version to 1.6.0 Test against Scala 2.10.6 on Travis Bump the Macro Paradise plugin to 2.1.0 »

News from Spark Summit East

Mediative is building a data pipeline on top of Spark so I went to Spark Summit East to see what other people are doing and what’s coming. There were many conference tracks including Enterprise, Developer and Data Science. I mostly attended Data Science talks and below are the highlights. Some of this information also came from NYC Spark Meetup, held on the first evening of the Conference. Spark 2.0 Some of the main news about Spark 2. »

Running Zeppelin on CDH

Download and Build Zeppelin Go to the download page and get the latest source package. Untar the source package and create a git repo to make bower happy: $ tar zxvf zeppelin-0.5.6-incubating.tgz $ cd zeppelin-0.5.6-incubating $ git init Before building from source first determine the Hadoop version by running the following command on the edge node: $ hadoop version Hadoop 2.6.0-cdh5.4.8 ... This command was run using /opt/cloudera/parcels/CDH-5.4.8-1.cdh5.4.8.p0.4/lib/hadoop/hadoop-common-2.6.0-cdh5.4.8.jar Build Zeppelin with YARN support enabled using the Maven profile corresponding to the Hadoop version found above: »

Mesos Stack version 0.4.0

Version 0.4.0 has been released of our Mesos stack. It updates Marathon-LB to use an upstream released version and adds a new GlusterFS role to distribute files across the Mesos cluster. Also enjoy the new and improved documentation which is generated from the Ansible role files. Release notes Improvements: mesos-master, mesos-agent: Use fully qualified host names. Generate Ansible role documentation from YAML files so they are always up to date. »

Installing a Custom Spark Version on CDH

Since Spark can be run as a YARN application it is possible to run a Spark version other than the one provided by the Cloudera platform (CDH). This document lists the instructions for how to compile a specific Spark version against the Hadoop version supported by CDH. The instructions are based on the post Running Spark 1.5.1 on CDH. Determine the version of CDH and Hadoop by running the following command on the edge node: »