tips, tricks, tuts, and more

PySpark and Jupyter Quick local setup with Docker

Aspiring Data Scientists and Data Analysts out there looking to quickly get started with PySpark and Jupyter, here is a quick write up to show you how to spin up a local workspace using Docker.

First make sure you have Docker, docker-machine, docker-compose installed on your machine.

  1. Create a new Docker machine:

    in you terminal, run the following commands

    cd /to/your/workspace
    mkdir learning_pyspark && cd learning_pyspark
    mkdir -p code data notebooks
    docker-machine create -d virtualbox SciMachine
    eval <code>docker-machine env SciMachine</code>

  2. Create your docker configuration files and scripts:

    in your learning_pyspark folder

    touch Dockerfile
    touch docker-compose.yml

    The content of Dockerfile docker-compose.yml is below:


    FROM jupyter/pyspark-notebook
    MAINTAINER outcastgeek <outcastgeek+docker@gmail.com>
    WORKDIR /workspace/notebooks
    CMD ["/workspace/start-notebook.sh", "--NotebookApp.base_url=/workspace"]


      build: .
      restart: always
        - "4040:4040"
        - "8888:8888"
        - .:/workspace

  3. Create your startup script:

    still inside your learning_pyspark folder

    touch start-notebook.sh

    Again the content of start-notebook.sh is below:


    # Change UID of NB_USER to NB_UID if it does not match
    if [ "$NB_UID" != $(id -u $NB_USER) ] ; then
        usermod -u $NB_UID $NB_USER
        chown -R $NB_UID $CONDA_DIR
    # Enable sudo if requested
    if [ ! -z "$GRANT_SUDO" ]; then
        echo "$NB_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/notebook
    # Start the notebook server
    exec su $NB_USER -c "env PATH=$PATH jupyter notebook $*"

  4. Run your environment:

    From within your learning_pyspark folder

    Run your container:

    docker-compose up

    Obtain the ip address of your container:

    docker-machine ip SciMachine

  5. Now get to work:

    Your Jupyter workspace is available here:

    http://${SciMachine IP Address}:8888/workspace

    Create a note book and run some PySpark workload in it, then your Spark UI will be available here:

    http://${SciMachine IP Address}:4040

Feel free to clone https://github.com/outcastgeek/docker_pyspark.git

and play around:

Any questions, feedback, comment?

❯❯ Back to Blog ❮❮ ❮ Previous: combine minify javascript dotnet Next: gradle behind proxy ❯
blog comments powered by Disqus