OUTCASTGEEK

tips, tricks, tuts, and more

PySpark and Jupyter Quick local setup with Docker


Aspiring Data Scientists and Data Analysts out there looking to quickly get started with PySpark and Jupyter, here is a quick write up to show you how to spin up a local workspace using Docker.

First make sure you have Docker, docker-machine, docker-compose installed on your machine.

  1. Create a new Docker machine:

    in you terminal, run the following commands

    cd /to/your/workspace
    mkdir learning_pyspark && cd learning_pyspark
    mkdir -p code data notebooks
    docker-machine create -d virtualbox SciMachine
    eval <code>docker-machine env SciMachine</code>
    

  2. Create your docker configuration files and scripts:

    in your learning_pyspark folder

    touch Dockerfile
    touch docker-compose.yml
    

    The content of Dockerfile docker-compose.yml is below:

    Dockerfile

    FROM jupyter/pyspark-notebook
    
    MAINTAINER outcastgeek <outcastgeek+docker@gmail.com>
    
    WORKDIR /workspace/notebooks
    
    CMD ["/workspace/start-notebook.sh", "--NotebookApp.base_url=/workspace"]
    

    docker-compose.yml

    learning_pyspark:
      build: .
      restart: always
      ports:
        - "4040:4040"
        - "8888:8888"
      volumes:
        - .:/workspace
    

  3. Create your startup script:

    still inside your learning_pyspark folder

    touch start-notebook.sh
    

    Again the content of start-notebook.sh is below:

    start-notebook.sh

    #!/bin/bash
    
    # Change UID of NB_USER to NB_UID if it does not match
    if [ "$NB_UID" != $(id -u $NB_USER) ] ; then
        usermod -u $NB_UID $NB_USER
        chown -R $NB_UID $CONDA_DIR
    fi
    
    # Enable sudo if requested
    if [ ! -z "$GRANT_SUDO" ]; then
        echo "$NB_USER ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/notebook
    fi
    
    # Start the notebook server
    exec su $NB_USER -c "env PATH=$PATH jupyter notebook $*"
    

  4. Run your environment:

    From within your learning_pyspark folder

    Run your container:

    docker-compose up
    

    Obtain the ip address of your container:

    docker-machine ip SciMachine
    

  5. Now get to work:

    Your Jupyter workspace is available here:

    http://${SciMachine IP Address}:8888/workspace

    Create a note book and run some PySpark workload in it, then your Spark UI will be available here:

    http://${SciMachine IP Address}:4040

Feel free to clone https://github.com/outcastgeek/docker_pyspark.git

and play around:

Any questions, feedback, comment?


❯❯ Back to Blog ❮❮ ❮ Previous: combine minify javascript dotnet Next: gradle behind proxy ❯
blog comments powered by Disqus