Configuring Mesos Fetcher & Hadoop for AWS S3

 ⋅ 4 min read

Configuring Mesos Fetcher & Hadoop for AWS S3

Apache Mesos is a "distributed systems kernel" that runs at a higher level of abstraction. It is essentially a cluster manager that provides resource isolation and sharing across distributed applications or frameworks.

My team uses Mesos heavily (yes I know, we aren't on Kubernetes yet because Mesos solves most of my team's problems, is pretty mature and we have a ton of experience running it at scale), and when we decided to upgrade our clusters and rewrite the infrastructure code, one of the features I was excited about using was Fetcher.

Deployment workflow

We use Marathon as our main orchestration platform. It lets us run bundled apps and Docker images on our Mesos cluster and makes it very easy to scale them up or down.

Our usual pattern of deploying apps is —

  1. CI builds a Docker image of the app or bundles it up as an archive and pushes it to AWS S3.
  2. CD pushes a configuration JSON to Marathon with the --force-deploy flag.

For Docker images, the Marathon configuration is straightforward. For bundled archives, the cmd directive on Marathon is auto-generated by the CD pipeline to look like this —

s3cmd get s3://app-name/app.tgz && tar -xzf *.tgz && ./bin/start.sh

This meant the overall cmd directive was just awkward to look at. Also, s3cmd had to be installed on all of our Mesos agents; s3cmd has its own set of weird issues and it is very difficult to upgrade packages on all of the Mesos agent infrastructure.

Enter Mesos Fetcher

Mesos Fetcher natively supports fetching resources from HTTP, HTTPS, FTP & FTPS URIs. Additionally, it supports caching (huge wins in deploy time and S3 transfer costs if you deploy a lot) and auto-extraction (see supported formats) of resources. If a local Hadoop client is installed, it can also fetch resources from HDFS & S3; this last bit is what we're interested in.

Installing Hadoop

To support fetching S3 URIs, let's first install Hadoop on our Mesos agent and set it up so it is accesible inside the container sandbox. Ideally, this should be baked in your AMI for Mesos agent machines.

The code is heavily documented inline, explaining why we do each of these steps.

#!/bin/bash
# Installation & configuration of Hadoop & Hadoop AWS Tools for Mesos agent.
# Battle-tested on Ubuntu 18.04.

set -e

export HADOOP_VERSION="3.2.0" # See https://hadoop.apache.org/releases.html for latest version
export BASE_DIR="/apps"
export HADOOP_APP_DIR="${BASE_DIR}/hadoop-${HADOOP_VERSION}"
export HADOOP_LOG_DIR="/logs/hadoop"
export MESOS_FETCHER_CACHE_DIR="/data/cache/mesos"

sudo mkdir -p ${HADOOP_APP_DIR} ${HADOOP_LOG_DIR} ${MESOS_FETCHER_CACHE_DIR}

# Download & extract Hadoop binary release.
cd ${HADOOP_APP_DIR}
    # Mirrors are at https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz
    sudo wget -O hadoop.tar.gz http://mirrors.estointernet.in/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
    sudo tar -xzf hadoop.tar.gz --strip-components=1 -C .
    sudo rm hadoop.tar.gz
    # Cleanup unneccessary items to keep the Hadoop installation lean.
    sudo rm -rf *.txt
    sudo rm -rf share/doc
cd -

# Set up $JAVA_HOME for Hadoop.
# This locates the JAVA_HOME and updates it in Hadoop's environment file.
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
sudo sed -i "s@# export JAVA_HOME=.*@export JAVA_HOME=${JAVA_HOME}@g" ${HADOOP_APP_DIR}/etc/hadoop/hadoop-env.sh

# Turn on Hadoop-AWS optional tools.
# We need this to be able to fetch s3://, s3a:// and s3n:// URIs.
sudo sed -i "s@# export HADOOP_OPTIONAL_TOOLS=.*@export HADOOP_OPTIONAL_TOOLS=\"hadoop-aws\"@g" ${HADOOP_APP_DIR}/etc/hadoop/hadoop-env.sh

# Ask Hadoop to load the AWS SDK.
# This assumes '/root' is $HOME.
# If the default user in your Marathon is not 'root' (as it should be), change this to the home directory of that user.
cat <<EOF | sudo tee /root/.hadooprc
hadoop_add_to_classpath_tools hadoop-aws

EOF

# Add Hadoop to $PATH and customize a few parameters.
# $PATH is available inside each container sandbox, making fetcher work.
# HADOOP_HOME will be picked up by Mesos in place of the --hadoop-home agent flag.
cat <<EOF | sudo tee /etc/profile.d/A00-add-hadoop.sh
export PATH="$PATH:${HADOOP_APP_DIR}/bin"
export HADOOP_HOME="${HADOOP_APP_DIR}"
export HADOOP_LOG_DIR="${HADOOP_LOG_DIR}"
export HADOOP_ROOT_LOGGER=WARN,console

EOF

# Set up Hadoop executables to be discoverable for all users.
# This is so that just running 'hadoop' will work for everyone.
sudo sed -i -e "/ENV_SUPATH/ s[=.*[&:${HADOOP_APP_DIR}/bin[" /etc/login.defs
sudo sed -i -e "/ENV_PATH/ s[=.*[&:${HADOOP_APP_DIR}/bin[" /etc/login.defs

Additional Agent configuration

Once the Hadoop is set up, make sure to set sensible values for these 2 Agent configuration flags

  1. --fetcher-cache-size flag or MESOS_FETCHER_CACHE_SIZE environment variable.
  2. --fetcher-cache-dir flag or MESOS_FETCHER_CACHE_DIR environment variable.

For example —

MESOS_FETCHER_CACHE_SIZE=1GB
MESOS_FETCHER_CACHE_DIR=/data/cache/mesos

Marathon configuration

Once the agents are deployed with the new Fetcher & Hadoop setup done, you can change your Marathon deployment configuration from —

{
  "cmd": "s3cmd get s3://app-name/app.tgz && tar -xzf *.tgz && ./bin/start.sh"
}

to —

{
  "fetch": {
    "uri": "s3a://app-name/app.tgz",
    "extract": true,
    "cache": true
  }
}