Vlasov Simulations on the Cloud

How to build a modular private supercomputer with a UI

Why?

VlaPy on JAX needs GPUs

In a previous post, I discussed porting VlaPy to JAX in order to be able to use GPUs and perform differentiable programming. However, I don’t even own a GPU rig nor do I have access to supercomputing resources like those I used at universities and national labs. Thankfully, scientific computing is alive and well in the Cloud, thanks in part to the proliferation of deep learning. That’s part of the reason behind JAX’s invention and success in the first place! So the next step was to start using VlaPy with JAX on a GPU on the cloud.

At first, I ran the prototypical Cloud-starter workflow. This is usually one of two things

  • A managed Jupyter Notebook

  • A bare VM / EC2 Instance.

To prepare these, we have to provision the instance with the code we would like to run. If it is a git repo, we have to provision the instance with the credentials. If we wish to transfer data to an object store or database, we have to provide the right permissions to the instance so it can write to a DB or S3 bucket.

Doing this repeatedly can be very time-consuming and it becomes useful to start using some Dev/ML Ops principles. I found it so powerful to use infrastructure-as-code that going from one-off Jupyter Notebook/EC2 instance to something akin to a Vlasov-as-a-Service platform was doable within the span of a couple of months of side-project time.

In concrete terms, what this means is that I developed a tight integration of the solver with the computation, storage, and management components. In a way, this is something that is easier to implement in a cloud computing environment. In a supercomputing environment, the storage and computation is very tightly coupled. However, the management piece is not as easy to incorporate due to what essentially amounts to different hardware requirements. The management component often requires a DB server and needs to be accessible via the web.

This is where the promise of the cloud comes in. Since we are able to access HPC-like resources in terms of compute and storage we can generally match what supercomputing facilities have to offer for non-hero-runs i.e. where you need a million tightly coupled cores etc.

There is a lot of science to be done with more practical compute requirements

Even more so, there is a different kind of science to be done. With the rapid developments in data-driven techniques, having the ability to amass a large number of coarse/medium-scale simulations can be more useful than a few, highly-resolved, large-scale simulations.

TLDR

Take Vlasov simulations to the cloud so that you can create simulation software that has

  1. Scalable compute capacity
  2. Scalable storage capacity
  3. Easily accessible and usable data

Architecture Walkthrough

The DAG

You can think of this as a standard web-app of sorts. In a three-tier web-app, you have data, frontend, and the application itself. Here, we kind of combine the application layer with the front-end layer. We do this by using streamlit. The database/storage server is more like an experiment management server here. Last, we also have to add the computation piece for the simulations. So for our purposes, we need

  1. Frontend / Application
  2. Computation
  3. Experiment Management & Storage

For each of these we choose

  1. Streamlit on Amazon ECS
  2. Containers on AWS Batch + EC2 GPU Instances
  3. MLFlow on Amazon ECS

The DAG needs to look like this –

Continuum - Single Simulation Flow

In this figure, there is a major glue component that holds it all together. Typically, you might have something like a message broker/queue that holds on to the request from the front-end before the back-end has time to process it. We need that here as well in order to hold on to the simulation meta-data that’s created by the user at the front-end. We use MLFlow + AWS Batch for that purpose.

On AWS, the architecture looks like

Continuum - Single Sim Arch

The user interacts with the front-end through the web. The browser reaches an Application Load Balancer which is integrated with Cognito. This is actually very cool and makes it easy for a web dev noob like myself to create an authentication portal for my simulation manager.

For now, I will just describe how we address the 3 previously mentioned items.

Frontend - Streamlit Application

Once the user logs in, the Application Load Balancer routes the request to a container hosted using ECS Fargate. The streamlit site does not require much in terms of compute and memory so this container can be provisioned to be relatively small. The application is built using Streamlit which is a very high level Python library for creating “data” applications. For our purposes, we get dropdowns, number input, text input etc. and the ability to store this into a python object (typically a dictionary). This is really all we need to parameterize our simulations.

Continuum - Streamlit App
A subset of what the streamlit application can do for us

The container in which the application runs is created using GitHub Actions and pushed to a private ECR repository. Using the CDK, we create a Fargate ECS Service. We use Fargate because the compute requirements are negligible and most importantly, ignorable. To create a Service, one defines a Cluster (which is just an object that can house ECS Services within a VPC). Then, one creates a Service, and attaches to it a Task Definition. The Task Definition is where the container details are provided, and also where the IAM roles and security groups are defined.

Experiment Management & Storage - MLFlow Server

The MLFlow server is hosted similarly with a container on ECS Fargate. The best way to run MLFlow in production is by backing it with a DB store and an object store. On AWS, we use a Postgresql RDS instance (t3.micro works for our purpose), and an S3 bucket.

The RDS instance is provisioned using the CDK. It is easy to create a secret by using rds.DatabaseSecret. This secret is needed by the MLFlow application to be able to connect to the RDS instance. It is provided here by passing the rds.DatabaseSecret object to the container definition by using the secret keyword. For someone with little experience with databases, I found that this was a very simple way of creating a secure database without ever writing down a password.

We also need to provide some way for the rest of the compute infrastructure to be able to communicate with the MLFlow server that does not have a static IP. We choose to use CloudMap and Service Discovery for this. Cloud Map allows the user to provide a specified host domain that is internal to the VPC for the ECS Service. We can call this domain continuum. Using service discovery, we can call the MLFlow Service mlfs such that http://mlfs.continuum reaches the Fargate Service that is hosting the MLFlow server.

Here’s what this code looks like

from aws_cdk import (
    aws_rds as rds,
    aws_ec2 as ec2,
    aws_ecs as ecs,
    aws_servicediscovery as serv_disc,
)


def __setup_rds_db__(stack, vpc, app_prefix=""):
    db_name = "postgres"
    db_secret = rds.DatabaseSecret(stack, app_prefix + "-mlfs_db_secret", username="db_username")

    this_rds = rds.DatabaseInstance(
        stack,
        app_prefix + "-mlfs-dbi",
        engine=rds.DatabaseInstanceEngine.POSTGRES,
        vpc_subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT),
        instance_type=ec2.InstanceType("t3.micro"),
        vpc=vpc,
        database_name=db_name,
        allocated_storage=8,
        credentials=rds.Credentials.from_secret(db_secret),
    )

    backend_store_uri = this_rds.db_instance_endpoint_address + ":" + this_rds.db_instance_endpoint_port + "/" + db_name

    return this_rds, db_secret, backend_store_uri


def setup_mlfs(
    stack,
    namespace,
    mlflow_public_path="",
    app_prefix="",
):
    this_rds, db_secret, backend_store_uri = __setup_rds_db__(stack, stack.vpc, app_prefix=app_prefix)

    mlfs_task_def = ecs.FargateTaskDefinition(stack, app_prefix + "-MLFSTaskDef", cpu=512, memory_limit_mib=1024)
    mlfs_task_def.add_to_task_role_policy(statement=stack.s3_dict["policy_statement"])

    mlfs_container = mlfs_task_def.add_container(
        app_prefix + "-MLFSContainer",
        image=ecs.ContainerImage.from_ecr_repository(repository=stack.ecr_repo, tag="latest-mlfs"),
        logging=ecs.LogDriver.aws_logs(stream_prefix=app_prefix + "-mlfs_log"),
        environment={
            "BACKEND_STORE_URI": backend_store_uri,
            "ARTIFACT_STORE_URI": stack.s3_dict["artifact_store_uri"],
            "MLFS_PATH": mlflow_public_path,
        },
        secrets={"DB_PASSWORD": ecs.Secret.from_secrets_manager(db_secret, "password")},
    )
    mlfs_container.add_port_mappings(ecs.PortMapping(protocol=ecs.Protocol.TCP, container_port=80))

    mlfs_service = ecs.FargateService(
        stack,
        app_prefix + "-MLFSService",
        cluster=stack.cluster,
        task_definition=mlfs_task_def,
        vpc_subnets=ec2.SubnetSelection(subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT),
        assign_public_ip=False,
        desired_count=1,
    )
    mlfs_service.enable_cloud_map(
        cloud_map_namespace=namespace,
        dns_record_type=serv_disc.DnsRecordType.A,
        name="mlfs",
    )

    return {
        "service": mlfs_service,
        "container": mlfs_container,
        "path": mlflow_public_path,
        "db": this_rds,
    }

This is what the entrypoint code that uses those environment variables looks like

#!/bin/sh

mlflow gc --backend-store-uri "postgresql://db_username:$DB_PASSWORD@$BACKEND_STORE_URI"
mlflow server \
    --backend-store-uri "postgresql://db_username:$DB_PASSWORD@$BACKEND_STORE_URI" \
    --default-artifact-root "$ARTIFACT_STORE_URI" \
    --host 0.0.0.0 --port 80 --static-prefix "$MLFS_PATH"

The MLFS_PATH is used by the Application Load Balancer in order to know where to find the UI. This is discussed in more detail later in the post.

Computation - AWS Batch and Containers

To perform the calculations, we would like to be able to “submit jobs” to something that resembles a supercomputer on the cloud. For these purposes, the requirements are straightforward. Having a T4 GPU + ~400 GB in storage per simulation is adequate. We don’t need the storage to be particularly performant so gp3 should be fine.

We use AWS Batch to handle the queueing of the jobs and provisioning of the containers. AWS Batch is a 3 tier abstraction.

First, one must define a Compute Environment. This is simply an abstraction of the hardware you might need. For us, T4 + ~400 GB. You can also specify if you would like On Demand or Spot Instances. If your runs can be checkpointed and restarted, it can be very cost-efficient to be able to use Spot Instances. We typically do so ourselves.

Second, one creates a Job Queue and attaches it to one or more compute environments. This is lightweight.

Third, one creates a Job Definition which is very much like an ECS Task Definition. This is the object that holds the container specs, IAM roles, and Security Groups. Because AWS Batch actually uses ECS, one tricky detail here was that the IAM role for the Task/Job itself, requires a CompositePrincipal i.e. it is responsible for working with more than 1 service, namely, EC2 and ECS. It also needs permissions to be able to log to CloudWatch.

calculator_role = iam.Role(
    stack,
    app_prefix + "-calculatorRole",
    assumed_by=iam.CompositePrincipal(
        iam.ServicePrincipal("ecs-tasks.amazonaws.com"),
        iam.ServicePrincipal("ec2.amazonaws.com"),
    ),
)
calculator_role.add_to_policy(iam.PolicyStatement(resources=["*"], actions=["logs:*", "cloudwatch:*"]))

Wrapping it all up

The Streamlit application needs to submit jobs to the AWS Batch Queues. It also needs to store the input parameters to the simulation in the experiment manager. Each time we submit a job, the application performs those two steps.

The flow is as follows: 1 - all the parameters are collected into a dictionary, 2 - an MLFlow run is started, 3 - the parameters are stored as an artifact within that run. 4 - The unique run ID is passed in the boto3 AWS Batch submit job call.

When AWS Batch runs that job, the code that is executed reads the unique run ID, goes to the MLFlow server to ask what parameters belong to that run ID, downloads and loads the parameter dictionary, and then executes that job! Any metrics and artifacts get logged back to that same run. In fact, as the architecture diagram shows, we implement an additional post-processing step where we need a larger RAM machine to generate the plots. That is, after the computation is completed, the raw data gets logged and the GPU machine submits another job, this time to the post-processing queue. This gets picked up by a machine that has 128 GB of RAM, which is enough to handle our post-processing needs. 🎉

Other

These are two other miscellaneous but important components of what we have built. We need to be able to get to this stuff in a web browser and to be able to authenticate into the application as well as the experiment manager.

ALB

To do this, we create an Application Load Balancer that is hit when one navigates to our domain https://continuum-public.ergodic.io. Since we have already created our two user-facing services, the Streamlit application, and the MLFlow application, we can just attach them to the ALB using path-based routing. /experiments takes you to the MLFlow server, and / goes to the Streamlit application

Cognito

There is some (unfortunately convoluted) CDK code for Cognito sitting in front of one of these Application Load Balancers. I won’t share it here but please reach out if you need help with this.

Conclusion

In this post, we went through an approach with which one can create a “private supercomputer” on the cloud. We can submit jobs to it using a Web App. Each run is logged to an Experiment Manager (MLFlow) that has effectively unlimited storage due to S3, and a performant long-lived database. We can access the experiment manager through a Web App as well. This is all built using the AWS CDK, which is an imperative API (with Python support!) that makes gluing all these pieces together a very “Lego”-like experience. Plus, all of this can be versioned, and transferred to others. If you are interested in implementing something like this, feel free to reach out.

This gives us access to as many T4 (or whatever else they may have at the time of your reading) GPUs at Spot pricing which is 90c/hr i.e. very affordable. I have discussed some benchmarks of running JAX code on GPUs in a previous post As of this writing, this is how I perform my Vlasov-Maxwell-Fokker-Planck simulations while not having access to a NSF/DOE funded supercomputer.

In my opinion, this workflow enables citizen science because we are able to perform serious simulations without needing access to NSF/DOE funded facilities at very cost-effective rates. In fact, this workflow actually enables more than what is typically afforded to scientists because here, we can provision different and specialized compute for web app, experiment manager, PDE solve, and post-processing. This is something that is not easy to implement within a traditional academic/national lab super-computer workflow.

Dr. Archis Joglekar
Dr. Archis Joglekar
ML Researcher | Research Engineer | Theoretical Physicist

I like doing math with computers. I got a PhD in fusion plasma physics. It happens to be a perfect blend of applied mathematics, physics, and computing. I used to use supercomputers to do the math, now I use the cloud. I also like to do written math. I am currently working on something new at the intersection of deep learning and fusion. I am an Affiliate Researcher with the Laboratory for Laser Energetics and I am also an Adjunct Professor at the University of Michigan.