Work with GPUs¶

BentoML provides a streamlined approach to deploying Services that require GPU resources for inference tasks.

This document explains how to configure and allocate GPUs to run inference with BentoML.

Configure GPU resources¶

When creating your BentoML Service, you need to make sure your Service implementation has the correct GPU configuration.

A single device¶

When a single GPU is available, frameworks like PyTorch and TensorFlow default to using cuda:0 or cuda. In PyTorch, for example, to assign a model to use the GPU, you use .to('cuda:0'). An example of setting up a BentoML Service to use a single GPU:

import bentoml
import os

@bentoml.service(resources={"gpu": 1})
class MyService:
    # Use a Hugging Face model
    model_path = bentoml.models.HuggingFaceModel("org_name/model_id")

    # Use a model from the Model Store or BentoCloud
    # model_path = bentoml.models.BentoModel("model_name:latest")

    def __init__(self):
        import torch
        # Specify the exact path to the weights file
        weights_file = os.path.join(self.model_path, "weight.pt")
        self.model = torch.load(weights_file).to('cuda:0')

Multiple devices¶

In systems with multiple GPUs, each GPU is assigned an index starting from 0 (cuda:0, cuda:1, cuda:2, etc.). You can specify which GPU to use or distribute operations across multiple GPUs.

To use a specific GPU:

import bentoml
import os

@bentoml.service(resources={"gpu": 2})
class MultiGPUService:
    # Load Hugging Face models
    model1_path = bentoml.models.HuggingFaceModel("org_name/model1_id")
    model2_path = bentoml.models.HuggingFaceModel("org_name/model2_id")

    # Use a model from the Model Store or BentoCloud
    # model_path = bentoml.models.BentoModel("model_name:latest")

    def __init__(self):
        import torch
        # Specify the exact paths to the weights files
        weights_file1 = os.path.join(self.model1_path, "weight1.pt")
        weights_file2 = os.path.join(self.model2_path, "weight2.pt")

        self.model1 = torch.load(weights_file1).to("cuda:0") # Use the first GPU
        self.model2 = torch.load(weights_file2).to("cuda:1") # Use the second GPU

This image explains how different models use the GPUs assigned to them.

Note

Workers are the processes that actually run the code logic within a BentoML Service. By default, a BentoML Service has one worker. It is possible to set multiple workers and allocate specific GPUs to individual workers. See Parallelize requests handling for details.

If you want to use multiple GPUs for distributed operations (multiple GPUs for the same worker), PyTorch and TensorFlow offer different methods:

PyTorch: DataParallel and DistributedDataParallel
TensorFlow: Distributed training

GPU deployment¶

When using PyTorch or TensorFlow to run models on GPUs, we recommend you directly install them along with their respective CUDA dependencies, via pip. This ensures:

Minimal package size since only the required components are installed.
Better compatibility as the correct CUDA version is automatically installed alongside the frameworks.

For development, to install PyTorch or TensorFlow with the appropriate CUDA version using pip, use the following commands:

pip install torch
pip install tensorflow[and-cuda]

When building your Bento, simply add PyTorch and TensorFlow using the python_packages method (or put them in a separate requirements.txt file).

import bentoml

my_image = bentoml.images.Image(python_version='3.11') \
    .python_packages("torch", "tensorflow[and-cuda]")

@bentoml.service(image=my_image)
class MyService:
    # Service implementation

To customize the installation of CUDA driver and libraries, use the base_image parameter and the system_packages and run methods when defining the runtime specifications.

BentoCloud¶

When deploying on BentoCloud, specify resources with gpu or gpu_type in the @bentoml.service decorator to allow BentoCloud to allocate the necessary GPU resources:

@bentoml.service(
    resources={
        "gpu": 1, # The number of allocated GPUs
        "gpu_type": "nvidia-l4" # A specific GPU type on BentoCloud
    }
)
class MyService:
    # Service implementation

To list available GPU types on your BentoCloud account, run:

$ bentoml deployment list-instance-types

Name        Price  CPU    Memory  GPU  GPU Type
cpu.1       *      500m   2Gi
cpu.2       *      1000m  2Gi
cpu.4       *      2000m  8Gi
cpu.8       *      4000m  16Gi
gpu.t4.1    *      2000m  8Gi     1    nvidia-tesla-t4
gpu.l4.1    *      4000m  16Gi    1    nvidia-l4
gpu.a100.1  *      6000m  43Gi    1    nvidia-tesla-a100

After your Service is ready, you can then deploy it to BentoCloud by running bentoml deploy .. See Create Deployments for details.

Docker¶

You need to install the NVIDIA Container Toolkit for running Docker containers with Nvidia GPUs. NVIDIA provides detailed instructions for installing both Docker CE and nvidia-docker.

After you build a Docker image for your Bento with bentoml containerize, you can run it on all available GPUs like this:

docker run --gpus all -p 3000:3000 bento_image:latest

You can use the device option to specify GPUs:

docker run --gpus all --device /dev/nvidia0 \
            --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools \
            --device /dev/nvidia-modeset --device /dev/nvidiactl <docker-args>

To view GPU usage, use the nvidia-smi tool to see if a BentoML Service or Bento is using GPU. You can run it in a separate terminal while your BentoML Service is handling requests.

# Refresh the output every second
watch -n 1 nvidia-smi

Example output:

Every 1.0s: nvidia-smi                            ps49pl48tek0: Mon Jun 17 13:09:46 2024

Mon Jun 17 13:09:46 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:00:05.0 Off |                    0 |
| N/A   30C    P0              60W / 400W |   3493MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1813      G   /usr/lib/xorg/Xorg                           70MiB |
|    0   N/A  N/A      1946      G   /usr/bin/gnome-shell                         78MiB |
|    0   N/A  N/A     11197      C   /Home/Documents/BentoML/demo/bin/python     3328MiB|
+---------------------------------------------------------------------------------------+

For more information, see the Docker documentation.

Limit GPU visibility¶

By setting CUDA_VISIBLE_DEVICES to the IDs of the GPUs you want to use, you can limit BentoML to only use certain GPUs for your Service. GPU IDs are typically numbered starting from 0. For example:

CUDA_VISIBLE_DEVICES=0 makes only the first GPU visible.
CUDA_VISIBLE_DEVICES=1,2 makes the second and third GPUs visible.