Search - BentoML

Hide navigation sidebar

Hide table of contents sidebar

Skip to content

Toggle site navigation sidebar

Toggle table of contents sidebar

Get Started

Hello world
Adaptive batching
Model composition
Async task queues
Packaging for deployment
Cloud deployment

Learn by Examples

Overview
LLM inference: vLLM
Agent: Function calling
Agent: LangGraph
LLM safety: ShieldGemma
RAG: Document ingestion and search
Stable Diffusion XL Turbo
ComfyUI: Deploy workflows as APIs
ControlNet
MLflow
XGBoost

Build with BentoML

Create online API Services
Define input and output types
Load and manage models
Work with GPUs
Call an API endpoint
Parallelize requests handling
Define the runtime environment
Run distributed Services
Configure template arguments
Configure lifecycle hooks
Mount ASGI applications
Stream responses
Define a WebSocket endpoint
Add a UI with Gradio
Observability
Toggle navigation of Observability
- Monitoring
- Logging
- Metrics
- Tracing
Customize error responses
Test API endpoints

Scale with BentoCloud

Deployment
Toggle navigation of Deployment
Scaling
Toggle navigation of Scaling
- Concurrency and autoscaling
- Scale across multiple regions with Gateways
Manage secrets
Manage API tokens
Develop with Codespaces
Administering
Toggle navigation of Administering

References

BentoML
Toggle navigation of BentoML
- Bento and model APIs
- BentoML SDK
- Bento build options
- BentoML CLI
- Client API
- Framework APIs
  Toggle navigation of Framework APIs
  - Diffusers
  - ONNX
  - Scikit-Learn
  - Transformers
  - Flax
  - TensorFlow
  - TorchScript
  - XGBoost
  - Picklable Model
  - PyTorch
  - LightGBM
  - MLflow
  - CatBoost
  - fast.ai
  - EasyOCR
  - Keras
  - Ray
  - Detectron
- Configurations
- Batch inference
- Exceptions
- Container APIs
- Types
BentoCloud
Toggle navigation of BentoCloud

Toggle table of contents sidebar

Copyright © 2022-2026, bentoml.com

Made with Furo