Contents Menu Expand Light mode Dark mode Auto light/dark, in light mode Auto light/dark, in dark mode Skip to content
BentoML
Light Logo Dark Logo
BentoML

Get Started

  • Hello world
  • Adaptive batching
  • Model composition
  • Async task queues
  • Packaging for deployment
  • Cloud deployment

Learn by Examples

  • Overview
  • LLM inference: vLLM
  • Agent: Function calling
  • Agent: LangGraph
  • LLM safety: ShieldGemma
  • RAG: Document ingestion and search
  • Stable Diffusion XL Turbo
  • ComfyUI: Deploy workflows as APIs
  • ControlNet
  • MLflow
  • XGBoost

Build with BentoML

  • Create online API Services
  • Define input and output types
  • Load and manage models
  • Work with GPUs
  • Call an API endpoint
  • Parallelize requests handling
  • Define the runtime environment
  • Run distributed Services
  • Configure template arguments
  • Configure lifecycle hooks
  • Mount ASGI applications
  • Stream responses
  • Define a WebSocket endpoint
  • Add a UI with Gradio
  • Observability
    • Monitoring
    • Logging
    • Metrics
    • Tracing
  • Customize error responses
  • Test API endpoints

Scale with BentoCloud

  • Deployment
    • Create Deployments
    • Configure Deployments
    • Manage Deployments
    • Call Deployment endpoints
    • Create canary Deployments
    • Sandboxes
    • Batch inference jobs
    • Build CI/CD pipelines
  • Scaling
    • Concurrency and autoscaling
    • Scale across multiple regions with Gateways
  • Manage secrets
  • Manage API tokens
  • Develop with Codespaces
  • Administering
    • Manage users
    • Split staging and production environments
    • Bring Your Own Cloud
    • Configure standby instances

References

  • BentoML
    • Bento and model APIs
    • BentoML SDK
    • Bento build options
    • BentoML CLI
    • Client API
    • Framework APIs
      • Diffusers
      • ONNX
      • Scikit-Learn
      • Transformers
      • Flax
      • TensorFlow
      • TorchScript
      • XGBoost
      • Picklable Model
      • PyTorch
      • LightGBM
      • MLflow
      • CatBoost
      • fast.ai
      • EasyOCR
      • Keras
      • Ray
      • Detectron
    • Configurations
    • Batch inference
    • Exceptions
    • Container APIs
    • Types
  • BentoCloud
    • Deployment details
    • BentoCloud CLI
    • BentoCloud API
Back to top
View this page
Edit this page

RAG: Document ingestion and search¶

A retrieval-augmented generation (RAG) system allows you to retrieve relevant information from an external knowledge base and use this information to enhance the response generated by an LLM. This method helps improve the accuracy and relevance of the LLM’s response, especially when dealing with domains that require up-to-date or factual information.

For more information, see the RAG tutorials to create a RAG application with open-source models with BentoML.

Next
Stable Diffusion XL Turbo
Previous
LLM safety: ShieldGemma
Copyright © 2022-2026, bentoml.com
Made with Furo