Agent: LangGraph¶
LangGraph is an open-source library for building stateful, multi-actor applications with LLMs. It allows you to define diverse control flows to create agent and multi-agent workflows.
This document demonstrates how to serve a LangGraph agent application with BentoML.
The example LangGraph agent invokes DuckDuckGo to retrieve the latest information when the LLM used lacks the necessary knowledge. For example:
{
"query": "Who won the gold medal at the men's 100 metres event at the 2024 Summer Olympic?"
}
Example output:
Noah Lyles (USA) won the gold medal at the men's 100 metres event at the 2024 Summer Olympic Games. He won by five-thousands of a second over Jamaica's Kishane Thompson.
This example is ready for easy deployment and scaling on BentoCloud. You can use either external LLM APIs or deploy an open-source LLM together with the LangGraph agent. With a single command, you get a production-grade application with fast autoscaling, secure deployment in your cloud, and comprehensive observability.
Architecture¶
This project consists of two main components: a BentoML Service that serves a LangGraph agent as REST APIs and an LLM that generates text. The LLM can be an external API like Claude 3.5 Sonnet or an open-source model served via BentoML (Mistral 7B in this example).
After a user submits a query, it is processed through the LangGraph agent, which includes:
An
agent
node that uses the LLM to understand the query and decide on actions.A
tools
node that can invoke external tools if needed.
In this example, if the LLM needs additional information, the tools
node calls DuckDuckGo to search the internet for the necessary data. DuckDuckGo then returns the search results to the agent, which compiles the information and delivers the final response to the user.
Code explanations¶
This example contains the following two sub-projects that demonstrate the use of different LLMs:
langgraph-anthropic uses Claude 3.5 Sonnet
langgraph-mistral uses Mistral 7B Instruct
Both sub-projects follow the same logic for implementing the LangGraph agent. This document explains the key code implementation in langgraph-mistral.
mistral.py¶
The mistral.py file defines a BentoML Service MistralService
that serves the Mistral 7B model. You can switch to a different model by changing the MODEL_ID
if necessary.
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
MistralService
provides OpenAI-compatible APIs and uses vLLM as the inference backend. It is a dependent BentoML Service and can be invoked by the LangGraph agent.
For more information on code explanations, see LLM inference: vLLM.
service.py¶
The service.py
file defines the SearchAgentService
, a BentoML Service that wraps around the LangGraph agent and calls the MistralService
.
Create a Python class and decorate it with
@bentoml.service
, which transforms it into a BentoML Service. You can optionally set configurations like workers and concurrency.@bentoml.service( workers=2, resources={ "cpu": "2000m" }, traffic={ "concurrency": 16, "external_queue": True } ) class SearchAgentService: ...
For deployment on BentoCloud, we recommend you set
concurrency
and enableexternal_queue
. Concurrency refers to the number of requests the Service can handle at the same time. Withexternal_queue
enabled, if the application receives more than 16 requests simultaneously, the extra requests are placed in an external queue. They will be processed once the current ones are completed, allowing you to handle traffic spikes without dropping requests.Define the logic to call the
MistralService
. Use thebentoml.depends()
function to invoke it, which allowsSearchAgentService
to utilize all its functionalities, such as calling its OpenAI-compatible API endpoints.from mistral import MistralService from langchain_openai import ChatOpenAI ... class SearchAgentService: # OpenAI compatible API llm_service = bentoml.depends(MistralService) def __init__(self): openai_api_base = f"{self.llm_service.client_url}/v1" self.model = ChatOpenAI( model="mistralai/Mistral-7B-Instruct-v0.3", openai_api_key="N/A", openai_api_base=openai_api_base, temperature=0, verbose=True, http_client=self.llm_service.to_sync.client, ) # Logic to call the model, create LangGraph graph and add nodes & edge ...
Once the Mistral Service is injected, use the ChatOpenAI API from
langchain_openai
to configure an interface to interact with it. Since theMistralService
provides OpenAI-compatible API endpoints, you can use its HTTP client (to_sync.client
) and client URL (client_url
) to easily construct an OpenAI client for interaction.After that, define the LangGraph workflow that uses the model. The LangGraph agent will call this model and build its flow with nodes and edges, connecting the outputs of the LLM with the rest of the system. For detailed explanations of implementing LangGraph workflows, see the LangGraph documentation.
Define a BentoML task endpoint
invoke
with@bentoml.task
to handle the LangGraph workflow asynchronously. It is a background task that supports long-running operations. This ensures that complex LangGraph workflows involving external tools can complete without timing out.After sending the user’s query to the LangGraph agent, the task retrieves the final state and provides the results back to the user.
# Define a task endpoint @bentoml.task async def invoke( self, input_query: str="What is the weather in San Francisco today?", ) -> str: try: # Invoke the LangGraph agent workflow asynchronously final_state = await self.app.ainvoke( {"messages": [HumanMessage(content=input_query)]} ) # Return the final message from the workflow return final_state["messages"][-1].content # Handle errors that may occur during model invocation except OpenAIError as e: print(f"An error occurred: {e}") import traceback print(traceback.format_exc()) return "I'm sorry, but I encountered an error while processing your request. Please try again later."
Tip
We recommend you use a task endpoint for this LangGraph agent application. This is because the LangGraph agent often uses multi-step workflows including querying an LLM and invoking external tools. Such workflows may take longer than the typical HTTP request cycle. If handled synchronously, your application could face request timeouts, especially under high traffic. BentoML task endpoints solve this problem by offloading long-running tasks to the background. You can send a query and check back later for the results, ensuring smooth inference without timeouts.
Optionally, add a streaming API to send intermediate results in real time. Use
@bentoml.api
to turn thestream
function into an API endpoint and callastream_events
to stream events generated by the LangGraph agent.@bentoml.api async def stream( self, input_query: str="What is the weather in San Francisco today?", ) -> AsyncGenerator[str, None]: # Loop through the events generated by the LangGraph workflow async for event in self.app.astream_events( {"messages": [HumanMessage(content=input_query)]}, version="v2" ): # Yield each event and stream it back yield str(event) + "\n"
For more information about the
astream_events
API, see the LangGraph documentation.
bentofile.yaml¶
This configuration file defines the build options for a Bento, the unified distribution format in BentoML, which contains source code, Python packages, model references, and environment setup. It helps ensure reproducibility across development and production environments.
Here is an example file for BentoLangGraph/langgraph-mistral:
service: "service:SearchAgentService"
labels:
author: "bentoml-team"
project: "langgraph-example"
include:
- "*.py"
python:
requirements_txt: "./requirements.txt"
lock_packages: false
envs:
# Set HF environment variable here or use BentoCloud secret
- name: HF_TOKEN
docker:
python_version: "3.11"
Try it out¶
You can run this example project on BentoCloud, or serve it locally, containerize it as an OCI-compliant image, and deploy it anywhere.
BentoCloud¶
BentoCloud provides fast and scalable infrastructure for building and scaling AI applications with BentoML in the cloud.
Install BentoML and log in to BentoCloud through the BentoML CLI. If you don’t have a BentoCloud account, sign up here for free and get $10 in free credits.
pip install bentoml bentoml cloud login
Clone the repository and select the desired project to deploy it. We recommend you create a BentoCloud secret to store the required environment variable.
git clone https://github.com/bentoml/BentoLangGraph.git # Use Mistral 7B cd BentoLangGraph/langgraph-mistral bentoml secret create huggingface HF_TOKEN=$HF_TOKEN bentoml deploy . --secret huggingface # Use Claude 3.5 Sonnet cd BentoLangGraph/langgraph-anthropic bentoml secret create anthropic ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY bentoml deploy . --secret anthropic
Once it is up and running on BentoCloud, you can call the endpoint in the following ways:
import bentoml with bentoml.SyncHTTPClient("<your_deployment_endpoint_url>") as client: result = client.invoke( input_query="Who won the gold medal at the men's 100 metres event at the 2024 Summer Olympic?", ) print(result)
curl -s -X POST \ 'https://<your_deployment_endpoint_url>/invoke' \ -H 'Content-Type: application/json' \ -d '{ "input_query": "Who won the gold medal at the men's 100 metres event at the 2024 Summer Olympic?" }'
To make sure the Deployment automatically scales within a certain replica range, add the scaling flags:
bentoml deploy . --secret huggingface --scaling-min 0 --scaling-max 3 # Set your desired count
If it’s already deployed, update its allowed replicas as follows:
bentoml deployment update <deployment-name> --scaling-min 0 --scaling-max 3 # Set your desired count
For more information, see how to configure concurrency and autoscaling.
Local serving¶
BentoML allows you to run and test your code locally, so that you can quickly validate your code with local compute resources.
Clone the repository and choose your desired project.
git clone https://github.com/bentoml/BentoLangGraph.git # Recommend Python 3.11 # Use Mistral 7B cd BentoLangGraph/langgraph-mistral pip install -r requirements.txt export HF_TOKEN=<your-hf-token> # Use Claude 3.5 Sonnet cd BentoLangGraph/langgraph-anthropic pip install -r requirements.txt export ANTHROPIC_API_KEY=<your-anthropic-api-key>
Serve it locally.
bentoml serve .
Note
To run this project with Mistral 7B locally, you need an NVIDIA GPU with at least 16G VRAM.
Visit or send API requests to http://localhost:3000.
For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.