Agent: Function calling¶

LLM function calling refers to the capability of LLMs to interact with user-defined functions or APIs through natural language prompts. This allows the model to execute specific tasks, retrieve real-time data, or perform calculations beyond its trained knowledge. As a result, the model can provide more accurate and dynamic responses by integrating external resources or executing code in real-time.

This document demonstrates how to build an AI agent capable of calling a user-defined function using Llama 3.1 70B, powered by LMDeploy and BentoML.

The example Python function defined is used for currency conversion and exposed through an API, allowing users to submit queries like the following:

{
   "query": "I want to exchange 42 US dollars to Canadian dollars"
}

The application processes this request and responds by converting USD to CAD using a fictitious exchange rate of 1 to 3.14159.

The converted amount of 42 US dollars to Canadian dollars is 131.95.

This example is ready for easy deployment and scaling on BentoCloud. With a single command, you can deploy a production-grade application with fast autoscaling, secure deployment in your cloud, and comprehensive observability.

../_static/img/use-cases/large-language-models/function-calling/function-calling-playground.gif

Architecture¶

This example includes two BentoML Services, a Currency Exchange Assistant and an LLM. The LLM Service exposes an OpenAI-compatible API, so that the Exchange Assistant can call the OpenAI client. Here is the general workflow of this example:

../_static/img/use-cases/large-language-models/function-calling/function-calling-diagram.png
  1. A user submits a query to the Exchange Assistant’s Query API, which processes the query and forwards it to the LLM to determine the required function and extract parameters.

  2. With the extracted parameters, the Query API invokes the identified Exchange Function, which is responsible for the exchange conversion using the specified parameters.

  3. After the Exchange Function computes the results, these are sent back to the LLM. The LLM then uses this data to generate a natural language response, which is returned to the user through the Exchange Assistant.

Code explanations¶

You can find the source code in GitHub. Below is a breakdown of the key code implementations within this project.

service.py¶

The service.py file outlines the logic of the two required BentoML Services.

  1. Begin by specifying the LLM for the project. This example uses Llama 3.1 70B Instruct AWQ in INT4 and you may choose an alternative model as needed.

    MODEL_ID = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"
    
  2. Create a Python class (Llama in the example) to initialize the model and tokenizer, and use the following decorators to add BentoML functionalities.

    • @bentoml.service: Converts this class into a BentoML Service. You can optionally set configurations like timeout and GPU resources to use on BentoCloud. We recommend you use an NVIDIA A100 GPU of 80 GB for optimal performance.

    • @bentoml.mount_asgi_app: Mounts an existing ASGI application defined in the openai_endpoints.py file to this class. It sets the base path to /v1, making it accessible via HTTP requests. The mounted ASGI application provides OpenAI-compatible APIs and can be served side-by-side with the LLM Service. For more information, see Mount ASGI applications.

    import bentoml
    from openai_endpoints import openai_api_app
    
    @bentoml.mount_asgi_app(openai_api_app, path="/v1")
    @bentoml.service(
        traffic={
            "timeout": 300,
        },
        resources={
            "gpu": 1,
            "gpu_type": "nvidia-a100-80gb",
        },
    )
    class Llama:
       def __init__(self) -> None:
       # Logic to initialize the model and tokenizer
       ...
    
  3. Next, use the @bentoml.service decorator to create another BentoML Service called ExchangeAssistant. Different from the LLM, function calling does not require GPUs and can be run with a single CPU. Running them on separate instances also means you can scale them independently on BentoCloud later.

    Key elements within the ExchangeAssistant Service:

    • bentoml.depends(): This function calls the Llama Service as a dependency, which allows ExchangeAssistant to utilize all its functionalities. For more information, see Run distributed Services.

    • Service initialization: Because the Llama Service provides OpenAI-compatible endpoints, you can use its HTTP client and client_url to construct an OpenAI client to interact with it.

    • A front-facing API /exchange: Define the endpoint using the @bentoml.api decorator to handle currency exchange queries.

    from openai import OpenAI
    
    @bentoml.service(resources={"cpu": "1"})
    class ExchangeAssistant:
        # Declare dependency on the Llama class
        llm = bentoml.depends(Llama)
    
        def __init__(self):
            # Setup HTTP client to interact with the LLM
             self.client = OpenAI(
                  base_url=f"{self.llm.client_url}/v1",
                  http_client=self.llm.to_sync.client,
                  api_key="API_TOKEN_NOT_NEEDED"
            )
            ...
    
        @bentoml.api
        def exchange(self, query: str = "I want to exchange 42 US dollars to Canadian dollars") -> str:
          # Implementation logic
    
  4. The exchange method uses the OpenAI client to integrate function calling capabilities with the specified LLM. After parsing the query to determine the necessary function and extracts relevant parameters, it then invokes the identified exchange function to generate the results. For detailed information on OpenAI’s function calling client APIs, see the OpenAI documentation.

    @bentoml.api
    def exchange(self, query: str = "I want to exchange 42 US dollars to Canadian dollars") -> str:
        tools = [
            {
                "type": "function",
                "function": {
                    "name": "convert_currency",
                    "description": "Convert from one currency to another. Result is returned in the 'converted_amount' key.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "from_currency": {"type": "string", "description": "The source currency to convert from, e.g. USD",},
                            "to_currency": {"type": "string", "description": "The target currency to convert to, e.g. CAD",},
                            "amount": {"type": "number", "description": "The amount to be converted"},
                        },
                        "required": [],
                    },
                },
            }
        ]
        messages = [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": query},
        ]
        response_message = self.client.chat.completions.create(
            model=MODEL_ID,
            messages=messages,
            tools=tools,
        ).choices[0].message
        tool_calls = response_message.tool_calls
    
  5. You can then call the function and add additional functions as needed. Ensure the function definitions in JSON match the corresponding Python function signatures.

    # Check if there are function calls from the LLM response
    if tool_calls:
    
        # Map the function name to the actual method
        available_functions = {
            "convert_currency": self.convert_currency,
        }
    
        # Append the initial LLM response to messages for complete context
        messages.append(response_message)
        for tool_call in tool_calls:
            function_name = tool_call.function.name
            function_to_call = available_functions[function_name]
            function_args = json.loads(tool_call.function.arguments)
    
            # Call the mapped function with parsed arguments
            function_response = function_to_call(
                from_currency=function_args.get("from_currency"),
                to_currency=function_args.get("to_currency"),
                amount=function_args.get("amount"),
            )
    
            # Append function responses to the message chain
            messages.append(
                {
                    "role": "user",
                    "name": function_name,
                    "content": function_response,
                }
            )
    
        # Generate the final response from the LLM incorporating the function responses
        final_response = self.client.chat.completions.create(
            model=MODEL_ID,
            messages=messages,
        )
        return final_response.choices[0].message.content
    else:
        return "Unable to use the available tools."
    

bentofile.yaml¶

This configuration file defines the build options for a Bento, the unified distribution format in BentoML, which contains source code, Python packages, model references, and environment setup. It helps ensure reproducibility across development and production environments.

Here is an example file:

service: 'service:ExchangeAssistant'
labels:
  owner: bentoml-team
  stage: demo
include:
  - '*.py'
python:
  requirements_txt: './requirements.txt'
  lock_packages: false
docker:
  python_version: "3.11"

Try it out¶

You can run this example project on BentoCloud, or serve it locally, containerize it as an OCI-compliant image and deploy it anywhere.

BentoCloud¶

BentoCloud provides fast and scalable infrastructure for building and scaling AI applications with BentoML in the cloud.

  1. Install BentoML and log in to BentoCloud through the BentoML CLI. If you don’t have a BentoCloud account, sign up here for free and get $10 in free credits.

    pip install bentoml
    bentoml cloud login
    
  2. Clone the repository and deploy the project to BentoCloud.

    git clone https://github.com/bentoml/BentoFunctionCalling.git
    cd BentoFunctionCalling
    bentoml deploy .
    
  3. Once it is up and running on BentoCloud, you can call the endpoint in the following ways:

    ../_static/img/use-cases/large-language-models/function-calling/function-calling-playground.png
    import bentoml
    
    with bentoml.SyncHTTPClient("<your_deployment_endpoint_url>") as client:
       response_generator = client.exchange(
             query="I want to exchange 42 US dollars to Canadian dollars"
              )
       for response in response_generator:
            print(response, end='')
    
    curl -X 'POST' \
      '<your_deployment_endpoint_url>/exchange' \
      -H 'accept: text/plain' \
      -H 'Content-Type: application/json' \
      -d '{
        "query": "I want to exchange 42 US dollars to Canadian dollars"
    }'
    
  4. To make sure the Deployment automatically scales within a certain replica range, add the scaling flags:

    bentoml deploy . --scaling-min 0 --scaling-max 3 # Set your desired count
    

    If it’s already deployed, update its allowed replicas as follows:

    bentoml deployment update <deployment-name> --scaling-min 0 --scaling-max 3 # Set your desired count
    

    For more information, see how to configure concurrency and autoscaling.

Local serving¶

BentoML allows you to run and test your code locally, so that you can quickly validate your code with local compute resources.

Important

To serve this project locally, you need an Nvidia GPU with sufficient VRAM to run the LLM. We recommend you use an NVIDIA A100 GPU of 80 GB for the included Llama 3.1 70B Instruct AWQ in INT4 for optimal performance.

  1. Clone the project repository and install the dependencies.

    git clone https://github.com/bentoml/BentoFunctionCalling.git
    cd BentoFunctionCalling
    
    # Recommend Python 3.11
    pip install -r requirements.txt
    
  2. Serve it locally.

    bentoml serve .
    
  3. Visit or send API requests to http://localhost:3000.

For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.