Service and APIs#
The service definition is the manifestation of the Service Oriented Architecture and the core building block in BentoML where users define the model serving logic. This guide will dissect and explain the key components in the service definition.
Creating a Service#
A BentoML service is composed of Runners and APIs. Consider the following service definition from the tutorial:
import numpy as np
import bentoml
from bentoml.io import NumpyNdarray
iris_clf_runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
result = iris_clf_runner.predict.run(input_series)
return result
Services are initialized through bentoml.Service()
call, with the service name and a
list of Runners required in the service:
# Create the iris_classifier_service with the ScikitLearn runner
svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])
Note
The service name will become the name of the Bento.
The svc
object created provides a decorator method svc.api
for defining`
APIs in this service:
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
result = iris_clf_runner.predict.run(input_series)
return result
Runners#
Runners represent a unit of serving logic that can be scaled horizontally to maximize throughput and resource utilization.
BentoML provides a convenient way of creating Runner instance from a saved model:
runner = bentoml.sklearn.get("iris_clf:latest").to_runner()
Tip
Users can also create custom Runners via the Runner and Runnable interface.
Runner created from a model will automatically choose the most optimal Runner configurations specific for the target ML framework.
For example, if an ML framework releases the Python GIL and supports concurrent access natively, BentoML will create a single global instance of the runner worker and route all API requests to the global instance; otherwise, BentoML will create multiple instances of runners based on the available system resources. We also let advanced users to customize the runtime configurations to fine tune the runner performance. To learn more, see the introduction to Runners.
Debugging Runners#
Runners must be initialized in order to function. Normally, this is handled by BentoML internally
when bentoml serve
is called.
If you want to import and run a service without using BentoML, this must be done manually. For
example, to debug a service called svc
in service.py
:
from service import svc
for runner in svc.runners:
runner.init_local()
result = svc.apis["my_endpoint"].func(inp)
Service APIs#
Inference APIs define how the service functionality can be called remotely. A service can have one or more APIs. An API consists of its input/output specs and a callback function:
# Create new API and add it to "svc"
@svc.api(input=NumpyNdarray(), output=NumpyNdarray()) # define IO spec
def predict(input_array: np.ndarray) -> np.ndarray:
# Define business logic
# Define pre-processing logic
result = runner.run(input_array) # model inference call
# Define post-processing logic
return result
By decorating a function with @svc.api
, we declare that the function shall be
invoked when this API is called. The API function is a great place for defining your
serving logic, such as feature fetching, pre and post processing, and model inferences
via Runners.
When running bentoml serve
with the example above, this API function is
transformed into an HTTP endpoint, /predict
, that takes in a np.ndarray
as
input, and returns a np.ndarray
as output. The endpoint can be called with the following
curl
command:
» curl -X POST \
-H "content-type: application/json" \
--data "[[5.9, 3, 5.1, 1.8]]" \
http://127.0.0.1:3000/predict
"[0]"
Tip
BentoML also plan to support translating the same Service API definition into a gRPC server endpoint, in addition to the default HTTP server. See #703.
Route#
By default, the function name becomes the endpoint URL. Users can also customize
this URL via the route
option, e.g.:
@svc.api(
input=NumpyNdarray(), output=NumpyNdarray(),
route="/v2/models/my_model/versions/v0/infer",
)
def predict(input_array: np.ndarray) -> np.ndarray:
return runner.run(input_array)
Note
BentoML aims to parallelize API logic by starting multiple instances of the API server based on available system resources.
Inference Context#
The context of an inference call can be accessed through the additional bentoml.Context
argument added to the service API function. Both the request and response contexts can be
accessed through the inference context for getting and setting the headers, cookies, and
status codes. Additionaly, you can read and write to the global state dictionary via the
ctx.state
attribute, which is a per-worker dictionary that can be read and written across
API endpoints.
@svc.api(
input=NumpyNdarray(),
output=NumpyNdarray(),
)
def predict(input_array: np.ndarray, ctx: bentoml.Context) -> np.ndarray:
# get request headers
request_headers = ctx.request.headers
result = runner.run(input_array)
# set response headers, cookies, and status code
ctx.response.status_code = 202
ctx.response.cookies = [
bentoml.Cookie(
key="key",
value="value",
max_age=None,
expires=None,
path="/predict",
domain=None,
secure=True,
httponly=True,
samesite="None"
)
]
ctx.response.headers.append("X-Custom-Header", "value")
return result
Lifecycle Hooks#
BentoML service provides a set of lifecycle hooks that can be used to execute code before startup and after shutdown. In the hook function, you can access the inference context introduced in the previous section.
@svc.on_startup
async def connect_db_on_startup(context: bentoml.Context):
context.state["db"] = await get_db_connection()
# ctx.request # this will raise an error because no request has been served yet.
@svc.on_shutdown
async def close_db_on_shutdown(context: bentoml.Context):
await context.state["db"].close()
on_startup
and on_shutdown
hooks will be evaluated on each API server process(worker).
Users should avoid accessing file system for possible contest. More used for init a in process object like db connections.
BentoML service also provides an on_deployment
hook that will be evaluated only once when the service starts.
This is a good place to download models files once shared by all API server processes(workers).
@svc.on_deployment
def download_model_on_serve():
download_model_files()
This hook will be executed on bentoml serve
and before any process(worker) starts.
However, users can not access the inference context from the on_deployment
hook.
Note
The on_deployment
hook can be executed every time the service is started, and we still recommend putting
one-time initialization work in the Setup Script to avoid repeated execution.
You can register multiple functions for each hook, and they will be executed in the order they are registered. All hooks support both synchronous and asynchronous functions.
IO Descriptors#
IO descriptors are used for defining an API’s input and output specifications. It
describes the expected data type, helps validate that the input and output conform to
the expected format and schema and convert them from and to the native types. They are
specified through the input
and output
arguments in the @svc.api
decorator method.
Recall the API we created in the tutorial. The classify
API both accepts
arguments and returns results in the type of
bentoml.io.NumpyNdarray:
import numpy as np
from bentoml.io import NumpyNdarray
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_array: np.ndarray) -> np.ndarray:
...
Besides the NumpyNdarray
IO descriptor, BentoML supports a variety of IO
descriptors including PandasDataFrame
, JSON
, String
,
Image
, Text
, and File
. For detailed documentation on how to
declare and invoke these descriptors please see the
IO Descriptors API reference page.
Schema and Validation#
IO descriptors allow users to define the expected data types, shape, and schema, based
on the type of the input and output descriptor specified. IO descriptors can also be defined
through examples with the from_sample
API to simplify the development of service
definitions.
Numpy#
The data type and shape of the NumpyNdarray
can be specified with the dtype
and shape
arguments. By setting the enforce_shape
and enforce_dtype
arguments to True, the IO descriptor will strictly validate the input and output data
based the specified data type and shape. To learn more, see IO descrptor reference for
NumPy ndarray.
import numpy as np
from bentoml.io import NumpyNdarray
svc = bentoml.Service("iris_classifier")
# Define IO descriptors through samples
output_descriptor = NumpyNdarray.from_sample(np.array([[1.0, 2.0, 3.0, 4.0]]))
@svc.api(
input=NumpyNdarray(
shape=(-1, 4),
dtype=np.float32,
enforce_dtype=True,
enforce_shape=True
),
output=output_descriptor,
)
def classify(input_array: np.ndarray) -> np.ndarray:
...
Pandas DataFrame#
The data type and shape of the PandasDataFrame
can be specified with the dtype
and shape
arguments. By setting the enforce_shape
and enforce_dtype
arguments to True, the IO descriptor will strictly validate the input and output data
based the specified data type and shape. To learn more, see IO descrptor reference for
Tabular Data with Pandas.
import pandas as pd
from bentoml.io import PandasDataFrame
svc = bentoml.Service("iris_classifier")
# Define IO descriptors through samples
output_descriptor = PandasDataFrame.from_sample(pd.DataFrame([[5,4,3,2]]))
@svc.api(
input=PandasDataFrame(
orient="records",
dtype=np.float32,
enforce_dtype=True,
shape=(-1, 4),
enforce_shape=True
),
output=output_descriptor,
)
def classify(input_series: pd.DataFrame) -> pd.DataFrame:
...
JSON#
The data type of a JSON IO descriptor can be specified through a Pydantic model. By setting a pydantic model, the IO descriptor will validate the input based on the specified pydantic model and return. To learn more, see IO descrptor reference for Structured Data with JSON. We also provide an example project using Pydantic for request validation.
from typing import Dict, Any
from pydantic import BaseModel
from bentoml.io import JSON
svc = bentoml.Service("iris_classifier")
class IrisFeatures(BaseModel):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
@svc.api(
input=JSON(pydantic_model=IrisFeatures),
output=JSON(),
)
def classify(input_series: IrisFeatures) -> Dict[str, Any]:
input_df = pd.DataFrame([input_series.dict()])
results = iris_clf_runner.predict.run(input_df).to_list()
return {"predictions": results}
Built-in Types#
Beside NumpyNdarray
, BentoML supports a variety of other built-in IO descriptor
types under the bentoml.io module. Each type comes
with support of type validation and OpenAPI specification generation. For example:
IO Descriptor |
Type |
Arguments |
Schema Type |
---|---|---|---|
NumpyNdarray |
numpy.ndarray |
validate, schema |
numpy.dtype |
PandasDataFrame |
pandas.DataFrame |
validate, schema |
pandas.DataFrame.dtypes |
JSON |
Python native types |
validate, schema |
Pydantic.BaseModel |
Image |
PIL.Image.Image |
pilmodel, mime_type |
|
Text |
str |
||
File |
BytesIOFile |
kind, mime_type |
Learn more about other built-in IO Descriptors here.
Composite Types#
The Multipart
IO descriptors can be used to group multiple IO Descriptor
instances, which allows the API function to accept multiple arguments or return multiple
values. Each IO descriptor can be customized with independent schema and validation
logic:
from __future__ import annotations
from typing import Any
import numpy as np
from pydantic import BaseModel
from bentoml.io import Multipart, NumpyNdarray, JSON
class IrisFeatures(BaseModel):
sepal_length: float
sepal_width: float
petal_length: float
petal_width: float
output_descriptor_numpy = NumpyNdarray.from_sample(np.array([2]))
@svc.api(
input=Multipart(
arr=NumpyNdarray(
shape=(-1, 4),
dtype=np.float32,
enforce_dtype=True,
enforce_shape=True,
),
json=JSON(pydantic_model=IrisFeatures),
),
output=output_descriptor_numpy,
)
def multi_part_predict(arr: np.ndarray, json: dict[str, Any]) -> np.ndarray:
...
Sync vs Async APIs#
APIs can be defined as either synchronous function or asynchronous coroutines in Python. The API we created in the tutorial was a synchronous API. BentoML will intelligently create an optimally sized pool of workers to execute the synchronous logic. Synchronous APIs are simple and capable of getting the job done for most model serving scenarios.
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_array: np.ndarray) -> np.ndarray:
result = runner.run(input_array)
return result
Synchronous APIs fall short when we want to maximize the performance and throughput of the service. Asynchronous APIs are preferred if the processing logic is IO-bound or invokes multiple runners simultaneously. The following async API example calls a remote feature store asynchronously, invokes two runners simultaneously, and returns a combined result.
import aiohttp
import asyncio
# Load two runners for two different versions of the ScikitLearn
# Iris Classifier models we saved before
runner1 = bentoml.sklearn.get("iris_clf:yftvuwkbbbi6zc").to_runner()
runner2 = bentoml.sklearn.get("iris_clf:edq3adsfhzi6zg").to_runner()
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
async def predict(input_array: np.ndarray) -> np.ndarray:
# Call a remote feature store to pre-process the request
async with aiohttp.ClientSession() as session:
async with session.get('https://features/get', params=input_array[0]) as resp:
features = get_features(await resp.text())
# Invoke both model runners simultaneously
results = await asyncio.gather(
runner1.predict.async_run(input_array, features),
runner2.predict.async_run(input_array, features),
)
return combine_results(results)
The asynchronous API implementation is more efficient because when an asynchronous method is invoked, the event loop is released to service other requests while this request awaits the results of the method. In addition, BentoML will automatically configure the ideal amount of parallelism based on the available number of CPU cores. Further tuning of event loop configuration is not needed under common use cases.
Tip
Blocking logic such as communicating with an API or database without the await keyword will block the event loop and prevent it from completing other IO tasks. If you must use a library that does not support asynchronous IO with await, you should use the synchronous API instead. If you are not sure, also use the synchronous API to prevent unexpected errors.