BentoML SDK¶
Service decorator¶
- bentoml.service(inner: type[T], /) Service[T] [source]¶
- bentoml.service(inner: None = None, /, *, image: Image | None = None, envs: list[dict[str, Any]] | None = None, **kwargs: Unpack) _ServiceDecorator
Mark a class as a BentoML service.
Example
@service(traffic={“timeout”: 60}) class InferenceService:
@api def predict(self, input: str) -> str:
return input
Service API¶
- bentoml.api(func: t.Callable[t.Concatenate[t.Any, P], R]) APIMethod[P, R] [source]¶
- bentoml.api(*, route: str | None = None, name: str | None = None, input_spec: type[IODescriptor] | None = None, output_spec: type[IODescriptor] | None = None, batchable: bool = False, batch_dim: int | tuple[int, int] = 0, max_batch_size: int = 100, max_latency_ms: int = 60000) t.Callable[[t.Callable[t.Concatenate[t.Any, P], R]], APIMethod[P, R]]
Make a BentoML API method. This decorator can be used either with or without arguments.
- Parameters:
func – The function to be wrapped.
route – The route of the API. e.g. “/predict”
name – The name of the API.
input_spec – The input spec of the API, should be a subclass of
pydantic.BaseModel
.output_spec – The output spec of the API, should be a subclass of
pydantic.BaseModel
.batchable – Whether the API is batchable.
batch_dim – The batch dimension of the API.
max_batch_size – The maximum batch size of the API.
max_latency_ms – The maximum latency of the API.
Note that when you enable batching, batch_dim
can be a tuple or a single value.
For a tuple (
input_dim
,output_dim
):input_dim
: Determines along which dimension the input arrays should be batched (or stacked) together before sending them for processing. For example, if you are working with 2-D arrays andinput_dim
is set to 0, BentoML will stack the arrays along the first dimension. This means if you have two 2-D input arrays with dimensions 5x2 and 10x2, specifying aninput_dim
of 0 would combine these into a single 15x2 array for processing.output_dim
: After the inference is done, the output array needs to be split back into the original batch sizes. Theoutput_dim
indicates along which dimension the output array should be split. In the example above, if the inference process returns a 15x2 array andoutput_dim
is set to 0, BentoML will split this array back into the original sizes of 5x2 and 10x2, based on the recorded boundaries of the input batch. This ensures that each requester receives the correct portion of the output corresponding to their input.
If you specify a single value for
batch_dim
, this value will apply to bothinput_dim
andoutput_dim
. In other words, the same dimension is used for both batching inputs and splitting outputs.
Image illustration of batch_dim
This image illustrates the concept of batch_dim
in the context of processing 2-D arrays.
On the left side, there are two 2-D arrays of size 5x2, represented by blue and green boxes. The arrows show two different paths that these arrays can take depending on the batch_dim
configuration:
The top path has
batch_dim=(0,0)
. This means that batching occurs along the first dimension (the number of rows). The two arrays are stacked on top of each other, resulting in a new combined array of size 10x2, which is sent for inference. After inference, the result is split back into two separate 5x2 arrays.The bottom path has
batch_dim=(1,1)
. This implies that batching occurs along the second dimension (the number of columns). The two arrays are concatenated side by side, forming a larger array of size 5x4, which is processed by the model. After inference, the output array is split back into the original dimensions, resulting in two separate 5x2 arrays.
- bentoml.task(func: t.Callable[t.Concatenate[t.Any, P], R]) APIMethod[P, R] [source]¶
- bentoml.task(*, route: str | None = None, name: str | None = None, input_spec: type[IODescriptor] | None = None, output_spec: type[IODescriptor] | None = None, batchable: bool = False, batch_dim: int | tuple[int, int] = 0, max_batch_size: int = 100, max_latency_ms: int = 60000) t.Callable[[t.Callable[t.Concatenate[t.Any, P], R]], APIMethod[P, R]]
Mark a method as a BentoML async task. This decorator can be used either with or without arguments.
- Parameters:
func – The function to be wrapped.
route – The route of the API. e.g. “/predict”
name – The name of the API.
input_spec – The input spec of the API, should be a subclass of
pydantic.BaseModel
.output_spec – The output spec of the API, should be a subclass of
pydantic.BaseModel
.batchable – Whether the API is batchable.
batch_dim – The batch dimension of the API.
max_batch_size – The maximum batch size of the API.
max_latency_ms – The maximum latency of the API.
bentoml.depends¶
- bentoml.depends(*, url: str | None = None, deployment: str | None = None, cluster: str | None = None) Dependency[None] [source]¶
- bentoml.depends(on: Service[T], *, url: str | None = None, deployment: str | None = None, cluster: str | None = None) Dependency[T]
Create a dependency on other service or deployment
- Parameters:
on – Service[T] | None: The service to depend on.
url – str | None: The URL of the service to depend on.
deployment – str | None: The deployment of the service to depend on.
cluster – str | None: The cluster of the service to depend on.
Examples:
@bentoml.service class MyService: # depends on a service svc_a = bentoml.depends(SVC_A) # depends on a deployment svc_b = bentoml.depends(deployment="ci-iris") # depends on a remote service with url svc_c = bentoml.depends(url="http://192.168.1.1:3000") # For the latter two cases, the service can be given to provide more accurate types: svc_d = bentoml.depends(url="http://192.168.1.1:3000", on=SVC_D)
bentoml.validators¶
- class bentoml.validators.FileSchema(format: str = 'binary', content_type: str | None = None)[source]¶
Bases:
object
- class bentoml.validators.TensorSchema(format: TensorFormat, dtype: t.Optional[str] = None, shape: t.Optional[t.Tuple[int, ...]] = None)[source]¶
Bases:
object
- format: TensorFormat¶
- class bentoml.validators.DataframeSchema(orient: str = 'records', columns=None)[source]¶
Bases:
object