Metrics¶
Metrics are important measurements that provide insights into the usage and performance of Services. BentoML provides a set of default metrics for performance analysis while you can also define custom metrics with Prometheus.
In this document, you will:
Learn and configure the default metrics in BentoML
Create custom metrics for BentoML Services
Use Prometheus to scrape metrics
Create a Grafana dashboard to visualize metrics
Understand metrics¶
You can access metrics via the metrics
endpoint of a BentoML Service. This endpoint is enabled by default and outputs metrics that Prometheus can scrape to monitor your Services continuously.
Default metrics¶
BentoML automatically collects a set of default metrics for each Service. These metrics are tracked across different dimensions to provide detailed visibility into Service operations:
Name |
Type |
Dimension |
---|---|---|
|
Gauge |
|
|
Counter |
|
|
Histogram |
|
|
Histogram |
|
request_in_progress
: The number of requests that are currently being processed by a Service.request_total
: The total number of requests that a Service has processed.request_duration_seconds
: The time taken to process requests, including the total sum of request processing time, count of requests processed, and distribution across specified duration buckets.adaptive_batch_size
: The adaptive batch sizes used during Service execution, which is relevant for optimizing performance in batch processing scenarios. You need to enable adaptive batching to collect this metric.
Metric types¶
BentoML supports all metric types provided by Prometheus.
Gauge
: A metric that represents a single numerical value that can arbitrarily go up and down.Counter
: A cumulative metric that only increases, useful for counting total requests.Histogram
: Tracks the number of observations and the sum of the observed values in configurable buckets, allowing you to calculate averages, percentiles, and so on.Summary
: Similar to Histogram but provides a total count of observations and a sum of observed values.
For more information, see the Prometheus documentation.
Dimensions¶
Dimensions tracked for the default BentoML metrics include:
endpoint
: The specific API endpoint being accessed.runner_name
: The name of the running Service handling the request.service_name
: The name of the Bento Service handling the request.service_version
: The version of the Service.http_response_code
: The HTTP response code of the request.worker_index
: The worker instance that is running the inference.
Configure default metrics¶
To customize how metrics are collected and reported in BentoML, use the metrics
parameter within the @bentoml.service
decorator:
@bentoml.service(metrics={
"enabled": True,
"namespace": "custom_namespace",
})
class MyService:
# Service implementation
enabled
: This option is enabled by default. When enabled, you can access the metrics through themetrics
endpoint of a BentoML Service.namespace
: Follows the labeling convention of Prometheus. The default namespace isbentoml_service
, which covers most use cases.
Customize the duration bucket size¶
You can customize the duration bucket size of request_duration_seconds
in the following two ways:
Manual bucket definition. Specify explicit steps using
buckets
:@bentoml.service(metrics={ "enabled": True, "namespace": "bentoml_service", "duration": { "buckets": [0.1, 0.2, 0.5, 1, 2, 5, 10] } }) class MyService: # Service implementation
Exponential bucket generation. Automatically generate exponential buckets with any given
min
,max
andfactor
values.min
: The lower bound of the smallest bucket in the histogram.max
: The upper bound of the largest bucket in the histogram.factor
: Determines the exponential growth rate of the bucket sizes. Each subsequent bucket boundary is calculated by multiplying the previous boundary by the factor.
@bentoml.service(metrics={ "enabled": True, "namespace": "bentoml_service", "duration": { "min": 0.1, "max": 10, "factor": 1.2 } }) class MyService: # Service implementation
Note
duration.min
,duration.max
andduration.factor
are mutually exclusive withduration.buckets
.duration.factor
must be greater than 1 to ensure each subsequent bucket is larger than the previous one.The buckets for the
adaptive_batch_size
Histogram are calculated based on themax_batch_size
defined. The bucket sizes start at 1 and increase exponentially up to themax_batch_size
with a factor of 2.
By default, BentoML uses the duration buckets provided by Prometheus.
Create custom metrics¶
You can define and use custom metrics of Counter
, Histogram
, Summary
, and Gauge
within your BentoML Service using the prometheus_client
API.
Prerequisites¶
Install the Prometheus Python client package.
pip install prometheus-client
Define custom metrics¶
To define custom metrics, use the metric classes from the prometheus_client
module and set the following parameters as needed:
name
: A unique string identifier for the metric.documentation
: A description of what the metric measures.labelnames
: A list of strings defining the labels to apply to the metric. Labels add dimensions to the metric, which are useful for querying and aggregation purposes. When you record a metric, you specify the labels in the format<metric_object>.labels(<label_name>='<label_value>').<metric_function>
. Once you define a label for a metric, all instances of that metric must include that label with some value.The value of a label can also be dynamic, meaning it can change based on the context of the tracked metric. For example, you can use a label to log the version of model serving predictions, and this version label can change as you update the model.
buckets
: A Histogram-specific parameter which defines the boundaries for Histogram buckets, useful for categorizing measurement ranges. The list should end withfloat('inf')
to capture all values that exceed the highest defined boundary. See the Prometheus documentation on Histogram for more details.
import bentoml
from prometheus_client import Histogram
# Define Histogram metric
inference_duration_histogram = Histogram(
name="inference_duration_seconds",
documentation="Time taken for inference",
labelnames=["endpoint"],
buckets=(
0.005, 0.01, 0.025, 0.05, 0.075,
0.1, 0.25, 0.5, 0.75, 1.0,
2.5, 5.0, 7.5, 10.0, float("inf"),
),
)
@bentoml.service
class HistogramService:
def __init__(self) -> None:
# Initialization code
@bentoml.api
def infer(self, text: str) -> str:
# Track the metric
inference_duration_histogram.labels(endpoint='summarize').observe(512)
# Implementation logic
import bentoml
from prometheus_client import Counter
# Define Counter metric
inference_requests_counter = Counter(
name="inference_requests_total",
documentation="Total number of inference requests",
labelnames=["endpoint"],
)
@bentoml.service
class CounterService:
def __init__(self) -> None:
# Initialization code
@bentoml.api
def infer(self, text: str) -> str:
# Track the metric
inference_requests_counter.labels(endpoint='summarize').inc() # Increment the counter by 1
# Implementation logic
import bentoml
from prometheus_client import Summary
# Define Summary metric
response_size_summary = Summary(
name="response_size_bytes",
documentation="Response size in bytes",
labelnames=["endpoint"],
)
@bentoml.service
class SummaryService:
def __init__(self) -> None:
# Initialization code
@bentoml.api
def infer(self, text: str) -> str:
# Track the metric
response_size_summary.labels(endpoint='summarize').observe(512)
# Implementation logic
import bentoml
from prometheus_client import Gauge
# Define Gauge metric
in_progress_gauge = Gauge(
name="in_progress_requests",
documentation="In-progress inference requests",
labelnames=["endpoint"],
)
@bentoml.service
class GaugeService:
def __init__(self) -> None:
# Initialization code
@bentoml.api
def infer(self, text: str) -> str:
# Track the metric
in_progress_gauge.labels(endpoint='summarize').inc() # Increment by 1
in_progress_gauge.labels(endpoint='summarize').dec() # Decrement by 1
# Implementation logic
For more information on prometheus_client
, see the Prometheus Python client library documentation.
An example with custom metrics¶
The following service.py
file contains a custom Histogram and a Counter metric to measure the inference time and track the total number of requests.
from __future__ import annotations
import bentoml
from prometheus_client import Histogram, Counter
from transformers import pipeline
import time
# Define the metrics
request_counter = Counter(
name='summary_requests_total',
documentation='Total number of summarization requests',
labelnames=['status']
)
inference_time_histogram = Histogram(
name='inference_time_seconds',
documentation='Time taken for summarization inference',
labelnames=['status'],
buckets=(0.1, 0.2, 0.5, 1, 2, 5, 10, float('inf')) # Example buckets
)
EXAMPLE_INPUT = "Breaking News: In an astonishing turn of events, the small town of Willow Creek has been taken by storm as local resident Jerry Thompson's cat, Whiskers, performed what witnesses are calling a 'miraculous and gravity-defying leap.' Eyewitnesses report that Whiskers, an otherwise unremarkable tabby cat, jumped a record-breaking 20 feet into the air to catch a fly. The event, which took place in Thompson's backyard, is now being investigated by scientists for potential breaches in the laws of physics. Local authorities are considering a town festival to celebrate what is being hailed as 'The Leap of the Century."
@bentoml.service(
resources={"cpu": "2"},
traffic={"timeout": 10},
)
class Summarization:
def __init__(self) -> None:
self.pipeline = pipeline('summarization')
@bentoml.api
def summarize(self, text: str = EXAMPLE_INPUT) -> str:
start_time = time.time()
try:
result = self.pipeline(text)
summary_text = result[0]['summary_text']
# Capture successful requests
status = 'success'
except Exception as e:
# Capture failures
summary_text = str(e)
status = 'failure'
finally:
# Measure how long the inference took and update the histogram
inference_time_histogram.labels(status=status).observe(time.time() - start_time)
# Increment the request counter
request_counter.labels(status=status).inc()
return summary_text
Run this Service locally:
bentoml serve service:Summarization
Make sure you have sent some requests to the summarize
endpoint, then view the custom metrics by running the following command. You need to replace inference_time_seconds
and summary_requests_total
with your own metric names.
curl -X 'GET' 'http://localhost:3000/metrics' -H 'accept: */*' | grep -E 'inference_time_seconds|summary_requests_total'
Expected output:
# HELP summary_requests_total Total number of summarization requests
# TYPE summary_requests_total counter
summary_requests_total{status="success"} 12.0
# HELP inference_time_seconds Time taken for summarization inference
# TYPE inference_time_seconds histogram
inference_time_seconds_sum{status="success"} 51.74311947822571
inference_time_seconds_bucket{le="0.1",status="success"} 0.0
inference_time_seconds_bucket{le="0.2",status="success"} 0.0
inference_time_seconds_bucket{le="0.5",status="success"} 0.0
inference_time_seconds_bucket{le="1.0",status="success"} 0.0
inference_time_seconds_bucket{le="2.0",status="success"} 0.0
inference_time_seconds_bucket{le="5.0",status="success"} 12.0
inference_time_seconds_bucket{le="10.0",status="success"} 12.0
inference_time_seconds_bucket{le="+Inf",status="success"} 12.0
inference_time_seconds_count{status="success"} 12.0
Use Prometheus to scrape metrics¶
You can integrate Prometheus to scrape and visualize both default and custom metrics from your BentoML Service.
Create a Prometheus configuration file to define scrape jobs. Here is an example that scrapes metrics every 5 seconds from a BentoML Service.
global: scrape_interval: 5s evaluation_interval: 15s scrape_configs: - job_name: prometheus metrics_path: "/metrics" # The metrics endpoint of the BentoML Service static_configs: - targets: ["0.0.0.0:3000"] # The address where the BentoML Service is running
Make sure you have a BentoML Service running, then start Prometheus in a different terminal session using the configuration file you created:
./prometheus --config.file=/path/to/the/file/prometheus.yml
Once Prometheus is running, access its web UI by visiting
http://localhost:9090
in your web browser. This interface allows you to query and visualize metrics collected from your BentoML Service.Use PromQL expressions to query and visualize metrics. For example, to get the 99th percentile of request durations to the
/encode
endpoint over the last minute, use:histogram_quantile(0.99, rate(bentoml_service_request_duration_seconds_bucket{endpoint="/encode"}[1m]))
Create a Grafana dashboard¶
Grafana is an analytics platform that allows you to create dynamic and informative dashboards to visualize BentoML metrics. Do the following to create a Grafana dashboard.
By default, Grafana runs on port
3000
, which conflicts with BentoML’s default port. To avoid this, change Grafana’s default port. For example:sudo nano /etc/grafana/grafana.ini
Find the
[http]
section and changehttp_port
to a free port like4000
:;http_port = 3000 # Change it to a port of your choice and uncomment the line by removing the semicolon http_port = 4000
Save the file and restart Grafana to apply the change:
sudo systemctl restart grafana-server
Access the Grafana web UI at
http://localhost:4000/
(use your own port). Log in with the default credentials (admin
/admin
).In the Grafana search box at the top, enter
Data sources
and add Prometheus as an available option. In Connection, set the URL to the address of your running Prometheus instance, such ashttp://localhost:9090
. Save the configuration and test the connection to ensure Grafana can retrieve data from Prometheus.With Prometheus configured as a data source, you can create a new dashboard. Start by adding a panel and selecting a metric to visualize, such as
bentoml_service_request_duration_seconds_bucket
. Grafana offers a wide array of visualization options, from simple line graphs to more complex representations like heatmaps or gauges.For detailed instructions on dashboard creation and customization, read the Grafana documentation.