Sentence Transformer#

In natural language processing (NLP), embeddings enable computers to understand the underlying semantics of language by transforming words, phrases, or even documents into numerical vectors. It covers a variety of use cases, from recommending products based on textual descriptions to translating languages and identifying relevant images through semantic understanding.

This document demonstrates how to build a sentence embedding application Sentence Transformer using BentoML. It uses the all-MiniLM-L6-v2 model, a specific kind of language model developed for generating embeddings. Due to its smaller size, all-MiniLM-L6-v2 is efficient in terms of computational resources and speed, making it an ideal choice for embedding generation in environments with limited resources.


Install dependencies#

Clone the project repository and install all the dependencies.

git clone
cd BentoSentenceTransformers
pip install -r requirements.txt

Create a BentoML Service#

Define a BentoML Service to use a model for generating sentence embeddings. The example file in this project uses sentence-transformers/all-MiniLM-L6-v2:
from __future__ import annotations

import typing as t

import numpy as np
import torch
import bentoml
from sentence_transformers import SentenceTransformer, models

    "The sun dips below the horizon, painting the sky orange.",
    "A gentle breeze whispers through the autumn leaves.",
    "The moon casts a silver glow on the tranquil lake.",
    "A solitary lighthouse stands guard on the rocky shore.",
    "The city awakens as morning light filters through the streets.",
    "Stars twinkle in the velvety blanket of the night sky.",
    "The aroma of fresh coffee fills the cozy kitchen.",
    "A curious kitten pounces on a fluttering butterfly."

MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"

    traffic={"timeout": 60},
    resources={"memory": "2Gi"},
class SentenceEmbedding:

    def __init__(self) -> None:

        # Load model and tokenizer
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        # define layers
        first_layer = SentenceTransformer(MODEL_ID)
        pooling_model = models.Pooling(first_layer.get_sentence_embedding_dimension())
        self.model = SentenceTransformer(modules=[first_layer, pooling_model])
        print("Model loaded", "device:", self.device)

    def encode(
        sentences: t.List[str] = SAMPLE_SENTENCES,
    ) -> np.ndarray:
        print("encoding sentences:", len(sentences))
        # Tokenize sentences
        sentence_embeddings= self.model.encode(sentences)
        return sentence_embeddings

Here is a breakdown of the Service code:

  • The script uses the @bentoml.service decorator to annotate the SentenceEmbedding class as a BentoML Service with timeout and memory specified. You can set more configurations as needed.

  • __init__ loads the model and tokenizer when an instance of the SentenceEmbedding class is created. The model is loaded onto the appropriate device (GPU if available, otherwise CPU).

  • The model consists of two layers: The first layer is the pre-trained MiniLM model (all-MiniLM-L6-v2), and the second layer is a pooling layer to aggregate word embeddings into sentence embeddings.

  • The encode method is defined as a BentoML API endpoint. It takes a list of sentences as input and uses the sentence transformer model to generate sentence embeddings. The returned embeddings are NumPy arrays.

Run bentoml serve in your project directory to start the Service.

$ bentoml serve service:SentenceEmbedding

2023-12-27T07:49:25+0000 [WARNING] [cli] Converting 'all-MiniLM-L6-v2' to lowercase: 'all-minilm-l6-v2'.
2023-12-27T07:49:26+0000 [INFO] [cli] Starting production HTTP BentoServer from "service:SentenceEmbedding" listening on http://localhost:3000 (Press CTRL+C to quit)
Model loaded device: cuda

The server is active at http://localhost:3000. You can interact with it in different ways.

curl -X 'POST' \
    'http://localhost:3000/encode' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "sentences": [
        "hello world"
import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    result = client.encode(
                "hello world"

Visit http://localhost:3000, scroll down to Service APIs, and click Try it out. In the Request body box, enter your prompt and click Execute.


Expected output:


Deploy to BentoCloud#

After the Service is ready, you can deploy the project to BentoCloud for better management and scalability. Sign up for a BentoCloud account and get $30 in free credits.

First, specify a configuration YAML file (bentofile.yaml) to define the build options for your application. It is used for packaging your application into a Bento. Here is an example file in the project:

service: "service:SentenceEmbedding"
  owner: bentoml-team
  project: gallery
- "*.py"
  requirements_txt: "./requirements.txt"
    NORMALIZE : "True"

Create an API token with Developer Operations Access to log in to BentoCloud, then run the following command to deploy the project.

bentoml deploy .

Once the Deployment is up and running on BentoCloud, you can access it via the exposed URL.



For custom deployment in your own infrastructure, use BentoML to generate an OCI-compliant image.