Deploy BERT to Azure ML Studio (HuggingFace Transformers)
Azure ML Studio is a powerful platform that allows us to manage our ML/AI projects in various ways, from simple “Level 0” manual deployments to “Level 4”, fully automated MLOps.
In this scenario, we will deploy our HuggingFace BERT model running on the transformers library from a local folder to an online endpoint using Azure ML SDK v2.
The two main advantages of using Azure ML online endpoints are:
- the ability to load-balance multiple deployments: you can either mirror or funnel traffic between different models in a high throughput environment (or in production) without affecting the traffic or causing downtime or delays in your app,
- high observability: deployed containers come with Application Insights logging out of the box, so you don’t have to do anything for logging.
Prerequisites:
- VSCode with Python notebook
- Azure CLI (installed and signed in)
- Docker Desktop
- Azure ML Studio account with a workspace created
How-to:
For the most part, all you have to do is follow this tutorial, so I’m not going to repeat it. Instead, I will just show you the parts where things are a bit different.
Project structure
- Create a folder for your deployment or a project (i.e.,
bert
) - Inside that folder create a Python notebook, (i.e.,
deploy.ipynb
) – this will be our workbook - Create folder called
model
and copy your model files into that folder. - Copy this Conda file into your
model
folder. Edit that file to specify desired Python and PIP versions, and add all required libraries, including transformers and pyTorch. This file will be used to set up your model’s environment. (At the time of writing, the deployment image has a bug, where it is missing aazureml-inference-server-http
library, so you may want to add it to your conda file as well) - Copy this
score.py
file into yourmodel
folder. This file will be used toinit
() andrun
() your model when it’s deployed. - Copy this
sample-request.json
file into the root of your project. Edit it to provide desired request payload format for your model.
Configure deployment
When configuring deployment in your python notebook, provide your model
folder and your conda.yaml
file as follows:
model = Model(path="./model/")
env = Environment(
conda_file="./model/conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)
blue_deployment = ManagedOnlineDeployment(
name="blue",
endpoint_name=endpoint_name,
model=model,
environment=env,
code_configuration=CodeConfiguration(
code="./model/", scoring_script="score.py"
),
instance_type="Standard_DS1_v2", # or "Standard_DS3_v2" etc
instance_count=1,
)
Scoring script
The last thing you need to do is to modify your score.py
script to initialize and run your BERT model using the transformers library. This should look something like this:
import os
import json
from transformers import AutoModelForTokenClassification, AutoTokenizer
from torch import no_grad
def init():
"""
This function is called when the container is initialized/started, typically after create/update of the deployment.
You can write the logic here to perform init operations like caching the model in memory
"""
global model, tokenizer
model = AutoModelFromTokenClassification.from_pretrained(os.path.dirname(__file__), local_files_only = True)
tokenizer = AutoTokenizer.from_pretrained(os.path.dirname(__file__), local_files_only = True)
logging.info("Init complete")
def run(raw_data):
"""
This function is called for every invocation of the endpoint to perform the actual scoring/prediction.
In the example we extract the data from the json input and call the scikit-learn model's predict()
method and return the result back
"""
# TODO: add your deserialization and validation guards here
# TODO: add your inderence code here
# TODO: add your return statement here
Once that’s done – you are ready to run your model locally. Once the local deployment has succeeded – give it a quick test with Postman, after which you are ready to publish!