In this post, I’ll walk you through how to deploy a custom sync/asynchronous inference endpoint on AWS SageMaker using a FastAPI server wrapped in a lightweight Docker container.
When deploying machine learning models to production with SageMaker, one of the key architectural decisions is choosing between synchronous and asynchronous inference modes. It directly impacts throughput, latency, cost, timeout behavior, and scalability strategy.
The downside of async mode is cold start latency. If your endpoint has been idle for a while, the first request may incur several minutes of delay while SageMaker initializes containers and provisions compute resources. For some applications, this trade-off is completely acceptable. For others especially latency-sensitive ones it’s a deal-breaker.
# server.py
import time
import uvicorn
from fastapi import FastAPI
from fastapi import Form
from loguru import logger
app = FastAPI()
@app.get("/ping")
async def ping() -> dict[str, str]:
return {
"status": "ok",
"message": "API is healthy.",
}
@app.post("/invocations")
async def invocations(
id: str = Form(...),
request_id: str = Form(...),
) -> str:
logger.info(f"{request_id}, {id}, starting dummy process")
time.sleep(1) # your model
logger.info(f"{request_id}, {id}, ending dummy process")
# In sync mode: the endpoint should return the model response
# In async mode: the endpoint should return just an ack,
# and the real response will be delivered via SNS
return "ok"
if __name__ == "__main__":
uvicorn.run("serve:app", host="0.0.0.0", port=8080)
# Dockerfile
FROM python:3.12-slim-bookworm
ENV \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONFAULTHANDLER=1 \
PYTHONHASHSEED=random \
PIP_DEFAULT_TIMEOUT=100 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1
# venv
RUN python -m venv /venv
ENV PATH=/venv/bin:$PATH
COPY requirements.txt .
RUN pip install -r requirements.txt
WORKDIR /src
COPY src .
EXPOSE 8080
ENTRYPOINT ["sh", "-c", "uvicorn server:app --host 0.0.0.0 --port 8080"]
1) Build/Tag local
docker build -f Dockerfile -t dummy-inference .
# Test it
docker run --network=host dummy-inference:latest
curl http://localhost:8080/ping
curl -X 'POST' 'http://localhost:8080/invocations' -H 'accept: application/json' \
-H 'Content-Type: application/x-www-form-urlencoded' -d 'id=abc&request_id=def'
2) Publish to AWS-ECR
docker tag dummy-inference:latest <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/dummy-inference:latest
docker push <aws_account_id>.dkr.ecr.us-east-1.amazonaws.com/dummy-inference:latest
# Inside AWS-SageMaker notebooks
import boto3
sts = boto3.client("sts")
account_id = sts.get_caller_identity()["Account"]
3) Link ECR-Image to Sagemaker-Model
import boto3
import sagemaker
image_name = "dummy-inference"
model_name = image_name + "-model"
endpoint_name = image_name + "-endpoint"
sts = boto3.client("sts")
account_id = sts.get_caller_identity()["Account"]
image_uri = f"{account_id}.dkr.ecr.us-east-1.amazonaws.com/{image_name}:latest"
role = sagemaker.get_execution_role()
model = sagemaker.Model(image_uri=image_uri, role=role, name=model_name)
from sagemaker.async_inference import AsyncInferenceConfig
async_output_path = f"s3://{bucket}/async-output/"
async_config = AsyncInferenceConfig(
output_path=async_output_path,
max_concurrent_invocations_per_instance=1
#notification_config=... # configure SNS
)
4) Create endpoint configurations and endpoint
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.t2.medium", # One of cheapest machine just for test
endpoint_name=endpoint_name,
# async_inference_config=async_config,
)
5) Test the endpoint
import boto3
import urllib.parse
payload = urllib.parse.urlencode(
{
"id": "sku-12345",
"request_id": "req-003",
}
)
client = boto3.client("sagemaker-runtime", region_name="us-east-1")
response = client.invoke_endpoint(
EndpointName="dummy-inference-endpoint", # or sync
ContentType="application/x-www-form-urlencoded",
Body=payload,
)
print(response["ResponseMetadata"]["HTTPStatusCode"])
print(response["Body"].read().decode())