ML Model API

ML inference services have two constraints that generic API deploys don’t: large model weights that make the first build slow, and a runtime memory footprint that exceeds the default limits. This guide tackles both head-on.

By the end you’ll have a FastAPI inference endpoint live on Satusky, protected by an API key, sized for the model’s actual memory needs, and configured so that day-two deploys are fast.

Overview

Three things to understand before starting:

Memory limits. A 500MB model file expands in memory when loaded — PyTorch tensors, attention buffers, and the tokenizer vocabulary all live in RAM at once. Starting at 2Gi avoids an OOMKill on the first request. If the pod does get killed, you can raise the limit and redeploy without rebuilding the image — resource changes don’t touch the layer cache.

Docker layer caching. Kaniko respects standard Docker layer semantics. The pip install layer is keyed against requirements.txt. As long as that file doesn’t change, every rebuild after the first skips the pip step entirely and goes straight to copying your application code. A deploy that took four minutes the first time takes under a minute on the second.

Requires cloud build: Layer caching only applies when deploying from source with a Dockerfile. If you’re testing with --image, the build step is skipped entirely and caching is not relevant.

Build context size. The cloud builder uploads your project directory before building. Large files that don’t belong in the image — raw datasets, notebook checkpoints, local weight files — inflate the upload and the context scan. A tight .dockerignore keeps uploads fast.

Requires cloud build: Build context size only matters when deploying from source. Skip this concern when using --image with a pre-built image.

Project structure

ml-api/
├── app/
│   └── main.py
├── model/
│   └── weights.bin        # excluded via .dockerignore
├── notebooks/             # excluded via .dockerignore
├── data/                  # excluded via .dockerignore
├── requirements.txt
├── Dockerfile
├── .dockerignore
└── satusky.toml

model/weights.bin exists locally for experimentation. The Dockerfile downloads weights from HuggingFace at build time, so the image ships the weights without them being part of the build context upload.

.dockerignore

Requires cloud build: This section applies when deploying from source with a Dockerfile. Skip it if testing with a pre-built image.

Large ML projects need an explicit .dockerignore. Without one, the builder uploads your entire working directory — including local weight files, datasets, and notebook checkpoints that can be hundreds of gigabytes.

__pycache__
*.pyc
*.pyo
.git
.env
data/
notebooks/
*.ipynb
model/

The model/ directory is excluded because weights are downloaded during the Docker build (RUN python -c "..." below), not copied from the host. Your local model/weights.bin is for local experimentation; the image gets a fresh copy from HuggingFace every time the model-download layer is invalidated.

Dockerfile

Requires cloud build: This Dockerfile is used when deploying from source. If you already have a pre-built image, skip to satusky.toml.

FROM python:3.12-slim

WORKDIR /app

# System dep for torch (OpenMP runtime)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Copy dependency spec first so pip install is its own cached layer
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Download model weights at build time — not copied from host.
# This layer is cached as long as requirements.txt is unchanged.
RUN python -c "
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = 'google/flan-t5-base'
AutoTokenizer.from_pretrained(model_id, cache_dir='/app/model_cache')
AutoModelForSeq2SeqLM.from_pretrained(model_id, cache_dir='/app/model_cache')
print('Model downloaded successfully')
"

# Copy application code last — code changes don't bust the model layer
COPY app/ ./app/

EXPOSE 8080

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]

The layer order matters. requirements.txt and the model download sit before COPY app/ so that editing main.py doesn’t re-run pip or re-download weights. The model download layer only re-runs when requirements.txt changes (which changes the installed transformers version and invalidates everything below).

Requires cloud/HuggingFace network access: The RUN python -c "..." model download step requires internet access from the cloud builder and a valid HuggingFace token for gated models. It cannot be tested locally without a cloud build environment.

requirements.txt

Requires cloud build: This file is consumed by the Dockerfile during the cloud build step.

fastapi==0.111.0
uvicorn[standard]==0.30.1
torch==2.3.0
transformers==4.41.2

Pin exact versions. Floating versions cause builds to produce different images on different days, and you lose the ability to reproduce a specific release.

satusky.toml

The config file uses a single [app] section. There are no separate [build], [resources], or [network] sections — everything lives under [app].

[app]
  name   = "ml-api"
  port   = 8080
  cpu    = "2"
  memory = "2Gi"

Start at 2Gi. If the model OOMKills at runtime, bump to 4Gi and redeploy — no rebuild required because only the resource spec changes, not the image.

When deploying from source (Dockerfile), add dockerfile = "Dockerfile" to the [app] section:

[app]
  name       = "ml-api"
  port       = 8080
  dockerfile = "Dockerfile"
  cpu        = "2"
  memory     = "2Gi"

First deploy

1ctl deploy --image <your-image> --machine <machine-name> --wait

When deploying with a pre-built image, the build step is skipped and the deploy runs immediately. Expected output:

💡 Using pre-built image: nginx:alpine
Step 2/5: Creating/updating deployment ml-api ✓
Step 3/5: Configuring services ml-api ✓
Step 4/5: Setting up environment and storage ml-api ✓
Step 5/5: Configuring public routing and dependencies ml-api ✓
💡 Generated new domain: silentgiraffe-7715o6u.satusky.com
✅ 🚀 Deployment for ml-api is successful! Your app is live at: https://silentgiraffe-7715o6u.satusky.com
Deployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8e
💡 Waiting for deployment to become healthy...
💡 Deployment status: NotReady (0 pct)
✅ Deployment is healthy — pods Running

Note that steps start at Step 2/5 when --image is used (Step 1/5 is the cloud build, which is skipped).

Requires cloud build: When deploying from source with a Dockerfile (no --image flag), Step 1/5 runs the cloud build. The first build is slow because pip downloads PyTorch (~750MB) and then the model weights (~1GB). This happens once. On the next deploy — assuming requirements.txt is unchanged — Kaniko restores the pip and model layers from cache, and only re-runs COPY app/. Expect sub-60-second builds from there.

Verify resource allocation

Confirm the platform applied the resources from satusky.toml:

1ctl deploy get

Expected output:

Deployment Details
──────────────────
Deployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8e
Status: completed
URL: https://silentgiraffe-7715o6u.satusky.com
Deployed to machines: c7d2a022-07bf-41f3-b51c-5ebb27365fc4
Type: production
Region:
Zone:
Version: alpine
Port: 8080
CPU Request: 2
Memory Request: 2Gi
Memory Limit: 2Gi
Created: just now
Last Updated: just now

If CPU Request or Memory Request don’t match your satusky.toml, redeploy. The platform reconciles resource specs on every deploy.

JSON output

The -o json flag is a global flag and must come before the subcommand:

1ctl -o json deploy get

Expected output:

{
  "deployment_id": "d6ded542-2e4c-4116-bc9a-73190b40fb8e",
  "user_id": "7aeb1c24-b7fd-46d4-be7a-a18b43cdd5d2",
  "hostnames": [
    "c7d2a022-07bf-41f3-b51c-5ebb27365fc4"
  ],
  "type": "production",
  "zone": "",
  "region": "",
  "ssd": "true",
  "gpu": "false",
  "namespace": "org3-b322955e",
  "replicas": 1,
  "image": "nginx:alpine",
  "app_label": "ml-api",
  "port": 8080,
  "cpu_request": "2",
  "memory_request": "2Gi",
  "memory_limit": "2Gi",
  "env_enabled": false,
  "secret_enabled": false,
  "volume_enabled": false,
  "status": "completed",
  "environment": "production",
  "marketplace_app_name": "",
  "domain": "https://silentgiraffe-7715o6u.satusky.com",
  "created_at": "2026-04-28T09:08:44.418412+08:00",
  "updated_at": "2026-04-28T09:08:44.418412+08:00"
}

The key resource fields are cpu_request, memory_request, and memory_limit. These must match the values in satusky.toml.

Check deploy status

1ctl deploy status

Expected output:

Status: Running
Message: Deployment is running normally
Progress: 100%

Add an API key secret

Create the secret before the first meaningful request hits the API:

1ctl secret create --kv MODEL_API_KEY=sk-ml-prod-f83a91bc2e4d5071

Expected output:

✅ Secret ml-api created successfully

The running pod doesn’t have the secret yet — pods pick up changes on the next start. Trigger a rolling restart:

1ctl deploy restart

Expected output:

💡 Initiating rolling restart for deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e...
✅ Rolling restart initiated. Pods are being replaced one by one.
💡 Use '1ctl deploy status --deployment-id d6ded542-2e4c-4116-bc9a-73190b40fb8e' to monitor progress.

Your application reads MODEL_API_KEY from its environment and rejects requests missing the correct Authorization header.

Test the endpoint

Requires production infrastructure: Testing the actual ML inference endpoint requires a running model server (FastAPI + transformers), not the nginx placeholder used in this guide’s local testing steps.

curl -s https://silentgiraffe-7715o6u.satusky.com/summarize \
  -H "Authorization: Bearer sk-ml-prod-f83a91bc2e4d5071" \
  -H "Content-Type: application/json" \
  -d '{"text": "Satusky is a container deployment platform for developers who need to ship without managing Kubernetes."}'

Expected response from a running model:

{
  "summary": "Satusky is a container deployment platform for developers.",
  "latency_ms": 312
}

If the response is 401, the secret hasn’t propagated yet — run 1ctl deploy restart and try again.

Watch model load at startup

Requires production infrastructure: Model loading logs are only visible when running an actual ML framework (e.g., PyTorch + transformers). A plain nginx image does not produce these log lines.

After any restart or deploy, stream logs to confirm the model loaded cleanly:

1ctl logs stream

Expected output from a running model server:

2026-04-28T09:09:04Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...
2026-04-28T09:09:07Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...
2026-04-28T09:09:19Z [ml-api-6bcb6d898d-2ncnr] Model ready (loaded in 15.1s)
2026-04-28T09:09:19Z [ml-api-6bcb6d898d-2ncnr] INFO:     Uvicorn running on http://0.0.0.0:8080
2026-04-28T09:09:22Z [ml-api-6bcb6d898d-2ncnr] INFO:     Application startup complete.

Press Ctrl+C to stop the stream. A 10–15 second startup time is normal for a model this size. The platform waits for the health check to pass before routing any traffic, so requests don’t hit the API until the model is ready.

Checking recent logs

To check the last N lines without streaming:

1ctl logs --tail 5

Expected output format:

Pod Logs
────────
[2026-04-28 09:09:04] [ml-api-6bcb6d898d-2ncnr] 2026/04/28 01:09:04 [notice] ...
[2026-04-28 09:09:04] [ml-api-6bcb6d898d-2ncnr] 2026/04/28 01:09:04 [notice] ...
...
---
Showing last 5 lines

Each line is prefixed with a timestamp and the full pod name.

Handle OOMKill

Requires production infrastructure: OOMKill behavior requires a real ML model that exhausts the container memory limit. It cannot be triggered with a lightweight image like nginx.

A model that fits in 2Gi on disk may exceed that when loaded into PyTorch’s memory allocator, especially with larger batch sizes or longer input sequences. The first sign is logs like this in 1ctl logs stream:

2026-04-28T10:12:01Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...
2026-04-28T10:12:04Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...
2026-04-28T10:12:09Z [ml-api-6bcb6d898d-2ncnr] Killed
2026-04-28T10:12:09Z [ml-api-6bcb6d898d-2ncnr] OOMKilled: container exceeded memory limit (2Gi)
2026-04-28T10:12:12Z [ml-api-6bcb6d898d-2ncnr] container restarting... (backoff: 10s)
2026-04-28T10:12:22Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...
2026-04-28T10:12:25Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...
2026-04-28T10:12:30Z [ml-api-6bcb6d898d-2ncnr] Killed
2026-04-28T10:12:30Z [ml-api-6bcb6d898d-2ncnr] OOMKilled: container exceeded memory limit (2Gi)

The pod is stuck in a restart loop because it runs out of memory before the model finishes loading. Fix: edit satusky.toml and increase the memory limit:

[app]
  name   = "ml-api"
  port   = 8080
  cpu    = "2"
  memory = "4Gi"

Redeploy. When using a pre-built image:

1ctl deploy --image <your-image> --machine <machine-name>

Expected output (no --wait this time — the memory fix is near-instant):

💡 Using pre-built image: nginx:alpine
Step 2/5: Creating/updating deployment ml-api ✓
Step 3/5: Configuring services ml-api ✓
Step 4/5: Setting up environment and storage ml-api ✓
Step 5/5: Configuring public routing and dependencies ml-api ✓
✅ 🚀 Deployment for ml-api is successful! Your app is live at: https://silentgiraffe-7715o6u.satusky.com
Deployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8e

Requires cloud build: When deploying from source, no code changed, so the image build is fast — Kaniko restores all layers from cache.

Verify the new memory limit was applied:

1ctl -o json deploy get

Check that memory_request and memory_limit now show 4Gi. Then stream logs to confirm the fix worked:

1ctl logs stream

View release history

Each deploy creates a new release. To see all releases for the current app:

1ctl deploy releases

Expected output:

VERSION  IMAGE         STATUS      DEPLOYED
───────  ────────────  ──────────  ────────
2        nginx:alpine  active      just now
1        nginx:alpine  superseded  just now

active is the currently running version. Previous versions show as superseded.

Tear down

1ctl deploy destroy -y

Expected output:

💡 Destroying deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e...
✅ Deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e destroyed successfully

This removes the deployment and associated runtime resources. Route, DNS, and volume cleanup should be verified explicitly until the CLI reports each resource class separately:

kubectl -n <namespace> get deploy,svc,httproute,ingress,pvc
1ctl domains list

Summary

Step	Command	Notes
First deploy (pre-built image)	`1ctl deploy --image <ref> --machine <name> --wait`	Starts at Step 2/5 — no build step
First deploy (from source)	`1ctl deploy --wait`	Slow first time — downloads torch + model once
Check status	`1ctl deploy status`	`Status: Running / Message: / Progress: 100%`
Verify resources	`1ctl deploy get`	Check `CPU Request` and `Memory Request` match toml
JSON output	`1ctl -o json deploy get`	`-o json` is a global flag — goes before subcommand
Add API key	`1ctl secret create --kv MODEL_API_KEY=...`	Encrypted at rest
Apply secret	`1ctl deploy restart`	Rolling restart, no rebuild
Tail logs	`1ctl logs --tail 5`	Pod name prefixed on each line
Stream logs	`1ctl logs stream`	Watch model load; 10–15s startup is normal (production only)
Fix OOMKill	Edit `memory` in toml, redeploy	No rebuild needed — only resource spec changes
Release history	`1ctl deploy releases`	Columns: `VERSION IMAGE STATUS DEPLOYED`
Subsequent deploys	`1ctl deploy --image <ref> --machine <name>`	Fast when only resource spec or config changes
Tear down	`1ctl deploy destroy -y`	Removes deployment and secrets