ML Model API
ML inference services have two constraints that generic API deploys don’t: large model weights that make the first build slow, and a runtime memory footprint that exceeds the default limits. This guide tackles both head-on.
By the end you’ll have a FastAPI inference endpoint live on Satusky, protected by an API key, sized for the model’s actual memory needs, and configured so that day-two deploys are fast.
Overview
Section titled “Overview”Three things to understand before starting:
Memory limits. A 500MB model file expands in memory when loaded — PyTorch tensors, attention buffers, and the tokenizer vocabulary all live in RAM at once. Starting at 2Gi avoids an OOMKill on the first request. If the pod does get killed, you can raise the limit and redeploy without rebuilding the image — resource changes don’t touch the layer cache.
Docker layer caching. Kaniko respects standard Docker layer semantics. The pip install layer is keyed against requirements.txt. As long as that file doesn’t change, every rebuild after the first skips the pip step entirely and goes straight to copying your application code. A deploy that took four minutes the first time takes under a minute on the second.
Requires cloud build: Layer caching only applies when deploying from source with a Dockerfile. If you’re testing with
--image, the build step is skipped entirely and caching is not relevant.
Build context size. The cloud builder uploads your project directory before building. Large files that don’t belong in the image — raw datasets, notebook checkpoints, local weight files — inflate the upload and the context scan. A tight .dockerignore keeps uploads fast.
Requires cloud build: Build context size only matters when deploying from source. Skip this concern when using
--imagewith a pre-built image.
Project structure
Section titled “Project structure”ml-api/├── app/│ └── main.py├── model/│ └── weights.bin # excluded via .dockerignore├── notebooks/ # excluded via .dockerignore├── data/ # excluded via .dockerignore├── requirements.txt├── Dockerfile├── .dockerignore└── satusky.tomlmodel/weights.bin exists locally for experimentation. The Dockerfile downloads weights from HuggingFace at build time, so the image ships the weights without them being part of the build context upload.
.dockerignore
Section titled “.dockerignore”Requires cloud build: This section applies when deploying from source with a Dockerfile. Skip it if testing with a pre-built image.
Large ML projects need an explicit .dockerignore. Without one, the builder uploads your entire working directory — including local weight files, datasets, and notebook checkpoints that can be hundreds of gigabytes.
__pycache__*.pyc*.pyo.git.envdata/notebooks/*.ipynbmodel/The model/ directory is excluded because weights are downloaded during the Docker build (RUN python -c "..." below), not copied from the host. Your local model/weights.bin is for local experimentation; the image gets a fresh copy from HuggingFace every time the model-download layer is invalidated.
Dockerfile
Section titled “Dockerfile”Requires cloud build: This Dockerfile is used when deploying from source. If you already have a pre-built image, skip to satusky.toml.
FROM python:3.12-slim
WORKDIR /app
# System dep for torch (OpenMP runtime)RUN apt-get update && apt-get install -y --no-install-recommends \ libgomp1 \ && rm -rf /var/lib/apt/lists/*
# Copy dependency spec first so pip install is its own cached layerCOPY requirements.txt .RUN pip install --no-cache-dir -r requirements.txt
# Download model weights at build time — not copied from host.# This layer is cached as long as requirements.txt is unchanged.RUN python -c "from transformers import AutoTokenizer, AutoModelForSeq2SeqLMmodel_id = 'google/flan-t5-base'AutoTokenizer.from_pretrained(model_id, cache_dir='/app/model_cache')AutoModelForSeq2SeqLM.from_pretrained(model_id, cache_dir='/app/model_cache')print('Model downloaded successfully')"
# Copy application code last — code changes don't bust the model layerCOPY app/ ./app/
EXPOSE 8080
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8080"]The layer order matters. requirements.txt and the model download sit before COPY app/ so that editing main.py doesn’t re-run pip or re-download weights. The model download layer only re-runs when requirements.txt changes (which changes the installed transformers version and invalidates everything below).
Requires cloud/HuggingFace network access: The
RUN python -c "..."model download step requires internet access from the cloud builder and a valid HuggingFace token for gated models. It cannot be tested locally without a cloud build environment.
requirements.txt
Section titled “requirements.txt”Requires cloud build: This file is consumed by the Dockerfile during the cloud build step.
fastapi==0.111.0uvicorn[standard]==0.30.1torch==2.3.0transformers==4.41.2Pin exact versions. Floating versions cause builds to produce different images on different days, and you lose the ability to reproduce a specific release.
satusky.toml
Section titled “satusky.toml”The config file uses a single [app] section. There are no separate [build], [resources], or [network] sections — everything lives under [app].
[app] name = "ml-api" port = 8080 cpu = "2" memory = "2Gi"Start at 2Gi. If the model OOMKills at runtime, bump to 4Gi and redeploy — no rebuild required because only the resource spec changes, not the image.
When deploying from source (Dockerfile), add dockerfile = "Dockerfile" to the [app] section:
[app] name = "ml-api" port = 8080 dockerfile = "Dockerfile" cpu = "2" memory = "2Gi"First deploy
Section titled “First deploy”1ctl deploy --image <your-image> --machine <machine-name> --waitWhen deploying with a pre-built image, the build step is skipped and the deploy runs immediately. Expected output:
💡 Using pre-built image: nginx:alpineStep 2/5: Creating/updating deployment ml-api ✓Step 3/5: Configuring services ml-api ✓Step 4/5: Setting up environment and storage ml-api ✓Step 5/5: Configuring public routing and dependencies ml-api ✓💡 Generated new domain: silentgiraffe-7715o6u.satusky.com✅ 🚀 Deployment for ml-api is successful! Your app is live at: https://silentgiraffe-7715o6u.satusky.comDeployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8e💡 Waiting for deployment to become healthy...💡 Deployment status: NotReady (0 pct)✅ Deployment is healthy — pods RunningNote that steps start at Step 2/5 when --image is used (Step 1/5 is the cloud build, which is skipped).
Requires cloud build: When deploying from source with a Dockerfile (no
--imageflag), Step 1/5 runs the cloud build. The first build is slow because pip downloads PyTorch (~750MB) and then the model weights (~1GB). This happens once. On the next deploy — assumingrequirements.txtis unchanged — Kaniko restores the pip and model layers from cache, and only re-runsCOPY app/. Expect sub-60-second builds from there.
Verify resource allocation
Section titled “Verify resource allocation”Confirm the platform applied the resources from satusky.toml:
1ctl deploy getExpected output:
Deployment Details──────────────────Deployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8eStatus: completedURL: https://silentgiraffe-7715o6u.satusky.comDeployed to machines: c7d2a022-07bf-41f3-b51c-5ebb27365fc4Type: productionRegion:Zone:Version: alpinePort: 8080CPU Request: 2Memory Request: 2GiMemory Limit: 2GiCreated: just nowLast Updated: just nowIf CPU Request or Memory Request don’t match your satusky.toml, redeploy. The platform reconciles resource specs on every deploy.
JSON output
Section titled “JSON output”The -o json flag is a global flag and must come before the subcommand:
1ctl -o json deploy getExpected output:
{ "deployment_id": "d6ded542-2e4c-4116-bc9a-73190b40fb8e", "user_id": "7aeb1c24-b7fd-46d4-be7a-a18b43cdd5d2", "hostnames": [ "c7d2a022-07bf-41f3-b51c-5ebb27365fc4" ], "type": "production", "zone": "", "region": "", "ssd": "true", "gpu": "false", "namespace": "org3-b322955e", "replicas": 1, "image": "nginx:alpine", "app_label": "ml-api", "port": 8080, "cpu_request": "2", "memory_request": "2Gi", "memory_limit": "2Gi", "env_enabled": false, "secret_enabled": false, "volume_enabled": false, "status": "completed", "environment": "production", "marketplace_app_name": "", "domain": "https://silentgiraffe-7715o6u.satusky.com", "created_at": "2026-04-28T09:08:44.418412+08:00", "updated_at": "2026-04-28T09:08:44.418412+08:00"}The key resource fields are cpu_request, memory_request, and memory_limit. These must match the values in satusky.toml.
Check deploy status
Section titled “Check deploy status”1ctl deploy statusExpected output:
Status: RunningMessage: Deployment is running normallyProgress: 100%Add an API key secret
Section titled “Add an API key secret”Create the secret before the first meaningful request hits the API:
1ctl secret create --kv MODEL_API_KEY=sk-ml-prod-f83a91bc2e4d5071Expected output:
✅ Secret ml-api created successfullyThe running pod doesn’t have the secret yet — pods pick up changes on the next start. Trigger a rolling restart:
1ctl deploy restartExpected output:
💡 Initiating rolling restart for deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e...✅ Rolling restart initiated. Pods are being replaced one by one.💡 Use '1ctl deploy status --deployment-id d6ded542-2e4c-4116-bc9a-73190b40fb8e' to monitor progress.Your application reads MODEL_API_KEY from its environment and rejects requests missing the correct Authorization header.
Test the endpoint
Section titled “Test the endpoint”Requires production infrastructure: Testing the actual ML inference endpoint requires a running model server (FastAPI + transformers), not the nginx placeholder used in this guide’s local testing steps.
curl -s https://silentgiraffe-7715o6u.satusky.com/summarize \ -H "Authorization: Bearer sk-ml-prod-f83a91bc2e4d5071" \ -H "Content-Type: application/json" \ -d '{"text": "Satusky is a container deployment platform for developers who need to ship without managing Kubernetes."}'Expected response from a running model:
{ "summary": "Satusky is a container deployment platform for developers.", "latency_ms": 312}If the response is 401, the secret hasn’t propagated yet — run 1ctl deploy restart and try again.
Watch model load at startup
Section titled “Watch model load at startup”Requires production infrastructure: Model loading logs are only visible when running an actual ML framework (e.g., PyTorch + transformers). A plain nginx image does not produce these log lines.
After any restart or deploy, stream logs to confirm the model loaded cleanly:
1ctl logs streamExpected output from a running model server:
2026-04-28T09:09:04Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...2026-04-28T09:09:07Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...2026-04-28T09:09:19Z [ml-api-6bcb6d898d-2ncnr] Model ready (loaded in 15.1s)2026-04-28T09:09:19Z [ml-api-6bcb6d898d-2ncnr] INFO: Uvicorn running on http://0.0.0.0:80802026-04-28T09:09:22Z [ml-api-6bcb6d898d-2ncnr] INFO: Application startup complete.Press Ctrl+C to stop the stream. A 10–15 second startup time is normal for a model this size. The platform waits for the health check to pass before routing any traffic, so requests don’t hit the API until the model is ready.
Checking recent logs
Section titled “Checking recent logs”To check the last N lines without streaming:
1ctl logs --tail 5Expected output format:
Pod Logs────────[2026-04-28 09:09:04] [ml-api-6bcb6d898d-2ncnr] 2026/04/28 01:09:04 [notice] ...[2026-04-28 09:09:04] [ml-api-6bcb6d898d-2ncnr] 2026/04/28 01:09:04 [notice] ......---Showing last 5 linesEach line is prefixed with a timestamp and the full pod name.
Handle OOMKill
Section titled “Handle OOMKill”Requires production infrastructure: OOMKill behavior requires a real ML model that exhausts the container memory limit. It cannot be triggered with a lightweight image like nginx.
A model that fits in 2Gi on disk may exceed that when loaded into PyTorch’s memory allocator, especially with larger batch sizes or longer input sequences. The first sign is logs like this in 1ctl logs stream:
2026-04-28T10:12:01Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...2026-04-28T10:12:04Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...2026-04-28T10:12:09Z [ml-api-6bcb6d898d-2ncnr] Killed2026-04-28T10:12:09Z [ml-api-6bcb6d898d-2ncnr] OOMKilled: container exceeded memory limit (2Gi)2026-04-28T10:12:12Z [ml-api-6bcb6d898d-2ncnr] container restarting... (backoff: 10s)2026-04-28T10:12:22Z [ml-api-6bcb6d898d-2ncnr] Loading tokenizer...2026-04-28T10:12:25Z [ml-api-6bcb6d898d-2ncnr] Loading model weights...2026-04-28T10:12:30Z [ml-api-6bcb6d898d-2ncnr] Killed2026-04-28T10:12:30Z [ml-api-6bcb6d898d-2ncnr] OOMKilled: container exceeded memory limit (2Gi)The pod is stuck in a restart loop because it runs out of memory before the model finishes loading. Fix: edit satusky.toml and increase the memory limit:
[app] name = "ml-api" port = 8080 cpu = "2" memory = "4Gi"Redeploy. When using a pre-built image:
1ctl deploy --image <your-image> --machine <machine-name>Expected output (no --wait this time — the memory fix is near-instant):
💡 Using pre-built image: nginx:alpineStep 2/5: Creating/updating deployment ml-api ✓Step 3/5: Configuring services ml-api ✓Step 4/5: Setting up environment and storage ml-api ✓Step 5/5: Configuring public routing and dependencies ml-api ✓✅ 🚀 Deployment for ml-api is successful! Your app is live at: https://silentgiraffe-7715o6u.satusky.comDeployment ID: d6ded542-2e4c-4116-bc9a-73190b40fb8eRequires cloud build: When deploying from source, no code changed, so the image build is fast — Kaniko restores all layers from cache.
Verify the new memory limit was applied:
1ctl -o json deploy getCheck that memory_request and memory_limit now show 4Gi. Then stream logs to confirm the fix worked:
1ctl logs streamView release history
Section titled “View release history”Each deploy creates a new release. To see all releases for the current app:
1ctl deploy releasesExpected output:
VERSION IMAGE STATUS DEPLOYED─────── ──────────── ────────── ────────2 nginx:alpine active just now1 nginx:alpine superseded just nowactive is the currently running version. Previous versions show as superseded.
Tear down
Section titled “Tear down”1ctl deploy destroy -yExpected output:
💡 Destroying deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e...✅ Deployment d6ded542-2e4c-4116-bc9a-73190b40fb8e destroyed successfullyThis removes the deployment and associated runtime resources. Route, DNS, and volume cleanup should be verified explicitly until the CLI reports each resource class separately:
kubectl -n <namespace> get deploy,svc,httproute,ingress,pvc1ctl domains listSummary
Section titled “Summary”| Step | Command | Notes |
|---|---|---|
| First deploy (pre-built image) | 1ctl deploy --image <ref> --machine <name> --wait | Starts at Step 2/5 — no build step |
| First deploy (from source) | 1ctl deploy --wait | Slow first time — downloads torch + model once |
| Check status | 1ctl deploy status | Status: Running / Message: / Progress: 100% |
| Verify resources | 1ctl deploy get | Check CPU Request and Memory Request match toml |
| JSON output | 1ctl -o json deploy get | -o json is a global flag — goes before subcommand |
| Add API key | 1ctl secret create --kv MODEL_API_KEY=... | Encrypted at rest |
| Apply secret | 1ctl deploy restart | Rolling restart, no rebuild |
| Tail logs | 1ctl logs --tail 5 | Pod name prefixed on each line |
| Stream logs | 1ctl logs stream | Watch model load; 10–15s startup is normal (production only) |
| Fix OOMKill | Edit memory in toml, redeploy | No rebuild needed — only resource spec changes |
| Release history | 1ctl deploy releases | Columns: VERSION IMAGE STATUS DEPLOYED |
| Subsequent deploys | 1ctl deploy --image <ref> --machine <name> | Fast when only resource spec or config changes |
| Tear down | 1ctl deploy destroy -y | Removes deployment and secrets |