Modal 云端 GPU 计算¶

一句话说明： Modal 是一个无服务器 GPU 云平台，用 Python 装饰器即可将本地代码部署到云端 GPU 运行 AI 推理、训练和批处理任务。

为什么要学¶

零运维 GPU — 无需管理服务器、Docker、Kubernetes，装饰器即部署
按秒计费 — 只为实际 GPU 计算时间付费，空闲不产生费用
冷启动极快 — 容器镜像预热技术，秒级启动 GPU 实例
Python 原生 — 不写 YAML/Dockerfile，纯 Python 定义基础设施
适合 AI 工作负载 — 从推理、微调到批处理数据管道，原生支持 A100/H100

核心概念详解¶

核心抽象¶

概念	说明	类比
`App`	应用容器，包含一组函数	一个微服务
`@app.function`	标记远程执行的函数	Serverless Function
`Image`	运行环境定义（包、依赖）	Docker Image
`Volume`	持久化存储	云硬盘/EBS
`Secret`	安全凭证管理	环境变量
`Cls`	有状态的类（模型常驻内存）	长驻服务
`web_endpoint`	HTTP 端点	API Gateway + Lambda

GPU 选项¶

GPU 类型	显存	适用场景	大致价格
T4	16 GB	推理、小模型训练	~$0.59/h
A10G	24 GB	中等模型推理/训练	~$1.10/h
A100-40GB	40 GB	大模型训练/推理	~$3.00/h
A100-80GB	80 GB	LLM 推理、大规模训练	~$4.50/h
H100	80 GB	最高性能训练	~$7.00/h

执行模型¶

本地 Python --> Modal SDK --> 云端容器(GPU) --> 返回结果到本地

函数代码序列化上传到 Modal 云端
在预构建的容器镜像中执行
结果流式返回本地

安装与配置¶

安装 SDK¶

pip install modal

认证¶

# 首次使用，打开浏览器登录
modal setup

验证¶

modal --version
# 检查账户
modal profile current

账户额度¶

Modal 为新用户提供 $30 免费额度，足够大量实验。

快速上手¶

最小示例：远程执行函数¶

# hello_modal.py
import modal

app = modal.App("hello-modal")

@app.function()
def square(x: int) -> int:
    return x ** 2

@app.local_entrypoint()
def main():
    # 远程执行
    result = square.remote(42)
    print(f"42² = {result}")

modal run hello_modal.py

GPU 推理示例¶

import modal

app = modal.App("gpu-inference")

# 定义环境镜像
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "transformers", "accelerate"
)

@app.function(gpu="A10G", image=image, timeout=300)
def generate_text(prompt: str) -> str:
    from transformers import pipeline

    pipe = pipeline(
        "text-generation",
        model="meta-llama/Llama-3.1-8B-Instruct",
        device="cuda",
    )
    result = pipe(prompt, max_new_tokens=256)
    return result[0]["generated_text"]

@app.local_entrypoint()
def main():
    output = generate_text.remote("Explain quantum computing in simple terms:")
    print(output)

进阶用法¶

1. 模型常驻内存（Cls）¶

import modal

app = modal.App("model-server")

image = modal.Image.debian_slim().pip_install(
    "vllm", "torch"
)

@app.cls(gpu="A100", image=image, container_idle_timeout=300)
class LLMServer:
    @modal.enter()
    def load_model(self):
        """容器启动时加载模型（只执行一次）"""
        from vllm import LLM
        self.llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

    @modal.method()
    def generate(self, prompt: str) -> str:
        from vllm import SamplingParams
        params = SamplingParams(max_tokens=512, temperature=0.7)
        outputs = self.llm.generate([prompt], params)
        return outputs[0].outputs[0].text

    @modal.web_endpoint(method="POST")
    def api(self, item: dict):
        return {"response": self.generate(item["prompt"])}

2. 批量并行处理¶

@app.function(gpu="T4", image=image, concurrency_limit=10)
def process_image(image_path: str) -> dict:
    """处理单张图片"""
    # ... 模型推理逻辑
    return {"path": image_path, "result": "..."}

@app.local_entrypoint()
def main():
    image_paths = [f"s3://bucket/img_{i}.jpg" for i in range(1000)]

    # 并行处理 1000 张图片（自动扩缩容）
    results = list(process_image.map(image_paths))
    print(f"处理完成: {len(results)} 张图片")

3. 定时任务（Cron）¶

@app.function(schedule=modal.Cron("0 */6 * * *"))  # 每6小时
def periodic_job():
    """定时执行的任务"""
    # 数据同步、模型评估等
    pass

4. Volume 持久化¶

vol = modal.Volume.from_name("my-model-cache", create_if_missing=True)

@app.function(volumes={"/cache": vol}, gpu="A100")
def train():
    import torch
    # 训练模型
    model = train_model()
    # 保存到持久卷
    torch.save(model.state_dict(), "/cache/model.pt")
    vol.commit()  # 持久化

@app.function(volumes={"/cache": vol}, gpu="T4")
def inference(prompt: str):
    import torch
    # 从持久卷加载
    model = load_model("/cache/model.pt")
    return model.generate(prompt)

5. Web 端点部署¶

from modal import web_endpoint

@app.function(gpu="T4", image=image)
@web_endpoint(method="POST")
def predict(item: dict):
    """部署为 HTTP API"""
    result = run_inference(item["input"])
    return {"prediction": result}

部署后获得永久 URL：https://your-app--predict.modal.run

6. 自定义镜像¶

image = (
    modal.Image.from_registry("nvidia/cuda:12.1.0-devel-ubuntu22.04")
    .apt_install("git", "wget")
    .pip_install("torch==2.1.0", "flash-attn", index_url="https://download.pytorch.org/whl/cu121")
    .run_commands("git clone https://github.com/some/repo /opt/repo")
    .env({"HF_HOME": "/cache/huggingface"})
)

常见问题与排错¶

Q: 冷启动太慢¶

# 使用 keep_warm 保持最少 1 个实例
@app.function(gpu="T4", keep_warm=1)
def fast_inference(x):
    ...

Q: 容器内下载模型太慢¶

# 在镜像构建时下载（缓存在镜像中）
image = modal.Image.debian_slim().pip_install("transformers").run_function(
    download_model,  # 在构建时执行
    secrets=[modal.Secret.from_name("huggingface-token")],
)

Q: OOM（显存不足）¶

切换更大的 GPU：gpu="A100-80GB"
使用量化：load_in_4bit=True
减小 batch size

Q: 如何调试远程函数¶

# 本地模式运行（不上传到云端）
with app.run():
    result = square.local(42)  # 本地执行

Q: Secret 管理¶

# 通过 CLI 创建
modal secret create my-secret API_KEY=xxx

# 通过 Web 控制台管理
# https://modal.com/secrets

@app.function(secrets=[modal.Secret.from_name("my-secret")])
def use_secret():
    import os
    key = os.environ["API_KEY"]

参考资源¶

官方文档：https://modal.com/docs
GitHub 示例：https://github.com/modal-labs/modal-examples
价格页面：https://modal.com/pricing
博客/教程：https://modal.com/blog
Discord 社区：https://discord.gg/modal
官方 Cookbook：https://modal.com/docs/examples