AI Inference Backends

AnimaFlow supports four powerful AI backends to run your local LLM inference. Choose the right engine for your studio's hardware and workflow.

AnimaFlow's AI assistants can run on your own hardware, keeping all data inside your studio. The IT Dashboard lets you install, start, stop, upgrade, and benchmark each backend with a single click. All backends serve the same OpenAI-compatible API, so your assistants work identically regardless of which engine is running.

Ollama

Ollama

Easiest to get started

The most user-friendly backend. Ollama wraps llama.cpp with a polished CLI and REST API, automatic model management, and one-command model pulling. Ideal for studios that want a "just works" experience without tuning low-level parameters.

Advantages

  • One-command install and model pull (ollama pull llama3)
  • Built-in model library with hundreds of models
  • Automatic GPU detection and layer offloading
  • Runs on CPU, NVIDIA, AMD, and Apple Silicon
  • OpenAI-compatible REST API out of the box
  • Minimal configuration required
  • Hot-swap models without restart
  • Single-user inference (no batched requests)
  • No tensor parallelism across multiple GPUs
  • Slightly lower throughput than vLLM/SGLang at scale
llama.cpp

llama.cpp

Maximum hardware compatibility

The original C/C++ LLM inference engine. llama.cpp runs on virtually any hardware, including CPUs without a GPU. It offers fine-grained control over quantization, context size, and memory allocation. The engine behind Ollama, but exposed directly for maximum tuning.

Advantages

  • Runs on CPU, NVIDIA (CUDA), AMD (ROCm), Apple Metal, Vulkan
  • Extremely low memory footprint with GGUF quantization
  • Fine-grained control: threads, batch size, context length, GPU layers
  • No Python dependencies, pure C/C++ binary
  • Built-in HTTP server with OpenAI-compatible API
  • Supports speculative decoding and grammar-constrained output
  • Manual model download and path configuration
  • Requires more technical knowledge to optimize
  • Single-GPU only (no tensor parallelism)
SGLang

SGLang

Highest throughput for structured output

SGLang (Structured Generation Language) is a cutting-edge inference framework from UC Berkeley. It excels at structured output generation (JSON, code) and offers RadixAttention for efficient prefix caching. Best choice for studios running complex AI pipelines with structured prompts.

Advantages

  • RadixAttention: automatic prefix caching for repeated prompts
  • Fastest structured output (JSON mode) of any backend
  • Multi-GPU tensor parallelism
  • Continuous batching for high concurrent throughput
  • OpenAI-compatible API
  • Excellent for pipeline automation (shot descriptions, metadata extraction)
  • NVIDIA GPUs only (CUDA required)
  • Higher VRAM requirements than GGUF-based backends
  • Newer project, smaller community than Ollama
vLLM

vLLM

Production-grade serving at scale

vLLM is the industry standard for production LLM serving, used by companies like Databricks and Anyscale. Its PagedAttention algorithm achieves near-optimal GPU memory utilization, and it supports multi-GPU tensor parallelism for large models. The best choice for studios serving many concurrent artists.

Advantages

  • PagedAttention: near-zero memory waste, 2-4x throughput vs naive serving
  • Multi-GPU tensor parallelism (run 70B+ models across GPUs)
  • Continuous batching for dozens of concurrent requests
  • Supports GPTQ, AWQ, and FP16 model formats
  • OpenAI-compatible API with streaming
  • Battle-tested in production at scale
  • NVIDIA GPUs only (CUDA required)
  • Higher VRAM baseline than GGUF quantized models
  • Longer cold-start time (model loading)

GPU Compatibility

AnimaFlow supports both major GPU vendors for AI inference

NVIDIA

NVIDIA GPUs

Full support across all four backends. CUDA acceleration for maximum performance. Recommended: RTX 4090, RTX A6000, A100, H100.

Ollama llama.cpp SGLang vLLM
AMD Radeon

AMD Radeon GPUs

ROCm support via Ollama and llama.cpp. Vulkan fallback for broader compatibility. Recommended: RX 7900 XTX, Instinct MI300X.

Ollama llama.cpp

Side-by-Side Comparison

Choose the backend that matches your hardware and team size

Feature Ollama llama.cpp SGLang vLLM
Ease of Setup Easiest Moderate Advanced Advanced
GPU Support NVIDIA, AMD, Apple, CPU NVIDIA, AMD, Apple, Vulkan, CPU NVIDIA only NVIDIA only
Multi-GPU No No Yes (TP) Yes (TP)
Concurrent Users 1-3 1-5 50+ 50+
Model Formats GGUF (auto-pull) GGUF HuggingFace, GPTQ, AWQ HuggingFace, GPTQ, AWQ
Min VRAM (7B model) ~4 GB (Q4) ~4 GB (Q4) ~14 GB (FP16) ~14 GB (FP16)
Structured Output Basic JSON Grammar-based Best (RadixAttention) Good
Best For Small studios, quick setup CPU-only, edge, max compat Complex AI pipelines Large teams, production
AnimaFlow Management Full (install, tune, models) Full (start, stop, upgrade) Full (install, start, stop) Full (install, start, stop)

Which backend should I choose?

Small Studio (1-10 artists)

Start with Ollama. Zero configuration, works on any GPU (or CPU), and the IT Dashboard handles everything. Upgrade to vLLM later if you need more throughput.

Large Studio (10-100+ artists)

Use vLLM or SGLang with NVIDIA GPUs. Continuous batching serves many concurrent requests efficiently. Multi-GPU support for 70B+ parameter models.

No GPU Available

Use llama.cpp or Ollama. Both support CPU-only inference with GGUF quantized models. Slower but functional for light workloads.

Heavy AI Pipeline Automation

Use SGLang. RadixAttention caches repeated prompt prefixes, making batch processing of shot descriptions and metadata extraction significantly faster.

AnimaFlow's IT Dashboard lets you install, benchmark, and switch between all four backends with a single click. The built-in auto-benchmark feature tests each backend on your hardware and recommends the optimal choice.