AnimaFlow supports four powerful AI backends to run your local LLM inference. Choose the right engine for your studio's hardware and workflow.
AnimaFlow's AI assistants can run on your own hardware, keeping all data inside your studio. The IT Dashboard lets you install, start, stop, upgrade, and benchmark each backend with a single click. All backends serve the same OpenAI-compatible API, so your assistants work identically regardless of which engine is running.
Easiest to get started
The most user-friendly backend. Ollama wraps llama.cpp with a polished CLI and REST API, automatic model management, and one-command model pulling. Ideal for studios that want a "just works" experience without tuning low-level parameters.
ollama pull llama3)Maximum hardware compatibility
The original C/C++ LLM inference engine. llama.cpp runs on virtually any hardware, including CPUs without a GPU. It offers fine-grained control over quantization, context size, and memory allocation. The engine behind Ollama, but exposed directly for maximum tuning.
Highest throughput for structured output
SGLang (Structured Generation Language) is a cutting-edge inference framework from UC Berkeley. It excels at structured output generation (JSON, code) and offers RadixAttention for efficient prefix caching. Best choice for studios running complex AI pipelines with structured prompts.
Production-grade serving at scale
vLLM is the industry standard for production LLM serving, used by companies like Databricks and Anyscale. Its PagedAttention algorithm achieves near-optimal GPU memory utilization, and it supports multi-GPU tensor parallelism for large models. The best choice for studios serving many concurrent artists.
AnimaFlow supports both major GPU vendors for AI inference
Full support across all four backends. CUDA acceleration for maximum performance. Recommended: RTX 4090, RTX A6000, A100, H100.
ROCm support via Ollama and llama.cpp. Vulkan fallback for broader compatibility. Recommended: RX 7900 XTX, Instinct MI300X.
Choose the backend that matches your hardware and team size
| Feature | Ollama | llama.cpp | SGLang | vLLM |
|---|---|---|---|---|
| Ease of Setup | Easiest | Moderate | Advanced | Advanced |
| GPU Support | NVIDIA, AMD, Apple, CPU | NVIDIA, AMD, Apple, Vulkan, CPU | NVIDIA only | NVIDIA only |
| Multi-GPU | No | No | Yes (TP) | Yes (TP) |
| Concurrent Users | 1-3 | 1-5 | 50+ | 50+ |
| Model Formats | GGUF (auto-pull) | GGUF | HuggingFace, GPTQ, AWQ | HuggingFace, GPTQ, AWQ |
| Min VRAM (7B model) | ~4 GB (Q4) | ~4 GB (Q4) | ~14 GB (FP16) | ~14 GB (FP16) |
| Structured Output | Basic JSON | Grammar-based | Best (RadixAttention) | Good |
| Best For | Small studios, quick setup | CPU-only, edge, max compat | Complex AI pipelines | Large teams, production |
| AnimaFlow Management | Full (install, tune, models) | Full (start, stop, upgrade) | Full (install, start, stop) | Full (install, start, stop) |
Start with Ollama. Zero configuration, works on any GPU (or CPU), and the IT Dashboard handles everything. Upgrade to vLLM later if you need more throughput.
Use vLLM or SGLang with NVIDIA GPUs. Continuous batching serves many concurrent requests efficiently. Multi-GPU support for 70B+ parameter models.
Use llama.cpp or Ollama. Both support CPU-only inference with GGUF quantized models. Slower but functional for light workloads.
Use SGLang. RadixAttention caches repeated prompt prefixes, making batch processing of shot descriptions and metadata extraction significantly faster.
AnimaFlow's IT Dashboard lets you install, benchmark, and switch between all four backends with a single click. The built-in auto-benchmark feature tests each backend on your hardware and recommends the optimal choice.