LLM Service
Unified streaming and completion client supporting Ollama, OpenAI, Anthropic, Google, Groq, Cohere, Mistral, and NVIDIA NIM.
Overview
backend/services/llm/ is Luna's unified LLM abstraction layer. It exposes a single LLMClient class whose methods (stream_chat, complete,embed) route to whichever provider is configured in .env — your calling code never changes when you swap providers.
| Module | Contents |
|---|---|
llm/providers.py | All per-provider streaming generators and non-streaming completions. |
llm/client.py | LLMClient class + ollama singleton. |
llm/__init__.py | Public re-exports. |
Configuration
Set LLM_PROVIDER in .env and supply the matching key/model:
# ── Ollama (local, default) ──────────────────────────────────────
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen2.5:7b
OLLAMA_EMBED_MODEL=nomic-embed-text
# ── OpenAI or any OpenAI-compatible endpoint ─────────────────────
LLM_PROVIDER=openai-compatible
OPENAI_BASE_URL=https://api.openai.com/v1 # or LM Studio / Jan.ai / OpenRouter
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_EMBED_MODEL=text-embedding-3-small
# ── Anthropic Claude ─────────────────────────────────────────────
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-6
# ── Google Gemini ─────────────────────────────────────────────────
LLM_PROVIDER=google
GOOGLE_API_KEY=AIza...
GOOGLE_MODEL=gemini-2.0-flash
# ── Groq ──────────────────────────────────────────────────────────
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile
# ── Cohere ────────────────────────────────────────────────────────
LLM_PROVIDER=cohere
COHERE_API_KEY=...
COHERE_MODEL=command-r-plus
# ── Mistral AI ────────────────────────────────────────────────────
LLM_PROVIDER=mistral
MISTRAL_API_KEY=...
MISTRAL_MODEL=mistral-large-latest
# ── NVIDIA NIM ────────────────────────────────────────────────────
LLM_PROVIDER=nvidia-nim
NVIDIA_NIM_BASE_URL=https://integrate.api.nvidia.com/v1
NVIDIA_NIM_API_KEY=nvapi-...
NVIDIA_NIM_MODEL=meta/llama-3.1-8b-instructEmbeddings provider
Embeddings (used by the memory manager) can use a different provider from chat:
# Use Ollama for chat but OpenAI for embeddings (higher quality vectors)
LLM_PROVIDER=ollama
EMBEDDING_PROVIDER=openai-compatible
OPENAI_API_KEY=sk-...
OPENAI_EMBED_MODEL=text-embedding-3-smallstream_chat()
Returns an async generator that yields string tokens as they arrive from the model. This is the method used by every chat router, coding agent, and streaming response.
def stream_chat(
messages: list[dict], # [{"role": "user"|"assistant", "content": "..."}]
system_prompt: str,
*,
num_ctx: int | None = None, # Ollama only: context window override
num_predict: int | None = None,# Ollama only: max new tokens override
temperature: float = 0.7,
) -> AsyncGenerator[str, None]Usage
import asyncio
from backend.services.llm import ollama
async def main():
messages = [
{"role": "user", "content": "Explain how transformers work."}
]
async for token in ollama.stream_chat(messages, system_prompt="Be concise."):
print(token, end="", flush=True)
asyncio.run(main())Multi-turn conversation
history = []
while True:
user_input = input("You: ")
history.append({"role": "user", "content": user_input})
response = ""
async for token in ollama.stream_chat(history, system_prompt="You are helpful."):
print(token, end="", flush=True)
response += token
history.append({"role": "assistant", "content": response})
print()complete()
Non-streaming, awaitable completion. Used internally by the memory manager for fact extraction and contradiction detection.
async def complete(
prompt: str,
system: str = "",
temperature: float = 0.3,
) -> strUsage
import asyncio
from backend.services.llm import ollama
result = asyncio.run(
ollama.complete(
"Extract the main topics from this text: " + document,
system="Return a JSON list of strings.",
temperature=0.1,
)
)
print(result)embed()
Generates a float embedding vector for a piece of text. Used by the memory manager to store and retrieve facts via semantic similarity.
async def embed(text: str) -> list[float]Usage
vector = asyncio.run(ollama.embed("the user prefers dark mode"))
print(len(vector)) # 768 for nomic-embed-text, 1536 for text-embedding-3-smalllen(vector) > 0 before storing.Provider notes
Ollama — adaptive context window
Luna automatically sizes the Ollama context window based on actual prompt length rather than always allocating the full 8 192-token KV cache. This reduces TTFT by 20–40% on short turns. Override with num_ctx if needed:
async for token in ollama.stream_chat(msgs, sys, num_ctx=4096, num_predict=512):
...OpenAI-compatible — any endpoint
Setting LLM_PROVIDER=openai-compatible works with OpenAI, OpenRouter, LM Studio, Jan.ai, llama.cpp, Ollama's OpenAI-compat endpoint, and any server that exposes /chat/completions.
Anthropic — max tokens
The Anthropic streaming handler hard-codes max_tokens=8192 forstream_chat and max_tokens=4096 for complete. Adjust in llm/providers.py if you need longer outputs.
Groq — rate limits
Groq free tier has per-minute token limits. If you hit them, catchhttpx.HTTPStatusError (status 429) and back off.
The ollama singleton
from backend.services.llm import ollama gives you a pre-createdLLMClient() instance. This is a convenience — it reads the provider from settings at call time, not at import time, so changing .envand restarting is all that's needed to swap providers.
The name ollama is historical (Luna originally only supported Ollama). It is now a fully multi-provider client regardless of its name.
from backend.services.llm import ollama, LLMClient
# Using the shared singleton (recommended)
await ollama.complete("Hello")
# Or create your own client with different settings
client = LLMClient()
print(client.provider) # reads LLM_PROVIDER from .env
print(client.model) # resolves to the correct model nameVia HTTP
The LLM service powers every chat endpoint. You don't call it directly over HTTP — you call the chat router which invokes it internally:
| Endpoint | Method | Description |
|---|---|---|
/api/chat/stream | POST (SSE) | Streaming chat with memory + personality. |
/api/chat | POST | Non-streaming chat. |
/api/coding/stream | POST (SSE) | Coding agent with tool calls. |
See API Reference for full request/response shapes.