LLM Service

Unified streaming and completion client supporting Ollama, OpenAI, Anthropic, Google, Groq, Cohere, Mistral, and NVIDIA NIM.

Overview

backend/services/llm/ is Luna's unified LLM abstraction layer. It exposes a single LLMClient class whose methods (stream_chat, complete,embed) route to whichever provider is configured in .env — your calling code never changes when you swap providers.

ModuleContents
llm/providers.pyAll per-provider streaming generators and non-streaming completions.
llm/client.pyLLMClient class + ollama singleton.
llm/__init__.pyPublic re-exports.

Configuration

Set LLM_PROVIDER in .env and supply the matching key/model:

.env — provider options
# ── Ollama (local, default) ──────────────────────────────────────
LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=qwen2.5:7b
OLLAMA_EMBED_MODEL=nomic-embed-text

# ── OpenAI or any OpenAI-compatible endpoint ─────────────────────
LLM_PROVIDER=openai-compatible
OPENAI_BASE_URL=https://api.openai.com/v1   # or LM Studio / Jan.ai / OpenRouter
OPENAI_API_KEY=sk-...
OPENAI_MODEL=gpt-4o-mini
OPENAI_EMBED_MODEL=text-embedding-3-small

# ── Anthropic Claude ─────────────────────────────────────────────
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-6

# ── Google Gemini ─────────────────────────────────────────────────
LLM_PROVIDER=google
GOOGLE_API_KEY=AIza...
GOOGLE_MODEL=gemini-2.0-flash

# ── Groq ──────────────────────────────────────────────────────────
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

# ── Cohere ────────────────────────────────────────────────────────
LLM_PROVIDER=cohere
COHERE_API_KEY=...
COHERE_MODEL=command-r-plus

# ── Mistral AI ────────────────────────────────────────────────────
LLM_PROVIDER=mistral
MISTRAL_API_KEY=...
MISTRAL_MODEL=mistral-large-latest

# ── NVIDIA NIM ────────────────────────────────────────────────────
LLM_PROVIDER=nvidia-nim
NVIDIA_NIM_BASE_URL=https://integrate.api.nvidia.com/v1
NVIDIA_NIM_API_KEY=nvapi-...
NVIDIA_NIM_MODEL=meta/llama-3.1-8b-instruct

Embeddings provider

Embeddings (used by the memory manager) can use a different provider from chat:

.env
# Use Ollama for chat but OpenAI for embeddings (higher quality vectors)
LLM_PROVIDER=ollama
EMBEDDING_PROVIDER=openai-compatible
OPENAI_API_KEY=sk-...
OPENAI_EMBED_MODEL=text-embedding-3-small

stream_chat()

Returns an async generator that yields string tokens as they arrive from the model. This is the method used by every chat router, coding agent, and streaming response.

signature
def stream_chat(
    messages: list[dict],      # [{"role": "user"|"assistant", "content": "..."}]
    system_prompt: str,
    *,
    num_ctx: int | None = None,    # Ollama only: context window override
    num_predict: int | None = None,# Ollama only: max new tokens override
    temperature: float = 0.7,
) -> AsyncGenerator[str, None]

Usage

example.py
import asyncio
from backend.services.llm import ollama

async def main():
    messages = [
        {"role": "user", "content": "Explain how transformers work."}
    ]
    async for token in ollama.stream_chat(messages, system_prompt="Be concise."):
        print(token, end="", flush=True)

asyncio.run(main())

Multi-turn conversation

example.py
history = []
while True:
    user_input = input("You: ")
    history.append({"role": "user", "content": user_input})

    response = ""
    async for token in ollama.stream_chat(history, system_prompt="You are helpful."):
        print(token, end="", flush=True)
        response += token

    history.append({"role": "assistant", "content": response})
    print()

complete()

Non-streaming, awaitable completion. Used internally by the memory manager for fact extraction and contradiction detection.

signature
async def complete(
    prompt: str,
    system: str = "",
    temperature: float = 0.3,
) -> str

Usage

example.py
import asyncio
from backend.services.llm import ollama

result = asyncio.run(
    ollama.complete(
        "Extract the main topics from this text: " + document,
        system="Return a JSON list of strings.",
        temperature=0.1,
    )
)
print(result)

embed()

Generates a float embedding vector for a piece of text. Used by the memory manager to store and retrieve facts via semantic similarity.

signature
async def embed(text: str) -> list[float]

Usage

example.py
vector = asyncio.run(ollama.embed("the user prefers dark mode"))
print(len(vector))  # 768 for nomic-embed-text, 1536 for text-embedding-3-small
📌
Returns an empty list if the embedding provider is not configured or unavailable. Always check len(vector) > 0 before storing.

Provider notes

Ollama — adaptive context window

Luna automatically sizes the Ollama context window based on actual prompt length rather than always allocating the full 8 192-token KV cache. This reduces TTFT by 20–40% on short turns. Override with num_ctx if needed:

async for token in ollama.stream_chat(msgs, sys, num_ctx=4096, num_predict=512):
    ...

OpenAI-compatible — any endpoint

Setting LLM_PROVIDER=openai-compatible works with OpenAI, OpenRouter, LM Studio, Jan.ai, llama.cpp, Ollama's OpenAI-compat endpoint, and any server that exposes /chat/completions.

Anthropic — max tokens

The Anthropic streaming handler hard-codes max_tokens=8192 forstream_chat and max_tokens=4096 for complete. Adjust in llm/providers.py if you need longer outputs.

Groq — rate limits

Groq free tier has per-minute token limits. If you hit them, catchhttpx.HTTPStatusError (status 429) and back off.

The ollama singleton

from backend.services.llm import ollama gives you a pre-createdLLMClient() instance. This is a convenience — it reads the provider from settings at call time, not at import time, so changing .envand restarting is all that's needed to swap providers.

The name ollama is historical (Luna originally only supported Ollama). It is now a fully multi-provider client regardless of its name.

example.py
from backend.services.llm import ollama, LLMClient

# Using the shared singleton (recommended)
await ollama.complete("Hello")

# Or create your own client with different settings
client = LLMClient()
print(client.provider)   # reads LLM_PROVIDER from .env
print(client.model)      # resolves to the correct model name

Via HTTP

The LLM service powers every chat endpoint. You don't call it directly over HTTP — you call the chat router which invokes it internally:

EndpointMethodDescription
/api/chat/streamPOST (SSE)Streaming chat with memory + personality.
/api/chatPOSTNon-streaming chat.
/api/coding/streamPOST (SSE)Coding agent with tool calls.

See API Reference for full request/response shapes.