Build with Luna — SDK Overview

Use Luna's AI engine in your own applications. Two integration paths, nine LLM providers, detailed service APIs.

What is the Luna SDK?

Luna is a full AI backend — not just a chat wrapper. When you run it, you get a FastAPI server at http://localhost:8899 that exposes a rich set of services your application can call: streaming LLM inference, persistent memory, personality-driven prompts, proactive scheduling, state detection, tool execution, and more.

You can integrate Luna into your own project in two ways:

  • HTTP API — call the REST endpoints from any language or framework.
  • Python import — import the service modules directly into your Python app.

Either way, one .env file controls which LLM provider Luna talks to, and swapping providers is a one-line change.

Two integration modes

ModeWhen to useLanguageOverhead
HTTP APIAny language, microservice architecture, Electron/web apps, mobileAny (REST + SSE)Network round-trip (~1 ms local)
Python importPython monorepo, scripts, notebooks, agents in same processPython 3.10+None — in-process call
💡
Recommended for most appsUse the HTTP API. It isolates your app from Luna's internals, survives Luna restarts independently, and works from any language. The Python import path is best when you're building a Python-first tool and want zero network overhead.

LLM provider compatibility

Luna's LLMClient is a unified interface that routes to any of the providers below. All providers expose the same stream_chat(), complete(), and embed() methods — your code doesn't change when you switch providers.

Ollama
Ollamadefault

Local inference — zero API cost, full privacy.

LLM_PROVIDER=ollama

Models: qwen2.5, llama3, mistral, phi3, …

OpenAI
OpenAIsupported

GPT-4o, GPT-4-turbo, GPT-3.5-turbo and any OpenAI-compatible endpoint (OpenRouter, Jan.ai, llama.cpp, LM Studio).

LLM_PROVIDER=openai-compatible

Models: gpt-4o, gpt-4o-mini, gpt-4-turbo

Anthropic
Anthropicsupported

Native Claude Messages API with SSE streaming.

LLM_PROVIDER=anthropic

Models: claude-opus-4, claude-sonnet-4-6, claude-haiku-4-5

Google Gemini
Google Geminisupported

Native Gemini REST API with streaming.

LLM_PROVIDER=google

Models: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash

Groq
Groqsupported

Ultra-fast cloud inference — lowest latency of any hosted provider.

LLM_PROVIDER=groq

Models: llama-3.3-70b-versatile, mixtral-8x7b

Mistral AI
Mistral AIsupported

Native Mistral API.

LLM_PROVIDER=mistral

Models: mistral-large-latest, mistral-medium, mistral-small

Cohere
Coheresupported

Cohere Chat API v2 with streaming.

LLM_PROVIDER=cohere

Models: command-r-plus, command-r, command-light

NVIDIA NIM
NVIDIA NIMsupported

OpenAI-compatible endpoint for NVIDIA-optimised models.

LLM_PROVIDER=nvidia-nim

Models: meta/llama-3.1-8b-instruct, nvidia/nemotron-4-340b

LM Studio
LM Studiocompatible

Point openai_base_url at your LM Studio local server.

LLM_PROVIDER=openai-compatible

Models: any model loaded in LM Studio

Switching providers

Edit .env and restart the backend. No code changes needed anywhere:

.env
# Local (default)
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b

# Groq — fastest hosted option
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

# Anthropic Claude
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-6

# Any OpenAI-compatible endpoint (OpenRouter, LM Studio, llama.cpp)
LLM_PROVIDER=openai-compatible
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-...
OPENAI_MODEL=meta-llama/llama-3.3-70b-instruct

Separate coding model

The coding agent can use a different model from the chat LLM — useful if you want a specialist coder locally while routing conversation to a cloud provider:

.env
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...

# Override just the coding agent to a local coder
CODING_PROVIDER=ollama
CODING_MODEL=qwen2.5-coder:7b

HTTP API quick start

1. Start the backend

cd /path/to/Luna
pip install -r backend/requirements.txt
uvicorn backend.main:app --host 127.0.0.1 --port 8899

The Swagger UI is available at http://localhost:8899/docs once running.

2. Stream a chat response (Python)

my_app.py
import httpx

with httpx.stream(
    "POST", "http://localhost:8899/api/chat/stream",
    json={"message": "Summarise my tasks for today", "conversation_id": None},
    timeout=60,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            print(line[6:], end="", flush=True)

2b. Stream a chat response (JavaScript / Node)

my_app.js
const response = await fetch('http://localhost:8899/api/chat/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ message: 'What is on my calendar today?', conversation_id: null }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value);
  for (const line of chunk.split('\n')) {
    if (line.startsWith('data: ')) process.stdout.write(line.slice(6));
  }
}

2c. Non-streaming (single response)

curl -s -X POST http://localhost:8899/api/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "What can you do?"}' | jq .response

3. Run the coding agent

my_app.py
with httpx.stream(
    "POST", "http://localhost:8899/api/coding/stream",
    json={
        "message": "Create a FastAPI endpoint that returns a hello world JSON",
        "workspace_root": "/my/project",
    },
    timeout=120,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            import json
            event = json.loads(line[6:])
            print(event["type"], "→", event.get("content", "")[:80])
📌
SSE event types from the coding agentworkspace_index · plan · tool_call · tool_result · token · done · error

Python import quick start

Install the backend dependencies once, then import any service directly. No server needed.

pip install -r /path/to/Luna/backend/requirements.txt

LLM — streaming

my_app.py
import asyncio, sys
sys.path.insert(0, "/path/to/Luna")

from backend.services.llm import ollama

async def main():
    async for token in ollama.stream_chat(
        messages=[{"role": "user", "content": "Explain async/await in Python"}],
        system_prompt="You are a concise technical writer.",
    ):
        print(token, end="", flush=True)

asyncio.run(main())

LLM — one-shot completion

my_app.py
result = asyncio.run(
    ollama.complete("Extract the key topics from: " + my_text, temperature=0.2)
)
print(result)

Memory — store and retrieve facts

my_app.py
from backend.services.memory_manager import MemoryManager
from backend.models.database import SessionLocal

db = SessionLocal()
mm = MemoryManager(db)

# Store a fact
asyncio.run(mm.store_fact(
    "User prefers dark mode",
    category="preference",
    importance=0.8,
))

# Retrieve semantically relevant facts
facts = asyncio.run(mm.retrieve_relevant("what UI preferences does the user have?"))
for f in facts:
    print(f.content, f.confidence)

db.close()

Personality — build a system prompt

my_app.py
from backend.services.personality import PersonalityEngine
from backend.models.database import SessionLocal

db = SessionLocal()
engine = PersonalityEngine(db)

# Update mood based on the user's last message
engine.update_mood("I'm so excited about this!")

# Get a personality-aware system prompt
prompt = engine.build_personality_prompt(user_name="Alex")
print(prompt[:300])
db.close()

Service map

Every Luna service lives in backend/services/ and is documented on its own page.

ServiceWhat it does
LLM ServiceMulti-provider streaming + completion client.
Memory ManagerLong-term facts, ChromaDB vectors, conversation context.
Personality EngineMood state, RL-style style preferences, prompt building.
SchedulerBackground jobs, proactive messages, Windows notifications.
State EngineTime-aware user state classification + response policies.
Command ParserIntent detection, bracket commands, launch/Spotify/map parsing.
Tool RunnerLLM tool call JSON parsing, execution, result summarisation.
Memory GraphKnowledge graph traversal + episodic memory.
MCP ServersModel Context Protocol servers for Claude Desktop and agents.

Next steps