Build with Luna — SDK Overview
Use Luna's AI engine in your own applications. Two integration paths, nine LLM providers, detailed service APIs.
What is the Luna SDK?
Luna is a full AI backend — not just a chat wrapper. When you run it, you get a FastAPI server at http://localhost:8899 that exposes a rich set of services your application can call: streaming LLM inference, persistent memory, personality-driven prompts, proactive scheduling, state detection, tool execution, and more.
You can integrate Luna into your own project in two ways:
- HTTP API — call the REST endpoints from any language or framework.
- Python import — import the service modules directly into your Python app.
Either way, one .env file controls which LLM provider Luna talks to, and swapping providers is a one-line change.
Two integration modes
| Mode | When to use | Language | Overhead |
|---|---|---|---|
| HTTP API | Any language, microservice architecture, Electron/web apps, mobile | Any (REST + SSE) | Network round-trip (~1 ms local) |
| Python import | Python monorepo, scripts, notebooks, agents in same process | Python 3.10+ | None — in-process call |
LLM provider compatibility
Luna's LLMClient is a unified interface that routes to any of the providers below. All providers expose the same stream_chat(), complete(), and embed() methods — your code doesn't change when you switch providers.
Local inference — zero API cost, full privacy.
LLM_PROVIDER=ollamaModels: qwen2.5, llama3, mistral, phi3, …
GPT-4o, GPT-4-turbo, GPT-3.5-turbo and any OpenAI-compatible endpoint (OpenRouter, Jan.ai, llama.cpp, LM Studio).
LLM_PROVIDER=openai-compatibleModels: gpt-4o, gpt-4o-mini, gpt-4-turbo
Native Claude Messages API with SSE streaming.
LLM_PROVIDER=anthropicModels: claude-opus-4, claude-sonnet-4-6, claude-haiku-4-5
Native Gemini REST API with streaming.
LLM_PROVIDER=googleModels: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash
Ultra-fast cloud inference — lowest latency of any hosted provider.
LLM_PROVIDER=groqModels: llama-3.3-70b-versatile, mixtral-8x7b
Native Mistral API.
LLM_PROVIDER=mistralModels: mistral-large-latest, mistral-medium, mistral-small
Cohere Chat API v2 with streaming.
LLM_PROVIDER=cohereModels: command-r-plus, command-r, command-light
OpenAI-compatible endpoint for NVIDIA-optimised models.
LLM_PROVIDER=nvidia-nimModels: meta/llama-3.1-8b-instruct, nvidia/nemotron-4-340b
Point openai_base_url at your LM Studio local server.
LLM_PROVIDER=openai-compatibleModels: any model loaded in LM Studio
Switching providers
Edit .env and restart the backend. No code changes needed anywhere:
# Local (default)
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b
# Groq — fastest hosted option
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile
# Anthropic Claude
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-6
# Any OpenAI-compatible endpoint (OpenRouter, LM Studio, llama.cpp)
LLM_PROVIDER=openai-compatible
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-...
OPENAI_MODEL=meta-llama/llama-3.3-70b-instructSeparate coding model
The coding agent can use a different model from the chat LLM — useful if you want a specialist coder locally while routing conversation to a cloud provider:
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
# Override just the coding agent to a local coder
CODING_PROVIDER=ollama
CODING_MODEL=qwen2.5-coder:7bHTTP API quick start
1. Start the backend
cd /path/to/Luna
pip install -r backend/requirements.txt
uvicorn backend.main:app --host 127.0.0.1 --port 8899The Swagger UI is available at http://localhost:8899/docs once running.
2. Stream a chat response (Python)
import httpx
with httpx.stream(
"POST", "http://localhost:8899/api/chat/stream",
json={"message": "Summarise my tasks for today", "conversation_id": None},
timeout=60,
) as r:
for line in r.iter_lines():
if line.startswith("data: "):
print(line[6:], end="", flush=True)2b. Stream a chat response (JavaScript / Node)
const response = await fetch('http://localhost:8899/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message: 'What is on my calendar today?', conversation_id: null }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
for (const line of chunk.split('\n')) {
if (line.startsWith('data: ')) process.stdout.write(line.slice(6));
}
}2c. Non-streaming (single response)
curl -s -X POST http://localhost:8899/api/chat \
-H 'Content-Type: application/json' \
-d '{"message": "What can you do?"}' | jq .response3. Run the coding agent
with httpx.stream(
"POST", "http://localhost:8899/api/coding/stream",
json={
"message": "Create a FastAPI endpoint that returns a hello world JSON",
"workspace_root": "/my/project",
},
timeout=120,
) as r:
for line in r.iter_lines():
if line.startswith("data: "):
import json
event = json.loads(line[6:])
print(event["type"], "→", event.get("content", "")[:80])workspace_index · plan · tool_call · tool_result · token · done · errorPython import quick start
Install the backend dependencies once, then import any service directly. No server needed.
pip install -r /path/to/Luna/backend/requirements.txtLLM — streaming
import asyncio, sys
sys.path.insert(0, "/path/to/Luna")
from backend.services.llm import ollama
async def main():
async for token in ollama.stream_chat(
messages=[{"role": "user", "content": "Explain async/await in Python"}],
system_prompt="You are a concise technical writer.",
):
print(token, end="", flush=True)
asyncio.run(main())LLM — one-shot completion
result = asyncio.run(
ollama.complete("Extract the key topics from: " + my_text, temperature=0.2)
)
print(result)Memory — store and retrieve facts
from backend.services.memory_manager import MemoryManager
from backend.models.database import SessionLocal
db = SessionLocal()
mm = MemoryManager(db)
# Store a fact
asyncio.run(mm.store_fact(
"User prefers dark mode",
category="preference",
importance=0.8,
))
# Retrieve semantically relevant facts
facts = asyncio.run(mm.retrieve_relevant("what UI preferences does the user have?"))
for f in facts:
print(f.content, f.confidence)
db.close()Personality — build a system prompt
from backend.services.personality import PersonalityEngine
from backend.models.database import SessionLocal
db = SessionLocal()
engine = PersonalityEngine(db)
# Update mood based on the user's last message
engine.update_mood("I'm so excited about this!")
# Get a personality-aware system prompt
prompt = engine.build_personality_prompt(user_name="Alex")
print(prompt[:300])
db.close()Service map
Every Luna service lives in backend/services/ and is documented on its own page.
| Service | What it does |
|---|---|
| LLM Service | Multi-provider streaming + completion client. |
| Memory Manager | Long-term facts, ChromaDB vectors, conversation context. |
| Personality Engine | Mood state, RL-style style preferences, prompt building. |
| Scheduler | Background jobs, proactive messages, Windows notifications. |
| State Engine | Time-aware user state classification + response policies. |
| Command Parser | Intent detection, bracket commands, launch/Spotify/map parsing. |
| Tool Runner | LLM tool call JSON parsing, execution, result summarisation. |
| Memory Graph | Knowledge graph traversal + episodic memory. |
| MCP Servers | Model Context Protocol servers for Claude Desktop and agents. |