Build with Luna — SDK Overview

What is the Luna SDK?

Luna is a full AI backend — not just a chat wrapper. When you run it, you get a FastAPI server at http://localhost:8899 that exposes a rich set of services your application can call: streaming LLM inference, persistent memory, personality-driven prompts, proactive scheduling, state detection, tool execution, and more.

You can integrate Luna into your own project in two ways:

HTTP API — call the REST endpoints from any language or framework.
Python import — import the service modules directly into your Python app.

Either way, one .env file controls which LLM provider Luna talks to, and swapping providers is a one-line change.

Two integration modes

Mode	When to use	Language	Overhead
HTTP API	Any language, microservice architecture, Electron/web apps, mobile	Any (REST + SSE)	Network round-trip (~1 ms local)
Python import	Python monorepo, scripts, notebooks, agents in same process	Python 3.10+	None — in-process call

💡

Recommended for most appsUse the HTTP API. It isolates your app from Luna's internals, survives Luna restarts independently, and works from any language. The Python import path is best when you're building a Python-first tool and want zero network overhead.

LLM provider compatibility

Luna's LLMClient is a unified interface that routes to any of the providers below. All providers expose the same stream_chat(), complete(), and embed() methods — your code doesn't change when you switch providers.

Ollamadefault

Local inference — zero API cost, full privacy.

LLM_PROVIDER=ollama

Models: qwen2.5, llama3, mistral, phi3, …

OpenAIsupported

GPT-4o, GPT-4-turbo, GPT-3.5-turbo and any OpenAI-compatible endpoint (OpenRouter, Jan.ai, llama.cpp, LM Studio).

LLM_PROVIDER=openai-compatible

Models: gpt-4o, gpt-4o-mini, gpt-4-turbo

Anthropicsupported

Native Claude Messages API with SSE streaming.

LLM_PROVIDER=anthropic

Models: claude-opus-4, claude-sonnet-4-6, claude-haiku-4-5

Google Geminisupported

Native Gemini REST API with streaming.

LLM_PROVIDER=google

Models: gemini-2.0-flash, gemini-1.5-pro, gemini-1.5-flash

Groqsupported

Ultra-fast cloud inference — lowest latency of any hosted provider.

LLM_PROVIDER=groq

Models: llama-3.3-70b-versatile, mixtral-8x7b

Mistral AIsupported

Native Mistral API.

LLM_PROVIDER=mistral

Models: mistral-large-latest, mistral-medium, mistral-small

Coheresupported

Cohere Chat API v2 with streaming.

LLM_PROVIDER=cohere

Models: command-r-plus, command-r, command-light

NVIDIA NIMsupported

OpenAI-compatible endpoint for NVIDIA-optimised models.

LLM_PROVIDER=nvidia-nim

Models: meta/llama-3.1-8b-instruct, nvidia/nemotron-4-340b

LM Studiocompatible

Point openai_base_url at your LM Studio local server.

LLM_PROVIDER=openai-compatible

Models: any model loaded in LM Studio

Switching providers

Edit .env and restart the backend. No code changes needed anywhere:

.env

# Local (default)
LLM_PROVIDER=ollama
OLLAMA_MODEL=qwen2.5:7b

# Groq — fastest hosted option
LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...
GROQ_MODEL=llama-3.3-70b-versatile

# Anthropic Claude
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-sonnet-4-6

# Any OpenAI-compatible endpoint (OpenRouter, LM Studio, llama.cpp)
LLM_PROVIDER=openai-compatible
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_API_KEY=sk-or-...
OPENAI_MODEL=meta-llama/llama-3.3-70b-instruct

Separate coding model

The coding agent can use a different model from the chat LLM — useful if you want a specialist coder locally while routing conversation to a cloud provider:

.env

LLM_PROVIDER=groq
GROQ_API_KEY=gsk_...

# Override just the coding agent to a local coder
CODING_PROVIDER=ollama
CODING_MODEL=qwen2.5-coder:7b

HTTP API quick start

1. Start the backend

cd /path/to/Luna
pip install -r backend/requirements.txt
uvicorn backend.main:app --host 127.0.0.1 --port 8899

The Swagger UI is available at http://localhost:8899/docs once running.

2. Stream a chat response (Python)

my_app.py

import httpx

with httpx.stream(
    "POST", "http://localhost:8899/api/chat/stream",
    json={"message": "Summarise my tasks for today", "conversation_id": None},
    timeout=60,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            print(line[6:], end="", flush=True)

2b. Stream a chat response (JavaScript / Node)

my_app.js

const response = await fetch('http://localhost:8899/api/chat/stream', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ message: 'What is on my calendar today?', conversation_id: null }),
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  const chunk = decoder.decode(value);
  for (const line of chunk.split('\n')) {
    if (line.startsWith('data: ')) process.stdout.write(line.slice(6));
  }
}

2c. Non-streaming (single response)

curl -s -X POST http://localhost:8899/api/chat \
  -H 'Content-Type: application/json' \
  -d '{"message": "What can you do?"}' | jq .response

3. Run the coding agent

my_app.py

with httpx.stream(
    "POST", "http://localhost:8899/api/coding/stream",
    json={
        "message": "Create a FastAPI endpoint that returns a hello world JSON",
        "workspace_root": "/my/project",
    },
    timeout=120,
) as r:
    for line in r.iter_lines():
        if line.startswith("data: "):
            import json
            event = json.loads(line[6:])
            print(event["type"], "→", event.get("content", "")[:80])

📌

SSE event types from the coding agentworkspace_index · plan · tool_call · tool_result · token · done · error

Python import quick start

Install the backend dependencies once, then import any service directly. No server needed.

pip install -r /path/to/Luna/backend/requirements.txt

LLM — streaming

my_app.py

import asyncio, sys
sys.path.insert(0, "/path/to/Luna")

from backend.services.llm import ollama

async def main():
    async for token in ollama.stream_chat(
        messages=[{"role": "user", "content": "Explain async/await in Python"}],
        system_prompt="You are a concise technical writer.",
    ):
        print(token, end="", flush=True)

asyncio.run(main())

LLM — one-shot completion

my_app.py

result = asyncio.run(
    ollama.complete("Extract the key topics from: " + my_text, temperature=0.2)
)
print(result)

Memory — store and retrieve facts

my_app.py

from backend.services.memory_manager import MemoryManager
from backend.models.database import SessionLocal

db = SessionLocal()
mm = MemoryManager(db)

# Store a fact
asyncio.run(mm.store_fact(
    "User prefers dark mode",
    category="preference",
    importance=0.8,
))

# Retrieve semantically relevant facts
facts = asyncio.run(mm.retrieve_relevant("what UI preferences does the user have?"))
for f in facts:
    print(f.content, f.confidence)

db.close()

Personality — build a system prompt

my_app.py

from backend.services.personality import PersonalityEngine
from backend.models.database import SessionLocal

db = SessionLocal()
engine = PersonalityEngine(db)

# Update mood based on the user's last message
engine.update_mood("I'm so excited about this!")

# Get a personality-aware system prompt
prompt = engine.build_personality_prompt(user_name="Alex")
print(prompt[:300])
db.close()

Service map

Every Luna service lives in backend/services/ and is documented on its own page.

Service	What it does
LLM Service	Multi-provider streaming + completion client.
Memory Manager	Long-term facts, ChromaDB vectors, conversation context.
Personality Engine	Mood state, RL-style style preferences, prompt building.
Scheduler	Background jobs, proactive messages, Windows notifications.
State Engine	Time-aware user state classification + response policies.
Command Parser	Intent detection, bracket commands, launch/Spotify/map parsing.
Tool Runner	LLM tool call JSON parsing, execution, result summarisation.
Memory Graph	Knowledge graph traversal + episodic memory.
MCP Servers	Model Context Protocol servers for Claude Desktop and agents.

Build with Luna — SDK Overview

What is the Luna SDK?

Two integration modes

LLM provider compatibility

Switching providers

Separate coding model

HTTP API quick start

1. Start the backend

2. Stream a chat response (Python)

2b. Stream a chat response (JavaScript / Node)

2c. Non-streaming (single response)

3. Run the coding agent

Python import quick start

LLM — streaming

LLM — one-shot completion

Memory — store and retrieve facts

Personality — build a system prompt

Service map

Next steps