What Dig is

The retrieval core for music search.
Humans and agents hit the same answer.

Built on the Discogs public catalog - monthly XML dumps, CC0 licensed, full catalog depth. Every artist, label, release, pressing, format, credit, identifier, and the relationships between all of them. Normalised, indexed, served through a REST API and a fully open MCP server.

01

Open by default

No keys. No signup. Any agent, any workflow. Point at it.

02

Deterministic MCP

The MCP tools are deterministic. Every response includes provenance - why a record matched, where the data came from, where confidence drops. Agents get structured facts, not vibes.

03

Shared retrieval core

The same retrieval services power the human-facing web app, so what users see and what agents get is always the same answer.

04

Retrieval, not generation

Dig doesn't generate, recommend, or invent. It retrieves. The AI sits on top. Dig is what the AI uses when it can't afford to guess.

Why agents will reach for it

The open MCP layer is a deliberate choice. Dig doesn't gate access because the goal isn't to monetise queries - it's to become the default tool agents reach for.

01 Right now there is no canonical music data layer for LLM workflows. When Dig exists and is open, the agent calls Dig, gets structured facts, returns a good answer. Every time.
02 That's how infrastructure gets adopted. You make it reliable, you make it open, and the ecosystem routes to it because it's the best answer to the tool call.
03 Developers building music agents find it, it works, they keep using it. The MCP layer is distribution that compounds without effort.
04 The human product - search, record pages, crates - sits on top of the same retrieval core. It gives Dig a face. It's where curators and DJs live. But the durable asset is the retrieval graph and the trust that agents place in it.
Why the data is the hard part

Getting the data is easy.
Normalising it correctly is the work.

Why it's buildable

Small-team architecture on purpose.
Cheap until usage forces it to grow.

Deliberately simple stack for a small team. Modular monolith. Postgres as canonical source of truth. Full text search with pg_trgm to start - no search cluster until the metrics justify one. Redis for jobs and caching. One codebase powering the web app, the API, and the MCP server from the same domain services.

01

No LLM inference in the retrieval path

The MCP layer serves structured database queries, not LLM inference. No expensive compute behind every agent call.

02

Open data + open interface

The data is open. The interface is open. The cost is manageable.

03

Distribution is already there

The relationships to bring it to the right people are already there.

Technical appendix

Architecture, ingestion pipeline, and API/MCP interface details are below. This is the technical layer the product and agent workflows both sit on top of.

Architecture diagram

flowchart TB subgraph Clients["Clients"] Web["Mobile-First Web App"] Apps["Apps / Partners"] LLM["LLM Agent Runtime"] end subgraph Surface["Interfaces"] REST["REST API"] MCP["MCP Server"] Admin["Editorial/Admin"] end subgraph Core["Modular Monolith"] Catalog["Catalog"] Search["Search (Postgres FTS + pg_trgm)"] Curation["Curation"] Media["Media Links"] Jobs["Ingest / Workers"] Match["Matching / Export (later)"] end subgraph Data["Data"] PG["Postgres"] Redis["Redis"] end Web --> REST Apps --> REST LLM --> MCP MCP --> REST Admin --> REST REST --> Catalog REST --> Search REST --> Curation REST --> Media REST --> Match REST --> Jobs Catalog --> PG Search --> PG Curation --> PG Media --> PG Match --> PG Jobs --> PG Jobs --> Redis

Pipeline diagram

flowchart LR A["Discogs XML Dumps"] --> B["dump_batches"] B --> C["XML Stream Parse"] C --> D["raw_entities (JSON payloads)"] D --> E["Normalize / Validate"] E --> F["catalog.* canonical tables"] F --> G["search_documents (Postgres FTS)"] F --> H["curation + media links"] F --> I["matching/export (later)"]

API + MCP

LLMs orchestrate. Dig retrieves. Every tool output is structured, reproducible, and explainable. MCP and REST call the same retrieval services. No hidden writes - agent tools read, rank, and explain by default.

Core retrieval API (REST)

  • GET /search (catalog entities + filters)
  • GET /artists/:id
  • GET /labels/:id
  • GET /masters/:id
  • GET /releases/:id
  • GET /releases/:id/media-links
  • GET /curators/:slug, /lists/:slug

LLM-facing MCP tools

  • search_catalog
  • get_release, get_master_release
  • get_artist, get_label
  • get_related_releases
  • get_media_links
  • create_crate_draft
  • explain_relationships
Defaults
Architecture — Modular monolith
Search v1 — Postgres FTS + pg_trgm
Ingest — Raw payload staging in ingest.raw_entities
Auth v1 — Editorial-only
LLM strategy — No proxying. Expose MCP/API, users bring models
Non-goals — Marketplace, compliance stack, Discogs write-back