Dig Technical White Paper

What Dig is

The retrieval core for music search.
Humans and agents hit the same answer.

Built on the Discogs public catalog - monthly XML dumps, CC0 licensed, full catalog depth. Every artist, label, release, pressing, format, credit, identifier, and the relationships between all of them. Normalised, indexed, served through a REST API and a fully open MCP server.

Open by default

No keys. No signup. Any agent, any workflow. Point at it.

Deterministic MCP

The MCP tools are deterministic. Every response includes provenance - why a record matched, where the data came from, where confidence drops. Agents get structured facts, not vibes.

Shared retrieval core

The same retrieval services power the human-facing web app, so what users see and what agents get is always the same answer.

Retrieval, not generation

Dig doesn't generate, recommend, or invent. It retrieves. The AI sits on top. Dig is what the AI uses when it can't afford to guess.

Why agents will reach for it

The open MCP layer is a deliberate choice. Dig doesn't gate access because the goal isn't to monetise queries - it's to become the default tool agents reach for.

01 Right now there is no canonical music data layer for LLM workflows. When Dig exists and is open, the agent calls Dig, gets structured facts, returns a good answer. Every time.

02 That's how infrastructure gets adopted. You make it reliable, you make it open, and the ecosystem routes to it because it's the best answer to the tool call.

03 Developers building music agents find it, it works, they keep using it. The MCP layer is distribution that compounds without effort.

04 The human product - search, record pages, crates - sits on top of the same retrieval core. It gives Dig a face. It's where curators and DJs live. But the durable asset is the retrieval graph and the trust that agents place in it.

Why the data is the hard part

Getting the data is easy.
Normalising it correctly is the work.

01 The Discogs dumps are deep. Getting them is easy. Normalising them correctly is where projects like this fail.
02 The XML has decades of accumulated edge cases - inconsistent credit role encoding, multiple track position formats, artist name variations, a master/release relationship more complex than it looks. Get the normalisation wrong and every downstream system inherits the errors. Search, matching, agent queries - all of it breaks quietly.
03 Dig treats normalisation as a design phase, not an implementation detail. Real dump samples are profiled before a line of importer code is written. A normalisation dictionary documents every decision. Raw payloads are staged before normalisation so canonical tables can be re-derived as the schema evolves.
04 Not glamorous work. The work that makes everything else trustworthy.

Why it's buildable

Small-team architecture on purpose.
Cheap until usage forces it to grow.

Deliberately simple stack for a small team. Modular monolith. Postgres as canonical source of truth. Full text search with pg_trgm to start - no search cluster until the metrics justify one. Redis for jobs and caching. One codebase powering the web app, the API, and the MCP server from the same domain services.

No LLM inference in the retrieval path

The MCP layer serves structured database queries, not LLM inference. No expensive compute behind every agent call.

Open data + open interface

The data is open. The interface is open. The cost is manageable.

Distribution is already there

The relationships to bring it to the right people are already there.

Technical appendix

Architecture, ingestion pipeline, and API/MCP interface details are below. This is the technical layer the product and agent workflows both sit on top of.

Architecture diagram

flowchart TB subgraph Clients["Clients"] Web["Mobile-First Web App"] Apps["Apps / Partners"] LLM["LLM Agent Runtime"] end subgraph Surface["Interfaces"] REST["REST API"] MCP["MCP Server"] Admin["Editorial/Admin"] end subgraph Core["Modular Monolith"] Catalog["Catalog"] Search["Search (Postgres FTS + pg_trgm)"] Curation["Curation"] Media["Media Links"] Jobs["Ingest / Workers"] Match["Matching / Export (later)"] end subgraph Data["Data"] PG["Postgres"] Redis["Redis"] end Web --> REST Apps --> REST LLM --> MCP MCP --> REST Admin --> REST REST --> Catalog REST --> Search REST --> Curation REST --> Media REST --> Match REST --> Jobs Catalog --> PG Search --> PG Curation --> PG Media --> PG Match --> PG Jobs --> PG Jobs --> Redis

Pipeline diagram

flowchart LR A["Discogs XML Dumps"] --> B["dump_batches"] B --> C["XML Stream Parse"] C --> D["raw_entities (JSON payloads)"] D --> E["Normalize / Validate"] E --> F["catalog.* canonical tables"] F --> G["search_documents (Postgres FTS)"] F --> H["curation + media links"] F --> I["matching/export (later)"]

API + MCP

LLMs orchestrate. Dig retrieves. Every tool output is structured, reproducible, and explainable. MCP and REST call the same retrieval services. No hidden writes - agent tools read, rank, and explain by default.

Core retrieval API (REST)

GET /search (catalog entities + filters)
GET /artists/:id
GET /labels/:id
GET /masters/:id
GET /releases/:id
GET /releases/:id/media-links
GET /curators/:slug, /lists/:slug

LLM-facing MCP tools

search_catalog
get_release, get_master_release
get_artist, get_label
get_related_releases
get_media_links
create_crate_draft
explain_relationships

The retrieval core for music search.Humans and agents hit the same answer.