konar est. 2026
CASE STUDY · 04
LEGISLATIVE TRANSPARENCY  ·  2026  ·  SOLE ENGINEER

190 years of Spanish law as a Git repository.

Ley Abierta reconstructs every Spanish law since 1835 as a Markdown file with a Git commit dated to its official BOE publication: 12,272 laws, 43,883 commits, 18 jurisdictions. On top of the repository lives a hybrid retrieval engine with a hand-rolled int8 SIMD C kernel and a Qwen stack on Nan (EU servers, no-log policy): embeddings, query analyzer, reranker and synthesis. AGPL-3.0. Live at leyabierta.es.

STATUS
live
leyabierta.es · agpl-3.0
CORPUS
12,272 laws
43,883 commits · 18 jurisdictions · 1835–2026
INDEX
7.6 GB → 1.4 GB
int8 simd · custom AVX2/NEON · -81%
RETRIEVAL
hybrid
bm25 + rag · llm rerank · synthesis
INFERENCE
flat-rate
nan · eu servers · no-log · unmetered

Spanish legislation is nominally public: the BOE exposes every law via an XML API. But the raw feed is a stream of documents, not a navigable history. No diff between versions, no graph of which reform amended which article, no way to ask "what changed in labor law in 2018?" and get a useful answer. The official search returns keyword matches in legal prose, not structured documents. Ley Abierta treats the law as source code. Each reform is a commit, each jurisdiction a folder, each version a checkable diff. The result is 190 years of legislation that's queryable and diffable, with full version history.

  BOE open data API  (XML/JSON · no auth required)
          │
          ▼
  ┌─ pipeline  [ TypeScript · Bun ] ──────────────────────────────────┐
  │   fetch XML → parse → Markdown + YAML frontmatter                 │
  │   git commit --date=BOE_PUBLICATION_DATE                          │
  │   pre-1970 laws: date in frontmatter, git date = 1970-01-02       │
  │   ELI folder structure  (es + 17 CCAA = 18 jurisdictions)        │
  └───────────────────────────────────────────────────────────────────┘
          │                                │
          ▼                                ▼
  leyes/ git repo                   JSON cache (scratch)
  (public human-readable artifact)
          │
          ▼
  ┌─ api  [ Elysia · SQLite · FTS5 ] ────────────────────────────────┐
  │                                                                    │
  │   BM25 full-text search (FTS5)                                    │
  │         +                                                         │
  │   semantic search  — 486k int8 vectors                            │
  │   ┌────────────────────────────────────┐                          │
  │   │  C SIMD int8 cosine kernel          │                          │
  │   │  -81% index size (7.6 GB → 1.4 GB)  │                          │
  │   │  AVX2/FMA + NEON paths              │                          │
  │   │  SharedArrayBuffer worker pool      │                          │
  │   └────────────────────────────────────┘                          │
  │         +                                                         │
  │   RRF fusion  ·  P50 vector stage: 2.1s → 0.8s                    │
  │                                                                    │
  │   Stack: Qwen on Nan (embed + analyzer + rerank + synthesis)      │
  └───────────────────────────────────────────────────────────────────┘
          │
          ▼
  ┌─ web  [ Astro SSG · Cloudflare Pages ] ──────────────────────────┐
  │   law content from leyes/ checkout at build time                  │
  │   derived data fetched from API at build time                     │
  │   live at leyabierta.es                                           │
  └───────────────────────────────────────────────────────────────────┘

  Infrastructure: self-hosted Docker · Watchtower · GitHub Actions cron
                  (daily BOE ingestion) · Resend (email alerts)
01  ·  DECISION
Git as the storage layer

Laws are versioned documents. Git is the right tool: git log -- Codigo_Penal.md is a meaningful civic interface, git diff between two commits is a legislative audit trail. Pre-1970 laws are clamped to 1970-01-02 with the real publication date preserved in the YAML frontmatter (Git's commit timestamp is Unix-epoch only). The repo is the product.

02  ·  DECISION
Custom int8 SIMD C kernel

Off-the-shelf vector libraries were too slow at 486k vectors on a single VM. A hand-rolled int8 cosine kernel with SIMD intrinsics (dual path AVX2+FMA for x86_64 and NEON for arm64, two-way unrolled accumulator) cut the index from 7.6 GB to 1.4 GB (-81%) and brought the vector-search stage p50 from 2.1s to 0.8s. A SharedArrayBuffer worker pool ensures the index lives once in memory across 4 Bun Workers. No GPU and no managed vector DB.

03  ·  DECISION
The BM25 regression that taught the stack

BM25 looked obvious for legal text. The early hybrid (BM25 + RRF over Gemini dense embeddings) actually regressed on real citizen queries: FTS5 OR-expansion on "horas extras que no me pagan" matched noise across centuries of legal prose. The fix wasn't tuning BM25 weights. It was a modern-bias prompt on Qwen embeddings that biases retrieval toward recent statutes over 19th-century codes. A reproducible eval harness against citizen and omnibus question sets caught the regression before it reached prod.

04  ·  DECISION
Full Qwen stack over Gemini + Cohere

The original stack used Gemini embeddings, Gemini Flash Lite as the query analyzer, and Cohere Rerank 4 Pro, all routed through OpenRouter at ~$2 per 1,000 queries with every request leaving the server. A/B evaluation replaced each component with Qwen on Nan, an EU-based inference provider with a no-log policy: embeddings, query analyzer, LLM reranker, and synthesis. Queries stay in the EU, nothing is logged on the inference side.

05  ·  DECISION
Self-hosted Opik per-span RAG tracing

Every /v1/ask call creates an Opik trace with spans for embed_query, bm25, vector_knn, aggregate_pool, rrf_fusion, rerank, and synthesis. Tracing is fail-safe. Span failures never break the pipeline. Opik runs self-hosted in the same VM (full backend + frontend + Python OSS stack), so per-stage latency is debuggable in prod without exfiltrating queries to a SaaS. The catch that justified the work: the BM25 OR-explosion bug was identified at the bm25 span before it tanked recall.

06  ·  DECISION
AGPL-3.0

Public legal infrastructure shouldn't be capturable by a SaaS. AGPL guarantees that any deployment that serves Ley Abierta over a network must publish its modifications. The license matches the project's politics: if public law is a commons, so is the engine that makes it searchable.

PACKAGE STACK COMMITS ROLE
pipeline typescript · bun · git sole
api elysia · sqlite · fts5 · c simd · qwen nan sole
web astro ssg · cloudflare pages sole
shared typescript · bun sole
leyes/ (git repo) markdown · yaml frontmatter · eli folder structure 43,883 sole

Ley Abierta turns 190 years of Spanish legislation, technically public and practically unusable, into a queryable Git repository with a search layer that holds up on real citizen language. A structured A/B program replaced every Gemini and Cohere component (embeddings, query analyzer, reranker, and synthesis) with Qwen on Nan, an EU-based inference provider with a no-log policy, measured in isolation against citizen and omnibus question sets. An early hybrid regression was caught by the eval harness before prod, when BM25 OR-expansion on citizen Spanish matched noise across centuries of legal prose. Index size dropped from 7.6 GB to 1.4 GB via the custom int8 SIMD C kernel. Queries stay in the EU. The whole stack is AGPL-3.0 and live at leyabierta.es.