🔥 TODAY’S SPECIAL BRIEFING From Our Sponsors
TinyFish Open-Sources Bigset: A Multi-Agent System That Builds Verifiable Structured Datasets from the Live Web Using Plain-English Descriptions
Building a structured dataset from the web is still a pipeline problem. You identify a source, write a scraper, design a schema, handle deduplication, schedule refreshes, and fix breakage when upstream sites change. Bigset is an open-source multi-agent system (AGPL-3.0) from TinyFish that collapses that entire workflow into one sentence.
You type: "YC companies that are currently hiring engineers, with their funding stage, location, and number of open roles." That is the input. Bigset infers the schema, dispatches agents to the live web, deduplicates, and hands you a downloadable CSV or XLSX. Dataset generation takes 2–5 minutes. The agents are doing real web research — not cached lookups.
The architecture is a two-tier multi-agent system, not a prompt loop.
Step 1 is schema inference. Claude Sonnet (via OpenRouter, defaulting to anthropic/claude-sonnet-4.6) reads your sentence and outputs column names, data types, primary keys, and where to look — before any web access. Step 2 is orchestrator discovery. A Qwen agent (via OpenRouter, qwen/qwen3.7-max) runs broad TinyFish Search to identify which entities exist and queues them as sub-agents. Step 3 is fan-out. Each entity gets its own isolated sub-agent running in parallel, capped at 6 tool calls — fetch, search, insert, done. Step 4 is deduplication and source attribution. Primary key dedup collapses duplicates. Every row carries a traceable source URL. Step 5 is export: CSV or XLSX.
The security boundary is the part worth studying. Each sub-agent's authorized dataset ID is embedded in a JavaScript closure at workflow start — invisible to the LLM. The insert_row tool takes no datasetId argument at all. A prompt-injection payload embedded in a fetched web page literally cannot redirect writes. The boundary is enforced in infrastructure, not in a system prompt.
Refresh cadences: 30 min, 6 hours, 12 hours, daily, weekly — fully automatic. The full stack runs on Next.js 16, Fastify, Convex (self-hosted), Mastra workflows, and Clerk auth. Self-hostable via Docker in three commands. BYOK for OpenRouter and TinyFish. 2,500 row operations per account per month on the free self-hosted tier.
The agent-native API — where agents generate datasets directly without a human in the loop — is on the roadmap.

