Introduction: The Core Shift
BigSet is not just another AI tool—it is a structural shift in how structured datasets are built from live web data. By accepting a plain-English description and returning a downloadable CSV or XLSX, TinyFish has eliminated the traditional pipeline of scraper configuration, schema design, deduplication, and refresh scheduling. The multi-agent architecture—schema inference via Claude Sonnet, orchestration via Qwen, and parallel sub-agents each capped at 6 tool calls—reduces a process that once took hours or days to 2–5 minutes. This is a direct challenge to proprietary data platforms like Apify and Exa Websets, and it signals a broader trend toward open-source, self-hosted data generation.
Strategic Consequences
Who Gains?
Data scientists and analysts gain a low-code tool that accelerates data pipeline creation. Instead of writing scrapers or configuring APIs, they describe what they need and get a structured, refreshable dataset. TinyFish itself gains a powerful distribution channel for its search, fetch, and browser APIs through the BYOK model—every BigSet user becomes a TinyFish API customer. OpenRouter benefits from increased LLM inference volume as the default provider for schema inference and orchestration. The open-source community can inspect, modify, and extend BigSet under AGPL-3.0, fostering innovation and customization for niche use cases.
Who Loses?
Proprietary data marketplace platforms like Apify and Exa Websets face direct competition from an open-source alternative that undercuts their subscription pricing and offers self-hosting. Traditional web scraping service providers risk losing customers who prefer an automated, AI-driven approach over manual scripts. Companies selling static datasets will see reduced demand as BigSet's live refresh feature makes one-time dataset purchases obsolete.
What Shifts Next?
The BYOK pricing model—bring your own API keys for TinyFish and OpenRouter—shifts cost control to the user. Enterprises can predict costs based on their own usage, avoiding vendor lock-in. The self-hosted deployment addresses data privacy concerns in regulated sectors like healthcare and finance. The roadmap includes SQL query support and an agent-native API, which would expand use cases to analytics and automation, further eroding the moat of proprietary platforms.
Winners & Losers
Winners
- Data scientists and analysts: Low-code, live dataset generation.
- TinyFish: API adoption and ecosystem growth.
- OpenRouter: Increased inference revenue.
- Open-source community: Customizability and transparency.
Losers
- Apify, Exa Websets: Open-source competition erodes market share.
- Traditional scraping services: Automation reduces need for manual scraping.
- Static dataset vendors: Live refresh diminishes one-time sales.
Second-Order Effects
BigSet's architecture—where the dataset ID is captured in a JavaScript closure, invisible to the LLM—sets a new security standard for agent-based data extraction. This pattern will likely be adopted by other multi-agent systems to prevent prompt injection attacks. The AGPL-3.0 license may deter some commercial adoption but will attract organizations that value data sovereignty and transparency. Expect a wave of community-contributed dataset templates and integrations, accelerating the tool's capabilities beyond its initial scope.
Market / Industry Impact
The open-source, self-hosted model with BYOK pricing challenges the traditional data-as-a-service subscription model. Over time, this could shift market expectations toward transparent, auditable data generation tools and away from black-box APIs. Incumbents will be pressured to offer more flexible licensing and on-premise options. The total addressable market for structured data generation expands as technical barriers drop, potentially creating new use cases in real-time market research, competitive intelligence, and compliance monitoring.
Executive Action
- Evaluate BigSet for internal data pipelines: Start with a pilot project to generate live datasets for market analysis or sales intelligence. The self-hosted deployment ensures data remains within your infrastructure.
- Monitor competitive response: Watch for pricing changes or open-source releases from Apify and Exa Websets. Their reaction will indicate the severity of the threat BigSet poses.
- Assess AGPL-3.0 implications: If your organization has strict policies against copyleft licenses, consider contributing to or forking the project under a permissive license, or evaluate alternative tools.
Why This Matters
BigSet compresses a multi-hour data pipeline into a 2–5 minute natural-language interaction. For executives, this means faster access to decision-grade data without dedicated engineering resources. The open-source, self-hosted model also eliminates vendor lock-in and data sovereignty concerns—critical for regulated industries. Ignoring this shift risks falling behind competitors who adopt AI-driven data generation.
Final Take
BigSet is a blueprint for the future of data extraction: open, agent-driven, and user-controlled. TinyFish has not just released a tool—it has redefined the economics and security of structured dataset creation. The winners will be those who embrace this model early, while incumbents scramble to adapt.
Rate the Intelligence Signal
Intelligence FAQ
BigSet uses a two-tier agent system where sub-agents fetch real web pages and extract fields with source attribution. Each sub-agent has a tool budget of 6 calls, and the dataset ID is locked in a JavaScript closure to prevent prompt injection. Deduplication via primary keys further ensures row-level accuracy.
Costs are variable: you pay for OpenRouter API calls ($5–10 initial credits recommended) and TinyFish API usage. No subscription fees. Self-hosting requires Docker and your own infrastructure. For heavy use, costs scale linearly with the number of datasets and refresh frequency.
For structured dataset generation from natural-language descriptions, yes. But BigSet currently lacks the breadth of pre-built scrapers and site-specific actors that Apify offers. It is best suited for ad-hoc, custom datasets rather than large-scale, recurring extraction from known sites.


