A web infrastructure company entering the US market with proxy, SERP, and scraping APIs
Building a Signal-Driven Lead Intelligence System for Thor Data
Context
Thor Data is a web infrastructure company — proxy networks, SERP APIs, web scraping tools. The product suite includes residential and datacenter proxies, a Web Unlocker for anti-bot bypass, a SERP API for search engine results, and a LinkedIn Scraper API. Established in international markets with an existing customer base, now entering the US for the first time.
The competitive landscape is dominated by Bright Data and Oxylabs — well-funded, established US presence, saturated keyword space, and aggressive outbound operations. Both have large BDR teams running ZoomInfo-sourced lists through multi-touch sequences. The standard playbook for a new entrant would be the same: buy lists, run sequences, compete on volume.
Thor Data had a different advantage. Their own APIs are the infrastructure that makes web scraping possible. A system built to find their buyers could run entirely on their own product — capturing signals through Web Unlocker, searching via SERP API, enriching through LinkedIn Scraper. The GTM system wouldn’t just find customers. It would demonstrate the product.
Smoothed built that system.
The problem
No US inbound pipeline. No brand recognition in the market. No US sales team in place. The company needed to go from zero to qualified pipeline in a market where the incumbents have years of brand equity and content ranking.
List-based outbound would put Thor Data in the same inbox as every other proxy vendor. A VP of Engineering receiving a cold email from an unknown proxy company — alongside cold emails from Bright Data, Oxylabs, Smartproxy, and a dozen others — has no reason to reply. The message gets lost in the noise.
Intent data vendors wouldn’t help either. Bombora and G2 sell aggregated, anonymized signals at the account level, delivered weekly. By the time you see “Company X is researching proxies,” Bright Data’s BDR team has already seen the same signal. And you don’t know who at Company X or what they specifically need.
The structural problem: the best leads are people who are actively expressing a need right now, on platforms where their competitors can’t see them. Reddit posts asking for alternatives. GitHub issues describing scraping infrastructure needs. HackerNews threads discussing proxy providers. Twitter complaints about incumbent pricing or reliability.
These signals exist. They’re public. But no list vendor captures them, no intent platform aggregates them, and no CRM workflow monitors them. They require a purpose-built system.
Diagnosis
Before building, we audited the signal landscape across four platforms.
The platform audit
Reddit — 15+ relevant subreddits (r/webscraping, r/dataengineering, r/MachineLearning, r/node, r/python, among others). Posts asking for proxy recommendations, Bright Data alternatives, and web scraping infrastructure advice appear daily. The signal quality is exceptional: posters describe their use case, their current provider, their pain points, and their budget constraints in detail. Reddit turned out to be the highest-quality signal source — 99% of captured signals classified as Tier 1 or 2.
GitHub — Issues, discussions, and repository descriptions mentioning proxy providers, web scraping tools, and data collection infrastructure. Volume is high (532 signals in the first capture cycle), but signal-to-noise is lower than Reddit — many mentions are incidental (a dependency listed in a package.json) rather than intentional (someone actively looking for a tool).
Twitter — Complaint signals and alternative-seeking posts. Lower volume than Reddit or GitHub, but the urgency is higher. Someone publicly tweeting “anyone have a Bright Data alternative?” is typically in active buying mode — they want a response now, not next week.
HackerNews — “Ask HN” threads are gold: engineers asking peers for tool recommendations with detailed context about their use case. Lower frequency but high-intent when they appear.
The keyword taxonomy
From the platform audit, we built a 200+ keyword taxonomy organized by three dimensions:
- Product keywords — terms specific to Thor Data’s product categories: residential proxies, SERP API, web unlocker, datacenter proxies, scraping API
- Cohort keywords — terms indicating which buyer segment the signal comes from: AI training data, e-commerce price intelligence, competitive intelligence, SEO monitoring, ad verification
- Intent keywords — terms indicating urgency level: “alternative to,” “switching from,” “need recommendation,” “looking for,” “help with,” “too expensive,” “unreliable”
The taxonomy is the system’s primary tuning lever. Expanding it expands the signal surface. Refining it improves signal-to-noise. It’s a living document, updated based on classification accuracy data.
The classification framework
Four tiers based on expressed intent:
- Tier 1 — Active intent (respond same day): explicitly seeking a solution, naming competitors, describing specific pain
- Tier 2 — Research (respond next business day): evaluating options, comparing providers, building requirements
- Tier 3 — Building (nurture): constructing infrastructure that will eventually need proxy/scraping tools
- Tier 4 — Noise (drop): students, hobbyists, tangential mentions
The framework maps directly to response playbooks. Each tier gets a different velocity, channel, and messaging approach.
The system
We built a 6-stage pipeline that processes signals from capture through to qualified leads ready for outbound.
4 parallel scrapers — Reddit, GitHub, HackerNews, Twitter — keyword-driven, every 4 hours
Intent tier assignment (1–4), company name extraction, signal type classification
Auto-create lead records, deduplication, Slack alerts for Tier 1 signals
LinkedIn company pages via Thor Data Scraper API — employee count, industry, tech stack
3-tier API fallback for decision-maker profiles — Scraper API → Web Unlocker → SERP
Scoring on signal tier × company fit × contact availability → priority queue
Signal capture
Four parallel scrapers, each built for its platform’s native data structure. Reddit via JSON API with Web Unlocker fallback. GitHub via API with SERP fallback. HackerNews via Algolia API. Twitter via Serper site search. Scheduled every 4 hours via Supabase pg_cron.
Every scraper uses Thor Data’s own APIs. The Reddit scraper goes through Web Unlocker. The GitHub and Twitter scrapers use SERP API (via Serper, which itself runs on Thor Data’s SERP infrastructure). The LinkedIn enrichment scrapers use the Scraper API. The system generates real product usage telemetry while finding customers.
Classification and lead creation
Supabase edge functions triggered by database webhooks process each new signal. Claude 3 Haiku classifies the tier, extracts the company name, and assigns the signal type. When a company is identified, the system auto-creates or updates a lead record with deduplication. Tier 1 signals fire a Slack webhook immediately.
Enrichment
LinkedIn company page scraping via Thor Data’s Scraper API extracts firmographic data: employee count, industry, location, specialties. LinkedIn profile scraping identifies decision makers — CTOs, VPs of Engineering, Heads of Data — using a 3-tier API fallback for resilience.
The enrichment pipeline is where the self-demonstrating architecture pays off most visibly. Every LinkedIn page scraped to enrich a lead is a real product usage event. The scraper’s success rate, speed, and cost are production metrics that directly inform sales conversations about the product.
Qualification
Deterministic scoring across three dimensions: signal strength (0–40), company fit (0–35), and contact access (0–25). The formula is explicit and auditable — when a lead scores 91, you can trace exactly which factors contributed. High-scoring leads enter the priority outbound queue.
Signal Intelligence Feed
Buying signals captured across 4 platforms, classified by intent tier
| Source | Signal | Company | Tier | Intent | When |
|---|---|---|---|---|---|
| Looking for Bright Data alternative — current costs unsustainable at scale | DataForge AI | Tier 1 | active intent | 2h ago | |
| GitHub | Issue: Web Unlocker rate limiting on protected sites, need proxy rotation | ScaleML | Tier 1 | active pain | 3h ago |
| Anyone have a Bright Data alternative? Their residential pool has been unreliable this week | CrawlBase | Tier 1 | active intent | 4h ago | |
| Switching from Oxylabs — need better success rates on e-commerce sites | PriceTrack | Tier 1 | active intent | 6h ago | |
| GitHub | PR: Replace brightdata-sdk with generic proxy rotation layer | Moneta Analytics | Tier 2 | research | 8h ago |
| HN | Ask HN: What proxy infrastructure are you using for large-scale scraping? | — | Tier 2 | research | 10h ago |
| GitHub | Evaluating SERP API providers for competitive intelligence pipeline | InsightBridge | Tier 2 | research | 12h ago |
| GitHub | Building data collection pipeline — need reliable residential proxies | Nexus Data | Tier 2 | building | 14h ago |
| HN | How do you handle anti-bot detection at scale? Cloudflare is killing us | SpectrumIO | Tier 3 | active pain | 1d ago |
| GitHub | New repo: web-scraper-benchmark — comparing proxy providers | — | Tier 3 | research | 1d ago |
| GitHub | Training data collection for LLM fine-tuning — need web scraping infra | Meridian Labs | Tier 3 | building | 2d ago |
| Exploring proxy options for a side project — any recommendations? | — | Tier 4 | research | 2d ago |
Company Context
DataForge AI builds price intelligence tools for e-commerce brands, scraping product data across 200+ retail sites daily. The team is 85 people, Series A funded, based in Austin. Their data pipeline is core infrastructure — proxy reliability directly impacts product accuracy and customer SLAs.
Key Contacts
- James Chen — CTO (LinkedIn)
- Sarah Okafor — VP Engineering (LinkedIn)
Signal Context
Posted in r/webscraping, a subreddit with 45K members focused on web scraping tools and infrastructure. The post received 12 replies, several recommending specific providers. The author described a specific use case (e-commerce price intelligence), a specific pain point ($6K/month cost), and is actively evaluating — all Tier 1 indicators.
Recommended Response
- Lead with cost comparison — they cited $6K/month on Bright Data. Thor Data's pricing at their volume would be roughly 40% lower.
- Reference e-commerce scraping specifically — Thor Data's Web Unlocker has strong success rates on Shopify, Amazon, and major retail platforms.
- The Reddit post mentions "evaluating alternatives" — they're in active buying mode. Response within 24 hours is critical.
What we chose not to build
Scope discipline matters as much as architecture:
- No automated outbound. Month 1 was 100% manual review. Every signal-sourced lead was reviewed by a human before outreach. This was deliberate — you calibrate the system before you automate it.
- No CRM integration. Supabase is the system of record during the validation phase. CRM sync is a Phase 2 concern, after signal quality and classification accuracy are proven.
- No ML-based classification. Claude Haiku with structured prompts and explicit tier criteria. The model doesn’t learn from historical data — it applies rules. This makes the system inspectable: you can read the prompt, understand the criteria, and tune the classification by adjusting the prompt, not retraining a model.
Outcomes
- 910 buying signals captured across 4 platforms in the first capture cycle
- 332 Tier 1 high-intent prospects — a 36% quality rate
- Signal distribution by source: GitHub 532 (118 Tier 1), Reddit 135 (133 Tier 1), Twitter 100 (49 Tier 1), HackerNews 143 (32 Tier 1)
- Reddit signal quality: 99% of captured signals classified Tier 1 or 2 — the highest quality-to-volume ratio of any platform
- Pipeline cost: under $0.05 per qualified lead
- 200+ keyword taxonomy across product, cohort, and intent dimensions
- System runs entirely on Thor Data’s own APIs — zero third-party scraping infrastructure required
Signal Intelligence Observability
Aggregate signal metrics, platform performance, and cost tracking across all sources
Signal Distribution by Source
Tier Distribution
Platform Performance
| Source | Signals | Tier 1 | Tier 1 Rate | Avg Classification | Cost / Signal |
|---|---|---|---|---|---|
| 135 | 133 | 99% | 1.2s | $0.001 | |
| GitHub | 532 | 118 | 22% | 1.1s | $0.001 |
| 100 | 49 | 49% | 1.3s | $0.002 | |
| HackerNews | 143 | 32 | 22% | 1.0s | $0.001 |
The 36% Tier 1 rate deserves emphasis. Cold list outbound converts at 0.5–1%. Intent data vendors deliver signals that every competitor also receives. This system found 332 prospects that were actively expressing a need on public platforms — prospects that no list vendor, no intent platform, and no competitor’s BDR team was systematically capturing.
Reddit’s 99% quality rate is the standout. The platform’s structure — detailed posts with context, use case descriptions, and explicit asks — produces signals that classify cleanly. GitHub produces higher volume but lower quality because many mentions are incidental rather than intentional.
What’s next
The intelligence layer feeds the outbound layer. The next phase: persona-based sequencing across 48 segments (16 sophistication levels across 3 roles), AI-powered email personalization that references the specific signal that identified the prospect, and Smartlead integration for campaign delivery at scale.
The system that finds the prospects hands them to the system that engages them. The intelligence layer’s tier classification determines the outbound system’s velocity, channel, and messaging — a seamless handoff from signal to sequence.
For anyone reading this with a similar challenge: the architecture pattern — capture signals, classify intent, enrich companies, qualify leads — applies to any market where buyers express needs on public platforms. The platforms change. The keywords change. The classification criteria change. The pipeline doesn’t.
Systems Demonstrated
- 02 Lead Intelligence Layer
Signal capture, intent classification, enrichment, and qualification — built on the platforms where buyers actually talk
Problems Addressed
Want to see this built for your stack? Let's scope it.