Building a Signal-Driven Lead Intelligence System for Thor Data

Context

Thor Data is a web infrastructure company — proxy networks, SERP APIs, web scraping tools. The product suite includes residential and datacenter proxies, a Web Unlocker for anti-bot bypass, a SERP API for search engine results, and a LinkedIn Scraper API. Established in international markets with an existing customer base, now entering the US for the first time.

The competitive landscape is dominated by Bright Data and Oxylabs — well-funded, established US presence, saturated keyword space, and aggressive outbound operations. Both have large BDR teams running ZoomInfo-sourced lists through multi-touch sequences. The standard playbook for a new entrant would be the same: buy lists, run sequences, compete on volume.

Thor Data had a different advantage. Their own APIs are the infrastructure that makes web scraping possible. A system built to find their buyers could run entirely on their own product — capturing signals through Web Unlocker, searching via SERP API, enriching through LinkedIn Scraper. The GTM system wouldn’t just find customers. It would demonstrate the product.

Smoothed built that system.

The problem

No US inbound pipeline. No brand recognition in the market. No US sales team in place. The company needed to go from zero to qualified pipeline in a market where the incumbents have years of brand equity and content ranking.

List-based outbound would put Thor Data in the same inbox as every other proxy vendor. A VP of Engineering receiving a cold email from an unknown proxy company — alongside cold emails from Bright Data, Oxylabs, Smartproxy, and a dozen others — has no reason to reply. The message gets lost in the noise.

Intent data vendors wouldn’t help either. Bombora and G2 sell aggregated, anonymized signals at the account level, delivered weekly. By the time you see “Company X is researching proxies,” Bright Data’s BDR team has already seen the same signal. And you don’t know who at Company X or what they specifically need.

The structural problem: the best leads are people who are actively expressing a need right now, on platforms where their competitors can’t see them. Reddit posts asking for alternatives. GitHub issues describing scraping infrastructure needs. HackerNews threads discussing proxy providers. Twitter complaints about incumbent pricing or reliability.

These signals exist. They’re public. But no list vendor captures them, no intent platform aggregates them, and no CRM workflow monitors them. They require a purpose-built system.

Diagnosis

Before building, we audited the signal landscape across four platforms.

The platform audit

Reddit — 15+ relevant subreddits (r/webscraping, r/dataengineering, r/MachineLearning, r/node, r/python, among others). Posts asking for proxy recommendations, Bright Data alternatives, and web scraping infrastructure advice appear daily. The signal quality is exceptional: posters describe their use case, their current provider, their pain points, and their budget constraints in detail. Reddit turned out to be the highest-quality signal source — 99% of captured signals classified as Tier 1 or 2.

GitHub — Issues, discussions, and repository descriptions mentioning proxy providers, web scraping tools, and data collection infrastructure. Volume is high (532 signals in the first capture cycle), but signal-to-noise is lower than Reddit — many mentions are incidental (a dependency listed in a package.json) rather than intentional (someone actively looking for a tool).

Twitter — Complaint signals and alternative-seeking posts. Lower volume than Reddit or GitHub, but the urgency is higher. Someone publicly tweeting “anyone have a Bright Data alternative?” is typically in active buying mode — they want a response now, not next week.

HackerNews — “Ask HN” threads are gold: engineers asking peers for tool recommendations with detailed context about their use case. Lower frequency but high-intent when they appear.

The keyword taxonomy

From the platform audit, we built a 200+ keyword taxonomy organized by three dimensions:

Product keywords — terms specific to Thor Data’s product categories: residential proxies, SERP API, web unlocker, datacenter proxies, scraping API
Cohort keywords — terms indicating which buyer segment the signal comes from: AI training data, e-commerce price intelligence, competitive intelligence, SEO monitoring, ad verification
Intent keywords — terms indicating urgency level: “alternative to,” “switching from,” “need recommendation,” “looking for,” “help with,” “too expensive,” “unreliable”

The taxonomy is the system’s primary tuning lever. Expanding it expands the signal surface. Refining it improves signal-to-noise. It’s a living document, updated based on classification accuracy data.

The classification framework

Four tiers based on expressed intent:

Tier 1 — Active intent (respond same day): explicitly seeking a solution, naming competitors, describing specific pain
Tier 2 — Research (respond next business day): evaluating options, comparing providers, building requirements
Tier 3 — Building (nurture): constructing infrastructure that will eventually need proxy/scraping tools
Tier 4 — Noise (drop): students, hobbyists, tangential mentions

The framework maps directly to response playbooks. Each tier gets a different velocity, channel, and messaging approach.

The system

We built a 6-stage pipeline that processes signals from capture through to qualified leads ready for outbound.

Signal Capture Deterministic

4 parallel scrapers — Reddit, GitHub, HackerNews, Twitter — keyword-driven, every 4 hours

Signal Classification AI · Claude Haiku

Intent tier assignment (1–4), company name extraction, signal type classification

Lead Creation Deterministic

Auto-create lead records, deduplication, Slack alerts for Tier 1 signals

Company Enrichment Scraping · LinkedIn

LinkedIn company pages via Thor Data Scraper API — employee count, industry, tech stack

Contact Enrichment Scraping · LinkedIn

3-tier API fallback for decision-maker profiles — Scraper API → Web Unlocker → SERP

Lead Qualification Deterministic

Scoring on signal tier × company fit × contact availability → priority queue

Deterministic — auditable, reproducible AI-powered — classification and synthesis Scraping — platform-native data extraction

Signal capture

Four parallel scrapers, each built for its platform’s native data structure. Reddit via JSON API with Web Unlocker fallback. GitHub via API with SERP fallback. HackerNews via Algolia API. Twitter via Serper site search. Scheduled every 4 hours via Supabase pg_cron.

Every scraper uses Thor Data’s own APIs. The Reddit scraper goes through Web Unlocker. The GitHub and Twitter scrapers use SERP API (via Serper, which itself runs on Thor Data’s SERP infrastructure). The LinkedIn enrichment scrapers use the Scraper API. The system generates real product usage telemetry while finding customers.

Classification and lead creation

Supabase edge functions triggered by database webhooks process each new signal. Claude 3 Haiku classifies the tier, extracts the company name, and assigns the signal type. When a company is identified, the system auto-creates or updates a lead record with deduplication. Tier 1 signals fire a Slack webhook immediately.

Enrichment

LinkedIn company page scraping via Thor Data’s Scraper API extracts firmographic data: employee count, industry, location, specialties. LinkedIn profile scraping identifies decision makers — CTOs, VPs of Engineering, Heads of Data — using a 3-tier API fallback for resilience.

The enrichment pipeline is where the self-demonstrating architecture pays off most visibly. Every LinkedIn page scraped to enrich a lead is a real product usage event. The scraper’s success rate, speed, and cost are production metrics that directly inform sales conversations about the product.

Qualification

Deterministic scoring across three dimensions: signal strength (0–40), company fit (0–35), and contact access (0–25). The formula is explicit and auditable — when a lead scores 91, you can trace exactly which factors contributed. High-scoring leads enter the priority outbound queue.

Signal Intelligence Feed

Buying signals captured across 4 platforms, classified by intent tier

910 signals | 332 Tier 1 | 4 sources | $0.05/qualified lead

Source	Signal	Company	Tier	Intent	When
Reddit	Looking for Bright Data alternative — current costs unsustainable at scale	DataForge AI	Tier 1	active intent	2h ago
GitHub	Issue: Web Unlocker rate limiting on protected sites, need proxy rotation	ScaleML	Tier 1	active pain	3h ago
Twitter	Anyone have a Bright Data alternative? Their residential pool has been unreliable this week	CrawlBase	Tier 1	active intent	4h ago
Reddit	Switching from Oxylabs — need better success rates on e-commerce sites	PriceTrack	Tier 1	active intent	6h ago
GitHub	PR: Replace brightdata-sdk with generic proxy rotation layer	Moneta Analytics	Tier 2	research	8h ago
HN	Ask HN: What proxy infrastructure are you using for large-scale scraping?	—	Tier 2	research	10h ago
GitHub	Evaluating SERP API providers for competitive intelligence pipeline	InsightBridge	Tier 2	research	12h ago
GitHub	Building data collection pipeline — need reliable residential proxies	Nexus Data	Tier 2	building	14h ago
HN	How do you handle anti-bot detection at scale? Cloudflare is killing us	SpectrumIO	Tier 3	active pain	1d ago
GitHub	New repo: web-scraper-benchmark — comparing proxy providers	—	Tier 3	research	1d ago
GitHub	Training data collection for LLM fine-tuning — need web scraping infra	Meridian Labs	Tier 3	building	2d ago
Twitter	Exploring proxy options for a side project — any recommendations?	—	Tier 4	research	2d ago

DataForge AI

Artificial Intelligence

Employees: 85

Signal Source

Reddit · r/webscraping · 2 hours ago

We're scraping product data across 200+ e-commerce sites for our price intelligence platform. Bright Data costs are unsustainable at our volume — $6K/month and climbing. Need a residential proxy solution with comparable success rates on protected sites. Currently evaluating alternatives.

Classification

Tier 1 — active buying intent. Named competitor, specific volume/cost pain, evaluating alternatives now. Score: 91/100.

Qualification Score: 91/100

Signal Strength 35/40

✓ Tier 1: active buying intent +20 pts
✓ Named competitor (Bright Data) +10 pts
✓ Specific use case described +5 pts

Company Fit 30/35

✓ AI/ML company (target cohort) +15 pts
✓ 85 employees (mid-market) +10 pts
✓ Web scraping as core workflow +5 pts

Contact Access 26/30

✓ CTO identified via LinkedIn +15 pts
✓ VP Engineering identified +10 pts
✓ Email verified via enrichment +1 pts

Company Context

DataForge AI builds price intelligence tools for e-commerce brands, scraping product data across 200+ retail sites daily. The team is 85 people, Series A funded, based in Austin. Their data pipeline is core infrastructure — proxy reliability directly impacts product accuracy and customer SLAs.

Key Contacts

James Chen — CTO (LinkedIn)
Sarah Okafor — VP Engineering (LinkedIn)

Signal Context

Posted in r/webscraping, a subreddit with 45K members focused on web scraping tools and infrastructure. The post received 12 replies, several recommending specific providers. The author described a specific use case (e-commerce price intelligence), a specific pain point ($6K/month cost), and is actively evaluating — all Tier 1 indicators.

Recommended Response

Lead with cost comparison — they cited $6K/month on Bright Data. Thor Data's pricing at their volume would be roughly 40% lower.
Reference e-commerce scraping specifically — Thor Data's Web Unlocker has strong success rates on Shopify, Amazon, and major retail platforms.
The Reddit post mentions "evaluating alternatives" — they're in active buying mode. Response within 24 hours is critical.

What we chose not to build

Scope discipline matters as much as architecture:

No automated outbound. Month 1 was 100% manual review. Every signal-sourced lead was reviewed by a human before outreach. This was deliberate — you calibrate the system before you automate it.
No CRM integration. Supabase is the system of record during the validation phase. CRM sync is a Phase 2 concern, after signal quality and classification accuracy are proven.
No ML-based classification. Claude Haiku with structured prompts and explicit tier criteria. The model doesn’t learn from historical data — it applies rules. This makes the system inspectable: you can read the prompt, understand the criteria, and tune the classification by adjusting the prompt, not retraining a model.

Outcomes

910 buying signals captured across 4 platforms in the first capture cycle
332 Tier 1 high-intent prospects — a 36% quality rate
Signal distribution by source: GitHub 532 (118 Tier 1), Reddit 135 (133 Tier 1), Twitter 100 (49 Tier 1), HackerNews 143 (32 Tier 1)
Reddit signal quality: 99% of captured signals classified Tier 1 or 2 — the highest quality-to-volume ratio of any platform
Pipeline cost: under $0.05 per qualified lead
200+ keyword taxonomy across product, cohort, and intent dimensions
System runs entirely on Thor Data’s own APIs — zero third-party scraping infrastructure required

Signal Intelligence Observability

Aggregate signal metrics, platform performance, and cost tracking across all sources

Total Signals 910

Tier 1 Signals 332

Quality Rate 36%

Cost / Qualified Lead $0.05

Signal Distribution by Source

GitHub

532

HackerNews

143

135

Twitter

100

Tier Distribution

Tier 1

332

Tier 2

287

Tier 3

210

Tier 4

Platform Performance

Source	Signals	Tier 1	Tier 1 Rate	Avg Classification	Cost / Signal
Reddit	135	133	99%	1.2s	$0.001
GitHub	532	118	22%	1.1s	$0.001
Twitter	100	49	49%	1.3s	$0.002
HackerNews	143	32	22%	1.0s	$0.001

The 36% Tier 1 rate deserves emphasis. Cold list outbound converts at 0.5–1%. Intent data vendors deliver signals that every competitor also receives. This system found 332 prospects that were actively expressing a need on public platforms — prospects that no list vendor, no intent platform, and no competitor’s BDR team was systematically capturing.

Reddit’s 99% quality rate is the standout. The platform’s structure — detailed posts with context, use case descriptions, and explicit asks — produces signals that classify cleanly. GitHub produces higher volume but lower quality because many mentions are incidental rather than intentional.

What’s next

The intelligence layer feeds the outbound layer. The next phase: persona-based sequencing across 48 segments (16 sophistication levels across 3 roles), AI-powered email personalization that references the specific signal that identified the prospect, and Smartlead integration for campaign delivery at scale.

The system that finds the prospects hands them to the system that engages them. The intelligence layer’s tier classification determines the outbound system’s velocity, channel, and messaging — a seamless handoff from signal to sequence.

For anyone reading this with a similar challenge: the architecture pattern — capture signals, classify intent, enrich companies, qualify leads — applies to any market where buyers express needs on public platforms. The platforms change. The keywords change. The classification criteria change. The pipeline doesn’t.