A web infrastructure company entering the US market with proxy, SERP, and scraping APIs

Building a Signal-Driven Lead Intelligence System for Thor Data

Context

Thor Data is a web infrastructure company — proxy networks, SERP APIs, web scraping tools. The product suite includes residential and datacenter proxies, a Web Unlocker for anti-bot bypass, a SERP API for search engine results, and a LinkedIn Scraper API. Established in international markets with an existing customer base, now entering the US for the first time.

The competitive landscape is dominated by Bright Data and Oxylabs — well-funded, established US presence, saturated keyword space, and aggressive outbound operations. Both have large BDR teams running ZoomInfo-sourced lists through multi-touch sequences. The standard playbook for a new entrant would be the same: buy lists, run sequences, compete on volume.

Thor Data had a different advantage. Their own APIs are the infrastructure that makes web scraping possible. A system built to find their buyers could run entirely on their own product — capturing signals through Web Unlocker, searching via SERP API, enriching through LinkedIn Scraper. The GTM system wouldn’t just find customers. It would demonstrate the product.

Smoothed built that system.

The problem

No US inbound pipeline. No brand recognition in the market. No US sales team in place. The company needed to go from zero to qualified pipeline in a market where the incumbents have years of brand equity and content ranking.

List-based outbound would put Thor Data in the same inbox as every other proxy vendor. A VP of Engineering receiving a cold email from an unknown proxy company — alongside cold emails from Bright Data, Oxylabs, Smartproxy, and a dozen others — has no reason to reply. The message gets lost in the noise.

Intent data vendors wouldn’t help either. Bombora and G2 sell aggregated, anonymized signals at the account level, delivered weekly. By the time you see “Company X is researching proxies,” Bright Data’s BDR team has already seen the same signal. And you don’t know who at Company X or what they specifically need.

The structural problem: the best leads are people who are actively expressing a need right now, on platforms where their competitors can’t see them. Reddit posts asking for alternatives. GitHub issues describing scraping infrastructure needs. HackerNews threads discussing proxy providers. Twitter complaints about incumbent pricing or reliability.

These signals exist. They’re public. But no list vendor captures them, no intent platform aggregates them, and no CRM workflow monitors them. They require a purpose-built system.

Diagnosis

Before building, we audited the signal landscape across four platforms.

The platform audit

Reddit — 15+ relevant subreddits (r/webscraping, r/dataengineering, r/MachineLearning, r/node, r/python, among others). Posts asking for proxy recommendations, Bright Data alternatives, and web scraping infrastructure advice appear daily. The signal quality is exceptional: posters describe their use case, their current provider, their pain points, and their budget constraints in detail. Reddit turned out to be the highest-quality signal source — 99% of captured signals classified as Tier 1 or 2.

GitHub — Issues, discussions, and repository descriptions mentioning proxy providers, web scraping tools, and data collection infrastructure. Volume is high (532 signals in the first capture cycle), but signal-to-noise is lower than Reddit — many mentions are incidental (a dependency listed in a package.json) rather than intentional (someone actively looking for a tool).

Twitter — Complaint signals and alternative-seeking posts. Lower volume than Reddit or GitHub, but the urgency is higher. Someone publicly tweeting “anyone have a Bright Data alternative?” is typically in active buying mode — they want a response now, not next week.

HackerNews — “Ask HN” threads are gold: engineers asking peers for tool recommendations with detailed context about their use case. Lower frequency but high-intent when they appear.

The keyword taxonomy

From the platform audit, we built a 200+ keyword taxonomy organized by three dimensions:

  • Product keywords — terms specific to Thor Data’s product categories: residential proxies, SERP API, web unlocker, datacenter proxies, scraping API
  • Cohort keywords — terms indicating which buyer segment the signal comes from: AI training data, e-commerce price intelligence, competitive intelligence, SEO monitoring, ad verification
  • Intent keywords — terms indicating urgency level: “alternative to,” “switching from,” “need recommendation,” “looking for,” “help with,” “too expensive,” “unreliable”

The taxonomy is the system’s primary tuning lever. Expanding it expands the signal surface. Refining it improves signal-to-noise. It’s a living document, updated based on classification accuracy data.

The classification framework

Four tiers based on expressed intent:

  • Tier 1 — Active intent (respond same day): explicitly seeking a solution, naming competitors, describing specific pain
  • Tier 2 — Research (respond next business day): evaluating options, comparing providers, building requirements
  • Tier 3 — Building (nurture): constructing infrastructure that will eventually need proxy/scraping tools
  • Tier 4 — Noise (drop): students, hobbyists, tangential mentions

The framework maps directly to response playbooks. Each tier gets a different velocity, channel, and messaging approach.

The system

We built a 6-stage pipeline that processes signals from capture through to qualified leads ready for outbound.

01
Signal Capture Deterministic

4 parallel scrapers — Reddit, GitHub, HackerNews, Twitter — keyword-driven, every 4 hours

02
Signal Classification AI · Claude Haiku

Intent tier assignment (1–4), company name extraction, signal type classification

03
Lead Creation Deterministic

Auto-create lead records, deduplication, Slack alerts for Tier 1 signals

04
Company Enrichment Scraping · LinkedIn

LinkedIn company pages via Thor Data Scraper API — employee count, industry, tech stack

05
Contact Enrichment Scraping · LinkedIn

3-tier API fallback for decision-maker profiles — Scraper API → Web Unlocker → SERP

06
Lead Qualification Deterministic

Scoring on signal tier × company fit × contact availability → priority queue

Deterministic — auditable, reproducible AI-powered — classification and synthesis Scraping — platform-native data extraction

Signal capture

Four parallel scrapers, each built for its platform’s native data structure. Reddit via JSON API with Web Unlocker fallback. GitHub via API with SERP fallback. HackerNews via Algolia API. Twitter via Serper site search. Scheduled every 4 hours via Supabase pg_cron.

Every scraper uses Thor Data’s own APIs. The Reddit scraper goes through Web Unlocker. The GitHub and Twitter scrapers use SERP API (via Serper, which itself runs on Thor Data’s SERP infrastructure). The LinkedIn enrichment scrapers use the Scraper API. The system generates real product usage telemetry while finding customers.

Classification and lead creation

Supabase edge functions triggered by database webhooks process each new signal. Claude 3 Haiku classifies the tier, extracts the company name, and assigns the signal type. When a company is identified, the system auto-creates or updates a lead record with deduplication. Tier 1 signals fire a Slack webhook immediately.

Enrichment

LinkedIn company page scraping via Thor Data’s Scraper API extracts firmographic data: employee count, industry, location, specialties. LinkedIn profile scraping identifies decision makers — CTOs, VPs of Engineering, Heads of Data — using a 3-tier API fallback for resilience.

The enrichment pipeline is where the self-demonstrating architecture pays off most visibly. Every LinkedIn page scraped to enrich a lead is a real product usage event. The scraper’s success rate, speed, and cost are production metrics that directly inform sales conversations about the product.

Qualification

Deterministic scoring across three dimensions: signal strength (0–40), company fit (0–35), and contact access (0–25). The formula is explicit and auditable — when a lead scores 91, you can trace exactly which factors contributed. High-scoring leads enter the priority outbound queue.

Signal Intelligence Feed

Buying signals captured across 4 platforms, classified by intent tier

910 signals | 332 Tier 1 | 4 sources | $0.05/qualified lead
Source Signal Company Tier Intent When
Reddit Looking for Bright Data alternative — current costs unsustainable at scale DataForge AI Tier 1 active intent 2h ago
GitHub Issue: Web Unlocker rate limiting on protected sites, need proxy rotation ScaleML Tier 1 active pain 3h ago
Twitter Anyone have a Bright Data alternative? Their residential pool has been unreliable this week CrawlBase Tier 1 active intent 4h ago
Reddit Switching from Oxylabs — need better success rates on e-commerce sites PriceTrack Tier 1 active intent 6h ago
GitHub PR: Replace brightdata-sdk with generic proxy rotation layer Moneta Analytics Tier 2 research 8h ago
HN Ask HN: What proxy infrastructure are you using for large-scale scraping? Tier 2 research 10h ago
GitHub Evaluating SERP API providers for competitive intelligence pipeline InsightBridge Tier 2 research 12h ago
GitHub Building data collection pipeline — need reliable residential proxies Nexus Data Tier 2 building 14h ago
HN How do you handle anti-bot detection at scale? Cloudflare is killing us SpectrumIO Tier 3 active pain 1d ago
GitHub New repo: web-scraper-benchmark — comparing proxy providers Tier 3 research 1d ago
GitHub Training data collection for LLM fine-tuning — need web scraping infra Meridian Labs Tier 3 building 2d ago
Twitter Exploring proxy options for a side project — any recommendations? Tier 4 research 2d ago

DataForge AI

Artificial Intelligence

Employees: 85

Reddit · r/webscraping · 2 hours ago

We're scraping product data across 200+ e-commerce sites for our price intelligence platform. Bright Data costs are unsustainable at our volume — $6K/month and climbing. Need a residential proxy solution with comparable success rates on protected sites. Currently evaluating alternatives.

Tier 1 — active buying intent. Named competitor, specific volume/cost pain, evaluating alternatives now. Score: 91/100.

Signal Strength 35/40
  • Tier 1: active buying intent +20 pts
  • Named competitor (Bright Data) +10 pts
  • Specific use case described +5 pts
Company Fit 30/35
  • AI/ML company (target cohort) +15 pts
  • 85 employees (mid-market) +10 pts
  • Web scraping as core workflow +5 pts
Contact Access 26/30
  • CTO identified via LinkedIn +15 pts
  • VP Engineering identified +10 pts
  • Email verified via enrichment +1 pts

DataForge AI builds price intelligence tools for e-commerce brands, scraping product data across 200+ retail sites daily. The team is 85 people, Series A funded, based in Austin. Their data pipeline is core infrastructure — proxy reliability directly impacts product accuracy and customer SLAs.

  • James Chen — CTO (LinkedIn)
  • Sarah Okafor — VP Engineering (LinkedIn)

Posted in r/webscraping, a subreddit with 45K members focused on web scraping tools and infrastructure. The post received 12 replies, several recommending specific providers. The author described a specific use case (e-commerce price intelligence), a specific pain point ($6K/month cost), and is actively evaluating — all Tier 1 indicators.

  • Lead with cost comparison — they cited $6K/month on Bright Data. Thor Data's pricing at their volume would be roughly 40% lower.
  • Reference e-commerce scraping specifically — Thor Data's Web Unlocker has strong success rates on Shopify, Amazon, and major retail platforms.
  • The Reddit post mentions "evaluating alternatives" — they're in active buying mode. Response within 24 hours is critical.

What we chose not to build

Scope discipline matters as much as architecture:

  • No automated outbound. Month 1 was 100% manual review. Every signal-sourced lead was reviewed by a human before outreach. This was deliberate — you calibrate the system before you automate it.
  • No CRM integration. Supabase is the system of record during the validation phase. CRM sync is a Phase 2 concern, after signal quality and classification accuracy are proven.
  • No ML-based classification. Claude Haiku with structured prompts and explicit tier criteria. The model doesn’t learn from historical data — it applies rules. This makes the system inspectable: you can read the prompt, understand the criteria, and tune the classification by adjusting the prompt, not retraining a model.

Outcomes

  • 910 buying signals captured across 4 platforms in the first capture cycle
  • 332 Tier 1 high-intent prospects — a 36% quality rate
  • Signal distribution by source: GitHub 532 (118 Tier 1), Reddit 135 (133 Tier 1), Twitter 100 (49 Tier 1), HackerNews 143 (32 Tier 1)
  • Reddit signal quality: 99% of captured signals classified Tier 1 or 2 — the highest quality-to-volume ratio of any platform
  • Pipeline cost: under $0.05 per qualified lead
  • 200+ keyword taxonomy across product, cohort, and intent dimensions
  • System runs entirely on Thor Data’s own APIs — zero third-party scraping infrastructure required

Signal Intelligence Observability

Aggregate signal metrics, platform performance, and cost tracking across all sources

Total Signals 910
Tier 1 Signals 332
Quality Rate 36%
Cost / Qualified Lead $0.05

Signal Distribution by Source

GitHub
532
HackerNews
143
Reddit
135
Twitter
100

Tier Distribution

Tier 1
332
Tier 2
287
Tier 3
210
Tier 4
81

Platform Performance

Source Signals Tier 1 Tier 1 Rate Avg Classification Cost / Signal
Reddit 135 133 99% 1.2s $0.001
GitHub 532 118 22% 1.1s $0.001
Twitter 100 49 49% 1.3s $0.002
HackerNews 143 32 22% 1.0s $0.001

The 36% Tier 1 rate deserves emphasis. Cold list outbound converts at 0.5–1%. Intent data vendors deliver signals that every competitor also receives. This system found 332 prospects that were actively expressing a need on public platforms — prospects that no list vendor, no intent platform, and no competitor’s BDR team was systematically capturing.

Reddit’s 99% quality rate is the standout. The platform’s structure — detailed posts with context, use case descriptions, and explicit asks — produces signals that classify cleanly. GitHub produces higher volume but lower quality because many mentions are incidental rather than intentional.

What’s next

The intelligence layer feeds the outbound layer. The next phase: persona-based sequencing across 48 segments (16 sophistication levels across 3 roles), AI-powered email personalization that references the specific signal that identified the prospect, and Smartlead integration for campaign delivery at scale.

The system that finds the prospects hands them to the system that engages them. The intelligence layer’s tier classification determines the outbound system’s velocity, channel, and messaging — a seamless handoff from signal to sequence.

For anyone reading this with a similar challenge: the architecture pattern — capture signals, classify intent, enrich companies, qualify leads — applies to any market where buyers express needs on public platforms. The platforms change. The keywords change. The classification criteria change. The pipeline doesn’t.

Systems Demonstrated

  • 02
    Lead Intelligence Layer

    Signal capture, intent classification, enrichment, and qualification — built on the platforms where buyers actually talk

Want to see this built for your stack? Let's scope it.

Let's talk

Tell us what you're working on, or book a call directly.

Or book a call