Lead Intelligence Layer

The problem this solved

Thor Data was entering the US market and needed a pipeline strategy that didn’t look like everyone else’s — buy a list, enrich it, hope. Their buyers weren’t in ZoomInfo. They were on Reddit asking which scraping tool to use, on GitHub filing issues against proxy libraries, on HackerNews comparing providers, on Twitter complaining about competitor outages. The intent was public. The standard outbound playbook couldn’t see it.

The Lead Intelligence Layer was the answer. It captures buying signals from the platforms where Thor Data’s prospects actually talk, classifies them by intent tier, enriches them with company and contact data, and qualifies them against ICP. Instead of cold-listing into the dark, Thor Data started responding to people who had just told the internet they were looking.

Notably, the system runs against Thor Data’s own APIs — the product that captures the signals is built on the same infrastructure being sold. Every SERP query, every LinkedIn scrape, every proxy request is live product usage. The lead-gen system is also the flagship case study.

36% of captured signals qualified as Tier 1 — orders of magnitude above the 0.5–1% you’d expect from cold list outreach.

Architecture

The system is a 6-stage pipeline. One stage is AI-powered — signal classification, where an LLM assigns intent tiers and extracts company names. Two are scraping-based — leveraging platform-native data extraction to pull structured information from LinkedIn. Three are deterministic — pure business logic for signal capture, lead creation, and qualification scoring.

Signal Capture Deterministic

4 parallel scrapers — Reddit, GitHub, HackerNews, Twitter — keyword-driven, every 4 hours

Signal Classification AI · Claude Haiku

Intent tier assignment (1–4), company name extraction, signal type classification

Lead Creation Deterministic

Auto-create lead records, deduplication, Slack alerts for Tier 1 signals

Company Enrichment Scraping · LinkedIn

LinkedIn company pages via Thor Data Scraper API — employee count, industry, tech stack

Contact Enrichment Scraping · LinkedIn

3-tier API fallback for decision-maker profiles — Scraper API → Web Unlocker → SERP

Lead Qualification Deterministic

Scoring on signal tier × company fit × contact availability → priority queue

Deterministic — auditable, reproducible AI-powered — classification and synthesis Scraping — platform-native data extraction

The design principle: capture where the signal lives, classify with AI, enrich through scraping, qualify with rules. AI handles judgment — is this person actively looking, or just curious? Business rules handle routing — Tier 1 gets immediate outreach, Tier 3 gets nurture.

The stages

01 — Signal Capture

Four parallel scrapers, each built for its platform’s native data structure:

Reddit — monitors r/webscraping, r/dataengineering, r/MachineLearning, and 12 other subreddits via Reddit’s JSON API, with Thor Data’s Web Unlocker as fallback. Captures full post content, comments, author, subreddit context. Reddit produces the highest signal quality: 99% of captured signals classify as Tier 1 or 2.

GitHub — searches issues, discussions, and repository descriptions via the GitHub API, with SERP API fallback for deeper discovery. Captures repo context, issue content, dependency information. GitHub produces the highest volume: 532 signals from a single capture cycle.

HackerNews — queries “Ask HN” threads and comments via the Algolia API. Free tier, no rate limit concerns. Captures discussion threads where engineers ask peers for tool recommendations.

Twitter — searches for complaint signals and alternative-seeking posts via Serper (which runs on Thor Data’s own SERP infrastructure). Captures the highest-urgency signals: someone publicly asking for an alternative is typically in active buying mode.

All four scrapers are keyword-driven, using a 200+ term taxonomy organized by product, cohort, and intent level. Scheduled every 4 hours via pg_cron. The keyword taxonomy is the system’s primary tuning lever — expanding it expands the signal surface.

02 — Signal Classification

Claude 3 Haiku via OpenRouter classifies each captured signal into one of four tiers:

Tier 1 — Active intent. Explicitly looking for a solution now. “Looking for a Bright Data alternative.” “Need residential proxies for e-commerce scraping.” These get same-day response.
Tier 2 — Research. Evaluating the space, comparing options, building requirements. “What proxy providers do you recommend?” “Bright Data vs Oxylabs?” These enter the priority queue.
Tier 3 — Building. Building infrastructure that will eventually need proxy/scraping tools. “Setting up a data collection pipeline.” “Need web scraping for training data.” These go to nurture.
Tier 4 — Noise. Students, hobbyists, tangential mentions. Dropped from the pipeline.

Classification also extracts the company name when identifiable from the signal content, assigns a signal type (active_intent, active_pain, research, building), and scores confidence. Low-confidence classifications get flagged rather than silently passed through.

Cost: $0.001 per signal. At 910 signals, total classification cost is under $1.

03 — Lead Creation

When classification identifies a company from a signal, the system auto-creates a lead record in Supabase. Deduplication runs against existing leads — if the same company was identified from a previous signal, the new signal is linked to the existing lead rather than creating a duplicate.

Tier 1 signals trigger a Slack webhook immediately. The sales team sees the signal source, the classification, and the company name within minutes of capture.

04 — Company Enrichment

LinkedIn company page scraping via Thor Data’s Scraper API extracts: employee count, industry, headquarters location, description, specialties. Web Unlocker serves as fallback when the primary scraper encounters rate limits or anti-bot detection.

This stage matters because company fit is half the qualification score. A Tier 1 signal from a 3-person consultancy and a Tier 1 signal from an 85-person AI company require very different response strategies. The enrichment makes that distinction visible before a human touches the lead.

05 — Contact Enrichment

LinkedIn profile scraping identifies decision makers at the enriched company. Three-tier API fallback for resilience: Thor Data Scraper API (primary), Web Unlocker (secondary), SERP API (tertiary). Title matching filters for target personas — CTOs, VPs of Engineering, Head of Data, Infrastructure leads.

The output: names, titles, and LinkedIn profile URLs for 1–3 decision makers at each qualified company. Contact email enrichment integrates with FullEnrich and Uplead for verified email addresses.

06 — Lead Qualification

Deterministic scoring based on three dimensions:

Signal strength (0–40 points) — tier classification, competitor mention, specificity of use case described
Company fit (0–35 points) — industry match against target cohorts, employee count in target range, web scraping as core workflow
Contact access (0–25 points) — decision maker identified, email verified, multiple contacts available

Total score determines priority queue position. Leads scoring 70+ enter immediate outbound. Leads scoring 40–69 enter priority nurture. Below 40, the lead is deprioritized.

Design principles

Signal-first over list-first

The system doesn’t start with a list and enrich it. It starts with expressed intent and builds backward to the company and contact. This inverts the traditional prospecting funnel.

The result: 36% of captured signals qualify as Tier 1, compared to 0.5–1% conversion rates from cold list outreach. The difference isn’t incremental — it’s structural. When someone tells you they’re looking, responding is fundamentally different from guessing they might be.

Platform-native capture

Each scraper is built for its platform’s data structure, not a generic crawler. Reddit signals include the full post context, subreddit, comment thread, and author history. GitHub signals include the repository, issue content, dependency context, and contributor profile. This platform-specific capture is what makes downstream classification accurate — the AI has rich context, not a stripped-down text snippet.

Generic web crawlers miss this context. A “Bright Data alternative” mention in a Reddit post with 12 replies and a detailed use case description is a fundamentally different signal than the same phrase in a tweet with no context. The scraper captures the difference. The classifier uses it.

Tiered classification over binary qualification

Four tiers, not qualified/unqualified. The tier determines everything downstream — response velocity, channel, messaging, and escalation path.

Tier 1 gets same-day outreach with a personalized message referencing the specific signal. Tier 2 enters a priority queue for next-business-day response. Tier 3 enters automated nurture. Tier 4 is dropped. Each tier has a different expected conversion rate: Tier 1 at 15–30%, Tier 2 at 5–10%, Tier 3 at 2–5%.

This matters because treating all qualified leads the same wastes the urgency advantage on Tier 1 signals. When someone posts “I need this today,” responding three days later with a generic BDR sequence defeats the entire point.

Self-demonstrating architecture

The system that generates leads for Thor Data runs on Thor Data’s own APIs. Every signal captured through Web Unlocker, every search executed through SERP API, every LinkedIn page scraped through the Scraper API is a live product usage metric.

This isn’t a marketing gimmick. It means the system’s uptime, accuracy, and cost metrics are real product telemetry. When we tell a prospect “Thor Data’s SERP API handles 235,000 queries with 99.9% uptime,” that’s because we ran those queries ourselves.

Tech approach

Key implementation choices for this build:

Claude 3 Haiku via OpenRouter for signal classification — the only AI stage. Chosen for speed and cost: $0.001 per classification, sub-2-second response time. Structured prompts with explicit tier criteria, not open-ended generation.
Thor Data APIs for all scraping — Web Unlocker for Reddit and LinkedIn fallback, Scraper API for LinkedIn company and profile pages, SERP API (via Serper) for GitHub and Twitter search. Total platform cost under $50/month at current volume.
Supabase (PostgreSQL) for persistence — signals, leads, and enrichment data stored in structured tables. Edge functions for signal processing triggered by database webhooks. pg_cron for scheduled scraper runs.
Node.js scrapers with circuit breakers (Opossum library) and async-retry for resilience. Pino for structured logging. Each scraper runs independently — a GitHub API outage doesn’t block Reddit capture.
Slack webhooks for Tier 1 alerting — the sales team sees high-intent signals within minutes of capture.

Full pipeline cost: under $0.05 per qualified lead.

Signal Intelligence Feed

Buying signals captured across 4 platforms, classified by intent tier

910 signals | 332 Tier 1 | 4 sources | $0.05/qualified lead

Source	Signal	Company	Tier	Intent	When
Reddit	Looking for Bright Data alternative — current costs unsustainable at scale	DataForge AI	Tier 1	active intent	2h ago
GitHub	Issue: Web Unlocker rate limiting on protected sites, need proxy rotation	ScaleML	Tier 1	active pain	3h ago
Twitter	Anyone have a Bright Data alternative? Their residential pool has been unreliable this week	CrawlBase	Tier 1	active intent	4h ago
Reddit	Switching from Oxylabs — need better success rates on e-commerce sites	PriceTrack	Tier 1	active intent	6h ago
GitHub	PR: Replace brightdata-sdk with generic proxy rotation layer	Moneta Analytics	Tier 2	research	8h ago
HN	Ask HN: What proxy infrastructure are you using for large-scale scraping?	—	Tier 2	research	10h ago
GitHub	Evaluating SERP API providers for competitive intelligence pipeline	InsightBridge	Tier 2	research	12h ago
GitHub	Building data collection pipeline — need reliable residential proxies	Nexus Data	Tier 2	building	14h ago
HN	How do you handle anti-bot detection at scale? Cloudflare is killing us	SpectrumIO	Tier 3	active pain	1d ago
GitHub	New repo: web-scraper-benchmark — comparing proxy providers	—	Tier 3	research	1d ago
GitHub	Training data collection for LLM fine-tuning — need web scraping infra	Meridian Labs	Tier 3	building	2d ago
Twitter	Exploring proxy options for a side project — any recommendations?	—	Tier 4	research	2d ago

DataForge AI

Artificial Intelligence

Employees: 85

Signal Source

Reddit · r/webscraping · 2 hours ago

We're scraping product data across 200+ e-commerce sites for our price intelligence platform. Bright Data costs are unsustainable at our volume — $6K/month and climbing. Need a residential proxy solution with comparable success rates on protected sites. Currently evaluating alternatives.

Classification

Tier 1 — active buying intent. Named competitor, specific volume/cost pain, evaluating alternatives now. Score: 91/100.

Qualification Score: 91/100

Signal Strength 35/40

✓ Tier 1: active buying intent +20 pts
✓ Named competitor (Bright Data) +10 pts
✓ Specific use case described +5 pts

Company Fit 30/35

✓ AI/ML company (target cohort) +15 pts
✓ 85 employees (mid-market) +10 pts
✓ Web scraping as core workflow +5 pts

Contact Access 26/30

✓ CTO identified via LinkedIn +15 pts
✓ VP Engineering identified +10 pts
✓ Email verified via enrichment +1 pts

Company Context

DataForge AI builds price intelligence tools for e-commerce brands, scraping product data across 200+ retail sites daily. The team is 85 people, Series A funded, based in Austin. Their data pipeline is core infrastructure — proxy reliability directly impacts product accuracy and customer SLAs.

Key Contacts

James Chen — CTO (LinkedIn)
Sarah Okafor — VP Engineering (LinkedIn)

Signal Context

Posted in r/webscraping, a subreddit with 45K members focused on web scraping tools and infrastructure. The post received 12 replies, several recommending specific providers. The author described a specific use case (e-commerce price intelligence), a specific pain point ($6K/month cost), and is actively evaluating — all Tier 1 indicators.

Recommended Response

Lead with cost comparison — they cited $6K/month on Bright Data. Thor Data's pricing at their volume would be roughly 40% lower.
Reference e-commerce scraping specifically — Thor Data's Web Unlocker has strong success rates on Shopify, Amazon, and major retail platforms.
The Reddit post mentions "evaluating alternatives" — they're in active buying mode. Response within 24 hours is critical.

Signal Intelligence Observability

Aggregate signal metrics, platform performance, and cost tracking across all sources

Total Signals 910

Tier 1 Signals 332

Quality Rate 36%

Cost / Qualified Lead $0.05

Signal Distribution by Source

GitHub

532

HackerNews

143

135

Twitter

100

Tier Distribution

Tier 1

332

Tier 2

287

Tier 3

210

Tier 4

Platform Performance

Source	Signals	Tier 1	Tier 1 Rate	Avg Classification	Cost / Signal
Reddit	135	133	99%	1.2s	$0.001
GitHub	532	118	22%	1.1s	$0.001
Twitter	100	49	49%	1.3s	$0.002
HackerNews	143	32	22%	1.0s	$0.001

This is one approach

This particular architecture was the right answer for Thor Data’s situation: a go-to-market where the buyers are publicly technical, the signals are scrapeable, and the product being sold is the same infrastructure the pipeline runs on. For a company whose buyers don’t express intent in public channels — or where the signal-to-noise ratio is wrong for classification — the same problem would get solved differently. Sometimes the fix is tighter integration between existing enrichment tools. Sometimes it’s better scoring on leads already flowing into the CRM. The diagnosis decides the shape.

Where an engagement starts

Not every engagement that ends in a system like this starts with “build me one.” Most start a level up.

Start with an audit. What’s actually producing leads today, what isn’t, and which platforms your buyers are actually using to ask questions about the category. Sometimes this surfaces that signal capture isn’t the bottleneck — the existing lead flow is fine, and the real gap is scoring or routing. The engagement ends there, and that’s a good outcome.

When the audit points at a signal-capture build, the engagement looks like this:

Signal landscape design — identify which platforms your buyers use to ask questions, evaluate tools, and complain about incumbents. Map high-intent language in your space. Estimate signal-to-noise ratio by platform.
Architecture scoped to your motion — pipeline stages tailored to your platforms, classification tiers mapped to your sales motions, enrichment sources selected for your market.
Staged build with checkpoints — each pipeline stage delivered and reviewed independently. You see working signal capture before enrichment is built. You validate classification accuracy before qualification logic runs.
Calibration against live signals — run the pipeline against real platform data. Tune classification thresholds. Validate enrichment quality. Adjust keyword taxonomy based on actual signal-to-noise ratios.
Handoff with documentation — the system is yours. Full code, architecture docs, keyword taxonomy, calibration playbook.

Ongoing calibration is available as needed — keyword expansion, new platform coverage, classification tuning as your market evolves.