Lead Intelligence Layer
Signal capture, intent classification, enrichment, and qualification — built on the platforms where buyers actually talk
The problem this solved
Thor Data was entering the US market and needed a pipeline strategy that didn’t look like everyone else’s — buy a list, enrich it, hope. Their buyers weren’t in ZoomInfo. They were on Reddit asking which scraping tool to use, on GitHub filing issues against proxy libraries, on HackerNews comparing providers, on Twitter complaining about competitor outages. The intent was public. The standard outbound playbook couldn’t see it.
The Lead Intelligence Layer was the answer. It captures buying signals from the platforms where Thor Data’s prospects actually talk, classifies them by intent tier, enriches them with company and contact data, and qualifies them against ICP. Instead of cold-listing into the dark, Thor Data started responding to people who had just told the internet they were looking.
Notably, the system runs against Thor Data’s own APIs — the product that captures the signals is built on the same infrastructure being sold. Every SERP query, every LinkedIn scrape, every proxy request is live product usage. The lead-gen system is also the flagship case study.
36% of captured signals qualified as Tier 1 — orders of magnitude above the 0.5–1% you’d expect from cold list outreach.
Architecture
The system is a 6-stage pipeline. One stage is AI-powered — signal classification, where an LLM assigns intent tiers and extracts company names. Two are scraping-based — leveraging platform-native data extraction to pull structured information from LinkedIn. Three are deterministic — pure business logic for signal capture, lead creation, and qualification scoring.
4 parallel scrapers — Reddit, GitHub, HackerNews, Twitter — keyword-driven, every 4 hours
Intent tier assignment (1–4), company name extraction, signal type classification
Auto-create lead records, deduplication, Slack alerts for Tier 1 signals
LinkedIn company pages via Thor Data Scraper API — employee count, industry, tech stack
3-tier API fallback for decision-maker profiles — Scraper API → Web Unlocker → SERP
Scoring on signal tier × company fit × contact availability → priority queue
The design principle: capture where the signal lives, classify with AI, enrich through scraping, qualify with rules. AI handles judgment — is this person actively looking, or just curious? Business rules handle routing — Tier 1 gets immediate outreach, Tier 3 gets nurture.
The stages
01 — Signal Capture
Four parallel scrapers, each built for its platform’s native data structure:
Reddit — monitors r/webscraping, r/dataengineering, r/MachineLearning, and 12 other subreddits via Reddit’s JSON API, with Thor Data’s Web Unlocker as fallback. Captures full post content, comments, author, subreddit context. Reddit produces the highest signal quality: 99% of captured signals classify as Tier 1 or 2.
GitHub — searches issues, discussions, and repository descriptions via the GitHub API, with SERP API fallback for deeper discovery. Captures repo context, issue content, dependency information. GitHub produces the highest volume: 532 signals from a single capture cycle.
HackerNews — queries “Ask HN” threads and comments via the Algolia API. Free tier, no rate limit concerns. Captures discussion threads where engineers ask peers for tool recommendations.
Twitter — searches for complaint signals and alternative-seeking posts via Serper (which runs on Thor Data’s own SERP infrastructure). Captures the highest-urgency signals: someone publicly asking for an alternative is typically in active buying mode.
All four scrapers are keyword-driven, using a 200+ term taxonomy organized by product, cohort, and intent level. Scheduled every 4 hours via pg_cron. The keyword taxonomy is the system’s primary tuning lever — expanding it expands the signal surface.
02 — Signal Classification
Claude 3 Haiku via OpenRouter classifies each captured signal into one of four tiers:
- Tier 1 — Active intent. Explicitly looking for a solution now. “Looking for a Bright Data alternative.” “Need residential proxies for e-commerce scraping.” These get same-day response.
- Tier 2 — Research. Evaluating the space, comparing options, building requirements. “What proxy providers do you recommend?” “Bright Data vs Oxylabs?” These enter the priority queue.
- Tier 3 — Building. Building infrastructure that will eventually need proxy/scraping tools. “Setting up a data collection pipeline.” “Need web scraping for training data.” These go to nurture.
- Tier 4 — Noise. Students, hobbyists, tangential mentions. Dropped from the pipeline.
Classification also extracts the company name when identifiable from the signal content, assigns a signal type (active_intent, active_pain, research, building), and scores confidence. Low-confidence classifications get flagged rather than silently passed through.
Cost: $0.001 per signal. At 910 signals, total classification cost is under $1.
03 — Lead Creation
When classification identifies a company from a signal, the system auto-creates a lead record in Supabase. Deduplication runs against existing leads — if the same company was identified from a previous signal, the new signal is linked to the existing lead rather than creating a duplicate.
Tier 1 signals trigger a Slack webhook immediately. The sales team sees the signal source, the classification, and the company name within minutes of capture.
04 — Company Enrichment
LinkedIn company page scraping via Thor Data’s Scraper API extracts: employee count, industry, headquarters location, description, specialties. Web Unlocker serves as fallback when the primary scraper encounters rate limits or anti-bot detection.
This stage matters because company fit is half the qualification score. A Tier 1 signal from a 3-person consultancy and a Tier 1 signal from an 85-person AI company require very different response strategies. The enrichment makes that distinction visible before a human touches the lead.
05 — Contact Enrichment
LinkedIn profile scraping identifies decision makers at the enriched company. Three-tier API fallback for resilience: Thor Data Scraper API (primary), Web Unlocker (secondary), SERP API (tertiary). Title matching filters for target personas — CTOs, VPs of Engineering, Head of Data, Infrastructure leads.
The output: names, titles, and LinkedIn profile URLs for 1–3 decision makers at each qualified company. Contact email enrichment integrates with FullEnrich and Uplead for verified email addresses.
06 — Lead Qualification
Deterministic scoring based on three dimensions:
- Signal strength (0–40 points) — tier classification, competitor mention, specificity of use case described
- Company fit (0–35 points) — industry match against target cohorts, employee count in target range, web scraping as core workflow
- Contact access (0–25 points) — decision maker identified, email verified, multiple contacts available
Total score determines priority queue position. Leads scoring 70+ enter immediate outbound. Leads scoring 40–69 enter priority nurture. Below 40, the lead is deprioritized.
Design principles
Signal-first over list-first
The system doesn’t start with a list and enrich it. It starts with expressed intent and builds backward to the company and contact. This inverts the traditional prospecting funnel.
The result: 36% of captured signals qualify as Tier 1, compared to 0.5–1% conversion rates from cold list outreach. The difference isn’t incremental — it’s structural. When someone tells you they’re looking, responding is fundamentally different from guessing they might be.
Platform-native capture
Each scraper is built for its platform’s data structure, not a generic crawler. Reddit signals include the full post context, subreddit, comment thread, and author history. GitHub signals include the repository, issue content, dependency context, and contributor profile. This platform-specific capture is what makes downstream classification accurate — the AI has rich context, not a stripped-down text snippet.
Generic web crawlers miss this context. A “Bright Data alternative” mention in a Reddit post with 12 replies and a detailed use case description is a fundamentally different signal than the same phrase in a tweet with no context. The scraper captures the difference. The classifier uses it.
Tiered classification over binary qualification
Four tiers, not qualified/unqualified. The tier determines everything downstream — response velocity, channel, messaging, and escalation path.
Tier 1 gets same-day outreach with a personalized message referencing the specific signal. Tier 2 enters a priority queue for next-business-day response. Tier 3 enters automated nurture. Tier 4 is dropped. Each tier has a different expected conversion rate: Tier 1 at 15–30%, Tier 2 at 5–10%, Tier 3 at 2–5%.
This matters because treating all qualified leads the same wastes the urgency advantage on Tier 1 signals. When someone posts “I need this today,” responding three days later with a generic BDR sequence defeats the entire point.
Self-demonstrating architecture
The system that generates leads for Thor Data runs on Thor Data’s own APIs. Every signal captured through Web Unlocker, every search executed through SERP API, every LinkedIn page scraped through the Scraper API is a live product usage metric.
This isn’t a marketing gimmick. It means the system’s uptime, accuracy, and cost metrics are real product telemetry. When we tell a prospect “Thor Data’s SERP API handles 235,000 queries with 99.9% uptime,” that’s because we ran those queries ourselves.
Tech approach
Key implementation choices for this build:
- Claude 3 Haiku via OpenRouter for signal classification — the only AI stage. Chosen for speed and cost: $0.001 per classification, sub-2-second response time. Structured prompts with explicit tier criteria, not open-ended generation.
- Thor Data APIs for all scraping — Web Unlocker for Reddit and LinkedIn fallback, Scraper API for LinkedIn company and profile pages, SERP API (via Serper) for GitHub and Twitter search. Total platform cost under $50/month at current volume.
- Supabase (PostgreSQL) for persistence — signals, leads, and enrichment data stored in structured tables. Edge functions for signal processing triggered by database webhooks. pg_cron for scheduled scraper runs.
- Node.js scrapers with circuit breakers (Opossum library) and async-retry for resilience. Pino for structured logging. Each scraper runs independently — a GitHub API outage doesn’t block Reddit capture.
- Slack webhooks for Tier 1 alerting — the sales team sees high-intent signals within minutes of capture.
Full pipeline cost: under $0.05 per qualified lead.
Signal Intelligence Feed
Buying signals captured across 4 platforms, classified by intent tier
| Source | Signal | Company | Tier | Intent | When |
|---|---|---|---|---|---|
| Looking for Bright Data alternative — current costs unsustainable at scale | DataForge AI | Tier 1 | active intent | 2h ago | |
| GitHub | Issue: Web Unlocker rate limiting on protected sites, need proxy rotation | ScaleML | Tier 1 | active pain | 3h ago |
| Anyone have a Bright Data alternative? Their residential pool has been unreliable this week | CrawlBase | Tier 1 | active intent | 4h ago | |
| Switching from Oxylabs — need better success rates on e-commerce sites | PriceTrack | Tier 1 | active intent | 6h ago | |
| GitHub | PR: Replace brightdata-sdk with generic proxy rotation layer | Moneta Analytics | Tier 2 | research | 8h ago |
| HN | Ask HN: What proxy infrastructure are you using for large-scale scraping? | — | Tier 2 | research | 10h ago |
| GitHub | Evaluating SERP API providers for competitive intelligence pipeline | InsightBridge | Tier 2 | research | 12h ago |
| GitHub | Building data collection pipeline — need reliable residential proxies | Nexus Data | Tier 2 | building | 14h ago |
| HN | How do you handle anti-bot detection at scale? Cloudflare is killing us | SpectrumIO | Tier 3 | active pain | 1d ago |
| GitHub | New repo: web-scraper-benchmark — comparing proxy providers | — | Tier 3 | research | 1d ago |
| GitHub | Training data collection for LLM fine-tuning — need web scraping infra | Meridian Labs | Tier 3 | building | 2d ago |
| Exploring proxy options for a side project — any recommendations? | — | Tier 4 | research | 2d ago |
Company Context
DataForge AI builds price intelligence tools for e-commerce brands, scraping product data across 200+ retail sites daily. The team is 85 people, Series A funded, based in Austin. Their data pipeline is core infrastructure — proxy reliability directly impacts product accuracy and customer SLAs.
Key Contacts
- James Chen — CTO (LinkedIn)
- Sarah Okafor — VP Engineering (LinkedIn)
Signal Context
Posted in r/webscraping, a subreddit with 45K members focused on web scraping tools and infrastructure. The post received 12 replies, several recommending specific providers. The author described a specific use case (e-commerce price intelligence), a specific pain point ($6K/month cost), and is actively evaluating — all Tier 1 indicators.
Recommended Response
- Lead with cost comparison — they cited $6K/month on Bright Data. Thor Data's pricing at their volume would be roughly 40% lower.
- Reference e-commerce scraping specifically — Thor Data's Web Unlocker has strong success rates on Shopify, Amazon, and major retail platforms.
- The Reddit post mentions "evaluating alternatives" — they're in active buying mode. Response within 24 hours is critical.
Signal Intelligence Observability
Aggregate signal metrics, platform performance, and cost tracking across all sources
Signal Distribution by Source
Tier Distribution
Platform Performance
| Source | Signals | Tier 1 | Tier 1 Rate | Avg Classification | Cost / Signal |
|---|---|---|---|---|---|
| 135 | 133 | 99% | 1.2s | $0.001 | |
| GitHub | 532 | 118 | 22% | 1.1s | $0.001 |
| 100 | 49 | 49% | 1.3s | $0.002 | |
| HackerNews | 143 | 32 | 22% | 1.0s | $0.001 |
This is one approach
This particular architecture was the right answer for Thor Data’s situation: a go-to-market where the buyers are publicly technical, the signals are scrapeable, and the product being sold is the same infrastructure the pipeline runs on. For a company whose buyers don’t express intent in public channels — or where the signal-to-noise ratio is wrong for classification — the same problem would get solved differently. Sometimes the fix is tighter integration between existing enrichment tools. Sometimes it’s better scoring on leads already flowing into the CRM. The diagnosis decides the shape.
Where an engagement starts
Not every engagement that ends in a system like this starts with “build me one.” Most start a level up.
Start with an audit. What’s actually producing leads today, what isn’t, and which platforms your buyers are actually using to ask questions about the category. Sometimes this surfaces that signal capture isn’t the bottleneck — the existing lead flow is fine, and the real gap is scoring or routing. The engagement ends there, and that’s a good outcome.
When the audit points at a signal-capture build, the engagement looks like this:
- Signal landscape design — identify which platforms your buyers use to ask questions, evaluate tools, and complain about incumbents. Map high-intent language in your space. Estimate signal-to-noise ratio by platform.
- Architecture scoped to your motion — pipeline stages tailored to your platforms, classification tiers mapped to your sales motions, enrichment sources selected for your market.
- Staged build with checkpoints — each pipeline stage delivered and reviewed independently. You see working signal capture before enrichment is built. You validate classification accuracy before qualification logic runs.
- Calibration against live signals — run the pipeline against real platform data. Tune classification thresholds. Validate enrichment quality. Adjust keyword taxonomy based on actual signal-to-noise ratios.
- Handoff with documentation — the system is yours. Full code, architecture docs, keyword taxonomy, calibration playbook.
Ongoing calibration is available as needed — keyword expansion, new platform coverage, classification tuning as your market evolves.
Case study
Building a Signal-Driven Lead Intelligence System for Thor Data
A web infrastructure company entering the US market with proxy, SERP, and scraping APIs
Read the case study →Want to see this built for your stack? Let's scope it.