All Articles

How to Use Claude Code to Scrape and Analyze Competitor Websites (Ethically)

March 18, 2026
How to Use Claude Code to Scrape and Analyze Competitor Websites (Ethically)

Your competitor just dropped their prices. You didn't notice for three weeks. By the time you adjusted, you'd already lost a handful of deals to someone who looked more competitive on paper. Sound familiar? This is the tax every business pays for manual competitive intelligence — the slow, labor-intensive process of checking competitor websites by hand, copying pricing tables into spreadsheets, and hoping you catch the important changes before they hurt you.

Claude Code changes that equation entirely. With a few hundred lines of Python and the right architectural approach, you can build a system that monitors competitor websites continuously, extracts structured data automatically, flags meaningful changes, and synthesizes findings into actionable intelligence reports — all while staying firmly within the bounds of ethical, legal web data collection.

Limited Event

Master Claude Code in One Day — Live Workshop by Adventure Media

Go from zero coding experience to building real AI-powered tools. Hands-on projects, expert guidance, no fluff.

Register Now — Spots Filling Fast →

This guide walks you through building that system from scratch. We'll cover the technical implementation step by step, including the exact code patterns, configuration values, and Claude Code prompts that make each piece work. Whether you're a solo founder trying to stay ahead of three competitors or a growth team tracking an entire industry, the same framework applies.

What you'll build: A modular competitive intelligence pipeline that scrapes publicly available data, uses Claude to analyze and categorize it, stores changes over time, and produces structured reports on pricing, content strategy, and messaging shifts — automatically.

Estimated total build time: 4-6 hours across five steps. Prerequisites: Basic Python familiarity, a Claude API key (available at console.anthropic.com), and a working Python environment (3.10+).

Before building any scraping tool, you must understand what you're legally and ethically permitted to collect. This isn't a disclaimer buried at the bottom — it's the foundation of the entire project. Getting this wrong doesn't just create legal exposure; it can get your IP address blocked, generate cease-and-desist letters, and damage your reputation if it becomes public that your "competitive intelligence" crossed lines it shouldn't have.

What's Generally Permitted

Publicly accessible information — data that any visitor can see in their browser without logging in — is generally fair game for automated collection, though this area of law continues to evolve. The landmark hiQ Labs v. LinkedIn case established important precedent suggesting that scraping publicly available data doesn't automatically violate the Computer Fraud and Abuse Act, though this remains a developing area of law. What you can typically collect: pricing pages, product descriptions, public blog posts, public job listings, and meta information like page titles and structured data.

What's Off-Limits

You should not collect: data behind login walls, data explicitly prohibited in a site's robots.txt file (which we'll check programmatically), personally identifiable information about individuals, or data at a volume or rate that constitutes a denial-of-service attack on the target server. Aggressive scraping that hammers a server with thousands of requests per minute is both unethical and potentially illegal under computer abuse statutes, regardless of what data you're collecting.

The robots.txt Checker — Your First Code Requirement

Your tool must check and respect robots.txt before collecting anything. This is non-negotiable. Here's the module you'll build first:


import urllib.robotparser
import urllib.parse
import time

def check_robots_permission(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> bool:
    """Returns True if scraping this URL is permitted by robots.txt"""
    parsed = urllib.parse.urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        # If robots.txt is unreachable, default to conservative behavior
        return False

def get_crawl_delay(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> float:
    """Returns the required crawl delay in seconds, minimum 2.0"""
    parsed = urllib.parse.urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    
    try:
        rp.read()
        delay = rp.crawl_delay(user_agent)
        return max(delay or 2.0, 2.0)  # Always wait at least 2 seconds
    except Exception:
        return 2.0

Pro tip: Set your user agent string to something identifiable and include a contact email in your request headers. Something like CompetitiveResearchBot/1.0 (contact@yourcompany.com) is transparent about who you are and gives site owners a way to reach you if there's an issue. This is basic professional courtesy and reduces friction.

Common mistake to avoid: Don't skip robots.txt checking because "everyone does it." The fact that unethical scraping is common doesn't make it acceptable. Beyond the legal risk, your tool's longevity depends on not getting blocked — and sites that detect aggressive scraping will implement countermeasures that break your pipeline.

Estimated time for this step: 30 minutes to implement the robots checker and review your target competitor sites' robots.txt files manually.

Step 2: Set Up Your Claude Code Environment and Project Structure

A well-organized project structure will save you hours of debugging later and make the tool maintainable as your competitor list grows. Claude Code works best when you give it a clear, consistent file structure to work within — it can reference existing modules, understand your data schemas, and generate new components that fit naturally into what's already there.

Install Dependencies

Create a new project directory and set up a virtual environment:


mkdir competitor-intel
cd competitor-intel
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install anthropic requests beautifulsoup4 lxml playwright \
            python-dotenv sqlite3 schedule rich pandas

Install Playwright's browser binaries:


playwright install chromium

You'll use Playwright for JavaScript-heavy pages where requests alone won't render dynamic content. Many modern competitor pricing pages load their data via JavaScript, and a headless browser is the only way to get the actual rendered content.

Project File Structure


competitor-intel/
├── .env                    # API keys (never commit this)
├── config.yaml             # Competitor URLs and scraping settings
├── main.py                 # Entry point and scheduler
├── scraper/
│   ├── __init__.py
│   ├── robots_checker.py   # The module from Step 1
│   ├── static_scraper.py   # requests + BeautifulSoup
│   └── dynamic_scraper.py  # Playwright for JS pages
├── analyzer/
│   ├── __init__.py
│   ├── claude_client.py    # Claude API wrapper
│   ├── pricing_analyzer.py
│   ├── content_analyzer.py
│   └── change_detector.py
├── storage/
│   ├── __init__.py
│   └── database.py         # SQLite operations
├── reports/
│   ├── __init__.py
│   └── report_generator.py
└── data/
    └── competitor_intel.db  # Auto-created SQLite database

Configure Your .env File


ANTHROPIC_API_KEY=sk-ant-your-key-here
CLAUDE_MODEL=claude-opus-4-5
REQUEST_DELAY_SECONDS=3
MAX_PAGES_PER_DOMAIN=20

Set Up the config.yaml

This is where you define your competitor list. Structure it by company, not by URL, so you can group related pages:


competitors:
  - name: "CompetitorA"
    domain: "https://competitor-a.com"
    pages:
      - url: "https://competitor-a.com/pricing"
        type: "pricing"
        js_required: false
      - url: "https://competitor-a.com/blog"
        type: "content"
        js_required: false
      - url: "https://competitor-a.com/features"
        type: "features"
        js_required: true
    
  - name: "CompetitorB"
    domain: "https://competitor-b.com"
    pages:
      - url: "https://competitor-b.com/plans"
        type: "pricing"
        js_required: true

Common mistake to avoid: Don't add every page on a competitor's site. Focus on high-signal pages: pricing, features, homepage, about, careers (for growth signals), and blog index. Adding too many pages creates noise and increases your API costs substantially.

Pro tip: Add a priority: high/medium/low field to each page entry. High-priority pages (like pricing) should be checked daily; low-priority pages (like about pages) weekly. This prevents unnecessary API calls and keeps your costs manageable.

Estimated time for this step: 45 minutes including dependency installation and initial configuration.

Step 3: Build the Scraping Layer — Static and Dynamic Content

The scraping layer is your data collection foundation, and it needs to handle two fundamentally different types of web pages: those that deliver all their content in the initial HTML response, and those that require JavaScript execution to render their actual content. Getting this distinction right is what separates a scraper that works reliably from one that silently returns empty data half the time.

Static Scraper (requests + BeautifulSoup)

Build scraper/static_scraper.py:


import requests
from bs4 import BeautifulSoup
import time
from scraper.robots_checker import check_robots_permission, get_crawl_delay
import os

HEADERS = {
    "User-Agent": f"CompetitiveResearchBot/1.0 (contact@{os.getenv('COMPANY_DOMAIN', 'example.com')})",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
}

def scrape_static_page(url: str) -> dict:
    """
    Scrapes a static page. Returns structured content dict.
    Returns None if robots.txt disallows or request fails.
    """
    if not check_robots_permission(url):
        print(f"[BLOCKED] robots.txt disallows: {url}")
        return None
    
    delay = get_crawl_delay(url)
    time.sleep(delay)
    
    try:
        response = requests.get(url, headers=HEADERS, timeout=15)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'lxml')
        
        # Remove script and style tags — they're noise for text analysis
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        
        return {
            "url": url,
            "status_code": response.status_code,
            "title": soup.title.string if soup.title else "",
            "meta_description": get_meta_description(soup),
            "h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
            "h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
            "body_text": soup.get_text(separator='\n', strip=True)[:15000],  # Limit for API
            "structured_data": extract_structured_data(soup),
            "scraped_at": time.time()
        }
    
    except requests.RequestException as e:
        print(f"[ERROR] Failed to scrape {url}: {e}")
        return None

def get_meta_description(soup) -> str:
    meta = soup.find('meta', attrs={'name': 'description'})
    return meta['content'] if meta and meta.get('content') else ""

def extract_structured_data(soup) -> list:
    """Extracts JSON-LD structured data — often contains pricing info"""
    import json
    structured = []
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            structured.append(data)
        except (json.JSONDecodeError, TypeError):
            pass
    return structured

Dynamic Scraper (Playwright for JavaScript Pages)

Build scraper/dynamic_scraper.py:


from playwright.sync_api import sync_playwright
from scraper.robots_checker import check_robots_permission, get_crawl_delay
from bs4 import BeautifulSoup
import time

def scrape_dynamic_page(url: str, wait_selector: str = None) -> dict:
    """
    Uses headless Chromium to render JavaScript-heavy pages.
    wait_selector: CSS selector to wait for before extracting content.
    """
    if not check_robots_permission(url):
        print(f"[BLOCKED] robots.txt disallows: {url}")
        return None
    
    delay = get_crawl_delay(url)
    time.sleep(delay)
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="CompetitiveResearchBot/1.0",
            extra_http_headers={"Accept-Language": "en-US,en;q=0.9"}
        )
        page = context.new_page()
        
        try:
            page.goto(url, wait_until="networkidle", timeout=30000)
            
            if wait_selector:
                page.wait_for_selector(wait_selector, timeout=10000)
            
            # Additional wait for dynamic content
            page.wait_for_timeout(2000)
            
            html_content = page.content()
            soup = BeautifulSoup(html_content, 'lxml')
            
            for tag in soup(['script', 'style', 'nav', 'footer']):
                tag.decompose()
            
            return {
                "url": url,
                "title": page.title(),
                "body_text": soup.get_text(separator='\n', strip=True)[:15000],
                "h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
                "h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
                "scraped_at": time.time(),
                "rendered": True
            }
        
        except Exception as e:
            print(f"[ERROR] Playwright failed for {url}: {e}")
            return None
        
        finally:
            browser.close()

Pro tip: For pricing pages that load prices via API calls (which is increasingly common in SaaS), open your browser's Developer Tools, go to the Network tab, and look for XHR/fetch calls when the pricing loads. Sometimes you can call that API endpoint directly, which is far more reliable than scraping rendered HTML. Check whether that endpoint is documented or publicly accessible before using it.

Common mistake to avoid: Using wait_until="load" instead of wait_until="networkidle" in Playwright. The "load" event fires before many JavaScript frameworks have finished rendering their content. "networkidle" waits until there are no more than 2 active network connections for 500ms — much more reliable for dynamic pages.

Estimated time for this step: 60-90 minutes including testing against your actual competitor URLs.

Step 4: Build the Claude Analysis Layer — Where Raw Data Becomes Intelligence

This is where the project transforms from a basic scraper into a genuine intelligence tool. Raw HTML text is nearly useless for decision-making. Claude's ability to read unstructured text and extract structured, categorized, actionable information is what makes this architecture valuable. A competitor's pricing page might contain paragraphs of marketing copy, feature comparison tables, FAQ sections, and footnotes — Claude can parse all of that and return a clean JSON object with the actual prices, plan names, and key differentiators.

Set Up the Claude Client

Build analyzer/claude_client.py:


import anthropic
import json
import os
from dotenv import load_dotenv

load_dotenv()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("CLAUDE_MODEL", "claude-opus-4-5")

def analyze_with_claude(prompt: str, content: str, system_prompt: str = None) -> str:
    """
    Sends content to Claude for analysis. Returns the response text.
    """
    messages = [
        {
            "role": "user",
            "content": f"{prompt}\n\n---CONTENT---\n{content}"
        }
    ]
    
    system = system_prompt or (
        "You are a competitive intelligence analyst. Extract structured, accurate information "
        "from web page content. Always respond with valid JSON unless instructed otherwise. "
        "If information is not present in the content, use null rather than guessing."
    )
    
    response = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        system=system,
        messages=messages
    )
    
    return response.content[0].text

def parse_json_response(response_text: str) -> dict:
    """Safely parses JSON from Claude's response, handling markdown code blocks"""
    text = response_text.strip()
    if text.startswith("```json"):
        text = text[7:]
    if text.startswith("```"):
        text = text[3:]
    if text.endswith("```"):
        text = text[:-3]
    
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError as e:
        print(f"[WARNING] JSON parse failed: {e}")
        return {"raw_response": response_text, "parse_error": str(e)}

Pricing Analyzer

Build analyzer/pricing_analyzer.py:


from analyzer.claude_client import analyze_with_claude, parse_json_response

PRICING_PROMPT = """
Analyze this pricing page content and extract all pricing information.

Return a JSON object with this exact structure:
{
  "plans": [
    {
      "name": "Plan name",
      "monthly_price": 0.00 or null,
      "annual_price": 0.00 or null,
      "annual_monthly_equivalent": 0.00 or null,
      "currency": "USD",
      "billing_period": "monthly|annual|one-time|usage-based",
      "user_limit": "number or 'unlimited' or null",
      "key_features": ["feature1", "feature2"],
      "highlighted_features": ["features emphasized in marketing copy"],
      "cta_text": "The call-to-action button text"
    }
  ],
  "free_trial": true/false,
  "free_tier_available": true/false,
  "enterprise_option": true/false,
  "pricing_model": "per-seat|flat-rate|usage-based|tiered|hybrid",
  "key_differentiators": ["What they emphasize vs competitors"],
  "recent_changes_mentioned": ["Any 'new', 'updated', 'launching soon' language found"],
  "confidence_score": 0-100
}
"""

def analyze_pricing_page(scraped_data: dict) -> dict:
    content = f"""
    Page Title: {scraped_data.get('title', '')}
    H1 Tags: {', '.join(scraped_data.get('h1_tags', []))}
    H2 Tags: {', '.join(scraped_data.get('h2_tags', []))}
    
    Body Text:
    {scraped_data.get('body_text', '')}
    """
    
    response = analyze_with_claude(PRICING_PROMPT, content)
    result = parse_json_response(response)
    result['source_url'] = scraped_data.get('url')
    result['scraped_at'] = scraped_data.get('scraped_at')
    return result

Content Strategy Analyzer

Build analyzer/content_analyzer.py. This one is particularly valuable for tracking how competitors are shifting their messaging — are they moving upmarket? Targeting a new persona? Pivoting their positioning?


from analyzer.claude_client import analyze_with_claude, parse_json_response

CONTENT_PROMPT = """
Analyze this blog/content page and extract strategic intelligence about the company's content approach.

Return JSON with this structure:
{
  "primary_topics": ["main topics covered"],
  "target_personas": ["inferred audience segments"],
  "content_stage": "awareness|consideration|decision",
  "seo_keywords_apparent": ["keywords they appear to be targeting"],
  "messaging_themes": ["core messages and value propositions"],
  "competitive_positioning": "how they position against alternatives",
  "pain_points_addressed": ["customer problems they highlight"],
  "tone": "technical|conversational|formal|casual|urgent",
  "calls_to_action": ["CTAs found on page"],
  "publication_recency": "recent|moderate|dated based on content references",
  "strategic_signals": ["anything suggesting strategic direction changes"]
}
"""

def analyze_content_page(scraped_data: dict) -> dict:
    content = f"""
    Title: {scraped_data.get('title', '')}
    Meta Description: {scraped_data.get('meta_description', '')}
    H1: {', '.join(scraped_data.get('h1_tags', []))}
    H2s: {', '.join(scraped_data.get('h2_tags', []))}
    Body: {scraped_data.get('body_text', '')[:8000]}
    """
    
    response = analyze_with_claude(CONTENT_PROMPT, content)
    return parse_json_response(response)

Pro tip: The strategic_signals field in the content analyzer is often where the most valuable intelligence surfaces. When a competitor's blog starts publishing heavily about a new topic — say, AI features, enterprise compliance, or a specific industry vertical — that's often a leading indicator of a product roadmap or marketing pivot, months before a formal announcement.

Common mistake to avoid: Sending the entire raw page body to Claude without trimming it first. Most pricing pages have extensive boilerplate, repeated navigation text, and footer content that adds tokens without adding analytical value. The 15,000-character limit in the scraper module and the 8,000-character limit in the content analyzer are intentional — they keep your API costs reasonable and actually improve response quality by forcing Claude to work with signal-dense content.

Estimated time for this step: 60-75 minutes to build and test both analyzers.

Step 5: Build Change Detection, Storage, and Automated Reporting

A competitive intelligence tool that requires you to manually compare this week's data to last week's is only marginally better than doing it by hand. The real value comes from automated change detection — a system that knows what the baseline looked like, identifies meaningful differences, and surfaces only what matters rather than overwhelming you with raw data dumps.

Database Setup

Build storage/database.py using SQLite for simplicity (upgrade to PostgreSQL for team deployments):


import sqlite3
import json
import time
import os

DB_PATH = os.path.join("data", "competitor_intel.db")

def init_database():
    os.makedirs("data", exist_ok=True)
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    
    cursor.executescript("""
        CREATE TABLE IF NOT EXISTS scrape_results (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            competitor_name TEXT NOT NULL,
            url TEXT NOT NULL,
            page_type TEXT NOT NULL,
            raw_content TEXT,
            analyzed_data TEXT,
            scraped_at REAL,
            created_at REAL DEFAULT (unixepoch())
        );
        
        CREATE TABLE IF NOT EXISTS changes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            competitor_name TEXT NOT NULL,
            page_type TEXT NOT NULL,
            change_type TEXT NOT NULL,
            change_summary TEXT,
            old_value TEXT,
            new_value TEXT,
            severity TEXT DEFAULT 'medium',
            detected_at REAL DEFAULT (unixepoch())
        );
        
        CREATE INDEX IF NOT EXISTS idx_scrape_url ON scrape_results(url, scraped_at);
        CREATE INDEX IF NOT EXISTS idx_changes_competitor ON changes(competitor_name, detected_at);
    """)
    
    conn.commit()
    conn.close()

def save_scrape_result(competitor_name: str, url: str, page_type: str, 
                       raw_content: dict, analyzed_data: dict):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO scrape_results (competitor_name, url, page_type, raw_content, analyzed_data, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?)
    """, (competitor_name, url, page_type, 
          json.dumps(raw_content), json.dumps(analyzed_data), time.time()))
    conn.commit()
    conn.close()

def get_previous_analysis(url: str) -> dict:
    """Gets the most recent previous analysis for comparison"""
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.execute("""
        SELECT analyzed_data FROM scrape_results 
        WHERE url = ? 
        ORDER BY scraped_at DESC 
        LIMIT 1 OFFSET 1
    """, (url,))
    row = cursor.fetchone()
    conn.close()
    return json.loads(row[0]) if row else None

Change Detector

Build analyzer/change_detector.py. This is where Claude does its most impressive work — comparing two snapshots of analyzed data and producing a human-readable, prioritized summary of what changed:


from analyzer.claude_client import analyze_with_claude, parse_json_response
import json

CHANGE_DETECTION_PROMPT = """
Compare these two competitive intelligence snapshots (previous vs current) and identify meaningful changes.

Ignore trivial differences like minor wording changes. Focus on strategically significant changes.

Return JSON:
{
  "changes_detected": true/false,
  "change_count": 0,
  "changes": [
    {
      "type": "pricing_change|feature_addition|feature_removal|messaging_shift|positioning_change|new_cta|plan_restructure",
      "severity": "high|medium|low",
      "summary": "One sentence description of the change",
      "strategic_implication": "What this likely means for their strategy",
      "recommended_action": "What you should consider doing in response",
      "old_value": "previous state",
      "new_value": "current state"
    }
  ],
  "overall_assessment": "Executive summary of competitive landscape shift"
}
"""

def detect_changes(competitor_name: str, page_type: str, 
                   previous_data: dict, current_data: dict) -> dict:
    if not previous_data:
        return {"changes_detected": False, "message": "No baseline to compare against"}
    
    content = f"""
PREVIOUS SNAPSHOT:
{json.dumps(previous_data, indent=2)[:6000]}

CURRENT SNAPSHOT:
{json.dumps(current_data, indent=2)[:6000]}

Competitor: {competitor_name}
Page Type: {page_type}
"""
    
    response = analyze_with_claude(CHANGE_DETECTION_PROMPT, content)
    result = parse_json_response(response)
    result['competitor_name'] = competitor_name
    result['page_type'] = page_type
    return result

Report Generator

Build reports/report_generator.py to produce a clean weekly digest:


import sqlite3
import json
from analyzer.claude_client import analyze_with_claude
import time
from datetime import datetime

DB_PATH = "data/competitor_intel.db"

def generate_weekly_report() -> str:
    conn = sqlite3.connect(DB_PATH)
    
    # Get all changes from the past 7 days
    week_ago = time.time() - (7 * 24 * 3600)
    cursor = conn.execute("""
        SELECT competitor_name, page_type, change_type, change_summary, 
               severity, old_value, new_value, detected_at
        FROM changes 
        WHERE detected_at > ?
        ORDER BY severity DESC, detected_at DESC
    """, (week_ago,))
    
    changes = cursor.fetchall()
    conn.close()
    
    if not changes:
        return "No significant changes detected in the past 7 days."
    
    changes_text = "\n".join([
        f"[{c[4].upper()}] {c[0]} - {c[2]}: {c[3]}"
        for c in changes
    ])
    
    synthesis_prompt = """
Synthesize these competitive intelligence findings into a concise executive briefing.
Structure it as:
1. Top 3 most urgent findings requiring immediate action
2. Strategic trends observed across competitors
3. Recommended actions for this week

Write in clear, direct business language. No bullet-point padding.
"""
    
    report = analyze_with_claude(synthesis_prompt, changes_text, 
                                  system_prompt="You are a senior competitive strategist writing an executive briefing.")
    
    timestamp = datetime.now().strftime("%m/%d/%Y")
    return f"COMPETITIVE INTELLIGENCE BRIEF — Week of {timestamp}\n\n{report}"

Wire It All Together in main.py


import yaml
import schedule
import time
from storage.database import init_database, save_scrape_result, get_previous_analysis
from scraper.static_scraper import scrape_static_page
from scraper.dynamic_scraper import scrape_dynamic_page
from analyzer.pricing_analyzer import analyze_pricing_page
from analyzer.content_analyzer import analyze_content_page
from analyzer.change_detector import detect_changes
from reports.report_generator import generate_weekly_report

def load_config():
    with open("config.yaml") as f:
        return yaml.safe_load(f)

def run_intelligence_cycle():
    config = load_config()
    
    for competitor in config['competitors']:
        name = competitor['name']
        print(f"\n[INFO] Processing {name}...")
        
        for page_config in competitor['pages']:
            url = page_config['url']
            page_type = page_config['type']
            js_required = page_config.get('js_required', False)
            
            # Scrape
            if js_required:
                raw_data = scrape_dynamic_page(url)
            else:
                raw_data = scrape_static_page(url)
            
            if not raw_data:
                print(f"[SKIP] Failed to scrape: {url}")
                continue
            
            # Analyze
            if page_type == 'pricing':
                analyzed = analyze_pricing_page(raw_data)
            else:
                analyzed = analyze_content_page(raw_data)
            
            # Get previous for comparison
            previous = get_previous_analysis(url)
            
            # Save current
            save_scrape_result(name, url, page_type, raw_data, analyzed)
            
            # Detect changes
            if previous:
                changes = detect_changes(name, page_type, previous, analyzed)
                if changes.get('changes_detected'):
                    print(f"[ALERT] Changes detected for {name} - {page_type}!")
                    # Here: send Slack notification, email, etc.
            
            time.sleep(2)  # Polite delay between pages

def main():
    init_database()
    
    # Run immediately on startup
    run_intelligence_cycle()
    
    # Schedule ongoing runs
    schedule.every().day.at("07:00").do(run_intelligence_cycle)
    schedule.every().monday.at("08:00").do(lambda: print(generate_weekly_report()))
    
    while True:
        schedule.run_pending()
        time.sleep(60)

if __name__ == "__main__":
    main()

Pro tip: Add Slack webhook notifications for high-severity changes. When a competitor drops their price or adds a major feature, you want to know within hours, not at your Monday morning briefing. The requests library makes this a five-line addition to the change detection logic.

Common mistake to avoid: Running the full intelligence cycle too frequently. Daily is appropriate for pricing pages; weekly is usually sufficient for content and feature pages. Overly frequent scraping increases your API costs, strains competitor servers, and produces noise from minor, irrelevant changes.

Estimated time for this step: 90 minutes to build and 30 minutes to run initial tests.

Going Deeper: Advanced Use Cases and Extending the Framework

Once your core pipeline is running, you have a foundation that can be extended in several high-value directions without rebuilding from scratch. The modular architecture you've built is specifically designed to accommodate new data sources, new analysis types, and new output formats by adding new modules rather than modifying existing ones.

Tracking Ad Creative and Landing Page Changes

One of the most valuable competitive signals — and one that most teams miss entirely — is monitoring what competitors are doing with their paid advertising. While you can't directly access their ad accounts, you can monitor their landing pages, which often reveal their current messaging priorities, offers, and conversion strategies. Add landing pages as type: "landing_page" entries in your config.yaml, and extend the content analyzer with a landing-page-specific prompt that looks for: urgency language, specific offer terms, social proof claims, and A/B testing signals (like multiple variants of a page being accessible via different URL parameters).

For more sophisticated ad intelligence, tools like the Meta Ad Library and Google's Transparency Report provide public access to active ad creative. With Claude Code, you can build scrapers for these platforms (they have their own APIs and public-facing data) and feed that intelligence back into your analysis pipeline. Meta's Ad Library is particularly valuable for B2C competitors — you can see exactly what creative they're running, for how long, and across which regions.

Job Listing Intelligence

Competitor job postings are an underutilized intelligence source. A competitor suddenly posting five senior AI/ML engineer roles tells you something about their product roadmap that no press release will. A cluster of sales hiring in a specific geographic region suggests market expansion. Build a job listing scraper that monitors their careers page (usually a simple static page) and uses Claude to extract: role categories, seniority distribution, required technology stack, and location patterns. This kind of intelligence is consistently available, ethically unambiguous, and highly predictive.

Integrating with Your Existing Reporting Stack

The SQLite database can be queried directly by tools like Retool, Metabase, or even Google Sheets (via a Python script that exports to CSV). For teams already using data visualization platforms, adding a "competitive intelligence" dashboard fed by this pipeline takes a few hours and creates genuinely useful visibility that most teams don't have.

If you're interested in taking your Claude Code skills significantly further — building more complex pipelines, integrating with APIs, and creating production-ready tools — Adventure Media's hands-on Claude Code workshop covers exactly this kind of real-world project development in a single day. The workshop is built for people who already understand the basics and want to move into building actual tools, which is precisely where this competitive intelligence project sits.

Frequently Asked Questions

Scraping publicly available information is generally legal in the US, but the law is not fully settled. The Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways, and the hiQ v. LinkedIn case provided some protection for public data scraping. However, you should always: respect robots.txt, avoid logging into competitors' systems, not scrape at volumes that constitute a denial-of-service attack, and review each target site's Terms of Service. Consult a technology attorney if you're doing this at scale or in a regulated industry. This guide is not legal advice.

How much will this cost to run in API fees?

For a typical setup monitoring 5 competitors with 3-4 pages each, daily runs will cost roughly $5-15 per month in Claude API fees — depending on page length and your model choice. Using claude-haiku for initial extraction and claude-opus only for change synthesis and report generation can reduce costs significantly. Monitor your usage in the Anthropic console and set billing alerts.

What if a competitor's pricing page uses a paywall or login?

Don't attempt to access any content behind authentication. If a pricing page requires login, that's a clear signal that the company considers it non-public. In these cases, use alternative sources: industry databases, G2/Capterra reviews (which often include pricing), press releases, or sales intelligence platforms like ZoomInfo. You can feed manually collected data into your Claude analysis pipeline just as easily as scraped data.

How do I handle false positives in change detection?

The most common source of false positives is dynamic content like "featured this week" sections, date-stamped content, and personalized elements. Mitigate this by: adding CSS selectors to specifically target the content sections you care about rather than grabbing full page text, running the scraper multiple times on the same day before flagging a change as confirmed, and tuning your Claude prompt to explicitly ignore "trivial wording changes" and time-sensitive content.

Can I use this framework to track Google Ads and paid search strategies?

Yes, with some adaptation. You can use the Google Ads Transparency Center to see active ads for any advertiser. Your scraper can collect this public data, and Claude can analyze messaging patterns, offer language, and positioning shifts. This is particularly valuable if you're in a competitive paid search environment where competitor ad copy directly affects your click-through rates and Quality Scores.

What's the right frequency for running the pipeline?

Match frequency to the volatility of the information type. Pricing pages: daily or every two days. Feature pages: twice weekly. Blog/content pages: weekly. Homepage: twice weekly (messaging changes fast). Job listings: weekly. Running everything daily creates unnecessary costs and API load without proportional intelligence value.

How do I avoid getting my IP blocked?

The key practices are: respect crawl delays, use realistic request headers, limit requests to 1-2 per minute per domain, and don't scrape the same domain from multiple threads simultaneously. If you do get blocked (you'll see 403 or 429 responses), wait 24-48 hours before retrying, reduce your frequency, and consider using a residential proxy service if the intelligence is valuable enough to warrant the cost. Never try to circumvent anti-bot systems — that's where you cross from competitive intelligence into legally risky territory.

Can I extend this to monitor social media?

Social media platforms have strict Terms of Service against automated scraping, and most have APIs with their own rate limits and data access policies. For Twitter/X and LinkedIn, use their official APIs within permitted rate limits. For Facebook, the Ad Library API is the legitimate path for ad intelligence. Attempting to scrape social platforms outside their official APIs is high-risk and frequently breaks as platforms update their anti-bot measures.

How do I handle competitor sites that use Cloudflare or other bot protection?

If a site is aggressively blocking automated requests, that's a strong signal to stop and find alternative data sources. Heavy bot protection indicates the site owner actively doesn't want automated access. Attempting to bypass Cloudflare or similar systems violates terms of service and potentially computer abuse laws. Use manual collection, third-party data providers, or focus your automation on sites with permissive robots.txt and no bot protection.

What's the best way to share findings with my team?

The report generator can output to Slack, email, or a shared document depending on your team's workflow. For Slack, add a webhook integration to the report generator. For email, Python's smtplib with a SendGrid or Amazon SES integration sends the weekly report automatically. For documentation, outputting to a Notion database via their API keeps all competitive intelligence searchable and linked to other strategic documents.

Do I need Claude Code specifically, or can I use another LLM?

The architecture in this guide uses the Anthropic Python SDK and is optimized for Claude's instruction-following capabilities, particularly for structured JSON extraction. You could adapt the analysis layer to use other models, but Claude's ability to reliably return valid JSON from complex, messy web content — and to follow nuanced extraction instructions without hallucinating fields — makes it the strongest choice for this specific use case as of early 2026. The structured output reliability is critical when downstream code depends on specific JSON fields being present and accurate.

How do I validate that Claude's extractions are accurate?

Build a validation step into your pipeline that spot-checks 10-20% of extractions against the source URL. For pricing specifically, manually verify the first week's extractions against what you see in the browser before trusting the automated data. Add a confidence_score field to your prompts (as shown in the pricing analyzer) and treat anything below 70 as requiring manual review. Over time, you'll develop a sense for which page layouts Claude handles reliably and which require prompt refinement.

Conclusion: From Reactive to Proactive Competitive Strategy

The competitive intelligence gap between companies that monitor manually and those that monitor systematically isn't just a time efficiency difference — it's a strategic decision-making gap. When you know that a competitor restructured their pricing two days ago, you can respond in days rather than weeks. When you see that three competitors have started emphasizing the same new feature in their messaging, you have an early signal about where customer expectations are heading. When you notice a competitor's job postings shift heavily toward enterprise sales roles, you have months of lead time before their public announcement.

The framework you've built in this guide gives you all of that — automatically, continuously, and within clear ethical boundaries. The five-step architecture covers everything from responsible data collection through Claude-powered analysis to automated change detection and reporting. It's production-ready with some hardening work (adding proper error handling, logging, and monitoring), and it scales cleanly from a solo founder watching three competitors to a growth team monitoring an entire industry vertical.

The most important thing to do next is actually run it. Put in your first three competitor URLs, let it collect a baseline, and come back a week later to see what Claude surfaces. The first time it catches a pricing change you didn't notice, the value of the entire project will be immediately obvious.

If you want to go deeper into building production-ready tools with Claude Code — beyond this pipeline into more complex architectures, API integrations, and multi-agent systems — Adventure Media is running a hands-on workshop called Master Claude Code in One Day. It's designed for people who are past the basics and ready to build real things that solve real business problems, which is exactly the mindset this guide was written for.

Competitive intelligence is no longer a resource reserved for enterprise teams with dedicated analysts. With Claude Code and a few hundred lines of well-structured Python, it's available to any team that decides to build it.

Ready to Master Claude Code?

Stop reading tutorials and start building. Adventure Media's "Master Claude Code in One Day" workshop takes you from zero to building real, functional AI tools — in a single day. Hands-on projects. Expert guidance. No coding experience required.

Reserve Your Spot — Seats Are Limited

Your competitor just dropped their prices. You didn't notice for three weeks. By the time you adjusted, you'd already lost a handful of deals to someone who looked more competitive on paper. Sound familiar? This is the tax every business pays for manual competitive intelligence — the slow, labor-intensive process of checking competitor websites by hand, copying pricing tables into spreadsheets, and hoping you catch the important changes before they hurt you.

Claude Code changes that equation entirely. With a few hundred lines of Python and the right architectural approach, you can build a system that monitors competitor websites continuously, extracts structured data automatically, flags meaningful changes, and synthesizes findings into actionable intelligence reports — all while staying firmly within the bounds of ethical, legal web data collection.

Limited Event

Master Claude Code in One Day — Live Workshop by Adventure Media

Go from zero coding experience to building real AI-powered tools. Hands-on projects, expert guidance, no fluff.

Register Now — Spots Filling Fast →

This guide walks you through building that system from scratch. We'll cover the technical implementation step by step, including the exact code patterns, configuration values, and Claude Code prompts that make each piece work. Whether you're a solo founder trying to stay ahead of three competitors or a growth team tracking an entire industry, the same framework applies.

What you'll build: A modular competitive intelligence pipeline that scrapes publicly available data, uses Claude to analyze and categorize it, stores changes over time, and produces structured reports on pricing, content strategy, and messaging shifts — automatically.

Estimated total build time: 4-6 hours across five steps. Prerequisites: Basic Python familiarity, a Claude API key (available at console.anthropic.com), and a working Python environment (3.10+).

Before building any scraping tool, you must understand what you're legally and ethically permitted to collect. This isn't a disclaimer buried at the bottom — it's the foundation of the entire project. Getting this wrong doesn't just create legal exposure; it can get your IP address blocked, generate cease-and-desist letters, and damage your reputation if it becomes public that your "competitive intelligence" crossed lines it shouldn't have.

What's Generally Permitted

Publicly accessible information — data that any visitor can see in their browser without logging in — is generally fair game for automated collection, though this area of law continues to evolve. The landmark hiQ Labs v. LinkedIn case established important precedent suggesting that scraping publicly available data doesn't automatically violate the Computer Fraud and Abuse Act, though this remains a developing area of law. What you can typically collect: pricing pages, product descriptions, public blog posts, public job listings, and meta information like page titles and structured data.

What's Off-Limits

You should not collect: data behind login walls, data explicitly prohibited in a site's robots.txt file (which we'll check programmatically), personally identifiable information about individuals, or data at a volume or rate that constitutes a denial-of-service attack on the target server. Aggressive scraping that hammers a server with thousands of requests per minute is both unethical and potentially illegal under computer abuse statutes, regardless of what data you're collecting.

The robots.txt Checker — Your First Code Requirement

Your tool must check and respect robots.txt before collecting anything. This is non-negotiable. Here's the module you'll build first:


import urllib.robotparser
import urllib.parse
import time

def check_robots_permission(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> bool:
    """Returns True if scraping this URL is permitted by robots.txt"""
    parsed = urllib.parse.urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    
    try:
        rp.read()
        return rp.can_fetch(user_agent, url)
    except Exception:
        # If robots.txt is unreachable, default to conservative behavior
        return False

def get_crawl_delay(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> float:
    """Returns the required crawl delay in seconds, minimum 2.0"""
    parsed = urllib.parse.urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    
    try:
        rp.read()
        delay = rp.crawl_delay(user_agent)
        return max(delay or 2.0, 2.0)  # Always wait at least 2 seconds
    except Exception:
        return 2.0

Pro tip: Set your user agent string to something identifiable and include a contact email in your request headers. Something like CompetitiveResearchBot/1.0 (contact@yourcompany.com) is transparent about who you are and gives site owners a way to reach you if there's an issue. This is basic professional courtesy and reduces friction.

Common mistake to avoid: Don't skip robots.txt checking because "everyone does it." The fact that unethical scraping is common doesn't make it acceptable. Beyond the legal risk, your tool's longevity depends on not getting blocked — and sites that detect aggressive scraping will implement countermeasures that break your pipeline.

Estimated time for this step: 30 minutes to implement the robots checker and review your target competitor sites' robots.txt files manually.

Step 2: Set Up Your Claude Code Environment and Project Structure

A well-organized project structure will save you hours of debugging later and make the tool maintainable as your competitor list grows. Claude Code works best when you give it a clear, consistent file structure to work within — it can reference existing modules, understand your data schemas, and generate new components that fit naturally into what's already there.

Install Dependencies

Create a new project directory and set up a virtual environment:


mkdir competitor-intel
cd competitor-intel
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

pip install anthropic requests beautifulsoup4 lxml playwright \
            python-dotenv sqlite3 schedule rich pandas

Install Playwright's browser binaries:


playwright install chromium

You'll use Playwright for JavaScript-heavy pages where requests alone won't render dynamic content. Many modern competitor pricing pages load their data via JavaScript, and a headless browser is the only way to get the actual rendered content.

Project File Structure


competitor-intel/
├── .env                    # API keys (never commit this)
├── config.yaml             # Competitor URLs and scraping settings
├── main.py                 # Entry point and scheduler
├── scraper/
│   ├── __init__.py
│   ├── robots_checker.py   # The module from Step 1
│   ├── static_scraper.py   # requests + BeautifulSoup
│   └── dynamic_scraper.py  # Playwright for JS pages
├── analyzer/
│   ├── __init__.py
│   ├── claude_client.py    # Claude API wrapper
│   ├── pricing_analyzer.py
│   ├── content_analyzer.py
│   └── change_detector.py
├── storage/
│   ├── __init__.py
│   └── database.py         # SQLite operations
├── reports/
│   ├── __init__.py
│   └── report_generator.py
└── data/
    └── competitor_intel.db  # Auto-created SQLite database

Configure Your .env File


ANTHROPIC_API_KEY=sk-ant-your-key-here
CLAUDE_MODEL=claude-opus-4-5
REQUEST_DELAY_SECONDS=3
MAX_PAGES_PER_DOMAIN=20

Set Up the config.yaml

This is where you define your competitor list. Structure it by company, not by URL, so you can group related pages:


competitors:
  - name: "CompetitorA"
    domain: "https://competitor-a.com"
    pages:
      - url: "https://competitor-a.com/pricing"
        type: "pricing"
        js_required: false
      - url: "https://competitor-a.com/blog"
        type: "content"
        js_required: false
      - url: "https://competitor-a.com/features"
        type: "features"
        js_required: true
    
  - name: "CompetitorB"
    domain: "https://competitor-b.com"
    pages:
      - url: "https://competitor-b.com/plans"
        type: "pricing"
        js_required: true

Common mistake to avoid: Don't add every page on a competitor's site. Focus on high-signal pages: pricing, features, homepage, about, careers (for growth signals), and blog index. Adding too many pages creates noise and increases your API costs substantially.

Pro tip: Add a priority: high/medium/low field to each page entry. High-priority pages (like pricing) should be checked daily; low-priority pages (like about pages) weekly. This prevents unnecessary API calls and keeps your costs manageable.

Estimated time for this step: 45 minutes including dependency installation and initial configuration.

Step 3: Build the Scraping Layer — Static and Dynamic Content

The scraping layer is your data collection foundation, and it needs to handle two fundamentally different types of web pages: those that deliver all their content in the initial HTML response, and those that require JavaScript execution to render their actual content. Getting this distinction right is what separates a scraper that works reliably from one that silently returns empty data half the time.

Static Scraper (requests + BeautifulSoup)

Build scraper/static_scraper.py:


import requests
from bs4 import BeautifulSoup
import time
from scraper.robots_checker import check_robots_permission, get_crawl_delay
import os

HEADERS = {
    "User-Agent": f"CompetitiveResearchBot/1.0 (contact@{os.getenv('COMPANY_DOMAIN', 'example.com')})",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5",
}

def scrape_static_page(url: str) -> dict:
    """
    Scrapes a static page. Returns structured content dict.
    Returns None if robots.txt disallows or request fails.
    """
    if not check_robots_permission(url):
        print(f"[BLOCKED] robots.txt disallows: {url}")
        return None
    
    delay = get_crawl_delay(url)
    time.sleep(delay)
    
    try:
        response = requests.get(url, headers=HEADERS, timeout=15)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.text, 'lxml')
        
        # Remove script and style tags — they're noise for text analysis
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        
        return {
            "url": url,
            "status_code": response.status_code,
            "title": soup.title.string if soup.title else "",
            "meta_description": get_meta_description(soup),
            "h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
            "h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
            "body_text": soup.get_text(separator='\n', strip=True)[:15000],  # Limit for API
            "structured_data": extract_structured_data(soup),
            "scraped_at": time.time()
        }
    
    except requests.RequestException as e:
        print(f"[ERROR] Failed to scrape {url}: {e}")
        return None

def get_meta_description(soup) -> str:
    meta = soup.find('meta', attrs={'name': 'description'})
    return meta['content'] if meta and meta.get('content') else ""

def extract_structured_data(soup) -> list:
    """Extracts JSON-LD structured data — often contains pricing info"""
    import json
    structured = []
    for script in soup.find_all('script', type='application/ld+json'):
        try:
            data = json.loads(script.string)
            structured.append(data)
        except (json.JSONDecodeError, TypeError):
            pass
    return structured

Dynamic Scraper (Playwright for JavaScript Pages)

Build scraper/dynamic_scraper.py:


from playwright.sync_api import sync_playwright
from scraper.robots_checker import check_robots_permission, get_crawl_delay
from bs4 import BeautifulSoup
import time

def scrape_dynamic_page(url: str, wait_selector: str = None) -> dict:
    """
    Uses headless Chromium to render JavaScript-heavy pages.
    wait_selector: CSS selector to wait for before extracting content.
    """
    if not check_robots_permission(url):
        print(f"[BLOCKED] robots.txt disallows: {url}")
        return None
    
    delay = get_crawl_delay(url)
    time.sleep(delay)
    
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent="CompetitiveResearchBot/1.0",
            extra_http_headers={"Accept-Language": "en-US,en;q=0.9"}
        )
        page = context.new_page()
        
        try:
            page.goto(url, wait_until="networkidle", timeout=30000)
            
            if wait_selector:
                page.wait_for_selector(wait_selector, timeout=10000)
            
            # Additional wait for dynamic content
            page.wait_for_timeout(2000)
            
            html_content = page.content()
            soup = BeautifulSoup(html_content, 'lxml')
            
            for tag in soup(['script', 'style', 'nav', 'footer']):
                tag.decompose()
            
            return {
                "url": url,
                "title": page.title(),
                "body_text": soup.get_text(separator='\n', strip=True)[:15000],
                "h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
                "h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
                "scraped_at": time.time(),
                "rendered": True
            }
        
        except Exception as e:
            print(f"[ERROR] Playwright failed for {url}: {e}")
            return None
        
        finally:
            browser.close()

Pro tip: For pricing pages that load prices via API calls (which is increasingly common in SaaS), open your browser's Developer Tools, go to the Network tab, and look for XHR/fetch calls when the pricing loads. Sometimes you can call that API endpoint directly, which is far more reliable than scraping rendered HTML. Check whether that endpoint is documented or publicly accessible before using it.

Common mistake to avoid: Using wait_until="load" instead of wait_until="networkidle" in Playwright. The "load" event fires before many JavaScript frameworks have finished rendering their content. "networkidle" waits until there are no more than 2 active network connections for 500ms — much more reliable for dynamic pages.

Estimated time for this step: 60-90 minutes including testing against your actual competitor URLs.

Step 4: Build the Claude Analysis Layer — Where Raw Data Becomes Intelligence

This is where the project transforms from a basic scraper into a genuine intelligence tool. Raw HTML text is nearly useless for decision-making. Claude's ability to read unstructured text and extract structured, categorized, actionable information is what makes this architecture valuable. A competitor's pricing page might contain paragraphs of marketing copy, feature comparison tables, FAQ sections, and footnotes — Claude can parse all of that and return a clean JSON object with the actual prices, plan names, and key differentiators.

Set Up the Claude Client

Build analyzer/claude_client.py:


import anthropic
import json
import os
from dotenv import load_dotenv

load_dotenv()

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("CLAUDE_MODEL", "claude-opus-4-5")

def analyze_with_claude(prompt: str, content: str, system_prompt: str = None) -> str:
    """
    Sends content to Claude for analysis. Returns the response text.
    """
    messages = [
        {
            "role": "user",
            "content": f"{prompt}\n\n---CONTENT---\n{content}"
        }
    ]
    
    system = system_prompt or (
        "You are a competitive intelligence analyst. Extract structured, accurate information "
        "from web page content. Always respond with valid JSON unless instructed otherwise. "
        "If information is not present in the content, use null rather than guessing."
    )
    
    response = client.messages.create(
        model=MODEL,
        max_tokens=4096,
        system=system,
        messages=messages
    )
    
    return response.content[0].text

def parse_json_response(response_text: str) -> dict:
    """Safely parses JSON from Claude's response, handling markdown code blocks"""
    text = response_text.strip()
    if text.startswith("```json"):
        text = text[7:]
    if text.startswith("```"):
        text = text[3:]
    if text.endswith("```"):
        text = text[:-3]
    
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError as e:
        print(f"[WARNING] JSON parse failed: {e}")
        return {"raw_response": response_text, "parse_error": str(e)}

Pricing Analyzer

Build analyzer/pricing_analyzer.py:


from analyzer.claude_client import analyze_with_claude, parse_json_response

PRICING_PROMPT = """
Analyze this pricing page content and extract all pricing information.

Return a JSON object with this exact structure:
{
  "plans": [
    {
      "name": "Plan name",
      "monthly_price": 0.00 or null,
      "annual_price": 0.00 or null,
      "annual_monthly_equivalent": 0.00 or null,
      "currency": "USD",
      "billing_period": "monthly|annual|one-time|usage-based",
      "user_limit": "number or 'unlimited' or null",
      "key_features": ["feature1", "feature2"],
      "highlighted_features": ["features emphasized in marketing copy"],
      "cta_text": "The call-to-action button text"
    }
  ],
  "free_trial": true/false,
  "free_tier_available": true/false,
  "enterprise_option": true/false,
  "pricing_model": "per-seat|flat-rate|usage-based|tiered|hybrid",
  "key_differentiators": ["What they emphasize vs competitors"],
  "recent_changes_mentioned": ["Any 'new', 'updated', 'launching soon' language found"],
  "confidence_score": 0-100
}
"""

def analyze_pricing_page(scraped_data: dict) -> dict:
    content = f"""
    Page Title: {scraped_data.get('title', '')}
    H1 Tags: {', '.join(scraped_data.get('h1_tags', []))}
    H2 Tags: {', '.join(scraped_data.get('h2_tags', []))}
    
    Body Text:
    {scraped_data.get('body_text', '')}
    """
    
    response = analyze_with_claude(PRICING_PROMPT, content)
    result = parse_json_response(response)
    result['source_url'] = scraped_data.get('url')
    result['scraped_at'] = scraped_data.get('scraped_at')
    return result

Content Strategy Analyzer

Build analyzer/content_analyzer.py. This one is particularly valuable for tracking how competitors are shifting their messaging — are they moving upmarket? Targeting a new persona? Pivoting their positioning?


from analyzer.claude_client import analyze_with_claude, parse_json_response

CONTENT_PROMPT = """
Analyze this blog/content page and extract strategic intelligence about the company's content approach.

Return JSON with this structure:
{
  "primary_topics": ["main topics covered"],
  "target_personas": ["inferred audience segments"],
  "content_stage": "awareness|consideration|decision",
  "seo_keywords_apparent": ["keywords they appear to be targeting"],
  "messaging_themes": ["core messages and value propositions"],
  "competitive_positioning": "how they position against alternatives",
  "pain_points_addressed": ["customer problems they highlight"],
  "tone": "technical|conversational|formal|casual|urgent",
  "calls_to_action": ["CTAs found on page"],
  "publication_recency": "recent|moderate|dated based on content references",
  "strategic_signals": ["anything suggesting strategic direction changes"]
}
"""

def analyze_content_page(scraped_data: dict) -> dict:
    content = f"""
    Title: {scraped_data.get('title', '')}
    Meta Description: {scraped_data.get('meta_description', '')}
    H1: {', '.join(scraped_data.get('h1_tags', []))}
    H2s: {', '.join(scraped_data.get('h2_tags', []))}
    Body: {scraped_data.get('body_text', '')[:8000]}
    """
    
    response = analyze_with_claude(CONTENT_PROMPT, content)
    return parse_json_response(response)

Pro tip: The strategic_signals field in the content analyzer is often where the most valuable intelligence surfaces. When a competitor's blog starts publishing heavily about a new topic — say, AI features, enterprise compliance, or a specific industry vertical — that's often a leading indicator of a product roadmap or marketing pivot, months before a formal announcement.

Common mistake to avoid: Sending the entire raw page body to Claude without trimming it first. Most pricing pages have extensive boilerplate, repeated navigation text, and footer content that adds tokens without adding analytical value. The 15,000-character limit in the scraper module and the 8,000-character limit in the content analyzer are intentional — they keep your API costs reasonable and actually improve response quality by forcing Claude to work with signal-dense content.

Estimated time for this step: 60-75 minutes to build and test both analyzers.

Step 5: Build Change Detection, Storage, and Automated Reporting

A competitive intelligence tool that requires you to manually compare this week's data to last week's is only marginally better than doing it by hand. The real value comes from automated change detection — a system that knows what the baseline looked like, identifies meaningful differences, and surfaces only what matters rather than overwhelming you with raw data dumps.

Database Setup

Build storage/database.py using SQLite for simplicity (upgrade to PostgreSQL for team deployments):


import sqlite3
import json
import time
import os

DB_PATH = os.path.join("data", "competitor_intel.db")

def init_database():
    os.makedirs("data", exist_ok=True)
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.cursor()
    
    cursor.executescript("""
        CREATE TABLE IF NOT EXISTS scrape_results (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            competitor_name TEXT NOT NULL,
            url TEXT NOT NULL,
            page_type TEXT NOT NULL,
            raw_content TEXT,
            analyzed_data TEXT,
            scraped_at REAL,
            created_at REAL DEFAULT (unixepoch())
        );
        
        CREATE TABLE IF NOT EXISTS changes (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            competitor_name TEXT NOT NULL,
            page_type TEXT NOT NULL,
            change_type TEXT NOT NULL,
            change_summary TEXT,
            old_value TEXT,
            new_value TEXT,
            severity TEXT DEFAULT 'medium',
            detected_at REAL DEFAULT (unixepoch())
        );
        
        CREATE INDEX IF NOT EXISTS idx_scrape_url ON scrape_results(url, scraped_at);
        CREATE INDEX IF NOT EXISTS idx_changes_competitor ON changes(competitor_name, detected_at);
    """)
    
    conn.commit()
    conn.close()

def save_scrape_result(competitor_name: str, url: str, page_type: str, 
                       raw_content: dict, analyzed_data: dict):
    conn = sqlite3.connect(DB_PATH)
    conn.execute("""
        INSERT INTO scrape_results (competitor_name, url, page_type, raw_content, analyzed_data, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?)
    """, (competitor_name, url, page_type, 
          json.dumps(raw_content), json.dumps(analyzed_data), time.time()))
    conn.commit()
    conn.close()

def get_previous_analysis(url: str) -> dict:
    """Gets the most recent previous analysis for comparison"""
    conn = sqlite3.connect(DB_PATH)
    cursor = conn.execute("""
        SELECT analyzed_data FROM scrape_results 
        WHERE url = ? 
        ORDER BY scraped_at DESC 
        LIMIT 1 OFFSET 1
    """, (url,))
    row = cursor.fetchone()
    conn.close()
    return json.loads(row[0]) if row else None

Change Detector

Build analyzer/change_detector.py. This is where Claude does its most impressive work — comparing two snapshots of analyzed data and producing a human-readable, prioritized summary of what changed:


from analyzer.claude_client import analyze_with_claude, parse_json_response
import json

CHANGE_DETECTION_PROMPT = """
Compare these two competitive intelligence snapshots (previous vs current) and identify meaningful changes.

Ignore trivial differences like minor wording changes. Focus on strategically significant changes.

Return JSON:
{
  "changes_detected": true/false,
  "change_count": 0,
  "changes": [
    {
      "type": "pricing_change|feature_addition|feature_removal|messaging_shift|positioning_change|new_cta|plan_restructure",
      "severity": "high|medium|low",
      "summary": "One sentence description of the change",
      "strategic_implication": "What this likely means for their strategy",
      "recommended_action": "What you should consider doing in response",
      "old_value": "previous state",
      "new_value": "current state"
    }
  ],
  "overall_assessment": "Executive summary of competitive landscape shift"
}
"""

def detect_changes(competitor_name: str, page_type: str, 
                   previous_data: dict, current_data: dict) -> dict:
    if not previous_data:
        return {"changes_detected": False, "message": "No baseline to compare against"}
    
    content = f"""
PREVIOUS SNAPSHOT:
{json.dumps(previous_data, indent=2)[:6000]}

CURRENT SNAPSHOT:
{json.dumps(current_data, indent=2)[:6000]}

Competitor: {competitor_name}
Page Type: {page_type}
"""
    
    response = analyze_with_claude(CHANGE_DETECTION_PROMPT, content)
    result = parse_json_response(response)
    result['competitor_name'] = competitor_name
    result['page_type'] = page_type
    return result

Report Generator

Build reports/report_generator.py to produce a clean weekly digest:


import sqlite3
import json
from analyzer.claude_client import analyze_with_claude
import time
from datetime import datetime

DB_PATH = "data/competitor_intel.db"

def generate_weekly_report() -> str:
    conn = sqlite3.connect(DB_PATH)
    
    # Get all changes from the past 7 days
    week_ago = time.time() - (7 * 24 * 3600)
    cursor = conn.execute("""
        SELECT competitor_name, page_type, change_type, change_summary, 
               severity, old_value, new_value, detected_at
        FROM changes 
        WHERE detected_at > ?
        ORDER BY severity DESC, detected_at DESC
    """, (week_ago,))
    
    changes = cursor.fetchall()
    conn.close()
    
    if not changes:
        return "No significant changes detected in the past 7 days."
    
    changes_text = "\n".join([
        f"[{c[4].upper()}] {c[0]} - {c[2]}: {c[3]}"
        for c in changes
    ])
    
    synthesis_prompt = """
Synthesize these competitive intelligence findings into a concise executive briefing.
Structure it as:
1. Top 3 most urgent findings requiring immediate action
2. Strategic trends observed across competitors
3. Recommended actions for this week

Write in clear, direct business language. No bullet-point padding.
"""
    
    report = analyze_with_claude(synthesis_prompt, changes_text, 
                                  system_prompt="You are a senior competitive strategist writing an executive briefing.")
    
    timestamp = datetime.now().strftime("%m/%d/%Y")
    return f"COMPETITIVE INTELLIGENCE BRIEF — Week of {timestamp}\n\n{report}"

Wire It All Together in main.py


import yaml
import schedule
import time
from storage.database import init_database, save_scrape_result, get_previous_analysis
from scraper.static_scraper import scrape_static_page
from scraper.dynamic_scraper import scrape_dynamic_page
from analyzer.pricing_analyzer import analyze_pricing_page
from analyzer.content_analyzer import analyze_content_page
from analyzer.change_detector import detect_changes
from reports.report_generator import generate_weekly_report

def load_config():
    with open("config.yaml") as f:
        return yaml.safe_load(f)

def run_intelligence_cycle():
    config = load_config()
    
    for competitor in config['competitors']:
        name = competitor['name']
        print(f"\n[INFO] Processing {name}...")
        
        for page_config in competitor['pages']:
            url = page_config['url']
            page_type = page_config['type']
            js_required = page_config.get('js_required', False)
            
            # Scrape
            if js_required:
                raw_data = scrape_dynamic_page(url)
            else:
                raw_data = scrape_static_page(url)
            
            if not raw_data:
                print(f"[SKIP] Failed to scrape: {url}")
                continue
            
            # Analyze
            if page_type == 'pricing':
                analyzed = analyze_pricing_page(raw_data)
            else:
                analyzed = analyze_content_page(raw_data)
            
            # Get previous for comparison
            previous = get_previous_analysis(url)
            
            # Save current
            save_scrape_result(name, url, page_type, raw_data, analyzed)
            
            # Detect changes
            if previous:
                changes = detect_changes(name, page_type, previous, analyzed)
                if changes.get('changes_detected'):
                    print(f"[ALERT] Changes detected for {name} - {page_type}!")
                    # Here: send Slack notification, email, etc.
            
            time.sleep(2)  # Polite delay between pages

def main():
    init_database()
    
    # Run immediately on startup
    run_intelligence_cycle()
    
    # Schedule ongoing runs
    schedule.every().day.at("07:00").do(run_intelligence_cycle)
    schedule.every().monday.at("08:00").do(lambda: print(generate_weekly_report()))
    
    while True:
        schedule.run_pending()
        time.sleep(60)

if __name__ == "__main__":
    main()

Pro tip: Add Slack webhook notifications for high-severity changes. When a competitor drops their price or adds a major feature, you want to know within hours, not at your Monday morning briefing. The requests library makes this a five-line addition to the change detection logic.

Common mistake to avoid: Running the full intelligence cycle too frequently. Daily is appropriate for pricing pages; weekly is usually sufficient for content and feature pages. Overly frequent scraping increases your API costs, strains competitor servers, and produces noise from minor, irrelevant changes.

Estimated time for this step: 90 minutes to build and 30 minutes to run initial tests.

Going Deeper: Advanced Use Cases and Extending the Framework

Once your core pipeline is running, you have a foundation that can be extended in several high-value directions without rebuilding from scratch. The modular architecture you've built is specifically designed to accommodate new data sources, new analysis types, and new output formats by adding new modules rather than modifying existing ones.

Tracking Ad Creative and Landing Page Changes

One of the most valuable competitive signals — and one that most teams miss entirely — is monitoring what competitors are doing with their paid advertising. While you can't directly access their ad accounts, you can monitor their landing pages, which often reveal their current messaging priorities, offers, and conversion strategies. Add landing pages as type: "landing_page" entries in your config.yaml, and extend the content analyzer with a landing-page-specific prompt that looks for: urgency language, specific offer terms, social proof claims, and A/B testing signals (like multiple variants of a page being accessible via different URL parameters).

For more sophisticated ad intelligence, tools like the Meta Ad Library and Google's Transparency Report provide public access to active ad creative. With Claude Code, you can build scrapers for these platforms (they have their own APIs and public-facing data) and feed that intelligence back into your analysis pipeline. Meta's Ad Library is particularly valuable for B2C competitors — you can see exactly what creative they're running, for how long, and across which regions.

Job Listing Intelligence

Competitor job postings are an underutilized intelligence source. A competitor suddenly posting five senior AI/ML engineer roles tells you something about their product roadmap that no press release will. A cluster of sales hiring in a specific geographic region suggests market expansion. Build a job listing scraper that monitors their careers page (usually a simple static page) and uses Claude to extract: role categories, seniority distribution, required technology stack, and location patterns. This kind of intelligence is consistently available, ethically unambiguous, and highly predictive.

Integrating with Your Existing Reporting Stack

The SQLite database can be queried directly by tools like Retool, Metabase, or even Google Sheets (via a Python script that exports to CSV). For teams already using data visualization platforms, adding a "competitive intelligence" dashboard fed by this pipeline takes a few hours and creates genuinely useful visibility that most teams don't have.

If you're interested in taking your Claude Code skills significantly further — building more complex pipelines, integrating with APIs, and creating production-ready tools — Adventure Media's hands-on Claude Code workshop covers exactly this kind of real-world project development in a single day. The workshop is built for people who already understand the basics and want to move into building actual tools, which is precisely where this competitive intelligence project sits.

Frequently Asked Questions

Scraping publicly available information is generally legal in the US, but the law is not fully settled. The Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways, and the hiQ v. LinkedIn case provided some protection for public data scraping. However, you should always: respect robots.txt, avoid logging into competitors' systems, not scrape at volumes that constitute a denial-of-service attack, and review each target site's Terms of Service. Consult a technology attorney if you're doing this at scale or in a regulated industry. This guide is not legal advice.

How much will this cost to run in API fees?

For a typical setup monitoring 5 competitors with 3-4 pages each, daily runs will cost roughly $5-15 per month in Claude API fees — depending on page length and your model choice. Using claude-haiku for initial extraction and claude-opus only for change synthesis and report generation can reduce costs significantly. Monitor your usage in the Anthropic console and set billing alerts.

What if a competitor's pricing page uses a paywall or login?

Don't attempt to access any content behind authentication. If a pricing page requires login, that's a clear signal that the company considers it non-public. In these cases, use alternative sources: industry databases, G2/Capterra reviews (which often include pricing), press releases, or sales intelligence platforms like ZoomInfo. You can feed manually collected data into your Claude analysis pipeline just as easily as scraped data.

How do I handle false positives in change detection?

The most common source of false positives is dynamic content like "featured this week" sections, date-stamped content, and personalized elements. Mitigate this by: adding CSS selectors to specifically target the content sections you care about rather than grabbing full page text, running the scraper multiple times on the same day before flagging a change as confirmed, and tuning your Claude prompt to explicitly ignore "trivial wording changes" and time-sensitive content.

Can I use this framework to track Google Ads and paid search strategies?

Yes, with some adaptation. You can use the Google Ads Transparency Center to see active ads for any advertiser. Your scraper can collect this public data, and Claude can analyze messaging patterns, offer language, and positioning shifts. This is particularly valuable if you're in a competitive paid search environment where competitor ad copy directly affects your click-through rates and Quality Scores.

What's the right frequency for running the pipeline?

Match frequency to the volatility of the information type. Pricing pages: daily or every two days. Feature pages: twice weekly. Blog/content pages: weekly. Homepage: twice weekly (messaging changes fast). Job listings: weekly. Running everything daily creates unnecessary costs and API load without proportional intelligence value.

How do I avoid getting my IP blocked?

The key practices are: respect crawl delays, use realistic request headers, limit requests to 1-2 per minute per domain, and don't scrape the same domain from multiple threads simultaneously. If you do get blocked (you'll see 403 or 429 responses), wait 24-48 hours before retrying, reduce your frequency, and consider using a residential proxy service if the intelligence is valuable enough to warrant the cost. Never try to circumvent anti-bot systems — that's where you cross from competitive intelligence into legally risky territory.

Can I extend this to monitor social media?

Social media platforms have strict Terms of Service against automated scraping, and most have APIs with their own rate limits and data access policies. For Twitter/X and LinkedIn, use their official APIs within permitted rate limits. For Facebook, the Ad Library API is the legitimate path for ad intelligence. Attempting to scrape social platforms outside their official APIs is high-risk and frequently breaks as platforms update their anti-bot measures.

How do I handle competitor sites that use Cloudflare or other bot protection?

If a site is aggressively blocking automated requests, that's a strong signal to stop and find alternative data sources. Heavy bot protection indicates the site owner actively doesn't want automated access. Attempting to bypass Cloudflare or similar systems violates terms of service and potentially computer abuse laws. Use manual collection, third-party data providers, or focus your automation on sites with permissive robots.txt and no bot protection.

What's the best way to share findings with my team?

The report generator can output to Slack, email, or a shared document depending on your team's workflow. For Slack, add a webhook integration to the report generator. For email, Python's smtplib with a SendGrid or Amazon SES integration sends the weekly report automatically. For documentation, outputting to a Notion database via their API keeps all competitive intelligence searchable and linked to other strategic documents.

Do I need Claude Code specifically, or can I use another LLM?

The architecture in this guide uses the Anthropic Python SDK and is optimized for Claude's instruction-following capabilities, particularly for structured JSON extraction. You could adapt the analysis layer to use other models, but Claude's ability to reliably return valid JSON from complex, messy web content — and to follow nuanced extraction instructions without hallucinating fields — makes it the strongest choice for this specific use case as of early 2026. The structured output reliability is critical when downstream code depends on specific JSON fields being present and accurate.

How do I validate that Claude's extractions are accurate?

Build a validation step into your pipeline that spot-checks 10-20% of extractions against the source URL. For pricing specifically, manually verify the first week's extractions against what you see in the browser before trusting the automated data. Add a confidence_score field to your prompts (as shown in the pricing analyzer) and treat anything below 70 as requiring manual review. Over time, you'll develop a sense for which page layouts Claude handles reliably and which require prompt refinement.

Conclusion: From Reactive to Proactive Competitive Strategy

The competitive intelligence gap between companies that monitor manually and those that monitor systematically isn't just a time efficiency difference — it's a strategic decision-making gap. When you know that a competitor restructured their pricing two days ago, you can respond in days rather than weeks. When you see that three competitors have started emphasizing the same new feature in their messaging, you have an early signal about where customer expectations are heading. When you notice a competitor's job postings shift heavily toward enterprise sales roles, you have months of lead time before their public announcement.

The framework you've built in this guide gives you all of that — automatically, continuously, and within clear ethical boundaries. The five-step architecture covers everything from responsible data collection through Claude-powered analysis to automated change detection and reporting. It's production-ready with some hardening work (adding proper error handling, logging, and monitoring), and it scales cleanly from a solo founder watching three competitors to a growth team monitoring an entire industry vertical.

The most important thing to do next is actually run it. Put in your first three competitor URLs, let it collect a baseline, and come back a week later to see what Claude surfaces. The first time it catches a pricing change you didn't notice, the value of the entire project will be immediately obvious.

If you want to go deeper into building production-ready tools with Claude Code — beyond this pipeline into more complex architectures, API integrations, and multi-agent systems — Adventure Media is running a hands-on workshop called Master Claude Code in One Day. It's designed for people who are past the basics and ready to build real things that solve real business problems, which is exactly the mindset this guide was written for.

Competitive intelligence is no longer a resource reserved for enterprise teams with dedicated analysts. With Claude Code and a few hundred lines of well-structured Python, it's available to any team that decides to build it.

Ready to Master Claude Code?

Stop reading tutorials and start building. Adventure Media's "Master Claude Code in One Day" workshop takes you from zero to building real, functional AI tools — in a single day. Hands-on projects. Expert guidance. No coding experience required.

Reserve Your Spot — Seats Are Limited

Request A Marketing Proposal

We'll get back to you within a day to schedule a quick strategy call. We can also communicate over email if that's easier for you.

Visit Us

New York
1074 Broadway
Woodmere, NY

Philadelphia
1429 Walnut Street
Philadelphia, PA

Florida
433 Plaza Real
Boca Raton, FL

General Inquiries

info@adventureppc.com
(516) 218-3722

AdVenture Education

Over 300,000 marketers from around the world have leveled up their skillset with AdVenture premium and free resources. Whether you're a CMO or a new student of digital marketing, there's something here for you.

OUR BOOK

We wrote the #1 bestselling book on performance advertising

Named one of the most important advertising books of all time.

buy on amazon
join or die bookjoin or die bookjoin or die book
OUR EVENT

DOLAH '24.
Stream Now
.

Over ten hours of lectures and workshops from our DOLAH Conference, themed: "Marketing Solutions for the AI Revolution"

check out dolah
city scape

The AdVenture Academy

Resources, guides, and courses for digital marketers, CMOs, and students. Brought to you by the agency chosen by Google to train Google's top Premier Partner Agencies.

Bundles & All Access Pass

Over 100 hours of video training and 60+ downloadable resources

Adventure resources imageview bundles →

Downloadable Guides

60+ resources, calculators, and templates to up your game.

adventure academic resourcesview guides →