
Your competitor just dropped their prices. You didn't notice for three weeks. By the time you adjusted, you'd already lost a handful of deals to someone who looked more competitive on paper. Sound familiar? This is the tax every business pays for manual competitive intelligence — the slow, labor-intensive process of checking competitor websites by hand, copying pricing tables into spreadsheets, and hoping you catch the important changes before they hurt you.
Claude Code changes that equation entirely. With a few hundred lines of Python and the right architectural approach, you can build a system that monitors competitor websites continuously, extracts structured data automatically, flags meaningful changes, and synthesizes findings into actionable intelligence reports — all while staying firmly within the bounds of ethical, legal web data collection.
This guide walks you through building that system from scratch. We'll cover the technical implementation step by step, including the exact code patterns, configuration values, and Claude Code prompts that make each piece work. Whether you're a solo founder trying to stay ahead of three competitors or a growth team tracking an entire industry, the same framework applies.
What you'll build: A modular competitive intelligence pipeline that scrapes publicly available data, uses Claude to analyze and categorize it, stores changes over time, and produces structured reports on pricing, content strategy, and messaging shifts — automatically.
Estimated total build time: 4-6 hours across five steps. Prerequisites: Basic Python familiarity, a Claude API key (available at console.anthropic.com), and a working Python environment (3.10+).
Before building any scraping tool, you must understand what you're legally and ethically permitted to collect. This isn't a disclaimer buried at the bottom — it's the foundation of the entire project. Getting this wrong doesn't just create legal exposure; it can get your IP address blocked, generate cease-and-desist letters, and damage your reputation if it becomes public that your "competitive intelligence" crossed lines it shouldn't have.
Publicly accessible information — data that any visitor can see in their browser without logging in — is generally fair game for automated collection, though this area of law continues to evolve. The landmark hiQ Labs v. LinkedIn case established important precedent suggesting that scraping publicly available data doesn't automatically violate the Computer Fraud and Abuse Act, though this remains a developing area of law. What you can typically collect: pricing pages, product descriptions, public blog posts, public job listings, and meta information like page titles and structured data.
You should not collect: data behind login walls, data explicitly prohibited in a site's robots.txt file (which we'll check programmatically), personally identifiable information about individuals, or data at a volume or rate that constitutes a denial-of-service attack on the target server. Aggressive scraping that hammers a server with thousands of requests per minute is both unethical and potentially illegal under computer abuse statutes, regardless of what data you're collecting.
Your tool must check and respect robots.txt before collecting anything. This is non-negotiable. Here's the module you'll build first:
import urllib.robotparser
import urllib.parse
import time
def check_robots_permission(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> bool:
"""Returns True if scraping this URL is permitted by robots.txt"""
parsed = urllib.parse.urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
return rp.can_fetch(user_agent, url)
except Exception:
# If robots.txt is unreachable, default to conservative behavior
return False
def get_crawl_delay(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> float:
"""Returns the required crawl delay in seconds, minimum 2.0"""
parsed = urllib.parse.urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
delay = rp.crawl_delay(user_agent)
return max(delay or 2.0, 2.0) # Always wait at least 2 seconds
except Exception:
return 2.0
Pro tip: Set your user agent string to something identifiable and include a contact email in your request headers. Something like CompetitiveResearchBot/1.0 (contact@yourcompany.com) is transparent about who you are and gives site owners a way to reach you if there's an issue. This is basic professional courtesy and reduces friction.
Common mistake to avoid: Don't skip robots.txt checking because "everyone does it." The fact that unethical scraping is common doesn't make it acceptable. Beyond the legal risk, your tool's longevity depends on not getting blocked — and sites that detect aggressive scraping will implement countermeasures that break your pipeline.
Estimated time for this step: 30 minutes to implement the robots checker and review your target competitor sites' robots.txt files manually.
A well-organized project structure will save you hours of debugging later and make the tool maintainable as your competitor list grows. Claude Code works best when you give it a clear, consistent file structure to work within — it can reference existing modules, understand your data schemas, and generate new components that fit naturally into what's already there.
Create a new project directory and set up a virtual environment:
mkdir competitor-intel
cd competitor-intel
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install anthropic requests beautifulsoup4 lxml playwright \
python-dotenv sqlite3 schedule rich pandas
Install Playwright's browser binaries:
playwright install chromium
You'll use Playwright for JavaScript-heavy pages where requests alone won't render dynamic content. Many modern competitor pricing pages load their data via JavaScript, and a headless browser is the only way to get the actual rendered content.
competitor-intel/
├── .env # API keys (never commit this)
├── config.yaml # Competitor URLs and scraping settings
├── main.py # Entry point and scheduler
├── scraper/
│ ├── __init__.py
│ ├── robots_checker.py # The module from Step 1
│ ├── static_scraper.py # requests + BeautifulSoup
│ └── dynamic_scraper.py # Playwright for JS pages
├── analyzer/
│ ├── __init__.py
│ ├── claude_client.py # Claude API wrapper
│ ├── pricing_analyzer.py
│ ├── content_analyzer.py
│ └── change_detector.py
├── storage/
│ ├── __init__.py
│ └── database.py # SQLite operations
├── reports/
│ ├── __init__.py
│ └── report_generator.py
└── data/
└── competitor_intel.db # Auto-created SQLite database
ANTHROPIC_API_KEY=sk-ant-your-key-here
CLAUDE_MODEL=claude-opus-4-5
REQUEST_DELAY_SECONDS=3
MAX_PAGES_PER_DOMAIN=20
This is where you define your competitor list. Structure it by company, not by URL, so you can group related pages:
competitors:
- name: "CompetitorA"
domain: "https://competitor-a.com"
pages:
- url: "https://competitor-a.com/pricing"
type: "pricing"
js_required: false
- url: "https://competitor-a.com/blog"
type: "content"
js_required: false
- url: "https://competitor-a.com/features"
type: "features"
js_required: true
- name: "CompetitorB"
domain: "https://competitor-b.com"
pages:
- url: "https://competitor-b.com/plans"
type: "pricing"
js_required: true
Common mistake to avoid: Don't add every page on a competitor's site. Focus on high-signal pages: pricing, features, homepage, about, careers (for growth signals), and blog index. Adding too many pages creates noise and increases your API costs substantially.
Pro tip: Add a priority: high/medium/low field to each page entry. High-priority pages (like pricing) should be checked daily; low-priority pages (like about pages) weekly. This prevents unnecessary API calls and keeps your costs manageable.
Estimated time for this step: 45 minutes including dependency installation and initial configuration.
The scraping layer is your data collection foundation, and it needs to handle two fundamentally different types of web pages: those that deliver all their content in the initial HTML response, and those that require JavaScript execution to render their actual content. Getting this distinction right is what separates a scraper that works reliably from one that silently returns empty data half the time.
Build scraper/static_scraper.py:
import requests
from bs4 import BeautifulSoup
import time
from scraper.robots_checker import check_robots_permission, get_crawl_delay
import os
HEADERS = {
"User-Agent": f"CompetitiveResearchBot/1.0 (contact@{os.getenv('COMPANY_DOMAIN', 'example.com')})",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
}
def scrape_static_page(url: str) -> dict:
"""
Scrapes a static page. Returns structured content dict.
Returns None if robots.txt disallows or request fails.
"""
if not check_robots_permission(url):
print(f"[BLOCKED] robots.txt disallows: {url}")
return None
delay = get_crawl_delay(url)
time.sleep(delay)
try:
response = requests.get(url, headers=HEADERS, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# Remove script and style tags — they're noise for text analysis
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
return {
"url": url,
"status_code": response.status_code,
"title": soup.title.string if soup.title else "",
"meta_description": get_meta_description(soup),
"h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
"h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
"body_text": soup.get_text(separator='\n', strip=True)[:15000], # Limit for API
"structured_data": extract_structured_data(soup),
"scraped_at": time.time()
}
except requests.RequestException as e:
print(f"[ERROR] Failed to scrape {url}: {e}")
return None
def get_meta_description(soup) -> str:
meta = soup.find('meta', attrs={'name': 'description'})
return meta['content'] if meta and meta.get('content') else ""
def extract_structured_data(soup) -> list:
"""Extracts JSON-LD structured data — often contains pricing info"""
import json
structured = []
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
structured.append(data)
except (json.JSONDecodeError, TypeError):
pass
return structured
Build scraper/dynamic_scraper.py:
from playwright.sync_api import sync_playwright
from scraper.robots_checker import check_robots_permission, get_crawl_delay
from bs4 import BeautifulSoup
import time
def scrape_dynamic_page(url: str, wait_selector: str = None) -> dict:
"""
Uses headless Chromium to render JavaScript-heavy pages.
wait_selector: CSS selector to wait for before extracting content.
"""
if not check_robots_permission(url):
print(f"[BLOCKED] robots.txt disallows: {url}")
return None
delay = get_crawl_delay(url)
time.sleep(delay)
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="CompetitiveResearchBot/1.0",
extra_http_headers={"Accept-Language": "en-US,en;q=0.9"}
)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=30000)
if wait_selector:
page.wait_for_selector(wait_selector, timeout=10000)
# Additional wait for dynamic content
page.wait_for_timeout(2000)
html_content = page.content()
soup = BeautifulSoup(html_content, 'lxml')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
return {
"url": url,
"title": page.title(),
"body_text": soup.get_text(separator='\n', strip=True)[:15000],
"h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
"h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
"scraped_at": time.time(),
"rendered": True
}
except Exception as e:
print(f"[ERROR] Playwright failed for {url}: {e}")
return None
finally:
browser.close()
Pro tip: For pricing pages that load prices via API calls (which is increasingly common in SaaS), open your browser's Developer Tools, go to the Network tab, and look for XHR/fetch calls when the pricing loads. Sometimes you can call that API endpoint directly, which is far more reliable than scraping rendered HTML. Check whether that endpoint is documented or publicly accessible before using it.
Common mistake to avoid: Using wait_until="load" instead of wait_until="networkidle" in Playwright. The "load" event fires before many JavaScript frameworks have finished rendering their content. "networkidle" waits until there are no more than 2 active network connections for 500ms — much more reliable for dynamic pages.
Estimated time for this step: 60-90 minutes including testing against your actual competitor URLs.
This is where the project transforms from a basic scraper into a genuine intelligence tool. Raw HTML text is nearly useless for decision-making. Claude's ability to read unstructured text and extract structured, categorized, actionable information is what makes this architecture valuable. A competitor's pricing page might contain paragraphs of marketing copy, feature comparison tables, FAQ sections, and footnotes — Claude can parse all of that and return a clean JSON object with the actual prices, plan names, and key differentiators.
Build analyzer/claude_client.py:
import anthropic
import json
import os
from dotenv import load_dotenv
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("CLAUDE_MODEL", "claude-opus-4-5")
def analyze_with_claude(prompt: str, content: str, system_prompt: str = None) -> str:
"""
Sends content to Claude for analysis. Returns the response text.
"""
messages = [
{
"role": "user",
"content": f"{prompt}\n\n---CONTENT---\n{content}"
}
]
system = system_prompt or (
"You are a competitive intelligence analyst. Extract structured, accurate information "
"from web page content. Always respond with valid JSON unless instructed otherwise. "
"If information is not present in the content, use null rather than guessing."
)
response = client.messages.create(
model=MODEL,
max_tokens=4096,
system=system,
messages=messages
)
return response.content[0].text
def parse_json_response(response_text: str) -> dict:
"""Safely parses JSON from Claude's response, handling markdown code blocks"""
text = response_text.strip()
if text.startswith("```json"):
text = text[7:]
if text.startswith("```"):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
try:
return json.loads(text.strip())
except json.JSONDecodeError as e:
print(f"[WARNING] JSON parse failed: {e}")
return {"raw_response": response_text, "parse_error": str(e)}
Build analyzer/pricing_analyzer.py:
from analyzer.claude_client import analyze_with_claude, parse_json_response
PRICING_PROMPT = """
Analyze this pricing page content and extract all pricing information.
Return a JSON object with this exact structure:
{
"plans": [
{
"name": "Plan name",
"monthly_price": 0.00 or null,
"annual_price": 0.00 or null,
"annual_monthly_equivalent": 0.00 or null,
"currency": "USD",
"billing_period": "monthly|annual|one-time|usage-based",
"user_limit": "number or 'unlimited' or null",
"key_features": ["feature1", "feature2"],
"highlighted_features": ["features emphasized in marketing copy"],
"cta_text": "The call-to-action button text"
}
],
"free_trial": true/false,
"free_tier_available": true/false,
"enterprise_option": true/false,
"pricing_model": "per-seat|flat-rate|usage-based|tiered|hybrid",
"key_differentiators": ["What they emphasize vs competitors"],
"recent_changes_mentioned": ["Any 'new', 'updated', 'launching soon' language found"],
"confidence_score": 0-100
}
"""
def analyze_pricing_page(scraped_data: dict) -> dict:
content = f"""
Page Title: {scraped_data.get('title', '')}
H1 Tags: {', '.join(scraped_data.get('h1_tags', []))}
H2 Tags: {', '.join(scraped_data.get('h2_tags', []))}
Body Text:
{scraped_data.get('body_text', '')}
"""
response = analyze_with_claude(PRICING_PROMPT, content)
result = parse_json_response(response)
result['source_url'] = scraped_data.get('url')
result['scraped_at'] = scraped_data.get('scraped_at')
return result
Build analyzer/content_analyzer.py. This one is particularly valuable for tracking how competitors are shifting their messaging — are they moving upmarket? Targeting a new persona? Pivoting their positioning?
from analyzer.claude_client import analyze_with_claude, parse_json_response
CONTENT_PROMPT = """
Analyze this blog/content page and extract strategic intelligence about the company's content approach.
Return JSON with this structure:
{
"primary_topics": ["main topics covered"],
"target_personas": ["inferred audience segments"],
"content_stage": "awareness|consideration|decision",
"seo_keywords_apparent": ["keywords they appear to be targeting"],
"messaging_themes": ["core messages and value propositions"],
"competitive_positioning": "how they position against alternatives",
"pain_points_addressed": ["customer problems they highlight"],
"tone": "technical|conversational|formal|casual|urgent",
"calls_to_action": ["CTAs found on page"],
"publication_recency": "recent|moderate|dated based on content references",
"strategic_signals": ["anything suggesting strategic direction changes"]
}
"""
def analyze_content_page(scraped_data: dict) -> dict:
content = f"""
Title: {scraped_data.get('title', '')}
Meta Description: {scraped_data.get('meta_description', '')}
H1: {', '.join(scraped_data.get('h1_tags', []))}
H2s: {', '.join(scraped_data.get('h2_tags', []))}
Body: {scraped_data.get('body_text', '')[:8000]}
"""
response = analyze_with_claude(CONTENT_PROMPT, content)
return parse_json_response(response)
Pro tip: The strategic_signals field in the content analyzer is often where the most valuable intelligence surfaces. When a competitor's blog starts publishing heavily about a new topic — say, AI features, enterprise compliance, or a specific industry vertical — that's often a leading indicator of a product roadmap or marketing pivot, months before a formal announcement.
Common mistake to avoid: Sending the entire raw page body to Claude without trimming it first. Most pricing pages have extensive boilerplate, repeated navigation text, and footer content that adds tokens without adding analytical value. The 15,000-character limit in the scraper module and the 8,000-character limit in the content analyzer are intentional — they keep your API costs reasonable and actually improve response quality by forcing Claude to work with signal-dense content.
Estimated time for this step: 60-75 minutes to build and test both analyzers.
A competitive intelligence tool that requires you to manually compare this week's data to last week's is only marginally better than doing it by hand. The real value comes from automated change detection — a system that knows what the baseline looked like, identifies meaningful differences, and surfaces only what matters rather than overwhelming you with raw data dumps.
Build storage/database.py using SQLite for simplicity (upgrade to PostgreSQL for team deployments):
import sqlite3
import json
import time
import os
DB_PATH = os.path.join("data", "competitor_intel.db")
def init_database():
os.makedirs("data", exist_ok=True)
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.executescript("""
CREATE TABLE IF NOT EXISTS scrape_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
competitor_name TEXT NOT NULL,
url TEXT NOT NULL,
page_type TEXT NOT NULL,
raw_content TEXT,
analyzed_data TEXT,
scraped_at REAL,
created_at REAL DEFAULT (unixepoch())
);
CREATE TABLE IF NOT EXISTS changes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
competitor_name TEXT NOT NULL,
page_type TEXT NOT NULL,
change_type TEXT NOT NULL,
change_summary TEXT,
old_value TEXT,
new_value TEXT,
severity TEXT DEFAULT 'medium',
detected_at REAL DEFAULT (unixepoch())
);
CREATE INDEX IF NOT EXISTS idx_scrape_url ON scrape_results(url, scraped_at);
CREATE INDEX IF NOT EXISTS idx_changes_competitor ON changes(competitor_name, detected_at);
""")
conn.commit()
conn.close()
def save_scrape_result(competitor_name: str, url: str, page_type: str,
raw_content: dict, analyzed_data: dict):
conn = sqlite3.connect(DB_PATH)
conn.execute("""
INSERT INTO scrape_results (competitor_name, url, page_type, raw_content, analyzed_data, scraped_at)
VALUES (?, ?, ?, ?, ?, ?)
""", (competitor_name, url, page_type,
json.dumps(raw_content), json.dumps(analyzed_data), time.time()))
conn.commit()
conn.close()
def get_previous_analysis(url: str) -> dict:
"""Gets the most recent previous analysis for comparison"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("""
SELECT analyzed_data FROM scrape_results
WHERE url = ?
ORDER BY scraped_at DESC
LIMIT 1 OFFSET 1
""", (url,))
row = cursor.fetchone()
conn.close()
return json.loads(row[0]) if row else None
Build analyzer/change_detector.py. This is where Claude does its most impressive work — comparing two snapshots of analyzed data and producing a human-readable, prioritized summary of what changed:
from analyzer.claude_client import analyze_with_claude, parse_json_response
import json
CHANGE_DETECTION_PROMPT = """
Compare these two competitive intelligence snapshots (previous vs current) and identify meaningful changes.
Ignore trivial differences like minor wording changes. Focus on strategically significant changes.
Return JSON:
{
"changes_detected": true/false,
"change_count": 0,
"changes": [
{
"type": "pricing_change|feature_addition|feature_removal|messaging_shift|positioning_change|new_cta|plan_restructure",
"severity": "high|medium|low",
"summary": "One sentence description of the change",
"strategic_implication": "What this likely means for their strategy",
"recommended_action": "What you should consider doing in response",
"old_value": "previous state",
"new_value": "current state"
}
],
"overall_assessment": "Executive summary of competitive landscape shift"
}
"""
def detect_changes(competitor_name: str, page_type: str,
previous_data: dict, current_data: dict) -> dict:
if not previous_data:
return {"changes_detected": False, "message": "No baseline to compare against"}
content = f"""
PREVIOUS SNAPSHOT:
{json.dumps(previous_data, indent=2)[:6000]}
CURRENT SNAPSHOT:
{json.dumps(current_data, indent=2)[:6000]}
Competitor: {competitor_name}
Page Type: {page_type}
"""
response = analyze_with_claude(CHANGE_DETECTION_PROMPT, content)
result = parse_json_response(response)
result['competitor_name'] = competitor_name
result['page_type'] = page_type
return result
Build reports/report_generator.py to produce a clean weekly digest:
import sqlite3
import json
from analyzer.claude_client import analyze_with_claude
import time
from datetime import datetime
DB_PATH = "data/competitor_intel.db"
def generate_weekly_report() -> str:
conn = sqlite3.connect(DB_PATH)
# Get all changes from the past 7 days
week_ago = time.time() - (7 * 24 * 3600)
cursor = conn.execute("""
SELECT competitor_name, page_type, change_type, change_summary,
severity, old_value, new_value, detected_at
FROM changes
WHERE detected_at > ?
ORDER BY severity DESC, detected_at DESC
""", (week_ago,))
changes = cursor.fetchall()
conn.close()
if not changes:
return "No significant changes detected in the past 7 days."
changes_text = "\n".join([
f"[{c[4].upper()}] {c[0]} - {c[2]}: {c[3]}"
for c in changes
])
synthesis_prompt = """
Synthesize these competitive intelligence findings into a concise executive briefing.
Structure it as:
1. Top 3 most urgent findings requiring immediate action
2. Strategic trends observed across competitors
3. Recommended actions for this week
Write in clear, direct business language. No bullet-point padding.
"""
report = analyze_with_claude(synthesis_prompt, changes_text,
system_prompt="You are a senior competitive strategist writing an executive briefing.")
timestamp = datetime.now().strftime("%m/%d/%Y")
return f"COMPETITIVE INTELLIGENCE BRIEF — Week of {timestamp}\n\n{report}"
import yaml
import schedule
import time
from storage.database import init_database, save_scrape_result, get_previous_analysis
from scraper.static_scraper import scrape_static_page
from scraper.dynamic_scraper import scrape_dynamic_page
from analyzer.pricing_analyzer import analyze_pricing_page
from analyzer.content_analyzer import analyze_content_page
from analyzer.change_detector import detect_changes
from reports.report_generator import generate_weekly_report
def load_config():
with open("config.yaml") as f:
return yaml.safe_load(f)
def run_intelligence_cycle():
config = load_config()
for competitor in config['competitors']:
name = competitor['name']
print(f"\n[INFO] Processing {name}...")
for page_config in competitor['pages']:
url = page_config['url']
page_type = page_config['type']
js_required = page_config.get('js_required', False)
# Scrape
if js_required:
raw_data = scrape_dynamic_page(url)
else:
raw_data = scrape_static_page(url)
if not raw_data:
print(f"[SKIP] Failed to scrape: {url}")
continue
# Analyze
if page_type == 'pricing':
analyzed = analyze_pricing_page(raw_data)
else:
analyzed = analyze_content_page(raw_data)
# Get previous for comparison
previous = get_previous_analysis(url)
# Save current
save_scrape_result(name, url, page_type, raw_data, analyzed)
# Detect changes
if previous:
changes = detect_changes(name, page_type, previous, analyzed)
if changes.get('changes_detected'):
print(f"[ALERT] Changes detected for {name} - {page_type}!")
# Here: send Slack notification, email, etc.
time.sleep(2) # Polite delay between pages
def main():
init_database()
# Run immediately on startup
run_intelligence_cycle()
# Schedule ongoing runs
schedule.every().day.at("07:00").do(run_intelligence_cycle)
schedule.every().monday.at("08:00").do(lambda: print(generate_weekly_report()))
while True:
schedule.run_pending()
time.sleep(60)
if __name__ == "__main__":
main()
Pro tip: Add Slack webhook notifications for high-severity changes. When a competitor drops their price or adds a major feature, you want to know within hours, not at your Monday morning briefing. The requests library makes this a five-line addition to the change detection logic.
Common mistake to avoid: Running the full intelligence cycle too frequently. Daily is appropriate for pricing pages; weekly is usually sufficient for content and feature pages. Overly frequent scraping increases your API costs, strains competitor servers, and produces noise from minor, irrelevant changes.
Estimated time for this step: 90 minutes to build and 30 minutes to run initial tests.
Once your core pipeline is running, you have a foundation that can be extended in several high-value directions without rebuilding from scratch. The modular architecture you've built is specifically designed to accommodate new data sources, new analysis types, and new output formats by adding new modules rather than modifying existing ones.
One of the most valuable competitive signals — and one that most teams miss entirely — is monitoring what competitors are doing with their paid advertising. While you can't directly access their ad accounts, you can monitor their landing pages, which often reveal their current messaging priorities, offers, and conversion strategies. Add landing pages as type: "landing_page" entries in your config.yaml, and extend the content analyzer with a landing-page-specific prompt that looks for: urgency language, specific offer terms, social proof claims, and A/B testing signals (like multiple variants of a page being accessible via different URL parameters).
For more sophisticated ad intelligence, tools like the Meta Ad Library and Google's Transparency Report provide public access to active ad creative. With Claude Code, you can build scrapers for these platforms (they have their own APIs and public-facing data) and feed that intelligence back into your analysis pipeline. Meta's Ad Library is particularly valuable for B2C competitors — you can see exactly what creative they're running, for how long, and across which regions.
Competitor job postings are an underutilized intelligence source. A competitor suddenly posting five senior AI/ML engineer roles tells you something about their product roadmap that no press release will. A cluster of sales hiring in a specific geographic region suggests market expansion. Build a job listing scraper that monitors their careers page (usually a simple static page) and uses Claude to extract: role categories, seniority distribution, required technology stack, and location patterns. This kind of intelligence is consistently available, ethically unambiguous, and highly predictive.
The SQLite database can be queried directly by tools like Retool, Metabase, or even Google Sheets (via a Python script that exports to CSV). For teams already using data visualization platforms, adding a "competitive intelligence" dashboard fed by this pipeline takes a few hours and creates genuinely useful visibility that most teams don't have.
If you're interested in taking your Claude Code skills significantly further — building more complex pipelines, integrating with APIs, and creating production-ready tools — Adventure Media's hands-on Claude Code workshop covers exactly this kind of real-world project development in a single day. The workshop is built for people who already understand the basics and want to move into building actual tools, which is precisely where this competitive intelligence project sits.
Scraping publicly available information is generally legal in the US, but the law is not fully settled. The Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways, and the hiQ v. LinkedIn case provided some protection for public data scraping. However, you should always: respect robots.txt, avoid logging into competitors' systems, not scrape at volumes that constitute a denial-of-service attack, and review each target site's Terms of Service. Consult a technology attorney if you're doing this at scale or in a regulated industry. This guide is not legal advice.
For a typical setup monitoring 5 competitors with 3-4 pages each, daily runs will cost roughly $5-15 per month in Claude API fees — depending on page length and your model choice. Using claude-haiku for initial extraction and claude-opus only for change synthesis and report generation can reduce costs significantly. Monitor your usage in the Anthropic console and set billing alerts.
Don't attempt to access any content behind authentication. If a pricing page requires login, that's a clear signal that the company considers it non-public. In these cases, use alternative sources: industry databases, G2/Capterra reviews (which often include pricing), press releases, or sales intelligence platforms like ZoomInfo. You can feed manually collected data into your Claude analysis pipeline just as easily as scraped data.
The most common source of false positives is dynamic content like "featured this week" sections, date-stamped content, and personalized elements. Mitigate this by: adding CSS selectors to specifically target the content sections you care about rather than grabbing full page text, running the scraper multiple times on the same day before flagging a change as confirmed, and tuning your Claude prompt to explicitly ignore "trivial wording changes" and time-sensitive content.
Yes, with some adaptation. You can use the Google Ads Transparency Center to see active ads for any advertiser. Your scraper can collect this public data, and Claude can analyze messaging patterns, offer language, and positioning shifts. This is particularly valuable if you're in a competitive paid search environment where competitor ad copy directly affects your click-through rates and Quality Scores.
Match frequency to the volatility of the information type. Pricing pages: daily or every two days. Feature pages: twice weekly. Blog/content pages: weekly. Homepage: twice weekly (messaging changes fast). Job listings: weekly. Running everything daily creates unnecessary costs and API load without proportional intelligence value.
The key practices are: respect crawl delays, use realistic request headers, limit requests to 1-2 per minute per domain, and don't scrape the same domain from multiple threads simultaneously. If you do get blocked (you'll see 403 or 429 responses), wait 24-48 hours before retrying, reduce your frequency, and consider using a residential proxy service if the intelligence is valuable enough to warrant the cost. Never try to circumvent anti-bot systems — that's where you cross from competitive intelligence into legally risky territory.
Social media platforms have strict Terms of Service against automated scraping, and most have APIs with their own rate limits and data access policies. For Twitter/X and LinkedIn, use their official APIs within permitted rate limits. For Facebook, the Ad Library API is the legitimate path for ad intelligence. Attempting to scrape social platforms outside their official APIs is high-risk and frequently breaks as platforms update their anti-bot measures.
If a site is aggressively blocking automated requests, that's a strong signal to stop and find alternative data sources. Heavy bot protection indicates the site owner actively doesn't want automated access. Attempting to bypass Cloudflare or similar systems violates terms of service and potentially computer abuse laws. Use manual collection, third-party data providers, or focus your automation on sites with permissive robots.txt and no bot protection.
The report generator can output to Slack, email, or a shared document depending on your team's workflow. For Slack, add a webhook integration to the report generator. For email, Python's smtplib with a SendGrid or Amazon SES integration sends the weekly report automatically. For documentation, outputting to a Notion database via their API keeps all competitive intelligence searchable and linked to other strategic documents.
The architecture in this guide uses the Anthropic Python SDK and is optimized for Claude's instruction-following capabilities, particularly for structured JSON extraction. You could adapt the analysis layer to use other models, but Claude's ability to reliably return valid JSON from complex, messy web content — and to follow nuanced extraction instructions without hallucinating fields — makes it the strongest choice for this specific use case as of early 2026. The structured output reliability is critical when downstream code depends on specific JSON fields being present and accurate.
Build a validation step into your pipeline that spot-checks 10-20% of extractions against the source URL. For pricing specifically, manually verify the first week's extractions against what you see in the browser before trusting the automated data. Add a confidence_score field to your prompts (as shown in the pricing analyzer) and treat anything below 70 as requiring manual review. Over time, you'll develop a sense for which page layouts Claude handles reliably and which require prompt refinement.
The competitive intelligence gap between companies that monitor manually and those that monitor systematically isn't just a time efficiency difference — it's a strategic decision-making gap. When you know that a competitor restructured their pricing two days ago, you can respond in days rather than weeks. When you see that three competitors have started emphasizing the same new feature in their messaging, you have an early signal about where customer expectations are heading. When you notice a competitor's job postings shift heavily toward enterprise sales roles, you have months of lead time before their public announcement.
The framework you've built in this guide gives you all of that — automatically, continuously, and within clear ethical boundaries. The five-step architecture covers everything from responsible data collection through Claude-powered analysis to automated change detection and reporting. It's production-ready with some hardening work (adding proper error handling, logging, and monitoring), and it scales cleanly from a solo founder watching three competitors to a growth team monitoring an entire industry vertical.
The most important thing to do next is actually run it. Put in your first three competitor URLs, let it collect a baseline, and come back a week later to see what Claude surfaces. The first time it catches a pricing change you didn't notice, the value of the entire project will be immediately obvious.
If you want to go deeper into building production-ready tools with Claude Code — beyond this pipeline into more complex architectures, API integrations, and multi-agent systems — Adventure Media is running a hands-on workshop called Master Claude Code in One Day. It's designed for people who are past the basics and ready to build real things that solve real business problems, which is exactly the mindset this guide was written for.
Competitive intelligence is no longer a resource reserved for enterprise teams with dedicated analysts. With Claude Code and a few hundred lines of well-structured Python, it's available to any team that decides to build it.
Stop reading tutorials and start building. Adventure Media's "Master Claude Code in One Day" workshop takes you from zero to building real, functional AI tools — in a single day. Hands-on projects. Expert guidance. No coding experience required.
Your competitor just dropped their prices. You didn't notice for three weeks. By the time you adjusted, you'd already lost a handful of deals to someone who looked more competitive on paper. Sound familiar? This is the tax every business pays for manual competitive intelligence — the slow, labor-intensive process of checking competitor websites by hand, copying pricing tables into spreadsheets, and hoping you catch the important changes before they hurt you.
Claude Code changes that equation entirely. With a few hundred lines of Python and the right architectural approach, you can build a system that monitors competitor websites continuously, extracts structured data automatically, flags meaningful changes, and synthesizes findings into actionable intelligence reports — all while staying firmly within the bounds of ethical, legal web data collection.
This guide walks you through building that system from scratch. We'll cover the technical implementation step by step, including the exact code patterns, configuration values, and Claude Code prompts that make each piece work. Whether you're a solo founder trying to stay ahead of three competitors or a growth team tracking an entire industry, the same framework applies.
What you'll build: A modular competitive intelligence pipeline that scrapes publicly available data, uses Claude to analyze and categorize it, stores changes over time, and produces structured reports on pricing, content strategy, and messaging shifts — automatically.
Estimated total build time: 4-6 hours across five steps. Prerequisites: Basic Python familiarity, a Claude API key (available at console.anthropic.com), and a working Python environment (3.10+).
Before building any scraping tool, you must understand what you're legally and ethically permitted to collect. This isn't a disclaimer buried at the bottom — it's the foundation of the entire project. Getting this wrong doesn't just create legal exposure; it can get your IP address blocked, generate cease-and-desist letters, and damage your reputation if it becomes public that your "competitive intelligence" crossed lines it shouldn't have.
Publicly accessible information — data that any visitor can see in their browser without logging in — is generally fair game for automated collection, though this area of law continues to evolve. The landmark hiQ Labs v. LinkedIn case established important precedent suggesting that scraping publicly available data doesn't automatically violate the Computer Fraud and Abuse Act, though this remains a developing area of law. What you can typically collect: pricing pages, product descriptions, public blog posts, public job listings, and meta information like page titles and structured data.
You should not collect: data behind login walls, data explicitly prohibited in a site's robots.txt file (which we'll check programmatically), personally identifiable information about individuals, or data at a volume or rate that constitutes a denial-of-service attack on the target server. Aggressive scraping that hammers a server with thousands of requests per minute is both unethical and potentially illegal under computer abuse statutes, regardless of what data you're collecting.
Your tool must check and respect robots.txt before collecting anything. This is non-negotiable. Here's the module you'll build first:
import urllib.robotparser
import urllib.parse
import time
def check_robots_permission(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> bool:
"""Returns True if scraping this URL is permitted by robots.txt"""
parsed = urllib.parse.urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
return rp.can_fetch(user_agent, url)
except Exception:
# If robots.txt is unreachable, default to conservative behavior
return False
def get_crawl_delay(url: str, user_agent: str = "CompetitiveIntelligenceBot") -> float:
"""Returns the required crawl delay in seconds, minimum 2.0"""
parsed = urllib.parse.urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
try:
rp.read()
delay = rp.crawl_delay(user_agent)
return max(delay or 2.0, 2.0) # Always wait at least 2 seconds
except Exception:
return 2.0
Pro tip: Set your user agent string to something identifiable and include a contact email in your request headers. Something like CompetitiveResearchBot/1.0 (contact@yourcompany.com) is transparent about who you are and gives site owners a way to reach you if there's an issue. This is basic professional courtesy and reduces friction.
Common mistake to avoid: Don't skip robots.txt checking because "everyone does it." The fact that unethical scraping is common doesn't make it acceptable. Beyond the legal risk, your tool's longevity depends on not getting blocked — and sites that detect aggressive scraping will implement countermeasures that break your pipeline.
Estimated time for this step: 30 minutes to implement the robots checker and review your target competitor sites' robots.txt files manually.
A well-organized project structure will save you hours of debugging later and make the tool maintainable as your competitor list grows. Claude Code works best when you give it a clear, consistent file structure to work within — it can reference existing modules, understand your data schemas, and generate new components that fit naturally into what's already there.
Create a new project directory and set up a virtual environment:
mkdir competitor-intel
cd competitor-intel
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install anthropic requests beautifulsoup4 lxml playwright \
python-dotenv sqlite3 schedule rich pandas
Install Playwright's browser binaries:
playwright install chromium
You'll use Playwright for JavaScript-heavy pages where requests alone won't render dynamic content. Many modern competitor pricing pages load their data via JavaScript, and a headless browser is the only way to get the actual rendered content.
competitor-intel/
├── .env # API keys (never commit this)
├── config.yaml # Competitor URLs and scraping settings
├── main.py # Entry point and scheduler
├── scraper/
│ ├── __init__.py
│ ├── robots_checker.py # The module from Step 1
│ ├── static_scraper.py # requests + BeautifulSoup
│ └── dynamic_scraper.py # Playwright for JS pages
├── analyzer/
│ ├── __init__.py
│ ├── claude_client.py # Claude API wrapper
│ ├── pricing_analyzer.py
│ ├── content_analyzer.py
│ └── change_detector.py
├── storage/
│ ├── __init__.py
│ └── database.py # SQLite operations
├── reports/
│ ├── __init__.py
│ └── report_generator.py
└── data/
└── competitor_intel.db # Auto-created SQLite database
ANTHROPIC_API_KEY=sk-ant-your-key-here
CLAUDE_MODEL=claude-opus-4-5
REQUEST_DELAY_SECONDS=3
MAX_PAGES_PER_DOMAIN=20
This is where you define your competitor list. Structure it by company, not by URL, so you can group related pages:
competitors:
- name: "CompetitorA"
domain: "https://competitor-a.com"
pages:
- url: "https://competitor-a.com/pricing"
type: "pricing"
js_required: false
- url: "https://competitor-a.com/blog"
type: "content"
js_required: false
- url: "https://competitor-a.com/features"
type: "features"
js_required: true
- name: "CompetitorB"
domain: "https://competitor-b.com"
pages:
- url: "https://competitor-b.com/plans"
type: "pricing"
js_required: true
Common mistake to avoid: Don't add every page on a competitor's site. Focus on high-signal pages: pricing, features, homepage, about, careers (for growth signals), and blog index. Adding too many pages creates noise and increases your API costs substantially.
Pro tip: Add a priority: high/medium/low field to each page entry. High-priority pages (like pricing) should be checked daily; low-priority pages (like about pages) weekly. This prevents unnecessary API calls and keeps your costs manageable.
Estimated time for this step: 45 minutes including dependency installation and initial configuration.
The scraping layer is your data collection foundation, and it needs to handle two fundamentally different types of web pages: those that deliver all their content in the initial HTML response, and those that require JavaScript execution to render their actual content. Getting this distinction right is what separates a scraper that works reliably from one that silently returns empty data half the time.
Build scraper/static_scraper.py:
import requests
from bs4 import BeautifulSoup
import time
from scraper.robots_checker import check_robots_permission, get_crawl_delay
import os
HEADERS = {
"User-Agent": f"CompetitiveResearchBot/1.0 (contact@{os.getenv('COMPANY_DOMAIN', 'example.com')})",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
}
def scrape_static_page(url: str) -> dict:
"""
Scrapes a static page. Returns structured content dict.
Returns None if robots.txt disallows or request fails.
"""
if not check_robots_permission(url):
print(f"[BLOCKED] robots.txt disallows: {url}")
return None
delay = get_crawl_delay(url)
time.sleep(delay)
try:
response = requests.get(url, headers=HEADERS, timeout=15)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'lxml')
# Remove script and style tags — they're noise for text analysis
for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
tag.decompose()
return {
"url": url,
"status_code": response.status_code,
"title": soup.title.string if soup.title else "",
"meta_description": get_meta_description(soup),
"h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
"h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
"body_text": soup.get_text(separator='\n', strip=True)[:15000], # Limit for API
"structured_data": extract_structured_data(soup),
"scraped_at": time.time()
}
except requests.RequestException as e:
print(f"[ERROR] Failed to scrape {url}: {e}")
return None
def get_meta_description(soup) -> str:
meta = soup.find('meta', attrs={'name': 'description'})
return meta['content'] if meta and meta.get('content') else ""
def extract_structured_data(soup) -> list:
"""Extracts JSON-LD structured data — often contains pricing info"""
import json
structured = []
for script in soup.find_all('script', type='application/ld+json'):
try:
data = json.loads(script.string)
structured.append(data)
except (json.JSONDecodeError, TypeError):
pass
return structured
Build scraper/dynamic_scraper.py:
from playwright.sync_api import sync_playwright
from scraper.robots_checker import check_robots_permission, get_crawl_delay
from bs4 import BeautifulSoup
import time
def scrape_dynamic_page(url: str, wait_selector: str = None) -> dict:
"""
Uses headless Chromium to render JavaScript-heavy pages.
wait_selector: CSS selector to wait for before extracting content.
"""
if not check_robots_permission(url):
print(f"[BLOCKED] robots.txt disallows: {url}")
return None
delay = get_crawl_delay(url)
time.sleep(delay)
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent="CompetitiveResearchBot/1.0",
extra_http_headers={"Accept-Language": "en-US,en;q=0.9"}
)
page = context.new_page()
try:
page.goto(url, wait_until="networkidle", timeout=30000)
if wait_selector:
page.wait_for_selector(wait_selector, timeout=10000)
# Additional wait for dynamic content
page.wait_for_timeout(2000)
html_content = page.content()
soup = BeautifulSoup(html_content, 'lxml')
for tag in soup(['script', 'style', 'nav', 'footer']):
tag.decompose()
return {
"url": url,
"title": page.title(),
"body_text": soup.get_text(separator='\n', strip=True)[:15000],
"h1_tags": [h.get_text(strip=True) for h in soup.find_all('h1')],
"h2_tags": [h.get_text(strip=True) for h in soup.find_all('h2')],
"scraped_at": time.time(),
"rendered": True
}
except Exception as e:
print(f"[ERROR] Playwright failed for {url}: {e}")
return None
finally:
browser.close()
Pro tip: For pricing pages that load prices via API calls (which is increasingly common in SaaS), open your browser's Developer Tools, go to the Network tab, and look for XHR/fetch calls when the pricing loads. Sometimes you can call that API endpoint directly, which is far more reliable than scraping rendered HTML. Check whether that endpoint is documented or publicly accessible before using it.
Common mistake to avoid: Using wait_until="load" instead of wait_until="networkidle" in Playwright. The "load" event fires before many JavaScript frameworks have finished rendering their content. "networkidle" waits until there are no more than 2 active network connections for 500ms — much more reliable for dynamic pages.
Estimated time for this step: 60-90 minutes including testing against your actual competitor URLs.
This is where the project transforms from a basic scraper into a genuine intelligence tool. Raw HTML text is nearly useless for decision-making. Claude's ability to read unstructured text and extract structured, categorized, actionable information is what makes this architecture valuable. A competitor's pricing page might contain paragraphs of marketing copy, feature comparison tables, FAQ sections, and footnotes — Claude can parse all of that and return a clean JSON object with the actual prices, plan names, and key differentiators.
Build analyzer/claude_client.py:
import anthropic
import json
import os
from dotenv import load_dotenv
load_dotenv()
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
MODEL = os.getenv("CLAUDE_MODEL", "claude-opus-4-5")
def analyze_with_claude(prompt: str, content: str, system_prompt: str = None) -> str:
"""
Sends content to Claude for analysis. Returns the response text.
"""
messages = [
{
"role": "user",
"content": f"{prompt}\n\n---CONTENT---\n{content}"
}
]
system = system_prompt or (
"You are a competitive intelligence analyst. Extract structured, accurate information "
"from web page content. Always respond with valid JSON unless instructed otherwise. "
"If information is not present in the content, use null rather than guessing."
)
response = client.messages.create(
model=MODEL,
max_tokens=4096,
system=system,
messages=messages
)
return response.content[0].text
def parse_json_response(response_text: str) -> dict:
"""Safely parses JSON from Claude's response, handling markdown code blocks"""
text = response_text.strip()
if text.startswith("```json"):
text = text[7:]
if text.startswith("```"):
text = text[3:]
if text.endswith("```"):
text = text[:-3]
try:
return json.loads(text.strip())
except json.JSONDecodeError as e:
print(f"[WARNING] JSON parse failed: {e}")
return {"raw_response": response_text, "parse_error": str(e)}
Build analyzer/pricing_analyzer.py:
from analyzer.claude_client import analyze_with_claude, parse_json_response
PRICING_PROMPT = """
Analyze this pricing page content and extract all pricing information.
Return a JSON object with this exact structure:
{
"plans": [
{
"name": "Plan name",
"monthly_price": 0.00 or null,
"annual_price": 0.00 or null,
"annual_monthly_equivalent": 0.00 or null,
"currency": "USD",
"billing_period": "monthly|annual|one-time|usage-based",
"user_limit": "number or 'unlimited' or null",
"key_features": ["feature1", "feature2"],
"highlighted_features": ["features emphasized in marketing copy"],
"cta_text": "The call-to-action button text"
}
],
"free_trial": true/false,
"free_tier_available": true/false,
"enterprise_option": true/false,
"pricing_model": "per-seat|flat-rate|usage-based|tiered|hybrid",
"key_differentiators": ["What they emphasize vs competitors"],
"recent_changes_mentioned": ["Any 'new', 'updated', 'launching soon' language found"],
"confidence_score": 0-100
}
"""
def analyze_pricing_page(scraped_data: dict) -> dict:
content = f"""
Page Title: {scraped_data.get('title', '')}
H1 Tags: {', '.join(scraped_data.get('h1_tags', []))}
H2 Tags: {', '.join(scraped_data.get('h2_tags', []))}
Body Text:
{scraped_data.get('body_text', '')}
"""
response = analyze_with_claude(PRICING_PROMPT, content)
result = parse_json_response(response)
result['source_url'] = scraped_data.get('url')
result['scraped_at'] = scraped_data.get('scraped_at')
return result
Build analyzer/content_analyzer.py. This one is particularly valuable for tracking how competitors are shifting their messaging — are they moving upmarket? Targeting a new persona? Pivoting their positioning?
from analyzer.claude_client import analyze_with_claude, parse_json_response
CONTENT_PROMPT = """
Analyze this blog/content page and extract strategic intelligence about the company's content approach.
Return JSON with this structure:
{
"primary_topics": ["main topics covered"],
"target_personas": ["inferred audience segments"],
"content_stage": "awareness|consideration|decision",
"seo_keywords_apparent": ["keywords they appear to be targeting"],
"messaging_themes": ["core messages and value propositions"],
"competitive_positioning": "how they position against alternatives",
"pain_points_addressed": ["customer problems they highlight"],
"tone": "technical|conversational|formal|casual|urgent",
"calls_to_action": ["CTAs found on page"],
"publication_recency": "recent|moderate|dated based on content references",
"strategic_signals": ["anything suggesting strategic direction changes"]
}
"""
def analyze_content_page(scraped_data: dict) -> dict:
content = f"""
Title: {scraped_data.get('title', '')}
Meta Description: {scraped_data.get('meta_description', '')}
H1: {', '.join(scraped_data.get('h1_tags', []))}
H2s: {', '.join(scraped_data.get('h2_tags', []))}
Body: {scraped_data.get('body_text', '')[:8000]}
"""
response = analyze_with_claude(CONTENT_PROMPT, content)
return parse_json_response(response)
Pro tip: The strategic_signals field in the content analyzer is often where the most valuable intelligence surfaces. When a competitor's blog starts publishing heavily about a new topic — say, AI features, enterprise compliance, or a specific industry vertical — that's often a leading indicator of a product roadmap or marketing pivot, months before a formal announcement.
Common mistake to avoid: Sending the entire raw page body to Claude without trimming it first. Most pricing pages have extensive boilerplate, repeated navigation text, and footer content that adds tokens without adding analytical value. The 15,000-character limit in the scraper module and the 8,000-character limit in the content analyzer are intentional — they keep your API costs reasonable and actually improve response quality by forcing Claude to work with signal-dense content.
Estimated time for this step: 60-75 minutes to build and test both analyzers.
A competitive intelligence tool that requires you to manually compare this week's data to last week's is only marginally better than doing it by hand. The real value comes from automated change detection — a system that knows what the baseline looked like, identifies meaningful differences, and surfaces only what matters rather than overwhelming you with raw data dumps.
Build storage/database.py using SQLite for simplicity (upgrade to PostgreSQL for team deployments):
import sqlite3
import json
import time
import os
DB_PATH = os.path.join("data", "competitor_intel.db")
def init_database():
os.makedirs("data", exist_ok=True)
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
cursor.executescript("""
CREATE TABLE IF NOT EXISTS scrape_results (
id INTEGER PRIMARY KEY AUTOINCREMENT,
competitor_name TEXT NOT NULL,
url TEXT NOT NULL,
page_type TEXT NOT NULL,
raw_content TEXT,
analyzed_data TEXT,
scraped_at REAL,
created_at REAL DEFAULT (unixepoch())
);
CREATE TABLE IF NOT EXISTS changes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
competitor_name TEXT NOT NULL,
page_type TEXT NOT NULL,
change_type TEXT NOT NULL,
change_summary TEXT,
old_value TEXT,
new_value TEXT,
severity TEXT DEFAULT 'medium',
detected_at REAL DEFAULT (unixepoch())
);
CREATE INDEX IF NOT EXISTS idx_scrape_url ON scrape_results(url, scraped_at);
CREATE INDEX IF NOT EXISTS idx_changes_competitor ON changes(competitor_name, detected_at);
""")
conn.commit()
conn.close()
def save_scrape_result(competitor_name: str, url: str, page_type: str,
raw_content: dict, analyzed_data: dict):
conn = sqlite3.connect(DB_PATH)
conn.execute("""
INSERT INTO scrape_results (competitor_name, url, page_type, raw_content, analyzed_data, scraped_at)
VALUES (?, ?, ?, ?, ?, ?)
""", (competitor_name, url, page_type,
json.dumps(raw_content), json.dumps(analyzed_data), time.time()))
conn.commit()
conn.close()
def get_previous_analysis(url: str) -> dict:
"""Gets the most recent previous analysis for comparison"""
conn = sqlite3.connect(DB_PATH)
cursor = conn.execute("""
SELECT analyzed_data FROM scrape_results
WHERE url = ?
ORDER BY scraped_at DESC
LIMIT 1 OFFSET 1
""", (url,))
row = cursor.fetchone()
conn.close()
return json.loads(row[0]) if row else None
Build analyzer/change_detector.py. This is where Claude does its most impressive work — comparing two snapshots of analyzed data and producing a human-readable, prioritized summary of what changed:
from analyzer.claude_client import analyze_with_claude, parse_json_response
import json
CHANGE_DETECTION_PROMPT = """
Compare these two competitive intelligence snapshots (previous vs current) and identify meaningful changes.
Ignore trivial differences like minor wording changes. Focus on strategically significant changes.
Return JSON:
{
"changes_detected": true/false,
"change_count": 0,
"changes": [
{
"type": "pricing_change|feature_addition|feature_removal|messaging_shift|positioning_change|new_cta|plan_restructure",
"severity": "high|medium|low",
"summary": "One sentence description of the change",
"strategic_implication": "What this likely means for their strategy",
"recommended_action": "What you should consider doing in response",
"old_value": "previous state",
"new_value": "current state"
}
],
"overall_assessment": "Executive summary of competitive landscape shift"
}
"""
def detect_changes(competitor_name: str, page_type: str,
previous_data: dict, current_data: dict) -> dict:
if not previous_data:
return {"changes_detected": False, "message": "No baseline to compare against"}
content = f"""
PREVIOUS SNAPSHOT:
{json.dumps(previous_data, indent=2)[:6000]}
CURRENT SNAPSHOT:
{json.dumps(current_data, indent=2)[:6000]}
Competitor: {competitor_name}
Page Type: {page_type}
"""
response = analyze_with_claude(CHANGE_DETECTION_PROMPT, content)
result = parse_json_response(response)
result['competitor_name'] = competitor_name
result['page_type'] = page_type
return result
Build reports/report_generator.py to produce a clean weekly digest:
import sqlite3
import json
from analyzer.claude_client import analyze_with_claude
import time
from datetime import datetime
DB_PATH = "data/competitor_intel.db"
def generate_weekly_report() -> str:
conn = sqlite3.connect(DB_PATH)
# Get all changes from the past 7 days
week_ago = time.time() - (7 * 24 * 3600)
cursor = conn.execute("""
SELECT competitor_name, page_type, change_type, change_summary,
severity, old_value, new_value, detected_at
FROM changes
WHERE detected_at > ?
ORDER BY severity DESC, detected_at DESC
""", (week_ago,))
changes = cursor.fetchall()
conn.close()
if not changes:
return "No significant changes detected in the past 7 days."
changes_text = "\n".join([
f"[{c[4].upper()}] {c[0]} - {c[2]}: {c[3]}"
for c in changes
])
synthesis_prompt = """
Synthesize these competitive intelligence findings into a concise executive briefing.
Structure it as:
1. Top 3 most urgent findings requiring immediate action
2. Strategic trends observed across competitors
3. Recommended actions for this week
Write in clear, direct business language. No bullet-point padding.
"""
report = analyze_with_claude(synthesis_prompt, changes_text,
system_prompt="You are a senior competitive strategist writing an executive briefing.")
timestamp = datetime.now().strftime("%m/%d/%Y")
return f"COMPETITIVE INTELLIGENCE BRIEF — Week of {timestamp}\n\n{report}"
import yaml
import schedule
import time
from storage.database import init_database, save_scrape_result, get_previous_analysis
from scraper.static_scraper import scrape_static_page
from scraper.dynamic_scraper import scrape_dynamic_page
from analyzer.pricing_analyzer import analyze_pricing_page
from analyzer.content_analyzer import analyze_content_page
from analyzer.change_detector import detect_changes
from reports.report_generator import generate_weekly_report
def load_config():
with open("config.yaml") as f:
return yaml.safe_load(f)
def run_intelligence_cycle():
config = load_config()
for competitor in config['competitors']:
name = competitor['name']
print(f"\n[INFO] Processing {name}...")
for page_config in competitor['pages']:
url = page_config['url']
page_type = page_config['type']
js_required = page_config.get('js_required', False)
# Scrape
if js_required:
raw_data = scrape_dynamic_page(url)
else:
raw_data = scrape_static_page(url)
if not raw_data:
print(f"[SKIP] Failed to scrape: {url}")
continue
# Analyze
if page_type == 'pricing':
analyzed = analyze_pricing_page(raw_data)
else:
analyzed = analyze_content_page(raw_data)
# Get previous for comparison
previous = get_previous_analysis(url)
# Save current
save_scrape_result(name, url, page_type, raw_data, analyzed)
# Detect changes
if previous:
changes = detect_changes(name, page_type, previous, analyzed)
if changes.get('changes_detected'):
print(f"[ALERT] Changes detected for {name} - {page_type}!")
# Here: send Slack notification, email, etc.
time.sleep(2) # Polite delay between pages
def main():
init_database()
# Run immediately on startup
run_intelligence_cycle()
# Schedule ongoing runs
schedule.every().day.at("07:00").do(run_intelligence_cycle)
schedule.every().monday.at("08:00").do(lambda: print(generate_weekly_report()))
while True:
schedule.run_pending()
time.sleep(60)
if __name__ == "__main__":
main()
Pro tip: Add Slack webhook notifications for high-severity changes. When a competitor drops their price or adds a major feature, you want to know within hours, not at your Monday morning briefing. The requests library makes this a five-line addition to the change detection logic.
Common mistake to avoid: Running the full intelligence cycle too frequently. Daily is appropriate for pricing pages; weekly is usually sufficient for content and feature pages. Overly frequent scraping increases your API costs, strains competitor servers, and produces noise from minor, irrelevant changes.
Estimated time for this step: 90 minutes to build and 30 minutes to run initial tests.
Once your core pipeline is running, you have a foundation that can be extended in several high-value directions without rebuilding from scratch. The modular architecture you've built is specifically designed to accommodate new data sources, new analysis types, and new output formats by adding new modules rather than modifying existing ones.
One of the most valuable competitive signals — and one that most teams miss entirely — is monitoring what competitors are doing with their paid advertising. While you can't directly access their ad accounts, you can monitor their landing pages, which often reveal their current messaging priorities, offers, and conversion strategies. Add landing pages as type: "landing_page" entries in your config.yaml, and extend the content analyzer with a landing-page-specific prompt that looks for: urgency language, specific offer terms, social proof claims, and A/B testing signals (like multiple variants of a page being accessible via different URL parameters).
For more sophisticated ad intelligence, tools like the Meta Ad Library and Google's Transparency Report provide public access to active ad creative. With Claude Code, you can build scrapers for these platforms (they have their own APIs and public-facing data) and feed that intelligence back into your analysis pipeline. Meta's Ad Library is particularly valuable for B2C competitors — you can see exactly what creative they're running, for how long, and across which regions.
Competitor job postings are an underutilized intelligence source. A competitor suddenly posting five senior AI/ML engineer roles tells you something about their product roadmap that no press release will. A cluster of sales hiring in a specific geographic region suggests market expansion. Build a job listing scraper that monitors their careers page (usually a simple static page) and uses Claude to extract: role categories, seniority distribution, required technology stack, and location patterns. This kind of intelligence is consistently available, ethically unambiguous, and highly predictive.
The SQLite database can be queried directly by tools like Retool, Metabase, or even Google Sheets (via a Python script that exports to CSV). For teams already using data visualization platforms, adding a "competitive intelligence" dashboard fed by this pipeline takes a few hours and creates genuinely useful visibility that most teams don't have.
If you're interested in taking your Claude Code skills significantly further — building more complex pipelines, integrating with APIs, and creating production-ready tools — Adventure Media's hands-on Claude Code workshop covers exactly this kind of real-world project development in a single day. The workshop is built for people who already understand the basics and want to move into building actual tools, which is precisely where this competitive intelligence project sits.
Scraping publicly available information is generally legal in the US, but the law is not fully settled. The Computer Fraud and Abuse Act (CFAA) has been interpreted in various ways, and the hiQ v. LinkedIn case provided some protection for public data scraping. However, you should always: respect robots.txt, avoid logging into competitors' systems, not scrape at volumes that constitute a denial-of-service attack, and review each target site's Terms of Service. Consult a technology attorney if you're doing this at scale or in a regulated industry. This guide is not legal advice.
For a typical setup monitoring 5 competitors with 3-4 pages each, daily runs will cost roughly $5-15 per month in Claude API fees — depending on page length and your model choice. Using claude-haiku for initial extraction and claude-opus only for change synthesis and report generation can reduce costs significantly. Monitor your usage in the Anthropic console and set billing alerts.
Don't attempt to access any content behind authentication. If a pricing page requires login, that's a clear signal that the company considers it non-public. In these cases, use alternative sources: industry databases, G2/Capterra reviews (which often include pricing), press releases, or sales intelligence platforms like ZoomInfo. You can feed manually collected data into your Claude analysis pipeline just as easily as scraped data.
The most common source of false positives is dynamic content like "featured this week" sections, date-stamped content, and personalized elements. Mitigate this by: adding CSS selectors to specifically target the content sections you care about rather than grabbing full page text, running the scraper multiple times on the same day before flagging a change as confirmed, and tuning your Claude prompt to explicitly ignore "trivial wording changes" and time-sensitive content.
Yes, with some adaptation. You can use the Google Ads Transparency Center to see active ads for any advertiser. Your scraper can collect this public data, and Claude can analyze messaging patterns, offer language, and positioning shifts. This is particularly valuable if you're in a competitive paid search environment where competitor ad copy directly affects your click-through rates and Quality Scores.
Match frequency to the volatility of the information type. Pricing pages: daily or every two days. Feature pages: twice weekly. Blog/content pages: weekly. Homepage: twice weekly (messaging changes fast). Job listings: weekly. Running everything daily creates unnecessary costs and API load without proportional intelligence value.
The key practices are: respect crawl delays, use realistic request headers, limit requests to 1-2 per minute per domain, and don't scrape the same domain from multiple threads simultaneously. If you do get blocked (you'll see 403 or 429 responses), wait 24-48 hours before retrying, reduce your frequency, and consider using a residential proxy service if the intelligence is valuable enough to warrant the cost. Never try to circumvent anti-bot systems — that's where you cross from competitive intelligence into legally risky territory.
Social media platforms have strict Terms of Service against automated scraping, and most have APIs with their own rate limits and data access policies. For Twitter/X and LinkedIn, use their official APIs within permitted rate limits. For Facebook, the Ad Library API is the legitimate path for ad intelligence. Attempting to scrape social platforms outside their official APIs is high-risk and frequently breaks as platforms update their anti-bot measures.
If a site is aggressively blocking automated requests, that's a strong signal to stop and find alternative data sources. Heavy bot protection indicates the site owner actively doesn't want automated access. Attempting to bypass Cloudflare or similar systems violates terms of service and potentially computer abuse laws. Use manual collection, third-party data providers, or focus your automation on sites with permissive robots.txt and no bot protection.
The report generator can output to Slack, email, or a shared document depending on your team's workflow. For Slack, add a webhook integration to the report generator. For email, Python's smtplib with a SendGrid or Amazon SES integration sends the weekly report automatically. For documentation, outputting to a Notion database via their API keeps all competitive intelligence searchable and linked to other strategic documents.
The architecture in this guide uses the Anthropic Python SDK and is optimized for Claude's instruction-following capabilities, particularly for structured JSON extraction. You could adapt the analysis layer to use other models, but Claude's ability to reliably return valid JSON from complex, messy web content — and to follow nuanced extraction instructions without hallucinating fields — makes it the strongest choice for this specific use case as of early 2026. The structured output reliability is critical when downstream code depends on specific JSON fields being present and accurate.
Build a validation step into your pipeline that spot-checks 10-20% of extractions against the source URL. For pricing specifically, manually verify the first week's extractions against what you see in the browser before trusting the automated data. Add a confidence_score field to your prompts (as shown in the pricing analyzer) and treat anything below 70 as requiring manual review. Over time, you'll develop a sense for which page layouts Claude handles reliably and which require prompt refinement.
The competitive intelligence gap between companies that monitor manually and those that monitor systematically isn't just a time efficiency difference — it's a strategic decision-making gap. When you know that a competitor restructured their pricing two days ago, you can respond in days rather than weeks. When you see that three competitors have started emphasizing the same new feature in their messaging, you have an early signal about where customer expectations are heading. When you notice a competitor's job postings shift heavily toward enterprise sales roles, you have months of lead time before their public announcement.
The framework you've built in this guide gives you all of that — automatically, continuously, and within clear ethical boundaries. The five-step architecture covers everything from responsible data collection through Claude-powered analysis to automated change detection and reporting. It's production-ready with some hardening work (adding proper error handling, logging, and monitoring), and it scales cleanly from a solo founder watching three competitors to a growth team monitoring an entire industry vertical.
The most important thing to do next is actually run it. Put in your first three competitor URLs, let it collect a baseline, and come back a week later to see what Claude surfaces. The first time it catches a pricing change you didn't notice, the value of the entire project will be immediately obvious.
If you want to go deeper into building production-ready tools with Claude Code — beyond this pipeline into more complex architectures, API integrations, and multi-agent systems — Adventure Media is running a hands-on workshop called Master Claude Code in One Day. It's designed for people who are past the basics and ready to build real things that solve real business problems, which is exactly the mindset this guide was written for.
Competitive intelligence is no longer a resource reserved for enterprise teams with dedicated analysts. With Claude Code and a few hundred lines of well-structured Python, it's available to any team that decides to build it.
Stop reading tutorials and start building. Adventure Media's "Master Claude Code in One Day" workshop takes you from zero to building real, functional AI tools — in a single day. Hands-on projects. Expert guidance. No coding experience required.

We'll get back to you within a day to schedule a quick strategy call. We can also communicate over email if that's easier for you.
New York
1074 Broadway
Woodmere, NY
Philadelphia
1429 Walnut Street
Philadelphia, PA
Florida
433 Plaza Real
Boca Raton, FL
info@adventureppc.com
(516) 218-3722
Over 300,000 marketers from around the world have leveled up their skillset with AdVenture premium and free resources. Whether you're a CMO or a new student of digital marketing, there's something here for you.
Named one of the most important advertising books of all time.
buy on amazon


Over ten hours of lectures and workshops from our DOLAH Conference, themed: "Marketing Solutions for the AI Revolution"
check out dolah
Resources, guides, and courses for digital marketers, CMOs, and students. Brought to you by the agency chosen by Google to train Google's top Premier Partner Agencies.
Over 100 hours of video training and 60+ downloadable resources
view bundles →