TLDR: Wanted to discover more personal websites, so scraped millions of candidate URLs and indexed 1 million personal sites Demo
Personal websites impose no structure on the creator, making them much more expressive and interesting. However, finding personal websites, especially new personal websites, is hard.
To solve this, I built a crawler that pulls millions of candidate URLs, classifies them with an LLM, and indexes 1 million personal sites in an interface to browse. Inspired by andrew chan's challenge to pull 1B in a day, I aimed to get 1 million personal sites indexed in under an hour; in practice rate limits (OpenRouter, per-host fetch) stretched the run to about 4 hours.
This required several mechanisms to prevent getting rate limited and optimize speed while maintaining quality, including
To keep quality high, I pull profiles from a small selection of websites I expected to be high quality: hackernews, Are.na, Twitter/X, LessWrong, SoundCloud, and HuggingFace.
| Source | Method | Raw URLs | Personal rate |
|---|---|---|---|
| HN | BigQuery + profile scrape + megathreads | ~1.5M | ~30% |
| Are.na | GraphQL batch API | ~500K | ~45% |
| Twitter/X | Follow-graph BFS from seed accounts | ~2M | ~20% |
| LessWrong | Profile scrape | ~80K | ~35% |
| HuggingFace | Profile pages | ~200K | ~25% |
| SoundCloud | Artist pages, linked sites | ~200K | ~15% |
I used three extraction methods in parallel. Google hosts a public BigQuery dataset of every HN item, so a single query over the about field gives ~800K candidate URLs in seconds, though it's noisy since profiles contain GitHub repos, company pages, and raw text that happens to contain a dot. HN also periodically runs "drop your personal site" megathreads (e.g. item 4646), which I scraped for ~30K URLs at nearly 100% hit rate. Finally, I fetched the profile page for every HN username and extracted the URL field directly. Overlap across the three is about 40%; I dedup at the URL level but keep all source tags.
Are.na's API is GraphQL, so I batch 25 aliased identity(id:) fragments into a single request to pull 25 users' bios at once:
query($i0: ID!, $i1: ID!, ..., $i24: ID!) { u0: identity(id: $i0) { name description } u1: identity(id: $i1) { name description } ... }
Link extraction runs three regexes over each bio, then filters through a frozenset of valid TLDs to drop garbage. I checkpoint every 10K user IDs by rewriting the output CSV so I can resume from the last ID. Hit rate after classification is ~45%, roughly 2x the average across all sources.
I start with ~50 seed accounts and crawl their followings (depth 1), then those users' followings (depth 2), authenticating via Chrome cookies. The done set is set[tuple[int, str]] of (depth, screen_name) pairs, and a sidecar .depth.txt file appends depth\tscreen_name as each user is processed. If the run gets killed at depth 2, I can restart with --depth1-path pointing to the previous file to recover the parent set without re-crawling depth 1.
With ~4.5M candidate URLs in CSVs, the architecture is an asyncio producer/consumer with bounded backpressure:
producer (CSV reader) │ asyncio.Queue(maxsize=1000) │ N_CONSUMERS=20 ├── fetch (global sem=300, per-host sem=4) ├── HTML→MD (ThreadPoolExecutor, 24 workers) ├── LLM classify (API sem=25) └── append to good_list
The producer pushes URLs onto a bounded queue so it blocks when consumers fall behind, keeping memory constant. Twenty consumer coroutines each run the full pipeline for one URL, with concurrency at each stage controlled by separate semaphores.
A global asyncio.Semaphore(300) caps total in-flight requests, and per-host semaphores cap any single domain at 4:
host_sems: dict[str, asyncio.Semaphore] = {} async def fetch(url): host = urlparse(url).hostname host_sem = host_sems.setdefault(host, asyncio.Semaphore(PER_HOST)) async with global_sem, host_sem: async with session.get(url, timeout=10) as resp: return await resp.text()
The HTML-to-Markdown converter is a Node subprocess wrapping htmlparser2, which is significantly faster than Python parsers for this workload. It runs in a ThreadPoolExecutor(24) via run_in_executor and compresses a 500KB HTML page down to ~20KB of Markdown, which is what makes cheap LLM classification feasible.
Two append-only files handle resume: processed_urls.txt (one URL per line, appended immediately after processing) and good_checkpoint.csv (URLs that passed classification, snapshotted every 60 seconds by a background task). On crash you lose at most 60 seconds of good results. At 10M+ URLs the processed set gets expensive to load from disk; the upgrade path is a Bloom filter backed by mmap.
First a HEAD check filters on status 200, correct content-type, and reasonable response time, eliminating ~50% of candidates. Then an iframe embeddability check parses X-Frame-Options and CSP frame-ancestors headers to reject sites that can't be shown in the app's full-screen iframe:
def blocks_embedding(headers: dict) -> bool: xfo = headers.get("x-frame-options", "").lower() if xfo in ("deny", "sameorigin"): return True csp = headers.get("content-security-policy", "") for directive in csp.split(";"): directive = directive.strip().lower() if directive.startswith("frame-ancestors"): values = directive.split()[1:] if "'none'" in values: return True if "*" not in values and "https:" not in values: return True return False
Without this check ~20% of sites render as blank iframes.
"Personal website" is fuzzy enough that rule-based classification fails immediately, so I use an LLM. The pipeline truncates the fetched HTML, converts to Markdown via the Node subprocess (dramatically reducing token count), truncates again, and sends it to Parasail via OpenRouter with a system prompt defining "personal website" with examples and counter-examples. Response is Yes or No.
Parasail costs about $0.0001 per classification versus GPT-4 at $0.001. For a binary classification where false positives are tolerable and false negatives are invisible, the cheap model wins. On a hand-labeled sample of 200 sites I got ~88% agreement; most disagreements are genuinely borderline.
At 25 concurrent calls and ~500ms/call, raw throughput is ~50 classifications/sec, which would still take many hours for millions of URLs. Three things bring this down:
Pre-filtering with domain heuristics skips ~20% of URLs without any LLM call. Content fingerprinting via SimHash over the Markdown skips another ~15% by reusing cached labels for near-duplicate pages (GitHub profiles, HuggingFace model cards, default Jekyll blogs all share templates with slight variations that exact hashing misses but SimHash within Hamming distance 3 catches). Batching 10 URLs per LLM call gives ~5x throughput.
def simhash(tokens: list[str], bits=64) -> int: v = [0] * bits for token in tokens: h = mmh3.hash64(token)[0] for i in range(bits): v[i] += 1 if h & (1 << i) else -1 return sum(1 << i for i in range(bits) if v[i] > 0)
Combined these reduce the volume to ~320K batched API calls for ~2M URLs that need classification, yielding ~1M indexed personal sites.
CREATE TABLE sites ( id SERIAL PRIMARY KEY, url TEXT UNIQUE NOT NULL, source TEXT, created_at TIMESTAMP DEFAULT NOW(), source_url TEXT, score INT DEFAULT 0 );
The feed query filters WHERE score >= 0 so downvoted sites drop out organically. Ingest is bulk COPY FROM CSV. 1M rows at ~500 bytes is ~500MB with indexes, which fits in Neon's free tier. ORDER BY random() LIMIT 15 takes ~50ms at 1M rows; at 10M+ it would need a materialized sample. If no DATABASE_URL is set the app degrades to flat files (sites.txt, votes.json, views.txt).
Next.js 16, React 19, Tailwind 4. One page, one component: full-screen iframe with swipe/arrow-key navigation.
const [history, setHistory] = useState<string[]>([]); const [queue, setQueue] = useState<string[]>([]); const [historyIdx, setHistoryIdx] = useState<number>(0); function getUrlFrom(history: string[], queue: string[], idx: number) { if (idx < history.length) return history[idx]; const qIdx = idx - history.length; return qIdx < queue.length ? queue[qIdx] : null; }
history and queue form a virtual list addressed by a single cursor. Back decrements the cursor (O(1), no network). Forward pops from queue into history and refills in the background. Prefetch fires when queue.length < 10 via /api/random?count=15&exclude=[...history], where the SQL uses url != ALL($1::text[]) to avoid re-showing sites in a session.
| Stage | Latency/item | Concurrency | Throughput |
|---|---|---|---|
| Fetch | ~200ms | 300 | ~1500/sec |
| HTML to MD | ~50ms | 24 threads | ~480/sec |
| LLM classify | ~500ms | 25 | ~50/sec |
| DB insert | bulk COPY | 1 | ~100K/sec |
With batching and pre-filtering, ~320K batched API calls at 25 concurrent; the full run (fetch, conversion, classification) took about 4 hours. End-to-end we land at ~1M indexed personal sites.
| Line item | Unit cost | Quantity | Total |
|---|---|---|---|
| LLM classification (Parasail) | ~$0.0001/call | ~320K batched | ~$32 |
| Storage (Neon) | ~$0.10/CU-hr | ~20 CU-hrs | ~$2 |
| Total | ~$35 |
Every classified site already has Markdown content. Embedding it (768-dim, ~3GB for 1M sites) and building a vector index adds "find sites like this one." At 1M vectors HNSW fits in memory; at 10M something like DiskANN queries from disk. At that scale Postgres ORDER BY random() also breaks, the processed-URLs set needs a Bloom filter, and classification needs to run continuously rather than in batch.