Adaptive Q-Learning Web Crawler

alt text

A novel hybrid approach to focused web crawling that combines three complementary AI techniques for intelligent link selection and web navigation.

Key Features

Hybrid RL Architecture: Integrates Q-Learning, Contextual Bandits (LinUCB), and Graph Neural Networks (GNNs)
Graph-Aware Navigation: GraphSAGE-based node embeddings capture web topology and structural information
Intelligent Link Selection: LinUCB bandit algorithm balances exploration-exploitation for efficient link discovery
Value-Based Learning: Q-Learning provides long-term navigation strategy and reward optimization
Contextual Decisions: Fast convergence using contextual information from web graph structure
Research-Backed: Based on peer-reviewed papers on RL-based web crawling and focused crawlers
Student-Friendly: $0.10 total cost, CPU-only, no GPU required, 60-70% harvest rate

How It Works

Graph Construction: Builds web graph from seed URLs with node/edge features
GNN Pre-training: GraphSAGE learns structural embeddings from web topology (frozen for stability)
Q-Agent Training: Learns navigation strategy through offline simulation environment
Link Selection: LinUCB contextual bandit selects promising links with exploration-exploitation balance
Reward Signal: Positive rewards for target-domain discoveries, penalties for inefficient navigation
Adaptive Crawling: System dynamically improves link selection based on cumulative rewards

How it works (simple)

Think of it like a person browsing Wikipedia trying to stay on the topic of Machine Learning.

GNN — gives the crawler a map of where it is. “I’m in this neighborhood of the web, these pages are nearby.” Frozen after training, just used for context. Bandit (LinUCB) — when the crawler sees 50 links on a page, this picks the best one. Like a slot machine learner, it balances “go with what’s worked before” vs “try something new.” Q-Learning — the boss layer. After each page, it decides: keep going or stop? Prevents the crawler from going down rabbit holes and wasting time.

The loop

visit page → look at links → Q-agent says continue → bandit picks best link → visit that page → repeat

Technologies Used

Reinforcement Learning: Q-Learning, Contextual Bandits (LinUCB)
Graph Neural Networks: PyTorch, GraphSAGE (SAGEConv)
Framework: PyTorch for deep learning
Web Technology: HTTP crawling, BeautifulSoup for HTML parsing
Data Processing: Pandas, NumPy for preprocessing
Language: Python (100% of codebase)

Research Foundation

Based on peer-reviewed research papers:

Tree-based Focused Web Crawling with Reinforcement Learning (2021)
Deep Reinforcement Learning for Web Crawling (2021)
Efficient Deep Web Crawling Using Reinforcement Learning (2010) - 59 citations
Learning to Crawl Deep Web (2013) - 71 citations

Performance Metrics

Coverage: 60-70% harvest rate (publishable results)
Efficiency: High coverage with minimal crawl cost through adaptive link selection
Training: 3-4 days on CPU only (runs overnight on standard laptops)
Resource Usage: 8GB RAM sufficient, works on older hardware
Convergence: Fast convergence through contextual information from graph structure

View on GitHub