AI Web Scraper with Crawl4AI

👉 Learn How to Scrape and Build Lead Lists Easily with Crawl4AI!

This project is an AI-powered web scraper built with Crawl4AI. It automates lead generation by extracting local business (Dentists, restaurents,...) names, addresses, phone numbers, and more from YellowPages. With the help of LLMs like GPT-4o, Claude, and DeepSeek, it intelligently processes data and saves it in CSV files, making it ready for outreach or analysis!

Features

Extract Business Information – Scrape business names, contact details, and other key data.
AI-Powered Data Processing – Use LLMs to clean, format, and enhance the extracted data.
Customizable Scraper – Adapt it to different websites and data types.
Flexible LLM Integration – Choose from AI models like GPT-4, Claude, and DeepSeek.

Adaptability

This scraper is designed for YellowPages but can be used on any website. You can change the target URL, modify the AI instructions to adjust how the data is processed, and define new data fields based on your needs.

Potential Use Cases

Lead Generation – Collect business emails, phone numbers, and addresses to build targeted outreach lists.
Market Research – Gather real-time industry data to analyze trends and customer behavior.
Competitor Analysis – Monitor pricing, services, and customer reviews to stay competitive.
AI Data Enrichment – Use LLMs to clean and categorize data for better insights.
Research & Analysis – Extract structured data from directories, reports, and other sources for business or academic studies.

Project Structure

.
├── main.py # Main entry point for the crawler
├── config.py # Contains configuration constants (LLM Models, Base URL, CSS selectors, etc.)
├── models
│ └── business.py # Defines the Local Business data model using Pydantic
├── src
│ ├── utils.py # Utility functions for processing and saving data
│ └── scraper.py # functions for configuring and running the crawler
└── requirements.txt # Python package dependencies

How to Run

Prerequisites

Ensure you have the following installed:

Python 3.11+
LLM provider API key (OpenAI, Gemini, Claude,...)
Necessary Python libraries (listed in requirements.txt)

Setup

Clone the Repository

git clone https://github.com/kaymen99/llm-web-scraper
cd llm-web-scraper

Create and Activate a Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install Required Packages

pip install -r requirements.txt
playwright install

Set Up Environment Variables

Create a .env file in the root directory and add necessary credentials:

# API keys for LLMs providers, add key for every provider you want to use
OPENAI_API_KEY=""            # OpenAI API key for accessing OpenAI's models and services
GEMINI_API_KEY=""            # Google Cloud API key for accessing Google Cloud services
GROQ_API_KEY=""              # GROQ platform API key for using GROQ's services

Running the scraper

To start the scraper, run:

python main.py

The script will crawl the specified website, extract data page by page, and save the complete venues to a businesses_data.csv file in the project directory. Additionally, usage statistics for the LLM strategy will be displayed after crawling.

Configuration

The config.py file contains key settings for controlling the scraper's behavior. You can modify these values to customize the scraping process:

LLM_MODEL: The AI model used for data extraction. Supports any LLM from LiteLLM (e.g., gpt-4o, claude, deepseek-chat, gemini-2.0-flash).
BASE_URL: The target website to scrape. By default, it extracts dentists in Toronto from Yellow Pages, but you can change this to any business category or location.
CSS_SELECTOR: The HTML selector used to pinpoint business details within the page.
MAX_PAGES: Limits the number of pages to crawl (default: 3). Increase this value to scrape more data.
SCRAPER_INSTRUCTIONS: Custom LLM prompt defining what details to extract .

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any changes.

Contact

If you have any questions or suggestions, feel free to contact me at [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper with Crawl4AI

👉 Learn How to Scrape and Build Lead Lists Easily with Crawl4AI!

Features

Adaptability

Potential Use Cases

Project Structure

How to Run

Prerequisites

Setup

Clone the Repository

Create and Activate a Virtual Environment

Install Required Packages

Set Up Environment Variables

Running the scraper

Configuration

Contributing

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
models		models
src		src
.env.example		.env.example
README.md		README.md
businesses_data.csv		businesses_data.csv
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

kaymen99/ai-web-scraper

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper with Crawl4AI

👉 Learn How to Scrape and Build Lead Lists Easily with Crawl4AI!

Features

Adaptability

Potential Use Cases

Project Structure

How to Run

Prerequisites

Setup

Clone the Repository

Create and Activate a Virtual Environment

Install Required Packages

Set Up Environment Variables

Running the scraper

Configuration

Contributing

Contact

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages