Building a Web Crawler and Data Storage System

TO DO flower ? 그거 붙일 예정 일단 안잊어 버리게 메모

Building a Web Crawler and Data Storage System with Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB

In this project, we built a web crawling system using Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB. This system periodically crawls data from a specified website and stores it in MongoDB. All services are containerized using Docker and managed with Docker Compose.

For testing purposes, we used the example site (https://example.com) and sample data. This allowed us to ensure that the web crawling and data storage functionality works correctly without accessing real-world data.

Tech Stacks

  • Django: Used as the web framework to define and manage Celery tasks.

  • Celery: Utilized as the asynchronous task queue to periodically execute crawling tasks.

  • RabbitMQ: Acts as the message broker for Celery.

  • Selenium: Performs the web crawling.

  • BeautifulSoup: Parses the HTML of the crawled web page.

  • MongoDB: Stores the crawled data.

  • Docker: Containerizes all services.

  • Docker Compose: Manages multiple containers easily.

Project Architecture and Flow

  1. Django:

    • Defines the web crawling task using Celery.

    • Manages the lifecycle of the task, including scheduling and execution.

  2. Celery:

    • Executes the web crawling task asynchronously.

    • Periodically schedules the task using Celery Beat.

  3. RabbitMQ:

    • Serves as the message broker, facilitating communication between Django and Celery.

  4. Selenium:

    • Automates the web browser to crawl the specified web page.

    • Extracts the required data.

  5. BeautifulSoup:

    • Parses the HTML content of the crawled web page.

    • Extracts specific elements from the HTML.

  6. MongoDB:

    • Stores the crawled data in a structured format.

  7. Docker and Docker Compose:

    • Containerizes each service for easy deployment and management.

    • Uses Docker Compose to define and manage the multi-container application.

Why Celery and RabbitMQ?

Celery is used for its robust support for asynchronous task execution, which allows us to handle long-running crawling tasks efficiently. It also provides an easy way to schedule periodic tasks using Celery Beat. This ensures that the web crawling tasks are executed at regular intervals without manual intervention.

RabbitMQ is chosen as the message broker because of its reliability, performance, and widespread adoption in the industry. It ensures that tasks are reliably sent to Celery workers, enabling efficient task distribution and execution.

Scheduler Setup

To schedule the periodic execution of the crawling task, we use Celery Beat. Celery Beat is a scheduler that sends tasks to Celery at regular intervals. We configure Celery Beat to trigger our crawling task every 5 minutes. This allows our system to automatically crawl the specified website and update the data in MongoDB at regular intervals.

Code

Directory Structure

.
├── backend/
│   ├── myproject/
│   │   ├── __init__.py
│   │   ├── celery.py
│   │   ├── settings.py
│   │   ├── tasks.py
│   │   └── urls.py
│   ├── manage.py
│   └── Dockerfile
├── docker-compose.yml
└── requirements.txt

requirements.txt

Django==3.2.4
djangorestframework==3.12.4
pymongo==3.11.4
celery==5.1.2
django-celery-beat==2.2.1
selenium==3.141.0
beautifulsoup4==4.9.3
urllib3==1.26.6

celery.py

from __future__ import absolute_import, unicode_literals
import os
from celery import Celery
from django.conf import settings

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings')

app = Celery('myproject')

app.config_from_object('django.conf:settings', namespace='CELERY')

app.autodiscover_tasks(lambda: settings.INSTALLED_APPS)

@app.task(bind=True)
def debug_task(self):
    print(f'Request: {self.request!r}')

__init__.py

from __future__ import absolute_import, unicode_literals

# This will make sure the app is always imported when
# Django starts so that shared_task will use this app.
from .celery import app as celery_app

__all__ = ('celery_app',)

settings.py

# Celery settings
INSTALLED_APPS = [
    ...
    'myproject',
    'django_celery_beat',
]
# Using RabbitMQ as the broker
CELERY_BROKER_URL = 'amqp://guest:guest@rabbitmq:5672//'
CELERY_RESULT_BACKEND = 'rpc://'

CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TIMEZONE = 'UTC'

# Celery Beat settings
from celery.schedules import crontab

CELERY_BEAT_SCHEDULE = {
    'scrape-every-5-minutes': {
        'task': 'myproject.tasks.scrape_and_store',
        'schedule': crontab(minute='*/60'),  # Execute every 5 minutes
    },
}

tasks.py

from celery import shared_task
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service as ChromeService
from pymongo import MongoClient, errors
import time
import logging

# Configure logger
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@shared_task
def scrape_and_store():
    try:
        url = ''  # The URL to crawl

        chrome_options = Options()
        chrome_options.add_argument("--headless")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument("--no-sandbox")
        chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.binary_location = "/usr/bin/chromium"

        service = ChromeService(executable_path="/usr/bin/chromedriver")
        driver = webdriver.Chrome(service=service, options=chrome_options)
        driver.get(url)
        time.sleep(3)  # Wait for the page to load

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        data = {
            'title': soup.title.string,
            'body': soup.body.text.strip()
        }

        driver.quit()

        # MongoDB connection
        try:
            logger.info("Connecting to MongoDB with the following timeouts:")
            logger.info("serverSelectionTimeoutMS=5000, connectTimeoutMS=10000, socketTimeoutMS=10000")

            client = MongoClient(
                'mongodb://mongo:27017/',
                serverSelectionTimeoutMS=5000, 
                connectTimeoutMS=10000,   
                socketTimeoutMS=10000   
            )

            logger.info("Connected to MongoDB. Inserting data...")
            db = client['mydatabase']
            collection = db['scraped_data']
            collection.insert_one(data)
            client.close()
            logger.info("Data scraped and stored successfully")
        except errors.ServerSelectionTimeoutError as err:
            logger.error("MongoDB connection timed out: %s", err)

    except Exception as e:
        logger.error("Error occurred: %s", e)

Dockerfile

FROM python:3.9-slim

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

WORKDIR /code

COPY requirements.txt /code/
RUN pip install -r requirements.txt

COPY . /code/

RUN apt-get update && apt-get install -y \
    chromium-driver \
    && rm -rf /var/lib/apt/lists/*

docker-compose.yml

version: '3'

services:
  mongo:
    image: mongo
    container_name: mongodb
    ports:
      - "27017:27017"

  rabbitmq:
    image: rabbitmq:3-management
    container_name: rabbitmq
    ports:
      - "5672:5672"
      - "15672:15672"

  backend:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: django
    command: sh -c "python manage.py migrate && python manage.py runserver 0.0.0.0:8000"
    volumes:
      - ./backend:/code
    ports:
      - "8000:8000"
    depends_on:
      - mongo
      - rabbitmq

  celery:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: celery
    command: celery -A myproject worker -l info
    volumes:
      - ./backend:/code
    depends_on:
      - backend
      - rabbitmq
      - mongo

  celery-beat:
    build:
      context: ./backend
      dockerfile: Dockerfile
    container_name: celery-beat
    command: sh -c "python manage.py migrate && celery -A myproject beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler"
    volumes:
      - ./backend:/code
    depends_on:
      - backend
      - rabbitmq
      - mongo

Running the Services

  1. Start Docker Compose services:

    • We use Docker Compose to start and manage our services. By running docker-compose up --build, we can build and start all the services defined in our docker-compose.yml file.

  2. Access the Django container:

    • Using Docker Compose, we can access the Django container by running docker-compose run backend sh. This allows us to interact with the Django application directly.

  3. Run Django shell and invoke the Celery task:

    • Within the Django container, we can start the Django shell by running python manage.py shell. From the shell, we can manually invoke the Celery task to test the web crawling functionality.

  4. Check the Celery worker logs:

    • We can check the logs of the Celery worker to monitor the execution of the crawling tasks. This helps us ensure that the tasks are being executed as expected and troubleshoot any issues that may arise.

Screenshot

celery

rabbitmq

flower

mongodb

Trouble Shooting

1. Celery Worker Not Receiving Tasks

Issue: Celery worker logs show that tasks are not being received.

Solution:

  • Ensure that RabbitMQ is running and accessible.

  • Verify that the CELERY_BROKER_URL in your Django settings points to the correct RabbitMQ instance.

  • Check if the Celery worker is connected to the correct Django app. Run celery -A myproject worker -l info to start the worker.

2. Timeout value connect was <object object at ...>, but it must be an int, float or None

Issue: This error typically occurs when there is a version conflict between urllib3 and selenium.

Solution:

  • Downgrade Selenium to a compatible version. For this project, using selenium==3.141.0 resolved the issue.

  • Ensure urllib3 is set to 1.26.6 in your requirements.txt.

3. MongoDB Connection Timeout

Issue: Celery logs show a timeout error when connecting to MongoDB.

Solution:

  • Verify that MongoDB is running and accessible.

  • Ensure that the MongoDB connection string in tasks.py is correct.

  • Increase the timeout values in the MongoDB client initialization.

4. ModuleNotFoundError: No module named 'myproject.celery'

Issue: Celery cannot find the Django project module.

Solution:

  • Ensure that your Django project structure is correct.

  • Make sure there is an __init__.py file in your Django project directory.

  • Verify that the CELERY_APP in your Django settings points to myproject.celery.

5. Docker Container Exits Immediately

Issue: One or more Docker containers exit immediately after starting.

Solution:

  • Check the container logs using docker-compose logs <container_name>.

  • Ensure that all services in docker-compose.yml have correct configurations.

  • Verify that the commands specified in the Dockerfile and docker-compose.yml are correct.

6. Periodic Task Not Running

Issue: Celery Beat is not triggering the periodic task.

Solution:

  • Ensure Celery Beat is running with the correct scheduler: celery -A myproject beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler.

  • Verify that the periodic task is correctly defined in the CELERY_BEAT_SCHEDULE setting in Django settings.

  • Check the Django admin interface to ensure the periodic task is enabled.

7. chromedriver Not Found

Issue: Selenium cannot find chromedriver.

Solution:

  • Ensure chromedriver is installed in the Docker container.

  • Verify the executable_path in the Selenium setup points to the correct location of chromedriver.

  • Check if the chromedriver binary has the correct permissions.

Conclusion

In this project, we successfully built a web crawling and data storage system using Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB. Docker and Docker Compose made it easy to containerize and manage these services. This system can be used for various web data collection and analysis tasks, providing a robust and scalable solution for automated data extraction.

Last updated