Building a Web Crawler and Data Storage System
TO DO flower ? 그거 붙일 예정 일단 안잊어 버리게 메모
Building a Web Crawler and Data Storage System with Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB
In this project, we built a web crawling system using Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB. This system periodically crawls data from a specified website and stores it in MongoDB. All services are containerized using Docker and managed with Docker Compose.
For testing purposes, we used the example site (https://example.com) and sample data. This allowed us to ensure that the web crawling and data storage functionality works correctly without accessing real-world data.
Tech Stacks
Django: Used as the web framework to define and manage Celery tasks.
Celery: Utilized as the asynchronous task queue to periodically execute crawling tasks.
RabbitMQ: Acts as the message broker for Celery.
Selenium: Performs the web crawling.
BeautifulSoup: Parses the HTML of the crawled web page.
MongoDB: Stores the crawled data.
Docker: Containerizes all services.
Docker Compose: Manages multiple containers easily.
Project Architecture and Flow
Django:
Defines the web crawling task using Celery.
Manages the lifecycle of the task, including scheduling and execution.
Celery:
Executes the web crawling task asynchronously.
Periodically schedules the task using Celery Beat.
RabbitMQ:
Serves as the message broker, facilitating communication between Django and Celery.
Selenium:
Automates the web browser to crawl the specified web page.
Extracts the required data.
BeautifulSoup:
Parses the HTML content of the crawled web page.
Extracts specific elements from the HTML.
MongoDB:
Stores the crawled data in a structured format.
Docker and Docker Compose:
Containerizes each service for easy deployment and management.
Uses Docker Compose to define and manage the multi-container application.
Why Celery and RabbitMQ?
Celery is used for its robust support for asynchronous task execution, which allows us to handle long-running crawling tasks efficiently. It also provides an easy way to schedule periodic tasks using Celery Beat. This ensures that the web crawling tasks are executed at regular intervals without manual intervention.
RabbitMQ is chosen as the message broker because of its reliability, performance, and widespread adoption in the industry. It ensures that tasks are reliably sent to Celery workers, enabling efficient task distribution and execution.
Scheduler Setup
To schedule the periodic execution of the crawling task, we use Celery Beat. Celery Beat is a scheduler that sends tasks to Celery at regular intervals. We configure Celery Beat to trigger our crawling task every 5 minutes. This allows our system to automatically crawl the specified website and update the data in MongoDB at regular intervals.
Code
Directory Structure
requirements.txt
celery.py
__init__.py
settings.py
tasks.py
Dockerfile
docker-compose.yml
Running the Services
Start Docker Compose services:
We use Docker Compose to start and manage our services. By running
docker-compose up --build
, we can build and start all the services defined in ourdocker-compose.yml
file.
Access the Django container:
Using Docker Compose, we can access the Django container by running
docker-compose run backend sh
. This allows us to interact with the Django application directly.
Run Django shell and invoke the Celery task:
Within the Django container, we can start the Django shell by running
python manage.py shell
. From the shell, we can manually invoke the Celery task to test the web crawling functionality.
Check the Celery worker logs:
We can check the logs of the Celery worker to monitor the execution of the crawling tasks. This helps us ensure that the tasks are being executed as expected and troubleshoot any issues that may arise.
Screenshot
celery
rabbitmq
flower
mongodb
Trouble Shooting
1. Celery Worker Not Receiving Tasks
Issue: Celery worker logs show that tasks are not being received.
Solution:
Ensure that RabbitMQ is running and accessible.
Verify that the
CELERY_BROKER_URL
in your Django settings points to the correct RabbitMQ instance.Check if the Celery worker is connected to the correct Django app. Run
celery -A myproject worker -l info
to start the worker.
2. Timeout value connect was <object object at ...>, but it must be an int, float or None
Timeout value connect was <object object at ...>, but it must be an int, float or None
Issue: This error typically occurs when there is a version conflict between urllib3
and selenium
.
Solution:
Downgrade Selenium to a compatible version. For this project, using
selenium==3.141.0
resolved the issue.Ensure
urllib3
is set to1.26.6
in yourrequirements.txt
.
3. MongoDB Connection Timeout
Issue: Celery logs show a timeout error when connecting to MongoDB.
Solution:
Verify that MongoDB is running and accessible.
Ensure that the MongoDB connection string in
tasks.py
is correct.Increase the timeout values in the MongoDB client initialization.
4. ModuleNotFoundError: No module named 'myproject.celery'
ModuleNotFoundError: No module named 'myproject.celery'
Issue: Celery cannot find the Django project module.
Solution:
Ensure that your Django project structure is correct.
Make sure there is an
__init__.py
file in your Django project directory.Verify that the
CELERY_APP
in your Django settings points tomyproject.celery
.
5. Docker Container Exits Immediately
Issue: One or more Docker containers exit immediately after starting.
Solution:
Check the container logs using
docker-compose logs <container_name>
.Ensure that all services in
docker-compose.yml
have correct configurations.Verify that the commands specified in the Dockerfile and
docker-compose.yml
are correct.
6. Periodic Task Not Running
Issue: Celery Beat is not triggering the periodic task.
Solution:
Ensure Celery Beat is running with the correct scheduler:
celery -A myproject beat -l info --scheduler django_celery_beat.schedulers:DatabaseScheduler
.Verify that the periodic task is correctly defined in the
CELERY_BEAT_SCHEDULE
setting in Django settings.Check the Django admin interface to ensure the periodic task is enabled.
7. chromedriver
Not Found
chromedriver
Not FoundIssue: Selenium cannot find chromedriver
.
Solution:
Ensure
chromedriver
is installed in the Docker container.Verify the
executable_path
in the Selenium setup points to the correct location ofchromedriver
.Check if the
chromedriver
binary has the correct permissions.
Conclusion
In this project, we successfully built a web crawling and data storage system using Django, Celery, RabbitMQ, Selenium, BeautifulSoup, and MongoDB. Docker and Docker Compose made it easy to containerize and manage these services. This system can be used for various web data collection and analysis tasks, providing a robust and scalable solution for automated data extraction.
Last updated