Drag
logo-img

AI Data Engineering - Crawling and Scraping Data

This project aimed to design and implement an AI Data Engineering Pipeline to efficiently crawl and scrape data from websites related to concerts, operas, and other domains. The extracted data underwent processing using Natural Language Processing (NLP) techniques such as Named Entity Recognition (NER), summarization, keyword extraction, topic modeling, and classification. Additionally, updates in previously crawled HTML pages were detected using clustering techniques.

To enhance functionality, the solution included a face cropping API developed using MTCNN Pytorch, which detected and cropped faces of individuals and groups of artists. The entire system was deployed on AWS EC2 with Flask APIs for interaction and containerized using Docker for scalable, production-ready deployment.

Challenges

  1. Dynamic Web Content:
    • Frequent updates to HTML pages required a mechanism to detect and track changes effectively.
  2. Unstructured Data:
    • Extracting structured information like names, keywords, and topics from raw, unstructured web data.
  3. Scalability and Automation:
    • Managing a high volume of URLs and integrating multiple data processing modules in a unified pipeline.
  4. High-Quality Face Detection:
    • Designing an API to crop faces accurately for individual and group photos in varying image resolutions and contexts.
  5. Data Storage and Retrieval:
    • Efficiently storing large volumes of crawled and processed data with metadata in MongoDB and DynamoDB.

Our Solutions

The solution incorporated advanced web scraping, NLP, and AI techniques, ensuring accurate data extraction, update tracking, and face cropping.

  1. Web Crawling and Scraping:
    • Used Scrapy to crawl URLs and scrape data dynamically.
    • Incorporated Selenium to handle JavaScript-heavy and dynamic content.
    • Parsed and cleaned HTML using Beautiful Soup.
  2. NLP Techniques:
    • Implemented Spacy NER to extract named entities like organization and artist names.
    • Used Gensim and Huggingface models for summarization, keyword extraction, and topic modeling.
    • Built classification models to categorize data based on project requirements.
  3. HTML Update Detection:
    • Compared old and newly scraped HTML using clustering techniques.
    • Performed equalization to align both documents for effective comparison.
    • Identified changes such as additions, removals, and replacements within regions of interest after refinement.
  4. Face Cropping API:
    • Developed an API using MTCNN (Pytorch) for detecting and cropping faces in group and individual artist images.
    • Integrated this API into the pipeline for automated processing.
  5. Deployment and Integration:
    • Dockerized the entire pipeline for portability and scalability.
    • Hosted the system on AWS EC2 with Nginx as the web server and APIs built with Flask.
    • Stored structured data and metadata in MongoDB and DynamoDB for efficient access.

Technology Slack

Selenium

AWS EC2

Lighttag

Flask

Docker

Python

Mongo DB

Hugging Face Models

Impacts

  1. Data Crawling and Extraction:
    • Scraped data from multiple domains using Scrapy and Selenium.
    • Parsed and structured data with Beautiful Soup for further processing.
  2. NLP Processing:
    • Identified key entities such as artist and organization names using Spacy NER.
    • Summarized data and extracted keywords for content categorization.
    • Used topic modeling to classify data into relevant categories.
  3. HTML Update Tracking:
    • Compared newly scraped data with previously stored HTML content.
    • Detected and classified updates such as text additions or removals using clustering logic.
    • Logged all updates for downstream analysis.
  4. Face Cropping API:
    • Developed a Pytorch-based MTCNN API to detect and crop artist faces from images.
    • Automated the integration of image processing with the scraping pipeline.
  5. Deployment and Storage:
    • Hosted the pipeline on AWS EC2 for high availability and scalability.
    • Stored processed data in MongoDB and DynamoDB for efficient query and retrieval.
  6. APIs and Integration:
    • Built Flask APIs to provide endpoints for data access, HTML update detection, and face cropping services.

Benefits

  1. Comprehensive Data Processing:
    • Delivered structured data with key insights like topics, keywords, and named entities.
  2. Efficient Update Detection:
    • Automated detection of content changes in HTML pages saved time and resources.
  3. Enhanced Face Detection:
    • High-quality face cropping for individuals and groups enhanced downstream tasks like profile creation.
  4. Scalable Infrastructure:
    • Dockerized and AWS-hosted pipeline ensured seamless scaling and deployment.
  5. Streamlined Access:
    • Flask APIs allowed easy integration of data scraping and processing capabilities with other systems.

Future Scope

  1. Real-Time Update Monitoring:
    • Implement real-time crawling and update detection for dynamic websites.
  2. Advanced AI Models:
    • Use transformer-based models like BERT for better summarization and topic modeling.
  3. Dashboard Integration:
    • Develop an interactive dashboard for visualizing scraped data, HTML updates, and processed results.
  4. Cross-Domain Adaptation:
    • Extend the pipeline to support data scraping and processing for additional domains and use cases.
  5. Multilingual NLP Support:
    • Incorporate multilingual models to handle data in diverse languages.

Conclusion

  • This AI Data Engineering Pipeline demonstrated the power of combining web scraping, NLP, and AI for structured data extraction and processing. The use of advanced clustering techniques for HTML update detection and a Pytorch-based face cropping API added unique value to the solution. With its scalable deployment on AWS and integration with MongoDB and DynamoDB, the project provided a robust and reliable platform for large-scale data engineering tasks.