Drag
logo-img

Event HTML Page Classification Using Gensim Glove Model

This project focuses on the classification of HTML pages for musical events using a Gensim Glove Model. The goal was to identify specific event-related URLs (e.g., ticket purchase links or event details) from a domain and distinguish them from other irrelevant URLs. By leveraging NLP-based similarity models and pattern extraction techniques, the solution provided efficient URL classification and discovery for event pages.

The solution also included URL discovery for image source links and ticket URLs, providing APIs for seamless access. The model was deployed on AWS EC2 with Flask APIs and Dockerized for efficient deployment in production.

Challenges

  1. HTML Page Complexity:
    • HTML pages contained a mix of relevant and irrelevant URLs.
    • Extracting meaningful patterns from the page structure and content required advanced preprocessing.
  2. Event-Specific Classification:
    • Accurately distinguishing event-related URLs (e.g., ticket purchase links) from other URLs within the same domain.
  3. Data Pipeline Design:
    • Building a scalable and efficient pipeline for processing large volumes of HTML pages.
  4. Deployment and Access:
    • Providing APIs for seamless integration of the classification model into production workflows.
  5. Scalability:
    • Ensuring the system could handle a large number of pages in real-time or batch processing

Our Solutions

The solution utilized Gensim Glove embeddings for generating similarity models and Flask APIs for seamless integration into production.

  1. HTML Page Classification:
    • Used Gensim Glove Model to generate a similarity model based on a corpus of event-related text.
    • Extracted patterns from URLs and keywords from summarized HTML content.
    • Classified URLs as event-related or irrelevant using the similarity model.
  2. URL Discovery:
    • Ticket URL Discovery: Identified ticket purchase links using the similarity model.
    • Image Source Discovery: Extracted image source URLs for event pages to ensure complete page rendering.
  3. Deployment:
    • Deployed models on AWS EC2 instances with an Nginx server for high availability.
    • Developed APIs using Flask for accessing classification and discovery functionalities.
  4. Pipeline Design:
    • Dockerized the entire solution for portability and scalability.
    • Designed a pipeline for efficient preprocessing, model inference, and API deployment.
  5. API Services:
    • Created APIs for:
      • Classifying event-related URLs.
      • Discovering ticket URLs and image sources.
      • Accessing classified and discovered URLs

Technology Slack

Mongo DB

Docker

Flask

Python

Impacts

  1. Data Preprocessing:
    • Parsed HTML pages to extract all URLs and summarized page content using NLP techniques.
    • Used Gensim Glove embeddings to create a similarity model based on extracted keywords and patterns.
  2. Model Training and Classification:
    • Trained a classification model using Gensim Glove embeddings to identify event-specific patterns in URLs and content.
    • Classified URLs based on their similarity to event-related corpus data.
  3. URL Discovery:
    • Developed separate modules for:
      • Ticket URL discovery: Extracted and classified ticket purchase links.
      • Image source discovery: Identified image URLs needed for rendering event pages.
  4. API Development and Deployment:
    • Built RESTful APIs using Flask for accessing the classification and discovery services.
    • Deployed the solution on AWS EC2 with an Nginx server for scalability.
  5. Production Deployment:
    • Dockerized the entire system for ease of deployment and scalability.
    • Set up CI/CD pipelines to streamline updates and maintenance.

Benefits

  1. Accurate Event URL Identification:
    • Efficiently classified and discovered event-specific URLs, saving time and effort in manual filtering.
  2. Scalability:
    • Leveraged AWS EC2 and Docker to handle large-scale HTML page processing.
  3. Improved API Accessibility:
    • Provided seamless integration with existing systems via Flask APIs.
  4. Enhanced Event Page Quality:
    • Identified key elements like image sources and ticket URLs to ensure complete event page representation.
  5. Real-Time Processing:
    • Enabled real-time classification and URL discovery for faster workflows

Future Scope

  1. Multi-Language Support:
    • Extend the model to handle HTML content in multiple languages.
  2. Integration with Event Platforms:
    • Automate event page classification and URL discovery for platforms like Eventbrite or Meetup.
  3. Advanced NLP Models:
    • Incorporate transformer-based models like BERT for improved classification accuracy.
  4. Dashboard Integration:
    • Build dashboards for monitoring classification metrics and visualizing discovered URLs.
  5. Real-Time Crawling:
    • Integrate web crawlers for automated page scraping and real-time URL discovery.

Conclusion

The Event HTML Page Classification System provided an efficient and scalable solution for identifying event-related URLs using Gensim Glove embeddings. By deploying the model on AWS EC2 with Flask APIs, the project ensured seamless integration and real-time accessibility. This solution can be extended to handle multi-domain classification tasks and advanced NLP requirements in the future.