This project aimed to design and implement an AI Data Engineering Pipeline to efficiently crawl and scrape data from websites related to concerts, operas, and other domains. The extracted data underwent processing using Natural Language Processing (NLP) techniques such as Named Entity Recognition (NER), summarization, keyword extraction, topic modeling, and classification. Additionally, updates in previously crawled HTML pages were detected using clustering techniques.
To enhance functionality, the solution included a face cropping API developed using MTCNN Pytorch, which detected and cropped faces of individuals and groups of artists. The entire system was deployed on AWS EC2 with Flask APIs for interaction and containerized using Docker for scalable, production-ready deployment.