Introduction
It is human tendency to label all the things we encounter. The internet along with its advantages also nurtured the availability and abundance in data. The larger the data gets, the greater the need to divide larger things into smaller chunks so that they could be accessed and used better. It might be an evolutionary learning to have ability to label the content based continuously training machine learning models. The goal is to design a model that could train on a corpus of text files to generate a finite bag of words that could be used to and predict an unknown/unlabelled text document. For this project we worked on designing and building topic prediction model for the TED talks to predict similar topic labels for TED Talks.
The dataset is a collection of the text transcripts the TED talks pulled from the TED website before November 2016. A web crawler script is run to pull a list of all the TED talks and their corresponding text transcripts from the official TED website. These talks are from different events i.e TEDx, TEDConference and TEDGlobal. Each of these talks are labelled with multiple topic labels and contain other metadata like the author, talk ratings, author, talk date etc… The training data comprises of talks from TEDx and TEDConference (1245 Text files) and the test data (472 Text files) is the collection of talks from TEDGlobal.
Methods Used
We’ve used 3 different model to predict the topics for every talk based on the transcripts available. *Ignored talks with only music *Performed data cleaning before implementing the models
- LDA Model
- tf-Idf Weight Ranking Model
- kNN Model
- Word2Vec Model
Authors and Contributors
Project Team:
- Bharat Vemulapalli (@bharatvem)
- Jaideep Patel (@jaideeppatel)