Search engines are software systems that can do fast lookup for keywords in large documents. It's been here since the dawn of the Internet and will be there forever. They use a set of complex data structures to make fast lookup in billions of documents.
Looking at the amazing things involved, we decided to make a search engine for my College Summer Design Project in 2016. While hunting for ideas, I decided to go for song lyrics. It found our interest as we also came around a paper on Query Expansion for Mixed Script Information Retrieval from Mr. Parth Gupta Sir, who happened to be our project mentor. So the thing I wished to use this method to normalize search query involving mixed scripts, and also suggest queries in other scripts.
So what this search engine would do is treat your mai and main as same and also suggest you results involving मैं. This will particularly be helpful when people type in song lyrics/names to search for their song on online music websites such as Saavn and Gaana.
The project primarily involved of three parts -
We made custom crawler for 5 websites -
For indexing, we chose Xapian. It is an open source search engine library (GPL2+), powered by speed of
The indexer fetches only new results from the database periodically and updates the existing index. This made sure that we did not had to index every document as they arrive. This not only provided a persistent storage, but also provided fast enough way to query a few million songs.
We wrote web app and Android app to provide an interface to user where they can search. The Android used
During the project, we managed to crawl lyrics for over
The source code can be found here.