#1 Mixed Lyrics Search Engine

304 views - Pratyush Singh
Tag(s) : postgresql flask crawling xapian indexing bootstrap webdev python searching Jan. 3, 2017, 12:39 p.m.
A search engine for lyrics of songs based on mixed script information retrieval.

Search engines are software systems that can do fast lookup for keywords in large documents. It's been here since the dawn of the Internet and will be there forever. They use a set of complex data structures to make fast lookup in billions of documents.

Looking at the amazing things involved, we decided to make a search engine for my College Summer Design Project in 2016. While hunting for ideas, I decided to go for song lyrics. It found our interest as we also came around a paper on Query Expansion for Mixed Script Information Retrieval from Mr. Parth Gupta Sir, who happened to be our project mentor. So the thing I wished to use this method to normalize search query involving mixed scripts, and also suggest queries in other scripts.

We were four in a group - Ankit, Avi, Saurabh and me.

So what this search engine would do is treat your mai and main as same and also suggest you results involving मैं. This will particularly be helpful when people type in song lyrics/names to search for their song on online music websites such as Saavn and Gaana.

Overview

The project primarily involved of three parts -

  1. Crawling
  2. Indexing
  3. Query Formulation and Searching
We used Python for the project. The following diagram depicts overall flow of the project -

Crawling

We made custom crawler for 5 websites -

  1. AZLyrics
  2. HindiLyrics
  3. Smriti
  4. LyricsMasti
  5. MetroLyrics
These websites were chosen primarily because we wanted to test the mixed script, and wished to focus primarily on Indian songs. We additionally crawled MetroLyrics as we wished to have a lot of data to search for, and we could test the speed of lookup. The crawler submitted the data to a PostgreSQL database, from where the indexer would read things and build the index.

Indexing

For indexing, we chose Xapian. It is an open source search engine library (GPL2+), powered by speed of C++ and flexibility of Python using bindings. It was great to experience the excellence of Xapian, so nicely designed that it helped to accomplish complex functionalities of indexing using just a few lines of code.

The indexer fetches only new results from the database periodically and updates the existing index. This made sure that we did not had to index every document as they arrive. This not only provided a persistent storage, but also provided fast enough way to query a few million songs.

Searching

We wrote web app and Android app to provide an interface to user where they can search. The Android used REST API for talking to server. The API and web app were all written in Flask. We record each activity of user so that we may improve our suggestions and listings in later phases.

Conslusion

During the project, we managed to crawl lyrics for over 2M songs and indexed them. The total size of index was ~8.3GB. Lowest, highest and average document lengths being 6, 15220 and 380 respectively.

The source code can be found here.

Share: