With an increasing amount of digital music released daily, streaming platforms such as Spotify face the following issue: how to develop the best song recommendation systems to increase user experience? This thesis aims to address the semantic gap in Music Recommender System (MRS) literature by incorporating both audio and lyrics information in a deep neural network. The network building blocks consist of four main elements.
First, Mel-spectrogram images of song audios are processed using the convolutional neural networks. Second, the lyrics’ text sequences are embedded into the long short-term model. Next, the two embeddings are combined in a multimodal network and trained to classify the songs into four balanced genres: rock, rap, blues, and pop. Lastly, the classification layer is disregarded to make use of latent representations of the songs and generate recommendations based on cosine similarity scores. To evaluate the performance of the model, an experiment was set up with two participants listening to multiple recommended songs across three conditions: recommendations drawn randomly from the dataset, based on audio information only, and based on audio and lyrics features. It is expected that the participants like the songs recommended by the multimodal network more than in the other two conditions.