Transfer Learning

Transfer learning

The core idea of transfer learning is to i) learn knowledge from solving a problem (source task) and ii) apply the knowledge to solve other, relevant problems (target task) [TS10]. For example, if the model is able to perform instrument identification (source task), the learned knowledge would be useful to solve music genre classification (target task) since as the underlying concepts in music genres are related with instrumentation. The assumption is that although the source and target tasks are not identical, if the dataset for source task is much larger than target task, transferring could lead to a better performance.

In practice, i) we usually fit our model to solve a source task and ii) further optimize the pretrained model to solve the target task. In the second process, all or a part of learned parameters are updated. For example, the authors of [CFSC17b] pretrained a music tagging model on the Million Song Dataset. Then the model was transferred to solve downstream tasks such as genre classification, emotion recognition, audio event classification, etc .

Music information can be classified into three categories based on the nature of the metadata elaboration: editorial, cultural, and acoustic [Pac05]. The aforementioned transfer learning experiment takes advantage of music tags in the Million Song Dataset which are mostly related to acoustic information (e.g., genre, instrument). However, those music tags still relies on human effort of labeling. Instead of targeting the acoustic information, we can also design the source task to predict editorial or cultural metadata.

Pretext using editorial information

Editorial metadata is, by definition, obtained by the editor. Written information of the song such as artist names, album names, song titles, or released dates can be included. As we can distinguish artists by their acoustic characteristics, a previous work proposed to use artist classification as its pretext task for music representation learning and transferred the learned representation to solve downstream music genre classification tasks [PLP+17].

However, there are millions of artists which makes the pretext task to be unrealistic when with large-scale music libraries. To alleviate this issue, following researchers proposed to use clusters of artists as the prediction target of the source task [KWSL18].

Transfer learning of artist group factors

Pretext using cultural information

Cultural information is generated by the way music is perceived and consumed in the society. One well-known approach is to use collaborative filtering. Collaborative filtering models the interests of users from user-item interaction data. As shown in the figure below, a user-item interaction matrix can be decomposed into two matrices with a lower dimensionality using matrix factorization. They represent the embeddings of items and users, respectively.

Matrix factorization

A previous work trained a pretext music representation model by targeting this item (song) embeddings [VdODS13]. The learned representation can include rich acoustic information if the original user-item interaction dataset is large enough. This pretext (source task) is especially beneficial in industry where such a type of data is accessible.