References¶

ABC+16: Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, and others. Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), 265–283. 2016.
BCG+19: David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249, 2019.
BMEWL11: Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida, 591–596. University of Miami, 2011.
BFR+19: Rachel M Bittner, Magdalena Fuentes, David Rubinstein, Andreas Jansson, Keunwoo Choi, and Thor Kell. Mirdata: software for reproducible usage of datasets. In Proceedings of the 20th Conference of the International Society for Music Information Retrieval, Delft, The Netherlands. International Society for Music Information Retrieval (ISMIR), 2019.
BPS+19: Dmitry Bogdanov, Alastair Porter, Hendrik Schreiber, Julián Urbano, and Sergio Oramas. The acousticbrainz genre dataset: multi-source, multi-level, multi-label, and large-scale. In Proceedings of the 20th Conference of the International Society for Music Information Retrieval (ISMIR 2019): 2019 Nov 4-8; Delft, The Netherlands.[Canada]: ISMIR; 2019. International Society for Music Information Retrieval (ISMIR), 2019.
BWGomezGutierrez+13: Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, and others. Essentia: an audio analysis library for music information retrieval. In Britto A, Gouyon F, Dixon S, editors. 14th Conference of the International Society for Music Information Retrieval (ISMIR); 2013 Nov 4-8; Curitiba, Brazil.[place unknown]: ISMIR; 2013. p. 493-8. International Society for Music Information Retrieval (ISMIR), 2013.
BWT+19: Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra. The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML 2019). Long Beach, CA, United States, 2019. URL: http://hdl.handle.net/10230/42015.
BJFH12: Juan J Bosch, Jordi Janer, Ferdinand Fuhrmann, and Perfecto Herrera. A comparison of sound segregation techniques for predominant instrument recognition in musical audio signals. In ISMIR, 559–564. Citeseer, 2012.
CGomezG+06: Pedro Cano, Emilia Gómez, Fabien Gouyon, Perfecto Herrera, Markus Koppenberger, Beesuan Ong, Xavier Serra, Sebastian Streich, and Nicolas Wack. Ismir 2004 audio description contest. Music Technology Group of the Universitat Pompeu Fabra, Tech. Rep, 2006.
Cel10: Oscar Celma. Music recommendation. In Music recommendation and discovery, pages 43–85. Springer, 2010.
CSR11: Vijay Chandrasekhar, Mehmet Emre Sargin, and David A Ross. Automatic language identification in music videos with low level audio and visual features. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5724–5727. IEEE, 2011.
CKNH20: Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR, 2020.
CAAH20: Kin Wai Cheuk, Hans Anderson, Kat Agres, and Dorien Herremans. Nnaudio: an on-the-fly gpu audio to spectrogram conversion toolbox using 1d convolutional neural networks. IEEE Access, 8:161981–162003, 2020.
CFCS17a: Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. A comparison on audio signal preprocessing methods for deep neural networks on music tagging. arXiv:1709.01922, 2017.
CFCS17b: Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396, 2017.
CFS16: Keunwoo Choi, György Fazekas, and Mark Sandler. Automatic tagging using deep convolutional neural networks. In The 17th International Society of Music Information Retrieval Conference, New York, USA. International Society of Music Information Retrieval, 2016.
CFSC17a: Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2392–2396. IEEE, 2017.
CFSC17b: Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. Transfer learning for music classification and regression tasks. In Proceedings of the 18th Conference of the International Society for Music Information Retrieval, Suzhou, China. International Society for Music Information Retrieval (ISMIR), 2017.
CFCS18: Keunwoo Choi, György Fazekas, Kyunghyun Cho, and Mark Sandler. The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):139–149, 2018. URL: https://ieeexplore.ieee.org/abstract/document/8323324, doi:10.1109/TETCI.2017.2771298.
CJK17: Keunwoo Choi, Deokjin Joo, and Juho Kim. Kapre: on-gpu audio preprocessing layers for a quick implementation of deep neural network models with keras. Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning, 2017.
CW21: Keunwoo Choi and Yuxuan Wang. Listen, read, and identify: multimodal singing language identification of music. In Proceedings of the 22th Conference of the International Society for Music Information Retrieval (ISMIR 2019). International Society for Music Information Retrieval (ISMIR), 2021.
Com20: Executable Books Community. Jupyter book. February 2020. URL: https://doi.org/10.5281/zenodo.4539666, doi:10.5281/zenodo.4539666.
DBVB16: Michaël Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson. Fma: a dataset for music analysis. In The 17th International Society of Music Information Retrieval Conference, New York, USA. International Society of Music Information Retrieval, 2016.
DS14: Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, 6964–6968. IEEE, 2014. URL: https://ieeexplore.ieee.org/document/6854950.
ERR+17: Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman, Mohammad Norouzi, Douglas Eck, and Karen Simonyan. Neural audio synthesis of musical notes with wavenet autoencoders. In International Conference on Machine Learning, 1068–1077. PMLR, 2017.
FLTZ10: Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. A survey of audio-based music classification and annotation. IEEE transactions on multimedia, 13(2):303–319, 2010.
GEF+17: Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017. New Orleans, LA, 2017.
GB+05: Yves Grandvalet, Yoshua Bengio, and others. Semi-supervised learning by entropy minimization. CAP, 367:281–296, 2005.
GHyvarinen10: Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 297–304. JMLR Workshop and Conference Proceedings, 2010.
GCCE+21: Juan Sebastián Gómez-Cañón, Estefanía Cano, Tuomas Eerola, Perfecto Herrera, Xiao Hu, Yi-Hsuan Yang, and Gómez Emilia. Music Emotion Recognition: towards new robust standards in personalized and context-sensitive applications. IEEE Signal Processing Magazine, 2021. doi:10.1109/MSP.2021.3106232.
HMvdW+20: Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, and others. Array programming with numpy. Nature, 585(7825):357–362, 2020.
HFW+20: Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738. 2020.
HBPD03: Perfecto Herrera-Boyer, Geoffroy Peeters, and Shlomo Dubnov. Automatic classification of musical instrument sounds. Journal of New Music Research, 32(1):3–21, 2003.
HCE+17: Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, and others. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), 131–135. IEEE, 2017.
HDM18: Eric Humphrey, Simon Durand, and Brian McFee. Openmic-2018: an open data-set for multiple instrument recognition. In The 19th International Society of Music Information Retrieval Conference, Paris, France. International Society of Music Information Retrieval, 2018.
HWW+21: Yun-Ning Hung, Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife, Kelian Li, Pavan Seshadri, and Junyoung Lee. Avaspeech-smad: a strongly labelled speech and music activity detection dataset with label co-occurrence. 2021. arXiv:2111.01320.
Hun07: John D Hunter. Matplotlib: a 2d graphics environment. Computing in science & engineering, 9(03):90–95, 2007.
KWSL18: Jaehun Kim, Minz Won, Xavier Serra, and Cynthia CS Liem. Transfer learning of artist group factors to musical genre classification. In Companion Proceedings of the The Web Conference 2018, 1929–1934. 2018.
KR16: Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947, 2016.
KSM+10: Youngmoo E Kim, Erik M Schmidt, Raymond Migneco, Brandon G Morton, Patrick Richardson, Jeffrey Scott, Jacquelin A Speck, and Douglas Turnbull. Music emotion recognition: a state of the art review. In Proc. ismir, volume 86, 937–952. 2010.
KMRW14: Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589. 2014.
KMS+11: Sander Koelstra, Christian Muhl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap: a database for emotion analysis; using physiological signals. IEEE transactions on affective computing, 3(1):18–31, 2011.
Lam08: Paul Lamere. Social tagging and music information retrieval. Journal of new music research, 37(2):101–114, 2008.
LWM+09: Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie. Evaluation of algorithms using games: the case of music tagging. In The 10th International Society of Music Information Retrieval Conference, 387–392. International Society of Music Information Retrieval, 2009.
L+13: Dong-Hyun Lee and others. Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, 896. 2013.
LPKN17: Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, and Juhan Nam. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Sound and Music Computing Conference (SMC). 2017. URL: https://arxiv.org/abs/1703.01789.
MMM+21: Brian McFee, Alexandros Metsai, Matt McVicar, Stefan Balke, Carl Thomé, Colin Raffel, Frank Zalkow, Ayoub Malek, Dana, Kyungyun Lee, Oriol Nieto, Dan Ellis, Jack Mason, Eric Battenberg, Scott Seyfarth, Ryuichi Yamamoto, viktorandreevichmorozov, Keunwoo Choi, Josh Moore, Rachel Bittner, Shunsuke Hidaka, Ziyao Wei, nullmightybofo, Darío Hereñú, Fabian-Robert Stöter, Pius Friesch, Adam Weiss, Matt Vollrath, Taewoon Kim, and Thassilo. Librosa/librosa: 0.8.1rc2. May 2021. URL: https://doi.org/10.5281/zenodo.4792298, doi:10.5281/zenodo.4792298.
MRL+15: Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, 18–25. Citeseer, 2015.
MelendezCatalanMGomez19: Blai Meléndez-Catalán, Emilio Molina, and Emilia Gómez. Open broadcast media audio from tv: a dataset of tv broadcast audio with relative music loudness annotations. Transactions of the International Society for Music Information Retrieval, 2019.
NCL+18: Juhan Nam, Keunwoo Choi, Jongpil Lee, Szu-Yu Chou, and Yi-Hsuan Yang. Deep learning for audio-based music classification and tagging: teaching computers to distinguish rock from bach. IEEE signal processing magazine, 36(1):41–51, 2018.
OLV18: Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Pac05: Francois Pachet. Knowledge management and musical metadata. Idea Group, 2005.
PLP+17: Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha, and Juhan Nam. Representation learning of music using artist labels. arXiv preprint arXiv:1710.06648, 2017.
PRS+19: Santiago Pascual, Mirco Ravanelli, Joan Serra, Antonio Bonafonte, and Yoshua Bengio. Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv preprint arXiv:1904.03416, 2019.
PGM+19: Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and others. Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
PS19: Jordi Pons and Xavier Serra. Musicnn: pre-trained convolutional neural networks for music audio tagging. arXiv preprint arXiv:1909.06654, 2019.
Rox19: Linus Roxbergh. Language classification of music using metadata. 2019.
SPD+20: Igor André Pegoraro Santana, Fabio Pinhelli, Juliano Donini, Leonardo Catharin, Rafael Biazus Mangolin, Valéria Delisandra Feltrim, Marcos Aurélio Domingues, and others. Music4all: a new music database and its applications. In 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), 399–404. IEEE, 2020.
SLBock20: Alexander Schindler, Thomas Lidy, and Sebastian Böck. Deep learning for mir tutorial. arXiv preprint arXiv:2001.05266, 2020.
Sch15: Hendrik Schreiber. Improving genre annotations for the million song dataset. In ISMIR, 241–247. 2015.
SHG+14: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. Machine learning: the high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop). 2014.
SSP+03: Patrice Y Simard, David Steinkraus, John C Platt, and others. Best practices for convolutional neural networks applied to visual document analysis. In Icdar, volume 3. 2003.
SAY16: M Soleymani, A Aljanaki, and YH Yang. Deam: mediaeval database for emotional analysis in music. 2016.
SCS+13: Mohammad Soleymani, Micheal N Caro, Erik M Schmidt, Cheng-Ya Sha, and Yi-Hsuan Yang. 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, 1–6. 2013.
SB21: Janne Spijkervet and John Ashley Burgoyne. Contrastive learning of musical representations. In Proceedings of the 22th Conference of the International Society for Music Information Retrieval (ISMIR 2019). International Society for Music Information Retrieval (ISMIR), 2021. URL: https://arxiv.org/abs/2103.09410.
Stu13: Bob L Sturm. The gtzan dataset: its contents, its faults, their effects on evaluation, and its future use. arXiv preprint arXiv:1306.1461, 2013.
TS10: Lisa Torrey and Jude Shavlik. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, pages 242–264. IGI global, 2010.
Tza99: George Tzanetakis. Gtzan musicspeech. availabe online at http://marsyas. info/download/data sets, 1999.
TC01: George Tzanetakis and P Cook. Gtzan genre collection. web resource, 2001. URL: http://marsyas.info/downloads/datasets.html.
TC02: George Tzanetakis and Perry Cook. Musical genre classification of audio signals. Speech and Audio Processing, IEEE transactions on, 10(5):293–302, 2002.
VdODS13: Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In Advances in Neural Information Processing Systems, 2643–2651. 2013. URL: https://biblio.ugent.be/publication/4324554.
VGO+20: Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi:10.1038/s41592-019-0686-2.
WBKW96: Erling Wold, Thom Blum, Douglas Keislar, and James Wheaten. Content-based classification, search, and retrieval of audio. IEEE multimedia, 3(3):27–36, 1996.
WCS21: Minz Won, Keunwoo Choi, and Xavier Serra. Semi-supervised music tagging transformer. In Proceedings of the 22th Conference of the International Society for Music Information Retrieval (ISMIR 2019). International Society for Music Information Retrieval (ISMIR), 2021.
WCNS20: Minz Won, Sanghyuk Chun, Oriol Nieto, and Xavier Serrc. Data-driven harmonic filters for audio representation learning. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 536–540. IEEE, 2020.
WCNCS19: Minz Won, Sanghyuk Chun, Oriol Nieto Caballero, and Xavier Serra. Automatic music tagging with harmonic cnn. 20th International Society for Music Information Retrieval, Late-Breaking/Demo Session, 2019.
WCS19: Minz Won, Sanghyuk Chun, and Xavier Serra. Toward interpretable music tagging with self-attention. arXiv preprint arXiv:1906.04972, 2019.
WFBS20: Minz Won, Andres Ferraro, Dmitry Bogdanov, and Xavier Serra. Evaluation of cnn-based automatic music tagging models. In Sound and Music Computing 2020 (SMC 2020). 2020. URL: https://arxiv.org/abs/2006.00751.
XLHL20: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10687–10698. 2020.
YJegouC+19: I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
YHN+21: Yao-Yuan Yang, Moto Hira, Zhaoheng Ni, Anjali Chourdia, Artyom Astafurov, Caroline Chen, Ching-Feng Yeh, Christian Puhrsch, David Pollack, Dmitriy Genzel, Donny Greenberg, Edward Z. Yang, Jason Lian, Jay Mahadeokar, Jeff Hwang, Ji Chen, Peter Goldsborough, Prabhat Roy, Sean Narenthiran, Shinji Watanabe, Soumith Chintala, Vincent Quenneville-Bélair, and Yangyang Shi. Torchaudio: building blocks for audio and speech processing. 2021. arXiv:2110.15018.
ZGL03: Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML-03), 912–919. 2003.

Music Classification: Beyond Supervised Learning, Towards Real-world Applications

References¶