MAGNATAGATUNE

+

=

+

 

Magnatagatune is a ready to use research dataset for MIR tasks such as automatic tagging.  It contains:


  1. Human annotations collected by Edith Law’s TagATune game.

  2. The corresponding sound clips from magnatune.com, encoded in 16 kHz, 32kbps, mono mp3.

  3. The source code of the scripts having generated this.

  4. A detailed analysis of the track’s structure and musical content, including rhythm, pitch and timbre.

CONTRIBUTORS

Edith Law is a Ph.D. student in the Machine Learning Department at Carnegie Mellon University and a Microsoft Graduate Research Fellow, working with Luis von Ahn and Tom Mitchell.  Her research focuses on innovative applications that close the loop for machine learning, and allow human users to directly interact, monitor and coach learning algorithms.  She is the mastermind behind the conception of TagATune.

John Buckman is the founder of Magnatune, a record label known for its commerical application of Creative Commons licensing and overtly artist and research friendly business practice.  Since founding Magnatune, Buckman has signed over 250 recording artists across multiple genres.  Buckman is a key reason why this dataset is publicly available to researchers, by ensuring that the release respects the rights of magnatune and the artists it publishes.

CONTENTS

The dataset consists of the following files:



Olivier Gillet works at Google, daily crunching vast amounts of text and audio signals.  During his Ph.D. at ENST Paris, he worked on automatic music transcription, source separation, and novel applications of video analysis methods for music videos.  He has made Magnatagatune his personal project, by creating scripts that render this dataset readily accessible to researchers. 

LICENSING AND REFERENCES

This data is licensed under a Creative Commons Attribution - Noncommercial-Share Alike 3.0 Unported License, to the exception of the script released under a GPL v3 license.  This enables the distribution of this package for a non-commercial, research use.


The audio clips are excerpts of original work released by numerous artists under a similar license, allowing the present non-commercial redistribution/repurposing.  A list of urls pointing to each artist/album page (with a link for purchase/commercial licensing) is available in the data file, and in the ID3 tags of each audio clip.


If you plan to cite this data in your research, please use the cite the following paper:


Tags in the dataset are verified (i.e. a tag is associated with an audio clip only if it is generated independently by more than 2 players) and useful for training learning algorithms (i.e. only tags that are associated with more than 50 songs are included).


The audio are binned into 16 shards (directories numbered 0-9a-f), the sharding key is artist + album name.  This binning can be used as a standardized, pseudo-random way of assigning songs to cross-validation folds, ensuring that songs from the same album, or excerpts from the same track are not assigned to both the training and testing sets.


Detailed description of the Echo Nest analysis can be found at http://developer.echonest.com.

DETAILS

“SOURCE ONLY” VERSION

The "source only" version of the dataset can be downloaded here: source_only.tar.bz2 . It contains 3 python scripts (join.py, cut_clips.py, join_with_clip_info.py), the original data/clip_info.csv, data/comparisons.csv, data/annotations.csv and data/song_info2_xml (the magnatune catalog XML) files.


Instructions


Prerequisites: sox (http://sox.sourceforge.net/), lame (http://lame.sourceforge.net/) - both installed in /usr/local/bin - Python >= 2.4 (http://www.python.org/).


1. Download and unarchive source_only.tar.bz2.


2. In order to download and cut the mp3 files, run:

python cut_clips.py [-f encoding_flags] [-L nbClips] -c cut_mp3_files_path -u uncut_mp3_files_path data/clip_info.csv clip_info_final.csv

cut_mp3_files_path

a directory that will contain 16 subdirectories with the cut, re-encoded mp3 files.

uncut_mp3_files_path

a directory that will contain the full mp3 files (e.g. /tmp for safe deletion afterwards).

encoding_flags

an optional flag for specifying the encoding options passed to lame (e.g. -f " -m m -b 64 --resample 32 " for 32kHz/64kbps/mono).  The default encoding is 32 kbps, 16kHz mono.

nbClips

an optional flag for specifying how many lines of the input data file to process.  This can be used for test runs, e.g. -L 40 will process only the first 40 lines of the input data files.

clip_info_final.csv is the clip_info.csv file with two additional columns with the url of the original mp3, and the path to the clip file.


N.B. This may take several days. If you want to distribute this work on 4 machines (for example) you can run one of these on each machine:

python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "0123" data/clip_info.csv clip_info_final_0123.csv


python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "4567" data/clip_info.csv clip_info_final_4567.csv


python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "89ab" data/clip_info.csv clip_info_final_89ab.csv


python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "cdef" data/clip_info.csv clip_info_final_cdef.csv

where uncut_mp3_files_path is a networked storage system, then on any machine:


python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path data/clip_info.csv clip_info_final.csv*


*This won't download/cut anything since all the clip mp3 files are already there from the previous run.

3. In order to remove entries in the annotations file for which there is no matching mp3 file (and optionally add the path to the clip file in the annotations), 

python join_with_clip_info.py --annotations [-j|-f] -i clip_info_final.csv

data/annotations.csv annotations_final.csv

-f (filter)

an option that will only remove from annotations.csv the entries for which there is no matching mp3 in clip_info

-j (join)

an option that will option will remove from annotations.csv the entries for which there is no matching mp3 in clip_info, and for the other entries add an extra column in the annotations file with the path of the mp3 file.

4. In order to remove entries in the comparisons file for which there is no matching mp3 file (and optionally add the path to the clip file in the annotations), 

python join_with_clip_info.py --comparisons [-j|-f] -i clip_info_final.csv data/comparisons.csv comparisons_final.csv

Edith Law and Luis von Ahn.  Input-agreement: A New Mechanism for Data Collection Using Human Computation Games.  To Appear in CHI 2009.

audio clips information, such as title, artist, album, url, start and end time, download URL for the mp3 file (entire song), and path to the mp3 clip.

tags associated with each audio clip, and path to the clip mp3 file.

similarity judgments (number of people who voted that a particular clip is the most different) associated with a tuple of audio clips, and paths to the mp3 clips.

TAR archive with all the audio clips as 16kHz, 32kbps, mono mp3.


If you want to handle the download and processing of the audio files yourself, you can use the “source only” version of the dataset by following the instructions at the last section of this page.

CONTACT

Email

    Edith Law (edith@cmu.edu) for questions about TagATune and the data collection procedures

    Olivier Gillet (ol.gillet@gmail.com) for questions about scripts/technical details

    Paul Lamere (paul@echonest.com) for questions about the Echo Nest analysis

    John Buckman (john@magnatune.com) to thank him for all the great music!

The Echo Nest analysis of each of the clips in XML format.