Tags in the dataset are verified (i.e. a tag is associated with an audio clip only if it is generated independently by more than 2 players) and useful for training learning algorithms (i.e. only tags that are associated with more than 50 songs are included).

The audio are binned into 16 shards (directories numbered 0-9a-f), the sharding key is artist + album name. This binning can be used as a standardized, pseudo-random way of assigning songs to cross-validation folds, ensuring that songs from the same album, or excerpts from the same track are not assigned to both the training and testing sets.

Detailed description of the Echo Nest analysis can be found at http://developer.echonest.com.


LICENSING AND REFERENCES

This data is licensed under a Creative Commons Attribution - Noncommercial-Share Alike 3.0 Unported License, to the exception of the script released under a GPL v3 license. This enables the distribution of this package for a non-commercial, research use.

The audio clips are excerpts of original work released by numerous artists under a similar license, allowing the present non-commercial redistribution/repurTerms and Conditionsing. A list of urls pointing to each artist/album page (with a link for purchase/commercial licensing) is available in the data file, and in the ID3 tags of each audio clip.

If you plan to cite this data in your research, please use the cite the following paper:

Edith Law and Luis von Ahn. Input-agreement: A New Mechanism for Data Collection Using Human Computation Games. To Appear in CHI 2009. 


Magnatagatune

Magnatagatune is a ready to use research dataset for MIR tasks such as automatic tagging. It contains:

  • Human annotations collected by Edith Law’s TagATune game.
  • The corresponding sound clips from magnatune.com, encoded in 16 kHz, 32kbps, mono mp3.
  • The source code of the scripts having generated this.
  • A detailed analysis of the track’s structure and musical content, including rhythm, pitch and timbre.