TagATune

The "source only" version of the dataset can be downloaded here: source_only.tar.bz2 . It contains 3 python scripts (join.py, cut_clips.py, join_with_clip_info.py), the original data/clip_info.csv, data/comparisons.csv, data/annotations.csv and data/song_info2_xml (the magnatune catalog XML) files.

Instructions

Prerequisites: sox (http://sox.sourceforge.net/), lame (http://lame.sourceforge.net/) - both installed in /usr/local/bin - Python >= 2.4 (http://www.python.org/).

1. Download and unarchive source_only.tar.bz2.

2. In order to download and cut the mp3 files, run:

Instructions

Prerequisites: sox (http://sox.sourceforge.net/), lame (http://lame.sourceforge.net/) - both installed in /usr/local/bin - Python >= 2.4 (http://www.python.org/).

1. Download and unarchive source_only.tar.bz2.

2. In order to download and cut the mp3 files, run:

python cut_clips.py [-f encoding_flags] [-L nbClips] -c cut_mp3_files_path -u uncut_mp3_files_path data/clip_info.csv clip_info_final.csv
cut_mp3_files_path
a directory that will contain 16 subdirectories with the cut, re-encoded mp3 files.
uncut_mp3_files_path
a directory that will contain the full mp3 files (e.g. /tmp for safe deletion afterwards).
encoding_flags
an optional flag for specifying the encoding options passed to lame (e.g. -f " -m m -b 64 --resample 32 " for 32kHz/64kbps/mono). The default encoding is 32 kbps, 16kHz mono.
nbClips
an optional flag for specifying how many lines of the input data file to process. This can be used for test runs, e.g. -L 40 will process only the first 40 lines of the input data files.
clip_info_final.csv is the clip_info.csv file with two additional columns with the url of the original mp3, and the path to the clip file.

N.B. This may take several days. If you want to distribute this work on 4 machines (for example) you can run one of these on each machine:
python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "0123" data/clip_info.csv clip_info_final_0123.csv

python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "4567" data/clip_info.csv clip_info_final_4567.csv

python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "89ab" data/clip_info.csv clip_info_final_89ab.csv

python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path -d "cdef" data/clip_info.csv clip_info_final_cdef.csv
where uncut_mp3_files_path is a networked storage system, then on any machine:

python cut_clips.py -c cut_mp3_files_path -u uncut_mp3_files_path data/clip_info.csv clip_info_final.csv*

*This won't download/cut anything since all the clip mp3 files are already there from the previous run.
3. In order to remove entries in the annotations file for which there is no matching mp3 file (and optionally add the path to the clip file in the annotations),
python join_with_clip_info.py --annotations [-j|-f] -i clip_info_final.csv
data/annotations.csv annotations_final.csv
-f (filter)
an option that will only remove from annotations.csv the entries for which there is no matching mp3 in clip_info
-j (join)
an option that will option will remove from annotations.csv the entries for which there is no matching mp3 in clip_info, and for the other entries add an extra column in the annotations file with the path of the mp3 file.
4. In order to remove entries in the comparisons file for which there is no matching mp3 file (and optionally add the path to the clip file in the annotations),
python join_with_clip_info.py --comparisons [-j|-f] -i clip_info_final.csv data/comparisons.csv comparisons_final.csv