The PRD (Pattern Recognition Data) is an ASCII file format created in order to make the databases easy to read in C/C++ programs. It is defined as follows (where each line begins with the line number just for reference):
where 'm' is the number of samples, 'n' is the number of features and s[a][b] means the feature 'b' of sample 'a'.
The class label must be an integer and the sequences of classes labels cannot be disrupted. For example, in a 10 classes database, the classes must be labeled from 0 to 9, but they can appear on the file at any order. The samples features are divided by commas with no spaces before or after it. The lines are numbered here just for reference, but they are not numbered in the databases files. Finnaly, there are no spaces after the three header lines.
1. For the
dermatology database, the 8 patterns with missing values were removed from the database, changing the samples number from 366 to 358;
2. The
forest database had been split in training and test data with 387344 samples for training and 193668 for test, keeping the proportion of the classes. See the
problems statistics for details;
3. The
glass database have one of the 7 classes with no patterns. This class was not considered in the PRD file class labeling. So, the labels are from 0 to 5;
4. In the
ionosphere database, the second feature of is constant equal to 0 and had been removed, changing the number of features from 34 to 33;
5. The
letter database had been split in training and test data with the first 12200 samples for training and the remaining 7800 for test;
6. The classes in the original
lrs database are numbered from 0 to 99, but this numbering are not continuous. The 48 present classes were relabeled from 0 to 47. Also, the 10 "header" features were eliminated, changing the feature number from 103 to 93;
7. For the
lung database, the 26th sample and the 5th feature were removed because missing values, changing the samples number from 32 to 31 and the features number from 56 to 55;
8. The
satimage database have one of the 7 classes with no patterns. This class was not considered in the PRD file class labeling. So, the labels are from 0 to 5;
9. For the
segment database, the feature that contains the number of pixels ("region-pixel-count") is constant (always 9) and was removed, changing the original number of features from 19 to 18;