n South African Computer Journal - A comparative study of sample selection methods for classification

Volume 2006, Issue 36
  • ISSN : 1015-7999
  • E-ISSN: 2313-7835



Sampling of large datasets for data mining is important for at least two reasons. The processing of large amounts of data results in increased computational complexity. The cost of this additional complexity may not be justifiable. On the other hand, the use of small samples results in fast and efficient computation for data mining algorithms. Statistical methods for obtaining sufficient samples from datasets for classification problems are discussed in this paper. Results are presented for an empirical study based on the use of sequential random sampling and sample evaluation using univariate hypothesis testing and an information theoretic measure. Comparisons are made between theoretical and empirical estimates.

Loading full text...

Full text loading...


Article metrics loading...


This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error