Abstract
The current paper presents a novel approach to bitmap-indexing for data mining purposes. Currently bitmap-indexing enables efficient data storage and retrieval, but is limited in terms of similarity measurement, and hence as regards classification, clustering and data mining. Bitmap-indexes mainly fit nominal discrete attributes and thus unattractive for widespread use, which requires the ability to handle continuous data in a raw format. The current research describes a scheme for representing ordinal and continuous data by applying the concept of "padding" where each discrete nominal data value is transformed into a range of nominal-discrete values. This "padding" is done by adding adjacent bits "around" the original value (bin). The padding factor, i.e.; the number of adjacent bits added, is calculated from the first and second derivative degrees of each attribute's domain-distribution. The padded representation better supports similarity measures, and therefore improves the accuracy of clustering and mining. The advantages of padding bitmaps are demonstrated on Fisher's Iris dataset.
Original language | English |
---|---|
Pages (from-to) | 99-110 |
Number of pages | 12 |
Journal | Information Systems Frontiers |
Volume | 15 |
Issue number | 1 |
DOIs | |
State | Published - Mar 2013 |
Keywords
- Bitmap-index
- Classification
- Cluster analysis
- Data mining
- Data representation
- Similarity index
All Science Journal Classification (ASJC) codes
- Software
- Theoretical Computer Science
- Information Systems
- Computer Networks and Communications