REEF: Resolving length bias in frequent sequence mining using sampling

Ariella Richardson, Gal A. Kaminka, Sarit Kraus

פרסום מחקרי: פרסום בכתב עתמאמרביקורת עמיתים


Classic support based approaches efficiently address frequent sequence mining. However, support based mining has been shown to suffer from a bias towards short sequences. In this paper, we propose a method to resolve this bias when mining the most frequent sequences. In order to resolve the length bias we define norm-frequency, based on the statistical zscore of support, and use it to replace support based frequency. Our approach mines the subsequences that are frequent relative to other subsequences of the same length. Unfortunately, naive use of norm-frequency hinders mining scalability. Using normfrequency breaks the anti-monotonic property of support, an important part in being able to prune large sets of candidate sequences. We describe a bound that enables pruning to provide scalability. Calculation of the bound uses a preprocessing stage on a sample of the dataset. Sampling the data creates a distortion in the samples measures. We present a method to correct this distortion. We conducted experiments on 4 data sets, including synthetic data, textual data, remote control zapping data and computer user input data. Experimental results establish that we manage to overcome the short sequence bias successfully, and to illustrate the production of meaningful sequences with our mining algorithm.
שפה מקוריתאנגלית
עמודים (מ-עד)208-222
מספר עמודים15
כתב עתInternational Journal On Advances in Intelligent Systems
מספר גיליון1
סטטוס פרסוםפורסם - 2014

טביעת אצבע

להלן מוצגים תחומי המחקר של הפרסום 'REEF: Resolving length bias in frequent sequence mining using sampling'. יחד הם יוצרים טביעת אצבע ייחודית.

פורמט ציטוט ביבליוגרפי