Abstract
Deep learning has become a very popular method for text classification in recent years, due to its ability to improve the accuracy of previous state-of-the-art methods on several benchmarks. However, these improvements required hundreds of thousands to millions labeled training examples, which in many cases can be very time consuming and/or expensive to acquire. This problem is especially significant in domain specific text classification tasks where pretrained embeddings and models are not optimal. In order to cope with this problem, we propose a novel learning framework, Ensembled Transferred Embeddings (ETE), which relies on two key ideas: (1) Labeling a relatively small sample of the target dataset, in a semi-automatic process (2) Leveraging other datasets from related domains or related tasks that are large-scale and labeled, to extract “transferable embeddings” Evaluation of ETE on a large-scale real-world item categorization dataset provided to us by PayPal, shows that it significantly outperforms traditional as well as state-of-the-art item categorization methods.
Original language | English |
---|---|
Title of host publication | Machine Learning for Data Science Handbook |
Subtitle of host publication | Data Mining and Knowledge Discovery Handbook, Third Edition |
Pages | 587-606 |
Number of pages | 20 |
ISBN (Electronic) | 9783031246289 |
DOIs | |
State | Published - 1 Jan 2023 |
All Science Journal Classification (ASJC) codes
- General Computer Science
- General Mathematics