Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models

Tomer Gueta, Yohay Carmel

Research output: Contribution to journalArticlepeer-review

Abstract

The recent availability of species occurrence data from numerous sources, standardized and connected within a single portal, has the potential to answer fundamental ecological questions. These aggregated big biodiversity databases are prone to numerous data errors and biases. The data-user is responsible for identifying these errors and assessing if the data are suitable for a given purpose. Complex technical skills are increasingly required for handling and cleaning biodiversity data, while biodiversity scientists possessing these skills are rare. Here, we estimate the effect of user-level data cleaning on species distribution model (SDM) performance. We implement several simple and easy-to-execute data cleaning procedures, and evaluate the change in SDM performance. Additionally, we examine if a certain group of species is more sensitive to the use of erroneous or unsuitable data. The cleaning procedures used in this research improved SDM performance significantly, across all scales and for all performance measures. The largest improvement in distribution models following data cleaning was for small mammals (1 g-100 g). Data cleaning at the user level is crucial when using aggregated occurrence data, and facilitating its implementation is a key factor in order to advance data-intensive biodiversity studies. Adopting a more comprehensive approach for incorporating data cleaning as part of data analysis, will not only improve the quality of biodiversity data, but will also impose a more appropriate usage of such data.

Original languageEnglish
Pages (from-to)139-145
Number of pages7
JournalEcological Informatics
Volume34
DOIs
StatePublished - 1 Jul 2016

Keywords

  • Australian mammals
  • Big-data
  • Biodiversity informatics
  • Data-cleaning
  • MaxEnt
  • SDM performance

All Science Journal Classification (ASJC) codes

  • Ecology, Evolution, Behavior and Systematics
  • Ecology
  • Modelling and Simulation
  • Ecological Modelling
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Quantifying the value of user-level data cleaning for big data: A case study using mammal distribution models'. Together they form a unique fingerprint.

Cite this