TY - GEN
T1 - Machine learning to data management
T2 - 34th IEEE International Conference on Data Engineering, ICDE 2018
AU - Laure, Berti Equille
AU - Angela, Bonifati
AU - Milo, Tova
N1 - Funding Information: Laure Berti-Equille received her Ph.D. degree in Computer Science from University of Toulon in France in 1999. From 2000-2010, she was a tenured Associate Professor at University of Rennes 1, and a 2-years visiting researcher at AT&T Labs Research in New Jersey, USA, as a recipient of the prestigious European Marie Curie Outgoing Fellowship (2007-2009). From 2011-2017, she joined IRD, the French Institute of Research for Development, as a Research Director. From 2014-2017, she was a Senior Scientist at Qatar Computing Research Institute (Hamad Bin Khalifa University). She is now is a full Professor at Aix-Marseille University (AMU) in France. Her interests are at the intersection of large-scale data science, data analytics, and machine learning with a focus on data quality and truth discovery research. She initiated the very first workshop editions on information and data quality in information systems (IQIS 2005) and in databases (QDB 2009 and 2016) in conjunction with SIGMOD and VLDB respectively, and co-organized the first French workshops on Data and Knowledge Quality in conjunction with EGC (Extraction et Gestion de Connaissances) in 2005, 2006, 2010, and 2011. Laure is serving as an associated editor of the ACM Journal on Data and Information Quality and served as a Program Chair of the International Conferences on Information Quality (ICIQ) in 2012 and 2016. She has received various grants from the French Agency for National Research (ANR), the French National Research Council (CNRS), and the European Union. Funding Information: Angela Bonifati received her Ph.D. degree in Computer Science from Politecnico di Milano in 2002. After graduating she worked as a postdoctoral researcher at the INRIA research institute in Paris. She then obtained a permanent position as a researcher at the Italian National Research Council in 2003. She is now a full Professor in France (since 2011), currently at University of Lyon 1. Her research focuses on advanced database applications such as data integration and exchange, web and graph databases, query inference by considering both structured and semi-structured data models. She has been visiting professor in several foreign universities, such as Stanford University, UBC and Saarland University. Angela served as the Program Chair of several international conferences, including ICDE 2011 (Semi-structured data Track) and ICDE 2018 (Information Extraction and Data Cleaning and Curation Track), WebDB 2013, and XSym 2009. She is currently associate editor of the VLDB Journal, ACM Transactions on Database Systems (TODS) and Distributed and Parallel Databases. She has been the recipient of the prestigious Palse Impulsion Starting Grant at the University of Lyon (IDEX) in 2016. She has received grants from the French and Italian Ministry of Science and the French National Research Council (CNRS). Funding Information: Tova Milo received her Ph.D. degree in Computer Science from the Hebrew University, Jerusalem, in 1992. After graduating she worked at the INRIA research institute in Paris and at University of Toronto and returned to Israel in 1995, joining the School of Computer Science at Tel Aviv university, where she is now a full Professor. She is the head of the Database research group and holds the Chair of Information Management. She served as the Head of the Computer Science Department from 2011-2014. Her research focuses on large-scale data management applications such as data integration, semi-structured information, Data-centered Business Processes and Crowd-sourcing, studying both theoretical and practical aspects. Tova served as the Program Chair of several international conferences, including PODS, VLDB, ICDT, XSym, and WebDB, and as the chair of the PODS Executive Committee. She served as a member of the VLDB Endowment and the PODS and ICDT executive boards and as an editor of TODS and the Logical Methods in Computer Science Journal. Tova has received grants from the Israel Science Foundation, the US-Israel Binational Science Foundation, the Israeli and French Ministry of Science and the European Union. She is an ACM Fellow, a member of Academia Europaea, a recipient of the 2010 ACM PODS Alberto O. Mendelzon Test-of-Time Award, the 2017 VLDB Women in Database Research award, the 2017 Weizmann award for Exact Sciences Research, and of the prestigious EU ERC Advanced Investigators grant. REFERENCES Publisher Copyright: © 2018 IEEE. Copyright: Copyright 2019 Elsevier B.V., All rights reserved.
PY - 2018/10/24
Y1 - 2018/10/24
N2 - With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model 'learn' and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out-of-The-box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.
AB - With the emergence of machine learning (ML) techniques in database research, ML has already proved a tremendous potential to dramatically impact the foundations, algorithms, and models of several data management tasks, such as error detection, data cleaning, data integration, and query inference. Part of the data preparation, standardization, and cleaning processes, such as data matching and deduplication for instance, could be automated by making a ML model 'learn' and predict the matches routinely. Data integration can also benefit from ML as the data to be integrated can be sampled and used to design the data integration algorithms. After the initial manual work to setup the labels, ML models can start learning from the new incoming data that are being submitted for standardization, integration, and cleaning. The more data supplied to the model, the better the ML algorithm can perform and deliver accurate results. Therefore, ML is more scalable compared to traditional and time-consuming approaches. Nevertheless, many ML algorithms require an out-of-The-box tuning and their parameters and scope are often not adapted to the problem at hand. To make an example, in cleaning and integration processes, the window sizes of values used for the ML models cannot be arbitrarily chosen and require an adaptation of the learning parameters. This tutorial will survey the recent trend of applying machine learning solutions to improve data management tasks and establish new paradigms to sharpen data error detection, cleaning, and integration at the data instance level, as well as at schema, system, and user levels.
KW - Classification
KW - Clustering
KW - Data cleaning
KW - Data management
KW - Data repairing
KW - Error detection
KW - Machine learning
KW - Query inference
UR - http://www.scopus.com/inward/record.url?scp=85057090634&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2018.00226
DO - 10.1109/ICDE.2018.00226
M3 - منشور من مؤتمر
T3 - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
SP - 1735
EP - 1738
BT - Proceedings - IEEE 34th International Conference on Data Engineering, ICDE 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 April 2018 through 19 April 2018
ER -