Publications | Hazar Harmouch

2025

The Effects of Data Quality on Machine Learning Performance on Tabular Data

Sedir Mohammed , Lukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Sina Noack , Hendrik Patzlaff , Felix Naumann , and Hazar Harmouch

Information Systems Journal, 2025

Abs Bib PDF

Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.
@article{budach2022effects, title = {The Effects of Data Quality on Machine Learning Performance on Tabular Data}, author = {Mohammed, Sedir and Budach, Lukas and Feuerpfeil, Moritz and Ihde, Nina and Nathansen, Andrea and Noack, Nele Sina and Patzlaff, Hendrik and Naumann, Felix and Harmouch, Hazar}, journal = {Information Systems Journal}, year = {2025}, }
The Five Facets of Data Quality Assessment

Sedir Mohammed , Lisa Ehrlinger , Hazar Harmouch, Felix Naumann , and Divesh Srivastava

SIGMOD Records, Apr 2025
Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

Sedir Mohammed , Felix Naumann , and Hazar Harmouch

EDBT 2025, Feb 2025

Abs PDF

Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However, in practical scenarios, data cleaning incurs significant costs, as it often involves domain experts for configuring and executing the cleaning process. Thus, efficient resource allocation during data cleaning can enhance ML prediction accuracy while controlling expenses. This paper presents Comet, a system designed to optimize data cleaning efforts for ML tasks. Comet gives step-by-step recommendations on which feature to clean next, maximizing the efficiency of data cleaning under resource constraints. We evaluated Comet across various datasets, ML algorithms, and data error types, demonstrating its robustness and adaptability. Our results show that Comet consistently outperforms feature importance-based, random, and another well-known cleaning method, achieving up to 52 and on average 5 percentage points higher ML prediction accuracy than the proposed baselines.
BASIL DB: BioActive Semantic Integration and Linking Database

David Jackson , Paul Groth , and Hazar Harmouch

Journal of Biomedical Semantics, Mar 2025

2024

A Data Quality Glossary

Sedir Mohammed , Lou Therese Brandner , Felicia Burtscher , Sebastian Hallensleben , Hazar Harmouch, Andreas Hauschke , Jessica Heesen , Stefanie Hildebrandt , Simon David Hirsbrunner , Julia Keselj , Philipp Mahlow , Marie Massow , Felix Naumann , Frauke Rostalski , Anna Wilken , and Annika Wölke

Jan 2024

Bib PDF

@manual{mohammed_2024_10474880,
  title = {A Data Quality Glossary},
  author = {Mohammed, Sedir and Brandner, Lou Therese and Burtscher, Felicia and Hallensleben, Sebastian and Harmouch, Hazar and Hauschke, Andreas and Heesen, Jessica and Hildebrandt, Stefanie and Hirsbrunner, Simon David and Keselj, Julia and Mahlow, Philipp and Massow, Marie and Naumann, Felix and Rostalski, Frauke and Wilken, Anna and Wölke, Annika},
  month = jan,
  year = {2024},
  doi = {10.5281/zenodo.10474880},
  url = {https://doi.org/10.5281/zenodo.10474880},
}

13th International Workshop on Quality in Databases: Preface

Sourav S. Bhowmick , Lisa Ehrlinger , and Hazar Harmouch

In Proceedings of Workshops at the 50th International Conference on Very Large Data Bases, VLDB 2024, Guangzhou, China, August 26-30, 2024 , Jan 2024

2023

How Data Quality Determines AI Fairness: The Case of Automated Interviewing

Lou T. Brandner , Philipp Mahlow , Anna Wilken , Annika Wölke , Hazar Harmouch, and Simon D. Hirsbrunner

European Workshop on Algorithmic Fairness (EWAF2023), Jan 2023

Abs Bib PDF

Artificial Intelligence (AI) supported job interviewing, i.e., one-sided automated applicant interviews assessed by AI-based systems, presents itself as a new mainstream solution in hiring, promising to be more efficient and effective than human recruiters, but also fairer and more objective. Selecting this technology as an illustrative case, we focus on a central element in the development of fair AI: the issue of (training) data quality (DQ). ML models with unsuitable, biased, or erroneous training data is a major source of bias in AI-based applications and therefore potentially discriminatory, unfair outcomes. However, DQ is often cast aside as one of many technical factors contributing to the overall quality of ML-based systems; this approach runs the risk of understating its crucial relevance. We select salient issues along the technology lifecycle to take a detailed look at the interrelation of fairness and DQ, illustrating how both fairness and DQ must be understood in a broad sense, taking into account normative considerations beyond technical aspects, to facilitate desirable outcomes such as the promotion of diversity, the prevention of discrimination, and the protection of workers’ rights.
@article{EWAF23, title = {How Data Quality Determines AI Fairness: The Case of Automated Interviewing}, author = {Brandner, Lou T. and Mahlow, Philipp and Wilken, Anna and Wölke, Annika and Harmouch, Hazar and Hirsbrunner, Simon D.}, journal = {European Workshop on Algorithmic Fairness (EWAF2023)}, year = {2023}, }

Ein Glossar zur Datenqualität (German)

Sedir Mohammed , Lou Brandner , Sebastian Hallensleben , Hazar Harmouch, Andreas Hauschke , Jessica Heesen , Stefanie Hildebrandt , Simon David Hirsbrunner , Julia Keselj , Philipp Mahlow , Felix Naumann , Frauke Rostalski , Anna Wilken , and Annika Wölke

Mar 2023

Die Forschung für diesen Artikel wurde gefördert durch das deutsche Bundesministerium für Arbeit und Soziales (BMAS) / The research for this article has been funded by the Federal Ministry of Labour and Social Affairs (in Germany).

Bib PDF

@manual{sedir_mohammed_2023_7702426,
  title = {Ein Glossar zur Datenqualität (German)},
  author = {Mohammed, Sedir and Brandner, Lou and Hallensleben, Sebastian and Harmouch, Hazar and Hauschke, Andreas and Heesen, Jessica and Hildebrandt, Stefanie and Hirsbrunner, Simon David and Keselj, Julia and Mahlow, Philipp and Naumann, Felix and Rostalski, Frauke and Wilken, Anna and Wölke, Annika},
  month = mar,
  year = {2023},
  note = {{Die Forschung für diesen Artikel wurde gefördert 
                     durch das deutsche Bundesministerium für Arbeit
                     und Soziales (BMAS) / The research for this
                     article has been funded by the Federal Ministry of
                     Labour and Social Affairs (in Germany).}},
  doi = {10.5281/zenodo.7702426},
}

Joint Proceedings of Workshops at the 49th International Conference on Very Large Data Bases (VLDB 2023), Vancouver, Canada, August 28 - September 1, 2023

Mar 2023

Bib PDF

@proceedings{DBLP:conf/vldb/2023w,
  editor = {Bordawekar, Rajesh and Cappiello, Cinzia and Efthymiou, Vasilis and Ehrlinger, Lisa and Gadepally, Vijay and Galhotra, Sainyam and Geisler, Sandra and Groppe, Sven and Gruenwald, Le and Halevy, Alon Y. and Harmouch, Hazar and Hassanzadeh, Oktie and Ilyas, Ihab F. and Jiménez-Ruiz, Ernesto and Krishnan, Sanjay and Lahiri, Tirthankar and Li, Guoliang and Lu, Jiaheng and Mauerer, Wolfgang and Minhas, Umar Farooq and Naumann, Felix and Özsu, M. Tamer and Rezig, El Kindi and Srinivas, Kavitha and Stonebraker, Michael and Valluri, Satyanarayana R. and Vidal, MariaEsther and Wang, Haixun and Wang, Jiannan and Wu, Yingjun and Xue, Xun and Zait, Mohamed and Zeng, Kai},
  title = {Joint Proceedings of Workshops at the 49th International Conference
                    on Very Large Data Bases {(VLDB} 2023), Vancouver, Canada, August
                    28 - September 1, 2023},
  series = {{CEUR} Workshop Proceedings},
  volume = {3462},
  publisher = {CEUR-WS.org},
  year = {2023},
  urn = {urn:nbn:de:0074-3462-7},
  timestamp = {Mon, 28 Aug 2023 17:23:07 +0200},
  biburl = {https://dblp.org/rec/conf/vldb/2023w.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org},
}

2021

Relational Header Discovery using Similarity Search in a Table Corpus.

Hazar Harmouch, Thorsten Papenbrock , and Felix Naumann

Proceedings of the International Conference on Data Engineering (ICDE), Mar 2021

Abs Bib PDF

Column headers are among the most relevant types of meta-data for relational tables, because they provide meaning and context in which the data is to be interpreted. Headers play an important role in many data integration, exploration, and cleaning scenarios, such as schema matching, knowledge base augmentation, and similarity search. Unfortunately, in many cases column headers are missing, because they were never defined properly, are meaningless, or have been lost during data extraction, transmission, or storage. For example, around one third of the tables on the Web have missing headers. Missing headers leave abundant tabular data shrouded and inaccessible to many data-driven applications.We introduce a fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values. It leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table. We evaluate our system on tables extracted from Wikipedia. Overall, 60% of the automatically discovered table headers are exact and complete. Considering more header candidates, top-5 for example, increases this percentage to 72%.
@article{harmouch2019coherance, title = {Relational Header Discovery using Similarity Search in a Table Corpus.}, author = {Harmouch, Hazar and Papenbrock, Thorsten and Naumann, Felix}, journal = {Proceedings of the International Conference on Data Engineering (ICDE)}, volume = {}, number = {}, pages = {444--455}, year = {2021}, doi = {10.1109/ICDE51399.2021.00045}, }

2020

Single-column data profiling

Hazar Harmouch

University of Potsdam, Germany , Mar 2020

Abs Bib PDF

The research area of data profiling consists of a large set of methods and processes to examine a given dataset and determine metadata about it. Typically, different data profiling tasks address different kinds of metadata, comprising either various statistics about individual columns (Single-column Analysis) or relationships among them (Dependency Discovery). Among the basic statistics about a column are data type, header, the number of unique values (the column’s cardinality), maximum and minimum values, the number of null values, and the value distribution. Dependencies involve, for instance, functional dependencies (FDs), inclusion dependencies (INDs), and their approximate versions. Data profiling has a wide range of conventional use cases, namely data exploration, cleansing, and integration. The produced metadata is also useful for database management and schema reverse engineering. Data profiling has also more novel use cases, such as big data analytics. The generated metadata describes the structure of the data at hand, how to import it, what it is about, and how much of it there is. Thus, data profiling can be considered as an important preparatory task for many data analysis and mining scenarios to assess which data might be useful and to reveal and understand a new dataset’s characteristics. In this thesis, the main focus is on the single-column analysis class of data profiling tasks. We study the impact and the extraction of three of the most important metadata about a column, namely the cardinality, the header, and the number of null values. First, we present a detailed experimental study of twelve cardinality estimation algorithms. We classify the algorithms and analyze their efficiency, scaling far beyond the original experiments and testing theoretical guarantees. Our results highlight their trade-offs and point out the possibility to create a parallel or a distributed version of these algorithms to cope with the growing size of modern datasets. Then, we present a fully automated, multi-phase system to discover human-understandable, representative, and consistent headers for a target table in cases where headers are missing, meaningless, or unrepresentative for the column values. Our evaluation on Wikipedia tables shows that 60% of the automatically discovered schemata are exact and complete. Considering more schema candidates, top-5 for example, increases this percentage to 72%. Finally, we formally and experimentally show the ghost and fake FDs phenomenon caused by FD discovery over datasets with missing values. We propose two efficient scores, probabilistic and likelihood-based, for estimating the genuineness of a discovered FD. Our extensive set of experiments on real-world and semi-synthetic datasets show the effectiveness and efficiency of these scores
@phdthesis{Harmouch20, author = {Harmouch, Hazar}, title = {Single-column data profiling}, school = {University of Potsdam, Germany}, year = {2020}, url = {https://publishup.uni-potsdam.de/frontdoor/index/index/docId/47455}, urn = {urn:nbn:de:kobv:517-opus4-474554}, timestamp = {Sat, 17 Jul 2021 09:02:38 +0200}, biburl = {https://dblp.org/rec/phd/dnb/Harmouch20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2019

Inclusion Dependency Discovery:An Experimental Evaluation of Thirteen Algorithms.

Falco Dürsch , Axel Stebner , Fabian Windheuser , Maxi Fischer , Tim Friedrich , Nils Strelow , Tobias Bleifuß , Hazar Harmouch, Lan Jiang , Thorsten Papenbrock , and Felix Naumann

Proceedings of the International Conference on Information and Knowledge Management (CIKM), Mar 2019

Abs Bib PDF

Inclusion dependencies are an important type of metadata in relational databases, because they indicate foreign key relationships and serve a variety of data management tasks, such as data linkage, query optimization, and data integration. The discovery of inclusion dependencies is, therefore, a well-studied problem and has been addressed by many algorithms. Each of these discovery algorithms follows its own strategy with certain strengths and weaknesses, which makes it difficult for data scientists to choose the optimal algorithm for a given profiling task. This paper summarizes the different state-of-the-art discovery approaches and discusses their commonalities. For evaluation purposes, we carefully re-implemented the thirteen most popular discovery algorithms and discuss their individual properties. Our extensive evaluation on several real-world and synthetic datasets shows the unbiased performance of the different discovery approaches and, hence, provides a guideline on when and where each approach works best. Comparing the different runtimes and scalability graphs, we identify the best approaches for certain situations and demonstrate where certain algorithms fail.
@article{ind2019, title = {Inclusion Dependency Discovery:An Experimental Evaluation of Thirteen Algorithms.}, author = {D\"ursch, Falco and Stebner, Axel and Windheuser, Fabian and Fischer, Maxi and Friedrich, Tim and Strelow, Nils and Bleifu\ss, Tobias and Harmouch, Hazar and Jiang, Lan and Papenbrock, Thorsten and Naumann, Felix}, journal = {Proceedings of the International Conference on Information and Knowledge Management (CIKM)}, volume = {}, number = {}, pages = {}, year = {2019}, publisher = {}, }

2018

Discovery of genuine functional dependencies from relational data with missing values

Laure Berti-Equille , Hazar Harmouch, Felix Naumann , Noël Novelli , and Saravanan Thirumuruganathan

PVLDB, Mar 2018

Abs Bib PDF

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
@article{berti2018discovery, title = {Discovery of genuine functional dependencies from relational data with missing values}, author = {Berti-Equille, Laure and Harmouch, Hazar and Naumann, Felix and Novelli, No\"el and Thirumuruganathan, Saravanan}, journal = {PVLDB}, volume = {11}, number = {8}, pages = {880--892}, year = {2018}, publisher = {VLDB Endowment}, }

2017

Cardinality estimation: an experimental survey

Hazar Harmouch, and Felix Naumann

PVLDB, Mar 2017

Abs Bib PDF

Data preparation and data profiling comprise many both basic and complex tasks to analyze a dataset at hand and extract metadata, such as data distributions, key candidates, and functional dependencies. Among the most important types of metadata is the number of distinct values in a column, also known as the zeroth-frequency moment. Cardinality estimation itself has been an active research topic in the past decades due to its many applications. The aim of this paper is to review the literature of cardinality estimation and to present a detailed experimental study of twelve algorithms, scaling far beyond the original experiments. First, we outline and classify approaches to solve the problem of cardinality estimation - we describe their main idea, error-guarantees, advantages, and disadvantages. Our experimental survey then compares the performance all twelve cardinality estimation algorithms. We evaluate the algorithms’ accuracy, runtime, and memory consumption using synthetic and real-world datasets. Our results show that different algorithms excel in different in categories, and we highlight their trade-offs.
@article{harmouch2017cardinality, title = {Cardinality estimation: an experimental survey}, author = {Harmouch, Hazar and Naumann, Felix}, journal = {PVLDB}, volume = {11}, number = {4}, pages = {499--512}, year = {2017}, publisher = {VLDB Endowment}, }

2016

Data Anamnesis: Admitting Raw Data into an Organization.

Sebastian Kruse , Thorsten Papenbrock , Hazar Harmouch, and Felix Naumann

IEEE Data Engineering Bulletin, Mar 2016

Abs Bib PDF

Today’s internet offers a plethora of openly available datasets, bearing great potential for novel applications and research. Likewise, rich datasets slumber within organizations. However, all too often those datasets are available only as raw dumps and lack proper documentation or even a schema. Data anamnesis is the first step of any effort to work with such datasets: It determines fundamental properties regarding the datasets’ content, structure, and quality to assess their utility and to put them to use appropriately. Detecting such properties is a key concern of the research area of data profiling, which has developed several viable instruments, such as data type recognition and foreign key discovery. In this article, we perform an anamnesis of the MusicBrainz dataset, an openly available and complex discographic database. In particular, we employ data profiling methods to create data summaries and then further analyze those summaries to reverse-engineer the database schema, to understand the data semantics, and to point out tangible schema quality issues. We propose two bottom-up schema quality dimensions, namely conciseness and normality, that measure the fit of the schema with its data, in contrast to a top-down approach that compares a schema with its application requirements.
@article{kruse2016data, title = {Data Anamnesis: Admitting Raw Data into an Organization.}, author = {Kruse, Sebastian and Papenbrock, Thorsten and Harmouch, Hazar and Naumann, Felix}, journal = {IEEE Data Engineering Bulletin}, volume = {39}, number = {2}, pages = {8--20}, year = {2016}, }

2015

Evaluating four of the most popular open source and free data mining tools

Ahmad Al-Khoder , and Hazar Harmouch

IJASR International Journal of Academic Scientific Research, Mar 2015

Abs Bib PDF

The ability of DM to provide predictive information derived from huge datasets became an effective tool for companies and individuals. Along with the increasing importance of this science, there was rapid increase in the number of free and open source tools developed to implement its concepts. It wouldn’t be easy to decide which tool performs the desired task better, plus we cannot rely solely on description provided by the vendor. This paper aims to evaluate four of the most popular open source and free DM tools, namely: R, RapidMiner, WEKA and KNIME to help user, developer, and researcher in choosing his preferred tool in terms of platform in use, format of data to be mined and desired output format, needed data visualization form, performance, and the intent to develop unexciting functionality. As a result, All tools under study are modular, easy to extend, and can run on cross-platforms. R is the leading in terms of range of input/output formats, and visualization types, followed by RapidMiner, KNIME, and finally WEKA. Based on the results yielded it can be conducted that WEKA outperformed the highest accuracy level and subsequently the best performance.
@article{al2015evaluating, title = {Evaluating four of the most popular open source and free data mining tools}, author = {Al-Khoder, Ahmad and Harmouch, Hazar}, journal = {IJASR International Journal of Academic Scientific Research}, volume = {3}, number = {1}, pages = {13--23}, year = {2015}, }