Publications
2024
- A Data Quality GlossarySedir Mohammed , Lou Therese Brandner , Felicia Burtscher , Sebastian Hallensleben , Hazar Harmouch, Andreas Hauschke , Jessica Heesen , Stefanie Hildebrandt , Simon David Hirsbrunner , Julia Keselj , Philipp Mahlow , Marie Massow , Felix Naumann , Frauke Rostalski , Anna Wilken , and Annika WölkeJan 2024
- Data Quality Assessment: Challenges and OpportunitiesSedir Mohammed , Hazar Harmouch, Felix Naumann , and Divesh SrivastavaarXiv preprint arXiv:2207.14529, Jan 2024
Data-oriented applications, their users, and even the law require data of high quality. Research has broken down the rather vague notion of data quality into various dimensions, such as accuracy, consistency, and reputation, to name but a few. To achieve the goal of high data quality, many tools and techniques exist to clean and otherwise improve data. Yet, systematic research on actually assessing data quality in all of its dimensions is largely absent, and with it the ability to gauge the success of any data cleaning effort. It is our vision to establish a systematic and comprehensive framework for the (numeric) assessment of data quality for a given dataset and its intended use. Such a framework must cover the various facets that influence data quality, as well as the many types of data quality dimensions. In particular, we identify five facets that serve as a foundation of data quality assessment. For each facet, we outline the challenges and opportunities that arise when trying to actually assign quality scores to data and create a data quality profile for it, along with a wide range of technologies needed for this purpose.
2023
- How Data Quality Determines AI Fairness: The Case of Automated InterviewingLou T. Brandner , Philipp Mahlow , Anna Wilken , Annika Wölke , Hazar Harmouch, and Simon D. HirsbrunnerEuropean Workshop on Algorithmic Fairness (EWAF2023), Jan 2023
Artificial Intelligence (AI) supported job interviewing, i.e., one-sided automated applicant interviews assessed by AI-based systems, presents itself as a new mainstream solution in hiring, promising to be more efficient and effective than human recruiters, but also fairer and more objective. Selecting this technology as an illustrative case, we focus on a central element in the development of fair AI: the issue of (training) data quality (DQ). ML models with unsuitable, biased, or erroneous training data is a major source of bias in AI-based applications and therefore potentially discriminatory, unfair outcomes. However, DQ is often cast aside as one of many technical factors contributing to the overall quality of ML-based systems; this approach runs the risk of understating its crucial relevance. We select salient issues along the technology lifecycle to take a detailed look at the interrelation of fairness and DQ, illustrating how both fairness and DQ must be understood in a broad sense, taking into account normative considerations beyond technical aspects, to facilitate desirable outcomes such as the promotion of diversity, the prevention of discrimination, and the protection of workers’ rights.
- Ein Glossar zur Datenqualität (German)Sedir Mohammed , Lou Brandner , Sebastian Hallensleben , Hazar Harmouch, Andreas Hauschke , Jessica Heesen , Stefanie Hildebrandt , Simon David Hirsbrunner , Julia Keselj , Philipp Mahlow , Felix Naumann , Frauke Rostalski , Anna Wilken , and Annika WölkeMar 2023Die Forschung für diesen Artikel wurde gefördert durch das deutsche Bundesministerium für Arbeit und Soziales (BMAS) / The research for this article has been funded by the Federal Ministry of Labour and Social Affairs (in Germany).
2022
- The Effects of Data Quality on Machine Learning PerformanceLukas Budach , Moritz Feuerpfeil , Nina Ihde , Andrea Nathansen , Nele Sina Noack , Hendrik Patzlaff , Felix Naumann , and Hazar HarmoucharXiv preprint arXiv:2207.14529, Mar 2022
Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, consistency, and uniformity. We explore empirically the relationship between six of the traditional data quality dimensions and the performance of fifteen widely used machine learning (ML) algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.
2021
- Relational Header Discovery using Similarity Search in a Table Corpus.Hazar Harmouch, Thorsten Papenbrock , and Felix NaumannProceedings of the International Conference on Data Engineering (ICDE), Mar 2021
Column headers are among the most relevant types of meta-data for relational tables, because they provide meaning and context in which the data is to be interpreted. Headers play an important role in many data integration, exploration, and cleaning scenarios, such as schema matching, knowledge base augmentation, and similarity search. Unfortunately, in many cases column headers are missing, because they were never defined properly, are meaningless, or have been lost during data extraction, transmission, or storage. For example, around one third of the tables on the Web have missing headers. Missing headers leave abundant tabular data shrouded and inaccessible to many data-driven applications.We introduce a fully automated, multi-phase system that discovers table column headers for cases where headers are missing, meaningless, or unrepresentative for the column values. It leverages existing table headers from web tables to suggest human-understandable, representative, and consistent headers for any target table. We evaluate our system on tables extracted from Wikipedia. Overall, 60% of the automatically discovered table headers are exact and complete. Considering more header candidates, top-5 for example, increases this percentage to 72%.
2020
- Single-column data profilingHazar HarmouchUniversity of Potsdam, Germany , Mar 2020
The research area of data profiling consists of a large set of methods and processes to examine a given dataset and determine metadata about it. Typically, different data profiling tasks address different kinds of metadata, comprising either various statistics about individual columns (Single-column Analysis) or relationships among them (Dependency Discovery). Among the basic statistics about a column are data type, header, the number of unique values (the column’s cardinality), maximum and minimum values, the number of null values, and the value distribution. Dependencies involve, for instance, functional dependencies (FDs), inclusion dependencies (INDs), and their approximate versions. Data profiling has a wide range of conventional use cases, namely data exploration, cleansing, and integration. The produced metadata is also useful for database management and schema reverse engineering. Data profiling has also more novel use cases, such as big data analytics. The generated metadata describes the structure of the data at hand, how to import it, what it is about, and how much of it there is. Thus, data profiling can be considered as an important preparatory task for many data analysis and mining scenarios to assess which data might be useful and to reveal and understand a new dataset’s characteristics. In this thesis, the main focus is on the single-column analysis class of data profiling tasks. We study the impact and the extraction of three of the most important metadata about a column, namely the cardinality, the header, and the number of null values. First, we present a detailed experimental study of twelve cardinality estimation algorithms. We classify the algorithms and analyze their efficiency, scaling far beyond the original experiments and testing theoretical guarantees. Our results highlight their trade-offs and point out the possibility to create a parallel or a distributed version of these algorithms to cope with the growing size of modern datasets. Then, we present a fully automated, multi-phase system to discover human-understandable, representative, and consistent headers for a target table in cases where headers are missing, meaningless, or unrepresentative for the column values. Our evaluation on Wikipedia tables shows that 60% of the automatically discovered schemata are exact and complete. Considering more schema candidates, top-5 for example, increases this percentage to 72%. Finally, we formally and experimentally show the ghost and fake FDs phenomenon caused by FD discovery over datasets with missing values. We propose two efficient scores, probabilistic and likelihood-based, for estimating the genuineness of a discovered FD. Our extensive set of experiments on real-world and semi-synthetic datasets show the effectiveness and efficiency of these scores
2019
- Inclusion Dependency Discovery:An Experimental Evaluation of Thirteen Algorithms.Falco Dürsch , Axel Stebner , Fabian Windheuser , Maxi Fischer , Tim Friedrich , Nils Strelow , Tobias Bleifuß , Hazar Harmouch, Lan Jiang , Thorsten Papenbrock , and Felix NaumannProceedings of the International Conference on Information and Knowledge Management (CIKM), Mar 2019
Inclusion dependencies are an important type of metadata in relational databases, because they indicate foreign key relationships and serve a variety of data management tasks, such as data linkage, query optimization, and data integration. The discovery of inclusion dependencies is, therefore, a well-studied problem and has been addressed by many algorithms. Each of these discovery algorithms follows its own strategy with certain strengths and weaknesses, which makes it difficult for data scientists to choose the optimal algorithm for a given profiling task. This paper summarizes the different state-of-the-art discovery approaches and discusses their commonalities. For evaluation purposes, we carefully re-implemented the thirteen most popular discovery algorithms and discuss their individual properties. Our extensive evaluation on several real-world and synthetic datasets shows the unbiased performance of the different discovery approaches and, hence, provides a guideline on when and where each approach works best. Comparing the different runtimes and scalability graphs, we identify the best approaches for certain situations and demonstrate where certain algorithms fail.
2018
- Discovery of genuine functional dependencies from relational data with missing valuesLaure Berti-Equille , Hazar Harmouch, Felix Naumann , Noël Novelli , and Saravanan ThirumuruganathanPVLDB, Mar 2018
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
2017
- Cardinality estimation: an experimental surveyHazar Harmouch, and Felix NaumannPVLDB, Mar 2017
Data preparation and data profiling comprise many both basic and complex tasks to analyze a dataset at hand and extract metadata, such as data distributions, key candidates, and functional dependencies. Among the most important types of metadata is the number of distinct values in a column, also known as the zeroth-frequency moment. Cardinality estimation itself has been an active research topic in the past decades due to its many applications. The aim of this paper is to review the literature of cardinality estimation and to present a detailed experimental study of twelve algorithms, scaling far beyond the original experiments. First, we outline and classify approaches to solve the problem of cardinality estimation - we describe their main idea, error-guarantees, advantages, and disadvantages. Our experimental survey then compares the performance all twelve cardinality estimation algorithms. We evaluate the algorithms’ accuracy, runtime, and memory consumption using synthetic and real-world datasets. Our results show that different algorithms excel in different in categories, and we highlight their trade-offs.
2016
- Data Anamnesis: Admitting Raw Data into an Organization.Sebastian Kruse , Thorsten Papenbrock , Hazar Harmouch, and Felix NaumannIEEE Data Engineering Bulletin, Mar 2016
Today’s internet offers a plethora of openly available datasets, bearing great potential for novel applications and research. Likewise, rich datasets slumber within organizations. However, all too often those datasets are available only as raw dumps and lack proper documentation or even a schema. Data anamnesis is the first step of any effort to work with such datasets: It determines fundamental properties regarding the datasets’ content, structure, and quality to assess their utility and to put them to use appropriately. Detecting such properties is a key concern of the research area of data profiling, which has developed several viable instruments, such as data type recognition and foreign key discovery. In this article, we perform an anamnesis of the MusicBrainz dataset, an openly available and complex discographic database. In particular, we employ data profiling methods to create data summaries and then further analyze those summaries to reverse-engineer the database schema, to understand the data semantics, and to point out tangible schema quality issues. We propose two bottom-up schema quality dimensions, namely conciseness and normality, that measure the fit of the schema with its data, in contrast to a top-down approach that compares a schema with its application requirements.
2015
- Evaluating four of the most popular open source and free data mining toolsAhmad Al-Khoder , and Hazar HarmouchIJASR International Journal of Academic Scientific Research, Mar 2015
The ability of DM to provide predictive information derived from huge datasets became an effective tool for companies and individuals. Along with the increasing importance of this science, there was rapid increase in the number of free and open source tools developed to implement its concepts. It wouldn’t be easy to decide which tool performs the desired task better, plus we cannot rely solely on description provided by the vendor. This paper aims to evaluate four of the most popular open source and free DM tools, namely: R, RapidMiner, WEKA and KNIME to help user, developer, and researcher in choosing his preferred tool in terms of platform in use, format of data to be mined and desired output format, needed data visualization form, performance, and the intent to develop unexciting functionality. As a result, All tools under study are modular, easy to extend, and can run on cross-platforms. R is the leading in terms of range of input/output formats, and visualization types, followed by RapidMiner, KNIME, and finally WEKA. Based on the results yielded it can be conducted that WEKA outperformed the highest accuracy level and subsequently the best performance.