Format:
1 Online-Ressource (ii, 97 Seiten, 6627 KB)
,
Illustrationen, Diagramme
Content:
Successfully completing any data science project demands careful consideration across its whole process. Although the focus is often put on later phases of the process, in practice, experts spend more time in earlier phases, preparing data, to make them consistent with the systems' requirements or to improve their models' accuracies. Duplicate detection is typically applied during the data cleaning phase, which is dedicated to removing data inconsistencies and improving the overall quality and usability of data. While data cleaning involves a plethora of approaches to perform specific operations, such as schema alignment and data normalization, the task of detecting and removing duplicate records is particularly challenging. Duplicates arise when multiple records representing the same entities exist in a database. Due to numerous reasons, spanning from simple typographical errors to different schemas and formats of integrated databases. Keeping a database free of duplicates is crucial for most use-cases, as their existence causes ...
Note:
Dissertation Universität Potsdam 2020
Language:
English
Keywords:
Hochschulschrift
DOI:
10.25932/publishup-48913
URN:
urn:nbn:de:kobv:517-opus4-489131
URL:
https://d-nb.info/1225792576/34
Author information:
Naumann, Felix 1971-
Author information:
Ritter, Norbert