Abstract
In this paper, we propose a novel approach to unsupervised detection of abnormal records in tabular data. We first characterize records in a tabular dataset using a set of features and then employ a one-class support vector machine classifier to characterize records as either normal or abnormal. We select the features that are most relevant in characterizing normal and abnormal records and apply clustering to identify groups of records that have similar characteristics according to these features. Using information-based measures, in the final step we identify the purest abnormal clusters to provide a descriptive representation that allows a user to better understand and identify abnormal records in the dataset. We evaluate our approach on datasets from three different domains, historical birth certificates, social network posts, and COVID-19 data. This evaluation demonstrates that our approach is well suited to identify anomalies in tabular data in an unsupervised manner while outperforming the baseline.
Original language | English |
---|---|
Number of pages | 16 |
Journal | CEUR Workshop Proceedings |
Volume | 3433 |
Publication status | Published - 2023 |
Externally published | Yes |
Event | AAAI 2023 Spring Symposium on Challenges Requiring the Combination of Machine Learning and Knowledge Engineering, AAAI-MAKE 2023 - San Francisco, United States Duration: 27 Mar 2023 → 29 Mar 2023 |
Keywords
- data quality enhancement
- k-means clustering
- One-class support vector machine
- unsupervised learning