Skip to yearly menu bar Skip to main content

Workshop: Synthetic Data for Empowering ML Research

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings.

Ajay Chawda · Marius Kloft · Stefanie Grimm


In this paper, we introduce Vehicle Claims datasets which belong to the category of Auditing data that includes Journals, Insurance claims, and Intrusion data for information systems. It consists of fraudulent insurance claims for automotive repairs. Insurance claims data are distinguishable from network intrusion datasets, for example, KDD data by the number of categoricalattributes. We tackle the problem of missing benchmark datasets for anomaly detection as the datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to theintroduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layerfor representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.

Chat is not available.