A lot of data contains implicit hierarchical structures, e.g. type hierarchies. The property graph model – among others employed in some graph databases – provides no tools to capture those internally.
In this thesis we derive such hierarchies automatically. First a survey is conducted to find the most promising approaches that cluster a data set hierarchically. In the next step various features and vectors thereof are experimented with to extend the methodology to graphs, capturing the structure as well as possible.
We found that there is not one specific feature vector that works well for all data sets and forms of representation in a graph, but rather needs to be constructed adaptive, depending on the way data is modelled. Finally, some extensions of a specific algorithm that was used during experimentation – namely Cobweb – are discussed as well as the use case of cardinality estimation in property graph databases, leveraging the hierarchy as an associative multi-level histogram.
Table of Contents
- 1 Introduction
- 2 Background
- 2.1 The Property Graph Model
- 2.2 Cluster Analysis
- 2.2.1 Approaches
- 2.2.2 Clustering as a Search
- 2.3 Taxonomy
- 2.4 Probability Theory
- 3 Algorithms
- 3.1 Hierarchical Clustering Algorithms
- 3.1.1 Hierarchical Agglomerative Clustering
- 3.1.2 Robust Single Linkage
- 3.2 Partition-based Clustering
- 3.2.1 K-Means
- 3.2.2 TTSAS
- 3.3 Density-based Clustering
- 3.3.1 DBSCAN
- 3.3.2 OPTICS
- 3.3.3 HDBSCAN
- 3.4 Model-based Clustering - Conceptual Clustering
- 3.4.1 Cobweb
- 3.4.2 Extensions of Cobweb
- 3.5 Feature Extraction
- 3.5.1 Characteristic Sets
- 3.5.2 Recursive Feature Extraction
- 3.1 Hierarchical Clustering Algorithms
- 4 Label Inference
- 4.1 Problem Statement
- 4.2 Proposed Solution
- 4.2.1 Pre-Processing
- 4.2.1.1 Encoding Sets of Tags as Vectors
- 4.2.1.2 Feature Vector Extension
- 4.2.2 Clustering
- 4.2.3 Post-Processing: Taxonomy Extraction
- 4.2.1 Pre-Processing
- 5 Evaluation
- 5.1 Tag-based clustering
- 5.1.1 Setup
- 5.1.1.1 Data
- 5.1.1.2 Implementation
- 5.1.2 Results
- 5.1.3 Discussion
- 5.1.1 Setup
- 5.2 Graph-aware clustering of Nodes
- 5.2.1 Setup
- 5.2.1.1 Data
- 5.2.1.2 Pre-Processing
- 5.2.1.3 Implementation
- 5.2.2 Results
- 5.2.3 Discussion
- 5.2.1 Setup
- 5.1 Tag-based clustering
Objectives and Key Themes
The main objective of this thesis is to develop a method for automatically deriving hierarchical structures from data, specifically focusing on property graph models used in graph databases. The research explores various clustering algorithms and feature extraction techniques to achieve this goal, considering the challenges posed by different data representations within graph structures.
- Automatic Hierarchy Derivation from Data
- Hierarchical Clustering Algorithms for Graph Data
- Feature Extraction for Graph-Based Clustering
- Adaptive Methodology for Different Data Representations
- Application to Cardinality Estimation in Property Graph Databases
Chapter Summaries
1 Introduction: This chapter introduces the problem of automatically deriving hierarchical structures from data, particularly within the context of property graph databases which lack inherent tools for representing such hierarchies. It lays the groundwork for the thesis by highlighting the importance of this research and outlining the approach taken.
2 Background: This chapter provides necessary background information on key concepts relevant to the thesis. It covers the property graph model, various approaches to cluster analysis (including hierarchical, partition-based, and density-based methods), taxonomy as a structured representation of hierarchies, and relevant concepts from probability theory, all of which are foundational to the proposed methods.
3 Algorithms: This chapter delves into a detailed exploration of several hierarchical, partition-based, density-based, and model-based clustering algorithms. It provides a comprehensive overview of their functionalities, strengths, and limitations, laying the theoretical foundation for the algorithm selection and adaptation processes employed in later chapters. Particular focus is given to Cobweb, a conceptual clustering algorithm, and its extensions, highlighting their potential for the problem at hand.
4 Label Inference: This chapter presents the proposed solution for automatically inferring hierarchical structures from data represented as property graphs. It details the pre-processing steps involved, including encoding sets of tags as vectors and extending the feature vectors to capture graph structure. The core of this chapter focuses on the clustering process and the subsequent post-processing steps for extracting a taxonomy from the results.
5 Evaluation: This chapter presents the results of the proposed method's evaluation through two different setups: tag-based clustering and graph-aware clustering of nodes. For each, data sets, implementation details, and results are discussed, focusing on how the selected algorithm performed and what insights they provide into the effectiveness of the proposed approach for different data characteristics and structures.
Keywords
Property graph, hierarchical clustering, feature extraction, graph databases, cardinality estimation, Cobweb, taxonomy, data representation, label inference, adaptive methodology.
Frequently Asked Questions: A Comprehensive Language Preview
What is the main topic of this document?
This document provides a comprehensive preview of a thesis focused on automatically deriving hierarchical structures from data, specifically within property graph databases. It details the methodology, algorithms used, evaluation process, and key findings.
What are the key objectives of the research?
The main objective is to develop a method for automatically creating hierarchical structures from data represented in property graph models. This involves exploring various clustering algorithms and feature extraction techniques to handle different data representations within graph structures, ultimately aiming for application in cardinality estimation within property graph databases.
What are the key themes explored in the thesis?
Key themes include automatic hierarchy derivation from data, hierarchical clustering algorithms for graph data, feature extraction for graph-based clustering, an adaptive methodology for different data representations, and the application to cardinality estimation in property graph databases.
What types of clustering algorithms are discussed?
The thesis explores several clustering algorithms, categorized as hierarchical (including hierarchical agglomerative clustering and robust single linkage), partition-based (K-Means and TTSAS), density-based (DBSCAN, OPTICS, and HDBSCAN), and model-based/conceptual clustering (Cobweb and its extensions).
What is the role of feature extraction in this research?
Feature extraction plays a crucial role in preparing the data for clustering. The thesis investigates techniques such as characteristic sets and recursive feature extraction to effectively represent the data's structure and properties for optimal clustering performance.
How does the proposed method handle different data representations?
The methodology is designed to be adaptive to different data representations within graph structures. Pre-processing steps, such as encoding sets of tags as vectors and extending feature vectors, are used to ensure the chosen clustering algorithms can effectively handle various data formats within the property graph model.
What is the proposed solution for inferring hierarchical structures?
The proposed solution involves a three-stage process: pre-processing (encoding tags as vectors and extending feature vectors), clustering using the selected algorithm (based on data characteristics), and post-processing to extract a taxonomy from the clustering results.
How is the proposed method evaluated?
The evaluation is conducted using two setups: tag-based clustering and graph-aware clustering of nodes. Each setup involves specific datasets, implementation details, and results analysis, focusing on algorithm performance and insights into the effectiveness of the proposed approach for different data characteristics and structures.
What are the key algorithms used in the proposed method?
While the specific algorithm selection depends on the data characteristics, the thesis deeply explores several algorithms, particularly focusing on Cobweb and its extensions for conceptual clustering. The final choice of algorithm is determined and justified within the context of the evaluation section.
What are the key takeaways from the evaluation?
The evaluation chapter provides a detailed analysis of the performance of the chosen clustering algorithm(s) under different conditions (tag-based vs. graph-aware clustering). This analysis highlights the strengths and limitations of the proposed method and provides insights into its applicability to various types of graph data and their properties.
What are the key words associated with this research?
Key words include property graph, hierarchical clustering, feature extraction, graph databases, cardinality estimation, Cobweb, taxonomy, data representation, label inference, and adaptive methodology.
- Quote paper
- Fabian Klopfer (Author), 2020, Label Hierarchy Inference in Property Graph Databases, Munich, GRIN Verlag, https://www.grin.com/document/1012318