Welcome to PyGraft Documentation!

PyGraft is an open-source Python library for generating synthetic yet realistic schemas and/or knowledge graphs based on user-specified parameters. The generated resources are domain-agnostic, i.e. they are not tied to a specific application field.

Being able to synthesize schemas and knowledge graph is an important milestone for conducting research in domains where data is sensitive or not readily available. PyGraft allows researchers and practitioners to generate schemas and KGs on the fly, provided minimal knowledge about the desired specifications.

PyGraft has the following features: - possibility to generate a schema, a KG, or both - highly-tunable process based on a broad array of user-specified parameters - schemas and KGs are built with an extended set of RDFS and OWL constructs - logical consistency is ensured by the use of a DL reasoner (HermiT)

Note

This project is under active development.

Contents

Installation

Note

In order to benefit from all the functionalities PyGraft offers, you need Java to be installed and the $JAVA_HOME environment variable to be properly assigned. This is because the HermiT reasoner currently runs using Java.

The latest stable version of PyGraft can be downloaded and installed from PyPI with:

$ pip install pygraft

The latest version of PyGraft can be installed directly from the source on GitHub with:

$ pip install git+https://github.com/nicolas-hbt/pygraft.git

Please note that installing PyGraft will also set up the following Python dependencies:

art
matplotlib
numpy
pyyaml
Owlready2
rdflib
tabulate
tqdm

And that’s it! You are all set! Before generating your first schemas and Knowledge Graphs (KGs), we recommend you to take a look at the Overview section to get a better understanding of how PyGraft operates. Not totally familiar with what schemas and KGs are? Consider going through the Background section. Next, you can jump to the First Steps section to get a hands-on first experience with PyGraft.

Background

Schemas

A schema – e.g. an ontology – refers to a explicit specification of a conceptualization that includes concepts, properties, and restrictions within a particular domain of knowledge [Gru95]. It helps ensure consistency, clarity, and interoperability when representing and sharing knowledge. We consider schemas to be represented as a collection of concepts \(\mathcal{C}\), properties \(\mathcal{P}\), and axioms \(\mathcal{A}\), i.e. \(\mathcal{S} = \{ \mathcal{C}, \mathcal{P}, \mathcal{A}\}\).

They are typically represented using formal languages or vocabularies such as RDFS (Resource Description Framework Schema) and OWL (Web Ontology Language).

Knowledge Graphs

Regarding KGs, distinct definitions co-exist [BDPP18], [EWoss16]. In this work, we stick to the inclusive definition of Hogan et al. [HBC+21], i.e. we consider a KG to be a graph where nodes represent entities and edges represent relations between these entities. The link between schemas and KGs lies in the fact that schemas are often used to define the structure and semantics of a KG. In other words, a schema defines the vocabulary and rules that govern entities and relationships in a KG. In this view, a KG is a data graph that can be potentially enhanced with a schema [HBC+21].

Graph Generation

The generation principle that underpins synthetic graph generators leads to differentiate three main families of generators: stochastic-based, deep generative, and semantic-driven ones.

Stochastic-based Generators

Stochastic-based generators are usually characterized by their ability to output large graphs in a short amount of time. Early works around the development of this family of generators are represented by the famous Erdős–Rényi model [ERwi59], which is a foundational stochastic model for generating random graphs. The Erdős–Rényi model generates graphs by independently assigning edges between pairs of nodes with a fixed probability. The Barabási-Albert model [ABarabasi02], is another stochastic model that exhibits scale-free degree distributions. The Barabási-Albert model is based on the principle of preferential attachment, where new nodes are more likely to attach to nodes with higher degrees. The R-MAT model [CZF04] is another well-known stochastic graph generator that generates large-scale power-law graphs with properties like power-law degree distributions, small-world characteristics, and community structures. More recently, TrillionG [PK17] has been presented as an extension of R-MAT. TrillionG represents nodes and edges as vectors in a high-dimensional space. It captures the structural characteristics of real-world graphs and generates synthetic graphs that mimic the properties observed in those graphs. TrillionG allows users to generate large graphs up to trillions of edges while exhibiting lower space and time complexities than previously proposed generators.

Deep Generative Graph Generators

Another line of research revolves around the development of deep generative graph generators. These models are trained on existing graph datasets and learn to capture the underlying patterns and structures of the input graphs. Deep generative graph models are typically based on generative adversarial networks (GANs) and graph neural networks (GNNs), recurrent neural networks (RNNs), or variational autoencoders (VAEs). They often take into account both the structural and attribute information of the input graphs to generate new graphs that exhibit similar properties. GraphGAN [WWW+18] leverages the GAN structure, in which the generative model receives a vertex and aims at fitting its true connectivity distribution over all other vertices – thereby producing fake samples for the discriminative model to differentiate from ground-truth samples. GraphRNN [YYR+18] is a deep autoregressive model that trains on a collection of graphs. It can be viewed as a hierarchical model adding nodes and edges in a sequential manner: a graph-level RNN maintains the state of the graph and generates new nodes, while an edge-level RNN generates the edges for each newly generated node. A representant of the VAE family of generators is NeVAE [SDJ+20], which is specifically designed for molecular graphs. NeVAE features a decoder which is able to guarantee a set of valid properties in the generated molecules.

Semantic-driven Generators

Semantic-driven synthetic generators, in contrast, incorporate schema-based constraints or external knowledge to generate graphs that exhibit specific properties or follow certain patterns relevant to the given field of application. In [GPH05], the Lehigh University Benchmark (LUBM) and the Univ-Bench Artificial data generator (UBA) are presented. The latter is an ontology modelling the university domain while the latter aims at generating synthetic graphs based on the LUBM schema as well as user-defined queries and restrictions. Similarly, the Linked Data Benchmark Council (LDBC) [ABLarribaPey+14] released the Social Network Benchmark (SNB), which includes a graph generator for synthesizing social network data based on realistic distributions. gMark [BBC+17] has subsequently been presented as the first generator that satisfies the criteria of being domain-independent, extensible, schema-driven, and highly configurable, all at the same time. In [MP17], Melo and Paulheim focus on the synthesis of KGs for the purpose of benchmarking link prediction and type prediction tasks. The authors claim that there is a need for more diverse benchmark datasets for link prediction, with the possibility of having control over their characteristics (e.g. the number of entities, relation assertions, number of types, etc.). Therefore, Melo and Paulheim propose a synthesis approach which closely resemble real-world graphs while allowing for controlled variations in graph properties. Notably, they highlight the fact that most works focus on synthesizing KGs based on an existing schema, which leads them to formulate the desiderata of generating both a schema and KG from scratch as a promising venue for future work – which PyGraft actually does. Subsequently, Feng et al. [FMH+21] proposed a schema-driven graph generator based on the concept of Extended Graph Differential Dependencies (\(GDD^{x}\)), which exhibits user-specified graph patterns, node attributes and degree distributions based on the graph’s schema. The DLCC benchmark proposed in [PP22] features a synthetic KG generator based on user-specified graph and schema properties. Beyond asking for a given number of nodes, relations and degree distribution in the resulting KG, it allows for specifying a few RDFS constraints for the generation of the underpinning schema. To the best of our knowledge, this is the first and only work that allows to generate both a schema and a KG. However, the DLCC benchmark is specifically designed for the node classification task. Besides, only three RDFS assertions are taken into account, and the final logical consistency of the KG is not guaranteed.

Overview

We present PyGraft, a Python-based tool that allows generating highly parametrizable, domain-agnostic schemas and KGs. Importantly, the logical consistency of these schemas and KGs is checked using the HermiT reasoner.

The contributions of PyGraft are as follows:

  • To the best of our knowledge, PyGraft is the first generator able to synthesize both schemas and KGs in a single pipeline.

  • The generated schemas and KGs are described with an extended set of RDFS and OWL constructs, allowing for both fine-grained resource descriptions and strict compliance with common Semantic Web standards.

  • A broad range of parameters can be specified by the user. These allow for creating an infinite number of graphs with different characteristics. More details on parameters can be found in the Parameters section.

From a high-level perspective, the entire PyGraft generation pipeline is depicted in Figure 1. In particular, Class and Relation Generators are initialized with user-specified parameters and used to build the schema incrementally. The logical consistency of the schema is subsequently checked using the HermiT reasoner from owlready2. If you are also interested in generating a KG based on this schema, the KG Generator is initialized with KG-related parameters and fused with the previously generated schema to sequentially build the KG. Ultimately, the logical consistency of the resulting KG is (again) assessed using HermiT.

_images/pygraft-overview.png

Figure 1: PyGraft Overview

First Steps

Once installed, PyGraft can be loaded with:

>>> import pygraft

Importantly, you can access all the functions with:

>>> pygraft.__all__
['create_template',
'create_json_template',
'create_yaml_template',
'generate_schema',
'generate_kg',
'generate']

When generating a schema and/or a KG, the output files will be stored in output/ (relative to the current directory). Under output/, schemas as well as associated KGs are further stored in a distinct folder. By default, the name of such folders correspond to the value for the schema_name parameter, which can be modified in the configuration file (see below).

Generating a Schema

Let us assume we are only interested in generating a schema. We first need to retrieve the template configuration file (e.g. a .yaml configuration file), which is as simple as calling create_yaml_template():

>>> pygraft.create_yaml_template()

Now, the template has been generated under the current working directory, and is named template.yml by default. Let us inspect the file:

# GENERAL ARGS #
schema_name: template
format: xml

# SCHEMA ARGS #
## CLASSES ##
num_classes: 50
max_hierarchy_depth: 4
avg_class_depth: 2.5
class_inheritance_ratio: 2.0
avg_disjointness: 0.3
verbose: true

## RELATIONS ##
num_relations: 50
relation_specificity: 2.5
prop_profiled_relations: 0.9
profile_side: both
prop_symmetric_relations: 0.3
prop_inverse_relations: 0.3
prop_transitive_relations: 0.1
prop_asymmetric_relations: 0.0
prop_reflexive_relations: 0.3
prop_irreflexive_relations: 0.0
prop_functional_relations: 0.0
prop_inverse_functional_relations: 0.0
prop_subproperties: 0.3

# KNOWLEDGE GRAPH ARGS ##
num_entities: 3000
num_triples: 30000
fast_gen: true
oversample: false
relation_balance_ratio: 0.9
prop_untyped_entities: 0.0
avg_depth_specific_class: 2.0
multityping: false
avg_multityping: 1.5

This file contains all the tunable parameters. For more details on their meanings, please check the Parameters section.

For now, we do not plan to modify this template and stick with the default parameter values. Refer to the Advanced Usage section for more detailed examples.

Generating an ontology is made possible via the generate_schema(path) function, which only requires the relative path to the configuration file.

Note

For the following steps, i.e. generating a schema and a KG, you need Java to be installed and the $JAVA_HOME environment variable to be properly assigned. This is because the HermiT reasoner currently runs using Java.

In our case, the configuration file is named template.yml and is located in the same directory, thereby:

>>> pygraft.generate_schema("template.yml")

  ____      __   __    ____      ____         _        _____    _____
U|  _"\ u   \ \ / / U /"___|u U |  _"\ u  U  /"\  u   |" ___|  |_ " _|
\| |_) |/    \ V /  \| |  _ /  \| |_) |/   \/ _ \/   U| |_  u    | |
 |  __/     U_|"|_u  | |_| |    |  _ <     / ___ \   \|  _|/    /| |\
 |_|          |_|     \____|    |_| \_\   /_/   \_\   |_|      u |_|U
 ||>>_    .-,//|(_    _)(|_     //   \\_   \\    >>   )(\\,-   _// \\_
(__)__)    \_) (__)  (__)__)   (__)  (__) (__)  (__) (__)(_/  (__) (__)


Ontology Generated.
===================
Ontology Parameters:
===================

+-------------------------+-------+-----------------+
|      Class Metric       | Value | Specified Value |
+-------------------------+-------+-----------------+
|    Number of Classes    |  50   |       50        |
| Maximum Hierarchy Depth |   4   |        4        |
|   Average Class Depth   | 2.52  |       2.5       |
| Class Inheritance Ratio | 2.05  |       2.0       |
|  Average Disjointness   |  0.3  |       0.3       |
+-------------------------+-------+-----------------+

+-----------------------------+-------+-----------------+
|       Relation Metric       | Value | Specified Value |
+-----------------------------+-------+-----------------+
|     Number of Relations     |  50   |       50        |
|   SubProperty Proportion    | 0.32  |       0.3       |
|     Reflexive Relations     |  0.3  |       0.3       |
|    Irreflexive Relations    |  0.0  |       0.0       |
|    Functional Relations     |  0.0  |       0.0       |
| InverseFunctional Relations |  0.0  |       0.0       |
|     Symmetric Relations     |  0.3  |       0.3       |
|    Asymmetric Relations     |  0.0  |       0.0       |
|    Transitive Relations     |  0.1  |       0.1       |
|     InverseOf Relations     | 0.32  |       0.3       |
|     Profiled Relations      | 0.94  |       0.9       |
|    Relation Specificity     |  2.5  |       2.5       |
+-----------------------------+-------+-----------------+

Writing classes: 100% 50/50 [00:00<00:00, 1224.05classes/s]
Writing relations: 100% 50/50 [00:00<00:00, 1194.00relations/s]

Schema created.

Consistent schema.

The output above highlights important facts:

  • the first two tables compare the user requirements w.r.t. the schema with the actual values of the generated schema. In most cases, the generated schemas match very well with the user-specified parameters. Situations in which this would not be the case are when users specify parameter values that conflict. For instance, asking for a max_hierarchy_depth which is lower than the avg_class_depth is not possible. Other sources of incomplete matching with the configuration file could happen if, for example, the following parameter values are specified: num_classes = 6, max_hierarchy_depth = 3, and inheritance_ratio = 2.5. In this case, too many concurrent constraints are to be satisfied simultaneously, which can result in the different situations depicted in Figure 2.

  • After the schema is created, its semantic consistency is checked using the HermiT reasoner from owlready2. In this example, the schema is consistent. Note that over several hundreds of generated schemas during our experiments, PyGraft did not generate any inconsistent schema.

The generated schema can be retrieved in output/template/schema.rdf. Additional files are created during the process: output/template/class_info.json and output/template/relation_info.json. These fils give important information about the classes and relations of the generated schema, respectively.

_images/class-trees.png

Figure 2: Potential class hierarchies for the constraints num_classes = 6, max_hierarchy_depth = 3, and inheritance_ratio = 2.5. Left and middle class hierarchies are built with parameter priority. The right class hierarchy is built with a best-effort strategy, without specific parameter privilege.

Generating a KG

Let us now explore how to use PyGraft to generate a KG. In this section, we assume we already have a schema, that will serve as a blueprint for generating our KG. We can use the same configuration file as before – as it also contained parameters related to the KG generation (although not used before, since we only asked for a schema) – to generate a KG:

>>> pygraft.generate_kg("template.yml")


 ______          _______                   ___
(_____ \        (_______)                 / __)   _
 _____) ) _   _  _   ___   ____  _____  _| |__  _| |_
|  ____/ | | | || | (_  | / ___)(____ |(_   __)(_   _)
| |      | |_| || |___) || |    / ___ |  | |     | |_
|_|       \__  | \_____/ |_|    \_____|  |_|      \__)
        (____/


Writing instance triples: 100% 30000/30000 [00:01<00:00, 19830.78triples/s]

Consistent KG.

And that’s it! We now generated a KG containing 30K triples and roughly 3K distinct entities, as defined in template.yml. The generated KG can be retrieved in output/template/full_graph.rdf. It combines information inherited from output/template/schema.rdf (i.e. ontological information) with information related to individuals. Let us inspect its first few lines:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
   xmlns:ns1="http://purl.org/dc/terms/"
   xmlns:owl="http://www.w3.org/2002/07/owl#"
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
   xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
   xmlns:sc="http://pygraf.t/"
>
  <rdf:Description rdf:about="http://pygraf.t/E1810">
    <rdf:type rdf:resource="http://pygraf.t/C21"/>
    <sc:R47 rdf:resource="http://pygraf.t/E622"/>
    <sc:R12 rdf:resource="http://pygraf.t/E447"/>
    <sc:R32 rdf:resource="http://pygraf.t/E761"/>
    <sc:R32 rdf:resource="http://pygraf.t/E4"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://pygraf.t/C21">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
    <rdfs:subClassOf rdf:resource="http://pygraf.t/C37"/>
    <owl:disjointWith rdf:resource="http://pygraf.t/C11"/>
  </rdf:Description>
  <rdf:Description rdf:about="http://pygraf.t/R47">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#ObjectProperty"/>
    <owl:inverseOf rdf:resource="http://pygraf.t/R50"/>
  </rdf:Description>
</rdf:RDF>

The above displayed RDF graph is easily readable: for instance, entity E1810 is linked to entity E622 via the relation R47 (inverse of relation R50). We also know that E1810 is of type C21, which is both disjoint with C11 and a subclass of C37.

Full Pipeline Execution

In most cases, one wants to generate both a schema and a KG in a single process. PyGraft allows this with the generate(path) function, which operates just as the aforedescribed two functions generate_schema(path) and generate_kg(path):

>>> pygraft.generate("template.yml")


,---.   .-.   .-.  ,--,   ,---.      .--.    ,---.  _______
| .-.\   \ \_/ )/.' .'    | .-.\    / /\ \   | .-' |__   __|
| |-' )   \   (_)|  |  __ | `-'/   / /__\ \  | `-.   )| |
| |--'     ) (   \  \ ( _)|   (    |  __  |  | .-'  (_) |
| |        | |    \  `-) )| |\ \   | |  |)|  | |      | |
/(        /(_|    )\____/ |_| \)\  |_|  (_)  )\|      `-'
(__)      (__)    (__)         (__)          (__)


Ontology Generated.
===================
Ontology Parameters:
===================

+-------------------------+-------+-----------------+
|      Class Metric       | Value | Specified Value |
+-------------------------+-------+-----------------+
|    Number of Classes    |  50   |       50        |
| Maximum Hierarchy Depth |   4   |        4        |
|   Average Class Depth   | 2.52  |       2.5       |
| Class Inheritance Ratio | 1.95  |       2.0       |
|  Average Disjointness   |  0.3  |       0.3       |
+-------------------------+-------+-----------------+

+-----------------------------+-------+-----------------+
|       Relation Metric       | Value | Specified Value |
+-----------------------------+-------+-----------------+
|     Number of Relations     |  50   |       50        |
|   SubProperty Proportion    | 0.32  |       0.3       |
|     Reflexive Relations     |  0.3  |       0.3       |
|    Irreflexive Relations    |  0.0  |       0.0       |
|    Functional Relations     |  0.0  |       0.0       |
| InverseFunctional Relations |  0.0  |       0.0       |
|     Symmetric Relations     |  0.3  |       0.3       |
|    Asymmetric Relations     |  0.0  |       0.0       |
|    Transitive Relations     |  0.1  |       0.1       |
|     InverseOf Relations     | 0.32  |       0.3       |
|     Profiled Relations      | 0.91  |       0.9       |
|    Relation Specificity     | 2.47  |       2.5       |
+-----------------------------+-------+-----------------+

Writing classes: 100% 50/50 [00:00<00:00, 800.84classes/s]
Writing relations: 100% 50/50 [00:00<00:00, 1492.49relations/s]

Schema created.

Consistent schema.

Writing instance triples: 100% 30000/30000 [00:01<00:00, 28517.72triples/s]

Consistent KG.

Advanced Usage

Note

This section is under construction.

Built-in Functions

The core functions of PyGraft are reported below:

Function

Description

create_template(extension)

Generate a configuration file template in the working directory

create_json_template()

Generate a .json configuration file template in the working directory

create_yaml_template()

Generate a .yml configuration file template in the working directory

generate_schema(path)

Generate a schema based on a specified path to a configuration file

generate_kg(path)

Generate a KG based on a specified path to a configuration file

generate_both(path)

Generate both a schema and a KG based on a specified path to a configuration file

Parameters

PyGraft allows for a broad range of user-specified parameters. Overall, these can be seen split into:

  • general parameters, seen as metadata for the generation process

  • schema parameters, governing the number of classes, relations, and how they are expected to interact

  • KG parameters, governing the number of triples, instances, and how the latter should populate the schema

All these parameters can be freely modified in the json and yaml configuration files provided as templates. For a quick example on how you can fetch the template in the current corkind directory and modify the parameters manually, see the Advanced Usage section.

Metadata

Parameter

Description

schema_name

Which schema to use

format

Output format for the schema. Options: xml ttl nt

Schema Parameters

Classes

Parameter

Description

num_classes

Number of classes

max_hierarchy_depth

Maximum hierarchy depth

avg_class_depth

Average class depth

class_inheritance_ratio

Class inheritance ratio

avg_disjointness

Proportion of owl:DisjointWith

Relations

Parameter

Description

num_relations

Number of relations

relation_specificity

Relation specificity

prop_profiled_relations

Proportion of rdfs:domain and rdfs:range

profile_side

Whether profiled relations should have both a domain and a range or whether they should have at least one of them

prop_symmetric_relations

Proportion of symmetric relations

prop_inverse_relations

Proportion of owl:inverseOf

prop_transitive_relations

Proportion of owl:TransitiveProperty

prop_asymmetric_relations

Proportion of owl:AsymmetricProperty

prop_reflexive_relations

Proportion of owl:ReflexiveProperty

prop_irreflexive_relations

Proportion of owl:IrreflexiveProperty

prop_subproperties

Proportion of rdfs:subPropertyOf

prop_functional_relations

Proportion of owl:FunctionalProperty (not debugged)

prop_inv_functional_relations

Proportion of owl:InverseFunctionalProperty (not debugged)

KG Parameters

Parameter

Description

num_entities

Number of entities

num_triples

Number of triples

relation_balance_ratio

Distribution of relations across triples

prop_untyped_entities

Proportion of untyped entities

avg_depth_specific_class

Average depth of most specific class for all entities

multityping

Whether entities are multi-typed

avg_multityping

Average number of most-specific classes that typed entities belong to

format

Output format for the final graph

Execution Time

The efficiency and scalability of PyGraft are benchmarked across several schema and graph configurations. Each schema specification reported in Table 1 is paired with each graph specification from Table 2. This leads to 27 distinct combinations.

In particular, schemas from \(\mathcal{S}1\) to \(\mathcal{S}3\) are small-sized, schemas from \(\mathcal{S}4\) to \(\mathcal{S}6\) are medium-sized, and schemas from \(\mathcal{S}7\) to \(\mathcal{S}9\) are of larger sizes. For each schema of a given size, the degree of constraints vary as they contain different levels of OWL and RDFS constructs. For example, \(\mathcal{S}1\) has less constraints than \(\mathcal{S}2\), which itself has less constraints than \(\mathcal{S}3\). Graph specifications \(\mathcal{G}1\), \(\mathcal{G}2\), and \(\mathcal{G}3\) correspond to small-sized, medium-sized and large-sized graphs, respectively.

For these 27 unique configurations, execution times w.r.t. several dimensions are computed and shown in Figure 3. Execution times related to the schema generation are omitted as they are negligible. Experiments were conducted on a machine with 2 CPUs Intel Xeon E5-2650 v4, 12 cores/CPU, and 128GB RAM.

Table 1. Generated schemas. Column headers from left to right: number of classes, class hierarchy depth, average class depth, proportion of class disjointness (cd), number of relations, average depth of relation domains and ranges (rs), and proportions of reflexive (rf), irreflexive (irr), asymmetric (asy), symmetric (sy), transitive (tra), and inverse (inv) relations.

\(|\mathcal{C}|\)

\(\operatorname{MAX}(\mathcal{D})\)

\(\operatorname{AVG}(\mathcal{D})\)

cd

\(|\mathcal{R}|\)

rs

ref

irr

asy

sym

tra

inv

\(\mathcal{S}1\)

\(25\)

\(3\)

\(1.5\)

\(0.1\)

\(25\)

\(1.5\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(\mathcal{S}2\)

\(25\)

\(3\)

\(1.5\)

\(0.2\)

\(25\)

\(1.5\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(\mathcal{S}3\)

\(25\)

\(3\)

\(1.5\)

\(0.3\)

\(25\)

\(1.5\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(\mathcal{S}4\)

\(100\)

\(4\)

\(2.5\)

\(0.1\)

\(100\)

\(2.5\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(\mathcal{S}5\)

\(100\)

\(4\)

\(2.5\)

\(0.2\)

\(100\)

\(2.5\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(\mathcal{S}6\)

\(100\)

\(4\)

\(2.5\)

\(0.3\)

\(100\)

\(2.5\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(\mathcal{S}7\)

\(250\)

\(5\)

\(3.0\)

\(0.1\)

\(250\)

\(3.0\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(0.1\)

\(\mathcal{S}8\)

\(250\)

\(5\)

\(3.0\)

\(0.2\)

\(250\)

\(3.0\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(0.2\)

\(\mathcal{S}9\)

\(250\)

\(5\)

\(3.0\)

\(0.3\)

\(250\)

\(3.0\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

\(0.3\)

Table 2. Different graph specifications. Column headers from left to right: number of entities, number of triples, proportion of untyped entities, average depth of the most specific specific class, average number of most-specific classes per multi-typed entity.

\(|\mathcal{E}|\)

\(|\mathcal{T}|\)

unt

asc

mul

\(\mathcal{G}_1\)

\(100\)

\(1,000\)

\(0.3\)

\(2.0\)

\(2.0\)

\(\mathcal{G}_2\)

\(1,000\)

\(10,000\)

\(0.3\)

\(2.0\)

\(2.0\)

\(\mathcal{G}_3\)

\(10,000\)

\(100,000\)

\(0.3\)

\(2.0\)

\(2.0\)


_images/stacked-bar-plot.png

Figure 3: Execution time results

Bibliography

[ABarabasi02]

Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Rev. Mod. Phys., 74:47–97, Jan 2002. doi:10.1103/RevModPhys.74.47.

[ABLarribaPey+14]

Renzo Angles, Peter A. Boncz, Josep Llu\'ıs Larriba-Pey, Irini Fundulaki, Thomas Neumann, Orri Erling, Peter Neubauer, Norbert Mart\'ınez-Bazan, Venelin Kotsev, and Ioan Toma. The linked data benchmark council: a graph and RDF industry benchmarking effort. SIGMOD Rec., 43(1):27–31, 2014. doi:10.1145/2627692.2627697.

[BBC+17]

Guillaume Bagan, Angela Bonifati, Radu Ciucanu, George H. L. Fletcher, Aurélien Lemay, and Nicky Advokaat. Gmark: schema-driven generation of graphs and queries. IEEE Trans. Knowl. Data Eng., 29(4):856–869, 2017. doi:10.1109/TKDE.2016.2633993.

[BDPP18]

Piero Andrea Bonatti, Stefan Decker, Axel Polleres, and Valentina Presutti. Knowledge graphs: new directions for knowledge representation on the semantic web (dagstuhl seminar 18371). Dagstuhl Reports, 8(9):29–111, 2018. doi:10.4230/DagRep.8.9.29.

[CZF04]

Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, 442–446. SIAM, 2004. doi:10.1137/1.9781611972740.43.

[DCK18]

Nicola De Cao and Thomas Kipf. MolGAN: An implicit generative model for small molecular graphs. ICML 2018 workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018.

[EWoss16]

Lisa Ehrlinger and Wolfram Wöß. Towards a definition of knowledge graphs. In Joint Proceedings of the Posters and Demos Track of the 12th International Conference on Semantic Systems - SEMANTiCS2016 and the 1st International Workshop on Semantic Change & Evolving Semantics (SuCCESS'16) co-located with the 12th International Conference on Semantic Systems (SEMANTiCS 2016), Leipzig, Germany, September 12-15, 2016, volume 1695 of CEUR Workshop Proceedings. CEUR-WS.org, 2016.

[ERwi59]

P ERDdS and A R&wi. On random graphs i. Publ. math. debrecen, 6(290-297):18, 1959.

[FMH+21]

Zaiwen Feng, Wolfgang Mayer, Keqing He, Selasi Kwashie, Markus Stumptner, Georg Grossmann, Rong Peng, and Wangyu Huang. A schema-driven synthetic knowledge graph generation approach with extended graph differential dependencies (gdd\(^\mbox x\)s). IEEE Access, 9:5609–5639, 2021. doi:10.1109/ACCESS.2020.3048186.

[GJR20]

Nikhil Goyal, Harsh Vardhan Jain, and Sayan Ranu. Graphgen: A scalable approach to domain-agnostic labeled graph generation. In WWW '20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, 1253–1263. ACM / IW3C2, 2020. doi:10.1145/3366423.3380201.

[Gru95]

Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing? Int. J. Hum. Comput. Stud., 43(5-6):907–928, 1995. doi:10.1006/ijhc.1995.1081.

[GPH05]

Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. LUBM: A benchmark for OWL knowledge base systems. J. Web Semant., 3(2-3):158–182, 2005. doi:10.1016/j.websem.2005.06.005.

[HBC+21]

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d'Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, and Antoine Zimmermann. Knowledge Graphs. Synthesis Lectures on Data, Semantics, and Knowledge. Morgan & Claypool Publishers, 2021. doi:10.2200/S01125ED1V01Y202109DSK022.

[MP17]

André Melo and Heiko Paulheim. Synthesizing knowledge graphs for link and type prediction benchmarking. In The Semantic Web - 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28 - June 1, 2017, Proceedings, Part I, volume 10249 of Lecture Notes in Computer Science, 136–151. 2017. doi:10.1007/978-3-319-58068-5\_9.

[PTMP22]

John Palowitch, Anton Tsitsulin, Brandon Mayer, and Bryan Perozzi. Graphworld: fake graphs bring real insights for gnns. In KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, 3691–3701. ACM, 2022. doi:10.1145/3534678.3539203.

[PK17]

Himchan Park and Min-Soo Kim. Trilliong: A trillion-scale synthetic graph generator using a recursive vector model. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, 913–928. ACM, 2017. doi:10.1145/3035918.3064014.

[PP22]

Jan Portisch and Heiko Paulheim. The DLCC node classification benchmark for analyzing knowledge graph embeddings. In The Semantic Web - ISWC 2022 - 21st International Semantic Web Conference, Virtual Event, October 23-27, 2022, Proceedings, volume 13489 of Lecture Notes in Computer Science, 592–609. Springer, 2022. doi:10.1007/978-3-031-19433-7\_34.

[SDJ+20]

Bidisha Samanta, Abir De, Gourhari Jana, Vicenç Gómez, Pratim Kumar Chattaraj, Niloy Ganguly, and Manuel Gomez-Rodriguez. NEVAE: A deep generative model for molecular graphs. J. Mach. Learn. Res., 21:114:1–114:33, 2020.

[SK18]

Martin Simonovsky and Nikos Komodakis. Graphvae: towards generation of small graphs using variational autoencoders. In Artificial Neural Networks and Machine Learning - ICANN 2018 - 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part I, volume 11139 of Lecture Notes in Computer Science, 412–422. Springer, 2018. doi:10.1007/978-3-030-01418-6\_41.

[WWW+18]

Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. Graphgan: graph representation learning with generative adversarial nets. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2508–2515. AAAI Press, 2018.

[YYR+18]

Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. Graphrnn: generating realistic graphs with deep auto-regressive models. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, 5694–5703. PMLR, 2018.

About

How to Contribute

Interested in contributing to PyGraft? Please consider reaching out:

nicolas.hubert@univ-lorraine.fr

How to Cite

If you like PyGraft, consider downloading PyGraft and starring our GitHub repository to make it known and promote its development 😊

GitHub stars

If you use or mention PyGraft in a publication, cite our work as:

@misc{pygraft,
author= {Nicolas Hubert and
        Pierre Monnin and
        Mathieu d'Aquin and
        Armelle Brun and
        Davy Monticolo},
title = {{PyGraft: Configurable Generation of Schemas and Knowledge Graphs at Your Fingertips}},
month = sep,
year  = 2023,
doi   = {},
url   = {}
}

Contributors

License

PyGraft is licensed under the MIT License.