Introduction

What is relational deep learning?

Much of the world's most valued data is stored in data warehouses, where the data is spread across many tables connected by primary-foreign key relations. However, building machine learning models using this data is both challenging and time consuming. The core problem is that no machine learning method is capable of learning directly on the data spread across multiple relational tables.

Current methods can only learn from a single table, so the data must first be joined and aggregated into a single training table, the process known as feature engineering. Here we introduce an end-to-end deep representation learning approach to directly learn on data spread across multiple tables. We name our approach Relational Deep Learning.

The core idea is to view relational tables as a heterogeneous graph, with a node for each row in each table, and edges specified by primary-foreign key relations. Message Passing Neural Networks can then automatically learn across multiple tables to extract representations that leverage all input data, without any manual feature engineering.

Overall, we define a new research area that generalizes graph machine learning and broadens its applicability to a wide set of AI use cases.

RelBench Overview

RelBench is a set of benchmark datasets and an implementation of Relational Deep Learning. The data covers a wide spectrum of human and natural activity, and spans several orders of magnitude in size, from 1M to 1B entities.

RelBench contains a data loader to load relational databases and its associated predictive tasks. The loaders handle downloading, pre-processing, and splitting of the datasets. Additionally, RelBench has standardized evaluators and leaderboards to keep track of state-of-the-art results.

benchmark

Installation

To install the RelBench Python package, use the following:

pip install relbench

Package Usage

Here we describe key functions of RelBench. RelBench provides a collection of APIs for easy access of machine-learning-ready relational databases.

For a concrete example, to obtain the rel-stackex relational database, do:

from relbench.datasets import get_dataset
dataset = get_dataset(name="rel-stackex")

Next, to retrieve the rel-stackex-votes predictive task, which is to predict the upvotes of a post it will receive in the next 2 years, simply do:

task = dataset.get_task("rel-stackex-votes")
task.train_table, task.val_table, task.test_table # training/validation/testing tables

The training/validation/testing tables are automatically generated using pre-defined standardized temporal split. You can then build your favorite relational deep learning model on top of it. After training and validation, you can make prediction from your model on task.test_table. Suppose your prediction pred is an array following the order of task.test_table, you can call the following to retrieve the unified evaluation metrics:

task.evaluate(pred)

Public Leaderboards and Benchmarks

RelBench provides leaderboards for systematic model evaluation and comparison. Each task defined in each relational database constitutes a benchmark. Dataset splits and evaluation metrics reflect real-world difficulty of relational database problems. We are currently constructing leaderboards and we are expecting submissions in the near future.

Cite Us

If you use RelBench, please cite our paper:

@article{relbench,
  title={Relational Deep Learning: Graph Representation Learning on Relational Tables},
  author={Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex Ying, Jiaxuan You, Jure Leskovec},
  year={2023}
}

Explore Relbench Datasets

Stack-exchange

Amazon-reviews