rel-arxiv arXiv citations database
Database Description: arXiv-physics is a large-scale relational dataset of physics research papers published on arXiv between 2018 and 2023, designed for relational deep learning and scholarly network analysis. It contains over 222,000 papers, 143,000 authors, 53 hierarchical physics categories, and more than 1.5 million directed citation links, organized in a normalized relational schema capturing papers, authorship, categories, and citations.
Database Statistics:
| Num of Tables | 6 |
| Num of Rows | 2,146,112 |
| Num of Columns | 21 |
| Starting Time | 2018-01-01 |
| Validation timestamp | 2022-01-01 |
| Testing timestamp | 2023-01-01 |
| Time window | 6 months |
Database schema:

To load this relational database in RelBench, do:
from relbench.datasets import get_dataset
dataset = get_dataset("rel-arxiv")
References:
Dataset License: CC BY 4.0.
Entity Classification Tasks
paper-citation
Task Description: For each paper, predict whether it will receive at least one citation in the next 6 months.
Evaluation metric: AUROC
author-category
Task Description: For each author, predict the primary research category in which they will publish most in the next 6 months.
Evaluation metric: Multiclass F1
Entity Regression Tasks
author-publication
Task Description: For each author, predict how many papers they will publish in the next 6 months.
Evaluation metric: MAE
Link Prediction Tasks
paper-paper-cocitation
Task Description: For each paper, predict which other papers will be co-cited with it (i.e., cited in the same reference list) in the next 6 months.
Evaluation metric: MAP