rel-arxiv arXiv citations database

Database Description: arXiv-physics is a large-scale relational dataset of physics research papers published on arXiv between 2018 and 2023, designed for relational deep learning and scholarly network analysis. It contains over 222,000 papers, 143,000 authors, 53 hierarchical physics categories, and more than 1.5 million directed citation links, organized in a normalized relational schema capturing papers, authorship, categories, and citations.

Database Statistics:

Num of Tables 6
Num of Rows 2,146,112
Num of Columns 21
Starting Time 2018-01-01
Validation timestamp 2022-01-01
Testing timestamp 2023-01-01
Time window 6 months

Database schema:

To load this relational database in RelBench, do:

from relbench.datasets import get_dataset
dataset = get_dataset("rel-arxiv")

References:

[1] arXiv-physics.

Dataset License: CC BY 4.0.


Entity Classification Tasks

paper-citation

Task Description: For each paper, predict whether it will receive at least one citation in the next 6 months.

Evaluation metric: AUROC

author-category

Task Description: For each author, predict the primary research category in which they will publish most in the next 6 months.

Evaluation metric: Multiclass F1

Entity Regression Tasks

author-publication

Task Description: For each author, predict how many papers they will publish in the next 6 months.

Evaluation metric: MAE

Link Prediction Tasks

paper-paper-cocitation

Task Description: For each paper, predict which other papers will be co-cited with it (i.e., cited in the same reference list) in the next 6 months.

Evaluation metric: MAP