rel-stackex Stack-Exchange Q&A Website Database

Database Description: Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating. In our benchmark, we use the stats-exchange site. We derive from the raw data dump from 2023-09-12.

Database Statistics:

Num of Tables 7
Num of Users 333,784
Num of Posts 415,913
Num of comments 794,597
Num of votes 1,673,836
Num of post links 103,969
Num of badges 590,833
Num of post history 1,486,886
Time range From 2009-02-02 to 2023-09-03
Validation timestamp 2019-01-01
Testing timestamp 2021-01-01

Database schema:

To load this relational database in RelBench, do:

from relbench.datasets import get_dataset
dataset = get_dataset("rel-stackex")

References:

[1] Stack Exchange Data Dump.

Dataset License: CC BY-SA 4.0 DEED.


Predictive Tasks

rel-stackex-engage Predict if a user will engage to the site

Task Description: Predict if the user will make any engagement, defined as vote, comment, or post, to the site in the next 2 years.

Time window size: 2 years.

Entity filtering: We filter on active users defined as users that have made at least one comment/post/vote before the timestamp.

Task significance: By accurately forecasting the levels of user engagement, website administrators can effectively gauge and oversee user activity. This insight allows for well-informed choices across various business aspects. For instance, it aids in preempting and mitigating user attrition, as well as in enhancing strategies to foster increased user interaction and involvement. This predictive task serves as a crucial tool in optimizing user experience and sustaining a dynamic and engaged user base.

Machine learning task: BinaryClassification

Evaluation metric: AP

To load the dataset and the split, do:

from relbench.datasets import get_dataset
dataset = get_dataset(name = "rel-stackex")
task = dataset.get_task("rel-stackex-engage")
task.train_table, task.val_table, task.test_table # training/validation/testing tables

rel-stackex-votes Predict upvotes of a question post

Task Description: Predict the number of upvotes of a question post in the next six months.

Time window size: 6 months.

Entity filtering: We filter on question posts that are posted recently in the past 2 years before the timestamp. This ensures that we do not predict on old questions that have been outdated.

Task significance: Predicting the upvotes of a question post is valuable as it empowers site managers to predict and prepare for the influx of traffic directed towards that particular post. This foresight is instrumental in making strategic business decisions, such as curating question recommendations and optimizing content visibility. Understanding which posts are likely to attract more attention helps in tailoring the user experience and managing resources effectively, ensuring that the most engaging and relevant content is highlighted to maintain and enhance user engagement.

Machine learning task: Regression

Evaluation metric: MAE

dataset = get_dataset(name = "rel-stackex")
task = dataset.get_task("rel-stackex-votes")
task.train_table, task.val_table, task.test_table # training/validation/testing tables