rel-stackex
Stack-Exchange Q&A Website Database
Database Description: Stack Exchange is a network of question-and-answer websites on topics in diverse fields, each site covering a specific topic, where questions, answers, and users are subject to a reputation award process. The reputation system allows the sites to be self-moderating. In our benchmark, we use the stats-exchange site. We derive from the raw data dump from 2023-09-12.
Database Statistics:
Num of Tables | 7 |
Num of Users | 333,784 |
Num of Posts | 415,913 |
Num of comments | 794,597 |
Num of votes | 1,673,836 |
Num of post links | 103,969 |
Num of badges | 590,833 |
Num of post history | 1,486,886 |
Time range | From 2009-02-02 to 2023-09-03 |
Validation timestamp | 2019-01-01 |
Testing timestamp | 2021-01-01 |
Database schema:
To load this relational database in RelBench, do:
from relbench.datasets import get_dataset
dataset = get_dataset("rel-stackex")
References:
Dataset License: CC BY-SA 4.0 DEED.
Predictive Tasks
rel-stackex-engage
Predict if a user will engage to the site
Task Description: Predict if the user will make any engagement, defined as vote, comment, or post, to the site in the next 2 years.
Time window size: 2 years.
Entity filtering: We filter on active users defined as users that have made at least one comment/post/vote before the timestamp.
Task significance: By accurately forecasting the levels of user engagement, website administrators can effectively gauge and oversee user activity. This insight allows for well-informed choices across various business aspects. For instance, it aids in preempting and mitigating user attrition, as well as in enhancing strategies to foster increased user interaction and involvement. This predictive task serves as a crucial tool in optimizing user experience and sustaining a dynamic and engaged user base.
Machine learning task: BinaryClassification
Evaluation metric: AP
To load the dataset and the split, do:
from relbench.datasets import get_dataset
dataset = get_dataset(name = "rel-stackex")
task = dataset.get_task("rel-stackex-engage")
task.train_table, task.val_table, task.test_table # training/validation/testing tables
rel-stackex-votes
Predict upvotes of a question post
Task Description: Predict the number of upvotes of a question post in the next six months.
Time window size: 6 months.
Entity filtering: We filter on question posts that are posted recently in the past 2 years before the timestamp. This ensures that we do not predict on old questions that have been outdated.
Task significance: Predicting the upvotes of a question post is valuable as it empowers site managers to predict and prepare for the influx of traffic directed towards that particular post. This foresight is instrumental in making strategic business decisions, such as curating question recommendations and optimizing content visibility. Understanding which posts are likely to attract more attention helps in tailoring the user experience and managing resources effectively, ensuring that the most engaging and relevant content is highlighted to maintain and enhance user engagement.
Machine learning task: Regression
Evaluation metric: MAE
dataset = get_dataset(name = "rel-stackex")
task = dataset.get_task("rel-stackex-votes")
task.train_table, task.val_table, task.test_table # training/validation/testing tables