Main repository to gather everything related to talk page analysis using the wikiconv dataset
Some records are missing even in the original provided dataset.
Edits and replies on a restored comment have the original comment as the parent,instead of the restored node.
YuBot messes up the dataset for the italian wikipedia.
- Firstly the bot generates a table that wikiconv isn't able to parse (addition);
- Later it edits that table (modification);
- As a result in the dataset there are some modification actions without a parent.
The id field of each record consists of three integers following the format a.b.c where a is the id of the revision from which the records comes.
To sort the dataset by page, you need to use five fields:
- pageId
- timestamp
- a.b.c(from the- idfield) where a, b, c are int
To build the graph of the actions of page:
- Creations and additions use the replytoIdfield
- Modifications, deletions and restorations use the parentIdfield Deletions and restorations are always leaves.
The authorlist field has all the users that have edited the message and the original creator.
The thresholds for Perspective API used in the original paper are 0.64 and 0.92 for toxicity and severe_toxicity respectively.
To be considered a restore, a message needs to stay deleted for no more than two weeks.