User:Crispy1989
From Wikipedia, the free encyclopedia
|
Crispy
|
Hi, I'm Crispy. I haven't done much with Wikipedia (I'm more of a programmer and a sysadmin), but I'm working on creating a new core engine for ClueBot's vandalism detection, and using this page to document some of the work.
Contents |
[edit] New Cluebot Vandalism Detection Engine
[edit] General Structure
Cobi and I are working together to create the architecture of the new engine. All of the Wikipedia interface code will remain the same, using Cobi's wikibot.classes PHP code. The core vandalism detection engine, however, will be required to perform heavy computation, and as such, is unsuitable for a scripting language to perform in a reasonable amount of time.
[edit] Core Engine
The engine's core will be made up of a feed-forward artificial neural network with back-propagation. It will have the ability to learn what vandalism is, given numerous examples. Currently, the examples are being provided by Cluebot's existing vandalism detection heuristics, with a higher-than-normal threshold value. The code being used for the neural network is a modified version of annutils.
[edit] Neural Network Inputs
The raw text cannot be fed directly into the neural network, so a preprocessor performs many operations on the edit to convert it into scaled floating point values before the ANN can process it.
[edit] Training Set
The entire principle of this type of supervised-learning neural network is that datapairs are needed consisting of an edit and whether or not that edit is vandalism. A large number of such pairs is needed to properly train the neural network. The current dataset that I am testing the neural network with consists of Cluebot's current outputs. However, this scheme has a number of problems, primarily that the current Cluebot misses a substantial amount of vandalism, and also classifies a fair number of false positives. These errors cause inaccuracies in the dataset which can cause significant problems in the operation of the neural network. Cluebot's current false positive count is significantly less than the amount of vandalism that it misses (by design), but because a neural network needs to be trained with both vandalism and not vandalism, both types of errors cause significant amounts of dataset pollution.
Because Cluebot's direct output appears to be unsuitable for reliably training the neural network, I'm reaching out to the Wikipedia community to see if anyone is willing to help generate this dataset. If anyone wants to help (and any help would be greatly appreciated), look at the dataset page.
[edit] Word Categories
Part of the preprocessor groups words into categories (based on wiktionary categories) for processing by the neural network. If there are any additional wiktionary categories that you think might be pertinent to vandalism (ie, vandalism will show a marked difference in words from those categories that normal edits in a reasonable number of cases), add it to the bottom of the list:
- 1000 English basic words - 724
- 100 English basic words - 101
- Abbreviations, acronyms and initialisms - 4926
- Alcoholic beverages - 110
- Anatomy - 991
- Animals - 181
- Bible - 322
- Biblical characters - 135
- Books of the Bible - 187
- Christianity - 257
- Classic 1811 Dictionary of the Vulgar Tongue - 132
- Clothing - 173
- Colloquial - 726
- Communication - 130
- Currency - 274
- Death - 199
- Derogatory - 183
- Diseases - 188
- Dogs - 114
- Electronics - 207
- Euphemisms - 235
- Family - 221
- Food and drink - 164
- Foods - 379
- Football (Soccer) - 147
- Games - 101
- Grammar - 517
- Hair - 124
- Informal - 1429
- Internet - 345
- Jocular - 174
- Logic - 175
- Mammals - 414
- Misspellings - 515
- Mythology - 125
- Pejoratives - 545
- Poetic - 121
- Recreational drugs - 125
- Religion - 406
- Sex - 121
- Slang - 3347
- Vulgarities - 452
- Weapons - 152
- Crime - 000
- Internet slang - 82
- Leet - 25
- English first person pronouns - 12
- English second person pronouns - 15
- English pronouns - 143 (not as sure about this one)
- Female given names - 1175
- Male given names - 993

