User:Shalom Yechiel/Drafts and archives/Offdays analysis

From Wikipedia, the free encyclopedia

One of the techniques I developed while working on Poetlister's case is offdays analysis. An "offday" is defined as any day when a user account does not edit Wikipedia. Over a long period, all of the accounts belonging to one person will likely fall into a pattern of simultaneous offdays. Since a sockpuppeteer, in the strictest definition, is one person in real life, offdays analysis can predict whether the activity of multiple accounts is consistent with a nonrandom pattern that correlates with real-world human motives.

Every day, when you wake up in the morning, you will do one of two things: either you will edit Wikipedia, or you will not. It does not matter, for offdays analysis, whether you edit Wikipedia once or a hundred times. Either way, you edit Wikipedia on that day, thus demonstrating that you avail yourself of Internet access and log on to Wikipedia to make an edit.

[edit] Analysis of Runcorn and Poetlister

I was inspired to try an offdays analysis for the alleged Runcorn/Poetlister sock-farm when I noticed, to my amazement, that the 11 suspected sockpuppets, taken as a group, did not miss a single day in all of 2007 until they were blocked. I started to ask myself, when was the last day that none of the accounts edited? It was October 13, 2006, and before that, July 15, 2006. I have documented the details on a subpage of my report on Poetlister. I find it difficult to believe that a single person could have produced the profile of offdays indicated by the Runcorn/Poetlister accounts. For this and other reasons, I believe Runcorn and Poetlister are two different people.

I examined the offdays of Runcorn and Poetlister specifically from February 26, 2006, through May 29, 2007, when all of the 2005 suspects were no longer blocked and Runcorn was active. In the 457-day interval from February 27, 2006 through May 29, 2007, when both Runcorn and Poetlister were actively editing, Runcorn edited on 397 days, and Poetlister edited on 106 days. (Poetlister's numbers include edits made by Poetlister's accounts on all Wikimedia projects, including Wikisource, Wikiquote and Meta.) Runcorn's frequency, defined as the number of days Runcorn edited divided by the number of days in the interval, was 397/457 = 0.869 (slightly more often than 6 days a week, which corresponds to a frequency of 0.857). Poetlister's frequency was 106/457 = 0.232.

I created a Microsoft Excel spreadsheet to analyze the offdays data. In the top row I placed the names of the users: Runcorn and Poetlister. (I analyzed all eleven accounts, but I'll focus on these two.) In the first column I asked the computer to fill in dates automatically in reverse chronological order: 29-May-07, 28-May-07, 27-May-07, and so forth, going back into the past until 27-Feb-06. In the second column I manually typed the number 1 into each cell corresponding to a date on which Poetlister edited. On other days, I typed the number 0. I did likewise for Runcorn in the third column. Thus, 29-May-07 was a 0 for Poetlister but a 1 for Runcorn because Runcorn edited that day but Poetlister did not.

Below the oldest date I added four more rows: "Sum", "Days", "Frequency" and "Prediction." "Sum" is the number of days on which a user edited. In the Microsoft Excel source code, I type in cell B459:

=SUM(B2:B458)

"Days" is the number of days in the interval. By visual check I can see this is 457, so I type 457 in cell B460.

"Frequency" is "Sum" divided by "Days", as explained already. In the source code I type:

=B459/B460

I added a fourth column for "Both", i.e. days on which both Runcorn and Poetlister edited. In order to locate these days automatically, I typed the following expression into cell D2 and completed the column automatically:

=PRODUCT(B2,C2)

If both users edited, their product is 1*1 = 1. If either user did not edit, the product is 0.

I added a fifth column for "Neither", i.e. days on which neither Runcorn nor Poetlister edited. I typed the following expression into cell E2 and completed the column automatically:

=1-MAX(B2,C2)

If both users did not edit, their values are both zero, and the maximum is zero, so the cell will display 1-0 = 1. If either user edited, the cell will display 1-1 = 0.

I calculated the sum of days on which both or neither user edited Wikipedia, and calculated the frequency of these occurrences.

In the "Prediction" row, I compared these frequencies with a predicted value based on a random distribution of offdays. For days on which both users edited, my prediction was expressed as:

=B461*C461

which in this case was (0.232 * 0.869) = 0.201. This correlated nicely with the empirical frequency of 0.195.

For days on which neither user edited, my prediction was:

=(1-B461)*(1-C461)

which in this case was (0.778 * 0.131) = 0.101. This correlated nicely with the empirical frequency of 0.094.

I do not have the training in statistics to know at what point an empirical result deviates significantly from a predicted value based on an assumed random distribution. I prefer to supplement the numerical test with a manual check of contributions on days when both users edited to look for instances where both users edited in a short time frame, and to examine if the nature of those edits indicates one person or two unrelated people. A random result in offdays analysis does not prove innocence, but a nonrandom result increases suspicion of guilt where other factors already exist to link two user accounts.

[edit] Case studies