Anti-Spam measures for the Wiki. Names are (Suggested/Implemented), or both, if the same person. The source for the current wiki CGI script, and the patch stack, are available here.
Already implemented
001.rdns-check.patch: Prevent access if the user has no reverse DNS record (HugoMills)
- Allows access to registered users even if they don't have RDNS
002.no-robots-in-diff.patch: Prevent diff and history pages being picked up and indexed by spiders (DanPope / GrahamBleach)
I submitted a patch based on Tom Scanlan's http://www.usemod.com/cgi-bin/wiki.pl?[[WikiPatches/RobotsNoFollow|RobotsNoFollow patch]] (it's the one at the end of the page). — GrahamBleach
003.content-blacklist.patch: Prevent edits with banned strings in: e.g. http://(.*)\.cc(/|\b) (GrahamBleach & DavidRamsden / DavidRamsden)
Note we only need to check link URIs against a banned list of regexes, not the whole text of the edit — GrahamBleach
The patch checks against the whole text: it's easier to implement that way. Note, however, that it will make the non-administrator editing of a page with a banned word on it impossible. (But that's not a serious problem). — HugoMills
004.dnsbl.patch: Prevent edits from sites in DNS blackhole lists. (DavidRamsden)
005.rollback.patch: Add a "roll-back to previous revision" link at the bottom of the page. This would reduce the effort required to unspam a page. (TonyWhitmore / HugoMills)
Developed but not applied
U006.spamassassin.patch: Feed all changes through spamassassin before acceptance. (TomBragg)
Not applied because it's too slow at present. — HugoMills
Maybe we should use spammonkey
In progress
Recent spam analysis
I said I'd have a look at the apache logs and see what's going on at the time of the wiki spam attack. With this information we might be able to prepare better defences. I started by grabbing the apache access logs for the hantslug site down to my local machine. Don't fear though I shall not release any information about anyones browsing habbits, I'm only interested in the hacks ma'am, just the hacks.
Next I grabbed a page that was hacked recently, here's a good one - the inter mezzo page will be easy to grep for.
http://www.hantslug.org.uk/cgi-bin/wiki.pl?[[LinuxHints/InterMezzo]] The spammer appeared to commit their changes twice, once at 23:17 and again at 23:19. Revision 35 . . February 16, 2005 11:19 pm by 69.50.166.2-custblock.intercage.com Revision 34 . . February 16, 2005 11:17 pm by 69.50.166.2-custblock.intercage.com I found the first visit to that page that day. 218.103.59.59 - - [[16/Feb/2005:03:50:08|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints/SambaAuth]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" 218.103.59.59 - - [[16/Feb/2005:03:50:12|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[MailingList/TopPosting]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" 218.103.59.59 - - [[16/Feb/2005:03:50:16|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[AboutWiki]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" 218.103.59.59 - - [[16/Feb/2005:03:50:28|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" 218.103.59.59 - - [[16/Feb/2005:03:50:40|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints/InterMezzo]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" Note with interest that netblock is owned by someone in the far east.. Next we look for all other visits from that IP.. Boy oh boy it's hit every page - some many times. 218.103.59.59 - - [[16/Feb/2005:03:50:08|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints/SambaAuth]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" : : snipped 591 lines : 218.103.59.59 - - [[16/Feb/2005:04:37:17|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints/UpdatingGrub]] HTTP/1.0" 200 1137 "-" "libwww-perl/5.803" Thats about 590 odd hits over a 40 min period which equates to about 15 a minute or one hit every 4 seconds. I don't care how fast [[ThomasAdam]] is at maintaining the wiki, he can't keep up with this puppy! Ok, move on to look for the next hits because those times don't tie up with the times of the spam. Scout forewards to the next hits on the mezzo page.. 69.50.166.2 - - [[16/Feb/2005:23:17:19|+0000]] "POST /cgi-bin/wiki.pl HTTP/1.0" 302 162 "-" "libwww-perl/5.803" 69.50.166.2 - - [[16/Feb/2005:23:17:20|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[LinuxHints/InterMezzo]] HTTP/1.0" 200 14270 "-" "libwww-perl/5.803" 69.50.166.2 - - [[16/Feb/2005:23:17:21|+0000]] "POST /cgi-bin/wiki.pl HTTP/1.0" 302 195 "-" "libwww-perl/5.803" 69.50.166.2 - - [[16/Feb/2005:23:17:22|+0000]] "GET /cgi-bin/wiki.pl?action=edit&id=[[MailingList/UnSubscribe]] HTTP/1.0" 200 3088 "-" "libwww-perl/5.803" Some GETs and POSTs, this is where the hack actually took place. Note the IP address ties up with the RDNS recorded in the recent changes to the page. After that the next hit to that page were the recovery of it by the crack wiki-fix team.
Here's what we learn.
- Spammer is using some kind of perl based bot thing from a machine in Hong Kong to read the wiki pages
- A prep run, gathering all the URLs in readyness for the attack?
- Then later on spawning some bot on another box entirely to action the problem.
- A quick google for 'libwww-perl/5.803 spam' reveals numerous sites (most in Japanese) where someone reports [1] [2] [3] comment spam to their blog which exhibits similar behaviour.
- The spammer appears to let this thing run flat out for a while grabbing and then modifying many pages one after the other.
[1] http://taka.no32.tk/diary/?date=200501 [2] http://www.yaizawa.jp/diary/?date=20050118 [3] http://www.fkimura.com/diary/?date=20050114
Some suggestions which may need to be moved to the next section down, but which appear appropriate here because they are directly related to the above research.
- Put a rate limiter on the apache side?
- Nobody other than a good bot (googlebot etc) can hit a site *for good reason* that quickly. A page every 4 seconds sustained for 40 minutes can NEVER be good.
- Put a rate limiter on the wiki side?
- Same as above, probably easier to limit the reads per minute on apache, but easier to limite the writes/updates per minute on the wiki side?
- Block the user "libwww-perl" user agent for now.
- Yes it's not a permanent fix.
- Yes I'm sure there's some valid reason why someone may want to use libwww-perl version 5.803 to browse our site, but IMO it's not likely.
- I've checked the last month or so and the ONLY thing using libwww-perl is a spammer.
- Other tools *may* use libwww-perl such as archive.org, and possibly yahoo spider, but most have a user agent string which identifies the bot, followed by libwww-perl. The spammer in our case only reports "libwww-perl/5.803".
- Something akin to this should do it.
[[RewriteCond]] %{HTTP_USER_AGENT} ^libwww-perl/[0-9] [NC]
The above was initially written by AlanPope
Suggestions
Remove diff record of spam entirely (DanPope)
Link/text ratio (GrahamBleach)
- Problem: What if you're legitimately adding a link?
Maybe a single link being added would be allowed (this is a perfectly reasonable edit), but adding more than one link with a text/link ratio greater than defined would be banned. One problem might be that linkers pick up on this and start adding random passages of text to their spam, in the same manner that they have done for e-mail spam. — TonyWhitmore
- Problem: What if you're legitimately adding a link?
Bayesian checks on new content (GrahamBleach)
- Require people to be registered if they wish to edit pages. (jcdutton)
Not currently implemented in UseMod - closest equivalent would be distributing an "editor" password which people are required to set once registered and is stored in a cookie — TonyWhitmore
I favour this amongst all others, since it would probably kill all spammers in one go. At the moment, there is only a small subset of people in the LUG which edit the wiki anyway. We could simply just distribute the relevant details to the list periodically every month. — ThomasAdam
Actually, there have been two cases to date of spammers registering an ID. —HugoMills
I think ThomasAdam was referring to my comment, rather than the OPs. It is possibly the "least-effort" solution. The problem (although other people may not see it as such) is that this method excludes the possibility of people editing the wiki who aren't list members. Editors would also need to set up an ID on each machine they use to edit the wiki and save that as a cookie on that system, re-entering it everytime they flush their cache. Whilst none off this is earth-shatteringly difficult, I think it's important to keep the technical barrier to contributions to the wiki as low as possible. — TonyWhitmore
- Have the admins moderate who gets registered. (jcdutton)
Admins probably wouldn't like the extra workload. —HugoMills
Would have to implement a more advanced registration technique that the currently available one — TonyWhitmore
Match registration emails against the lug MailingList members and if they match, email the registration password to that user. (jcdutton)
We'd need to add the email address to the user database – that's not currently stored by the wiki. We'd also need to have a hook into the mailman database on lug.org.uk. This one would be very tricky indeed. —HugoMills
Have a security code that must be entered on submission. The code could be randomly generated, and displayed as a distorted png image. This should reduce the automated editing of pages. (PhilipStubbs)
Although a common suggestion for wikis, there is software available to "scan" these images and interpret the graphic as digits. This makes scripting just as possible as before. The trouble is that the technological barrier for entry is then raised for normal users. Just my £0.02 — TonyWhitmore
Would also obstruct visually impaired people from editing the Wiki — GrahamBleach
Semi-automated abuse reports to netblock owner and those responsible, similar to http://www.spamcop.net/. Web form to paste in URL of affected revision and have it do all the donkey work of whois lookups and adding boilerplate text. — GrahamBleach
Shared blacklists of URIs and domains. This is a feature of MT-Blacklist, a Movable Type plugin for preventing blog comment spam. http://www.chongqed.org has a page listing URIs and keywords which are known to be added in wiki spam attacks. — GrahamBleach
Try the new Google et al rel="nofollow" link. This isn't a magic bullet, but it may help. See: Babbage's jornal on use.perl.org for an example and links. — AdamTrickett
We have already eliminated the benefits of link spamming by asking robots not to index or follow links from the diff and history pages. Providing the LUG members continue to remove spam promptly, there is no need to patch the wiki code. Adding nofollow to all links would also prevent sites we link to legitimately from benefitting from the extra pagerank. Unfortunately I don't think that the spammers have the technical nous to identify that they will not gain any benefit from spamming this Wiki. — GrahamBleach
Rate limiter. Say, 4 changes from the same IP address in the space of 1 minute leads to an automatic place on the banned list. — HugoMills
Roll back By IP. Add functionality to list commits by IP address and remove all changes made by a chosen IP. This would make it easier to roll the whole site back to fix it. — (MatGrove)
De-wiki-fication. Make the HTML generated more unique to make it less likely to be found when a spammer uses a search engine to locate potential sites to spam. Renaming things like "class=wikiheader". — (MatGrove)
Spam alert to MailingList/WikiAdmins if it detects potential spam – DanPope
If we can detect spam, why not just ban the spammer immediately? – HugoMills
Shuffle the field names for input – DanPope
Honeypot pages, leading to instant bans – DanPope
Static site generated (daily?) from wiki content by a spider – DanPope
Quite a lot of work to implement? – HugoMills
Ability to remove comments from spam edits - these show up in RecentChanges. TonyWhitmore
Ability to access a remote blocked list specifically for wiki spammers. AlanPope
- Have a centralised "Banned Words" list to allow other wikis to benefit from out banned words.
- Allow other wikis to submit to the "Banned words" list?
Ask a simple question (what is the date, name the item in the logo, etc) when editing a page. This should defeat any (non sentient) bot (DeanEarley and KathrynJones)
- It's 03/04/06 at the time of writing this in the UK - it's 04/04/06 in Europe. Time would be even more tricky.
Consider Kitten Auth http://www.thepcspy.com/kittenauthtest and the disucssion http://www.thepcspy.com/articles/security/the_cutest_humantest_kittenauth (JimKissel)
I just tried the demo site linked to and it linked to another version which took between 10 and 20 seconds to load the images. Dunno if that's an issue with the code or his server. -- AlanPope
Consider using a remote spam filter such as http://blogspam.net/ which offers a API to filter wiki and blog spam. -- AdamTrickett