Thursday, October 4, 2012

Can we end SPAM without CAPTCHAs?

Death to CAPTCHAs and bad automated link building

I saw this article online - http://coding.smashingmagazine.com/2011/03/04/in-search-of-the-perfect-captcha/ - and it posits the question - should websites use CAPTCHA at all? The basic premise is that websites shouldn't because it places a barrier between users and using the site. In other words you can't use the site until you fill in this annoying piece of garbled text or click on pictures of cats.

The author also suggests that the final death blow to website spam will be a combination of technology and laws. I find the idea that laws will stop website spam to be laughable because frankly people in Africa and Asia haven't slowed their email spam campaigns at all despite hefty fines here in the US. Do you really think China will ever make it illegal to wreck US or European websites? or dupe the stupid in those countries? No way.

Tech to solve the problem of bad links

There were some discussions in the article on technology solutions that might eliminate automated link building/website spam.

Side note...

Clearly, those people that were suggesting tech solutions hadn't ever bothered to build or use one of these automated link builders and understand how it worked at a low level.

Yours truly has - several times. I consider it unethical to hammer a site but I don't consider unethical to manipulate the search engines. I spend a lot of time trying to make my automated content for these link building tools readable and useful. It really isn't that hard. For example, find blog posts and comments on some recent event - like a presidential election - and post some content excerpt from another news org on the topic. With a little forethought you can grab a lot of articles from news orgs around the world on topic, extract the keywords for each article, and post an excerpt from the article that matches the most keywords. The utility of your comment is highly dependent on how good your keyword extraction is and it isn't perfect but the comment will be at least loosely related to the topic at hand on the site and more often than not at least a little useful. It isn't hard it's just harder than putting out random garbage all the time.

How do these automated link building tools work?

Frankly most of these tools are built by people with no coding knowledge or experience passing down vague requirements to a coder in India who couldn't care less what the code is supposed to do. In some case the tool has been broken up so no one coder even knows what the whole is doing (probably not a bad idea).

The sites most often manipulated are sites based on PHP like WordPress self installs and Pligg installs. Having looked at a lot of their code I can make a few statements about them:
  • The tools are typically built by people who go visit the site. They extract all the fields on a given HTML form and figure out what the POST URL for the form is. They then write a PHP cURL command to post all the required fields to the POST URL The tool never does a GET on the site to look for the form fields.
  • Much of the code has hardcoded parameters in it like the POST URL, fieldnames, etc. And since the code to manipulate of each site is often written by a different individual or an individual not working on the whole project the hardcoded values are necessary. But they are also a vulnerability to the code.
  • The cURL code almost never includes a REFERRER meaning the page grab will be a direct grab. This should fall into the suspicious category.

I hate F*#?ing SPAM on my site...

I have a serious, nonprofit, site devoted to an engineering field I used to work in.  And it gets hammered by spam. It pisses me off. However, I know that I didn't mind these spammers creating accounts (without the garbage link spam text) and pumping up my stats to make the site look a lot more popular than it likely is in reality. Perceived popularity tends to bread real popularity and credibility. I implemented a reCAPTCHA on my site registration and it did exactly nothing. So I removed the registration form entirely and that fixed exactly nothing as well. I think the issue actually happens to be an XSS hole opened up by one of the plugins. My lazy fix to date has been to have a PHP script scrub the DB every 5 minutes for any users that registered after a certain date and delete them and everything they've added to the site.

I still have to manually delete several items about once a month. It's a pain. In fact I'm multi-tasking while writing this and deleting those pages that slip through my DB scrub. So I am on the other side of this problem as well.

The irony...

The irony is that if the articles were at all related to the site, even marginally informative, and included a link in a bio box/resource box then these spammy submissions would likely help my site's SEO (marginally) by adding a lot of content to the site. But they aren't even marginally related to the site; they are nothing but ads (poorly written at that); and they filled with spammy links almost assuring that Google discounts any SEO benefit the creator had hoped for while at the same time reducing the quality of my site and my site's SEO goodness in Google.

I understand why SPAM links are so ubiquitous

Look I understand why these links get put out there. Lots of people (myself included) want to make money online from the comfort of their home. I also understand that getting a site to make any money is more difficult than those selling "make money at home solutions" will ever tell you it is. And that most sites you ever build will be lucky to make more than a couple of hundred bucks a month.

So you need dozens of sites in order to add up to enough revenue to make a living. And each month a whole bunch of those sites that made money last month will stop making you any money so you will need to build new sites to replace them. In the mean time you need to build backlinks for traffic and SEO purposes. And doing even a minimal set of backlinks manually for one site can take 10s of hours. Certainly something you don't have the time to do properly.

On top of all of that, if you build more than a couple sites you will quickly realize that getting Google's attention - just to get indexed let alone a ranking in the top 100 pages for keyword, let alone the first page - will often takes hundreds of backlinks from a wide variety of domains.

It just isn't possible to compete without automation anymore. (Unless you have a really big advertising budget and can just buy attention via the ads on the side.)

Having been on both sides of this coin, I do the following...

I fight SPAM on my sites regularly and I've given some thought to what I would find acceptable - even helpful. I'm an engineer by training and enjoy studying artificial intelligence. And, honestly, I'm a hell of a lot smarter than most internet marketers. And a hell of a lot more ethical - so I don't make millions with all the marginally legal things I could automate and make money from using the half dozen 10+ year old PCs I have sitting in my house.

Having written some code myself for building links on other sites, here are my best practices for not getting caught:
  • Go to the site home page, then go to the login/join/post form passing the correct REFERRER.
  • Extract the form fields from the page.
  • POST those form fields extracted
    • Form fields are usually easy to recognize with simple string matching
    •  This step is somewhat unnecessary (but it's a longish story why, short version is that I have never come across a site that kills your form entry because of extra fields in the POST URL)
    • I've written code that manipulates thousands of sites from the same code base and some of those sites change monthly. If you aren't extracting the form fields your code starts break down almost immediately. Leading to a big, big maintenance headache.
  • Wait a random but reasonable time between calling the page and posting the form data.
    • This is probably unnecessary but your are attempting to look at human as possible. Remember that the busiest sites are going to give you the best bang for your effort but are also going to be the most aggressive about website spam and banning user accounts. Once the account is banned you can bet your links have all disappeared.
    • The big sites have so many users and so many spammers that they have to use automation to find the spammers. The more human you look in the logs the harder it will be for their automation to find you and thus ban you.
    • Their efforts to ban you will never be 100% effective. Your efforts to avoid being banned will never be 100% effective. Live with it.
  • Don't hammer a site 1000 times from the same account. Limit each account to logging in only every so many hours. Even better define a time period of each day that each account is allowed to log in during - like 8 am to 9 pm Monday thru Friday and not at all on the weekend. Then only allow 6 - 10 logins a day during that time period.
  • Create an account with a common and forgettable name but not random garbage.
  • Participate in other ways as much as possible:
    • Fill in your profile (with picture if possible)
    • Vote other posts up or down
    • Etc.
And here are my best practices for not being an unethical prick:
  • Post relevant, and hopefully useful, content.
  • Post more links to sites like CNN.com, Whitehouse.gov, etc than to your own sites.
    • This helps build up the SEO of the site you are putting links on as well as adding some authority to the links you post.
    • It also changes the context in which the site owner (and Google) views your "contributions" to the site.
  • There are plenty of high quality news aggregation sites out there. Automated content doesn't have to be irrelevant junk. And most of the posts your bot makes shouldn't lead back to one of your sites.

Can SPAM be easily stopped?

Is there an easy solution to automated website spam that doesn't require an annoying CAPTCHA for the users?

For automated spam I think there are ways to dramatically reduce the numbers. I can't eliminate all automated SPAM but I think what I propose would eliminate the vast majority of what is already out there as well as eliminate most of the players from the field leading to fewer tools in the future. Here are a few things I've noticed in the tools I've bought:
  • Most are PHP which runs server side and is incapable of running javascript. That doesn't mean you can't get around javascript but it takes another tool (like WireShark) and in general is a much bigger pain in the ass than a pure HTML form.
  • Most bots have very poor reporting. In other words they report a link as having been submitted when it wasn't. Some error occurred and the user of the tool never heard about it. The reasons for this are 2 fold. First, no developer/seller wants to report when the bot failed because it makes the tool look bad. Second, determining when the bot failed reliably is a lot trickier than it sounds. (You'd assume a simple string comparison would work but I haven't found that to be reliable. The bigger sites often change these strings and display them within javascript portions of the page so PHP can't see the error.)
  • Most bots are single threaded.  This it the surest way to tell that the tool was built by someone who knows nothing about what they are doing. PHP is single threaded but this can be gotten around via multi-cURL. It took me a couple of hours when I was first learning how to code in PHP to build a multi-cURL based PHP class to handle this problem.
  • These bots require a lot of maintenance so they begin to fail regularly only a few months after they are released. If you do some simple IP tracking on your site you can probably come up with a pattern of IPs where the vast majority of the forms submitted by these IPs has failed over a 2 week sliding window. Ban those IPs (I think using your .htaccess on Apache would be best but this isn't my area of expertise) and you will have eliminated much of the problem.
My bots are a combination of PHP and something else I'm not going to share. They allow me to answer CAPTCHAs via automation and deal with javascript just fine. They are much more capable than your standard bot out there (but also tend to be more buggy due to the much larger code base).

Here's what I would suggest to someone building the next Pligg or WordPress - sites built on PHP and MySQL. Use PHP to generate the field names of all of your forms dynamically.  For example, when the registration page is called create the field names with something like this:
  • $now = md5(date(DATE_RFC822));
  • $usrFld = md5('username' . $now . $my_private_key);
  • $emailFld = md5('email' . $now . $my_private_key);
  • $pwdFld = md5('password' . $now . $my_private_key);
In a hidden field you provide the $now value. Then the field names in the query portion of the POST URL will change every time the form is called up. The field names should be easy to recreate in the page that processes the form entries as well. The field names change every time the page is loaded and based on your private key the dynamically generated names will be almost impossible to guess. This all by itself eliminates most bots as they don't ever bother to do a GET on the form page.

Now if the order of the input fields remains the same each and every time the form is called/generated then a bot could be created (fairly easily) to grab the page and parse the form fields for submission. However, a little CSS trickery could be employed to display the form code in a different order while maintaining the appearance for your users. And honestly having the fields show up in a different order shouldn't be a problem or even really a nuisance for a real person. For something like a registration page; they should really only see the reg page once or a few times anyways.

These form changes shouldn't hurt the user experience. In fact mostly they should be transparent to a person. However, you've made the coding of a bot significantly more difficult - especially for the lazy hacks out there.

You can eliminate PHP based tools almost entirely by simply adding some javascript that is required to run before the user can submit the form. You may or may not want to do this depending on what your site is and does.

An end to SPAM?

Nope.

Until you remove the profit motive for link building there will always be SPAM both automated and manually done by cheap workers in China and India (or somewhere else when they get too expensive). The only way to remove link building as a profitable venture is to remove it from Google's search algorithm. If you've any of the original papers or even summaries of them then you know that backlinks are the fundamental principle behind Google's search algorithm, Everything is a tweak to that principle.

In the mean time just put up enough hurdles that the SPAMMERs skip your site and pick on someone else's. It's probably the best you can hope for.