ZPi Logo "Serving the Paranoid
since 1997"
Lyle Zapato

Spammers Messing With My Site

Lyle Zapato | 2011-05-17.8650 LMT | Site

Somehow, someone is appending spam to Google Blog Search results from my site. To see what I mean, do a blog search for "blogurl:http://zapatopi.net/".

[2011-08-16: Found the problem, maybe, but not a complete solution. See update to update below.]

The green link at the bottom of a blog result from my site should read "http://zapatopi.net/blog/" (that's what it has been in the past; I guess Google gets it from the <link> element in the RSS feed). Instead, it's now this monstrosity:

http://zapatopi.net/?s=make+money+online&q=http://www.*****.com/****_******_TClick.aspx

(Don't worry if you've clicked on it; it won't do anything since my site doesn't parse those variables. But don't go to the appended URL. I have no idea what that will do.)

Anyone seeing my site in Google Blog Search will think I'm some sort of spammer and avoid me. I know I wouldn't click on a link with that garbage in the description. Whoever's doing this isn't merely spamming, they're poisoning my Google presence, impugning my reputation, and driving away at least some of my visitors. I am not pleased with this.

The spam isn't coming from my site. The RSS feed and the posts have not been altered to include the spam text. As far as I can tell, the spammer is somehow externally influencing what Google puts in it's metadata for my site, which they shouldn't be able to do.

The spam isn't showing up in normal Google search, only Blog Search, and it starts on posts from January of this year, with one outlier on October 9, 2010. If you do a normal search for "site:zapatopi.net [name of spammer]" my blog posts come up, so Google is acting as though those pages are associated with the word "[name of spammer]", even if the word doesn't appear on them (until this post, which will add it to the "recent posts" sidebar) or in the metatext of the search results. (This doesn't negatively affect me, per se, since all that means is people searching for "[name of spammer]" might end up reading about tree octopuses instead. But I still don't like it.)

I've sent a spam report to Google, but, yeah, not holding my breath on a reply to that. Maybe it'll be mysteriously fixed by Google elves, who knows.

(Note: I may delete this post if the problem is mysteriously solved since I'd rather not muck up my blog with references to spammer gibberish. Or maybe I'll replace it with useful information for anyone else similarly spammed if I learn any. [Which I did, see update...])

UPDATE 2011-08-16: It's been over a month and a half since I implemented the fix in the update below. The spam is still on my blog search listings. So the fix doesn't actually work, at least not for posts already indexed. Although, oddly, this post has been fixed. (It's now showing my main url, "zapatopi.net/", instead of the blog url, "zapatopi.net/blog/", like all the older, pre-spammed posts, presumably because of the redirect I implemented.) Google might have updated this one because I changed the name/url. I tried changing one of the other posts names by adding an 'x' to the end and letting it sit for a few days, but that didn't spur an update. I also tried changing the content a little, but that didn't do anything either. So, yeah, that makes no sense.

Also, now none of the newer posts are showing up in blog search. The June 6 post is the most recent listed there, but there's been five posts since then. The newer ones do show up in regular search. In fact, when I remembered to look two hours after posting the last one, it was indexed, so that's working fine and fast, it's just the blog search that's screwed up.

If anyone knows how to get Google to 1) remove the spam from my blog search listings, and 2) start indexing newer posts in blog search, please email me. Barring that, does anyone know how to remove my blog from Google's blog search but not the regular search. Because I'd rather not have it listed at all if it's just going to make my blog look like some spam farm.

UPDATE 2011-06-28: I found the reason why spammers were able to do this. My site was simply ignoring the existence of unexpected URL queries (e.g. "http://zapatopi.net/?spam=blah") and serving a page with HTTP status code 200 ("OK"). A spam site would massively link to my site with spam queries and since they returned valid pages, Google would happily include them in its index, spam and all.

(While this explains the inclusion of spam injected URLs in Google's index, it still doesn't explain why Google would allow this to influence that link on the bottom of Blog Search results, which one would assume would be the main link to the blog -- and is in every other case I've seen. I would think weight would be given to the RSS link element and the site's structure over queerly queried links that only appear on external sites -- especially sites that Google has marked as malware, if the incoming links listing in Google's Webmaster Tools is showing what I think it is. Seems kind of a brain-dead way of doing things. Is Blog Search still in beta?)

The fix for this is to strip the queries off the URL (except desired ones) and return a HTTP status code 301 ("Moved Permanently") redirect. This can be done on Apache servers using mod_rewrite in .htaccess like so:

RewriteEngine On
RewriteCond %{QUERY_STRING} .
RewriteCond %{REQUEST_URI} !^/blog/.* [NC]
RewriteRule ^(.*)$ /$1? [R=301,L]

Since my blog uses queries, its directory is excluded from the rewrite (the third line above). Instead, the blog URL sanitization is handled with PHP to only allow 'post', 'start', and 'perpage' and strip out anything else. The details of this are specific to my site, but here's what it looks like in case anyone wants an example (feel free to adapt and use):

function change_url($goto)
{
header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $goto" );
exit;
}
function clean_queries()
{
$request_uri = $_SERVER['REQUEST_URI'];
$host = $_SERVER['HTTP_HOST'];
$query_string = $_SERVER['QUERY_STRING'];

$qm = strpos($request_uri,"?");
if ($qm)
	{
	// '?' but no queries | too long, probably spam
	if ($query_string == FALSE | strlen($query_string) > 100)
		{
		$request_uri = substr($request_uri,0,$qm);
		$goto = "http://".$host.$request_uri;
		change_url($goto);
		}
	$query_string = strtolower($query_string);
	
	// separate $query_string into query keys and values
	$queries = explode('&', $query_string);
	foreach ($queries as $q)
		{
		$qe = explode('=', $q);
		$qarray[$qe[0]] = $qe[1];
		}
	unset($qe);
	unset($queries);

	// if post is set but there are other queries, remove them and redirect
	if (array_key_exists('post', $qarray)) 
		{
		if (count($qarray) > 1)
			{
			$request_uri = substr($request_uri,0,$qm);
			$goto = "http://".$host.$request_uri.'?post='.$qarray[post];
			change_url($goto);
			}
		}
	else
		{
		// unallowed query -> reformat + redirect
		$allow = array('start' => 0, 'perpage' => 0);
		$qint = array_intersect_key($allow, $qarray);
		if (count($qint) != count($qarray))
			{
			if (array_key_exists('start', $qarray))
				{
				$nq = '?start=' . $qarray['start'];
				if (array_key_exists('perpage', $qarray))
					$nq = $nq . '&perpage=' . $qarray['perpage'];
				}
			else if (array_key_exists('perpage', $qarray))
				$nq = '?perpage=' . $qarray['perpage'];

			$request_uri = substr($request_uri,0,$qm);
			$goto = "http://".$host.$request_uri.$nq;
			change_url($goto);
			}
		unset($allow);
		unset($qint);
		}
	unset($qarray);
	}
}

Note: clean_queries() needs to be called before any output, including spaces and line feeds, since header() won't work right with stuff before it.

I also added "rel=canonical" attributes in the header as appropriate.

(Post changes: I removed the rage-update and all references to the spammer's site since I'm tired of seeing their name and it's irrelevant to the problem anyway. I'll keep the post up though since it might be of use to others.)

End of post.