Generating Google Webmaster Tools Disavow Text File in Linux

I had a domain that generates me about a grand each month get spanked badly by Google due to the work of an outsourced SEO. Luckily Google released a tool to disavow bad links. I suggest reading Googles blog post on the matter at http://googlewebmastercentral.blogspot.com/2012/10/a-new-tool-to-disavow-links.html to ensure you are using the tool properly and know when to use it. With that out of the way I created a number of custom scripts and learned some nifty shell commands to help me out in the process.

What I Needed

  • I needed a list of all my links. I received a list of 23,000 from Google Webmaster Tools even though my domain has over 50,000 (terrible I know).
  • Find out if I still have a link on that page.
  • Determine if this is a good link or bad link.
  • Do this fast

From the resulting data I would then contact the site owner and see if they could remove the link. If not, I’d add them to a section in the text file and inform Google they would not remove it. If the link came from a bad domain I’d inform Google to not count any links from that domain. If the link was eventually removed I’d add that to a special section in the file and inform Google the link was no longer there.

How I Did It

  • Download links from Google Webmaster Tools. Download both the Sample and the Recent Links.
  • Concatenate these into a single file:

  • Remove the date column from the resulting file.
  • Remove all quotes from the file.
  • Run disavow.php which generates domain-links.csv
  • Since domain-links.csv could end up containing lots of links (23,000 in my case) it could take quite some time to check all these pages. So we are going to split it up into multiple files and run 1-n instances of a script to complete the job faster. You can split this up as follows:
  • The above shell command will split the file into multiple 256k chunks
  • Next run checker.php in multiple instances. To do this open multiple shell tabs and type the following command:
  • It accepts a single parameter, an integer. This is how the script knows which domain-link-split file to look at and what the resulting file name will be. In this case it will be checker-00.csv
  • When done you can recombine all these checker files into a single file using the cat command again like we did in the beginning.

Final Notes

Download disavow.php here, checker.php here, and CsvParser.class.php here (my scripts need this class but you don’t need to know anything about it). The resulting data will be in the form of domain,url,result. Result is a boolean (0 or 1) where 0 means no link found and 1 means link found. You will want to update the regular expression in checker.php to check from your domain instead of example.com. Once you’ve combined these files you can use your favorite spreadsheet to perform functions and what have you to organize the data or modify the script to add the data into a database instead to perform complex queries on and then finally create your text file to upload to Google. I am pretty fucking irate that I had to do this, enjoy.

My Results

Of the 23,000 links only about 3,000 still had links to my site on them and of those about half were actually good links.

Other Considerations

If you can ping google with the URLs that no longer contain your links or find a way to push links to those pages to get google to crawl them again. The point here is don’t just think of the disavow tool as your only solution, be proactive. Contact site owners and get google to recrawl the ones that have been removed somehow.

Bugs

It is possible to get return carriages in the 2nd column of the final checker-xx.csv file. You can remove these using the SUBSTITUTE function in libre calc and then move the clean data to a new column using paste special.

2 Comments