28 12 2012
Generating Google Webmaster Tools Disavow Text File in Linux
I had a domain that generates me about a grand each month get spanked badly by Google due to the work of an outsourced SEO. Luckily Google released a tool to disavow bad links. I suggest reading Googles blog post on the matter at http://googlewebmastercentral.blogspot.com/2012/10/a-new-tool-to-disavow-links.html to ensure you are using the tool properly and know when to use it. With that out of the way I created a number of custom scripts and learned some nifty shell commands to help me out in the process.
What I Needed
- I needed a list of all my links. I received a list of 23,000 from Google Webmaster Tools even though my domain has over 50,000 (terrible I know).
- Find out if I still have a link on that page.
- Determine if this is a good link or bad link.
- Do this fast
From the resulting data I would then contact the site owner and see if they could remove the link. If not, I’d add them to a section in the text file and inform Google they would not remove it. If the link came from a bad domain I’d inform Google to not count any links from that domain. If the link was eventually removed I’d add that to a special section in the file and inform Google the link was no longer there.
How I Did It
- Download links from Google Webmaster Tools. Download both the Sample and the Recent Links.
- Concatenate these into a single file:
cat list-1.csv list-2.csv >> dislinks.csv
- Remove the date column from the resulting file.
- Remove all quotes from the file.
- Run disavow.php which generates domain-links.csv
- Since domain-links.csv could end up containing lots of links (23,000 in my case) it could take quite some time to check all these pages. So we are going to split it up into multiple files and run 1-n instances of a script to complete the job faster. You can split this up as follows:
split --bytes=256k -d domain-links.csv domain-link-split
- The above shell command will split the file into multiple 256k chunks
- Next run checker.php in multiple instances. To do this open multiple shell tabs and type the following command:
php checker.php 0
- It accepts a single parameter, an integer. This is how the script knows which domain-link-split file to look at and what the resulting file name will be. In this case it will be checker-00.csv
- When done you can recombine all these checker files into a single file using the cat command again like we did in the beginning.
cat checker-*.csv >> checked-links.csv
Of the 23,000 links only about 3,000 still had links to my site on them and of those about half were actually good links.
If you can ping google with the URLs that no longer contain your links or find a way to push links to those pages to get google to crawl them again. The point here is don’t just think of the disavow tool as your only solution, be proactive. Contact site owners and get google to recrawl the ones that have been removed somehow.
It is possible to get return carriages in the 2nd column of the final checker-xx.csv file. You can remove these using the SUBSTITUTE function in libre calc and then move the clean data to a new column using paste special.