Oct 12 2012
Scrape Faster with PHP DomDocument and Safely with Tor
I recently was tasked with a rather complex project involving scraping hundreds of thousands HTML documents. Normally scraping is quite easy. I have a lot of experience with it and just use the wonderful Simple HTML DOM library . Simple HTML DOM has some issues though. It chokes on large HTML documents. And when running inside a loop you can easily hit Segmentation Faults and even if you are manually clearing memory through unset() you can still encounter memory exhaustion errors. For some projects SimpleHTMLDom just doesn’t cut it.
So I began looking at a lot of solutions. The most promising of which was DOMDocument. DOMDocument wasn’t without its own problems though. The parser in DOMDoc will quickly fail on invalid HTML. I tried some additional solutions and then tried optimizing SimpleHTMLDOM. Nothing was working, and then it hit me. I don’t need the entire HTML document. I only need a specific piece to be parsed. To do this I would fetch the entire page via cURL and then use stripos() and substr().
Cutting Out The Fat
- Find the starting point I needed using stripos()
- Remove all text before that point using substr()
- Find the ending point I needed again using using stripos()
- And again remove everything after that point using substr()
1 2 3 4 | $charAt = stripos($html,'<ul class=Listing'); $htmlList = substr($html, $charAt,strlen($html)); $charLast = stripos($htmlList,'<ul>'); $htmlList = substr($htmlList, 0,$charLast+4); |
Only What We Need
As luck would have it the portion of HTML I needed was valid enough for DOMDoc to parse. The parser threw out some trivial warnings that I decided to just @supress. Next was looping through the array the parser returned.
Code:
1 2 3 4 5 6 7 8 9 10 11 12 13 | $o = @DOMDocument::loadHTML($html); $li = $o->getElementsByTagName('li'); foreach($li as $i){ $a = $i->getElementsByTagName('a'); foreach($a as $hrefKey => $href){ $a = $i->getElementsByTagName('a'); $url = $href->getAttribute('href'); $name = $href->nodeValue; } } |
This REALLY reduced memory consumption and execution time compared to SimpleHTMLDOM. I don’t have benchmarks, but its A LOT faster. The memory overhead from an instance of this script was so low in fact that I fired up 8 instances running on quad-core 8GB machine. The overhead was low enough that I was able to program with NetBeans open and have multiple browser tabs open while it chugged away. This was aided by the fact that I cut out Apache (no mod_php) and run the script directly from the PHP-CLI.
Like A Bandit
Now for the fun part. To reduce the risk of having my IP blocked I wanted to disperse the jobs through multiple IP addresses. To do this I used the TOR network. This is really simple to do once you have TOR installed.
Code:
1 2 3 4 5 6 7 8 9 10 | $curl = curl_init(); curl_setopt($curl, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'); curl_setopt($curl, CURLOPT_URL, 'http://whatismyipaddress.com/'); curl_setopt($curl, CURLOPT_TIMEOUT, 60); curl_setopt($curl, CURLOPT_HTTPGET, 1); curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($curl, CURLOPT_PROXY, '127.0.0.1:9050'); curl_setopt($curl, CURLOPT_PROXYTYPE, CURLPROXY_SOCKS5); curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE); $html = curl_exec($curl); |
Using that bit of code will execute a HTTP GET request to whatismyipaddress.com, the returned HTML will show some random IP address that isn’t yours. The magic configurations in there are CURLOPT_PROXY and CURLOPT_PROXYTYPE. You can expect some slowness going through TOR, but you can’t optimize peace of mind. Despite TORs slowness and the amount of checks and balances I needed when adding the data to the database I was able to parse hundreds of thousands of pages in roughly a weeks time.
Finding Out Where a Method is Being Called From in PHP How Lazy Loading JavaScript Reduces Page Load and Optimizes Application Performance


Hey I know this post was made a long time ago; however I wanted to know. I am making a web scraper for youtube, so I can crawl the webpages, and pull Video ID’s and save them in a xml file. Would this method work for that as well? I know I would be able to get a video ID from the tag. however I know I will need to still put the id from the url string in the tag I already have the regex code for that however
Yeah it will work with some tweaking. I’ve never looked at YouTubes source code so you’ll just need to hack around till you get it. With scraping YouTube I’d be worried about them blocking IPs. I know its difficult to scrape Google results in general, so I assume they’ve put similar safeguards in place with YouTube.