Scrape Faster with PHP DomDocument and Safely with Tor

I recently was tasked with a rather complex project involving scraping hundreds of thousands HTML documents. Normally scraping is quite easy. I have a lot of experience with it and just use the wonderful Simple HTML DOM library . Simple HTML DOM has some issues though. It chokes on large HTML documents. And when running inside a loop you can easily hit Segmentation Faults and even if you are manually clearing memory through unset() you can still encounter memory exhaustion errors. For some projects SimpleHTMLDom just doesn’t cut it.

So I began looking at a lot of solutions. The most promising of which was DOMDocument. DOMDocument wasn’t without its own problems though. The parser in DOMDoc will quickly fail on invalid HTML. I tried some additional solutions  and then tried optimizing SimpleHTMLDOM. Nothing was working, and then it hit me. I don’t need the entire HTML document. I only need a specific piece to be parsed. To do this I would fetch the entire page via cURL and then use stripos() and substr().

Cutting Out The Fat

  • Find the starting point I needed using stripos()
  • Remove all text before that point using substr()
  • Find the ending point I needed again using using stripos()
  • And again remove everything after that point using substr()
Code:

Only What We Need

As luck would have it the portion of HTML I needed was valid enough for DOMDoc to parse. The parser threw out some trivial warnings that I decided to just @supress. Next was looping through the array the parser returned.

Code:

This REALLY reduced memory consumption and execution time compared to SimpleHTMLDOM. I don’t have benchmarks, but its A LOT faster. The memory overhead from an instance of this script was so low in fact that I fired up 8 instances running on quad-core 8GB machine. The overhead was low enough that I was able to program with NetBeans open and have multiple browser tabs open while it chugged away. This was aided by the fact that I cut out Apache (no mod_php) and run the script directly from the PHP-CLI.

Like A Bandit

Now for the fun part. To reduce the risk of having my IP blocked I wanted to disperse the jobs through multiple IP addresses. To do this I used the TOR network. This is really simple to do once you have TOR installed.

Code:

Using that bit of code will execute a HTTP GET request to whatismyipaddress.com, the returned HTML will show some random IP address that isn’t yours. The magic configurations in there are CURLOPT_PROXY and CURLOPT_PROXYTYPE. You can expect some slowness going through TOR, but you can’t optimize peace of mind. Despite TORs slowness and the amount of checks and balances I needed when adding the data to the database I was able to parse hundreds of thousands of pages in roughly a weeks time.


9 Comments

  • CJ says:

    Hey I know this post was made a long time ago; however I wanted to know. I am making a web scraper for youtube, so I can crawl the webpages, and pull Video ID’s and save them in a xml file. Would this method work for that as well? I know I would be able to get a video ID from the tag. however I know I will need to still put the id from the url string in the tag I already have the regex code for that however :)

    • chris says:

      Yeah it will work with some tweaking. I’ve never looked at YouTubes source code so you’ll just need to hack around till you get it. With scraping YouTube I’d be worried about them blocking IPs. I know its difficult to scrape Google results in general, so I assume they’ve put similar safeguards in place with YouTube.

  • doug says:

    Used this tor script. To determine which port is being used ( might not be 9050) open the tor browser then go to preferences then it should say which port it is using for socks v5 which was 9150 for me)

    note that with simple dom parser you get many errors so

    1) clean up memory on each loop iteration or each new object created like so

    $html->clear();
    unset($html);

    or at the end of each loop iteration

    $row->clear();
    unset($row);

    2) check objects integrity before doing a find() especially in big loops or if you are doing many pages.

    e.g.
    $variables =get_object_vars($table);
    if (!isset($variables['parent'])){ //handle }
    foreach($table->find(‘tr’) as $row) {

    }

    check isset($variables['parent']) or some part of the object you would expect to be set if it isn’t null otherwise you can get errors often. Sometimes parts of the object is null and the object is actually null but you can only easily get the public properties which are all null e.g. parent.

    once you see its isset and therefore not null do the find() in a loop or however. Before find() you really should check the objects integrity or you can get wierd errors and i checked that this bug is documented although that was just the way i got around it with my scripts.

    3)

    many scripts also overrun the timeout in php so increase that beforehand to be safe e.g.set_time_limit(800) i have had scripts overshoot that if they take a few hours.

    Maybe also limit the rate of the loops if you are pelting a small site with sleep(). You can also change the ip with shell commands if using tor and can handle therefore failures to connect with a change of ip. you can execute the shell commands from php using shell_exec(); Google the commands if you need to do this.

    Cheers.

  • Aditya says:

    Hi

    this is very interesting information. Can you please answeer my 2 questions:

    1) will Tor be able to scrape data from sites which aggressively block IPs (i fetch some pricing info)

    2) i have written vb script crawler. will i be able to apply Tor to it. to handle blocking, will i be able to rotate the Tor IPs

    thanks

  • Odi says:

    Hi Chris,

    I’m strugling to implement your “cutting out the fat” section, here are my codes :

    $findstart = stripos($scraped_page, '');
    $removestart = substr($scraped_page, $findstart,strlen($scraped_page));
    $findend = stripos($removestart,'');
    $removeend = substr($removestart, 0, $findend);

    the problem is, those codes give me blank page if I scrape directly to the URL of my target page..

    the funny part is, if I scrape a local copy of my target page, the above codes work and my script outputs as expected.

    echoing the $scraped_page gives me complete html page, so there is nothing wrong with scraping the html of my target page.

    could you please help me to find a solution for this?

    thanks Chris

    • chris says:

      Is the URL returning HTML or not? If so then figure out what’s wrong with your code, if not then figure out why the server is blocking you. Looking at things like cookies and other cURL settings.

  • Maxx Thordendal says:

    Really useful tutorial. One thing i’d like to ask, seeing that you had to scrape thousands of html pages. Did you enabled the curl_multi_exec option which would let you do parallel scraping ? If you did, was there any significant performance improvement. Actually my recent project also involves a large scale scraping similar to yours on top of it the system has time constraint. So i had to use pthreads for parallel processing, still the time constraint is not holding. So i was thinking maybe you had used the curl_multi_exec option and has some useful insight regarding it.
    Anyways thanks for this tutorial. It really helped a lot.

    Maxx

    • chris says:

      No I just ran multiple instances of the script in parallel from the PHP command line. I did not know about curl multi exec at the time, even still just running 8 or so instances of the script was sufficient. It still took a while, but considering the amount of pages I had to scrape (an enormous amount) I was satisfied with the turn around, especially considering this was all running through TOR. They never knew what hit them ;-)

      I can’t stress enough that running it through a browser/apache2 will only slow you down. PHP CLI is the way to go on this or use another language more suited to this task.