Yii scraper

Hi,

Are there any libraries, particular in Yii that can scrap webpages for data.

I need to get certain data out of html code that is feed through a GET request.

Does anyone know of any libraries? And have you personally used the library which you are recommending?

James.

Not a specific to Yii, Goutte crawls websites and extract data from the responses (requires php 5.3).

From the website,

"Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses."

Example of sending request




require_once '/path/to/goutte.phar';

use Goutte\Client;

$client = new Client();

$crawler = $client->request('GET', 'http://www.example.org/');



Some examples extracting data, using its CSS selector, ‘filter()’.




$nodes = $crawler->filter('.error_list');


// get document title 

$crawler->filter('title')->text());


// get form element

$form = $crawler->filter('input[type=submit]')->form();



Looks like a sure-fire way of getting your IP banned - if you’re not very careful, of course. :)

Maybe one could look into using the new local storage (HTML5) and use to incrementally store the website you’re browsing for off-line use?

Thanks I will look into using the Goutte library.

The script is not for my personal use but thanks for the heads up anyway.

I didn’t mean to let it sound that way.

Let’s hope Goutte knows how to handle that issue.

Otherwise you could unintentionally be the cause of a DOS attack. :)

We had Ogre3d.org server down to it’s knees several times due to people scraping the site. We banned their IP(s) and the site went back online…

I believe that you can avoid that by careful planning.

I meant it as a heads-up.

I’ve recently used SimpleHTMLDom for this purpose. It’s fairly easy to integrate it as an external library. In my case, I did contact the website owner first to make sure it wasn’t an issue. I’m actually leaning towards a client side solution for version 2.

My needs were fairly simple, so your mileage may vary.