Yii 1.1: yiiscrapermodule

YiiScraperModule is base scraper module to get information from Internet.
30 followers

YiiScraperModule

Overview

YiiScraperModule is base scraper module to get information from Internet. It parses received HTML pages using simple_html_dom parser component and retrieves URLs from HTML page. URLs are saved to DB table and will be requested later. By default, HTML content is saved to DB table as well. You may select target containers for parsed URLs and stored content. You may write your own functions for URLs parsing and storing content.

Module is designed to run from cron as well. You can use it for periodical scraping. If DB table for links is empty, it is filled with seeds. Each URL is fetched then, marked as non-active and requested. Received HTML content is parsed then and new URLs are saved to the same table. Scraper work is terminated, when anyone of limits is exhausted. You can see respective message in scraper logs table. If scraper work is terminated, scraper may start later and request first active URL, so its work is proceeded. If there are no one active URL in the table, scraper stops its work. When scraper runs next time, it cleans DB table for URLs, inserts seeds there and process will be repeated from the very beginning.

Features

  • Automated install and uninstall.
  • Designed to run periodically, from browser or cron.
  • Defines content charset and converts is to UTF-8.
  • You can define limits for scraping process (duration, received data size, received documents count, max depth to scrape)
  • Able to scrape only inside specified URLs, if needed.
  • Logging system (when scraping process started, how much bytes, documents and HTML documents were received, how much new URLs were saved to DB, what is the process status).
  • Stores URL relations into separate table
  • Uses simple_html_dom extension to parse received data. You can use CSS selectors to define, what URLs are added and what content is stored to DB table. Or you can write your own callback functions to handle this task on your own.

Requirements

  • Yii 1.1 or above
  • CURL and MBSTRING PHP extensions

Usage

To use module, please, copy it to '.../protected/modules/' folder. Then add lines to your '.../protected/config/main.php' file, to 'modules' part:

...
    'modules'=>array(
        'gii'=>array(   ...     ),
        ...
        'yscraper' => array(
            'class' => 'application.modules.YiiScraper.YiiScraperModule',
            'installMode' => true,
            'seeds' => 'http://pravda.com.ua',
            'insideUrlsOnly' => 'pravda.com.ua',
            'maxDuration' => 1000,
            'contentSelector' => 'div#content',
        ),
        ...
    ),
...

Note, that 'installMode' option must be true. Then you need to install scraper. Please, open this link in your browser:

http://your.domainname.com/index.php?r=yscraper/default/install

or for uninstall

http://your.domainname.com/index.php?r=yscraper/default/uninstall

Then you need to remove line

'installMode' => true,

from config, adjust all other settings and run scraper with command:

Yii::app()->getModule('yscraper')->run();

If you want to use callbacks, you may use static methods from previously imported classes in your config file:

...
    'yscraper' => array(
        ...
        'contentCallback' => 'SomeModelClass::someCallbackFunction',
    ),
...

Or you may set callbacks during runtime in your controller file:

public function actionIndex()
    {
        ...
        Yii::app()->getModule('yscraper')->contentCallback = array($this, 'someCallbackFunction');
        Yii::app()->getModule('yscraper')->run();
        ...
    }
 
    ...
 
    public function someCallbackFunction($currentURL, $content)
    {
        // process content here
    }

Note, that linkCallback and contentCallback functions get two arguments: URL and content, received from that URL. And note, that linkCallback function should to return array of URLs to scrape later.

Total 11 comments

#16689 report it
tihanyilaci at 2014/03/19 04:16pm
great

Hi,

that is a great extension, i just have been digging into it :) Surely will have some questions sometime

Laszlo from Hungary

#15135 report it
windsurfer at 2013/10/10 10:01am
Must also do this: 'tablePrefix'=>'',// DECLARING THE PREFIX

Also must add tablePrefix with two single quotes '' with no space to your main.php if you don't have any table prefixes. If you don't you'll get an error and spend like an hour trying to figure it out. :) See code below for easy add.

'db'=>array(
            'connectionString' =>   'mysql:host=localhost;dbname=yourdatabasenamehere',
            'emulatePrepare' => true,
            'username' => 'yourusername',
            'password' => 'yourpassword',
            'charset' => 'utf8',
                        **'tablePrefix'=>'',// DECLARING THE PREFIX**
#13650 report it
vittron at 2013/06/14 04:28am
Seeds

Hello, samilo! 'seeds' is the url(s), where to start scraping. It is a string, or an array (if there are several seeds). If you want to scrape only inside some area (i.e. just one site, no outside links), you need to specify 'insideUrlsOnly' options. 'pravda.com.ua' is used only for example. You can change url according to your needs.

Please, let me know whether you have any questions. Thank you!

#13644 report it
samilo at 2013/06/13 10:07am
Thak you

Thanks for this module : Bu can you explain please what mean this lines :

'seeds' => 'http://pravda.com.ua',
            'insideUrlsOnly' => 'pravda.com.ua',

Thanks in advance

#11468 report it
vittron at 2013/01/15 08:02am
Indeed!

Hi! Yes, you are right, indeed! I will fix that bug. I hope, you didn't have much inconvenience with it. Thank you!

#11465 report it
aquasite.pl at 2013/01/15 05:46am
Install Error

Hi, Thanks for extension. You forgot to add prefix in Installer sql query on line 55 and 56.

It's: tbl_yiiscraper_link Should be: {$prefix}yiiscraper_link

#11334 report it
beesho at 2013/01/06 07:20pm
Very Nice

Very nice! I will probably be using it sometime. Thank you for your detailed explanation, vittron!

#11333 report it
vittron at 2013/01/06 06:10pm
The goal

Hi, Beesho, thanks for good question.

Scraper can be used to gather information from other sites, or one site, or part of site, to process that information and use it in your own purposes.

For example, many portal sites use scrapers. They gather info about last news on other (news) sites and show news titles with respective links to portal user. You can use it for own search mini-engine. You can use it to gather info, which is dispersed accross a thousands of site pages in some blocks. Etc...

For example, I have developed this scraper, because I needed to collect all specimen items from one site. Then I have enhanced it, develop it as module and shared.

#11330 report it
beesho at 2013/01/06 01:38pm
What can I use this for?

Good work.. Can you please explain to me the practical use of this extension? How could it be of use on a website? Thanks!

#11305 report it
PeRoChAk at 2013/01/05 12:21am
Nice

Nice... will check it... thanks

#11302 report it
vittron at 2013/01/04 09:35am
Feedbacks are greatly appreciated!

Dear friends! Please, feel you free to post suggestions, notes and any other feedbacks.

Leave a comment

Please to leave your comment.

Create extension
Downloads