yiiscrapermodule YiiScraperModule is base scraper module to get information from Internet.

  1. Overview
  2. Features
  3. Requirements
  4. Usage

YiiScraperModule

Overview

YiiScraperModule is base scraper module to get information from Internet. It parses received HTML pages using simple_html_dom parser component and retrieves URLs from HTML page. URLs are saved to DB table and will be requested later. By default, HTML content is saved to DB table as well. You may select target containers for parsed URLs and stored content. You may write your own functions for URLs parsing and storing content.

Module is designed to run from cron as well. You can use it for periodical scraping. If DB table for links is empty, it is filled with seeds. Each URL is fetched then, marked as non-active and requested. Received HTML content is parsed then and new URLs are saved to the same table. Scraper work is terminated, when anyone of limits is exhausted. You can see respective message in scraper logs table. If scraper work is terminated, scraper may start later and request first active URL, so its work is proceeded. If there are no one active URL in the table, scraper stops its work. When scraper runs next time, it cleans DB table for URLs, inserts seeds there and process will be repeated from the very beginning.

Features

  • Automated install and uninstall.
  • Designed to run periodically, from browser or cron.
  • Defines content charset and converts is to UTF-8.
  • You can define limits for scraping process (duration, received data size, received documents count, max depth to scrape)
  • Able to scrape only inside specified URLs, if needed.
  • Logging system (when scraping process started, how much bytes, documents and HTML documents were received, how much new URLs were saved to DB, what is the process status).
  • Stores URL relations into separate table
  • Uses simple_html_dom extension to parse received data. You can use CSS selectors to define, what URLs are added and what content is stored to DB table. Or you can write your own callback functions to handle this task on your own.

Requirements

  • Yii 1.1 or above
  • CURL and MBSTRING PHP extensions

Usage

To use module, please, copy it to '.../protected/modules/' folder. Then add lines to your '.../protected/config/main.php' file, to 'modules' part:

...
	'modules'=>array(
		'gii'=>array(	... 	),
		...
		'yscraper' => array(
			'class' => 'application.modules.YiiScraper.YiiScraperModule',
			'installMode' => true,
			'seeds' => 'http://pravda.com.ua',
			'insideUrlsOnly' => 'pravda.com.ua',
			'maxDuration' => 1000,
			'contentSelector' => 'div#content',
		),
		...
	),
...

Note, that 'installMode' option must be true. Then you need to install scraper. Please, open this link in your browser:

http://your.domainname.com/index.php?r=yscraper/default/install

or for uninstall

http://your.domainname.com/index.php?r=yscraper/default/uninstall

Then you need to remove line

'installMode' => true, 

from config, adjust all other settings and run scraper with command:

Yii::app()->getModule('yscraper')->run();

If you want to use callbacks, you may use static methods from previously imported classes in your config file:

...
	'yscraper' => array(
		...
		'contentCallback' => 'SomeModelClass::someCallbackFunction',
	),
...

Or you may set callbacks during runtime in your controller file:

public function actionIndex()
	{
		...
		Yii::app()->getModule('yscraper')->contentCallback = array($this, 'someCallbackFunction');
		Yii::app()->getModule('yscraper')->run();
		...
	}
 
	...
 
	public function someCallbackFunction($currentURL, $content)
	{
		// process content here
	}

Note, that linkCallback and contentCallback functions get two arguments: URL and content, received from that URL. And note, that linkCallback function should to return array of URLs to scrape later.

11 1
27 followers
831 downloads
Yii Version: 1.1
License: MIT
Category: Networking
Developed by: vittron
Created on: Jan 4, 2013
Last updated: 11 years ago

Downloads

show all

Related Extensions