yiiscrapermodule

Yii Framework Extensions

Popular Tags

yiiscrapermodule YiiScraperModule is base scraper module to get information from Internet.

Overview
Features
Requirements
Usage

YiiScraperModule

Overview ¶

YiiScraperModule is base scraper module to get information from Internet. It parses received HTML pages using simple_html_dom parser component and retrieves URLs from HTML page. URLs are saved to DB table and will be requested later. By default, HTML content is saved to DB table as well. You may select target containers for parsed URLs and stored content. You may write your own functions for URLs parsing and storing content.

Module is designed to run from cron as well. You can use it for periodical scraping. If DB table for links is empty, it is filled with seeds. Each URL is fetched then, marked as non-active and requested. Received HTML content is parsed then and new URLs are saved to the same table. Scraper work is terminated, when anyone of limits is exhausted. You can see respective message in scraper logs table. If scraper work is terminated, scraper may start later and request first active URL, so its work is proceeded. If there are no one active URL in the table, scraper stops its work. When scraper runs next time, it cleans DB table for URLs, inserts seeds there and process will be repeated from the very beginning.

Features ¶

Automated install and uninstall.
Designed to run periodically, from browser or cron.
Defines content charset and converts is to UTF-8.
You can define limits for scraping process (duration, received data size, received documents count, max depth to scrape)
Able to scrape only inside specified URLs, if needed.
Logging system (when scraping process started, how much bytes, documents and HTML documents were received, how much new URLs were saved to DB, what is the process status).
Stores URL relations into separate table
Uses simple_html_dom extension to parse received data. You can use CSS selectors to define, what URLs are added and what content is stored to DB table. Or you can write your own callback functions to handle this task on your own.

Requirements ¶

Yii 1.1 or above
CURL and MBSTRING PHP extensions

Usage ¶

To use module, please, copy it to '.../protected/modules/' folder. Then add lines to your '.../protected/config/main.php' file, to 'modules' part:

...
	'modules'=>array(
		'gii'=>array(	... 	),
		...
		'yscraper' => array(
			'class' => 'application.modules.YiiScraper.YiiScraperModule',
			'installMode' => true,
			'seeds' => 'http://pravda.com.ua',
			'insideUrlsOnly' => 'pravda.com.ua',
			'maxDuration' => 1000,
			'contentSelector' => 'div#content',
		),
		...
	),
...

Note, that 'installMode' option must be true. Then you need to install scraper. Please, open this link in your browser:

http://your.domainname.com/index.php?r=yscraper/default/install

or for uninstall

http://your.domainname.com/index.php?r=yscraper/default/uninstall

Then you need to remove line

'installMode' => true,

from config, adjust all other settings and run scraper with command:

Yii::app()->getModule('yscraper')->run();

If you want to use callbacks, you may use static methods from previously imported classes in your config file:

...
	'yscraper' => array(
		...
		'contentCallback' => 'SomeModelClass::someCallbackFunction',
	),
...

Or you may set callbacks during runtime in your controller file:

public function actionIndex()
	{
		...
		Yii::app()->getModule('yscraper')->contentCallback = array($this, 'someCallbackFunction');
		Yii::app()->getModule('yscraper')->run();
		...
	}
 
	...
 
	public function someCallbackFunction($currentURL, $content)
	{
		// process content here
	}

Note, that linkCallback and contentCallback functions get two arguments: URL and content, received from that URL. And note, that linkCallback function should to return array of URLs to scrape later.

11 1

27 followers

831 downloads

Yii Version: 1.1

License: MIT

Category: Networking

Tags: crawling, CURL, module, scraping

Developed by:

vittron

Created on: Jan 4, 2013

Last updated: 13 years ago

Downloads

YiiScraper.zip

show all

Related Extensions

User Contributed Notes 11

#11302

1 0

Feedbacks are greatly appreciated!

Dear friends!
Please, feel you free to post suggestions, notes and any other feedbacks.

vittron at Jan 4, 2013, 2:35:15 PM

#11305

1 0

Nice

Nice... will check it... thanks

PeRoChAk at Jan 5, 2013, 5:21:43 AM

#11330

1 0

What can I use this for?

Good work.. Can you please explain to me the practical use of this extension? How could it be of use on a website?
Thanks!

beesho at Jan 6, 2013, 6:38:15 PM

#11333

1 0

The goal

Hi, Beesho, thanks for good question.

Scraper can be used to gather information from other sites, or one site, or part of site, to process that information and use it in your own purposes.

For example, many portal sites use scrapers. They gather info about last news on other (news) sites and show news titles with respective links to portal user. You can use it for own search mini-engine. You can use it to gather info, which is dispersed accross a thousands of site pages in some blocks. Etc...

For example, I have developed this scraper, because I needed to collect all specimen items from one site. Then I have enhanced it, develop it as module and shared.

vittron at Jan 6, 2013, 11:10:24 PM

#11334

1 0

Very Nice

Very nice!
I will probably be using it sometime.
Thank you for your detailed explanation, vittron!

beesho at Jan 7, 2013, 12:20:32 AM

#11465

1 0

Install Error

Hi,
Thanks for extension.
You forgot to add prefix in Installer sql query on line 55 and 56.

It's: tbl_yiiscraper_link
Should be: {$prefix}yiiscraper_link

aquasite.pl at Jan 15, 2013, 10:46:12 AM

#11468

0 0

Indeed!

Hi! Yes, you are right, indeed! I will fix that bug. I hope, you didn't have much inconvenience with it. Thank you!

vittron at Jan 15, 2013, 1:02:48 PM

#13644

0 0

Thak you

Thanks for this module :
Bu can you explain please what mean this lines :

'seeds' => 'http://pravda.com.ua',
            'insideUrlsOnly' => 'pravda.com.ua',

Thanks in advance

samilo at Jun 13, 2013, 2:07:25 PM

#13650

0 0

Seeds

Hello, samilo! 'seeds' is the url(s), where to start scraping. It is a string, or an array (if there are several seeds). If you want to scrape only inside some area (i.e. just one site, no outside links), you need to specify 'insideUrlsOnly' options. 'pravda.com.ua' is used only for example. You can change url according to your needs.

Please, let me know whether you have any questions. Thank you!

vittron at Jun 14, 2013, 8:28:51 AM

#15135

0 0

Must also do this: 'tablePrefix'=>'',// DECLARING THE PREFIX

Also must add tablePrefix with two single quotes '' with no space to your main.php if you don't have any table prefixes. If you don't you'll get an error and spend like an hour trying to figure it out. :) See code below for easy add.

'db'=>array(
			'connectionString' =>   'mysql:host=localhost;dbname=yourdatabasenamehere',
			'emulatePrepare' => true,
			'username' => 'yourusername',
			'password' => 'yourpassword',
			'charset' => 'utf8',
                        **'tablePrefix'=>'',// DECLARING THE PREFIX**

windsurfer at Oct 10, 2013, 2:01:42 PM

#16689

0 0

great

Hi,

that is a great extension, i just have been digging into it :)
Surely will have some questions sometime

Laszlo from Hungary

tihanyilaci at Mar 19, 2014, 8:16:54 PM

Categories

Popular Tags