YiiScraperModule
YiiScraperModule is base scraper module to get information from Internet. It parses received HTML pages using simple_html_dom parser component and retrieves URLs from HTML page. URLs are saved to DB table and will be requested later. By default, HTML content is saved to DB table as well. You may select target containers for parsed URLs and stored content. You may write your own functions for URLs parsing and storing content.
Module is designed to run from cron as well. You can use it for periodical scraping. If DB table for links is empty, it is filled with seeds. Each URL is fetched then, marked as non-active and requested. Received HTML content is parsed then and new URLs are saved to the same table. Scraper work is terminated, when anyone of limits is exhausted. You can see respective message in scraper logs table. If scraper work is terminated, scraper may start later and request first active URL, so its work is proceeded. If there are no one active URL in the table, scraper stops its work. When scraper runs next time, it cleans DB table for URLs, inserts seeds there and process will be repeated from the very beginning.
To use module, please, copy it to '.../protected/modules/' folder. Then add lines to your '.../protected/config/main.php' file, to 'modules' part:
... 'modules'=>array( 'gii'=>array( ... ), ... 'yscraper' => array( 'class' => 'application.modules.YiiScraper.YiiScraperModule', 'installMode' => true, 'seeds' => 'http://pravda.com.ua', 'insideUrlsOnly' => 'pravda.com.ua', 'maxDuration' => 1000, 'contentSelector' => 'div#content', ), ... ), ...
Note, that 'installMode' option must be true. Then you need to install scraper. Please, open this link in your browser:
http://your.domainname.com/index.php?r=yscraper/default/install
or for uninstall
http://your.domainname.com/index.php?r=yscraper/default/uninstall
Then you need to remove line
'installMode' => true,
from config, adjust all other settings and run scraper with command:
Yii::app()->getModule('yscraper')->run();
If you want to use callbacks, you may use static methods from previously imported classes in your config file:
... 'yscraper' => array( ... 'contentCallback' => 'SomeModelClass::someCallbackFunction', ), ...
Or you may set callbacks during runtime in your controller file:
public function actionIndex() { ... Yii::app()->getModule('yscraper')->contentCallback = array($this, 'someCallbackFunction'); Yii::app()->getModule('yscraper')->run(); ... } ... public function someCallbackFunction($currentURL, $content) { // process content here }
Note, that linkCallback and contentCallback functions get two arguments: URL and content, received from that URL. And note, that linkCallback function should to return array of URLs to scrape later.
Total 7 comments
Hi! Yes, you are right, indeed! I will fix that bug. I hope, you didn't have much inconvenience with it. Thank you!
Hi, Thanks for extension. You forgot to add prefix in Installer sql query on line 55 and 56.
It's: tbl_yiiscraper_link Should be: {$prefix}yiiscraper_link
Very nice! I will probably be using it sometime. Thank you for your detailed explanation, vittron!
Hi, Beesho, thanks for good question.
Scraper can be used to gather information from other sites, or one site, or part of site, to process that information and use it in your own purposes.
For example, many portal sites use scrapers. They gather info about last news on other (news) sites and show news titles with respective links to portal user. You can use it for own search mini-engine. You can use it to gather info, which is dispersed accross a thousands of site pages in some blocks. Etc...
For example, I have developed this scraper, because I needed to collect all specimen items from one site. Then I have enhanced it, develop it as module and shared.
Good work.. Can you please explain to me the practical use of this extension? How could it be of use on a website? Thanks!
Nice... will check it... thanks
Dear friends! Please, feel you free to post suggestions, notes and any other feedbacks.
Leave a comment
Please login to leave your comment.