History: Crawl
Source of version: 4
Copy to clipboard
! Tiki Crawl A crawler tool for checking links or gathering content from websites. https://gitlab.com/tikiwiki/tiki-crawl This is the alpha version of a crawler tool for checking links or gathering content from websites. It relies on the [https://github.com/spatie/crawler|Crawler library from Spatie]. Kudo to those guys. ! __Installation__ This piece of code has been tested with php 8.2.6 and node v18.16.0, make sure you have those version installed (but it may very well be that it works on older versions, code is not that complicated). {CODE()}composer update npm install {CODE} ! __Usage__ First you need to configure your options. You can override any options from config/config.default.php in a config/config.php file, for example: {CODE()}<?php $options['timeout'] = 600; $options['js'] = true; {CODE} > Note: toscrape.com is useful for testing crawlers __!! Config options__ {FANCYTABLE(head="Name|Type|Default|Description")} timeout|integer|60|the maximum time that should be waited before getting a response for any page crawled cli|boolean|true|show a progress dotted line while launching it in console log|boolean|true|keeps logs in `logs/` directory (one for access one for errors) store|boolean|false|store crawled pages in `collected-data/` directory max_size|integer|2|when `store` is enabled, the maximum size of stored documents, if you want to get pdfs you should up limit|? integer|5|number of pages to crawl, set to `false` for unlimited crawls js|boolean|false|use headless chrome with puppeteer (takes much longer, and requires to have puppeteer installed) internal|boolean|true|only crawl urls that are on the same host concurrency|integer|10|number of concurrent requests to make max_depth|?integer|false|if you want to only get the immediate links on the page you are crawl, set `1`, you can decide how deep you want to crawl from initial page delay|?integer|false|add a delay in milliseconds between requests skip_urls|?array|false|if you stumble upon problematic urls that make the crawler crash, you can skip them by listing them in this array allow_redirect|boolean|false|wether or not follow 30x redirects ignore_robots|boolean|false|bypass instructions contained in `robots.txt` file allow_nofollow|boolean|false|bypass the `rel="nofollow"` directive in links user_agent|string|TikiCrawl|the user-agent header sent with http requests {FANCYTABLE} Those options will be passed to spatie, you can learn more about those at https://github.com/spatie/crawler/ Then you can launch the crawling {CODE()}./bin/crawl https://books.toscrape.com{CODE} !! Notes * It relies on the [https://packagist.org/packages/spatie/crawler|Crawler library from Spatie] (Over 5 million downloads). * It starts off as a tool just for developers, and will eventually be integrated in Tiki, so accessible to power users. This was used to gather data for an upcoming AI Chatbot we are working on (more news on this later, along with some code).