Loading...
 
Skip to main content

History: Crawl

Source of version: 6

Copy to clipboard
            ! Tiki Crawl

A crawler tool for checking links or gathering content from websites.

https://gitlab.com/tikiwiki/tiki-crawl

This is the alpha version of a crawler tool for checking links or gathering content from websites. It relies on the [https://github.com/spatie/crawler|Crawler library from Spatie]. Kudo to those guys.


! __Installation__
This piece of code has been tested with php 8.2.6 and node v18.16.0, make sure you have those version installed (but it may very well be that it works on older versions, code is not that complicated).


{CODE()}composer update
npm install
{CODE}


! __Usage__
First you need to configure your options. You can override any options from config/config.default.php in a config/config.php file, for example:

{CODE()}<?php
$options['timeout'] = 600;
$options['js'] = true;
{CODE}

Then you can launch the crawling

{CODE()}./bin/crawl https://books.toscrape.com{CODE}

> Note: toscrape.com is useful for testing crawlers

__!! Config options__

{FANCYTABLE(head="Name|Type|Default|Description")}
timeout|integer|60|the maximum time that should be waited before getting a response for any page crawled
cli|boolean|true|show a progress dotted line while launching it in console
log|boolean|true|keeps logs in `logs/` directory (one for access one for errors) 
store|boolean|false|store crawled pages in `collected-data/` directory
max_size|integer|2|when `store` is enabled, the maximum size of stored documents, if you want to get pdfs you should up
limit|? integer|5|number of pages to crawl, set to `false` for unlimited crawls 
js|boolean|false|use headless chrome with puppeteer (takes much longer, and requires to have puppeteer installed) 
internal|boolean|true|only crawl urls that are on the same host
concurrency|integer|10|number of concurrent requests to make 
max_depth|?integer|false|if you want to only get the immediate links on the page you are crawl, set `1`, you can decide how deep you want to crawl from initial page
delay|?integer|false|add a delay in milliseconds between requests 
skip_urls|?array|false|if you stumble upon problematic urls that make the crawler crash, you can skip them by listing them in this array
allow_redirect|boolean|false|wether or not follow 30x redirects 
ignore_robots|boolean|false|bypass instructions contained in `robots.txt` file 
allow_nofollow|boolean|false|bypass the `rel="nofollow"` directive in links
user_agent|string|TikiCrawl|the user-agent header sent with http requests
{FANCYTABLE}

Those options will be passed to spatie, you can learn more about those at https://github.com/spatie/crawler/



Roadmap


 handle the case where headless browser fails miserably and crawler lib don't catch it

 improve crawling on sites by using the canonical url that may be declared inside the html

 add an option to try gathering or checking http status on images (for finding missing ones)

 publish as a composer package

 add a feature in Tiki to make use of it


License
Copyright (c) 2023 mose, evoludata
Available under MIT license. See LICENSE.txt for more details






!! Notes
* It relies on the [https://packagist.org/packages/spatie/crawler|Crawler library from Spatie] (Over 5 million downloads).
* It starts off as a tool just for developers, and will eventually be integrated in Tiki, so accessible to power users.

This was used to gather data for an upcoming AI Chatbot we are working on (more news on this later, along with some code).


        

History

Advanced
Information Version
Marc Laporte Nearly 8 38
Marc Laporte 37
Vianney Rwicha 36
Vianney Rwicha 35
Vianney Rwicha 34
Vianney Rwicha 33
Vianney Rwicha 32
Vianney Rwicha 31
Vianney Rwicha 30
Vianney Rwicha 29
Vianney Rwicha 28
Vianney Rwicha 27
Vianney Rwicha 26
Vianney Rwicha 25
Vianney Rwicha 24
Vianney Rwicha 23
Vianney Rwicha 22
Vianney Rwicha 21
Vianney Rwicha 20
Vianney Rwicha 19
Vianney Rwicha 18
Vianney Rwicha 17
Vianney Rwicha 16
Vianney Rwicha 15
Vianney Rwicha 14
Vianney Rwicha 13
Vianney Rwicha 12
Vianney Rwicha 11
Vianney Rwicha 10
Vianney Rwicha 9
Vianney Rwicha 8
Vianney Rwicha 7
Vianney Rwicha 6
Vianney Rwicha 5
Vianney Rwicha 4
Vianney Rwicha 3
Marc Laporte 2
Marc Laporte 1