Tiki Crawl

A crawler tool for checking links or gathering content from websites.

This is the alpha version of a crawler tool for checking links or gathering content from websites. It relies on the Crawler library from Spatie. Kudo to those guys.

Installation

This piece of code has been tested with php 8.2.6 and node v18.16.0, make sure you have those version installed (but it may very well be that it works on older versions, code is not that complicated).

Copy to clipboard

composer update
npm install

Usage

First you need to configure your options. You can override any options from config/config.default.php in a config/config.php file, for example:

Copy to clipboard

<?php
$options['timeout'] = 600;
$options['js'] = true;

Then you can launch the crawling

Copy to clipboard

./bin/crawl https://books.toscrape.com

> Note: toscrape.com is useful for testing crawlers

!! Config options

Name	Type	Default	Description
timeout	integer	60	the maximum time that should be waited before getting a response for any page crawled
cli	boolean	true	show a progress dotted line while launching it in console
log	boolean	true	keeps logs in `logs/` directory (one for access one for errors)
store	boolean	false	store crawled pages in `collected-data/` directory
max_size	integer	2	when `store` is enabled, the maximum size of stored documents, if you want to get pdfs you should up
limit	? integer	5	number of pages to crawl, set to `false` for unlimited crawls
js	boolean	false	use headless chrome with puppeteer (takes much longer, and requires to have puppeteer installed)
internal	boolean	true	only crawl urls that are on the same host
concurrency	integer	10	number of concurrent requests to make
max_depth	?integer	false	if you want to only get the immediate links on the page you are crawl, set `1`, you can decide how deep you want to crawl from initial page
delay	?integer	false	add a delay in milliseconds between requests
skip_urls	?array	false	if you stumble upon problematic urls that make the crawler crash, you can skip them by listing them in this array
allow_redirect	boolean	false	wether or not follow 30x redirects
ignore_robots	boolean	false	bypass instructions contained in `robots.txt` file
allow_nofollow	boolean	false	bypass the `rel="nofollow"` directive in links
user_agent	string	TikiCrawl	the user-agent header sent with http requests

Those options will be passed to spatie, you can learn more about those at https://github.com/spatie/crawler/

Roadmap

handle the case where headless browser fails miserably and crawler lib don't catch it

improve crawling on sites by using the canonical url that may be declared inside the html

add an option to try gathering or checking http status on images (for finding missing ones)

publish as a composer package

add a feature in Tiki to make use of it

License
Copyright (c) 2023 mose, evoludata
Available under MIT license. See LICENSE.txt for more details

Notes

It relies on the Crawler library from Spatie (Over 5 million downloads).
It starts off as a tool just for developers, and will eventually be integrated in Tiki, so accessible to power users.

This was used to gather data for an upcoming AI Chatbot we are working on (more news on this later, along with some code).

Information	Version
Thu 28 Mar, 2024 19:41 GMT-0000 Marc Laporte Nearly 8	38
Tue 12 Sep, 2023 13:28 GMT-0000 Marc Laporte	37
Sun 30 Jul, 2023 23:02 GMT-0000 Vianney Rwicha	36
Wed 26 Jul, 2023 00:46 GMT-0000 Vianney Rwicha	35
Wed 26 Jul, 2023 00:39 GMT-0000 Vianney Rwicha	34
Wed 26 Jul, 2023 00:39 GMT-0000 Vianney Rwicha	33
Wed 26 Jul, 2023 00:35 GMT-0000 Vianney Rwicha	32
Wed 26 Jul, 2023 00:34 GMT-0000 Vianney Rwicha	31
Wed 26 Jul, 2023 00:33 GMT-0000 Vianney Rwicha	30
Wed 26 Jul, 2023 00:07 GMT-0000 Vianney Rwicha	29
Wed 26 Jul, 2023 00:06 GMT-0000 Vianney Rwicha	28
Wed 26 Jul, 2023 00:06 GMT-0000 Vianney Rwicha	27
Wed 26 Jul, 2023 00:05 GMT-0000 Vianney Rwicha	26
Wed 26 Jul, 2023 00:04 GMT-0000 Vianney Rwicha	25
Wed 26 Jul, 2023 00:01 GMT-0000 Vianney Rwicha	24
Wed 26 Jul, 2023 00:00 GMT-0000 Vianney Rwicha	23
Tue 25 Jul, 2023 23:57 GMT-0000 Vianney Rwicha	22
Tue 25 Jul, 2023 23:56 GMT-0000 Vianney Rwicha	21
Tue 25 Jul, 2023 14:35 GMT-0000 Vianney Rwicha	20
Tue 25 Jul, 2023 14:35 GMT-0000 Vianney Rwicha	19
Tue 25 Jul, 2023 14:34 GMT-0000 Vianney Rwicha	18
Tue 25 Jul, 2023 14:29 GMT-0000 Vianney Rwicha	17
Tue 25 Jul, 2023 14:27 GMT-0000 Vianney Rwicha	16
Tue 25 Jul, 2023 14:26 GMT-0000 Vianney Rwicha	15
Tue 25 Jul, 2023 14:16 GMT-0000 Vianney Rwicha	14
Tue 25 Jul, 2023 14:15 GMT-0000 Vianney Rwicha	13
Tue 25 Jul, 2023 13:59 GMT-0000 Vianney Rwicha	12
Tue 25 Jul, 2023 13:50 GMT-0000 Vianney Rwicha	11
Tue 25 Jul, 2023 13:45 GMT-0000 Vianney Rwicha	10
Tue 25 Jul, 2023 13:44 GMT-0000 Vianney Rwicha	9
Tue 25 Jul, 2023 13:27 GMT-0000 Vianney Rwicha	8
Tue 25 Jul, 2023 13:26 GMT-0000 Vianney Rwicha	7
Tue 25 Jul, 2023 13:25 GMT-0000 Vianney Rwicha	6
Mon 24 Jul, 2023 15:29 GMT-0000 Vianney Rwicha	5
Mon 24 Jul, 2023 15:19 GMT-0000 Vianney Rwicha	4
Mon 24 Jul, 2023 15:17 GMT-0000 Vianney Rwicha	3
Sat 08 Jul, 2023 16:37 GMT-0000 Marc Laporte	2
Sat 08 Jul, 2023 16:37 GMT-0000 Marc Laporte	1

History: Crawl

Preview of version: 6

Tiki Crawl

Installation

Usage

Notes

History

About Tiki

Support

Community

Documentation

Development

Legal

Tiki Project Sites

Networks

Navigation and related functionality and content

Related content

History: Crawl

Preview of version: 6

Tiki Crawl

Installation

Usage

Notes

History

Related content