mirror of https://github.com/foomo/walker.git synced 2025-10-16 12:45:39 +00:00

Go to file

Milos Pejanovic 6b2b0d344c disable SSL certificate verification on crawler		2022-09-27 19:14:47 +02:00
cmd	implemented ScrapeResultModifierFunc	2020-06-23 17:07:23 +02:00
config	extracted reports, added filters for reports web interface, proper redirects support	2020-06-19 12:15:48 +02:00
htmlschema	cleaned up assert imports	2020-06-19 14:18:47 +02:00
reports	trim h1 tags to find empty one	2020-07-01 12:59:10 +02:00
vo	added normalized links and a links report	2020-06-23 13:22:27 +02:00
.gitignore	hello world	2019-02-12 17:55:05 +01:00
Dockerfile	extracted reports, added filters for reports web interface, proper redirects support	2020-06-19 12:15:48 +02:00
extract_test.go	made ExtractStructure public	2020-03-30 13:02:03 +02:00
extract.go	added ScrapeResultModifierFunc	2020-06-23 17:05:00 +02:00
filterscrapelinks.go	external scraping and filtering support	2020-03-26 16:18:32 +01:00
go.mod	no script tag support	2022-06-02 11:17:56 +02:00
go.sum	no script tag support	2022-06-02 11:17:56 +02:00
Makefile	hello world	2019-02-12 17:55:05 +01:00
metrics.go	hello html schema	2020-04-30 14:34:50 +02:00
README.md	extracted reports, added filters for reports web interface, proper redirects support	2020-06-19 12:15:48 +02:00
robotstxt.go	initial robots.txt support	2020-03-11 14:36:55 +01:00
scrape_test.go	no script tag support	2022-06-02 11:17:56 +02:00
scrape.go	no script tag support	2022-06-02 11:17:56 +02:00
scrapeloop.go	disable SSL certificate verification on crawler	2022-09-27 19:14:47 +02:00
service.go	no script tag support	2022-06-02 11:17:56 +02:00
test.html	no script tag support	2022-06-02 11:17:56 +02:00
validationmetrics.go	extracted reports, added filters for reports web interface, proper redirects support	2020-06-19 12:15:48 +02:00
walker_test.go	implemented ScrapeResultModifierFunc	2020-06-23 17:07:23 +02:00
walker.go	no script tag support	2022-06-02 11:17:56 +02:00
walkerstatus.go	extracted reports, added filters for reports web interface, proper redirects support	2020-06-19 12:15:48 +02:00

README.md

Walker

Walker walkes aka as crawls through websites and collects performance and SEO relevant data. The results can be browsed through a very simple web interface. Apart from that they are exposed as prometheus metrics (not implemented yet).

Be careful when crawling your website with walker with aggressive settings, it might take your site down

Configuration

---
# target of your scrape
target: http://www.bestbytes.de
# number of concurrent go routines
concurrency: 2
# where to run the webinterface
addr: ":3001"
# if you want to ignore <meta name="robots" content="noindex,nofollow"/>
ignorerobots: true
# in some cases using cookies is friendlier to the server
usecookies: true

# ignoring urls
## based on query parameters in this example all links, that contain a queryparameter foo
ignorequerieswith:
  - foo
## skip everything that has a query
ignoreallqueries: true
# what paths (that would be a prefixes)
ignore:
  - /foomo
...

error detection

everything greater than 400 will be tracked as an error

external link validation (not implemented yet)

check external links
forbidden sites like a stage system

seo validation

missing title, description, h1
duplication title, description, h1

seo validation schemata

WIP

metrics

Work in progress exposed on /metrics

vector of status codes
performance buckets