Go to file
2022-09-27 19:14:47 +02:00
cmd implemented ScrapeResultModifierFunc 2020-06-23 17:07:23 +02:00
config extracted reports, added filters for reports web interface, proper redirects support 2020-06-19 12:15:48 +02:00
htmlschema cleaned up assert imports 2020-06-19 14:18:47 +02:00
reports trim h1 tags to find empty one 2020-07-01 12:59:10 +02:00
vo added normalized links and a links report 2020-06-23 13:22:27 +02:00
.gitignore hello world 2019-02-12 17:55:05 +01:00
Dockerfile extracted reports, added filters for reports web interface, proper redirects support 2020-06-19 12:15:48 +02:00
extract_test.go made ExtractStructure public 2020-03-30 13:02:03 +02:00
extract.go added ScrapeResultModifierFunc 2020-06-23 17:05:00 +02:00
filterscrapelinks.go external scraping and filtering support 2020-03-26 16:18:32 +01:00
go.mod no script tag support 2022-06-02 11:17:56 +02:00
go.sum no script tag support 2022-06-02 11:17:56 +02:00
Makefile hello world 2019-02-12 17:55:05 +01:00
metrics.go hello html schema 2020-04-30 14:34:50 +02:00
README.md extracted reports, added filters for reports web interface, proper redirects support 2020-06-19 12:15:48 +02:00
robotstxt.go initial robots.txt support 2020-03-11 14:36:55 +01:00
scrape_test.go no script tag support 2022-06-02 11:17:56 +02:00
scrape.go no script tag support 2022-06-02 11:17:56 +02:00
scrapeloop.go disable SSL certificate verification on crawler 2022-09-27 19:14:47 +02:00
service.go no script tag support 2022-06-02 11:17:56 +02:00
test.html no script tag support 2022-06-02 11:17:56 +02:00
validationmetrics.go extracted reports, added filters for reports web interface, proper redirects support 2020-06-19 12:15:48 +02:00
walker_test.go implemented ScrapeResultModifierFunc 2020-06-23 17:07:23 +02:00
walker.go no script tag support 2022-06-02 11:17:56 +02:00
walkerstatus.go extracted reports, added filters for reports web interface, proper redirects support 2020-06-19 12:15:48 +02:00

Walker

Walker walkes aka as crawls through websites and collects performance and SEO relevant data. The results can be browsed through a very simple web interface. Apart from that they are exposed as prometheus metrics (not implemented yet).

Be careful when crawling your website with walker with aggressive settings, it might take your site down

Configuration

---
# target of your scrape
target: http://www.bestbytes.de
# number of concurrent go routines
concurrency: 2
# where to run the webinterface
addr: ":3001"
# if you want to ignore <meta name="robots" content="noindex,nofollow"/>
ignorerobots: true
# in some cases using cookies is friendlier to the server
usecookies: true

# ignoring urls
## based on query parameters in this example all links, that contain a queryparameter foo
ignorequerieswith:
  - foo
## skip everything that has a query
ignoreallqueries: true
# what paths (that would be a prefixes)
ignore:
  - /foomo
...

error detection

  • everything greater than 400 will be tracked as an error
  • check external links
  • forbidden sites like a stage system

seo validation

  • missing title, description, h1
  • duplication title, description, h1

seo validation schemata

WIP

metrics

Work in progress exposed on /metrics

  • vector of status codes
  • performance buckets