3. Run a spider using the hoaxly-scraping-container

Author: Luis Rosenstrauch

Respository: https://github.com/hoaxly/hoaxly-scraping-container

This is useful for testing your spider locally before using it to retrieve data regularly.

For portia spiders: portiacrawl command (footnote 1). For spiders created programmatically: scrapy crawl cli command

you will get a list of spiders if you run this command

docker exec portia portiacrawl [SPIDER] [OPTIONS] docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders

For example, to run the climatefeedback.org crawler and save its output into /app/data/example-output/output.json using the hoaxly settings, you would run:

docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders climatefeedback.org -o /app/data/example-output/output.json --settings=hoaxly

the more lowlevel command using scrapy looks like

scrapy crawl -s PROJECT_DIR=./ -s SPIDER_MANAGER_CLASS=slybot.spidermanager.SlybotSpiderManager snopes.com

You can also locally deploy exported spiders to the scrapingdaemon and schedule a run there to test what would happen in production environment there is a cli container supplied so you dont need to install any dependencies on your host

Run:

docker exec -ti cli /bin/bash

Now you are in container and can tell you local scrapydaemon container to run these spiders

scrapyd-client deploy local scrapyd-client -t http://scrapydaemon.hoaxly.docksal:6800 schedule -p Hoaxlyspiders climatefeedback.org

and view your results in the storage container:

http://elastic.hoaxly.docksal:9200/hoaxly/_search

Footnotes

1 http://portia.readthedocs.io/en/latest/spiders.html#running-a-spider

Last updated