3. Run a spider using the hoaxly-scraping-container
Author: Luis Rosenstrauch
Respository: https://github.com/hoaxly/hoaxly-scraping-container
This is useful for testing your spider locally before using it to retrieve data regularly.
For portia spiders: portiacrawl command (footnote 1). For spiders created programmatically: scrapy crawl cli command
you will get a list of spiders if you run this command
docker exec portia portiacrawl [SPIDER] [OPTIONS] docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders
For example, to run the climatefeedback.org crawler and save its output into /app/data/example-output/output.json using the hoaxly settings, you would run:
docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders climatefeedback.org -o /app/data/example-output/output.json --settings=hoaxly
the more lowlevel command using scrapy looks like
scrapy crawl -s PROJECT_DIR=./ -s SPIDER_MANAGER_CLASS=slybot.spidermanager.SlybotSpiderManager snopes.com
You can also locally deploy exported spiders to the scrapingdaemon and schedule a run there to test what would happen in production environment there is a cli container supplied so you dont need to install any dependencies on your host
Run:
docker exec -ti cli /bin/bash
Now you are in container and can tell you local scrapydaemon container to run these spiders
scrapyd-client deploy local scrapyd-client -t
http://scrapydaemon.hoaxly.docksal:6800
schedule -p Hoaxlyspiders climatefeedback.org
and view your results in the storage container:
http://elastic.hoaxly.docksal:9200/hoaxly/_search
Footnotes
1 http://portia.readthedocs.io/en/latest/spiders.html#running-a-spider
Last updated