hoax.ly documentation
  • Hoax.ly Documentation
  • User Documentation
    • About
    • General FAQs
    • Using the hoaxlybot
    • Using the hoax.ly Browser Extension
    • Using debunkCMS
    • Terms of Use
    • Data Privacy
  • Developer Documentation
    • hoax.ly technical architecture
    • Using the hoax.ly API
    • Adding new sites to the database
      • Normalizing ratings
      • Criteria for adding new sources
      • Technical steps to create spiders
        • 1. Setup environment
        • 2. Create spider
        • 3. Run a spider using the hoaxly-scraping-container
        • 4. Deploy spiders
    • Developing/Updating debunkCMS
    • Contributions
  • Polite scraping
  • Benutzerdokumentation
    • Über hoax.ly
    • FAQs
Powered by GitBook
On this page
  1. Developer Documentation
  2. Adding new sites to the database
  3. Technical steps to create spiders

3. Run a spider using the hoaxly-scraping-container

Author: Luis Rosenstrauch

Previous2. Create spiderNext4. Deploy spiders

Last updated 6 years ago

Respository:

This is useful for testing your spider locally before using it to retrieve data regularly.

For portia spiders: portiacrawl command (footnote 1). For spiders created programmatically: scrapy crawl cli command

you will get a list of spiders if you run this command

docker exec portia portiacrawl [SPIDER] [OPTIONS] docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders

For example, to run the climatefeedback.org crawler and save its output into /app/data/example-output/output.json using the hoaxly settings, you would run:

docker exec portia portiacrawl /app/data/projects/Hoaxlyspiders climatefeedback.org -o /app/data/example-output/output.json --settings=hoaxly

the more lowlevel command using scrapy looks like

scrapy crawl -s PROJECT_DIR=./ -s SPIDER_MANAGER_CLASS=slybot.spidermanager.SlybotSpiderManager snopes.com

You can also locally deploy exported spiders to the scrapingdaemon and schedule a run there to test what would happen in production environment there is a cli container supplied so you dont need to install any dependencies on your host

Run:

docker exec -ti cli /bin/bash

Now you are in container and can tell you local scrapydaemon container to run these spiders

scrapyd-client deploy local scrapyd-client -t schedule -p Hoaxlyspiders climatefeedback.org

and view your results in the storage container:

Footnotes

1

https://github.com/hoaxly/hoaxly-scraping-container
http://scrapydaemon.hoaxly.docksal:6800
http://elastic.hoaxly.docksal:9200/hoaxly/_search
http://portia.readthedocs.io/en/latest/spiders.html#running-a-spider