hoax.ly documentation

CtrlK

2. Create spider

Author: Luis Rosenstrauch

Create a new branch

Respository: https://github.com/hoaxly/hoaxly-scraping-container

After setting up the environment visit http://hoaxly.docksal:9001/#/projects/hoaxlyPortia

Enter url you want to scrape

Using the portia interface: visit the page where you want to start crawling through links

Create a new spider

Follow a link to a sample item you want to scrape

Create a new sample annotation

Select the appropriate schema (hoaxly)

TODO: screenshot of new schema

Annotate the first element by clicking on the visible project headline

Select the appropriate field from schema

Repeat for all fields in the schema

Close sample

Configure url crawling schema

using regex:

Export spider as scrapy spider (python code)

Add the new spider to the scrapy_projects directory and commit the new spider

% git add scrapy_projects/hoaxlyPortia/spiders/ -p

% git commit scrapy_projects/hoaxlyPortia/spiders/

use a commit message that tells us what spider you are adding using which schema

Create a merge request

Previous1. Setup environment Next3. Run a spider using the hoaxly-scraping-container

Last updated 7 years ago