2. Create spider
Author: Luis Rosenstrauch
Create a new branch
Respository: https://github.com/hoaxly/hoaxly-scraping-container
After setting up the environment visit http://hoaxly.docksal:9001/#/projects/hoaxlyPortia
Enter url you want to scrape
Using the portia interface: visit the page where you want to start crawling through links
Create a new spider
Follow a link to a sample item you want to scrape
Create a new sample annotation
Select the appropriate schema (hoaxly)
TODO: screenshot of new schema
Annotate the first element by clicking on the visible project headline
Select the appropriate field from schema
Repeat for all fields in the schema
Close sample
Configure url crawling schema
using regex:
Export spider as scrapy spider (python code)
Add the new spider to the scrapy_projects directory and commit the new spider
% git add scrapy_projects/hoaxlyPortia/spiders/ -p
% git commit scrapy_projects/hoaxlyPortia/spiders/
use a commit message that tells us what spider you are adding using which schema
Create a merge request
Last updated