# 2. Create spider

## Create a new branch

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRsEgG7mInKPJqLAI0%2F20180119_143931_3319QVe.png?alt=media\&token=46ffb756-f2ec-43a3-a38a-87a441af319a)

Respository: <https://github.com/hoaxly/hoaxly-scraping-container>

## After setting up the environment visit <http://hoaxly.docksal:9001/#/projects/hoaxlyPortia>

## Enter url you want to scrape

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRsK4HpFy_70ImulJd%2F20180119_144527_3319qpq.png?alt=media\&token=2a731c50-4afe-48f3-a655-e66b65be2b99)

## Using the portia interface: visit the page where you want to start crawling through links

![Example start url](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRs_OTUXolSTDkmGW-%2F20180119_144652_33193zw.png?alt=media\&token=51507ea9-f783-4b25-baea-ed1dea26e20f)

## Create a new spider

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRsgChVKkFIvph5WVw%2F20180119_144716_3319E-2.png?alt=media\&token=562f2990-4e65-46cb-b637-37491ea8841d)

## Follow a link to a sample item you want to scrape

![Sample item link](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRslG4Z8m1jm5MdP1p%2F20180119_144817_33192HG.png?alt=media\&token=9d5661a0-4f93-45fb-8138-529b8633dbef)

![Sample item page](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRsu5Q7llSB63FpAzH%2F20180119_144832_3319DSM.png?alt=media\&token=c966f744-37dd-4f37-8189-c9a6ab726820)

## Create a new sample annotation

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRtI4SH_RP8irAV2wP%2F20180119_144856_3319QcS.png?alt=media\&token=a8810bd8-c964-4247-90f0-faf215bbaa1c)

## Select the appropriate schema (hoaxly)

TODO: screenshot of new schema

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRtN6559OS8nUjK6zX%2F20180119_144936_3319dmY.png?alt=media\&token=bcddc94f-21db-471f-802f-63005632bb3f)

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRtQEzSVEASGVG5dhm%2F20180119_145019_3319qwe.png?alt=media\&token=e35517ba-acd0-443a-8286-251744a943ac)

## Annotate the first element by clicking on the visible project headline

## Select the appropriate field from schema

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRu4Pswb-zXCeTsz5S%2F20180119_145146_3319EFr.png?alt=media\&token=c3d41f64-a5ba-4610-a030-c02b83a59975)

## Repeat for all fields in the schema

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRrUGLa0x9XxNuWUdn%2F-LGRtiE45AfP8KRW1sjy%2F20180119_145238_3319RPx.png?alt=media\&token=880eea13-c99a-4ecb-a11b-64dc0ded3e37)

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRtxo8sWINUIVvneTU%2F20180119_145415_3319DZA.png?alt=media\&token=a70a0330-041a-4746-8105-28c6c56045dc)

## Close sample

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRuEHsxw1k4t-qhuJq%2F20180119_145433_3319QjG.png?alt=media\&token=033be8c2-a82b-4564-bcca-77232987d72d)

## Configure url crawling schema

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRuLyQOc5aTBJr_sff%2F20180119_145501_3319dtM.png?alt=media\&token=21f83268-597a-4a02-9e10-99859d21c6e9)

using regex:

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRuPALp8W9UqecQPWc%2F20180119_145607_3319q3S.png?alt=media\&token=cfb4f441-1ea2-4dbd-b9a6-d43512003609)

## Export spider as scrapy spider (python code)

## Add the new spider to the scrapy\_projects directory and commit the new spider

![](https://2058394380-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LFcExzxksOgWYgalL6z%2F-LGRtqPySxP3MOspfBbf%2F-LGRuZuxzV3Pm3UEZgiZ%2F20180119_145722_33193BZ.png?alt=media\&token=93934b94-12ef-4c78-98f6-7360bfb3e935)

&#x20;`% git add scrapy_projects/hoaxlyPortia/spiders/ -p`&#x20;

`% git commit scrapy_projects/hoaxlyPortia/spiders/`&#x20;

use a commit message that tells us what spider you are adding using which schema

## Create a merge request
