From 6f3be073212c9063d78557d95806f65112ac615a Mon Sep 17 00:00:00 2001 From: Thibaut Courouble Date: Fri, 8 Nov 2013 11:34:30 -0800 Subject: [PATCH] Created Scraper Reference (markdown) --- Scraper-Reference.md | 182 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 Scraper-Reference.md diff --git a/Scraper-Reference.md b/Scraper-Reference.md new file mode 100644 index 00000000..96bec984 --- /dev/null +++ b/Scraper-Reference.md @@ -0,0 +1,182 @@ +--- + +**Table of contents:** + +* [Overview](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#overview) +* [Configuration](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#configuration) + - [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) + - [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) + - [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) + +## Overview + +Starting from a root url, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end. + +Scrapers rely on the following libraries: + +* [Typhoeus](https://github.com/typhoeus/typhoeus) for making HTTP requests +* [HTML::Pipeline](https://github.com/jch/html-pipeline) for applying filters +* [Nokogiri](http://nokogiri.org/) for parsing HTML + +There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base url with a local path before reading a file. + +To be processed, a response must meet the following requirements: + +* 200 status code +* HTML content type +* effective URL (after redirection) contained in the base url (explained below) + +(`FileScraper` only checks if the file exists and is not empty.) + +Each URL is requested only once (case-insensitive). + +## Configuration + +Configuration is done via class attributes and divided into three main categories: + +* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, url, etc. +* [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) — the list of filters that will be applied to each page. +* [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) — the options passed to said filters. + +**Note:** scrapers are located in the [`lib/docs/scrapers`](https://github.com/Thibaut/devdocs/tree/master/lib/docs/scrapers/) directory. The class's name must be the [CamelCase](http://api.rubyonrails.org/classes/String.html#method-i-camelize) equivalent of the filename. + +### Attributes + +* `name` [String] + Must be unique. + Defaults to the class's name. + +* `slug` [String] + Must be unique, lowercase, and not include dashes (underscores are ok). + Defaults to `name` lowercased. + +* `type` [String] **(required, inherited)** + Defines the CSS class name (`_[type]`) and custom JavaScript class (`app.views.[Type]Page`) that will be added/loaded on each page. Documentations sharing a similar structure (e.g. generated with the same tool or originating from the same website) should use the same `type` to avoid duplicating the CSS and JS. + Must include lowercase letters only. + +* `version` [String] **(required)** + The version of the software at the time the scraper was last run. This is only informational and doesn't affect the scraper's behavior. + +* `base_url` [String] **(required)** + The documents' location. Only URLs _inside_ the `base_url` will be scraped. "inside" more or less means "starting with" except that `/docs` is outside `/doc` (but `/doc/` is inside). + Unless `root_path` is set, the root/initial URL is equal to `base_url`. + +* `root_path` [String] + The path from the `base_url` of the root URL. + +* `dir` [String] **(required, `FileScraper` only)** + The absolute path where the files are located on the local filesystem. + _Note: `FileScraper` works exactly like `UrlScraper` (manipulating the same kind of URLs) except that it substitutes `base_url` with `dir` in order to read files instead of making HTTP requests._ + +* `params` [Hash] **(inherited, `UrlScraper` only)** + Query string parameters to append to every URL. (e.g. `{ format: 'raw' }` → `?format=raw`) + Defaults to `{}`. + +* `abstract` [Boolean] + Make the scraper abstract / not runnable. Used for sharing behavior with other scraper classes (e.g. all MDN scrapers inherit from the abstract [`Mdn`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/scrapers/mdn/mdn.rb) class). + Defaults to `false`. + +### Filter stacks + +Each scraper has two [filter](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/filter.rb) [stacks](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/filter_stack.rb): `html_filters` and `text_filters`. They are combined into a pipeline (using the [HTML::Pipeline](https://github.com/jch/html-pipeline) library) which causes each filter to hand its output to the next filter's input. + +HTML filters are executed first and manipulate a parsed version of the document (a [Nokogiri](http://nokogiri.org/Nokogiri/XML/Node.html) node object), whereas text filters manipulate the document as a string. This separation avoids parsing the document multiple times. + +Filter stacks are like sorted sets. They can modified using the following methods: + +``` +push(*names) # append one or more filters at the end +insert_before(index, *names) # insert one filter before another (index can be a name) +insert_after(index, *names) # insert one filter after another (index can be a name) +replace(index, name) # replace one filter with another (index can be a name) +``` + +"names" are `require` paths relative to `Docs` (e.g. `jquery/clean_html` → `Docs::Jquery::CleanHtml`). + +Default `html_filters`: + +* [`ContainerFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/container.rb) — changes the root node of the document (remove everything outside) +* [`CleanHtmlFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/clean_html.rb) — removes HTML comments, `