**Table of contents:** * [Overview](#overview) * [Configuration](#configuration) - [Attributes](#attributes) - [Filter stacks](#filter-stacks) - [Filter options](#filter-options) ## Overview Starting from a root URL, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end. Scrapers rely on the following libraries: * [Typhoeus](https://github.com/typhoeus/typhoeus) for making HTTP requests * [HTML::Pipeline](https://github.com/jch/html-pipeline) for applying filters * [Nokogiri](http://nokogiri.org/) for parsing HTML There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base URL with a local path before reading a file. `FileScraper` uses the placeholder `localhost` base URL by default and includes a filter to remove any URL pointing to it at the end. To be processed, a response must meet the following requirements: * 200 status code * HTML content type * effective URL (after redirection) contained in the base URL (explained below) (`FileScraper` only checks if the file exists and is not empty.) Each URL is requested only once (case-insensitive). ## Configuration Configuration is done via class attributes and divided into three main categories: * [Attributes](#attributes) — essential information such as name, version, URL, etc. * [Filter stacks](#filter-stacks) — the list of filters that will be applied to each page. * [Filter options](#filter-options) — the options passed to said filters. **Note:** scrapers are located in the [`lib/docs/scrapers`](https://github.com/freeCodeCamp/devdocs/tree/main/lib/docs/scrapers/) directory. The class's name must be the [CamelCase](http://api.rubyonrails.org/classes/String.html#method-i-camelize) equivalent of the filename. ### Attributes * `name` [String] Must be unique. Defaults to the class's name. * `slug` [String] Must be unique, lowercase, and not include dashes (underscores are ok). Defaults to `name` lowercased. * `type` [String] **(required, inherited)** Defines the CSS class name (`_[type]`) and custom JavaScript class (`app.views.[Type]Page`) that will be added/loaded on each page. Documentations sharing a similar structure (e.g. generated with the same tool or originating from the same website) should use the same `type` to avoid duplicating the CSS and JS. Must include lowercase letters only. * `release` [String] **(required)** The version of the software at the time the scraper was last run. This is only informational and doesn't affect the scraper's behavior. * `base_url` [String] **(required in `UrlScraper`)** The documents' location. Only URLs _inside_ the `base_url` will be scraped. "inside" more or less means "starting with" except that `/docs` is outside `/doc` (but `/doc/` is inside). Defaults to `localhost` in `FileScraper`. _(Note: any iframe, image, or skipped link pointing to localhost will be removed by the `CleanLocalUrls` filter; the value should be overridden if the documents are available online.)_ Unless `root_path` is set, the root/initial URL is equal to `base_url`. * `base_urls` [Array] **(the `MultipleBaseUrls` module must be included)** Documentation's locations. Almost the same as `base_url` but in this case more than one URL can be added, should be used when a documentation is split in different URLs or needs more URLs to be completed. See [`typescript.rb`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/scrapers/typescript.rb). * `root_path` [String] **(inherited)** The path from the `base_url` of the root URL. * `initial_paths` [Array] **(inherited)** A list of paths (from the `base_url`) to add to the initial queue. Useful for scraping isolated documents. Defaults to `[]`. _(Note: the `root_path` is added to the array at runtime.)_ * `dir` [String] **(required, `FileScraper` only)** The absolute path where the files are located on the local filesystem. _Note: `FileScraper` works exactly like `UrlScraper` (manipulating the same kind of URLs) except that it substitutes `base_url` with `dir` in order to read files instead of making HTTP requests._ * `params` [Hash] **(inherited, `UrlScraper` only)** Query string parameters to append to every URL. (e.g. `{ format: 'raw' }` → `?format=raw`) Defaults to `{}`. * `abstract` [Boolean] Make the scraper abstract / not runnable. Used for sharing behavior with other scraper classes (e.g. all MDN scrapers inherit from the abstract [`Mdn`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/scrapers/mdn/mdn.rb) class). Defaults to `false`. ### Filter stacks Each scraper has two [filter](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/filter.rb) [stacks](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/core/filter_stack.rb): `html_filters` and `text_filters`. They are combined into a pipeline (using the [HTML::Pipeline](https://github.com/jch/html-pipeline) library) which causes each filter to hand its output to the next filter's input. HTML filters are executed first and manipulate a parsed version of the document (a [Nokogiri](http://nokogiri.org/Nokogiri/XML/Node.html) node object), whereas text filters manipulate the document as a string. This separation avoids parsing the document multiple times. Filter stacks are like sorted sets. They can modified using the following methods: ```ruby push(*names) # append one or more filters at the end insert_before(index, *names) # insert one filter before another (index can be a name) insert_after(index, *names) # insert one filter after another (index can be a name) replace(index, name) # replace one filter with another (index can be a name) ``` "names" are `require` paths relative to `Docs` (e.g. `jquery/clean_html` → `Docs::Jquery::CleanHtml`). Default `html_filters`: * [`ContainerFilter`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/filters/core/container.rb) — changes the root node of the document (remove everything outside) * [`CleanHtmlFilter`](https://github.com/freeCodeCamp/devdocs/blob/main/lib/docs/filters/core/clean_html.rb) — removes HTML comments, `