From 91ef8e774eec2d9847fe629e86fc876c1361af35 Mon Sep 17 00:00:00 2001 From: Thibaut Courouble Date: Fri, 8 Nov 2013 11:34:41 -0800 Subject: [PATCH] Created Filter Reference (markdown) --- Filter-Reference.md | 222 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 Filter-Reference.md diff --git a/Filter-Reference.md b/Filter-Reference.md new file mode 100644 index 00000000..0bb5ba9d --- /dev/null +++ b/Filter-Reference.md @@ -0,0 +1,222 @@ +--- + +**Table of contents:** + +* [Overview](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#overview) +* [Instance methods](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#instance-methods) +* [Core filters](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#core-filters) +* [Custom filters](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#custom-filters) + - [CleanHtmlFilter](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#cleanhtmlfilter) + - [EntriesFilter](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#entriesfilter) + +## Overview + +Filters use the [HTML::Pipeline](https://github.com/jch/html-pipeline) library. They take an HTML string or [Nokogiri](http://nokogiri.org/) node as input, optionally perform modifications and/or extract information from it, and then outputs the result. Together they form a pipeline where each filter hands its output to the next filter's input. Every documentation page passes through this pipeline before being copied on the local filesystem. + +Filters are subclasses of the [`Docs::Filter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/filter.rb) class and require a `call` method. A basic implementation looks like this: + +```ruby +module Docs + class CustomFilter < Filter + def call + doc + end + end +end +``` + +Filters which manipulate the Nokogiri node object (`doc` and related methods) are _HTML filters_ and must not manipulate the HTML string (`html`). Vice-versa, filters which manipulate the string representation of the document are _text filters_ and must not manipulate the Nokogiri node object. The two types are divided into two stacks within the scrapers. These stacks are then combined into a pipeline that calls the HTML filters before the text filters (more details [here](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks)). This is to avoid parsing the document multiple times. + +The `call` method must return either `doc` or `html`, depending on the type of filter. + +## Instance methods + +* `doc` [Nokogiri::XML::Node] + The Nokogiri representation of the container element. + See [Nokogiri's API docs](http://nokogiri.org/Nokogiri/XML/Node.html#methods_nav) for the list of available methods. + +* `html` [String] + The string representation of the container element. + +* `context` [Hash] **(frozen)** + The scraper's `options` along with a few additional keys: `:base_url`, `:root_url`, `:root_page` and `:url`. + +* `result` [Hash] + Used to store the page's metadata and pass back information to the scraper. + Possible keys: + + - `:path` — the page's normalized path + - `:store_path` — the path where the page will be stored (equal to `:path` with `.html` at the end) + - `:internal_urls` — the list of distinct internal URLs found within the page + - `:entries` — the [`Entry`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/models/entry.rb) objects to add to the index + +* `css`, `at_css`, `xpath`, `at_xpath` + Shortcuts for `doc.css`, `doc.xpath`, etc. + +* `base_url`, `current_url`, `root_url` [Docs::URL] + Shortcuts for `context[:base_url]`, `context[:url]`, and `context[:root_url]` respectively. + +* `root_path` [String] + Shortcut for `context[:root_path]`. + +* `subpath` [String] + The sub-path from the base URL of the current URL. + _Example: if `base_url` equals `example.com/docs` and `current_url` equals `example.com/docs/file?raw`, the returned value is `/file`._ + +* `slug` [String] + The `subpath` removed of any leading slash or `.html` extension. + _Example: if `subpath` equals `/dir/file.html`, the returned value is `dir/file`._ + +* `root_page?` [Boolean] + Returns `true` if the current page is the root page. + +## Core filters + +* [`ContainerFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/container.rb) — changes the root node of the document (remove everything outside) +* [`CleanHtmlFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/clean_html.rb) — removes HTML comments, `