From 9baef9a8875d6b638f00c2960908fe39a1896a71 Mon Sep 17 00:00:00 2001 From: Thibaut Courouble Date: Thu, 24 Oct 2013 11:38:59 -0700 Subject: [PATCH 01/54] Initial Commit --- Home.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 Home.md diff --git a/Home.md b/Home.md new file mode 100644 index 00000000..7d6544a4 --- /dev/null +++ b/Home.md @@ -0,0 +1 @@ +Welcome to the devdocs wiki! \ No newline at end of file From 0cb6cb918c78be179a9c5ddbb34f9a2b6e3a1c83 Mon Sep 17 00:00:00 2001 From: Thibaut Courouble Date: Fri, 8 Nov 2013 11:33:53 -0800 Subject: [PATCH 02/54] Created Adding documentations to DevDocs (markdown) --- Adding-documentations-to-DevDocs.md | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+) create mode 100644 Adding-documentations-to-DevDocs.md diff --git a/Adding-documentations-to-DevDocs.md b/Adding-documentations-to-DevDocs.md new file mode 100644 index 00000000..66eb5f24 --- /dev/null +++ b/Adding-documentations-to-DevDocs.md @@ -0,0 +1,27 @@ +--- + +Adding a documentation may look like a daunting task but once you get the hang of it, it's actually quite simple. Don't hesitate to ask for help on the [mailing list](https://groups.google.com/d/forum/devdocs) if you ever get stuck. + +**Note:** please read the [contributing guidelines](https://github.com/Thibaut/devdocs/blob/master/CONTRIBUTING.md) before submitting a new documentation. + +1. Create a subclass of `Docs::UrlScraper` or `Docs::FileScraper` in the `lib/docs/scrapers/` directory. Its name should be the [CamelCase](http://api.rubyonrails.org/classes/String.html#method-i-camelize) equivalent of the filename (e.g. `my_doc` → `MyDoc`) +2. Add the appropriate class attributes and filter options (see the [Scraper Reference](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference) page). +3. Check that the scraper is listed in `thor docs:list`. +4. Create filters specific to the scraper in the `lib/docs/filters/[my_doc]/` directory and add them to the class's [filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks). You may create any number of filters but will need at least the following two: + * A [`CleanHtml`](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#cleanhtmlfilter) filter whose task is to clean the HTML markup (e.g. adding `id` attributes to headings) and remove everything superfluous and/or nonessential. + * An [`Entries`](https://github.com/Thibaut/devdocs/wiki/Filter-Reference#entriesfilter) filter whose task is to determine the pages' metadata (the list of entries, each with a name, type and path). + The [Filter Reference](https://github.com/Thibaut/devdocs/wiki/Filter-Reference) page has all the details about filters. +5. Using the `thor docs:page [my_doc] [path]` command, check that the scraper works properly. Files will appear in the `public/docs/[my_doc]/` directory (but not inside the app as the command doesn't touch the index). +6. Generate the full documentation using the `thor docs:download [my_doc] --force` command. Additionally, you can use the `--verbose` option to see which files are being created/updated/deleted (useful to see what changed since the last run), and the `--debug` option to see which URLs are being requested and added to the queue (useful to pin down which page adds unwanted URLs to the queue). +7. Start the server, open the app, enable the documentation, and see how everything plays out. +8. Tweak the scraper/filters and repeat 5) and 6) until the pages and metadata are ok. +9. To customize the pages' styling, create an SCSS file in the `assets/stylesheets/pages/` directory and import it in `application.css.scss`. Both the file and CSS class should be named `_[type]` where [type] is equal to the scraper's `type` attribute (documentations with the same type share the same custom CSS and JS). _(Note: feel free to submit a pull request without custom CSS/JS)_ +10. To add syntax highlighting or execute custom JavaScript on the pages, create a file in the `assets/javascripts/views/pages/` directory (take a look at the other files to see how it works). +11. Add the documentation's icon in the `public/icons/docs/[my_doc]/` directory, in both 16x16 and 32x32-pixels formats. It'll be added to the icon sprite after your pull request is merged. + +If the documentation includes more than a few hundreds pages and is available for download, try to scrape it locally (e.g. using `FileScraper`). It'll make the development process much faster and avoids putting too much load on the source site. (It's not a problem if your scraper is coupled to your local setup, just explain how it works in your pull request.) + +Finally, try to document your scraper and filters' behavior as much as possible using comments (e.g. why some URLs are ignored, HTML markup removed, metadata that way, etc.). It'll make updating the documentation much easier. + +--- +_Feel free to edit this page or ask for improvements on the [mailing list](https://groups.google.com/d/forum/devdocs)._ \ No newline at end of file From 6f3be073212c9063d78557d95806f65112ac615a Mon Sep 17 00:00:00 2001 From: Thibaut Courouble Date: Fri, 8 Nov 2013 11:34:30 -0800 Subject: [PATCH 03/54] Created Scraper Reference (markdown) --- Scraper-Reference.md | 182 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 182 insertions(+) create mode 100644 Scraper-Reference.md diff --git a/Scraper-Reference.md b/Scraper-Reference.md new file mode 100644 index 00000000..96bec984 --- /dev/null +++ b/Scraper-Reference.md @@ -0,0 +1,182 @@ +--- + +**Table of contents:** + +* [Overview](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#overview) +* [Configuration](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#configuration) + - [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) + - [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) + - [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) + +## Overview + +Starting from a root url, scrapers recursively follow links that match a set of rules, passing each valid response through a chain of filters before writing the file on the local filesystem. They also create an index of the pages' metadata (determined by one filter), which is dumped into a JSON file at the end. + +Scrapers rely on the following libraries: + +* [Typhoeus](https://github.com/typhoeus/typhoeus) for making HTTP requests +* [HTML::Pipeline](https://github.com/jch/html-pipeline) for applying filters +* [Nokogiri](http://nokogiri.org/) for parsing HTML + +There are currently two kinds of scrapers: [`UrlScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/url_scraper.rb) which downloads files via HTTP and [`FileScraper`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/scrapers/file_scraper.rb) which reads them from the local filesystem. They function almost identically (both use URLs), except that `FileScraper` substitutes the base url with a local path before reading a file. + +To be processed, a response must meet the following requirements: + +* 200 status code +* HTML content type +* effective URL (after redirection) contained in the base url (explained below) + +(`FileScraper` only checks if the file exists and is not empty.) + +Each URL is requested only once (case-insensitive). + +## Configuration + +Configuration is done via class attributes and divided into three main categories: + +* [Attributes](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#attributes) — essential information such as name, version, url, etc. +* [Filter stacks](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-stacks) — the list of filters that will be applied to each page. +* [Filter options](https://github.com/Thibaut/devdocs/wiki/Scraper-Reference#filter-options) — the options passed to said filters. + +**Note:** scrapers are located in the [`lib/docs/scrapers`](https://github.com/Thibaut/devdocs/tree/master/lib/docs/scrapers/) directory. The class's name must be the [CamelCase](http://api.rubyonrails.org/classes/String.html#method-i-camelize) equivalent of the filename. + +### Attributes + +* `name` [String] + Must be unique. + Defaults to the class's name. + +* `slug` [String] + Must be unique, lowercase, and not include dashes (underscores are ok). + Defaults to `name` lowercased. + +* `type` [String] **(required, inherited)** + Defines the CSS class name (`_[type]`) and custom JavaScript class (`app.views.[Type]Page`) that will be added/loaded on each page. Documentations sharing a similar structure (e.g. generated with the same tool or originating from the same website) should use the same `type` to avoid duplicating the CSS and JS. + Must include lowercase letters only. + +* `version` [String] **(required)** + The version of the software at the time the scraper was last run. This is only informational and doesn't affect the scraper's behavior. + +* `base_url` [String] **(required)** + The documents' location. Only URLs _inside_ the `base_url` will be scraped. "inside" more or less means "starting with" except that `/docs` is outside `/doc` (but `/doc/` is inside). + Unless `root_path` is set, the root/initial URL is equal to `base_url`. + +* `root_path` [String] + The path from the `base_url` of the root URL. + +* `dir` [String] **(required, `FileScraper` only)** + The absolute path where the files are located on the local filesystem. + _Note: `FileScraper` works exactly like `UrlScraper` (manipulating the same kind of URLs) except that it substitutes `base_url` with `dir` in order to read files instead of making HTTP requests._ + +* `params` [Hash] **(inherited, `UrlScraper` only)** + Query string parameters to append to every URL. (e.g. `{ format: 'raw' }` → `?format=raw`) + Defaults to `{}`. + +* `abstract` [Boolean] + Make the scraper abstract / not runnable. Used for sharing behavior with other scraper classes (e.g. all MDN scrapers inherit from the abstract [`Mdn`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/scrapers/mdn/mdn.rb) class). + Defaults to `false`. + +### Filter stacks + +Each scraper has two [filter](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/filter.rb) [stacks](https://github.com/Thibaut/devdocs/blob/master/lib/docs/core/filter_stack.rb): `html_filters` and `text_filters`. They are combined into a pipeline (using the [HTML::Pipeline](https://github.com/jch/html-pipeline) library) which causes each filter to hand its output to the next filter's input. + +HTML filters are executed first and manipulate a parsed version of the document (a [Nokogiri](http://nokogiri.org/Nokogiri/XML/Node.html) node object), whereas text filters manipulate the document as a string. This separation avoids parsing the document multiple times. + +Filter stacks are like sorted sets. They can modified using the following methods: + +``` +push(*names) # append one or more filters at the end +insert_before(index, *names) # insert one filter before another (index can be a name) +insert_after(index, *names) # insert one filter after another (index can be a name) +replace(index, name) # replace one filter with another (index can be a name) +``` + +"names" are `require` paths relative to `Docs` (e.g. `jquery/clean_html` → `Docs::Jquery::CleanHtml`). + +Default `html_filters`: + +* [`ContainerFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/container.rb) — changes the root node of the document (remove everything outside) +* [`CleanHtmlFilter`](https://github.com/Thibaut/devdocs/blob/master/lib/docs/filters/core/clean_html.rb) — removes HTML comments, `