# railgun.Spider

Perhaps the most powerful piece of Railgun is the Spider class, which is used to crawl web pages
looking for links to follow and then crawling those pages.  The spider is fully customizable,
allowing users to supply their own search function used to scan pages for links and also comes
packed with middleware processing for manipulating requests before they're sent as well as a
subscriber system that allows users to have functions run before a request is sent. The spider
also creates and populates an instance of `railgun.records.Record` with information about requests
it makes.

Example:

```js
const railgun = require('railgun')

let siteToCrawl = new railgun.requests.Builder('GET', 'target.cat')

// Let us instruct the spider to crawl target.cat and make requests up to a depth of 3.
let spider = new railgun.Spider({maxDepth: 3}, siteToCrawl)

spider.onRequest((request, next) => {
  // Spoof the user agent of each request the spider sends
  request.headers['User-Agent'] = 'Railgun Spider'
  next()
})

// Use a subscriber to count how many requests the spider makes.
let count = 0
spider.subscribe({
  onRequest: function (req) {
    count++
  }
})

// Tell the spider to start crawling asynchronously.
spider.crawl()

// Use another asynchronous function to periodically check if the spider has finished
// running before printing a nested list of all of the URLs for pages the spider crawled.
let id = setInterval(function () {
  if (spider.finished()) {
    console.log(spider.records().toString())
    console.log('Processed', count, 'requests')
    clearInterval(id)
  }
}, 2000)
```

## Static Methods

### railgun.Spider.defaultSearch

In case you would like to explicitly supply a spider with the default search function or else
use the default search in your own function to do some of the work, the default search function
is available as `Spider.defaultSearch`. More on search functions in the constructor section.
The default search scans documents and looks for `href` attributes of `a` tags, and creates a
request builder for each one.

## Methods

### railgun.Spider#constructor

The constructor for the spider class accepts three arguments:

1. A configuration object for configuring the spider.
2. A request builder defining the first request the spider should make.
3. A search function that scans a document for new links to follow. If not provided, this argument defaults to Spider.defaultSearch

The first parameter, `config`, is expected to be a standard object. Currently, spiders only
act on a `maxDepth` value, which instructs the spider to only crawl `maxDepth` links deep into
a single path. That is, if site A has a link to site B, B to C, C to D, etc... and `maxDepth` were
`3`, then the spider should never go past C after starting at A.

The request builder parameter is simply an instance of `railgun.requests.Builder` and serves as the
first request to be made to start running the spider. Search functions should take the form

```js
function (requestBuilder, documentContent) {
  // Scan the document's content for links to follow and create an array of new request builders
  // to make requests with.
  return newRequests
}
```

Because search functions simply return an array of request builders, it's possible for you to write
your own search functions and compose them with other existing search functions, saving you from
repeating work done by others. For example, if you had your own search function that produced
interesting links, you could easily include results generated by the default search function like so

```js
function mySearch (requestBuilder, documentContent) {
  let myResults = // Do some work to get an array of new request builders
  return myResults.concat(Spider.defaultSearch(requestBuilder, documentContent))
}
```

### request.Spider#records

Spider#records gets the `railgun.records.Record` instance built by the spider as it makes its requests.

### request.Spider#onRequest

Spider#onRequest is used to register a new middleware function to the spider, which will be invoked in
order of insertion to manipulate request builders before they are used to make new requests. See the
ProxyServer documentation for more information about such middleware.

### request.Spider#subscribe

Spider#subscribe is used to register a new subscriber to the spider. A subscriber must be an object with
an `onRequest` method, which will be called with a request builder that has been processed by the spider's
middleware just before the spider issues the request.

### request.Spider#crawl

Spider#crawl must be called to have a spider begin crawling a web page. The first request made will be
based on the request builder passed to the spider's constructor.

### request.Spider#finished

Spider#finished can be invoked to determine if the spider has finished executing. Due to the way the spider
works, and its expected use, `Spider#crawl` does not return a promise that will resolve to any particular
information about the pages the spider crawled. Instead, you are encouraged to periodically check if the spider
is done executing if knowing that it is finished is important.
