Web Crawlers

From Security Wiki
Jump to navigationJump to search

Version 1.0 11 Feb 2013

Background

Web crawling (or "spidering") from a network associated with SDSC has the potential to negatively impact SDSC's reputation on the Internet. Poor reputation can endanger funding of current and future projects, as well as reduce the availability of external resources to SDSC's users.

While SDSC has adequate network, computing, and financial resources to handle aggressive crawling, the targeted website may not. Crawling a site may reduce the quality of service provided by that site. Additionally, crawlers may create a financial impact. Site operators or other third parties may have to pay for the resources consumed by crawling activity, as well as advertisement click-throughs triggered by the crawler.

This policy attempts to mitigate these impacts and keep SDSC in good standing amongst its neighbors on the Internet.


Scope

This policy applies to projects or individuals employing software to crawl websites on a regular or persistent basis when the crawling activity appears to originate from an IP address registered to or allocated to SDSC.

Casual use of crawling software generally does not fall under the scope of this policy unless SDSC Security contacts the individual or project and indicates that their otherwise casual crawling activity must comply with this policy.


Policy

1. The project or individual operating a crawler must ensure that the crawling software observes and obeys the basic robots.txt directives for crawl denies and delays found at [1].


2. The project or individual operating a crawler must ensure that the crawling software honors the "Crawl-delay" directive in a given site's robots.txt. If the site's robots.txt does not contain a Crawl-delay directive, the crawling software must wait at least twenty seconds before issuing another request to the same site.


3. The project or individual operating a crawler must configure the crawling software to supply a user-agent string that:

  • Identifies the client as a crawler.
  • Includes a URL with information about the project and its crawling activity.


4. The PI for a project employing the use of a crawler in a manner within the scope of this policy must notify SDSC Security and ENS of their intent to operate a crawler. This notification consists of an email sent to security and noc, and must include:

  • A name and email address for the administrator of the crawler.
  • A brief description of the project and scope of the crawling activity.
  • An assertion that the crawler's capabilities and configuration meets the provisions of this policy.

The PI must keep the crawler administrator's contact address up to date with SDSC Security and ENS. The project must not begin crawling unless SDSC Security and ENS have approved the crawler.


5. SDSC Security or ENS may block or remove the crawling host from the network at any time without advanced warning. If SDSC Security or ENS enacts a block against a crawling host, the party enacting the block will notify the registered administrative contact once. The administrative contact then has the burden of contacting the party enacting the block to resolve the issues that precipitated the implementation of the block.


6. SDSC Security may suspend or revoke any or all network access privileges granted to individuals attempting to continue crawling by circumventing the blocks applied to their crawler hosts. Examples of circumvention include changing a blocked host's MAC or IP address, plugging the blocked host into a different switch port, use of a proxy to get around a block, or running the crawling software on an unblocked host.


7. No user may use HPC resources for web crawling activity without explicit authorization from the corresponding project lead for the HPC resource in question and SDSC Security.


8. No user may use shared SDSC hosts for crawling without explicit authorization from the host's appropriate IT systems manager and SDSC Security. A shared SDSC host is a host used by others besides the project involved in crawling activity. Examples include general-purpose login hosts and SDSC-managed web servers.


9. SDSC Security may suspend, limit, or revoke any or all network access privileges granted to individuals who repeatedly violate this policy, or who continue crawling and refuse to comply with the stipulations of this policy in a timely manner.