Brettb.Com
  HOME | ABOUT ME | BIOTECHNOLOGY | ARTICLES | TOOLS | GALLERY | CONTACT
Search: Go
DEVELOPER TOOLS
 ASP Documentation Tool
 .NET Documentation Tool
 PHP Documentation Tool
 SQL Documentation Tool
 VB6 Documentation Tool
 Indexing Service Companion
 The Website Utility

TECHNICAL ARTICLES
 ASP
 ASP.NET
 JavaScript
 Transact SQL
 Software Reviews

PHOTO GALLERIES
 Canon EOS 300D Samples
 Red Arrows 2004
 Living Coasts
 Akihabara Maids!
 Web Page Backgrounds
 More Galleries...

TRAVEL LOG
 2007: Tokyo
 2006: Hong Kong
 2005: New York City

NEW STUFF
 ASP Spell Check
 Code Documentors
 The Website Utility
 Search Engine Optimisation
 Build an ASP Search Engine
 My Tropical Fishtank
 Text WorkBench
 Other New Stuff...

POPULAR STUFF
 Regular Expressions
 ASP Documentation Tool
 Index Server & ASP
 JavaScript Ad Rotator

LINKS
 Business Website
 ASPAlliance Articles
 SoftwareDocumentation.info

Microsoft Certified Professional

Home > ASP.NET Articles

How to stop automated web robots from visiting ASP/ASP.NET websites

While the growth in website users over the last few years has been spectacular, there has also been a corresponding increase in unwelcome website visitors. Many websites are now plagued by unwanted automated web robot visitors which steal content, interfere with interactive website elements and use up large amounts of bandwidth. This article helps to determine if your website has a robot problem and what to do about it if it does!

Do you have a robots problem?

The scale of the robots problem largely depends on the type of website as well as the type of content it offers. The following pointers are consistent with robot activity:

  • Large numbers of requests from a single IP address or a range of IP addresses within the same subnet (i.e. the first three numbers of the IP address are identical).
  • Large numbers of requests for database driven content compared to the rest of the website.
  • Many requests made from browsers that do not support ASP Sessions.
  • Lots of and increasing numbers of website visitors, but no corresponding increase in transactions (e.g. sales!).
  • Large numbers of spam or automated requests being generated from online forms.

Although some of these indicators will be identified by website server statistical analysis packages, it is often necessary to manually look at the log files in a text editor or use a specialized log reporting tool such as Microsoft's Log Parser.

Why robots are a problem

There are a number of problems associated with robots.

Large amounts of web robot traffic cause an increase in the bandwidth consumed by the website. On top of the increased financial cost of bandwidth, the bandwidth usage can reduce overall server performance, especially if the robots are making large numbers of requests to resource intensive pages, such as database search results pages.

Automated web traffic can distort website statistics, especially if there is a large amount of robot traffic or the robot traffic varies significantly from month to month. This can lead to awkward questions from senior management if they notice unusual traffic peaks. It also makes it difficult to gauge the success of marketing campaigns, etc.

The robots may be doing something with your website's content. Stealing website content and republishing it in order to benefit from pay per click advertising is a highly profitable industry.

Techniques for Stopping Robots

The Web Robots Exclusion Standard

There is a semi-official standard for preventing robots from visiting all or part of a website. This is the Standard for Robot Exclusion and the details of it are at http://www.robotstxt.org/wc/norobots.html. This standard proposes that web servers that want to change the behavior of robots visiting the site should control the behavior through a robots.txt text file placed in the root of the web server (i.e. http://www.foo.com/robots.txt).

Unfortunately the Standard for Robot Exclusion is not an official standard and has never been ratified by an official Internet organization. Furthermore, robots are under no obligation to follow the guidelines in a robots.txt file. Consequently, a robots.txt file is of very limited use when attempting to stop all but the most well behaved robots from visiting.

The robots meta tag

Although the web robots exclusion standard is useful for stopping certain robots from visiting an entire website or parts of an entire website, it is not really suited for stopping robots visiting individual pages. The other drawback is that in order to use a robots.txt file, the file must be placed in the root folder of the website - something that is not always possible to do depending on the configuration of the web hosting plan or the internal IT regulations of a large corporation.

For this reason it is sometimes better to use the robots meta tag in individual pages of the website. The HTML required for stopping a robot indexing a page is:

<meta name="robots" content="noindex">

This HTML should be placed within the element of the document.

It is also possible to stop a robot from following the links from a particular document using the following syntax.

<meta name="robots" content="nofollow">

The two instructions can also be combined in a single meta tag.

<meta name="robots" content="noindex, nofollow">

However, this technique of using meta tags is unlikely to stop all but the most well behaved robots.

Make registration mandatory

If you have valuable content on your website and it is appropriate to do so, it may be worthwhile to make all or part of the website content only accessible once a user has logged in.

The main drawback of doing this is that preventing robots will also stop a search engine's own web robots from visiting the website's content which will cause your website to be less visible in search engine catalogs. If your website relies on a significant portion of its revenue earning traffic from search engine referrals then this technique will obviously be counter-productive.

Do not forget that many web robots can be trained to "log in" to websites provided they have a set of valid login credentials, so it is essential to include some mechanism of distinguishing between human and robotic visitors. A common means of achieving this is by using a graphical sequence of characters that a user has to type into the form before submission (i.e. a Captcha, see http://www.captcha.net/). Robots are rarely able to execute JavaScript either, so configuring the registration or login process to rely on the execution of a particular JavaScript function could also be used.

Slowing robots down

An alternative to stopping robots altogether is to slow them down. Many of the common legitimate robots that visit websites and obey the robots exclusion protocol can be slowed down. For example, to slow down Yahoo!'s robot so that it requests URLs with reduced frequency, the following lines can be added to the robots.txt file.

User-agent: Slurp Crawl-delay: 10

Note that Crawl-delay is measured in seconds.

Unfortunately, there is no agreed standard for slowing down robots, so it has to be implemented on a robot by robot basis.

For robots that do not understand any instructions to slow down, it is possible to force them to slow down. This could be achieved by writing a custom add-on to a website that introduces a delay in returning content should a specific user make more than a certain number of requests in a specific time period.

As an alternative to writing a custom add-on, it is possible to find commercial offerings that will accomplish the same. The Slow Down Manager ASP.NET component within VAM: Visual Input Security is able to slow down anyone who makes repeated requests for pages and can be configured to deny them access to the pages if they make more than a certain number of requests. Further details about the Slow Down Manager are available from http://www.peterblum.com/VAM/VISETools.aspx#SDM.

While slowing down robots is in theory a good solution, it is fraught with difficulties. For example, most robots can be configured to visit websites at preset intervals. If the robot user noticed it was being slowed down, it could simply increase the time interval between robot visits. Slowing down website visitors based on IP address may also reduce response times for legitimate users using the same web cache/proxy server as the robot user. Slowing down robots by introducing a delay in the response time would also use up processor resources while the delay was introduced.

Obfuscating content

While stopping robots from visiting is one solution, the other is to make your website a lot less useful to them. This can be achieved by either making the website structure difficult to navigate, or by obfuscating the content so that it is more difficult to parse and extract content.

Obfuscating the content of the website

Using ASP.NET

A straightforward way of making life more difficult for robots is to use the .NET Framework. The HTML produced by ASP.NET can be more difficult to parse than that created using classic ASP. This is particularly so if the content the robots are interested in can only be displayed after a form is posted back. The .NET Framework gives form fields names such as _ctl10__ctl1_DropDownListPrice which can often be inconsistent if the page contains different numbers of controls each time it is viewed or it contains controls with many subcontrols within them, such as DataGrids.

Using JavaScript

As mentioned previously, few (if any) robots are able to execute JavaScript. Building the website's navigation scheme using JavaScript could, therefore, be used to disguise the website's navigation structure from robots. This does of course have the consequence of making the website's content less visible to search engine robots. The JavaScript navigation system will also only work in web browsers where JavaScript is enabled and there are also accessibility issues to consider.

Blocking robot user-agents

Most requests made to a web server will contain a description of the web browser or automated web robot being used - the "user agent string." This description can be accessed via the HTTP_USER_AGENT server variable, Request.ServerVariables ("HTTP_USER_AGENT"), in either VBScript in classic ASP or VB.NET in ASP.NET. Most legitimate robots will identify themselves. For example, Google's content retrieval robot identifies itself as:

Mozilla/5.0 (compatible; Googlebot/2.1; http://www.google.com/bot.html)

A web browser will generally identify itself as something like:

Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0).

However, there are now so many variants of the user agent string that it can be difficult to keep up with things. Classic ASP used to have a Browser Capabilities component that could be used to identify web browsers, but it relied on manually updating the server's browscap.ini file as new web browsers were released.

Commercial alternatives to the Browser Capabilities component are often much better at identification of user agents. Of the various commercial offerings, BrowserHawk is probably the best known. Its ASP component contains a Crawler property that can be used to determine if the client is a robot.

While in theory using the user agent string to identify and block robots is possible, it is possible for the users of robots to "fake" the user agent string. The usual method of accomplishing this is to use a user agent string from a commonly used web browser such as Internet Explorer 6 on Windows. The web server is then unable to distinguish it from the normal website users unless more sophisticated robot detection techniques are employed.

A further problem is that an increasing number of proxy servers have been configured to strip out information, such as the user agent string from the request, so it is not uncommon to see the user agent masked or absent altogether.

Robot honey pot

Since the user agent string is open to abuse, a more sophisticated method of stopping robots is required.

One way of achieving this is by looking for website visitors that request a high ratio of pages to other content such as images. Robots are primarily interested in text content, so this is a good way of identifying robots. The downside to this is that it is not straightforward to accomplish this through ASP or ASP.NET, but it can be accomplished by analysis of the web server's logfiles. Analysis of robot behavior in log files can be carried out using Microsoft's Log Parser. Alternatively, the analysis could potentially be done in near real time by making use of an ISAPI filter to log requests as they are made to the web server. Logging website requests to a SQL Server could also be used, but for large websites this would require substantial SQL Server resources to log the amount of data generated.

A variant of this is to look for website visitors that just request the dynamic parts of the site. For example, an online store may have product catalog pages that robots will tend to visit in order to extract the product details and republish on another site, such as a shopping comparison site. The exact pattern of robot usage will tend to vary depending on the type of content offered by the website.

Instead of looking through log files, an alternative for identifying robots is to put a hidden link on a page which only robots will follow. This link could then take the robot to an ASP page that logs its IP address to a database. Of course, this technique cannot be effective against robots just visiting specific pages within the website, but it is reasonably good at identifying robots that crawl entire websites.

Once a robot has been identified then it can be blocked from the site. The usual method of this is to prevent requests from the robot's IP address.

Testing your robot defenses

If you want to robot proof your website and then test the results, I wrote a small utility - The Website Utility - to simulate a robot's eye view of a website.

Useful Development Tools

ASP Documentation Tool™
Automatically creates developer documentation for ASP 2.0 and 3.0 web applications written in VBScript and JScript. Documentation for Microsoft Access, SQL Server 7/2000 databases and Visual Basic 6.0 components associated with the web application can also be incorporated into the reports. Documentation is created in HTML, HTML Help and plain text formats.
   View Sample Output (HTML Help format) View Sample Output (HTML Help format).
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (5.2Mb ZIP file).

.NET Documentation Tool
Automatically creates technical documentation for .NET Framework Windows and ASP.NET applications written in C# or VB.NET and SQL Server 7/2000/2005 or Microsoft Access databases associated with the application. Documentation is created in HTML, HTML Help and plain text formats.
   View Sample Output (HTML Help format) View Sample Output (HTML Help format).
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (5Mb ZIP file).

SQL Documentation Tool
The SQL Documentation Tool creates technical documentation for Microsoft SQL Server 7.0 and 2000 databases. Technical documentation is created in HTML and HTML Help formats. The HTML Help format documentation is fully searchable and cross referenced. The SQL Documentation Tool documents SQL Server Tables, Views, Stored Procedures, Triggers and Table Relationships.
   View Sample Output (HTML Help format) View Sample Output (HTML Help format).
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (10.3Mb ZIP file).

VB Documentation Tool
The VB Documentation Tool creates technical documentation for Microsoft Visual Basic 6.0 projects. Technical documentation is created in HTML and HTML Help formats. The HTML Help format documentation is fully searchable and cross referenced.
   View Sample Output (HTML Help format) View Sample Output (HTML Help format).
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (1Mb ZIP file).

Indexing Service Companion
The Indexing Service Companion is a Windows application that extends the functionality of the Microsoft Windows Indexing Service so that it is able to index content from remote websites and also from ODBC databases. As such it can be used as a low cost alternative to Sharepoint Portal Search Services.
   Try Sample Search Facility Try Sample Search Facility.
   Download Trial Version Download Trial Version (1.7Mb ZIP file).

The Website Utility
The Website Utility examines websites for errors and areas that need to be optimised for search engines by using a built in web crawling engine. Errors checked for include broken or moved hyperlinks, missing page titles and missing meta tags. It also generates HTML for use in creating website site maps (table of contents pages - like this one), and is able to create both client-side JavaScript Search Engines and server-side ASP Search Engines for a website.
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (3Mb ZIP file).

PHP Documentation Tool™
Automatically creates developer documentation for PHP web applications. Documentation is created in HTML, HTML Help and plain text formats.
   View Sample Output (HTML Help format) View Sample Output (HTML Help format).
   View Sample Output (HTML Format) View Sample Output (HTML Format).
   Download Trial Version Download Trial Version (1.0Mb ZIP file).
ASP Documentation Tool - Free Trial Available!

Documentation tools to automate the documentation of SQL Server databases and ASP, C#, VB.NET and VB 6.0 application source code

  Site Map

All content is © 1995 - 2008 Brett Burridge