Safe handling of user input

Your form wants a six-digit number. Somone will put in a shell command. What will your application do?

Goals

Understand the importance of sanitzing user input.

Motivation

Improper handling of user input is one of the most common flaws in web applications. Prevent this type of vulnerability in your application by checking user input first!

Concerns

In this document, we're concerned about preventing vulnerabilities that arise from improper user input handling in internally-developed web applications.

Causes and Consequences

A web application that fails to properly sanitize user input before displaying the input on a page, inserting into a query, or using in a command line can result in vulnerabilities to the following types of attacks:

SQL injection - an attacker modifies an SQL query to change its results or perform unauthorized modifications to the site's SQL database.
Shell injection - an attacker modifies the shell command executed by a web application to perform unauthorized actions.
Cross-site scripting - an attacker injects malicious javascript or other content into a page normally served from a trustworthy URL.
Code injection - an attacker injects arbitrary code evaluated or run by the web application's script interpreter.
Low-level tricks - an attacker injects arbitrary machine-level code or crashes a process.

Prevention

Don't Trust the Client

First, do not trust the user, nor the user's browser to sanitize input for you. You may use client-side technology such as Javascript or Flash to pre-screen user input, however, you must perform sanity checking on the server side, regardless of what goes on at the client side.

Sanity Checking

Sanity checking is the process of ensuring that user input meets your expectations.

Length

Most scripted languages dynamically allocate buffers of sufficient size for user input, so fitting user input within memory buffer constraints isn't as big of a concern as before with compiled C applications. However, limits still do exist. For example, what will happen when your application tries to store a 256-character string into a database column for a 16-character string? Make sure any assumptions about the length of input are checked. Also, don't forget the case where NO data for an input is provided by a user.

Content

The content of user input can be broken down into two general types: structured and unstructured.

Structured Content

Structured content has some fixed format that you can check against. For example, you may expect a six-digit number for a particular input, and easily check that the input is indeed a six-digit number. Use these tools to sanitize your structured content:

Regular Expressions

Regular expressions (regex, regexes) can be used to check the format of user input, as well as parse or extract portions of the input. (If you're not familiar with regular expressions, now is a good time to do so.) Just keep in mind that regular expressions must be "anchored", and you should match (and optinally extract) what you want, rather than match and reject what you don't want.

Example: You expect a user input field (stored in $yesno) to contain "y" or "n".

Good (perl regex):

if( $yesno =~ /^[yn]$/ ) { # do stuff with $yesno }
  regex translation: "Starts with 'y' or 'n' and ends"

The above example "anchors" the regex (the ^, meaning "starts with" and $ meaning "and ends") and only matches the expected input.

Bad (perl regex):

if( $yesno =~ /[yn]$/ ) { # do stuff with $yesno }
  regex translation: "(Contains) 'y' or 'n' then ends"

The above example fails to properly anchor the regex. The clause will match inputs of "y" and "n" as expected, but will also match unexpected input like "day", "on" and "yaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaawn".

Bad (perl regex):

if( $yesno =~ /^[abcdefghijklmopqrstuvwxz]$/ ) { # send the user an error message }
  regex translation: "Starts with 'a' or 'b' or 'c' ... and ends"

The above example tries to match undesirable input and sends the user an error message. The keen observer will note that the regex will fail to catch undesirable input such as the empty string, as well as any input that has more than one character, and more!

Filters

Some langages (such as PHP) have a preset collection of functions to filter or verify input, such as is_numeric(), is_digit(), and the filter framework. These are often a good alternative to using regexes, however, make sure you know what the filter does! is_numeric() for example, accepts (returns true) for "123" as well as "1.23" and "-1.2e3"!

Unstructured Content

Unstructured content doesn't fit an easily described format. Examples incude a guestbook message and a forum post. This type of data is particularly dangerous as it may contain metacharacters (characters that have a special meaning) or other structure that isn't immediately recognized as malicious.

Consider this "Subject" line from a guestbook form that will become part of a raw email body sent over SMTP:

I like your page!\nBcc: user1@example.com, user2@example.com, user3@example.com, user4@example.com

(the \n is a newline character) Part of the body looks like this:

Subject: I like your page!
Bcc:  user1@example.com, user2@example.com, user3@example.com, user4@example.com

BUY ONLINE PH4RMA_CY!!...

Yes, someone just used the form to spam users at example.com!

So, what can you do?

Pay careful attention to how and where unstructured content is used.
Escape or filter metacharacters.
Escape or filter undesired characters.
Parse unstructured content and interpret the parsing results instead.
Don't use unstructured content in command string (shell command, sql query, etc) generation.
Encode unstructured content in an innocuous format (base64, uuencode) before further handling.

Input

Never pass unsanitized structured content (or in many cases, any unstructured content) to a code path that will end up building a command string (shell command, sql query, code executed by eval()...).
Always escape metacharacters in sanitized content used to build a command string.
When passing content to an external process, such as a shell command or sql query, try to use a path separate from the command string. Use stdin for shell commands (as opposed to passing user data on the command line) and "bound variables" in sql queries.

Output

Always escape or remove html entities in user input destined for a browser.

A word about eval() and the like

If your web application involves incorporating user input into dynamic code generation, please consider a different approach. Any failure in your sanitization methods will result in an attacker being able to run any code they want to on the web server. Coupled with poor permissions on your site or other sites you share disk space with, and the attacker can spread their intrusion.

Safe handling of user input

Contents