14.1. Content vectoring basics

Content vectoring is basically the act of inspecting the transmitted data (downloaded by a web browser, sent over SMTP, and so on) to detect and reject unwanted content. Depending on the environment and the circumstances, unwanted content can mean viruses, ad- or malware, spam e-mails, client-side scripts (that is, Java, JavaScript, and so on), or simply websites containing information not permitted for the users (such as adult sites, and so on).

Note

Content vectoring may seem to be similar to application level protocol inspection performed by Zorp proxies. However, it is essentially different: the proxies only analyze the elements of the protocol, not the transferred data itself.

The main types of content filtering are summarized below.

  • Virus filtering: The most classical and well-known form of content vectoring is virus filtering: examining the files being transferred to verify that they do not contain any software that may harm the user's computer or infrastructure. Most virus filtering engines also detect adware and trojan programs. If a virus is detected, often it is possible to remove the virus (disinfect the file) without any side-effect.

  • Spam filtering: Spam filtering examines e-mails (usually in the SMTP traffic) to delete unwanted advertisements, viruses spreading through e-mails, and so on.

  • Disabling client-side scripts in HTML: Client-side scripting is a popular method for decreasing the load of webservers. It means that certain actions are performed on the client machine (for example, in a submission form a client-side script could check that all fields are filled, without having to connect to the server). However, such scripts can be exploited to perform virtually any operation on the client machine. Therefore, often they are disabled and completely removed from the webpages as they are downloaded.

  • General HTML content filtering: Access to certain webpages is also often limited based on the contents of the page — usually based on the keywords occurring in the page. Most commonly this takes the form of blacklisting/whitelisting, to deny access to pages containing prohibited or illegal content, or simply to pages not related to the everyday work of the organization.

Content vectoring is possible using two approaches: file-based and stream-based filtering. File-based filtering is used when the complete object (file) is needed to perform the checking, such as in virus and spam filtering. (Virus filters cannot work on partial files.) Stream-based filtering monitors a continuous data flow (that is, a webpage being downloaded) and removes the prohibited contents (such as JavaScripts, images, and so on).