Nuts And Bolts Archiving Email Lead image: .marqs, photocase.com
.marqs, photocase.com
 

Techniques for archiving email

Store and Find

Email archiving involves more than just backing up your email directories. It is also a question of classifying the email and making it easier for users to find their way around overfilled email folders. By Jörg Fritsch

In the past few years, very few technology developments have been as versatile and necessary as email archiving. In fact, it ranks high on the best practices list of many corporations. Many administrators who hear the term "email archiving" for the first time think about data privacy and automatically file the term away in the category of backup. However, email archiving makes sense outside the scope of backup, if you think of it in terms of email management.

Document Lifecycle Management and Compliance

For decades, corporations have developed their filing systems into something approaching an art form; thus, they can protect records and retain them for a defined period of time. To cope with these mountains of paper, many policies have been introduced to define how long to retain certain document types before they finally end up in the shredder. Little is left to common sense: In the paper world, everything is strictly organized.

Email management applies these policies and processes to email, thus ensuring harmonized document lifecycle management. Although it is not a question of keeping meaningless messages indefinitely, although some storage vendors might recommend doing so, you can't just adopt quotas and a policy of benign neglect. Just like physical filing, email management requires a policy that, ideally, defines which email the system should keep (and for how long) and which it should delete beyond any possibility of recovery.

Not many corporations have actually achieved this ideal state. Because of a lack of email classification tools, most businesses are unable to distinguish between Internet email (i.e., email that reached the company across the Internet) and internal email and between business and private correspondence. Currently, only two classes typically exist: spam and non-spam. Administrators who want to try out more granular classifications or experience the look and feel of a professional application (e.g., by Titus Labs [1]) can use POPFile [2] on Linux to generate multiple mail classes. For more details on practical email classification, see the "Practical Email Classification" box.

Retention Periods

Compliance rules define how long a corporation must retain email (see the "Compliance Sources" box). Rules in the US are stipulated by HIPAA (Health Insurance Portability and Accountability Act) and SOX (Sarbanes-Oxley). HIPAA mandates archiving of patient-related data for two years after the patient's death. SOX mandates retention of relevant data up to four years after an audit. Major banks or pharmaceutical companies typically are multi-national, so compliance requirements of other countries must be taken into account as well.

Most of these frameworks don't refer to email directly, but to data archiving, and opinions differ among professionals as to which rules and periods actually apply to email.

Mail Server Diet

Even if you find the discussion about document lifecycle management and compliance too abstract, you might enjoy looking into email management from a technical point of view.

Simply doing nothing would be the wrong decision, because it would leave everything to the user's discretion. And administrators will be aware that this is not a good idea. Many users keep everything and never delete their email. Quotas that give the users a limited amount of space for their inboxes are an initial approach, but quotas only really work in the real world for hosting platforms, Internet service providers, and web mail providers. In an enterprise environment, an administrator is more likely to accept an oversized inbox because automatic Recipient out of quota messages look unprofessional to customers or business partners.

Where quotas are in effect, users sometimes resort to self-managed offline archives, such as Microsoft Exchange .pst files. Archives of this type are problematic because a corporation can never know where its employees are storing their email or whether they are covered by a backup plan. Users tend to manage offline archives locally on laptops and PCs that are not covered by the backup policy. If the laptop is lost or damaged, the company does not have a backup, and it is difficult to assess how much confidential data have been lost. With Exchange, another problem is that users occasionally let their .pst files grow to a size that suddenly causes irreparable damage independent of the operating system.

Practical Archiving

Basically, there are three methods of archiving email:

Hosted services (i.e., outsourcing) are only interesting for administrators who simply need to archive incoming and outgoing email and can leave everything else to the users. From a technical point of view, the service provider acts as the DNS MX record for all incoming email and enters its own details as a smart host in all outgoing mail. The biggest provider in this field is Message Labs [6]. This variant is of most interest to corporations who already have accounts with these providers and use their spam and virus filters.

If you're not a customer of one of these services but do want to archive email on Linux, you can capture email at any suitable point and push it into the archive. The easiest option is to let a Linux system capture the mail flow on a SPAN port (Switched Port Analyzer, mirror port). The Linux machine would simply run a tcpdump command to archive all network traffic on TCP port 25 (SMTP).

A medium-sized company could use a simple recorder and about 1TB of disk space to archive one or two years' worth of email. The use of tcpdump for archiving purposes sounds complicated and not very user friendly, but this approach works if you don't need to restore email very often (i.e., one to five times a year). The Net VCR appliance by Niksun [7] provides a commercial alternative to tcpdump.

Setting up a Smart Host

A smart host that copies all your email as it passes through is easy to set up: Many appliances that archive email are smart hosts. The only thing that distinguishes them from a do-it-yourself smart host is more convenience in email searching and, in the high-end sector, the kind of media the software uses for archiving.

WORM drives (Write Once, Read Many) are often suggested by service providers: WORM-based appliances will write to DVDs, tapes (this allows you to go on using existing, legacy backup systems) or special arrays. Right now, WORM-based systems are only typical in the financial industry.

The typical issues with do-it-yourself solutions are a lack of user-friendliness and possibly a lack of scalability. Good solutions are easy to use and at least offer some kind of user involvement by giving the user access to their archived email throughout the retention period.

Other solutions tie in directly with the email system. Plugins let users access their own archived email transparently using Outlook or Lotus Notes. Symantec Enterprise Vault [8] is an example of this kind of solution. One decision administrators must make before deploying archiving technology relates to the method the software solution will use to archive the raw data and how it will modify the data, if at all.

Metadata and Pointers

Email not only comprise payload data and usage information, but also metadata and headers that say where the message originated, which route it took to reach your company, or how it left the company.

The available technologies and products differ greatly in this respect. In many cases, additional metadata are generated by the archiving process; some solutions completely discard the metadata because they only store the route to the smart host or service provider. Other solutions let you view, but not extract, the metadata.

Because choosing an archiving system will have long-term effects, it is important to check your company's email policy to see which features you need. One major criterion is the way the system archives email. Some solutions write email messages as files to the filesystem (with a SAN or NAS), whereas others store email messages in a database.

Most systems that use databases replace attachments with pointers: If the same attachment is sent with multiple mail messages, the attachment is only stored once rather than duplicated. The same thing applies to messages with multiple recipients in the To, CC, or BCC fields. A combination of pointers and compression can save a huge amount of disk space.

With regard to disk space, you have another area of concern: Some solutions copy user email to the archive and still let users manage their "live email." In this case, the archive is simply used for auditing purposes, and the users are just as exposed to the flood of email as they would be without an archiving service. Other solutions move email into the archive and thus save each message once only on the enterprise network.

E-Discovery: Email Archiving and Auditing

An archive is not there just to keep the mail server lean and fast or to help users with email management tasks. E-discovery is typically heard in the context of email archiving, and e-discovery tools keep evidence accessible and confidential. In other words, the software indexes the stored data and then supports full-text searching.

Besides email and instant messages, e-discovery includes messages transmitted via other technologies. Some major banks go so far as to include the voice messages from VoIP phone mailboxes in their document lifecycle and use this material as documented evidence or for audit purposes.

An electronic signature for each email in the archive is not typical – this would add overhead for PKI, PKI policies, and key escrow to email archiving (see the "Key Escrow" box). Some products sign the archive on a daily basis, but this doesn't improve confidentiality unless you have a robust enterprise policy in place with respect to the keys used for this purpose.

Email – A Legacy Technology?

Although email is widespread as a communications technology, the technology itself has not developed greatly in the past eight to 10 years. Users today use email for chatting, collaboration, and exchanging documents.

Alternatives that support collaboration and communication, and add the ability to archive files or messages, are listed in Table 1. Besides the benefits stated in column three of the table, these newer tools (at least theoretically) reduce the amount of email that users need to send.

Tabelle 1: Alternative Technologies

Requirements

Technology/Product

Benefits

Informal conversations

Instant Messaging (IM)

Brief, informal conversations do not stress the mail server and inboxes

Collaboration

IM, Web EX, Zoho, Fuze, Groove, Joomla, Sharepoint, Alfresco

More dimensions to collaboration than with email (shared desktops, interactive)

Document exchange

Sharepoint portal, Alfresco

Easily indexed, managed project, spaces, full document lifecycle management

Some experts predict that this development will make email a legacy technology within the next five to 10 years. Although it is hard to say whether this will happen, after years without any major changes in the use of email, email archiving is at least an innovation that improves and facilitates the use of the medium.