Features Archiving Lead image: © Pavlenko Evgenly, fotolia.com

Archiving email and documents for small businesses

Long-Term Storage

You can easily archive email and attachments, thereby saving overhead and money. This article shows various approaches that use built-in tools on Linux and FreeBSD. By Harald Zisler

In many countries, business owners are required by law to archive email and electronic documents containing business content (see "Private Email Use in the Enterprise"). But even without legal constraints, orderly archiving of email and electronic documents can help reconstruct the facts of a business process without too much trouble, even years later.

File Format Requirements

Email often contains attachments created with various office applications, which can cause problems with long-term storage. The file formats these programs use are constantly changing.

The information must be stored in a format that will remain readable for a long time. Storing in plaintext format offers the biggest benefits. Text files are easy to search, to store in databases, and to convert to a different character set. The use of Rich Text Format (RTF) as a storage format for office documents helps keep (most) of the formatting and offers the benefits of a plaintext file. In Open/LibreOffice, you can go to Tools | Options | Load/Save | General and select Rich Text Format as the default format in the Always save as field (Figure 1).

Figure 1: Setting RTF as the standard file type.

PDF files can be searched with the use of standard tools such as pdfgrep or pdftotext, but these tools will not work for every document because users can choose to prevent searching when they create documents. Files created using the print function or the CUPS PDF printer also cannot be searched using these standard tools because what looks like text is actually graphical material.

Capture Point for Email

The major league email archiving software grabs incoming and outgoing mail directly at the company's internal mail server. At this point, a copy of every message is created and archived. In a small business, you will not typically have the infrastructure for this because the email accounts are hosted by an external provider.

The email client needs to use the Maildir storage format to ensure a single file for each piece of email. One of the following methods can be used for archiving email in small businesses for one or multiple users:

The content of the inbox and outbox is regularly copied to the archive at the local email client. This also works for multiple users.
The mail to be archived is forwarded by the local email client to an archive account. This method requires cooperation on the part of the users.
Incoming and outgoing email that is no longer needed for active correspondence is manually copied to an archive directory.

Storing Email and Documents

The archive directories are located on the system data partition or disk, and the content of these directories is backed up regularly. Additional copies of the archive to external media will allow for evaluation of the data on a different system some time later.

Two methods of storing data include the following:

Storing all email in an SQL database. MIME encoding removes the need to use the BLOB field type, which improves the speed of the database. To back up externally, you will typically need to perform a dump. And, you need to watch out for RDBMS version changes, which can cause extra work.
Storing all email and documents in a directory, which can be accessed using suitable search tools. Again, there is little risk of problems when performing a search even years later.

To store email and documents, you will need to create an Archive folder in the email client. This is a link to a directory to which multiple users have access (Figure 2).

Sample Script split.sh

To allow searching of email with attachments, the Base64-encoded attachments need to be decrypted and stored as individual files. Integrating ripmime [1] automates this process in a shell script.

The split.sh script (Listing 1) pushes the email messages out of the Archive directory into the storage directory. The email messages will no longer be accessible to the users now. Each email is assigned a separate subdirectory for storage. This prevents accidental overwriting of attachments with the same name, and you can see which attachment belongs to which piece of mail.

Listing 1: split.sh

01 #! /bin/sh
02 for i in `ls -1 archive/cur`
03 do
04   # Define name component for the individual mail directory
05   pref=`date +%Y-%m-%d-%H-%M-%S`
06
07   # Push mail to storage
08   mv archiv/cur/$i storage
09
10   # Create storage directory for email
11   mkdir storage/$pref.$i
12
13   # Extract attachments from mail
14   ripmime -i storage/$i -d storage/$pref.$i
15
16   # Push original mail into storage directory
17   mv storage/$i storage/$pref.$i
18 done

The directory name also contains the creation date, which is useful if you are viewing the data manually and need data edited within a certain period. The script only shows the basic principle of the function, which still needs to be refined.

I applied the sample script in split.sh to three different email messages, one of which had an attachment. Figure 3 shows the directory structure in the storage directory after running the script.

Directory structure in storage after running the script. — Figure 3: Directory structure in `storage` after running the script.

Surface Mail in the Electronic Archive

Written correspondence and other documents can also be stored in the electronic archive. You need to scan the documents, store them in PDF format, and tag them with search keys. The search keys are entered manually. You can use automated optical character recognition (OCR), but this will require some manual revision. In a production environment, the search terms are added in dialogs with fixed terms (incoming, customer ID, etc.) and free text.

The Archive.sh script (Listing 2) shows the basic workflow. To begin, gscan2pdf [2] scans the document. The program also lets you integrate OCR, in this case, gocr [3]. You can edit the scan results with an editor and store the results in the clipboard. After saving and quitting gscan2pdf, a text editor is launched. You can then insert the clipboard content and add the search keys (customer ID, date, complaint, return number, etc.). The shell script uses pdflatex [4] to convert the content into a searchable PDF document, then pdftk [5] merges this with the scanned PDF document. After viewing the results, you push the PDF file into the archive transfer directory, where it is picked up and moved to the target directory.

Listing 2: Archive.sh

01 #! /bin/sh
02
03 cd /home/incominggoods/archive
04
05 # Scan
06 gscan2pdf
07
08 # Create empty text for editor
09 echo "Overwrite this and enter search keys!" > searchkeys.tex
10
11 # Edit search keys
12
13 gedit searchkeys.tex
14
15
16 # Create Latex document
17
18 echo "\documentclass[10pt,a4paper]{article}
19 \usepackage[utf8]{inputenc}
20 \usepackage{ngerman}
21 \usepackage[official,right]{eurosym}
22 \\\begin{document}" > att1.tex
23 echo "\end{document}" > att3.tex
24
25 # Merge Latex file components
26
27 cat att1.tex searchkey.tex att3.tex > attach.tex
28 pdflatex attach.tex
29
30 # Generate filename from date and random number
31
32 dn=$(date +%Y-%m-%d-%H-%M-%S.$$)
33
34 # Merge PDF files
35
36 pdftk A=copy.pdf B=attach.pdf CAT A B output archive$dn.pdf
37
38
39 # Check the results by displaying the PDF file
40 evince archive$dn.pdf

Accessing Archived Data

Access to the archived data is provided by a web server, which the users can access on the local network. The service must provide protection against unauthorized access and reading. Potential approaches include encrypting communication and restricting access to localhost. Users can use either the console or SSH, NoMachine, or RDP on the local network.

Namazu

Namazu [6] is a full-text search engine that lets you search the archive. The program collaborates with web servers, and the package also includes mknmz for creating an index. It requires namazu to find the desired documents.

On Debian, the configuration files are located under /etc/namazu:

mknmzrc: Define the maximum file size and text scope to which indexing is performed:

$FILE_SIZE_MAX   = 900000000;
$TEXT_SIZE_MAX   = 900000000;

This value is based on experience. Other configuration options relate to weighting the results, supported and non-supported file types, directory locations, and display options.

namazurc: Enter at least the Index, Template, and Lang details here.

The project website has an exhaustive and very readable guide. The index is built after issuing the mknmz command. The program can also handle MIME-encoded mail with the --decode-base64 option. The sample shell script indexbuilder.sh

#! /bin/sh
mknmz -a storage/* -O namazu/index

uses the -a option, which allows searching of all files. The storage location for the index is specified by the -O option.

In this shell script, mknmz outputs a message for each document. At the end of the run, a statistic is shown. When new documents are added, information for the documents is added to the existing index. The index is not normally rebuilt.

The search.sh script (Listing 3) shows a minimalist query function. The call to namazu points to the configuration file (-f). The output is limited to 500 matches (-n) and is returned in HTML format (-h). See Figure 4 for the results.

Listing 3: search.sh

01 #! /bin/sh
02 while true;
03 do
04 clear
05
06 # Search key
07 echo -n "Enter search key: ";read sube
08
09 # Query
10 namazu -f/etc/namazu/namazurc -n 500 -h \""$sube\"" > searchresults.html
11
12 # Display
13 iceweasel searchresults.html &
14
15 # Stop or continue
16 echo -n "<<<<<<<<<< Press Ctrl C to stop >>>>>>>>>>>>>>>>"; read wn
17 done

Alternatives

Instead of using Namazu, you can search text files (RTF format) with grep and PDF files with pdfgrep [7]. The pdftotext tool extracts text from PDF files, and you can then use the resulting text file in a database application.