Use Case 1 : Whitelist generator

The purpose of this use case is to generate an email addresses whitelist file based on email addresses found in “To”, “Cc” and “From” headers in a list of email files.

As input, we have a directory containing email files. As output, we want a text file having raw email addresses as lines.

In this second version of the use case, we'll use TOS 2.3.0M1. To see the previous version browse previous revision of this wiki page.

Context

We use two context variables to make the final script easy to use anywhere. The first variable is the input directory, the other is the filename of the output whitelist.

We'll use the emaildir in the tFileList and the whitelist in the final tFileOutputDelimited.

Find email files

The component is a tFileList. Here we use the emaildir context variable to tell the tFileList where to search files. My email filenames are numbers, so my filemask is simply '[0-9]+' which means that the filename must contain at least one digit.

As my email files are dispatched into several sub-directories, I check the Include Sub directories option.

Going out from the tFileList, we use an iterate link as we don't transmit data rows. The iterate link will open as many tFileInputEmail as file found.

Extract email to,from,cc

The component is a tFileInputEmail. We set a schema with three columns: “to”, “from” and “cc”. In the Mail parts properties, we map the schema columns to the name of the email header.

In the File Name property we use the current filename delivered by the preceding tFileList $_globals{tFileList_1}{CURRENT_FILEPATH}.

N flow to 1 flow

The problem with the iterate link is that it creates several distinct data flows. But here we want to apply global operations (such as the duplicate removal and the sort), so we need to transform N flows to a single flow. tUnite does this job.

Email extractor

The data extracted by the preceding tFileInputMail are not exactly formatted as we would like. Indeed, we have something like…

"Fabrice BONAN" <fbonan@talendforge.com>, Cedric CARBONE <ccarbone@talendforge.com>,
"Pierrick LE GALL" <plegall@talendforge.com>, "Bertrand DIARD" <bdiard@talendforge.com>

… and we would prefer a clean list of email addresses:

fbonan@talendforge.com,ccarbone@talendforge.com,plegall@talendforge.com,bdiard@talendforge.com

To achieve this task, we need a Perl dedicated function. We create a user routine file “mail” and we add the extractMailAddresses function. Here follows the content of the “mail” routine file:

use Exporter;
 
use vars qw(@EXPORT @ISA);
 
@ISA = qw(Exporter);
@EXPORT = qw(
    extractMailAddresses
);
 
sub extractMailAddresses {
    my @email_addresses = ();
 
    foreach (@_) { 
        while(m{([\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)*\.[a-z]+)}g) {
            push @email_addresses, $1;
        }
    }
 
    return @email_addresses;
}
 
1;

Now we can use the new user routine on the list of fields coming from tFileInputEmail.

One address per row

Now we have many email addresses on each row, separated by comma. To perform operations such as duplication removal, we need to have only one email address per row. This operation is a normalization that tNormalize performs.

Lowercase

To simply lowercase a field, we use the tPerlRow component:

$output_row[email] = lc $input_row[emails];

Remove duplicates

Removing duplicates is the speciality of the tUniqRow component. Link the following component with the uniques output row.

Sort

We use the tSortRow to sort emails alphabeticaly.

Whitelist

The final tFileOutputDelimited uses the whitelist context variable to know which file to write.

ChangeLog

Version2

  • job moved to TOS 2.3.0M1 (1 year after)
  • tFileList can include sub-directories to perform a recursive file search
  • no temporary file needed thanks to tUnite + tNormalize

Version 1

  • uses TOS 1.1.0RC1

TODO

  • remove email addresses from a graylist (solution available in version 1 of the job)
  • add email addresses from a “pure whitelist” created manually (people you would really like to receive an email from but who have never sent any… and certainly won't in the future)
  • add a execution summary : number of occurence of each address, etc.
 
use_case/1.txt · Last modified: 2007/11/29 14:50 by plegall
 
 
Recent changes RSS feed Driven by DokuWiki
Copyright © 2006 - 2010 Talend. All rights reserved. Talend Contributor Agreement