The purpose of this use case is to generate an email addresses whitelist file based on email addresses found in “To”, “Cc” and “From” headers in a list of email files.
As input, we have a directory containing email files. As output, we want a text file having raw email addresses as lines.
In this second version of the use case, we'll use TOS 2.3.0M1. To see the previous version browse previous revision of this wiki page.
We use two context variables to make the final script easy to use anywhere. The first variable is the input directory, the other is the filename of the output whitelist.
We'll use the emaildir in the tFileList and the whitelist in the final tFileOutputDelimited.
The component is a tFileList. Here we use the emaildir context variable to tell the tFileList where to search files. My email filenames are numbers, so my filemask is simply '[0-9]+' which means that the filename must contain at least one digit.
As my email files are dispatched into several sub-directories, I check the Include Sub directories option.
Going out from the tFileList, we use an iterate link as we don't transmit data rows. The iterate link will open as many tFileInputEmail as file found.
The component is a tFileInputEmail. We set a schema with three columns: “to”, “from” and “cc”. In the Mail parts properties, we map the schema columns to the name of the email header.
In the File Name property we use the current filename delivered by the preceding tFileList $_globals{tFileList_1}{CURRENT_FILEPATH}.
The problem with the iterate link is that it creates several distinct data flows. But here we want to apply global operations (such as the duplicate removal and the sort), so we need to transform N flows to a single flow. tUnite does this job.
The data extracted by the preceding tFileInputMail are not exactly formatted as we would like. Indeed, we have something like…
"Fabrice BONAN" <fbonan@talendforge.com>, Cedric CARBONE <ccarbone@talendforge.com>, "Pierrick LE GALL" <plegall@talendforge.com>, "Bertrand DIARD" <bdiard@talendforge.com>
… and we would prefer a clean list of email addresses:
fbonan@talendforge.com,ccarbone@talendforge.com,plegall@talendforge.com,bdiard@talendforge.com
To achieve this task, we need a Perl dedicated function. We create a user routine file “mail” and we add the extractMailAddresses function. Here follows the content of the “mail” routine file:
use Exporter; use vars qw(@EXPORT @ISA); @ISA = qw(Exporter); @EXPORT = qw( extractMailAddresses ); sub extractMailAddresses { my @email_addresses = (); foreach (@_) { while(m{([\w-]+(\.[\w-]+)*@[\w-]+(\.[\w-]+)*\.[a-z]+)}g) { push @email_addresses, $1; } } return @email_addresses; } 1;
Now we can use the new user routine on the list of fields coming from tFileInputEmail.
Now we have many email addresses on each row, separated by comma. To perform operations such as duplication removal, we need to have only one email address per row. This operation is a normalization that tNormalize performs.
To simply lowercase a field, we use the tPerlRow component:
$output_row[email] = lc $input_row[emails];
Removing duplicates is the speciality of the tUniqRow component. Link the following component with the uniques output row.