I need to load a huge number of data in memory with a Perl hash. The value corresponding to each key is an array of scalar data. The key is most of the time created with a single field of my array, but it can be made of several fields. The number of fields in the array may vary a lot, but most of the time it will be around 5 scalar values.
My goal is to load as many keys as possible with a limited memory size. Perl interpreter only takes 5MB at the beginning of the process, as Sys::Statistics::Linux::Processes tells me with the virtual size (which is the same as the real size for Perl scripts).
My data to load looks like this (500k lines, 5 columns):
1,1948-10-26,1951-12-08,8TBhXzkOvc,l0YO0ghDND
2,1920-04-16,1959-06-10,eyCFd4IjRo,41YTEnB7Qh
3,1978-12-28,2005-06-23,9LeBBiR2sw,qk30zZdftW
4,2004-01-05,1997-03-25,66K6gvdd5D,bmL3LpuLKT
5,2019-05-11,1995-08-27,rGRtJHioa7,qF7bhwfGeE
Here comes my basic script. The hash key is made of only one column, the first field in the line. In the next examples, only code between mark 1 and mark 2 will change.
#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(gettimeofday tv_interval);
use Sys::Statistics::Linux::Processes;
my $lxs = Sys::Statistics::Linux::Processes->new;
$lxs->init;
my $start = [gettimeofday];
my %cache = ();
open(my $ifh, '<'.$ARGV[0])
or die 'cannot open input file';
while (<$ifh>) {
chomp;
my @fields = split ',', $_;
# mark 1
$cache{$fields[0]} = \@fields;
# mark 2
}
close($ifh);
my $stop = [gettimeofday];
my $stat = $lxs->get;
printf(
"time: %.1f seconds, memory : %uM\n",
tv_interval($start, $stop),
$stat->{$$}{vsize} / (1024 * 1024)
);
We can use less memory if we store a string instead of an array reference:
# mark 1
$cache{$fields[0]} = join $;, @fields;
# mark 2
plegall@miro:~/bench/hash$ perl load-01.pl in_5c_500kl.csv; perl load-02.pl in_5c_500kl.csv
time: 2.1 seconds, memory : 173M
time: 2.3 seconds, memory : 70M
This is of course a major improvement for me! 2.5 times less memory used and a really low extra time. Let's verify the memory usage is linear, let's have a data file with only the half of lines:
plegall@miro:~/bench/hash$ head -250000 in_5c_500kl.csv > in_5c_250kl.csv
plegall@miro:~/bench/hash$ perl load-01.pl in_5c_250kl.csv; perl load-02.pl in_5c_250kl.csv
time: 1.0 seconds, memory : 89M
time: 1.1 seconds, memory : 37M
It confirms the memory usage is linear : half the data size, half the memory usage. On a unix-like operating system like Linux, a single process can consume up to 3GB of memory. According to this limit, it would mean that a hash can have nearly 22 millions records at once.
In Talend Open Studio, the tMap has one main link and several lookup links as input. Each lookup link corresponds to a data join. The data joined are stored in memory, thanks to hashes. So I have opened a feature request to implement this improvement.