Using Perl or Linux built-in command-line tools how quickly map one integer to another? -
i have text file mapping of 2 integers, separated commas:
123,456 789,555 ...
it's 120megs... it's long file.
i keep search first column , return second, e.g., 789 --returns--> 555 , need fast, using regular linux built-ins.
i'm doing right , takes several seconds per look-up.
if had database index it. guess need indexed text file!
here i'm doing now:
my $linefound=`awk -f, '/$column1/ { print $2 }' ../mybigmappingfile.csv`;
is there easy way pull off performance improvement?
the hash suggestions natural way experienced perler this, may suboptimal in case. scans entire file , builds large, flat datastructure in linear time. cruder methods can short circuit worst case linear time, less in practice.
i first made big mapping file:
my $len = shift; (1 .. $len) { $rnd = int rand( 999 ); print "$_,$rnd\n"; }
with $len
passed on command line 10000000, file came out 113mb. benchmarked 3 implemntations. first hash lookup method. second slurps file , scans regex. third reads line-by-line , stops when matches. complete implementation:
#!/usr/bin/perl use strict; use warnings; use benchmark qw{timethese}; $file = shift; $count = 100; $entry = 40; slurp(); # initial file slurp, hard drive cache timethese( $count, { 'hash' => sub { hash_lookup( $entry ) }, 'scalar' => sub { scalar_lookup( $entry ) }, 'linebyline' => sub { line_lookup( $entry ) }, }); sub slurp { open( $fh, '<', $file ) or die "can't open $file: $!\n"; undef $/; $s = <$fh>; close $fh; return $s; } sub hash_lookup { ($entry) = @_; %data; open( $fh, '<', $file ) or die "can't open $file: $!\n"; while( <$fh> ) { ($name, $val) = split /,/; $data{$name} = $val; } close $fh; return $data{$entry}; } sub scalar_lookup { ($entry) = @_; $data = slurp(); ($val) = $data =~ /\a $entry , (\d+) \z/x; return $val; } sub line_lookup { ($entry) = @_; $found; open( $fh, '<', $file ) or die "can't open $file: $!\n"; while( <$fh> ) { ($name, $val) = split /,/; if( $name == $entry ) { $found = $val; last; } } close $fh; return $found; }
results on system:
benchmark: timing 100 iterations of hash, linebyline, scalar... hash: 47 wallclock secs (18.86 usr + 27.88 sys = 46.74 cpu) @ 2.14/s (n=100) linebyline: 47 wallclock secs (18.86 usr + 27.80 sys = 46.66 cpu) @ 2.14/s (n=100) scalar: 42 wallclock secs (16.80 usr + 24.37 sys = 41.17 cpu) @ 2.43/s (n=100)
(note i'm running off ssd, i/o fast, , perhaps makes initial slurp()
unnecessary. ymmv.)
interestingly, hash
implementation fast linebyline
, isn't expected. using slurping, scalar
may end being faster on traditional hard drive.
however, far fastest simple call grep
:
$ time grep '^40,' int_map.txt 40,795 real 0m0.508s user 0m0.374s sys 0m0.046
perl read output , split apart comma in hardly time @ all.
edit: never mind grep. misread numbers.
Comments
Post a Comment