Using Perl or Linux built-in command-line tools how quickly map one integer to another? -


i have text file mapping of 2 integers, separated commas:

123,456 789,555 ... 

it's 120megs... it's long file.

i keep search first column , return second, e.g., 789 --returns--> 555 , need fast, using regular linux built-ins.

i'm doing right , takes several seconds per look-up.

if had database index it. guess need indexed text file!

here i'm doing now:

my $linefound=`awk -f, '/$column1/ { print $2 }' ../mybigmappingfile.csv`; 

is there easy way pull off performance improvement?

the hash suggestions natural way experienced perler this, may suboptimal in case. scans entire file , builds large, flat datastructure in linear time. cruder methods can short circuit worst case linear time, less in practice.

i first made big mapping file:

my $len = shift; (1 .. $len) {     $rnd = int rand( 999 );     print "$_,$rnd\n"; } 

with $len passed on command line 10000000, file came out 113mb. benchmarked 3 implemntations. first hash lookup method. second slurps file , scans regex. third reads line-by-line , stops when matches. complete implementation:

#!/usr/bin/perl use strict; use warnings; use benchmark qw{timethese};  $file  = shift; $count = 100; $entry = 40;   slurp(); # initial file slurp, hard drive cache  timethese( $count, {     'hash'       => sub { hash_lookup( $entry ) },     'scalar'     => sub { scalar_lookup( $entry ) },     'linebyline' => sub { line_lookup( $entry ) }, });   sub slurp {     open( $fh, '<', $file ) or die "can't open $file: $!\n";     undef $/;     $s = <$fh>;     close $fh;     return $s; }  sub hash_lookup {     ($entry) = @_;     %data;      open( $fh, '<', $file ) or die "can't open $file: $!\n";     while( <$fh> ) {         ($name, $val) = split /,/;         $data{$name} = $val;     }     close $fh;      return $data{$entry}; }  sub scalar_lookup {     ($entry) = @_;     $data = slurp();     ($val) = $data =~ /\a $entry , (\d+) \z/x;     return $val; }  sub line_lookup {     ($entry) = @_;     $found;      open( $fh, '<', $file ) or die "can't open $file: $!\n";     while( <$fh> ) {         ($name, $val) = split /,/;         if( $name == $entry ) {             $found = $val;             last;         }     }     close $fh;      return $found; } 

results on system:

benchmark: timing 100 iterations of hash, linebyline, scalar...       hash: 47 wallclock secs (18.86 usr + 27.88 sys = 46.74 cpu) @  2.14/s (n=100) linebyline: 47 wallclock secs (18.86 usr + 27.80 sys = 46.66 cpu) @  2.14/s (n=100)     scalar: 42 wallclock secs (16.80 usr + 24.37 sys = 41.17 cpu) @  2.43/s (n=100) 

(note i'm running off ssd, i/o fast, , perhaps makes initial slurp() unnecessary. ymmv.)

interestingly, hash implementation fast linebyline, isn't expected. using slurping, scalar may end being faster on traditional hard drive.

however, far fastest simple call grep:

$ time grep '^40,' int_map.txt  40,795  real    0m0.508s user    0m0.374s sys     0m0.046 

perl read output , split apart comma in hardly time @ all.

edit: never mind grep. misread numbers.


Comments

Popular posts from this blog

objective c - Change font of selected text in UITextView -

php - Accessing POST data in Facebook cavas app -

c# - Getting control value when switching a view as part of a multiview -