barvin has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
I'm doing graph traversal and calculating the Jaccard distance on each pair of vertices. Calculating intersection and union are consuming 98% of my processing time.
I've tried Set::IntSpan, Set::Array and List::Compare, but they are all even slower than my own code. Is there a VERY fast (maybe implemented in C) way to calculate intersection and union on two arrays? Just for interest, my implementation is ($in and $jn are array refs of positive integers):
sub get_int_uni {
my ($in, $jn) = @_;
my (%int, %uni);
for my $i (@$in) {
$uni{$i}++;
}
for my $j (@$jn) {
$int{$j}++ if $uni{$j};
$uni{$j}++;
}
return ((scalar keys %int), (scalar keys %uni));
}
Re: Fast, Efficient Union and Intersection on arrays
by BrowserUk (Pope) on Nov 20, 2008 at 18:03 UTC

c:\test>unionIntersect N=10 SHOW
2 2 3 4 5 5 8 9 12 14
4 4 7 7 9 9 10 10 11 11
barvin: 5 : 11
buk: 2 : 11
The union contains 11, and the intersection is only 2 (4,9) but your routine reports 5?
Anyway, this is my benchmark which shows a vec implementation is quite a bit faster:
#! perl slw
use strict;
use Benchmark qw[ cmpthese ];
our $N = 100;
our $SHOW;
sub barvin {
my ($in, $jn) = @_;
my (%int, %uni);
for my $i (@$in) {
$uni{$i}++;
}
for my $j (@$jn) {
$int{$j}++ if $uni{$j};
$uni{$j}++;
}
print 'barvin: ', ((scalar keys %int), ' : ', (scalar keys %u
+ni))
if $SHOW;
return ((scalar keys %int), (scalar keys %uni));
}
sub buk{
my( $aRef, $bRef ) = @_;
my( $aBits, $bBits ) = ('') x 2;
vec( $aBits, $_, 1 ) = 1 for @$aRef;
vec( $bBits, $_, 1 ) = 1 for @$bRef;
my $uCount = unpack '%32b*', ( $aBits  $bBits );
my $iCount = unpack '%32b*', ( $aBits & $bBits );
print 'buk: ', $iCount, ' : ', $uCount if $SHOW;
return( $iCount, $uCount );
}
our @a = map int( rand $N * 1.5 ), 1 .. $N;
our @b = map int( rand $N * 1.5 ), 1 .. $N;
print "@{[ sort { $a <=> $b } @a ]}\n@{[ sort{ $a <=> $b } @b ]}\n" if
+ $SHOW;
cmpthese $SHOW ? 1 : 1, {
barvin => q[
our( @a, @b );
barvin( \@a, \@b );
],
buk => q[
our( @a, @b );
buk( \@a, \@b );
],
};
__END__
c:\test>unionIntersect N=10
Rate barvin buk
barvin 33212/s  29%
buk 47045/s 42% 
c:\test>unionIntersect N=100
Rate barvin buk
barvin 4111/s  53%
buk 8777/s 113% 
c:\test>unionIntersect N=1000
Rate barvin buk
barvin 357/s  64%
buk 998/s 179% 
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] [select] 

Ah yes, I'm assuming unique elements in the two hashes, which is correct for my data. Thanks very much for your suggestions. These vectors are new to me, so some good stuff for me to chew on here.
 [reply] 
Re: Fast, Efficient Union and Intersection on arrays
by Roy Johnson (Monsignor) on Nov 20, 2008 at 18:07 UTC

I note two things: you only want the number of elements in the intersection and union, not the actual intersection and union; and your method suggests that there are no duplicates in the sets.
So you don't need to find the union, explicitly. It is just the total number in I + J, minus the intersection.
sub get_int_uni2 {
my ($in, $jn) = @_;
my %i;
@i{@$in} = ();
my $int = grep exists $i{$_}, @$jn;
return ($int, @$in + @$jn  $int);
}
Sadly, I don't see any real improvement in performance (thanks to BrowserUK for the benchmarkery).
Caution: Contents may have been coded under pressure.
 [reply] [d/l] 
Re: Fast, Efficient Union and Intersection on arrays
by moritz (Cardinal) on Nov 20, 2008 at 16:45 UTC

If you know that each array itself doesn't contain duplicate elements, you could try something along these lines:
my (@a, @b);
# initialize @a and @b here
my %hash:
@hash{@a} = (1) x @a;
my @union = @a, grep { !$hash{$_} } @b;
my @intersection = grep { $hash{$_} } @b;
That has the advantage of using only one hash, which might buy you some performance. But I don't know if it is actually faster than anything else.
(If you provide a small Benchmark with lists of typical sizes you might get much better answers; for example for a large number of small integers it might be more memory efficient to use vec than a hash, which might in turn result in some performance benefit.)  [reply] [d/l] 

It seems counterproductive to grep the b array twice. But it's actually faster than using one for loop.
I did a benchmark, and there is a 20% speed boost for doing two greps over just one for loop with short arrays (less than 10). The boost shrinks as the arrays grow in size. At array size ~50 boost is only about 8%.
use strict;
use warnings;
use Benchmark qw(cmpthese clearallcache);
my @a = (1..50);
my @b = (3..80);
#two_greps();
#one_for();
#exit;
print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n";
cmpthese( 100_000, {
two_greps => \&two_greps,
one_for => \&one_for,
});
clearallcache();
@a = (1..5);
@b = (3..8);
print "Range ($a[0],$a[$#a]) vs ($b[0],$b[$#b])\n";
cmpthese( 100_000, {
two_greps => \&two_greps,
one_for => \&one_for,
});
sub two_greps {
my %hash;
@hash{@a} = (1) x @a;
my @union = @a, grep { !$hash{$_} } @b;
my @intersection = grep { $hash{$_} } @b;
# print "union @union\n";
# print "interstection @intersection\n";
}
sub one_for {
my %hash;
@hash{@a} = (1) x @a;
my @union = @a;
my @intersection;
foreach (@b) {
if( exists $hash{$_} ) {
push @intersection, $_
}
else {
push @union, $_
}
}
# print "union @union\n";
# print "interstection @intersection\n";
}
__DATA__
Range (1,50) vs (3,80)
Rate one_for two_greps
one_for 5829/s  8%
two_greps 6318/s 8% 
Range (1,5) vs (3,8)
Rate one_for two_greps
one_for 57143/s  17%
two_greps 68823/s 20% 
Thanks for posting this thought provoking code.
 [reply] [d/l] 

I remember one time getting a noticeable speed improvement by using the 'for (...;...;...)' over a foreach loop for a large array. I wondering if that would have helped here.
 [reply] 


 [reply] 

Re: Fast, Efficient Union and Intersection on arrays
by barvin (Beadle) on Nov 20, 2008 at 19:08 UTC

Thanks all for a fabulous lessons in efficiency and creativity. I implemented BrowserUK's vector routine, and Roy Johnson's single grep routine  and at 100,000 iterations with moderately sized arrays (mean 151, stdev 185) the winner is:
Wall Time
barvin 54.134s
BrowserUK 26.905s
Roy Johnson 21.349s
So for those coming in by Google...
To calculate the count of intersection and union on two arrays (used by me to calculate the Jaccard Distance on the edges of a graph), pass two array references that have no internal duplicates to Roy's subroutine below. Thanks Roy.
sub get_int_uni2 {
my ($in, $jn) = @_;
my %i;
@i{@$in} = ();
my $int = grep exists $i{$_}, @$jn;
return ($int, @$in + @$jn  $int);
}
 [reply] 

If you can maintain sorted structures, which is painful in Perl, then perhaps you can speed up more.
 [reply] 
Re: Fast, Efficient Union and Intersection on arrays
by JavaFan (Canon) on Nov 20, 2008 at 21:43 UTC

Is there a VERY fast (maybe implemented in C) way to calculate intersection and union on two arrays?
You may want to have a look at PDL. And, no doubt, the fastest routines have probably been written in the early 60s, in FORTRAN. But I wouldn't know how to link FORTRAN routines to perl.
 [reply] 
Re: Fast, Efficient Union and Intersection on arrays
by ptoulis (Scribe) on Nov 20, 2008 at 22:44 UTC

That was an excellent exercise barvin.
As always, the speed in the algorithm depends in the nature of your problem (the input).
All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists. Now, all the great ideas above differ much in the implementation with respect to the Perl internals and most are huge space consumers.
In terms of speed the best algorithm, is by far from BrowserUK(buk() function). In my Windows machine it is insanely fast compared to others. However, it has some important weaknesses. First, it can't deal with nonnumeric data, since the vec needs a numeric offset which in buk() the list elements are used. Second, it fails with arrays with large numbers as element, e.g. if you set @a=(1,2) and @b=(3,4,10000000000) the program will die with the message Negative offset to vec in lvalue context at ....
On the other hand all other algorithms do not have the above weaknesses (can deal with large numbers, can handle nonnumeric data), but still are much slower and also fail when arrays have duplicated elements.
In summary, I guess there is no absolute truth when we speak about algorithms. The algorithms here are running in O(m+n) which means that they are affected by the larger array. To beat this, use some highorder data structure,like the Heap. I didn't test it but if l=min(m,n) and r=max(m,n), you could reach O(l*log(r)). You have an extra overhead when inserting values(or removing) but still search is lightning fast.  [reply] [d/l] [select] 

All the algorithms presented above have theoretically a running time of O(m+n) where m,n are the sizes of the lists.
Not true. Since BrowserUK's algorithm needs to allocate and initialize a string with max(@a, @b) bits, it has O(n + m + max(@a) + max(@b)) run time (you mentioned another problem that arises from large numbers, but not in terms of complexity).
 [reply] [d/l] [select] 

Whilst I agree that the algorithm is limited to numerics, (the triump of the specificthe OP's "positive integers"over the generic), I disagree with your big O assessment.
It perfectly feasible to preallocate the bitstrings to their largest possible size (2^32/8 = 512MB each). Easily and relativly quickly achieved on a fully populated 32bit machine. But more importantly, it only need be done once regardless of the size of the input arrays, so it is an O(2) constant factor, and is disregarded by bigO.
Of course, allocating huge chunks of ram for small input arrays means that the constant factor would be proportionally quite large, but that's the limitation of bigO.
Conversely, the hashes used by the other algorithms grow by a process of repeated reallocation in geometrically increasing chunks, which means the time cost grows geometrically with size of the input arrays. In addition, the grep solutions have to build stackbound lists from each of the arrays.
So, if you are going to try and account for the memory allocation in assesment of the algorithms, then you would have to come up with something like:
O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + SIGMA( 2^3+...+2^m >M ) + MIN(N,M
+) )
memory for 1st hash memory for 2nd hash memory
+for 1 array*
## (*assuming you chose the smallest array to use with the 2hash/1 lis
+t method)
or
O( M+N + SIGMA( 2^3+2^4+...+2^n>N) + N + M) )
memory for smallest hash memory for the two arrays
## for the 1 hash/2 arrays method.
All of which reenforces my stance that bigO is rarely done properly outside of academic papaers, and is of little use unless it is. A benchmark is more instructive. This run uses arrays of 1 million apiece, with the numbers in the range 0 .. 2^27:
C:\test>unionIntersect.pl N=1e6 MAX=27
s/iter barvin roy buk buk2
barvin 8.58  43% 58% 64%
roy 4.86 77%  26% 37%
buk 3.59 139% 35%  15%
buk2 3.05 182% 60% 18% 
I couldn't go higher than 2^27 because of the need to have enough memory in the process to accommodate the large hashes and lists created by the other algorithms as well. Of course, you do have to go to quite large input arrays before that constant factor for allocating the bitstrings amortises:
C:\test>unionIntersect.pl N=5e5 MAX=27
s/iter barvin buk roy buk2
barvin 3.87  28% 40% 50%
buk 2.80 39%  17% 30%
roy 2.33 66% 20%  16%
buk2 1.95 98% 43% 19% 
BTW: Buk2 simply uses RoyJohnstone's trick of only counting the bits for one solution and deriving the other number with arithmetic:
sub buk2{
my( $aRef, $bRef ) = @_;
my( $aBits, $bBits );
for( $aBits, $bBits ) {
local $\;
open my $ram, '>', \$_;
seek $ram, $MAX1, 0;
print $ram chr(0);
close $ram;
}
vec( $aBits, $_, 1 ) = 1 for @$aRef;
vec( $bBits, $_, 1 ) = 1 for @$bRef;
my $iCount = unpack '%32b*', ( $aBits & $bBits );
my $uCount = @$aRef + @$bRef  $iCount;
$aBits = $bBits = '';
print 'buk2: ', $iCount, ' : ', $uCount if $SHOW;
return( $iCount, $uCount );
}
Examine what is said, not who speaks  Silence betokens consent  Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
 [reply] [d/l] [select] 




Re: Fast, Efficient Union and Intersection on arrays
by ptoulis (Scribe) on Nov 21, 2008 at 21:26 UTC

I am dumping my own code here which is for finding the union/intersection of arrays inplace. It uses mergesort to order the lists(The Heap module from CPAN is of lowquality so I had to use the builtin mergesort for sorting..) and then a divideandconquer to find the intersection and union. It is about 2 times slower than uni_int_2 but it is as much efficient in space requirements because it does not allocate any more memory than the input arrays (of course I can't compare it with the buk(). For arrays of 500,000 elements it takes 45 seconds in my machine, and RAM consumption is 50MB which is what Perl allocates for the initial arrays (and of course the runtime).
I didn't try to optimize it much, perhaps someone could throw an idea here. Sorry for the big file, but I have some comments inline that might help anyone interested.
use warnings;
use strict;
use List::Util 'shuffle';
use sort '_mergesort';
#create the
!$ARGV[0] and print "No input for array size" and exit;
my @array1 = shuffle 1..$ARGV[0];
my @array2 = shuffle 1..$ARGV[0];
my @sortedA = sort {$a<=>$b} (@array1);
my @sortedB = sort {$a<=> $b } (@array2);
our $pivot; #used in binary search..
my ($uni, $insc)= p_uni(0,$#sortedA,0,$#sortedB);
print "\nU = ",$uni," I=",$insc;
#>>>>> UNION and INTERSECTION <<<<<<<<
#The function p_uni() will return an ARRAY for which
#ARRAY[0] = cardinality of union
#ARRAY[1]=cardinality of intersection
sub p_uni {
#The function process 2 ordered lists, say A and B as follows
#We split A = A1+A2 and B=B1+B2 around the middle
#Since they are ordered we know that INTERSECTION(A1,B2)=0 and INTERSE
+CTION (A2,B1)=0, so we process them independently...
#The split array Ai is A[aLeft,aRight]  a subset of A with those indi
+ces. The same holds for bLeft and bRight.
my ($aLeft,$aRight,$bLeft, $bRight) = @_;
my ($tmp_uni,$tmp_insc)=(0)x 2;
#If A array is too small then run a linear search and match
if($aRight$aLeft<100)
{
my $search_result;
for (my $i=$aLeft, my $bLeft_h = $bLeft;$i<=$aRight;$i++)
{
$search_result = bin_search(\@sortedB, $sortedA[$i],$bL
+eft_h,$bRight);
$tmp_uni++;
$tmp_insc++ if($search_result>[0]);
$bLeft_h = $search_result>[1];
}
return ($aRight$aLeft+1$tmp_insc+$bRight$bLeft+1, $tmp_insc)
+;
}
#If B array is too small run a linear search and match
elsif($bRight$bLeft<100)
{
my ($tmp_uni,$tmp_insc)=(0)x 2;
my $search_result;
for my $i($bLeft..$bRight)
{
if(defined($sortedB[$i1]) && $sortedB[$i1] != $so
+rtedB[$i]) {
my $search_result = bin_search(\@sortedA, $sortedB[$i]
+,$aLeft,$aRight);
if($search_result>[0]) {$tmp_uni++; $tmp_insc++;}
else { $tmp_uni+=1; }
$aLeft = $search_result>[1];
}
}
return ($aRight$aLeft+1$tmp_insc+$bRight$bLeft+1, $tmp_insc
+);
}
my $bPos = int(($bLeft+$bRight)/2);
my $aPos = bin_search(\@sortedA, $sortedB[$bPos],$aLeft,$aRight);
my ($a_insc,$a_uni) = p_uni($aLeft, $aPos>[1], $bLeft,$bPos);
my ($b_insc,$b_uni) = p_uni($aPos>[1]+1,$aRight,$bPos+1,$bRight);
return($a_insc+$b_insc,$a_uni+$b_uni);
}
#This is a binary search Searches in the ordered list reference AR_RE
+F for WHAT and between and including indices cLEFT cRIGHT..
sub bin_search {
my ($ar_ref, $what,$cLeft,$cRight) =@_;
my ($left,$right,$value)=($cLeft,$cRight,0);
while($left<=$right )
{
$pivot = int (($left+$right)/2);
$value = $ar_ref>[$pivot];
if($value>$what)
{
$right=$pivot1;
}
elsif($value<$what)
{
$left=$pivot+1;
}
else
{
return [1,$pivot];
}
}
return [0,$pivot];
}
 [reply] [d/l] [select] 

