BTER Guide: How to create a BTER graph that matches an existing graph

By the BTER Team (Sandia National Laboratories, Livermore, CA):

Tamara G. Kolda, tgkolda@sandia.gov
Ali Pinar, apinar@sandia.gov
Todd Plantenga, tplante@sandia.gov
C. Seshadhri, scomand@sandia.gov

April 2013

Distributed with FEASTPACK.

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

Step 1: Load a graph
Step 2: Compute the degree distribution
Step 3: Calculate the clustering coefficient per degree
Step 4: Generate BTER model
Step 5: Create sparse matrix representation of BTER graph
Step 6: Calculate metrics on BTER graph and compare to original
Step 7: Visualize the data
Step 7 (alternate): Visualize the results with binning
"Blowup" option in BTER for better match to degree-1 vertices
Larger graphs: Deduplicating the edge list
Larger graphs: Calculating the clustering coefficients
References

Step 1: Load a graph

Some preprocessing may be needed to make the graph simple.

G = G + G'; %<- Make the graph undirected
G = spones(G); %<- Make the graph unweighted
G = spdiags(zeros(size(G,1),1),0,G); %<- Remove self-loops

We use an example graph from the SNAP collection which has already been preprocessed and is included in the distribution.

load('data/ca-AstroPh-simple');
nnodes = size(G,1);
nedges = nnz(G)/2;
fprintf('Graph name: %s\n', graphname);
fprintf('Number of nodes: %d\n', nnodes);
fprintf('Number of edges: %d\n', nedges);

Graph name: ca-AstroPh-simple
Number of nodes: 18772
Number of edges: 198050

Step 2: Compute the degree distribution

Calculate the number of nodes per degree.

nd = accumarray(nonzeros(sum(G,2)),1);
maxdegree = find(nd>0,1,'last');
fprintf('Maximum degree: %d\n', maxdegree);

Maximum degree: 504

Step 3: Calculate the clustering coefficient per degree

This can be extremely expensive, especially for large graphs. We provide C code and accompanying mex interfaces to do both exact and estimated calculations. The matlab function ccperdeg is an interface to both; see below for details. The ccperdeg function returns the clustering coefficient per degree (ccd) and the global clustering coefficient (gcc). In the second case, the value of 'nsamples' is the number of sample wedges per bin.

[ccd,gcc] = ccperdeg(G);
fprintf('Global clustering coefficient: %.2f\n', gcc);

Global clustering coefficient: 0.32

Step 4: Generate BTER model

BTER creates two edges lists, corresponding to Phase 1 and Phase 2 edges, respectively.

fprintf('Running BTER...\n');
t1=tic;
[E1,E2] = bter(nd,ccd);
toc(t1)
fprintf('Number of edges created by BTER: %d\n', size(E1,1) + size(E2,1));

Running BTER...
--- BTER HPC Set-up ---
Desired number of nodes: 18771
Desired number of edges: 198050
Multiplier to degree-1 nodes: 1
Maximum degree: 504
Number of groups: 76
Number of blocks: 2095
Phase 1 total weight: 303303
Phase 2 total weight: 52313
Time for setup (sec): 0.02
Determined phase for 355616 edges in 0.018912 seconds
Sampled 303534 edges for phase 1 in 0.185769 seconds
Sampled 52082 edges for phase 2 in 0.058720 seconds
Removed 35 loops in 0.001642 seconds
--- BTER HPC Complete ---
Elapsed time is 0.319444 seconds.
Number of edges created by BTER: 355581

Step 5: Create sparse matrix representation of BTER graph

Converting the edge list to an adjacency matrix is often the most expensive part in MATLAB. We separate this step because the edge list may be sufficient for many applications.

fprintf('Turning edge list into adjacency matrix (including dedup)...\n');
t2=tic;
G_bter = bter_edges2graph(E1,E2);
toc(t2);
fprintf('Number of edges in dedup''d graph: %d\n', nnz(G)/2);

Turning edge list into adjacency matrix (including dedup)...
Elapsed time is 0.770594 seconds.
Number of edges in dedup'd graph: 198050

Step 6: Calculate metrics on BTER graph and compare to original

% Number of nodes and edges
nnodes_bter = size(G_bter,1);
nedges_bter = nnz(G_bter)/2;
fprintf('Comparative Results: %s versus BTER\n', graphname);
fprintf('Number of nodes: %d versus %d\n', nnodes, nnodes_bter);
fprintf('Number of edges: %d versus %d\n', nedges, nedges_bter);

% Degree disttribution
nd_bter = accumarray(nonzeros(sum(G_bter,2)),1);
maxdegree_bter = find(nd_bter>0,1,'last');
fprintf('Maximum degree: %d versus %d\n', maxdegree, maxdegree_bter);

% Clustering coefficients
[ccd_bter,gcc_bter] = ccperdeg(G_bter);
fprintf('Global clustering coefficient: %.2f versus %.2f\n', gcc, gcc_bter);

Comparative Results: ca-AstroPh-simple versus BTER
Number of nodes: 18772 versus 18771
Number of edges: 198050 versus 195795
Maximum degree: 504 versus 423
Global clustering coefficient: 0.32 versus 0.33

Step 7: Visualize the data

% Degree distribution
figure;
loglog(1:length(nd),nd,'ro',1:length(nd_bter),nd_bter,'b*');
title('Degree distribution');
xlabel('Degree');
ylabel('Frequency');
legend(graphname, 'BTER');

% Clustering coefficients
figure;
semilogx(1:length(ccd),ccd,'ro',1:length(ccd_bter),ccd_bter,'b*');
title('Clutering Coefficients');
xlabel('Degree');
ylabel('Mean CC');
legend(graphname, 'BTER');

Step 7 (alternate): Visualize the results with binning

The data can be extremely crowded, especially for the higher degrees. We also include the bindata function that can be used to logrithmically bin the data. Type 'help bindata' for full instructions. We make the following parameter choices here:

tau = 10 : The first 10 bins are singleton bins.
omega = 1.5 : Subsequent bins grow in size by a factor of 1.5.
ybinfun = @mean : Take the average of all entries in the bin.
prebin = true : Ensures every entry in the bin has an assigned value, even if its zero. Especially significant for last bin which may not be completely specified by the inputs.

binparams = {'tau', 10,'omega', 1.5, 'ybinfun', @mean, 'prebin', false};

In the binned version, the degree distibution can be non-integer and less than one because its the average value for the bin.

% Degree distribution
figure;
[xx1,yy1] = bindata([], nd, binparams{:});
[xx2,yy2] = bindata([], nd_bter, binparams{:});
loglog(xx1,yy1,'ro',xx2,yy2,'b*');
title('Degree distribution (logrithmically binned with means)');
xlabel('Degree');
ylabel('Frequency');
legend(graphname, 'BTER');

Clustering coefficients

figure;
[xx1,yy1] = bindata([], ccd, binparams{:});
[xx2,yy2] = bindata([], ccd_bter, binparams{:});
semilogx(xx1,yy1,'ro',xx2,yy2,'b*');
title('Clutering Coefficients (logrithmically binned with means)');
xlabel('Degree');
ylabel('Mean CC');
legend(graphname, 'BTER');

"Blowup" option in BTER for better match to degree-1 vertices

In many cases, BTER underestimates the number of degree-1 nodes. This is a well-understood problem, and we use the correction prescribed by Durak et al. (arXiv:1210.5288, 2012). We can see that now the number of degree-1 vertices is much more accurate.

[E1_alt,E2_alt] = bter(nd,ccd,'verbose',false,'blowup',10);
G_alt = bter_edges2graph(E1_alt, E2_alt);
nd_alt = accumarray(nonzeros(sum(G_alt,2)),1);
figure;
loglog(1:length(nd),nd,'ro',1:length(nd_bter),nd_bter,'b*');
hold on;
loglog(1:length(nd_alt),nd_alt,'g^','Color', [0 .7 .2]);
title('Degree distribution');
xlabel('Degree');
ylabel('Frequency');
legend(graphname, 'BTER', 'BTER with Blowup');

Larger graphs: Deduplicating the edge list

It's expensive to form a sparse adjancey matrix for very large graphs. If you want to remove duplicate edges in the edge list, we suggest the following procedure.

% Before we begin, we have to do some preprocessing to remove any loops
% from Phase 2 (cannot occur in Phase 1), and order the vertices defining
% the edges so that the least index is always first. Note that this step
% can be skipped if 'swap' and 'rmloops' options area set to true when
% running BTER.

% --- Preprocessing for Deduplication ---
% Remove loops in Phase 2
tf = (E2(:,1) == E2(:,2));
E2 = E2(~tf,:);
% Consistently order Phase 1
tf = E1(:,1) > E1(:,2);
tmp = E1(tf,2);
E1(tf,2) = E1(tf,1);
E1(tf,1) = tmp;
% Consistently order Phase 2
tf = E2(:,1) > E2(:,2);
tmp = E2(tf,2);
E2(tf,2) = E2(tf,1);
E2(tf,1) = tmp;
% --- Preprocessing Complete ---

% First, deduplicate Phase 1, which has most of the duplication
E1_dedup = unique(E1,'rows');
fprintf('Number of duplicates in Phase 1: %d\n', size(E1,1) - size(E1_dedup,1));

% Next, deduplicate Phase 2, which does not have much duplication
E2_dedup = unique(E2,'rows');
fprintf('Number of duplicates in Phase 2: %d\n', size(E2,1) - size(E2_dedup,1));

% Finally, remove any duplicates between the 2 phases.
E_dedup = unique([E1_dedup; E2_dedup], 'rows');
fprintf('Number of duplicates across Phasees: %d\n', ...
    size(E1_dedup,1) + size(E2_dedup,1) - size(E_dedup,1));
fprintf('Final number of edges: %d\n', size(E_dedup,1));

Number of duplicates in Phase 1: 157937
Number of duplicates in Phase 2: 776
Number of duplicates across Phasees: 1073
Final number of edges: 195795

Larger graphs: Calculating the clustering coefficients

For larger graphs, calculating the clustering coefficient exactly can be extremely expensive. Here we compare the exact and estimated clustering coefficients using the two different methods.

% Load a bigger graph
load('data/web-NotreDame-simple');
nnodes = size(G,1);
nedges = nnz(G)/2;
fprintf('Graph name: %s\n', graphname);
fprintf('Number of nodes: %d\n', nnodes);
fprintf('Number of edges: %d\n', nedges);

% Calculate the clustering coefficients _exactly_.
tic;
[ccd0,gcc0] = ccperdeg(G,'nsamples',0);
t = toc;
fprintf('      Exact GCC: %e (%f seconds to compute) \n', gcc0, t);

% Calculate the clustering coefficients using wedge sampling.
nsamples = 2000;
tic;
[ccd,gcc] = ccperdeg(G,'nsamples',nsamples);
t = toc;
fprintf('  Estimated GCC: %e (%f seconds to compute) \n', gcc, t);

% Compute the bound on the error and check results.
delta = 0.001; %<-Corresponding to 99.9% confidence.
errorbound = sqrt(log(2/delta)/(2*nsamples));
fprintf(' Expected error: %e (with 99.9%% confidence)\n', errorbound);
fprintf('  Max CCD error: %e\n', max(abs(ccd-ccd0)));
fprintf('      GCC error: %e\n', abs(gcc-gcc0));

% Visualize the differences
figure;
e = ones(size(ccd0));
errorbar(1:length(ccd0),ccd0,errorbound*e,errorbound*e,'b.');
set(gca,'XScale','log')
hold on;
plot(1:length(ccd),ccd,'r*');
xl = xlim;
xlim([1 xl(2)]);
ylim([-errorbound 1+errorbound]);
title('Clutering Coefficients Calculation Comparison');
xlabel('Degree');
ylabel('Mean CC');
legend('Exact', 'Wedge Sampling');

Graph name: web-NotreDame-simple
Number of nodes: 325729
Number of edges: 1090108
      Exact GCC: 8.767355e-02 (2.091621 seconds to compute) 
  Estimated GCC: 8.747799e-02 (0.535013 seconds to compute) 
 Expected error: 4.359158e-02 (with 99.9% confidence)
  Max CCD error: 2.666742e-02
      GCC error: 1.955560e-04

References

SNAP: Stanford Network Analysis Project, http://snap.stanford.edu/
ca-AstroPh: Astro Physics collaboration network, SNAP, http://snap.stanford.edu/data/ca-AstroPh.html
web-NotreDame: Notre Dame web graph, SNAP, http://snap.stanford.edu/data/web-NotreDame.html
BTER Model: C. Seshadhri, T. G. Kolda and A. Pinar. Community structure and scale-free collections of Erdös-Rényi graphs, Physical Review E 85(5):056109, May 2012, doi:10.1103/PhysRevE.85.056109
BTER Implementation: T. G. Kolda, A. Pinar, T. Plantenga and C. Seshadhri. A Scalable Generative Graph Model with Community Structure, arXiv:1302.6636, March 2013, http://arxiv.org/abs/1302.6636
Clustering Coefficient (ccperdeg.m and related MEX files): C. Seshadhri, A. Pinar and T. G. Kolda. Triadic Measures on Graphs: The Power of Wedge Sampling in Proc. 2013 SIAM International Conference on Data Mining (SDM), May 2013, preprint: http://arxiv.org/abs/1202.5230