Implementation Details

The core implementation of minepy is built from scratch in ANSI C starting from the pseudocode provided in [Reshef11] Supplementary On-line Material (SOM), as no original Java source code is available. The level of detail of the pseudocode leaves a few ambiguities and in this section we list and comment the most crucial choices we adopted for the algorithm steps whenever no explicit description was provided. Obviously, our choices are not necessarily the same as in the original Java version (MINE.jar, http://www.exploredata.net/). The occurring differences can be ground for small numerical discrepancies as well as for difference in performance.

  1. In SOM, Algorithm 5, the characteristic matrix M is computed in the loop starting at line 7 for xy\leq B. This is in contrast with the definition of the MINE measures (see SOM, Sec. 2) where the corresponding bound is xy<B for all the four statistics. We adopted the same bound as in the pseudocode, i.e. xy\leq B.

  2. The MINE statistic MCN is defined as follows in SOM, Sec. 2:

    \textrm{MCN}(D,\epsilon) = \min_{xy<B} \{\log(xy)\colon M(D)_{x,y}
\geq (1-\epsilon)\textrm{MIC}(D)\}

    As for MINE.jar (inferred from Table S1), we set \epsilon=0 and \log to be in base 2. Finally, as specified in Point 1 above, we use the bound xy\leq B as in the SOM pseudocode rather than the xy<B as in the definition. This led to implement the formula:

    \textrm{MCN}(D,0) = \min_{xy\leq B} \{\log_2(xy)\colon M(D)_{x,y}
= \textrm{MIC}(D)\},

    being \textrm{MIC}(D) the maximum value of the matrix M(D).

  3. In EquipartitionYAxis() (SOM, Algorithm 3, lines 4 and 10), two ratios are assigned to the variable desiredRowSize, namely \frac{n}{y} and \frac{(n-i+1)}{(y-\textrm{currRow}+1)}. We choose to consider the ratios as real numbers; a possible alternative is to cast desiredRowSize to an integer. The two alternatives can give rise to different Q maps, and thus to slightly different numerical values of the MINE statistics.

  4. In some cases, the function EquipartitionYAxis() can return a map Q whose number of clumps \hat{y} is smaller than y, e.g. when in D there are enough points whose second coordinates coincide. This can lead to underestimate the normalized mutual information matrix M_{x,y} (SOM, Algorithm 5, line 9), where M_{x,y} is obtained by dividing the mutual information I_{x,y} for \min\{\log x,\log
y\}. To prevent this issue, we normalize instead by the factor \min\{\log x,\log \hat{y}\}.

  5. The function GetClumpsPartition(D,Q) is discussed ([Reshef11], SOM page 12), but its pseudocode is not explicitely available. Our implementation is defined here in GetClumpsPartition() algorithm. The function returns the map P defining the clumps for the set D, with the constraint of keeping in the same clump points with the same x-value.

    _images/alg1.png

    GetClumpsPartition() algorithm

  6. We also explicitly provide the pseudocode for the GetSuperclumpsPartition() function (discussed in [Reshef11], SOM page 13) in GetSuperclumpsPartition() algorithm. This function limits the number of clumps when their number k is larger than a given bound \hat{k}. The function calls the GetClumpsPartition() and, for k>\hat{k} it builds an auxiliary set D_{\tilde{P}} as an input for the EquipartitionYAxis function discussed above (Points 3-4).

    _images/alg2.png

    GetSuperclumpsPartition() algorithm

  7. We observed that the GetSuperclumpsPartition() implemented in MINE.jar may fail to respect the \hat{k} constraints on the maximum number of clumps and a map P with \hat{k}+1 superclumps is actually returned. As an example, the MINE.jar applied in debug mode (d=4 option) with the same parameters (\alpha=0.551, c=10) used in [Reshef11] to the pair of variables (OTU4435,OTU4496) of the Microbioma dataset, returns cx+1 clumps, instead of stopping at the bound \hat{k}=cx for x=12,7,6,5,4\ldots.

  8. The possibly different implementations of the GetSuperclumpsPartition() function described in Points 6-7 can lead to minor numerical differences in the MIC statistics. To confirm this effect, we verified that by reducing the number of calls to the GetSuperclumpsPartition() algorithm, we can also decrease the difference between MIC computed by minepy and by MINE.jar, and they asymptotically converge to the same value.

  9. In our implementation, we use double-precision floating-point numbers (double in C) in the computation of entropy and mutual information values. The internal implementation of the same quantities in MINE.jar is unknown.

  10. In order to speed up the computation of the MINE statistics, we introduced two improvements (with respect to the pseudo-code), in OptimizeXAxis(), defined in Algorithm 2 in [Reshef11] SOM):

    • Given a (P,Q) grid, we precalculate the matrix of number of samples in each cell of the grid, to speed up the computation of entropy values H(Q), H(\langle
c_0,c_s,c_t\rangle), H(\langle c_0,c_s,c_t \rangle,
Q) and H(\langle c_s,c_t \rangle, Q).
    • We precalculate the entropy matrix H(\langle c_s,c_t
\rangle, Q), \forall s, t to speed up the computation of F(s,t,l) (see Algorithm 2, lines 10–17 in [Reshef11] SOM).

    These improvements do not affect the final results of mutual information matrix and of MINE statistics.

[Reshef11](1, 2, 3, 4, 5, 6) David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, and Pardis C. Sabeti. Detecting Novel Associations in Large Data Sets. Science 16 December 2011: 334 (6062), 1518-1524. [DOI:10.1126/science.1205438]

Previous topic

Welcome to minepy’s documentation!

Next topic

Install

This Page