Mathlas repository

dimensionality_reduction package

The dimensionality_reduction package contains two modules that can be used for reducing the dimensionality of a set of data: IAMB & floats_to_categories.


The floats_to_categories module contains routines for splitting float data in different ways:

  • By manually providing the intervals
  • By creating equally spaced intervals
  • Information theory-based partitioning


Consider the following diagram symbolizing the relationship between variables:

Markov Blanket example

Bishop1 defines the Markov Blanket based on a graphical model as the set of parents, children and co-parents of a node and it has the property that the conditional probability distribution of \(x_i\) conditioned on all the variables in the graph is only dependent on the variables in the Markov Blanket (a more formal definition can be found in 2 and 3). It can be though of as the minimal set of variables that we need to know in order to model the target variable \(x_i\).

Being able to determine the Markov Blanket allows us to perform variable selection processes. This is important since in our work we'd normally have dozens or hundreds of variables in our sample data and it is generally not possible to train a model with so many variables.

IAMB stands for Incremental Association Markov Blanket and is an algorithm for determining the Markov Blanket for some input data. The iamb module provides both IAMB2 and λ-IAMB3 algorithm implementations for determining the Markov Blanket for a variable and its associated dataset.

The IAMB algorithm computes the Markov Blanket for a target variable in a dataset by measuring how much "knowledge" of the target data is contained in the measurements for each of the rest of the variables. That "knowledge" of the target data is measured by using the mutual information or the conditional mutual information (CMI) 4. These values can be understood as the reduction of uncertainty about the target variable when the value of another variable is known given what we already know about other variables (if anything).

The algorithm comprises a growing and a shrinking phase:

  • The growing phase consists of the following steps:
    1. The first variable in the Markov Blanket is chosen as that which provides the highest value for the mutual information when considered together with the target variable.
    2. A new variable is chosen as that which provides the highest value of the conditional mutual information about the target variable considering what we already know about it, given that that CMI is above a given threshold.
    3. The previous step is repeated until no variable providing a CMI above the given threshold can be found.
  • During the shrinking phase we look for false positives added during the growing phase. We do this by looking at the CMI for each var in the Markov Blanket given the information provided by the rest of the variables in the Markov Blanket. Should that CMI be below a given threshold, we'd remove said vars.

The following code computes the Markov Blanket on some input data using the IAMB algorithm. The target variable is \(x_4\) which is defined as \(1\) when \(x_2>0.5\) and \(0\) otherwise.

import numpy as np
import pandas as pd
from mathlas.dimensionality_reduction.iamb import iamb
from mathlas.statistics.probabilityDistribution import get_probabilities
from mathlas.dimensionality_reduction.floats_to_categories import EquispacedPartition

# We create a DataFrame from random data in the 0-1 range
df = pd.DataFrame(np.random.rand(1000, 4), columns=('x_1', 'x_2', 'x_3', 'x_4'))
# x_4 will be out target variable, and we'll define
# it as the rows whose x_2 value is > 0.5
df['x_4'] = (df['x_2'] > 0.5).astype(int)

# We must categorize the data before being able to pass it to the IAMB algorithm
df = EquispacedPartition(df, variables=['x_1', 'x_2', 'x_3'], n_partitions=10).data
iamb_vars = iamb(df, 'x_4', get_probabilities, eps_g=0.01, verbose=True)
# Print the Markov Blanket we just obtained
print('IAMB vars: {}'.format(iamb_vars))

# Output:
# Growing phase. Tier 1
# Mutual informations calculated in 0.0 seconds.
# Added variable <x_2> whose CMI is 0.97086, larger than the threshold 0.01000.
# Growing phase. Tier 2
# Mutual informations calculated in 0.0 seconds.
# Shrinking phase. Tier 1
# {'x_2'}