A Role for Symbolic Computation in the General Estimation of Statistical Models


In this document we describe how symbolic computation can be used to provide generalizable statistical estimation through a combination of existing open source frameworks.

Specifically, we consider the optimization problems resulting from models with non-smooth objective functions. These problems arise in the context of regularization and shrinkage, and here we’ll address their automated application within the proximal framework (Polson, Scott, and Willard 2015). In (Polson, Scott, and Willard 2015) we outlined a set of seemingly disparate optimization techniques within the fields of statistics, computer vision, and machine learning, that are unified by the types of functional bounds employed, the tools of convex analysis and the language of operator theory. These methods–and the concepts behind them–have found much success in recent times and admit quite a few interesting paths for research.

In at least a few cases, the work required to produce a proximal algorithm overlaps with some highly functional areas in computer algebra and symbolic mathematics. For instance, some proximal operators–the main ingredient within proximal algorithms–can be solved exactly with symbolic differentiation and algebraic equation solvers. With perhaps only the addition of sophisticated variable assumptions and an ability to manipulate inequalities, even larger classes of proximal operators could be solved. We refer the reader to the table in (Polson, Scott, and Willard 2015) for examples of operators that are currently solved–by hand–using roughly the same means.

The kind of automation proposed here only begins to answer a problem that arises somewhat naturally in these areas: how does one provide access to methods that produce–or apply to–numerous distinct problems and solutions. Instead of the common attempt to implement each model or method separately, and then combine them under a loosely organized API or function interface, the symbolic approach offers wider applicability and concise implementations in terms of code that maps directly to the mathematics. Also, it is one of the few practical means of including the higher and lower level considerations made by professionals at all stages of statistical modeling: mathematical, numerical, computational (e.g. distributed environments, concurrency, etc.) Some steps toward these broader goals are within reach and worth taking now.

That said, statistical modeling and estimation as a whole should seriously consider aligning more with the current tools and offerings of symbolic mathematics. Relative to the subject matter here, symbolic integration provides an excellent example. In computer algebra systems, mappings between basic functional forms and their generalized hypergeometric equivalents are used to exploit convenient convolution identities. In the same vein, it might be possible to use analogous tables for solutions to a wide variety of non-smooth models. We outline how this might be done in the following sections.

A Context

Much recent work in statistical modeling and estimation has revolved around the desire for sparsity, regularization and efficient [automatic] model selection. This is, in some sense, an objective shared with the more specialized areas of Deep Learning and Compressed Sensing. In the former case, we can point to Dropout (Srivastava et al. 2014) and, in the latter, \(\ell_p\) regularization (Donoho 2006).

Without delving into those topics here, we’ll simply assume that a practitioner intends to produce a sparse estimate for a model that results in LASSO. First, some setup:

import numpy as np
import scipy as sc

import pymc3 as pm
import theano
import theano.tensor as tt

theano.config.mode = 'FAST_COMPILE'

Using PyMC3, the Bayes version of the LASSO (Park and Casella 2008) model is easily specified.

from theano import shared as tt_shared

mu_true = np.zeros(100)
mu_true[:20] = np.exp(-np.arange(20)) * 100

X = np.random.randn(int(np.alen(mu_true) * 0.7), np.alen(mu_true))
y = sc.stats.norm.rvs(, scale=10)

X_tt = tt_shared(X, name='X', borrow=True)
y_tt = tt_shared(y, name='y', borrow=True)

with pm.Model() as lasso_model:
    # Would be nice if we could pass the symbolic y_tt.shape, so
    # that our model would automatically conform to changes in
    # the shared variables X_tt.
    # See
    beta_rv = pm.Laplace('beta', mu=0, b=1, shape=X.shape[1])
    y_rv = pm.Normal('y',, sd=1,
                     shape=y.shape[0], observed=y_tt)

The negative total log likelihood in our example problem has a non-smooth \(\ell_1\) term. The standard means of estimating a [MAP] solution to this problem usually involves the soft-thresholding operator, which is a type of proximal operator. This operator is cheap to compute, so that–among other things–makes the proximal approaches that use it quite appealing.

Moving on, let’s say we wanted to produce a MAP estimate in this PyMC3 context. A function is already provided for this generic task: find_MAP.

with lasso_model:
    params_0 = pm.find_MAP(vars=[beta_rv])

In our run of the above, an exception is thrown due to the nan values that arise within the gradient evaluation.

More directly, we can inspect the gradient at \(\beta = 0, 1\) to demonstrate the same.

start = pm.Point({'beta': np.zeros(X.shape[1])}, model=lasso_model)
bij = pm.DictToArrayBijection(pm.ArrayOrdering(lasso_model.vars),
logp = bij.mapf(lasso_model.fastlogp)
dlogp = bij.mapf(lasso_model.fastdlogp(lasso_model.vars))

# Could also inspect the log likelihood of the prior:
# beta_rv.dlogp().f(np.zeros_like(start['beta']))

grad_at_0 = dlogp(np.zeros_like(start['beta']))
grad_at_1 = dlogp(np.ones_like(start['beta']))

In [1]: print(np.sum(np.isnan(grad_at_0)))
Out[1]: 100
In [2]: print(np.sum(np.isnan(grad_at_1)))
Out[2]: 0

The Proximal Context

The general form of what we’re calling a proximal problem mirrors a penalized likelihood, i.e. \[\begin{equation} \beta^* = \operatorname*{argmin}_\beta \left\{ l(\beta) + \gamma \phi(\beta) \right\} \;, \label{eq:prox_problem} \end{equation}\] where the entire sum is commonly associated with the negative log likelihood and the functions \(l\) and \(\phi\) with observation and prior terms, or loss and penalty terms, respectively. Within the proximal framework, \(l\) and \(\phi\) are usually convex and lower semi-continuous–although quite a few properties and results can still hold for non-convex functions.

The proximal operator is an explicit form of \(\ref{eq:prox_problem}\) that has \(l(\beta) = \frac{1}{2} (\beta - z)^2\). It is a defining part of the intermediate steps within most proximal algorithms. Exact solutions to proximal operators exist for many \(\phi\). These are the elements that could exist in a table, many entries of which could be generated automatically, in analogy to symbolic integration.

The proximal operator relevant to our example, the soft-threshold operator, is implement in Theano with something like the following:

beta_tt = tt.vector('beta', dtype=tt.config.floatX)
beta_tt.tag.test_value = np.r_[-10, -1, -0.2, 0, 0.2, 1,

lambda_tt = tt.scalar('lambda', dtype=tt.config.floatX)
lambda_tt.tag.test_value = np.array(0.5).astype(tt.config.floatX)

def soft_threshold(beta_, lambda_):
    return tt.sgn(beta_) * tt.maximum(tt.abs_(beta_) - lambda_, 0)

In [3]: print(soft_threshold(beta_tt, lambda_tt).tag.test_value)
Out[3]: [-9.5 -0.5 -0.   0.   0.   0.5  9.5]

Operators like these can be composed with a gradient step to produce a proximal gradient algorithm.

Besides the proximal operator for \(\phi\), the steps in a proximal gradient algorithm are very straightforward and rely primarily on the gradient of \(l(\beta)\). When considering this quantity, a tangible benefit of symbolic computation becomes apparent; complicated gradients can be computed automatically and efficiently. With [backtracking] line search to handle unknown step sizes, the proximal gradient alone provides a surprisingly general means of sparse estimation.

Here is an implementation of a proximal gradient step:

from theano import function as tt_function

def prox_grad_step(logl, beta_tt, lambda_1_tt, gamma_prox_tt,
    # Negative log-likelihood without non-smooth (\ell_1) term:
    logl_grad = tt.grad(logl, wrt=beta_tt) = "logl_grad"

    from theano.compile.nanguardmode import NanGuardMode
    tt_func_mode = NanGuardMode(nan_is_error=True,

    beta_var_tt = tt.vector(name='beta', dtype=beta_tt.dtype)
    beta_var_tt.tag.test_value = beta_tt.get_value()
    grad_step = tt_function([beta_var_tt], logl_grad,
                            givens={beta_tt: beta_var_tt})

    beta_quad_step = beta_tt - lambda_1_tt * logl_grad = "beta_quad_step"

    beta_prox = prox_func(beta_quad_step, gamma_prox_tt * lambda_1_tt) = "beta_prox"

    r_tt = y_tt -

    prox_step = tt_function([],
                            [beta_prox, logl, logl_grad,
                            updates=[(beta_tt, beta_prox)],

    return (prox_step, grad_step)

The Symbolic Operations

In order to employ a lookup table, or to even identify a proximal problem and check that its conditions (e.g. convexity) are satisfied, we need to obtain the exact forms of each component: \(l\), \(\phi\) and \(\gamma\). We start with the determination of \(l\) and \(\phi\).

In some cases, we’re able to tease apart our \(l(\beta)\) and \(\phi(\beta)\) using the symbolic log likelihoods for the organizational designations of observed and unobserved PyMC3 random variables.

from theano import clone as tt_clone

logl = tt_clone(lasso_model.observed_RVs[0].logpt,
                {beta_rv: beta_tt}) = "logl"

Instead, let’s assume we’re extending find_MAP with some generality, so that distinguishing \(l\) and \(\phi\) in \(\ref{eq:prox_problem}\) using these designations isn’t reliable. This is necessary for cases in which a user specifies custom distributions or a potential function. In either case, to achieve our desired functionality we need to operate at a more symbolic level.

At this point, it is extremely worthwhile to browse the Theano documentation regarding graphs and their constituent objects.

The total log likelihood is a good place to start. Let’s look at symbolic graphs produced by ours.

from theano import pp as tt_pp
from theano import pprint as tt_pprint

In [4]: print(tt_pp(lasso_model.logpt))
- (|(\beta - TensorConstant{0})| / TensorConstant{1})))) +
(((TensorConstant{-1.0} * ((y - (X \dot \beta)) ** TensorConstant{2}))
+ log(TensorConstant{0.159154943092})) / TensorConstant{2.0}),

The pretty printed Theano graph tells us, among other things, that we have the anticipated sum of \(\ell_2\) and \(\ell_1\) terms.

As with most graphs produced by symbolic algebra systems, we need to consider exactly how its operations are arranged, so that we can develop a means of matching general structures. The debug printout is better for this.

In [5]: tt.printing.debugprint(lasso_model.logpt)
Out[5]: Elemwise{add,no_inplace} [id A] ''
 |Sum{acc_dtype=float64} [id B] ''
 | |Sum{acc_dtype=float64} [id C] ''
 |   |Elemwise{sub,no_inplace} [id D] ''
 |     |DimShuffle{x} [id E] ''
 |     | |Elemwise{neg,no_inplace} [id F] ''
 |     |   |Elemwise{log,no_inplace} [id G] ''
 |     |     |TensorConstant{2} [id H]
 |     |Elemwise{true_div,no_inplace} [id I] ''
 |       |Elemwise{abs_,no_inplace} [id J] ''
 |       | |Elemwise{sub,no_inplace} [id K] ''
 |       |   |beta [id L]
 |       |   |DimShuffle{x} [id M] ''
 |       |     |TensorConstant{0} [id N]
 |       |DimShuffle{x} [id O] ''
 |         |TensorConstant{1} [id P]
 |Sum{acc_dtype=float64} [id Q] ''
   |Sum{acc_dtype=float64} [id R] ''
     |Elemwise{switch,no_inplace} [id S] ''
       |DimShuffle{x} [id T] ''
       | |TensorConstant{1} [id P]
       |Elemwise{true_div,no_inplace} [id U] ''
       | |Elemwise{add,no_inplace} [id V] ''
       | | |Elemwise{mul,no_inplace} [id W] ''
       | | | |DimShuffle{x} [id X] ''
       | | | | |TensorConstant{-1.0} [id Y]
       | | | |Elemwise{pow,no_inplace} [id Z] ''
       | | |   |Elemwise{sub,no_inplace} [id BA] ''
       | | |   | |y [id BB]
       | | |   | |dot [id BC] ''
       | | |   |   |X [id BD]
       | | |   |   |beta [id L]
       | | |   |DimShuffle{x} [id BE] ''
       | | |     |TensorConstant{2} [id H]
       | | |DimShuffle{x} [id BF] ''
       | |   |Elemwise{log,no_inplace} [id BG] ''
       | |     |TensorConstant{0.159154943092} [id BH]
       | |DimShuffle{x} [id BI] ''
       |   |TensorConstant{2.0} [id BJ]
       |DimShuffle{x} [id BK] ''
         |TensorConstant{-inf} [id BL]

We see that the top-most operator is an Elemwise that applies the scalar add operation. This is the \(+\) in \(l(\beta) + \phi(\beta)\). If we were to consider the inputs to this operator as our candidates for \(l\) and \(\phi\), then we might find this pair with the following:

In [6]: print(lasso_model.logpt.owner.inputs)
Out[6]: [Sum{acc_dtype=float64}.0, Sum{acc_dtype=float64}.0]

Using these two terms, we might simply search for an absolute value operator.

def get_abs_between(input_node):
    # Get all the operations in the sub-tree between our input and the
    # log likelihood output node.
    term_ops = list(tt.gof.graph.ops([input_node],

    # Is there an absolute value in there?
    return filter(lambda x: x.op is tt.abs_, term_ops)

abs_res = [(get_abs_between(in_), in_)
           for in_ in lasso_model.logpt.owner.inputs]

for r_ in abs_res:
    if len(r_[0]) == 0:
        phi = r_[1]
        logp = r_[1]

In [7]: tt.printing.debugprint(logp)
Out[7]: Sum{acc_dtype=float64} [id A] ''
 |Sum{acc_dtype=float64} [id B] ''
   |Elemwise{switch,no_inplace} [id C] ''
     |DimShuffle{x} [id D] ''
     | |TensorConstant{1} [id E]
     |Elemwise{true_div,no_inplace} [id F] ''
     | |Elemwise{add,no_inplace} [id G] ''
     | | |Elemwise{mul,no_inplace} [id H] ''
     | | | |DimShuffle{x} [id I] ''
     | | | | |TensorConstant{-1.0} [id J]
     | | | |Elemwise{pow,no_inplace} [id K] ''
     | | |   |Elemwise{sub,no_inplace} [id L] ''
     | | |   | |y [id M]
     | | |   | |dot [id N] ''
     | | |   |   |X [id O]
     | | |   |   |beta [id P]
     | | |   |DimShuffle{x} [id Q] ''
     | | |     |TensorConstant{2} [id R]
     | | |DimShuffle{x} [id S] ''
     | |   |Elemwise{log,no_inplace} [id T] ''
     | |     |TensorConstant{0.159154943092} [id U]
     | |DimShuffle{x} [id V] ''
     |   |TensorConstant{2.0} [id W]
     |DimShuffle{x} [id X] ''
       |TensorConstant{-inf} [id Y]
In [8]: tt.printing.debugprint(phi)
Out[8]: Sum{acc_dtype=float64} [id A] ''
 |Sum{acc_dtype=float64} [id B] ''
   |Elemwise{sub,no_inplace} [id C] ''
     |DimShuffle{x} [id D] ''
     | |Elemwise{neg,no_inplace} [id E] ''
     |   |Elemwise{log,no_inplace} [id F] ''
     |     |TensorConstant{2} [id G]
     |Elemwise{true_div,no_inplace} [id H] ''
       |Elemwise{abs_,no_inplace} [id I] ''
       | |Elemwise{sub,no_inplace} [id J] ''
       |   |beta [id K]
       |   |DimShuffle{x} [id L] ''
       |     |TensorConstant{0} [id M]
       |DimShuffle{x} [id N] ''
         |TensorConstant{1} [id O]

From these terms one can similarly attempt to determine multiplicative constants, or parts thereof (e.g. \(\gamma\)).

The above approach is too limiting; we need something more robust. The above logic will fail on graphs that are constructed differently (e.g. producing equivalent, but different, representations via associativity) or when our naive use of theano.gof.graph.ops is compromised by certain types of intermediate operations (e.g. non-distributed, non-affine). These are only a few of the many weaknesses inherent to the naive approach above. Furthermore, sufficient coverage of all the necessary conditions–using the same approach–is likely to result in complicated, less approachable code.

What we need to identify the more general patterns suitable for our goal is mostly covered within the areas of graph unification and logic programming. Luckily, Theano has some basic unification capabilities that we’re able to deploy immediately.

As an example, we’ll jump right to the creation of a graph optimization. This is the context in which much of these symbolic operations are better suited; especially if we are required to alter the graph (or a copy thereof) during our search for terms. Consider the phi variable; the printouts show an unnecessary subtraction (with \(0\)). Clearly this part–and the entire graph–hasn’t been simplified. Standard simplifications already exist for these sorts of things, and most are performed in tandem.

Within the graph optimization framework we have the PatternSub local optimization. It provides basic unification. To demonstrate the kind of operations that could be used to pre-condition a graph and robustly determine a set of supported \(l\) and \(\phi\), we’ll make replacement patterns for multiplicative distribution across two forms of addition: sum and add. These searches and replacements can be applied to an objective function until it is in a sufficiently reduced form.

In the following, we’ll simply demonstrate the steps involved in programming the necessary logic.

test_a_tt = tt.as_tensor_variable(5, name='a')
test_b_tt = tt.as_tensor_variable(2, name='b')
test_c_tt = tt.as_tensor_variable(np.r_[1, 2], name='c')

test_exprs_tt = (test_a_tt * test_b_tt,)
test_exprs_tt += (test_a_tt * (test_b_tt + test_a_tt),)
test_exprs_tt += (test_a_tt * (test_c_tt + test_a_tt),)
test_exprs_tt += (test_a_tt * (test_c_tt + test_c_tt),)

mul_dist_pat_tt = (tt.gof.opt.PatternSub(
    (tt.mul, 'x', (tt.sum, 'y', 'z')),
    (tt.sum, (tt.mul, 'x', 'y'), (tt.mul, 'x', 'z'))
mul_dist_pat_tt += (tt.gof.opt.PatternSub(
    (tt.mul, 'x', (tt.add, 'y', 'z')),
    (tt.add, (tt.mul, 'x', 'y'), (tt.mul, 'x', 'z'))

The next step involves the repeated application of these operations, so that a non-trivial graph can be completely transformed/reduced in some way. We achieve this with the EquilibriumOptimizer class.

test_sub_eqz_opt_tt = tt.gof.opt.EquilibriumOptimizer(mul_dist_pat_tt,

test_fgraph_tt = tt.gof.fg.FunctionGraph(
    tt.gof.graph.inputs(test_exprs_tt), test_exprs_tt)

In [9]: tt.printing.debugprint(test_fgraph_tt)
Out[9]: Elemwise{mul,no_inplace} [id A] ''   5
 |TensorConstant{5} [id B]
 |TensorConstant{2} [id C]
Elemwise{mul,no_inplace} [id D] ''   8
 |TensorConstant{5} [id B]
 |Elemwise{add,no_inplace} [id E] ''   4
   |TensorConstant{2} [id C]
   |TensorConstant{5} [id B]
Elemwise{mul,no_inplace} [id F] ''   9
 |DimShuffle{x} [id G] ''   3
 | |TensorConstant{5} [id B]
 |Elemwise{add,no_inplace} [id H] ''   7
   |TensorConstant{[1 2]} [id I]
   |DimShuffle{x} [id J] ''   2
     |TensorConstant{5} [id B]
Elemwise{mul,no_inplace} [id K] ''   6
 |DimShuffle{x} [id L] ''   1
 | |TensorConstant{5} [id B]
 |Elemwise{add,no_inplace} [id M] ''   0
   |TensorConstant{[1 2]} [id I]
   |TensorConstant{[1 2]} [id I]

Now, when we apply the optimization, the FunctionGraph should have applied the replacements:

test_fgraph_opt = test_sub_eqz_opt_tt.optimize(test_fgraph_tt)

In [10]: tt.printing.debugprint(test_fgraph_tt)
Out[10]: Elemwise{mul,no_inplace} [id A] ''   5
 |TensorConstant{5} [id B]
 |TensorConstant{2} [id C]
Elemwise{add,no_inplace} [id D] ''   10
 |Elemwise{mul,no_inplace} [id E] ''   4
 | |TensorConstant{5} [id B]
 | |TensorConstant{2} [id C]
 |Elemwise{mul,no_inplace} [id F] ''   3
   |TensorConstant{5} [id B]
   |TensorConstant{5} [id B]
Elemwise{add,no_inplace} [id G] ''   12
 |Elemwise{mul,no_inplace} [id H] ''   9
 | |DimShuffle{x} [id I] ''   2
 | | |TensorConstant{5} [id B]
 | |TensorConstant{[1 2]} [id J]
 |Elemwise{mul,no_inplace} [id K] ''   8
   |DimShuffle{x} [id I] ''   2
   |DimShuffle{x} [id L] ''   1
     |TensorConstant{5} [id B]
Elemwise{add,no_inplace} [id M] ''   11
 |Elemwise{mul,no_inplace} [id N] ''   7
 | |DimShuffle{x} [id O] ''   0
 | | |TensorConstant{5} [id B]
 | |TensorConstant{[1 2]} [id J]
 |Elemwise{mul,no_inplace} [id P] ''   6
   |DimShuffle{x} [id O] ''   0
   |TensorConstant{[1 2]} [id J]

There is much more to consider in the examples above. Nonetheless, standalone libraries, like LogPy and SymPy, can be adapted to Theano graphs to provide the functionality necessary to cover most of these considerations.

Finally, let’s briefly imagine how convexity could be determined symbolically. For differentiable terms, we could start with a simple second derivative test. Within Theano, a “second derivative” can be obtained using the hessian function, and within theano.sandbox.linalg are Optimizer hints for matrix positivity and other properties relevant to determining convexity using this simple idea.

Other great examples of linear algebra themed optimizations are in theano.sandbox.linalg: for instance, no_transpose_symmetric. Some of these demonstrate exactly how straight-forward it can be to add algebraic considerations.

Although our convexity testing idea is far too simple for many \(l\), the point we want to make is that the basic code necessary for simple tests like this may already be in place. However, with the logic programming mentioned earlier in this section, comes the possibility of implementing aspects of the convex function calculus, by which one can determine convexity for many more classes of functions.


We’ve sketched out the concepts and mechanism with which one can develop a robust estimation platform that can be transparently guided by the more abstract mathematics frameworks from which new, efficient methods are produced.

Some key steps in the process will most likely require the integration of a symbolic algebra system, so that a much wider array of algebraic machinery can be leveraged to assess–say–convexity of the terms or to solve the proximal operators themselves. Connections between Theano, SymPy and LogPy have already been explored in (Rocklin 2013), as well as many other important aspects of the topics discussed here.

Additionally, more advanced proximal algorithms exist to improve upon the convergence and stability of the most basic proximal gradient given here. These algorithms often involve operator splitting, which requires careful consideration regarding the exact type of splitting and on which terms it is performed. Within this area are the familiar convex-conjugate approaches; these too could be approached by symbolic solvers, or simply addressed by [partially] generated tables.

Overall, there appear to be many avenues to explore just within the space of proximal algorithms and modern symbolic systems. Not all of this work necessitates the inclusion of fully featured symbolic algebra systems; much can be done with the symbolic tools of Theano alone. Furthermore, there are specialized, lightweight logic programming systems–like LogPy–that can serve as a step before full symbolic algebra integration.

Besides the automation of proximal algorithms themselves, there are areas of application involving very large and complicated models or graphs–such as the ones arising in Deep Learning. How might we consider the operator splitting of ADMM within deeply layered or hierarchical models (Polson, Willard, and Heidari 2015)? At which levels and on which terms should the splitting be performed? Beyond simply trying to solve the potentially intractable mathematics arising from related questions, with the symbolic capabilities described here, we can at least begin to experiment with the questions.

Before closing, a very related–and at least as interesting–set of ideas is worth mentioning: the possibility of encoding more symbolic knowledge into probabilistic programming platforms like PyMC3. Using the same optimization mechanisms as the examples here, simple distributional relationships can be encoded. For instance, the convolution of normally distributed random variables:

mu_X = tt.vector('mu_X')
mu_X.tag.test_value = np.array([1.], dtype=tt.config.floatX)
sd_X = tt.vector('sd_X')
sd_X.tag.test_value = np.array([2.], dtype=tt.config.floatX)

mu_Y = tt.vector('mu_Y')
mu_Y.tag.test_value = np.array([1.], dtype=tt.config.floatX)
sd_Y = tt.vector('sd_Y')
sd_Y.tag.test_value = np.array([0.5], dtype=tt.config.floatX)

with pm.Model() as conv_model:
    X_rv = pm.Normal('X', mu_X, sd=sd_X, shape=(1,))
    Y_rv = pm.Normal('Y', mu_Y, sd=sd_Y, shape=(1,))
    Z_rv = X_rv + Y_rv

We create a Theano Op to handle the convolution.

class NormConvOp(tt.Op):
    __props__ = ()

    def make_node(self, *inputs):
        name_new = str.join('+', [getattr(in_, 'name', '') for in_ in
        mu_new = tt.add(*[ for in_ in inputs])
        sd_new = tt.sqrt(tt.add(*[**2 for in_ in
        conv_rv = pm.Normal(name_new, mu=mu_new, sd=sd_new,
                            # Is this another place where
automatically/Theano managed
                            # shapes are really needed.  For now, we
hack it.

        return tt.Apply(self, inputs, [conv_rv])

    def perform(self, node, inputs, output_storage):
        z = output_storage[0]
        z[0] = np.add(*inputs)

Now, all that’s needed is a PatternSub like before.

def is_normal_dist(x):
    return hasattr(x, 'distribution') and isinstance(x.distribution,

norm_conv_pat_tt = (tt.gof.opt.PatternSub(
     {'pattern': 'x',
      'constraint': lambda x: is_normal_dist(x)},
     {'pattern': 'y',
      'constraint': lambda x: is_normal_dist(x)}
    (NormConvOp(), 'x', 'y')),)

norm_conv_opt_tt = tt.gof.opt.EquilibriumOptimizer(norm_conv_pat_tt,

Z_fgraph_tt = tt.gof.fg.FunctionGraph([X_rv, Y_rv], [Z_rv])

# We lose the `FreeRV.distribution` attribute when cloning the graph
# with `theano.gof.graph.clone_get_equiv` in `FunctionGraph`, so this
# hackishly reattaches that information:
_ = [setattr(g_in, 'distribution', s_in.distribution)
     for s_in, g_in in zip([X_rv, Y_rv], Z_fgraph_tt.inputs)]
with conv_model:
    _ = norm_conv_opt_tt.optimize(Z_fgraph_tt)

norm_conv_var_dist = Z_fgraph_tt.outputs[0].distribution

The resulting graph:

In [11]: tt.printing.debugprint(Z_fgraph_tt)
Out[11]: NormConvOp [id A] 'X+Y'   0
 |X [id B]
 |Y [id C]

and the convolution’s parameters (for the test values):

In [12]: print(
Out[12]: [ 2.]
In [13]: print(
Out[13]: [ 2.06155281]

More sophisticated routines–like the example above–could implement parameter expansions, efficient re-parameterizations and equivalent scale mixture forms in an effort to optimize a graph for sampling or point evaluation. Objectives for these optimizations could be straightforward and computationally based (e.g. reducing the number of operations in computations of the log likelihood and other quantities) or more statistically focused (e.g. highly efficient sampling, improve mixing). These ideas are most definitely not new–one example is given by (Mohasel Afshar 2016) for symbolic Gibbs sampling, but we hope the examples given here make the point that the tools are readily available and quite accessible.

We’ll end on a much more spacey consideration. Namely, that this is a context in which we can start experimenting rapidly with objectives over the space of estimation routines. This space is generated by–but not limited to–the variety of symbolic representations, re-parameterizations, etc., mentioned above. It does not necessarily require the complete estimation of a model at each step, nor even the numeric value of quantities like the gradient or Hessian. It may involve them, but not their evaluation; perhaps, instead, symbolic comparisons of competing gradients and Hessians arising from different representations. What we’re describing lies somewhere between the completely numeric assessments common today, and the entirely symbolic work found within the theorems and manipulations of the mathematics we use to derive methods.


Donoho, David L. 2006. “Compressed Sensing.” IEEE Transactions on Information Theory 52 (4): 1289–1306.

Mohasel Afshar, Hadi. 2016. “Probabilistic Inference in Piecewise Graphical Models.”

Park, Trevor, and George Casella. 2008. “The Bayesian Lasso.” Journal of the American Statistical Association 103 (482): 681–86.

Polson, Nicholas G., James G. Scott, and Brandon T. Willard. 2015. “Proximal Algorithms in Statistics and Machine Learning.” Statistical Science 30 (4): 559–81. doi:10.1214/15-STS530.

Polson, Nicholas G., Brandon T. Willard, and Massoud Heidari. 2015. “A Statistical Theory of Deep Learning via Proximal Splitting.” ArXiv Preprint ArXiv:1509.06061.

Rocklin, Matthew. 2013. “Mathematically Informed Linear Algebra Codes Through Term Rewriting.” PhD thesis, PhD Thesis, August.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” The Journal of Machine Learning Research 15 (1): 1929–58.


comments powered by Disqus