I didn't even realize DIGITS was a separate project from Nvidia. It appeared to be a visualization interface for Caffe since it loads Caffe prototxt and generate some visualization for it. I don't think it uses any different version of cuDNN. Even the proposed solution to running variance problem was initially implemented in Caffe's CPU implementation.
I think the issue is not batch norm itself since it works for some data but not stable for other. I implemented variance clipping (rather easy with JCuda actually) and it appears to work with some trial and error of parameters.
Torch appears to use exponential moving average rather than accumulative moving average as suggested by the batch norm paper.
It would make less theoretical sense to use exponential moving average for running mean/variance but I just tested it and it works as well.
Anyway, I won't dwell on this any more and consider it a solved problem.