So neural networks and support vector machines are essentially equivalent [1]. Thus both these approaches effectively project input into a high level feature-space and then draw a hyperplane between two different point sets. The cleverness or not of this depends on how the algorithm effectively creates the feature-space. The article's comments could be interpreted as Deep neural networks allow feature-spaces which otherwise require many more neurons.
But thing is, first consider that being divided by a plane in a feature space is simply a convenient quality that many patterns have. It's similar to data you can draw a line along to extrapolate further values of. However, unlike that approximately linear data, you can't "why" your complex is separated by a particular plane in the feature space and the reason is that your neural network or SVM data is more or less trapper in the model - it's not going to be further processed except in using that model for that particular pattern.
This comment is very confusing. First of all, the linked paper doesn't state what you claim it states. The authors show equivalence between two specific frameworks of neural networks: SVM-NN and Regularized-NN, and not equivalence between SVM and NN. Generally, SVM and NN are equivalent only in the sense that all discriminative models are equivalent. The kernel trick in SVM requires your embedding to have an "easily" calculable inner product. I'm not an expert, but I think this places strong constraints on the embeddings you can use.
Second of all, SVM does not create any feature space (i.e., embeddings). It just finds a good separator with a maximal margin. Deep NNs, on the other hand, do create features in their hidden layers.
Anyway, even ignoring these issues, I'm not sure I understood your main point.
There are many methods. The first to tackle is getting your data in the right format. Plotting software like Matplotlib can be really helpful when you're trying to debug.
What happens when you, instead of training the entire network at once, train for a while with a single layer, then add a second layer and train with both layers, then add a third layer and train with all three layers, and so on?
Good intuition! What you are describing sounds like a technique called pretraining (in particular, greedy, layer-wise pretraining). Five years ago, pretraining was how everyone attacked this problem, although they usually did a different kind of pretraining (basically, we train a different kind of model, and then perform surgery, cutting it apart and using some layers for it for the earlier layers of our model).
More recently, people, especially the younger generation of deep learning researchers, tend to be skeptical of how much pretraining helps.
Advocates for pretraining now tend to argue that it helps you find better local minima, instead of focusing on it helping the vanishing gradient problem. For example, see this paper: http://www.jmlr.org/papers/volume11/erhan10a/erhan10a.pdf .
As I'm sure Michael will address in coming chapters, there's a bunch of tricks you can use that make training deep neural networks a lot easier. People tend to prefer, now, to just use those and a lot of computing power, rather than mess around with pretraining.
* The biggest one is probably just train for a very long time. Competitive neural nets for many tasks are trained on GPUs, clusters, or GPU clusters, for days or weeks.
* Using convolutional layers really helps. Roughly, convolutional layers have multiple copies of the same neuron, applied to different inputs. This results in them needing to learn much less. It also leads to them kind of concentrating the gradient on just a few neurons. Because of this, if the first few layers of a network are convolutional layers, they are much easier to train. For a long time, these were the only kind of remotely deep neural networks we could train.
* So far, Michael's book has only talked about sigmoid neurons (I think). But you can use neurons with other activation functions. They still multiply their inputs by different weights and add a bias, but instead of applying sigmoid they apply a different function. Using a different kind of neuron, ReLU neurons, tends to help a lot. Unlike sigmoid neurons, which tend to have a very small derivative, ReLU neurons have a derivative of 1 a lot of the time. I've had mixed experiences, but most people swear by them.
* Using higher learning rates for early layers may be helpful.
Reducing the learning rate exponentially and increasing the momentum rate linearly over the course of training. Learning rate from .5 to .0001, momentum from .7 to .995. I've seen variations on this, like adjusting based on sigmoid curve.
Dropout may or may not help, adjusting dropout rate (percentage of activations that are discarded) may or may not help.
Mini-batch size can make a difference. Somewhere between 2 and 200?
But thing is, first consider that being divided by a plane in a feature space is simply a convenient quality that many patterns have. It's similar to data you can draw a line along to extrapolate further values of. However, unlike that approximately linear data, you can't "why" your complex is separated by a particular plane in the feature space and the reason is that your neural network or SVM data is more or less trapper in the model - it's not going to be further processed except in using that model for that particular pattern.
[1] http://www.scm.keele.ac.uk/staff/p_andras/PAnpl2002.pdf