I finished week 5 of Machine Learning last week – it was part two of the introduction to neural networks. Although it starts with a visual representation of how the neurons in the brain work, course leader Professor Andrew Ng didn’t suggest this is really how the brain actually works. I’m no a neuroscientist, but it’s pretty hard to see how the maths and the iterative nature of the forward and backward propagation algorithms he explained can in *any* way be replicated in a biological brain. As I say in this blog, I think there is just too much maths in current machine learning algorithms for them to resemble biological intelligence. For neural networks, the similarity seems to just be that both real brains and neural networks have a large number of nodes/neurons and connections between. I’m going to guess that human brains have *way* more nodes and connections – but the processing of each biological neuron is simpler than that in a neural network. Also, the brain actually has real nodes, rather than mathematical nodes that are actually just complex matrix calculations in practice.

However, what I found somewhat revealing was Ng’s honest admission at several stages during the week five videos, that he still doesn’t always have an intuitive sense of just what neural networks do, and why they work. He gave us the formulae for these algorithms and asked us to take them “on faith”. He has looked at the (by all accounts very tricky) mathematical proofs that lead to these formulae and says he could understand them. But he has also said on a few other occasions that he has just used this algorithm (and others) for solving machine learning problems without fully understanding how they work.

Is that a problem? Is it really any different to giving a driving license to someone who doesn’t know how an internal engine works? Or a computer to someone who doesn’t know how to program one? I remember in my first year introductory course to computer science when the lecturer, Peter Andreae, explained to all 200 of us 18 year-old nerds how computers worked at an electronics level. It took him maybe 15 minutes – and I remember being astonished at how simple the basic principle was. I can’t remember much it now of course, but have been using (and trusting) computers ever since.

The folks who figured out the maths behind these algorithms presumably *do* understand how they work, and we probably can trust that the proofs are mathematically solid. Furthermore, the proof does appear to be in the pudding when these algorithms consistently deliver reliable and believable results. Maybe somewhere else out there in internet-land there is an explanation of how neural networks function that makes more intuitive sense and can make more sense of the vague intuition that I pieced together during the lecture series. Ng’s logical XOR gate example/intuition was semi-helpful, but if it doesn’t even work for him 100%, it’s unlikely that it will help his students either.

At the end of the day, my point here is that I find the fact that it is hard to get an intuitive handle on neural networks somewhat disconcerting. I don’t understand the minute details of a great many technical things in the world, but I *can* usually grasp the principles behind them after reading a well-written article or two. The exception to date would be quantum physics – but I don’t think *anyone* has an intuitive understanding of that Carollian Wonderland!

I am not going to write off neural networks just because I *personally* don’t have an intuitive understanding of how they work. It just leaves me somewhat uneasy trusting them fully until maybe I see with my own eyes that they are pretty reliable. I am happy to take things by “faith” in other areas of my life… Maybe I will have to do it here for a while also!

Blog post ends here – but I did have a few…

## Other Random Thoughts

He also said it’s possible to code these backward and forward propagation algorithms wrongly so that it looks like they are working, but in actual fact aren’t. Is this a problem? He said there is a gradient checking algorithm you can use for the first iteration of code to make sure things are heading in the right direction – but then you should switch it off because it is computationally inefficient compared to the backwards propagation algorithm. What if the problems in your code only show up after 20 iterations, say – or 100? Maybe mathematically, it can be shown that the errors will crop up on the first iteration – although my intuition tells me that the nature of the error is going to depend on the nature of the bad algorithm I write…

More interestingly, Professor Ng tells us that backwards propagation doesn’t guarantee to find local minima for theta, but also says it’s not a problem in practice. I guess you could run your training case a few times with a bunch of different randomised initial starting theta values and see whether what the different optimum thetas are, and then choose the best one. But what if your learning algorithm is super expensive to compute – or the best theta can only be achieved from a very small set of initial random thetas that you don’t use in your training…

Most likely I will come to terms with both of these issues as I actually start using neural networks to try and solve real examples.

For me neural networks have aspects of them that feel “intuitive” and aspects of them that don’t feel so intuitive. If I were to give a brief synopsis of neural networks to someone who roughly understand calculus, it would be as follows:

– For several hundred years now, people have had a reasonable grasp of what a “function” is, and that if you compose a function using operators like plus and times and exponentiation, there are algorithms we call “calculus” that allow you to compute another function called the “derivative” that will show you the slope of the first function at a given point for any of the variables you’re interested in.

– Once you know the slope of the curve, it can become a powerful tool, because you might want to figure out what values to give the variables so that it computes the value 0. Or you might want to figure out what values to give variables so that it computes its minimum possible value. If you know the slope, you have a clue about which direction to try changing your variables. You can iteratively do that to see if you can find a local minimum. (newton’s method, etc.)

– So what happens if we create a really really deeply nested mathematical function with lots of variables? Can we still calculate derivatives? Yes, we can. Ok, great.

– Now, what if we don’t want to directly solve an equation for its minimum point but instead we want to find the best equation? How might we go about doing that?

– Well, as it turns out, we can create a kind of “second order equation” by starting with an equation (x + y + z) and then introducing new variables where there would normally be constants. When we substitute our training data into the original variables, they essentially turn into constants, and so we’ve done a bizarre kind of operation where we’ve turned our variables into constants and our constants into variables, but at the end of the day, we’re still left with an equation involving variables and constants.

– And hey look, we were just saying that we can still calculate derivatives, even if our equation has lots of variables, and even if the equation is deeply nested. Given that, we can just crank the calculus iteratively to see if we can find values for our variables that, when substituted, give us a function that optimally fits our criteria of minimizing the difference between our desired outputs and our actual outputs.

– In conclusion, what we’ve done is:

– Started by wishing there was a way that we could tune the constants of a function to make our function behave in a certain way.

– Discovering that we can turn the constants of our functions into variables and our variables into constants.

– Then we can just use the calculus to tune our function to behave the way we want.

If you ask me, that’s actually pretty “intuitive”, almost simple.

What neural networks adds to the above is simply that if we create a deeply nested equation, it gives us a more powerful mechanism that can model quite rich functions.

Not sure if you’ve covered this aspect yet, but they also discovered a theorem which shows that if you create a deeply nested network of linear functions, it’s identically powerful to a flat / non nested linear function. Thus, at certain points in your network, you need to apply a non-linear function, and voila, that’s one of the critical tricks that gives your overall function dramatically more modeling ability.

So here’s my final summary of a neural network, and machine learning in general: Functions are quite a useful tool for modeling phenomenon, and if you have a powerful way to “find a function”, that’s quite something. Neural networks are just that: A mechanism for taking a bunch of data and finding a function that can reasonably map your inputs to your outputs.

On the topic of how much we should trust these functions: Not sure if you’ve been following the news over the years, but it is well known that neural networks are very easy to “game”. You can take a photo of a gorilla, and yes, the network may say it’s 95% likely to be a gorilla, but if you subtly change the colors of just the right pixels, a human will still be 99.99999% sure it’s a gorilla, but the neural network will suddenly be convinced that it’s looking at a banana. uhhh… which gives many people the same concern that you express in your blog post: What’s going on? Why is the function so easy to game? Why isn’t it more robust?

People have done some study on this, and they’ve made some good learnings on it. For example, if you train a neural network to classify photos of {swimming, skiing, golfing}, it will generally learn that the images that have a lot of white in them are pictures of skiing. You can then give them a picture of someone golfing in the snow, and it will dutifully claim that it is quite confident that the person is skiing.

I think we can thus conclude that machine learning will basically be “lazy” in its model building if it can get away with it, just like a 10 year old will be lazy on a Saturday if they can get away with it. The result is that you can get very bad / brittle models with machine learning.

How to overcome that challenge is an interesting thought space. One recent effort was to build a giant data set that doesn’t allow the neural network to cheat as much. It’s full of pictures of things like people golfing in the snow, people playing violin in their bathing suit while standing in a lake, etc, etc. This forces the network to adapt beyond the lazy “white means skiing” kind of model.

All that said, you are right to conclude that machine learned models in 2018 are often very brittle, and so think twice before you trust your life to one.