Feed Forward: Making 60,000 Educated Predictions

This article is the fourth title of our series on deep learning. The topics are a continuation of each other and contain references to previous articles. Therefore, you should read it in order, starting from the beginning. Our previous articles:

  • Deep Learning Overview
  • Deep Learning Overview – 2
  • Deep Learning – Building a Working Brain with 28 Lines of Code

If you have read the first three articles, you are ready to continue. So let’s start.

57-59. rows

This is the portion of our network that predicts. We will consider the same concept from three different perspectives, as it is an exciting part of the deep learning process:

1 – First we will tell about feed forward in a fairy tale language,

2 – Secondly we will see great illustrations of feed forward and

3 – By opening the engine cover, we will study the matrix product of the forward feed engine.

Fortress and Meaning of Life: Feed Forward Network

Imagine yourself as a neural network. You are a neural network with a valid driver’s license and enjoy fast cars and mystical spiritual journeys. You can’t wait to find the meaning of life. See, a miracle happens within a miracle, and if you go to a certain castle, you learn that your mystical oracle is waiting for you to explain the meaning of life. You are in joy!

Needless to say, you are highly motivated to find the Oracle’s stronghold. After all, the oracle represents the truth (true truth), so our mystical question is “My cat is Mis! Did you buy?” How mysterious! (In other words, if your guess matches the oracle’s truth, you’re in the castle, where the l2 error you know from our previous posts is zero, the yellow arrow is at zero height, the white ball reaches the bottom of the bowl. These are all the same thing.)

Unfortunately, finding the oracle’s castle takes some patience and persistence. Because you tried to reach the castle thousands of times and got lost.

(Hint: Thousands of rides = trials. “Getting lost” means your l2 prediction is still wrong. This keeps you away from the stronghold y of truth. Let me remind you that we’re oversimplifying here. There is practically no guarantee you’ll get to the oracle. You can just know you’re going to the local minimum and this is only possible if your function is convex. Typically the error goes up and down, but here we’ll stick with our simplified fairytale example for educational purposes. After all, we all love happy endings.)

But there’s great news: You know that every day, with every outing, you’re getting closer and closer to the oracle (the y vector in all its cuteness… dizzying). The bad news is, every time you fail to reach the castle, PUFF! you wake up again at your house the next morning (i.e. in Layer 0, in the input properties consisting of answers to our l0 or 3 survey questions). You have to start again from here (a new try). It looks a bit like this:

Fortunately, this story has a happy ending. But to reach the oracle’s castle and experience enlightenment, you may need to make another 58,000 trials and correct your course. Don’t worry, it will be worth it.

Now let’s look at one of the rides from your house X to castle y. Every trip you make is 57-59. It is a feed forward going through the rows. You arrive at a new place every day only to discover that it is not a castle. Christ no! Of course you want to figure out how to get a little closer to your guru next time. We will see this in the steps below. Oracle and Castle, stay with me to Learn from Your Mistakes.

Okay, the prophetic castle analogy above was an analogy for the feed-forward process. Next we’ll go over a simplified example of feed forward.

Stunningly Beautiful Schemes of Feed Forward

We will consider only one example of weights out of 16. The weight we will be working with will be at the top of the 12 rows of syn0, between the l0 and l1 neurons at the top. Let’s call it syn0,1 for simplicity’s sake (technically the correct name for “row 1, column 1 of matrix syn0” is syn0(1,1)). Let’s see how it looks:

Why were the circles representing neurons l2 and l1 split in half? The left half of the circle (variables used by 1, indicated by “LH”) is the value used as the input to the sigmoid function, and the right side is the output of the sigmoid function: l1 or l2. In this context, remember that the sigmoid takes the product from the previous layer and compresses it to a value between 0 and 1. This is the following piece of code (created on line 10 but not called until line 58):

Read More  Neural Stem Cell – Neural Stem Cell

return 1/(1+np.exp(-x))

Ok, here’s using one of our feed-forward exercise examples, the first line of l0, or “customer first answer to 3 survey questions”: [1,0,1]. So we start multiplying the initial value of l0 by the initial value of syn0. Imagine that our syn0 matrix has gone through some training trials since we launched it (in the Placing Random Numbers section of our previous post). Now it looks like this:

syn0:

[ 3.66 -2.88  3.26 -1.53]

[-4.84  3.54  2.52 -2.55]

[ 0.16 -0.66 -2.82  1.87]

Now you might be thinking, “Why are the values ​​of syn0 so different from the syn0 we created in the previous article?” I’m glad you asked. Most of the previous matrices are realistic starting values, as if they were random numbers just generated on the computer. But the matrices here and below are not initial values. These are matrices that have undergone some training trials. Its values ​​have changed considerably by being updated in the learning process.

Beautiful. Let’s multiply the “1” of l0 by the “3.66” of syn0 and see where this nonsense takes us:

Here we see what feed-forward looks like in pseudocode. You can follow the forward, left to right process in the diagram above. (In this diagram, “LH” is added to mean left. Thus, we use the left half of the circle to represent a neuron in the given layer. “l1LH” means “left half of the circle representing l1”. This is “before the product passes the nonlin() function” means.)

l1_LH = l0 x syn0 so l1_LH = 1 x 3.66 -> (Don’t forget to then add the products of other l0 values ​​x other syn0 values. For simplicity’s sake trust me they add up to 0.16 for now) -> l1_LH = 1 x 3.66 + 0.16 = 3.82

l1 = nonlin(l1_LH) = nonlin(3.82) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98

l2_LH = nonlin(l1_LH) = l1 -> l1 x syn1 = 0.98 x 12.21 = 11.97 (again, add the results of the other syn1 multiplications–trust me, they add up to -11.97) -> 11.97 + -11.97 = 0.00

nonlin(l2_LH) -> nonlin() = 1/(1+np.exp(-x)) = [1/(1+2.718^0.00))] = 0.5

l2 = 0.5  ->  l2_error = y-l2 -> 1 – 0.5 = 0.5 -> l2_error = 0.5

OK then. Here’s the basics of the math that makes the above code:

Let’s Move Slowly to the Mathematics of Feed Forward

l0 x syn0 = l1LH, so 1 x 3.66 = 3.66 in our example, but don’t forget to add the corresponding weights of the other two products of l0 x syn0. In our example, l0.2 x syn0.2= 0 x something = 0, so it doesn’t matter. But l0.3 x syn0.3 makes a difference, because l0.3=1 and we know from the matrix example in the last section that syn0.3 is 0.16. So l0.3 x syn0.3 = 1 x 0.16 = 0.16. Our product of l0.1 x syn 0.1 + our product of l0.3 x syn0.3 = 3.66 + 0.16 = 3.82 and 3.82 l1_LH. After that we need to run l1_LH in our nonlin() function to get a probability between 0 and 1. Nonlin(l1_LH) uses code return 1/(1+np.exp(-x)). In our example, this would be: 1/(1+2.718^-3.82))=0.98, so l1 (right side of node l1) is 0.98.

What happened in the above equation 1/(1+np.exp(-x)) = [1/(1+2.718^-3.82))] = 0.98? The computer did something we could do with our eyes using fancy code, return 1/(1+np.exp(-x)) . Said the corresponding y value on the sigmoid curve for x = 3.82 in the figure below:

Note that the corresponding y-value on the blue curve for the point 3.82 on the x-axis is approximately 0.98. Our code turned 3.82 into a statistical probability between 0 and 1. It’s helpful to visualize this graphically so you can tell it’s not an incomprehensible hocus pocus. The computer does what we do: it uses math, not its eyes, to figure out what 3.82 on the x-axis corresponds to on the y-axis, nothing more.

Let’s repeat: nonlin() is the part of the sigmoid function that converts any number to a value between 0-1. This is the code for return 1/(1+np.exp(-x)). It doesn’t take slope. But in backpropagation we will use the other part of the sigmoid function, the slope area part, for example return x*(1-x) . Because lines 57 and 71 specifically ask for sigmoid to get the slope with code (deriv==True).

Let’s stop and repeat it again. We multiply our l1 value by our syn1,1 value. l1 x syn1 = l2LH in our example 0.98 x 12.21 = 11.97. Note, however, that we must add to 11.97 the products of all other l1 neurons times their corresponding syn1 weights. Trust me for the sake of simplicity, the result is -11.97 (using the same matrix). The result is 11.97 + -11.97 = 0.00 l2_LH. We then run l2_LH in our awesome nonlin() function. This goes like this: 1/(1+2.718^-(0)) = 0.5, here we get my first guess for y, which represents the truth, l2! Congratulations! You’ve completed your first feed forward!

Read More  What is Hebbian Learning?

Now for clarity, let’s combine all our variables in one place:

l0 = 1

syn0,1=3.66

l1_0.98

syn1,1=12.21

l2_LH=0

l2=~0.5

y=1 (The fourth question of this survey is the answer given to “Did you get my kitty mis!?”

l2_error = y-l2 = 1-0.5 = 0.5

Okay, now let’s look at the matrix multiplication that created all this.

First, we multiply 4×3 l0 with 3×4 syn0 to create l1, a 4×4 matrix (hidden layer) in row 58:

Now we pass it through the “nonlin()” function on line 58, which is a fancy mathematical expression that compresses all values ​​between 0 and 1 as we explained above:

1/(1 + 2.781281^-x)

This creates l1, the hidden layer of our network:

l1:

[0.98 0.03 0.61 0.58]

[0.01 0.95 0.43 0.34]

[0.54 0.34 0.06 0.87]

[0.27 0.50 0.95 0.10]

If you were afraid of the appearance of matrix multiplication, fear not. We’re going to start simple and break our product down into smaller parts. So you can understand how. Let’s take one simple example of our input. Row 1 (first customer’s survey answers): [1,0,1] is a 1×3 matrix. We’re going to multiply this with syn0, the 3×4 matrix, and our new l1 will be a 1×4 matrix. This process can be visualized as follows:

(multiply row 1 of l0 by column 1 of syn0, then multiply row 1 by column 2 etc.)

row 1 of l0: column 1 of syn0:

[1 0 1]         x   [ 3.66] +                  [ 3.82 -3.54  0.44  0.34]

[1 0 1] x [-4.84] + = [ (row 2 of l0 x column 1, 2, 3 and 4 of syn0…) ]

[1 0 1]         x   [ 0.16]                     [ vb…                                                                     ]

Then pass the above 4×4 result through “nonlin()” to get the l1 values.

l1:

[0.98 0.03 0.61 0.58]

[0.01 0.95 0.43 0.34]

[0.54 0.34 0.06 0.87]

[0.27 0.50 0.95 0.10]

Note that in line 58 we take the sigmoid function of l1 because l1 must take a value between 0 and 1:

l1=nonlin(np.dot(l0,syn0))

In line 58 we see the first of the four great advantages of the sigmoid function. When we pass the result of the matrix l0 and syn0 through the nonlin() function, sigmoid converts each value in the matrix to a statistical probability between 0 and 1.

Now you may ask, what is statistical probability? So prepare yourself for the next wonder in deep learning.

Super Key Point: Inference Correlations of the Hidden Layer

Ah yes, that’s right, statistical probabilities. Well, why should we care? Because statistical probability is one of the main factors in which a bunch of stupid matrices suddenly come to life and start learning like a child’s brain. Other than that, there’s no reason for it to interest you…

Why do we waste time with different values ​​of weights in syn0 when we multiply l0 by syn0 in the first layer? Because by trying various combinations of our original three questions, which combination is our main curiosity, “Customer Is My Cat Mis! What are the odds of buying?” Let’s continue with a few silly examples:

We have customer answers to our original three survey questions. What conclusions can we draw from different combinations of these questions that will increase our predictive ability? For example, if the customer has a cat, it indicates great pet taste. If he drinks imported beer, it shows that he cares about beer taste. So we can conclude that these customers would buy the wonderful, tastefully designed odor-absorbing granules of my Kedim Mis! just for displaying them proudly on their flooring, even if their cat is only pooping outside! Therefore, we can strengthen the link between these two features if they provide us with more accurate predictions.

Another example: If the customer does not have a cat, but drinks imported beer and has visited Kedimmis.com, we can deduce that they understand the technology. The reason they drink imported beer is obviously simply because they appreciate the logistics chain that allows a Dutch brand to get it home from there. Moreover, they visit websites and everything else… Ah… These people must be tech-crazed geniuses. So we can conclude that technologically sophisticated customers will buy my Kedim Mis! simply because they admire the latest technology in every poop sucker. Although the client does not have a cat. Maybe we should change the weights in syn0 to strengthen the links between these fine features and enjoy more accurate predictions.

Do you understand? When we multiply the answers to the survey questions on l0 and the weights in syn0 (each weight is our best estimate of how important an inference correlation is to our prediction), we try different combinations of our survey answers, trying to find out which one is my Kedim Mis! We’re trying to find out what’s most helpful in predicting receivables. After 60,000 trials, it becomes clear that, for example, visitors to Kedimmis.com: My Cat Mis! are most likely to pick up, and the corresponding weights increase over the course of the trials. This means that the statistical probability is closer to 1 than 0. However, imported beer drinkers who don’t have cats are My Cat Mis! less likely to take it. Therefore, their weights are reduced, meaning their statistical probabilities are closer to 0 than to 1. It’s cool, isn’t it? It’s like a poem with numbers. Matrices with logic and thinking!

Read More  Social Media and the Brain

Here’s why this is so important. There is no “Pandora’s Box” or “Mysterious magic” under the hood. There is clear, elegant and beautiful mathematics. And you can master it. It just takes patience and persistence.

Calm down and keep multiplying the matrices.

Visualizing Matrix Multiplication with Neurons and Synapses

If we plug these l0 and syn0 values ​​into our analogy of neurons and synapses, it looks like this:

The figure above shows the first step of feeding the first line inputs at l0 into our network. You know that row one has three answers to client one and three survey questions. But when it comes to multiplying these numbers by all 12 values ​​of syn0 and doing the same with the three values ​​of the other three customers, how do we deal with and arrange all these numbers?

The key here is to think of the four customers as “groups” together. So the top group in the top row is the first customer. As you can see above, we multiply the three numbers of row one by all 12 of syn0, add, and we get the four values ​​in the upper group of l1.

What is the second package in the group? The second row, 0,1,1, which consists of the second customer’s answers to the second question, is the second package. Multiplying these three numbers by 12 of the syn0 values ​​and adding them together we get the four values ​​in the second pack of the l1 values ​​group.

It goes on like this two more times. The key is to get items from one package at a time. As such, it doesn’t matter if your group has four packages or four million packages. You could say that each attribute has its own set of values, in our example each survey question (feature) is a set of four answers from our four customers. But it could have been four million. This concept of “full group combination” is a fairly common pattern. That’s why I tried to explain. It may be easier to understand by thinking that a given attribute has its own set of values. When you see a property, you now know that there is a set of values ​​below it.

Exactly the same thing happens when we get the point result of 4×4 l1 and 4×1 syn1 in line 59, then we run this result on the sigmoid function to generate 4×1 l2 where each value has a statistical probability in the range 0-1.

l1 (4×4)

[0.98 0.03 0.61 0.58]       [12.21]

[0.01 0.95 0.43 0.34]   x  [10.24]     =

[0.54 0.34 0.06 0.87]       [ -6.31]

[0.27 0.50 0.95 0.10]       [-14.52]

Then we pass the above 4×1 result through “nonlin()” to get our guess of l2:

l2:

[ 0.50]

[ 0.90]

[ 0.05]

[ 0.70]

So what do these four predictions tell us about our superior cat cleaning product? What he’s saying is, as the value gets closer to 1, the customer’s My Cat Mis! that it is more certain. As the value gets closer to 0, my cat is Mis! it is more certain not to. 0.2 means “Probably not”, 0.8 means “Probably will” and 0.999 “Definitely will!” can be interpreted as

We have completed the feed forward part of our network. I hope you were able to visualize what we’ve done so far. These:

1 – The matrices used;

2 – Sequences as client;

3 – Columns as properties and

4 – Each attribute contains value groups (eg answers to survey questions);

If you can visualize these four elements, congratulations.

Consider the feed-forward above as our first guess. After that, another 60,000 will follow. Our next step is to calculate where our first guess was wrong and figure out how to change the weights of our network so that our next guess is better. We’re going to do the same estimating and optimizing process over and over. 60,000 times. This is trial-and-error learning, and it’s a good thing.

Related Posts

Leave a Reply

Your email address will not be published.