After a long introduction to deep learning, it’s time to move on to the coding part. In this example, we have divided neural network construction into 13 steps. Prepare to witness a computer’s process of learning from mistakes and recognizing patterns. Here we go!
Attention: This article is the continuation of “ Deep Learning Overview ” and “ Deep Learning Overview – 2 ” articles. To understand the subject, you must first read these articles.
A Word on the Concepts of Matrix and Linear Algebra
In the code below you will see the word matrix. This is very important. The matrix is like the engine of our car. Without matrices, a neural network can’t go anywhere.
A matrix is a series of numbers. For an analogy, you can consider the Excel spreadsheet. Or imagine a database with many rows and columns. The first matrix you’ll encounter now contains data from our pet shop survey. Our matrix looks like this:
Think of each line as a customer. There are four customers in the matrix above. Each line contains three numbers. 1’s for yes and 0’s for no. In the previous section, we saw that the first customer’s answer was “Yes/No/Yes”. This is in the top row of our matrix. The first column was the first survey question of four customers, “Do you have a cat?” contains either answers (Customers two and three don’t have a cat, so it’s shown with a zero). Now let’s schematize the same matrix above to show a little more detail.
I hope the diagram above helped to understand the relationship between customer rows and attribute columns. We need to see the matrix as both. Let’s break it down now:
In our matrix, a customer’s data is represented by a row of three. In the neural network diagram, which includes circular neurons connected to each other by linear synapses, the input layer consists of three circular “neurons”. Here it is important to know that each neuron does not represent a customer – a row of data. Instead, each neuron represents a feature – a column of data. Thus, within a neuron, we have the answers that all customers give to the same question (for example, “do you have a cat?”). Since we are only showing four customers, we see four 1s and 0s corresponding to that question in the figure above. But if we were to chart 1,000,000 customers, the top neuron would ask each customer “do you have a cat?” The one million 1’s and 0’s he gave to the question would contain his answers.
I hope it has become clearer why we need matrices. Because we have multiple customers. We described four clients in our neural network below, so we needed four ordinal numbers.
Because we have multiple survey questions in our network, we needed a column corresponding to each question (or feature). The fourth question will appear in a different matrix. We’ll talk about this part later.
To summarize, matrices keep data organized while performing complex calculations.
Important note: The code below contains only one training set. Does not include validation or test set.
The comments interspersed with the Python code below might be a good summary, but it’s still complicated. Novices in coding may not understand it at first read. Everything will be explained in detail. The explanation of the parts that are not understood can be in the next parts of the article.
Let’s spruce up the code first: Let’s import numpy, a powerful mathematical tool.
import numpy as np
#1 Sigmoid Function: Converts numbers to probabilities and calculates confidence with gradient descent.
#2 X Matrix: This is the answer we got from the questionnaire we made to our 4 customers, in a language that the computer understands. Row 1 is the sequence of Yes/No responses from the first customer to the first 3 survey questions:
“1” means Yes to “Do you have a cat?” “0” “Do you drink imported beer?” means no to the question. “Have you visited Kedimmis.com?” for 1 means Yes. Below that are 3 more lines (for the other 3 customers’ answers).
So these are 4 customers’ Yes/No answers to the first 3 questions (question 4 is used in the next step below).
These are the input set we will use to train our network.
X = np.array([[1,0,1],
#3 vector y: The four target values are our output. These are the fourth survey question of 4 customers, “My Cat Is Mis! did you buy?” When our neural network outputs a prediction, we compare it to the answer to question 4, the reality. When our network’s predictions achieve these 4 target values correctly, it has reached sufficient accuracy and is ready to receive a different dataset. In our example, our second dataset was surveys from the veterinarian.
y = np.array ([,
#4 Seed: This can be thought of as sprucing up the environment. In the synapses we need to place the random numbers that we generate during the training process. This makes it easy to debug.
#5 Synapses: In other words, “weights”. These two matrices are the part of the “brain” that guesses, learns by trial and error, and heals itself on the next try. Remember the crooked red bowl we talked about in previous articles. syn0 and syn1 were the X and Y axes in the white grid below the red bowl. Thus, each time we set these values, we move the grid coordinates from point A to the bottom of the red bowl, where the error is zero (think of the movement of the yellow arrow).
syn0 = 2*np.random.random((3,4)) – 1 # Synapse 0 has 12 weights and connects l0 to l1.
syn1 = 2*np.random.random((4,1)) – 1 # Synapse 1 has 4 weights and connects l1 to l2.
#6 For Loop: This takes our iterator network to 60,000 predictions, comparisons, and optimizations.
for j in range(60000):
# Feedforward: Think of l0, l1 and l2 as 2 “neuron” matrix layers that predict, compare and improve with the synapse matrices at #5. l0 or X are 3 features/questions of our survey recorded for 4 customers.
l0 = X
#8 The target values against which we will compare our estimates are l2. So we can calculate how much we missed. y the fourth question, “My cat is Mis! Did you buy?” is a 4×1 vector containing the answers given by 4 customers. Subtracting the vector l2 (our first 4 guesses) from y (actual buying behaviors) we get l2_error. This shows how much our estimates miss the target in a given attempt. In practice l2-error is a derivative of the total error (see next comment). This is equal to l2_error squared divided by 2.
l2_error = y – l2
#9 Print error: Within 60,000 tries, j divided by 10,000 remains 6. We will check our data every 10,000 tries to see if the l2_error (the height of the yellow arrow under the white ball at point A) is decreasing, ie whether we miss the target y less with each trial.
if (j% 10000)==0:
Print(“Total error after 10,000 additional tries: “+str(np.sum(np.abs(l2_error) / 2)))
#10 This is the beginning of backpropagation.All of the following steps share the goal of adjusting the weights in syn0 and syn1 to improve our estimation. To make our tuning as efficient as possible, we must identify the largest errors in our weights. For this, we must first calculate the confidence level of each l2 estimate by taking the slope of each l2 estimate. Then we have to multiply it by l2_error. In other words, we calculate l2_delta by multiplying the error in the sigmoid of a given value. Why? Because l2_error values corresponding to high confidence predictions (e.g. 0 or close to 1) must be multiplied by a small number (low slope corresponding to high confidence), so they change little. In this way, our network prioritizes improving the worst predictions (for example, lower confidence predictions closer to 0.5 have a steeper slope).
l2_delta = l2_error*nonlin(l2,deriv=True)
#11 Back propagation continues. In step 7 we fed our data from l0 to l1 and l2 forward into our prediction. Now, working backwards, we will find how many errors l1 contains when we feed. l1_error is the difference between the last calculated l1 and the ideal l1 that will provide the ideal l2 we want. To find l1_error, we must multiply l2_delta (what we want l2 to be on the next try) by the weights (syn1) that we think are most appropriate on our last try. In other words, to update syn0 we must take into account the effects of syn1 (with its current values) on the predictions of our network. We do this by taking the result of the newly calculated l2_delta giving l1_error and the current value of syn1. This corresponds to the amount of update we will do in syn0 to change l1 next time.
l1_error = l2_delta.dot(syn1.T)
#12 Similar to #10 above, we want to get a better estimate of l2 by adjusting the middle layer l1. Thus, l2 can better predict the y target. That is, while we make large changes in weights at low confidence values, we make small changes at high confidence values.
For this, we multiply l1_error by the slope of the sigmoid of l1 as in #10. Thus, the network makes larger changes in synapse weights for low confidence (closer to 0.5) predictions.
l1_delta = l1_error * nonlin(l1,deriv=True)
#13 Updating synapses: This is gradual descent. In this step, the synapses, the real brain of our network, learn from their mistakes, remember and get better. By multiplying each delta by the corresponding layer, we update our synapses to better predict on the next try.
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
Print the results!
print(“Our error value y-l2 after 60,000 attempts: “)
The nonlinear function in line 8 plays a very important role in the learning of our network. But fear not if you haven’t grasped all of them right away. This is our first time going over this material. We’ll go into more detail in step 10 below. We will use the sigmoid function as an example of a nonlinear function. Nowadays sigmoid is less used than ReLU but sigmoid is easier to learn and once you understand sigmoid you will have no trouble grasping the ReLU function.
“nonlin()” is a type of sigmoid function called logistic function. Logistic functions are widely used in science, statistics, and probability. This particular sigmoid function is written here more complex than necessary because it serves two functions:
The first of the two functions is to bracket a matrix (represented here by a small x) and convert each value to a number between 0 and 1 (statistical probability). This is done with line 12: return 1/(1+np.exp(-x))
Why do we need statistical probabilities? Remember that our net doesn’t just predict with 0’s and 1’s. Our network says “YES! The first customer is DEFINITELY My Cat Mis! will get it!” she doesn’t shout. Instead, he says: “My first client is My Cat Mis! The probability of getting it is 74%.”
This is an important distinction, because if you guess with 0’s and 1’s, there’s no room for improvement. You are either right or wrong. But with the possibility, there is room for improvement. You can continue to improve your accuracy if you are able to tune the system to improve the probability by a few decimal places in the right direction each time.
We will see the importance of this better below. Converting a number between zero and one gives us four big advantages. Just know for now that the sigmoid function converts each number of each matrix in its bracket into a number between 0 and 1 that falls on the S-curve. This is shown below:
Thus, Function #1 in the sigmoid function converts every value in the matrix into a statistical probability.
The second piece of the sigmoid function, Function #2, is on lines 9 and 10.
return x*(1-x) ‘
When asked to do with deriv=True in line 9 of the code, the code takes every value in a given matrix and converts it to the slope of a certain point on the sigmoid S curve. The slope number is also known as a measure of confidence. In other words, this number answers the question: “How confident are we that this number correctly predicts an outcome?” Well? Our goal was to build a neural network that made reliably accurate predictions. The fastest way to achieve this is to correct unreliable, empty, low-accuracy estimates, leaving only accurate and reliable estimates. The concept of “confidence measures” is very important, which we will touch upon shortly. Just keep these empty and unreliable numbers in mind for now.
Creating Entry X: 23-26. rows
23-26. rows form the 4×3 input values that we will use to train our network. X will be layer zero or l0 (layer 0) in our network. This is the beginning of the “toy brain” we have created.
Here’s how we show the feature set in our customer survey, in computer understandable language:
Line 23 creates entry X (which becomes layer 0 in line 57)
We have three clients who answered our three questions. We have already explained how the first line above is 101. This shows the Yes/No answers given by the first customer to the survey questions. Each line of this matrix is a training example that we will feed our network; each column is also a property of our input. Thus, our X matrix can be schematized as l0 as a 4×3 matrix as follows.
You may wonder how the matrix X has layer 0 (l0) in the diagram above. We’ll come to that shortly. Now let’s create a list of four correct answers that we want our network to be able to guess.
Creating the y Output: 34-37. rows
These are our outputs. The fourth question of our survey, “My cat is Mis! Did you buy?” are the answers. See the column of 4 numbers below. Here you will see that the first customer replied yes and the second customer replied yes.
Line 34 forms the vector y; these are the values we are trying to estimate.
To make an analogy, we can think of the target y-values as a target. As our net heals, it shoots arrows closer and closer to twelve. When our network can correctly predict the target 4 values from the values in the X matrix above, it is now ready to make predictions from another database (new surveys we obtained from the vet).
Placing Random Numbers: Line 40
In this step we generate random numbers to fill the synapses. We start with random numbers to avoid correlations between neurons. Ideally, each neuron should learn different properties. Random number insertion makes our tests repeatable (for example, if we run multiple tests with the same inputs, the results will be the same).
Using the random number generator we generate the random numbers that will generate the synapses/weights needed for the next step of our training process. This simplifies the sorting process. We don’t need to understand how the code works, we just need to add it.
Building Synapses – Weights for the Brain: 47-48. rows
Looking at the diagram above, you might think that the “brain” of the network is circles, and the “neurons” of the neural network brain. In fact, the brain of a neural network, that is, the part that actually learns and develops, is the synapse. Synapses are the lines connecting the circles in the diagram. These two matrices, syn0 and syn1, are the brains of our network. These are the parts of our network that learn by trial and error, make predictions, compare them with target y-values, and then improve their next predictions!
syn0 = 2*np.random.random((3,4)) – 1
Notice that his code creates a 3×4 matrix and fills it with random values. This will be the first layer of our synapses, or weights, Synapse 0, connecting 10 to 11. Similar to the following matrix:
Line 47: syn0 = 2*np.random.random((3,4)) – 1
This line produces synapse 0 or syn0:
[ 0.36 -0.28 0.32 -0.15]
[-0.48 0.35 0.25 -0.25]
[ 0.16 -0.66 -0.28 0.18]
Now, why does syn0 have to be a 3×4 matrix? Is it because we have to multiply the l0 4×3 matrix by syn0? So that we can arrange all numbers neatly in rows and columns? It is a mistake to think that multiplying 4×3 by 4×3 will generate numbers in an orderly fashion. In reality, we need to multiply 4×3 by 3×4 if we want our numbers to line up neatly. This is one of the basic and important rules of matrix multiplication. Now let’s take a closer look at the first neuron in the schematic we’re used to. “Do you have a cat?”
Inside this neuron are the Yes / No answers given by each of the four customers. This is the first column of our 4×3 layer0 (l0) matrix:
“Do you have a cat?” Note that there are four lines (synapses) connecting the neuron to the four neurons of l1. This means that each of the 1,0,0,1 above “Do you have a cat?” in l1. is that it must be multiplied four times by four different weights to connect. So the four numbers in “Do you have a cat?” times four weights = 16, right? Yes, l1 is this 4×4 matrix.
The second neuron, “Do you drink imported beer?” Note that we will do exactly the same with the four numbers inside. That’s four weights times four numbers = 16 values. So we add each of the 16 values to the corresponding value in the 4×4 we already created above.
And for the last time, we repeat with the four numbers inside the third neuron (“Have you visited kedimmis.com?”). So our final 4×4 l1 matrix has 16 values, each of which is the sum of the corresponding values from the three multiplication sets we just completed.
Do you understand now? 3 survey questions times 4 clients = 3 neurons times 4 synapses = 3 features times 4 weights = a 3×4 matrix.
Is it complicated? If you work on it patiently, you will get used to it. Also, the computer does the multiplications for you. But it’s important that we understand what’s going on under the hood. The lines don’t lie when we look at a neural network diagram like the one below. Consider:
If “Do you have a cat?” If there are four synapses that connect the neuron to all four neurons in the next layer, that means you have to multiply the one in “Do you have a cat?” by four weights. In this example, we know that “Do you have a cat?” has four numbers in it. Thus, we know that we will obtain a 4×4 matrix. To get here, we need to multiply by a 3×4 matrix: For example, 3 nodes times 4 synapses connect each node to the next layer of 4 neurons. Study the diagram until you are familiar with the following layout, and it is clear to you where each synapse begins and ends:
Always remember that matrix multiplication requires the inner 2 “matrix sizes” to match. For example, a 4×3 matrix should be multiplied by a 3x_ matrix. 3×4 in this example. The two inside numbers (3 in this example) must be the same.
So where do the “2*” at the beginning and the “-1” at the end of our equation come from? The np.random.random function neatly places random numbers between 0 and 1 (the corresponding mean is 0.5). But we want this initialization to mean zero. Why? Because the initial weight numbers in this matrix should not have a bias to 1 or 0, as this gives confidence we don’t have yet (for example, initially the network has no idea what’s going on, so it shouldn’t show confidence in its predictions until it updates with each trial).
So how do we convert a sequence of numbers with an average of 0.5 to a sequence with a mean of 0? We first multiply all random numbers by two (so that we get a result with a distribution between 0 and 2, with an average of 1), then we subtract 1 (it gives a result with a mean of 0 between -1 and 1). This is the reason for the 2* at the beginning and the -1 at the end of our equation. So the average goes from 0.5 to 0. Is not it beautiful?
2*np.random.random((3,4)) – 1
Lets continue. syn1 = 2*np.random.random((4,1)) – The code 1 creates a 4×1 vector and fills it with random values. The second layer weights of this network of ours becomes Synapse 1 connecting l1 to l2. Meet Synapse 1:
Satır 48: syn1 = 2*np.random.random((4,1)) – 1
This line creates synapse 1 or syn1:
It might be a good exercise for you to figure out what size the matrices should be for this multiplication. Why is syn1 a 4×1? Look at the diagram: the numbers in the topmost neuron of l1 should only be multiplied once, right? Because there is only one line (weight) connecting the top neuron of l1 to the single neuron of l2. We also know that the top neuron of l1 has four values. So 4 values x 1 weights = four results. Add them together and that gives you the first result inside l2.
Repeat this process 3 more times for the other 3 neurons in l1 and it is clear that l2 is a 4×1 matrix (also known as a vector when there is only one column).
Let’s repeat: Always remember that the inside size numbers must match. A 4×3 matrix must be multiplied by a 3x_ matrix. It’s like a 4×4 matrix has to be multiplied by a 4x_ matrix.
For loop: Line 52
This is for the cycle that will put our network through 60,000 trials. At each trial, our network receives X, response data from our customer surveys, according to these data, that customer’s Cat Mi! makes its best guess about the probability of getting it. It then compares its estimate with the actual found as y. Learning from his mistakes, he makes a slightly better guess the next time he tries and repeats it 60,000 times. Until we correctly guessed the target value y from input X by trial and error. This way, our network will be ready for any input data you provide (like surveys you get from the vet) and can predict the appropriate people for your targeted ads.
Here we have actually simplified the process. Typically, both a training set and a validation set are used. They train their models until the validation set errors increase, to what’s called an “early stop”. Here the process is simplified for teaching purposes.
- Feed Forward: Making 60,000 Educated Predictions