top of page

A Tour of Deep Learning Models


Toy, J. (2016, 7 28). Retrieved from A Tour of Deep Learning Models:

Takeaway: Deep learning models are teaching computers to think on their own, with some very fun and interesting results.

Deep learning is being applied to more and more domains and industries. From driverless cars, to playing Go, to generating images' music, there are new deep learning models coming out every day. Here we go over several popular deep learning models. Scientists and developers are taking these models and modifying them in new and creative ways. We hope this showcase can inspire you to see what is possible. (To learn about advances in artificial intelligence, see Will Computers Be Able to Imitate the Human Brain?)

Neural Style

If you have ever used Instagram or Snapchat, you are familiar with using filters that alter the brightness, saturation, contrast, and so on of your images. Neural style, a deep learning algorithm, goes beyond filters and allows you to transpose the style of one image, perhaps Van Gogh’s “Starry Night,” and apply that style onto any other image.

How Does it Work?

Neural style uses a deep neural network in order to separate and recombine the content and style of any two images. It is one of the first artificial neural networks (ANNs) to provide an algorithm for the creation of artistic imagery. (To learn more about ANNs, see What is the difference between artificial intelligence and neural networks?)

The model is given two input images, one that that will be used for styling, the other for content. At each processing stage in the convolutional neural network’s (CNN) hierarchy, the images are broken into a set of filtered images. While the number of different filters increases along the processing hierarchy, the overall size of the filtered images is reduced, leading to a decrease in the total number of units per layer of the network.

The above figure visualizes the information at different processing stages in the CNN. The content reconstructions from lower layers (a,b,c) are almost exact replicas of the original image. In the higher layers of the network however, the detailed pixel information is lost while the high-level structures and details remain the same (d,e). Meanwhile, the model captures the style of the other input image on top of the content CNN representations. Then, the style representation draws connections between the different features in different layers of the CNN. The model then reconstructs the style of the input image on top of the content representations within each of the CNN layers. This creates images that match the style on an increasing scale as you move through the network’s hierarchy.

Neural Storyteller

Neural Storyteller is a model that, when given an image, can generate a romance story about the image. It's a fun toy and yet you can imagine the future and see the direction in which all these artificial intelligence models are moving.


Building a neural network model to accomplish a goal increasingly involves building larger and more sophisticated pipelines, which can include mixing and matching different algorithms together. Neural storyteller consists of four main parts: skip-thought vectors, image-sentence embedding, style shifting and conditional neural language models.

Skip-Thought Vectors

Skip-thought vectors are a way to encode text in an unsupervised (inferring a function from unlabeled data) manner. The system works in an unsupervised way by exploiting the continuity of text. For any given sentence from the text, it tries to reconstruct the surrounding text. For neural storyteller, romance novels are converted into skip-thought vectors.

Image-Sentence Embeddings

Another separate model, a visual semantic embedding model, is built so that when given an image, it outputs a sentence describing that image. The dataset used to train this is called MSCOCO. There are many models that already do this, such as Neural Talk.

With these two models, they can now be connected together to get the result we are looking for. Another program is written that is essentially this function:

F(x) = x - c + b

In this function, x represents the image caption, c represents the "caption style," and b represents the "book style." The idea of the function can be translated to: Keep the "thought" of the caption, but replace the image caption style with that of a story. In the function, c, the caption style, is generated by taking the mean of the top MSCOCO captions generated for the image. While b is the mean of the skip-thought vectors for romance novel passages.

Style Shifting

The above function is the "style-shifting" operation that allows the model to transfer standard image captions to the style of stories from novels. Style shifting was inspired by "A Neural Algorithm of Artistic Style."


There are two main sources of data that are used in this model. MSCOCO is a dataset from Microsoft containing around 300,000 images, with each image containing five captions. MSCOCO is the only supervised data being used, meaning it is the only data where humans had to go in and explicitly write out captions for each image.

The other source of data is called BookCorpus. The model was trained on a subset of BookCorpus, specifically 11 million passages from romance novels. But BookCorpus also contains books from adventure, sci-fi and other genres.

Character RNN

Feed-Forward Network Versus a Recurrent Neural Network

Until fairly recently, the majority of computer scientists have been primarily experimenting with feed-forward neural networks to calculate prediction problems such as, is an email message spam or not? In a typical feed-forward neural network, input is given to the model. The model then processes the input behind the scenes in hidden layers and spits out an output. The hidden layers are arranged in a sort of pyramid structure where each higher layer is calculated based on input and calculations from each successive lower layer, but not vice-versa (higher-layer levels do not affect lower layers). For example, a feed-forward network might be used to determine objects in an image. The lower layers will analyze the shapes and lines of an object while the higher layers will combine the shapes and classify the object.

One of the major limitations of a feed-forward neural network is that it has no memory. Each prediction is independent from previous calculations, as if it were the first and only prediction the network ever made. But for many tasks, such as translating a sentence or paragraph, inputs should consist of sequential and contextually related data. For example, it would be difficult to make sense of a single word in a sentence without the context provided by the surrounding words.

RNNs are different because they add another set of connections between the neurons. These links allow the activations from the neurons in a hidden layer to feed back into themselves at the next step in the sequence. In other words, at every step, a hidden layer receives both activation from the layer below it and also from the previous step in the sequence. This structure essentially gives recurrent neural networks memory. So for the task of object detection, an RNN can draw upon its previous classifications of dogs to help determine if the current image is a dog.


This flexible structure in the hidden layer allows RNNs to be very good for character-level language models. Char RNN, originally created by Andrej Karpathy, is a model that takes one text file as input and trains an RNN to learn to predict the next character in a sequence. The RNN can generate text character by character that will look like the original training data. A demo has been trained using transcripts of various TED Talks. Feed the model one or several keywords and it will generate a passage about the keyword(s) in the voice/style of a TED Talk.


These models show new breakthroughs in machine intelligence that has become possible because of deep learning. Deep learning shows that we can solve problems that we could never solve before, and we have not yet reached that plateau. Expect to see many more exciting things like driverless cars over the next couple of years as a result of deep learning innovation.

5 views0 comments
Post: Blog2_Post
bottom of page