Hexaview Technologies

Evolution of Computer Vision

Evolution of computer vision_Hexaview
What is Computer Vision?

The history of vision can be traced back many years to approximately 543 million years ago, when species were floating around with no eyes. But within the span of 10 million years, from a few of them to hundreds of thousands, it was strange what would have caused this. Through fossils, it was discovered that species developed eyes, and the onset of vision started. We, humans, have over 50% of neurons involved in vision. Like humans, when we use electronics to work as neurons and use an image as data input, the output received has a greater meaning.

what is computer vision

Later in the 1960s, the first computer vision thesis was written, in which the visual world was simplified into geometric shapes, and the primary goal was to recognize these shapes. Since that time, the field of computer vision has blossomed, and this part of AI is still blooming. We came too far from recognizing the shapes to telling a story about an image. Security cameras are everywhere, but they do not alert us when a child is drowning in a swimming pool.

Computer vision is something that can tell a story about an image. It can give life to meaningless things. We have cars that can drive by themselves, but we don’t have robots that can work along with nurses. It is the branch of AI that gains high-level information through digital data. Due to the advances in neural networks and ML algorithms, computers can now classify among tens of thousands of different classes all through computer vision.

How Does Computer Vision Work?

We want computers to learn more about emotions, understanding relations, actions, and intentions. So how do we make it work? How do we make computers see the images and understand their story? In the early days of object modeling, we told the computer algorithms in a mathematical language that a cat has a round face, a chubby body, two pointy ears, and a long tail. A perfect example of a cat is shown in fig 1.b.

Fig 1.b
Fig. 1.c. (© TheRedMyth / imgur)

But, what about this cat? Fig. 1.c. Even something as simple as a household pet can present several variations to the object model, and this is only for one object. Rather than focusing on better algorithms, we should use a large amount of training data, both quality and quantity. So, the steps are

  • Gather Digital Data
  • Deep Learning Models
  • Natural Language Processing
Gather Digital Data

So how do we get this intense amount of data? The answer is the internet. We need more data than ever. Luckily, we have several websites with a pool of clean and labeled datasets that we can download from there. Some popular datasets are CIFAR-10, ImageNet, MS COCO, and many more available datasets ready to train. Now we have our datasets with clean and labeled images. We can’t feed these images to our model as they are. We need to apply several techniques to reduce the size of an image and its complexity. It can be achieved by grayscale conversion, normalization, and standardization. There are several vector-based Python libraries out there that will work for us.

Deep Learning Models

In computer vision, we can understand an image in multiple ways, i.e., what are those objects present and what meaning do they define? Models can answer all these.

Image Classification

Image classification attempts to identify the most significant object in an image. For example, we define a model to identify cats among other animals. The model would take an image as an input and label the output with a percentage of correctness. We employ several DNNs, mostly CNNs (Convolutional Neural Networks), for the task. But how do we implement all these? To do so, we need hands-on experience with Python and vector-based libraries, mainly TensorFlow and PyTorch. These widely used ML libraries work well with large datasets and quick implementation. Some popular image classification models are AlexNet, VGG, Inception V4, etc. These three are in order of date and improvement over others in their architectural designs.

Object Detection

Now we know what object is present in the image, but we don’t know how many of the locations. The computer doesn’t know that yet. We need the model that returns the bounding box or the coordinates enclosing the object and the confidence value and label for this task.

Object Detection = Classification + Localization

We have classified our image. The only thing left is to locate it, which can be done in several ways. We can use the region proposal or sliding window approach, but that can be highly expensive. It can be improved by replacing the dense layer with a CNN of 1×1 and passing the image only once for that window size. In actual implementation, we pass the cropped image at once. Both these algorithms are working fine, but still, it is slow. We need something that can work in almost real-time. Increasing the computational power is not always an option. We will use YOLO (you only look once). It is a clever CNN algorithm for object detection in real-time. This algorithm uses a single neural network to process the full image, then divides the image into regions and predicts bounding boxes and probabilities for each region. Algorithms like Fast R-CNN, SSD, or Faster-CNN can also be used.

Natural Language Processing

So far, we have just taught the computer to see objects and even classify them among many other different classes. This is like a small child learning to utter a few nouns. Instead of only giving labels and quantities, we need sentences with proper meanings. Now the computer has to learn from both pictures and natural sentences. Natural Language Processing is a branch of AI that can understand and deliver human-readable sentences. So, we will combine our computer vision model with NLP to get the desired result. NLP generally employs machine learning algorithms. They tend to learn things from patterns and large corpora that contain collections of sentences. Algorithms like these can do the trick.

  • Support Vector Machine (SVM)
  • Bayesian Network
  • Neural Networks

RNNs and LSTMs are the two most commonly used networks within neural networks. They work differently; the network takes the input, which was the previous layer’s output. In this way, it keeps track of the history and, on that basis, continues to produce sentences. You can see the outputs by computer vision models in the figures.

Top Technologies Powering Computer Vision

Our intuition is that we will follow these steps, and the computer will be able to see images and understand. Still, many obstacles need to be overcome, among them several programming languages to choose from, how much computation power we need, and how to monitor our progress overtraining.

  • Python
  • Google Colaboratory
  • Monitoring Tools


Python ranks the most popular and widely-used programming language for machine learning tasks. It is used for ML and in data science and web development. It provides concise and readable code where complex algorithms can be written in a single line of code. Developers can put their effort into solving an ML problem instead of focussing on syntax. Unlike python, there are other languages like C++, Java, JavaScript, and R. These can also be used for ML.

Google Colaboratory

A Google tool allows anybody to write and run Python code in a web browser and apply vector-based libraries like NumPy for machine learning and data analysis. You will be granted 12 GB of high-performance GPUs and free storage space for each 8-hour session. That would suffice for educational reasons.

Monitoring Tools

Like DevOps, we also need MLOps to monitor ML models in production. We need tools to compare several models and large datasets to see which one is better and where it falls short. Monitoring tools like Neptune, Arize, and Amazon SageMaker are widely used.


It seems very clear that computer vision will be a great part of our lives in the coming years. It still had a long way to go. We came from 0–3; now we need to go further ahead so a computer can recognize the beauty of nature or understand a special occasion. When machines can see, doctors and nurses will have an extra pair of tireless eyes to help them diagnose and take care of patients. Cars are becoming safer and smarter on the road. Robots will help rescue trapped and wounded people in disaster zones, not just humans. We will discover new species and better materials in places that humans cannot reach. This will be achieved little by little as we give sight to the machines.

We want a computer to be able to recognize the beauty of nature or understand a special occasion. The quest to give computers visual intelligence and create a better future continues. 

Sanskar Jaiswal

Sanskar Jaiswal

Sanskar has hands-on experience with Django/Python and Java. He is efficient in creating web APIs and working with Google API and scheduling. He is a fast learner with strong time management and multi-tasking skills. Apart from this, he has a keen interest in Deep Learning (Computer Vision).