Brief Intro - Computer Vision (CV)

Introduction to Computer Vision (CV)

It is often abbreviated as CV, is defined as a field of study that seeks to develop techniques to help computers “see” and understand the content of digital images such as photographs and videos.

The problem of CV appears simple because it is trivially solved by people, even very young children. Nevertheless, it largely remains an unsolved problem based both on the limited understanding of biological vision and because of the complexity of visual perception in a dynamic and nearly infinitely varying physical world.

Introduction to the field of Computer Vision (CV).

The goal of the field of CV and its distinctness from image processing.

What makes the problem of CV challenges.

Typical problems or tasks pursued in a CV.


Divided into four parts:

The desire for Computers to See

What Is CV

Challenge of CV

Tasks in CV

The desire for Computers to See

We are awash in images.

Smartphones have cameras, and taking a photo or video and sharing it has never been easier, resulting in the incredible growth of modern social networks like Instagram.

YouTube might be the second largest search engine and hundreds of hours of video are uploaded every minute and billions of videos are watched every day.

The internet is comprised of text and images. It is relatively straightforward to index and searches text, but in order to index and search images, algorithms need to know what the images contain. For the longest time, the content of images and video has remained opaque, best described using the meta descriptions provided by the person that uploaded them.

To get the most out of image data, we need computers to “see” an image and understand the content.

This is a trivial problem for a human, even young children.

A person can describe the content of a photograph they have seen once.

A person can summarize a video that they have only seen once.

A person can recognize a face that they have only seen once before.

We require at least the same capabilities from computers in order to unlock our images and videos.

CV is a field of study focused on the problem of helping computers to see.

At an abstract level, the goal of computer vision problems is to use the observed image data to infer something about the world.

It is a multidisciplinary field that could broadly be called a subfield of artificial intelligence (AI) and machine learning (ML), which may involve the use of specialized methods and make use of general learning algorithms.

As a multidisciplinary area of study, it can look messy, with techniques borrowed and reused from a range of disparate engineering and computer science fields.

One particular problem in vision may be easily addressed with a hand-crafted statistical method, whereas another may require a large and complex ensemble of generalized ML algorithms.

CV as a field is an intellectual frontier. Like any frontier, it is exciting and disorganized, and there is often no reliable authority to appeal to. Many useful ideas have no theoretical grounding, and some theories are useless in practice; developed areas are widely scattered, and often one looks completely inaccessible from the other.

The goal of a CV is to understand the content of digital images. Typically, this involves developing methods that attempt to reproduce the capability of human vision.

Understanding the content of digital images may involve extracting a description from the image, which may be an object, a text description, a three-dimensional model, and so on.

It is the automated extraction of information from images. Information can mean anything from 3D models, camera position, object detection, and recognition to grouping and searching image content.

Computer Vision (CV) and Image Processing

CV is distinct from image processing.

Image processing is the process of creating a new image from an existing image, typically simplifying or enhancing the content in some way. It is a type of digital signal processing and is not concerned with understanding the content of an image.

A given CV system may require image processing to be applied to raw input, e.g. pre-processing images.

Examples of image processing include:

Normalizing photometric properties of the image, such as brightness or color.

Cropping the bounds of the image, such as centering an object in a photograph.

Removing digital noise from an image, such as digital artifacts from low light levels.

Challenge of Computer Vision (CV)

Helping computers to see turns out to be very hard.

The goal of a CV is to extract useful information from images. This has proved a surprisingly challenging task; it has occupied thousands of intelligent and creative minds over the last four decades, and despite this we are still far from being able to build a general-purpose “seeing the machine.”

CV seems easy, perhaps because it is so effortless for humans.

Initially, it was believed to be a trivially simple problem that could be solved by a student connecting a camera to a computer. After decades of research, “computer vision - CV” remains unsolved, at least in terms of meeting the capabilities of human vision.

Making a computer see was something that leading experts in the field of AI thought to be at the level of difficulty of a summer student’s project back in the sixties. Forty years later the task is still unsolved and seems formidable.

One reason is that we don’t have a strong grasp of how human vision works.

Studying biological vision requires an understanding of the perception organs like the eyes, as well as the interpretation of the perception within the brain. Much progress has been made, both in charting the process and in terms of discovering the tricks and shortcuts used by the system, although like any study that involves the brain, there is a long way to go.

Perceptual psychologists have spent decades trying to understand how the visual system works and, even though they can devise optical illusions to tease apart some of its principles, a complete solution to this puzzle remains elusive

Another reason why it is such a challenging problem is because of the complexity inherent in the visual world.

A given object may be seen from any orientation, in any lighting conditions, with any type of occlusion from other objects, and so on. A true vision system must be able to “see” in any of an infinite number of scenes and still extract something meaningful.

Computers work well for tightly constrained problems, not open unbounded problems like visual perception.

Tasks in Computer Vision (CV)

Nevertheless, there has been progressed in the field, especially in recent years with commodity systems for optical character recognition and face detection in cameras and smartphones.

CV is at an extraordinary point in its development. The subject itself has been around since the 1960s, but only recently has it been possible to build useful computer systems using ideas from CV.

The 2010 textbook on CV titled “Computer Vision: Algorithms and Applications” provides a list of some high-level problems where we have seen success with computer vision.

Optical character recognition (OCR)

Machine inspection

Retail (e.g. automated checkouts)

3D model building (photogrammetry)

Medical imaging

Automotive safety

Match move (e.g. merging CGI with live actors in movies)

Motion capture (mocap)


Fingerprint recognition and biometrics

It is a broad area of study with many specialized tasks and techniques, as well as specializations to target application domains.

It has a wide variety of applications, both old (e.g., mobile robot navigation, industrial inspection, and military intelligence) and new (e.g., human-computer interaction, image retrieval in digital libraries, medical image analysis, and the realistic rendering of synthetic scenes in computer graphics).

It may be helpful to zoom in on some of the simpler CV tasks that you are likely to encounter or be interested in solving given the vast number of publicly available digital photographs and videos available.

Many popular CV applications involve trying to recognize things in photographs; for example:

Object Classification: What broad category of object is in this photograph?

Object Identification: Which type of a given object is in this photograph?

Object Verification: Is the object in the photograph?

Object Detection: Where are the objects in the photograph?

Object Landmark Detection: What are the key points for the object in the photograph?

Object Segmentation: What pixels belong to the object in the image?

Object Recognition: What objects are in this photograph and where are they?

Other common examples are related to information retrieval; for example: finding images like an image or images that contain an object.


In this post, you discovered a gentle introduction to the field of CV.

The goal of the field of CV and its distinctness from image processing.

What makes the problem of CV challenges.

Typical problems or tasks pursued in the CV.

If I asked you to name the objects in the picture below, you would probably come up with a list of words such as “tablecloth, basket, grass, boy, girl, man, woman, orange juice bottle, tomatoes, lettuce, disposable plates…” without thinking twice. Now, if I told you to describe the picture below, you would probably say, “It’s the picture of a family picnic” again without giving it a second thought.

Those are two very easy tasks that any person with below-average intelligence and above the age of six or seven could accomplish. However, in the background, a very complicated process takes place. The human vision is a very intricate piece of organic technology that involves our eyes and visual cortex, but also takes into account our mental models of objects, our abstract understanding of concepts, and our personal experiences through billions and trillions of interactions we’ve made with the world in our lives.

Digital equipment can capture images at resolutions and with detail that far surpasses the human vision system. Computers can also detect and measure the difference between colors with very high accuracy. But making sense of the content of those images is a problem that computers have been struggling with for decades. To a computer, the above picture is an array of pixels or numerical values that represent colors.

It is the field of computer science that focuses on replicating parts of the complexity of the human visual system and enabling computers to identify and process objects in images and videos in the same way that humans do. Until recently, computer vision only worked in a limited capacity.

Thanks to advances in artificial intelligence and innovations in deep learning (DL) and neural networks, the field has been able to take great leaps in recent years and has been able to surpass humans in some tasks related to detecting and labeling objects.

Applications of CV

The importance of CV is in the problems it can solve. It is one of the main technologies that enable the digital world to interact with the physical world.

CV enables self-driving cars to make sense of their surroundings. Cameras capture video from different angles around the car and feed it to computer vision software, which then processes the images in real-time to find the extremities of roads, read traffic signs, detect other cars, objects, and pedestrians. The self-driving car can then steer its way on streets and highways, avoid hitting obstacles, and (hopefully) safely drive its passengers to their destination.

It also plays an important role in facial recognition applications, the technology that enables computers to match images of people’s faces to their identities. CV algorithms detect facial features in images and compare them with databases of fake profiles. Consumer devices use facial recognition to authenticate the identities of their owners. Social media apps use facial recognition to detect and tag users. Law enforcement agencies also rely on facial recognition technology to identify criminals in video feeds.

It also plays an important role in augmented and mixed reality, the technology that enables computing devices such as smartphones, tablets, and smart glasses to overlay and embed virtual objects on real-world imagery. Using CV, AR gear detects objects in the real world in order to determine the locations on a device’s display to place a virtual object. For instance, computer vision algorithms can help AR applications detect planes such as tabletops, walls, and floors, a very important part of establishing depth and dimensions and placing virtual objects in the physical world.

Online photo libraries like Google Photos use CV to detect objects and automatically classify your images by the type of content they contain. This can save you a much time that you would have otherwise spent adding tags and descriptions to your pictures. CV of video by typing in the type of content they’re looking for instead of manually looking through entire videos.

It has also been an important part of advances in health-tech. CV algorithms can help automate tasks such as detecting cancerous moles in skin images or finding symptoms in x-ray and MRI scans.

It has other, more nuanced applications. For instance, imagine a smart home security camera that is constantly sending a video of your home to the cloud and enables you to remotely review the footage. Using CV, you can configure the cloud application to automatically notify you if something abnormal happens, such as an intruder lurking around your home or something catching fire inside the house. This can save you a lot of time by giving you the assurance that there’s a watchful eye constantly looking at your home. The U.S. military is already using computer vision to analyze and flag video content captured by cameras and drones (though the practice has already become the source of many controversies).

Taking the above example a step further, you can instruct the security application to only store footage that the CV algorithm has flagged as abnormal. This will help you save tons of storage space in the cloud because in nearly all cases, most of the footage your security camera captures is benign and doesn’t need review.

Furthermore, if you can deploy CV at the edge on the security camera itself, you’ll be able to instruct it to only send its video feed to the cloud if it has flagged its content as needing further review and investigation. This will enable you to save network bandwidth by only sending what’s necessary to the cloud.

Evolution of CV

Before the advent of deep learning, the tasks that CV could perform were very limited and required a lot of manual coding and effort by developers and human operators.

For instance, if you wanted to perform facial recognition, you would have to perform the following steps:

Create a database: You had to capture individual images of all the subjects you wanted to track in a specific format.

Annotate images: Then for every individual image, you would have to enter several key data points, such as distance between the eyes, the width of the nose bridge, the distance between upper lip and nose, and dozens of other measurements that define the unique characteristics of each person.

Capture new images: Next, you would have to capture new images, whether from photographs or video content. And then you had to go through the measurement process again, marking the key points on the image. You also had to factor in the angle the image was taken.

After all this manual work, the application would finally be able to compare the measurements in the new image with the ones stored in its database and tell you whether it corresponded with any of the profiles it was tracking. In fact, there was very little automation involved and most of the work was being done manually. And the error margin was still large.

Machine learning (ML) provided a different approach to solving CV problems. With ML, developers no longer needed to manually code every single rule into their vision applications. Instead, they programmed “features,” smaller applications that could detect specific patterns in images. They then used a statistical learning algorithm such as linear regression, logistic regression, decision trees, or support vector machines (SVM) to detect patterns and classify images and detect objects in them.

ML helped solve many problems that were historically challenging for classical software development tools and approaches. For instance, years ago, machine learning engineers were able to create software that could predict breast cancer survival windows better than human experts. However, as AI expert Jeremy Howard explains, building the features of the software required the efforts of dozens of engineers and breast cancer experts and took a lot of time to develop.

Deep learning (DL) provided a fundamentally different approach to doing ML. DL relies on neural networks, a general-purpose function that can solve any problem representable through examples. When you provide a neural network with many labeled examples of a specific kind of data, it’ll be able to extract common patterns between those examples and transform it into a mathematical equation that will help classify future pieces of information.

For instance, creating a facial recognition application with deep learning only requires you to develop or choose a preconstructed algorithm and train it with examples of the faces of the people it must detect. Given enough examples (lots of examples), the neural network will be able to detect faces without further instructions on features or measurements.

DL is a very effective method to do computer vision. In most cases, creating a good DL algorithm comes down to gathering a large amount of labeled training data and tuning the parameters such as the type and number of layers of neural networks and training epochs. Compared to previous types of machine learning, deep learning is both easier and faster to develop and deploy.

Most current CV applications such as cancer detection, self-driving cars, and facial recognition make use of DL. DL and deep neural networks have moved from the conceptual realm into practical applications thanks to availability and advances in hardware and cloud computing resources. However, deep learning algorithms have their own limits, most notable among them being lack of transparency and interpretability.

Limits of CV

Thanks to DL, CV has been able to solve the first of the two problems mentioned at the beginning of this article, meaning the detecting and classifying of objects in images and videos. In fact, deep learning has been able to exceed human performance in image classification.

However, despite the nomenclature that is reminiscent of human intelligence, neural networks function in a way that is fundamentally different from the human mind. The human visual system relies on identifying objects based on a 3D model that we build in our minds. We are also able to transfer knowledge from one domain to another. For instance, if we see a new animal for the first time, we can quickly identify some of the body parts found in most animals such as the nose, ears, tail, legs…

Deep neural networks have no notion of such concepts and they develop their knowledge of each class of data individually. At their heart, neural networks are statistical models that compare batches of pixels, though in very intricate ways. That’s why they need to see many examples before they can develop the necessary foundations to recognize every object. Accordingly, neural networks can make stupid (and dangerous) mistakes when not trained properly.

But where CV is really struggling in understanding the context of images and the relation between the objects they see. We humans can quickly tell without a second thought that the picture at the beginning of the article is that of a family picnic because we have an understanding of abstract concepts it represents. We know what a family is. We know that a stretch of grass is a pleasant place to be. We know that people usually eat at tables, and an outdoor event sitting on the ground around a tablecloth is probably a leisure event, especially when all the people in the picture are happy. All of that and countless other little experiences we’ve had in our lives quickly goes through our minds when we see the picture. Likewise, if I tell you about something unusual, like a “winter picnic” or a “volcano picnic” you can quickly put together a mental image of what such an exotic event would look like.

For a CV algorithm, pictures are still arrays of color pixels that can be statistically mapped to certain descriptions. Unless you specifically train a neural network on pictures of family picnics, it won’t be able to make the connection between the different objects it sees in a photo. Even when trained, the network will only have a statistical model that will probably label any picture that has a lot of grass, several people, and tablecloths as a “family picnic.” It won’t know what a picnic is contextual. Accordingly, it might mistakenly classify a picture of a poor family with sad looks and sooty faces eating in the outdoors as a happy family picnic. And it probably won’t be able to tell the following picture is a drawing of an animal picnic.

Some experts believe that true CV can only be achieved when we crack the code of general AI, AI that has the abstract and commonsense capabilities of the human mind. We don’t know when—or if—that will ever happen. Until then, or until we find some other way to represent concepts in a way that can also leverage the strengths of neural networks, we’ll have to throw more and more data at our CV algorithms, hoping that we can account for every possible type of object and context they should be able to recognize.


  1. Through this post, I know that your good knowledge in playing with all the pieces was very helpful. I notify that this is the first place where I find issues I've been searching for. You have a clever yet attractive way of writing. cv formasi


Post a Comment