CCS338 Computer Vision Lecture Notes 1

EnggTree.
com
UNIT-1
INTRODUCTION TO IMAGE FORMATION AND PROCESSING
Computer Vision - Geometric primitives and transformations -
Photometric image formation - The digital camera - Point operators -
Linear filtering - More neighborhood operators - Fourier transforms -
Pyramids and wavelets - Geometric transformations - Global
optimization.
1.Computer Vision:
Computer vision is a multidisciplinary field that enables machines to interpret and

make decisions based on visual data. It involves the development of algorithms
and systems that allow computers to gain high-level understanding from digital
images or videos. The goal of computer vision is to replicate and improve upon
human vision capabilities, enabling machines to recognize and understand visual
information.
Key tasks in computer vision www.EnggTree.com

include:
1. Image Recognition:Identifying objects, people, or patterns within images.
2. Object Detection: Locating and classifying multiple objects within an image or

video stream.
3. Image Segmentation: Dividing an image into meaningful segments or regions,

often to identify boundaries and structures.
4. Face Recognition: Identifying and verifying individuals based on facial

features.
5. Gesture Recognition: Understanding and interpreting human gestures from

images or video.
6. Scene Understanding: Analyzing and comprehending the content and context

of a scene.
Downloaded from EnggTree.com

EnggTree.com
7. Motion Analysis: Detecting and tracking movements within video sequences.
8. 3D Reconstruction: Creating three-dimensional models of objects or scenes

from two-dimensional images.
Computer vision applications are diverse and found in various fields, including
healthcare (medical image analysis), autonomous vehicles, surveillance,
augmented reality, robotics, industrial automation, and more. Advances in deep
learning, especially convolutional neural networks (CNNs), have significantly
contributed to the progress and success of computer vision tasks by enabling
efficient feature learning from large datasets.
2.Geometric primitives and transformations:
Geometric primitives and transformations are fundamental concepts in computer

graphics and computer vision. They form the basis for representing and manipulating
visual elements in both 2D and 3D spaces. Let's explore each of these concepts:
Geometric Primitives: www.EnggTree.com

1. Points: Represented by coordinates (x, y) in 2D or (x, y, z) in 3D space.
2. Lines and Line Segments: Defined by two points or a point and a direction vector.
3. Polygons: Closed shapes with straight sides. Triangles, quadrilaterals, and other
polygons are common geometric primitives.
4. Circles and Ellipses: Defined by a center point and radii (or axes in the case of
ellipses).
5. Curves: Bézier curves, spline curves, and other parametric curves are used to
represent smooth shapes.
Geometric Transformations:
Geometric transformations involve modifying the position, orientation, and scale of
geometric primitives. Common transformations include:

EnggTree.com
1. Translation: Moves an object by a certain distance along a specified direction.
2. Rotation: Rotates an object around a specified point or axis.
3. Scaling: Changes the size of an object along different axes.
4. Shearing: Distorts the shape of an object by stretching or compressing along one or

more axes.
5. Reflection: Mirrors an object across a specified plane.
6. Affine Transformations: Combine translation, rotation, scaling, and shearing.
7. Projective Transformations: Used for perspective transformations in 3D graphics.
Applications:
Computer Graphics: Geometric primitives and transformations are fundamental
for rendering 2D and 3D graphics in applications such as video games,
simulations, and virtual reality.

www.EnggTree.com
Computer-Aided Design (CAD): Used for designing and modeling objects in
engineering and architecture.
Computer Vision: Geometric transformations are applied to align and process
images, correct distortions, and perform other tasks in image analysis.
Robotics: Essential for robot navigation, motion planning, and spatial reasoning.
Understanding geometric primitives and transformations is crucial for creating realistic

and visually appealing computer-generated images, as well as for solving various
problems in computer vision and robotics.
3.Photometric image formation:
Photometric image formation refers to the process by which light interacts with
surfaces and is captured by a camera, resulting in the creation of a digital image. This
process involves various factors related to the properties of light, the surfaces of
objects, and the characteristics of the imaging system. Understanding photometric

EnggTree.com
image formation is crucial in computer vision, computer graphics, and image

processing.
Here are some key concepts involved:
Illumination:
● Ambient Light: The overall illumination of a scene that comes from all
www.EnggTree.com
directions.
● Directional Light: Light coming from a specific direction, which can create
highlights and shadows.
Reflection:
● Diffuse Reflection: Light that is scattered in various directions by rough
surfaces.
● Specular Reflection: Light that reflects off smooth surfaces in a
concentrated direction, creating highlights.
Shading:
● Lambertian Shading: A model that assumes diffuse reflection and
constant shading across a surface.
● Phong Shading: A more sophisticated model that considers specular
reflection, creating more realistic highlights.
Surface Properties:
● Reflectance Properties: Material characteristics that determine how light
is reflected (e.g., diffuse and specular reflectance).
● Albedo: The inherent reflectivity of a surface, representing the fraction of
incident light that is reflected.
Lighting Models:

EnggTree.com
● Phong Lighting Model: Combines diffuse and specular reflection

components to model lighting.
● Blinn-Phong Model: Similar to the Phong model but computationally more
efficient.
Shadows:
● Cast Shadows: Darkened areas on surfaces where light is blocked by other
objects.
● Self Shadows: Shadows cast by parts of an object onto itself.
Color and Intensity:
● Color Reflection Models: Incorporate the color properties of surfaces in
addition to reflectance.
● Intensity: The brightness of light or color in an image.
Cameras:
● Camera Exposure: The amount of light allowed to reach the camera
sensor or film.
● Camera Response Function: Describes how a camera responds to light of
different intensities.
4.The digital camera: www.EnggTree.com
A digital camera is an electronic device that captures and stores digital images. It
differs from traditional film cameras in that it uses electronic sensors to record images
rather than photographic film. Digital cameras have become widespread due to their
convenience, ability to instantly review images, and ease of sharing and storing photos
digitally. Here are key components and concepts related to digital cameras:
Image Sensor:
● Digital cameras use image sensors (such as CCD or CMOS) to convert
light into electrical signals.
● The sensor captures the image by measuring the intensity of light at each
pixel location.
Lens:
● The lens focuses light onto the image sensor.
● Zoom lenses allow users to adjust the focal length, providing optical
zoom.
Aperture:
● The aperture is an adjustable opening in the lens that controls the amount
of light entering the camera.

EnggTree.com
● It affects the depth of field and exposure.

Shutter:
● The shutter mechanism controls the duration of light exposure to the
image sensor.
● Fast shutter speeds freeze motion, while slower speeds create motion
blur.
Viewfinder and LCD Screen:
● Digital cameras typically have an optical or electronic viewfinder for
composing shots.
● LCD screens on the camera back allow users to review and frame images.
Image Processor:
● Digital cameras include a built-in image processor to convert raw sensor
data into a viewable image.
● Image processing algorithms may enhance color, sharpness, and reduce
noise.
Memory Card:
● Digital images are stored on removable memory cards, such as SD or CF
cards.
● Memory cards provide a convenient and portable way to store and transfer
images. www.EnggTree.com
Autofocus and Exposure Systems:
● Autofocus systems automatically adjust the lens to ensure a sharp image.
● Exposure systems determine the optimal combination of aperture, shutter
speed, and ISO sensitivity for proper exposure.
White Balance:
● White balance settings adjust the color temperature of the captured image
to match different lighting conditions.
Modes and Settings:
● Digital cameras offer various shooting modes (e.g., automatic, manual,
portrait, landscape) and settings to control image parameters.
Connectivity:
● USB, HDMI, or wireless connectivity allows users to transfer images to
computers, share online, or connect to other devices.
Battery:
● Digital cameras are powered by rechargeable batteries, providing the
necessary energy for capturing and processing images.
5.Point operators:

EnggTree.com
Point operators, also known as point processing or pixel-wise operations, are basic
image processing operations that operate on individual pixels independently. These
operations are applied to each pixel in an image without considering the values of
neighboring pixels. Point operators typically involve mathematical operations or
functions that transform the pixel values, resulting in changes to the image's
appearance. Here are some common point operators:
Brightness Adjustment:
● Addition/Subtraction: Increase or decrease the intensity of all pixels by
adding or subtracting a constant value.
● Multiplication/Division: Scale the intensity values by multiplying or dividing
them by a constant factor.
Contrast Adjustment:
● Linear Contrast Stretching: Rescale the intensity values to cover the full
dynamic range.
● Histogram Equalization: Adjust the distribution of pixel intensities to
enhance contrast.
Gamma Correction:
● Adjust the gammawww.EnggTree.com
value to control the overall brightness and contrast of
an image.
Thresholding:
● Convert a grayscale image to binary by setting a threshold value. Pixels
with values above the threshold become white, and those below become
black.
Bit-plane Slicing:
● Decompose an image into its binary representation by considering
individual bits.
Color Mapping:
● Apply color transformations to change the color balance or convert
between color spaces (e.g., RGB to grayscale).
Inversion:
● Invert the intensity values of pixels, turning bright areas dark and vice
versa.
Image Arithmetic:
● Perform arithmetic operations between pixels of two images, such as
addition, subtraction, multiplication, or division.

EnggTree.com
www.EnggTree.com

EnggTree.com
Point operators are foundational in image processing and form the basis for more
complex operations. They are often used in combination to achieve desired
enhancements or modifications to images. These operations are computationally
efficient, as they can be applied independently to each pixel, making them suitable for
real-time applications and basic image manipulation tasks.
It's important to note that while point operators are powerful for certain tasks, more
advanced image processing techniques, such as filtering and convolution, involve
considering the values of neighboring pixels and are applied to local image regions.
6. Linear filtering:
Linear filtering is a fundamental concept in image processing that involves applying a

linear operator to an image. The linear filter operates on each pixel in the image by
combining its value with the values of its neighboring pixels according to a predefined
convolution kernel or matrix. The convolution operation is a mathematical operation
that computes the weighted sum of pixel values in the image, producing a new value for
the center pixel.
www.EnggTree.com
The general formula for linear filtering or convolution is given by:
Where:
Common linear filtering operations include:
Blurring/Smoothing:

EnggTree.com
● Average filter: Each output pixel is the average of its neighboring pixels.
● Gaussian filter: Applies a Gaussian distribution to compute weights for
pixel averaging.
Edge Detection:
● Sobel filter: Emphasizes edges by computing gradients in the x and y
directions.
● Prewitt filter: Similar to Sobel but uses a different kernel for gradient
computation.
Sharpening:
● Laplacian filter: Enhances high-frequency components to highlight edges.
● High-pass filter: Emphasizes details by subtracting a blurred version of the
image.
Embossing:
● Applies an embossing effect by highlighting changes in intensity.
www.EnggTree.com

EnggTree.com
Linear filtering is a versatile technique and forms the basis for more advanced image
www.EnggTree.com
processing operations. The convolution operation can be efficiently implemented using
convolutional neural networks (CNNs) in deep learning, where filters are learned during
the training process to perform tasks such as image recognition, segmentation, and
denoising. The choice of filter kernel and parameters determines the specific effect
achieved through linear filtering.
7.More neighborhood operators :

Neighborhood operators in image processing involve the consideration of pixel values in
the vicinity of a target pixel, usually within a defined neighborhood or window. Unlike
point operators that operate on individual pixels, neighborhood operators take into
account the local structure of the image. Here are some common neighborhood
operators:

EnggTree.com
Median Filter:

● Computes the median value of pixel intensities within a local
neighborhood.
www.EnggTree.com
● Effective for removing salt-and-pepper noise while preserving edges.
Gaussian Filter:
● Applies a weighted average to pixel values using a Gaussian distribution.
● Used for blurring and smoothing, with the advantage of preserving edges.
Bilateral Filter:
● Combines spatial and intensity information to smooth images while
preserving edges.
● Uses two Gaussian distributions, one for spatial proximity and one for
intensity similarity.
Non-local Means Filter:
● Computes the weighted average of pixel values based on similarity in a
larger non-local neighborhood.
● Effective for denoising while preserving fine structures.
Anisotropic Diffusion:
● Reduces noise while preserving edges by iteratively diffusing intensity
values along edges.
● Particularly useful for images with strong edges.
Morphological Operators:
● Dilation: Expands bright regions by considering the maximum pixel value
in a neighborhood.

EnggTree.com
● Erosion: Contracts bright regions by considering the minimum pixel value

in a neighborhood.
● Used for operations like noise reduction, object segmentation, and shape
analysis.
Laplacian of Gaussian (LoG):
● Applies a Gaussian smoothing followed by the Laplacian operator.
● Useful for edge detection.
Canny Edge Detector:
● Combines Gaussian smoothing, gradient computation, non-maximum
suppression, and edge tracking by hysteresis.
● Widely used for edge detection in computer vision applications.
Homomorphic Filtering:
● Adjusts image intensity by separating the image into illumination and
reflectance components.
● Useful for enhancing images with non-uniform illumination.
Adaptive Histogram Equalization:
● Improves contrast by adjusting the histogram of pixel intensities based on
local neighborhoods.
● Effective for enhancing images with varying illumination.
www.EnggTree.com
These neighborhood operators play a crucial role in image enhancement, denoising,
edge detection, and other image processing tasks. The choice of operator depends on
the specific characteristics of the image and the desired outcome.
8.Fourier transforms:
Fourier transforms play a significant role in computer vision for analyzing and
processing images. They are used to decompose an image into its frequency
components, providing valuable information for tasks such as image filtering, feature
extraction, and pattern recognition. Here are some ways Fourier transforms are
employed in computer vision:
Frequency Analysis:
● Fourier transforms help in understanding the frequency content of an
image. High-frequency components correspond to edges and fine details,
while low-frequency components represent smooth regions.
Image Filtering:

EnggTree.com
● Filtering in the frequency domain allows for efficient operations such as

blurring or sharpening. Low-pass filters remove high-frequency noise,
while high-pass filters enhance edges and fine details.
Image Enhancement:
● Adjusting the amplitude of specific frequency components can enhance or
suppress certain features in an image. This is commonly used in image
enhancement techniques.
Texture Analysis:
● Fourier analysis is useful in characterizing and classifying textures based
on their frequency characteristics. It helps distinguish between textures
with different patterns.
Pattern Recognition:
● Fourier descriptors, which capture shape information, are used for
representing and recognizing objects in images. They provide a compact
representation of shape by capturing the dominant frequency
components.
Image Compression:
● Transform-based image compression, such as JPEG compression, utilizes
Fourier transforms to transform image data into the frequency domain.
www.EnggTree.com
This allows for efficient quantization and coding of frequency
components.
Image Registration:
● Fourier transforms are used in image registration, aligning images or
transforming them to a common coordinate system. Cross-correlation in
the frequency domain is often employed for this purpose.
Optical Character Recognition (OCR):
● Fourier descriptors are used in OCR systems for character recognition.
They help in capturing the shape information of characters, making the
recognition process more robust.
Homomorphic Filtering:
● Homomorphic filtering, which involves transforming an image to a
logarithmic domain using Fourier transforms, is used in applications such
as document analysis and enhancement.
Image Reconstruction:
● Fourier transforms are involved in techniques like computed tomography
(CT) or magnetic resonance imaging (MRI) for reconstructing images from
their projections.

EnggTree.com
The efficient computation of Fourier transforms, particularly through the use of the Fast
Fourier Transform (FFT) algorithm, has made these techniques computationally feasible
for real-time applications in computer vision. The ability to analyze images in the
frequency domain provides valuable insights and contributes to the development of
advanced image processing techniques.
9.Pyramids and wavelets:

Pyramids and wavelets are both techniques used in image processing for
multi-resolution analysis, allowing the representation of an image at different scales.
They are valuable for tasks such as image compression, feature extraction, and image
analysis.
Image Pyramids:
Image pyramids are a series of images representing the same scene but at different
resolutions. There are two main types of image pyramids:
Gaussian Pyramid:
● Created by repeatedly applying Gaussian smoothing and downsampling to
an image. www.EnggTree.com
● At each level, the image is smoothed to remove high-frequency
information, and then it is subsampled to reduce its size.
● Useful for tasks like image blending, image matching, and coarse-to-fine
image processing.
Laplacian Pyramid:
● Derived from the Gaussian pyramid.
● Each level of the Laplacian pyramid is obtained by subtracting the
expanded version of the higher level Gaussian pyramid from the original
image.
● Useful for image compression and coding, where the Laplacian pyramid
represents the residual information not captured by the Gaussian pyramid.
Image pyramids are especially useful for creating multi-scale representations of images,
which can be beneficial for various computer vision tasks.
Wavelets:

EnggTree.com
Wavelets are mathematical functions that can be used to analyze signals and images.
Wavelet transforms provide a multi-resolution analysis by decomposing an image into
approximation (low-frequency) and detail (high-frequency) components. Key concepts
include:
Wavelet Transform:
● The wavelet transform decomposes an image into different frequency
components by convolving the image with wavelet functions.
● The result is a set of coefficients that represent the image at various
scales and orientations.
Multi-resolution Analysis:
● Wavelet transforms offer a multi-resolution analysis, allowing the
representation of an image at different scales.
● The approximation coefficients capture the low-frequency information,
while detail coefficients capture high-frequency information.
Haar Wavelet:
● The Haar wavelet is a simple wavelet function used in basic wavelet
transforms.
● It represents changes in intensity between adjacent pixels.
Wavelet Compression: www.EnggTree.com
● Wavelet-based image compression techniques, such as JPEG2000, utilize
wavelet transforms to efficiently represent image data in both spatial and
frequency domains.
Image Denoising:
● Wavelet-based thresholding techniques can be applied to denoise images
by thresholding the wavelet coefficients.
Edge Detection:
● Wavelet transforms can be used for edge detection by analyzing the
high-frequency components of the image.
Both pyramids and wavelets offer advantages in multi-resolution analysis, but they differ
in terms of their representation and construction. Pyramids use a hierarchical structure
of smoothed and subsampled images, while wavelets use a transform-based approach
that decomposes the image into frequency components. The choice between pyramids
and wavelets often depends on the specific requirements of the image processing task
at hand.
10.Geometric transformations :

EnggTree.com
Geometric transformations are operations that modify the spatial configuration of

objects in a digital image. These transformations are applied to change the position,
orientation, scale, or shape of objects while preserving certain geometric properties.
Geometric transformations are commonly used in computer graphics, computer vision,
and image processing. Here are some fundamental geometric transformations:
1. Translation:
● Description: Moves an object by a specified distance along the x and/or y axes.
● Transformation Matrix (2D):
www.EnggTree.com
● Applications: Object movement, image registration.
2. Rotation:
● Description: Rotates an object by a specified angle about a fixed point.
● Applications: Image rotation, orientation adjustment.

EnggTree.com
3. Scaling:
● Description: Changes the size of an object by multiplying its coordinates by
scaling factors.
●
● Applications: Zooming in/out, resizing.
4. Shearing: www.EnggTree.com
● Description: Distorts the shape of an object by varying its coordinates linearly.

● Applications: Skewing, slanting.
5. Affine Transformation:
● Description: Combines translation, rotation, scaling, and shearing.

EnggTree.com
● Applications: Generalized transformations.
6. Perspective Transformation:
● Description: Represents a perspective projection, useful for simulating
three-dimensional effects.
www.EnggTree.com
● Applications: 3D rendering, simulation.
7. Projective Transformation:
● Description: Generalization of perspective transformation with additional control
points.
● Transformation Matrix (3D): More complex than the perspective transformation
matrix.
● Applications: Computer graphics, augmented reality.

EnggTree.com
These transformations are crucial for various applications, including image

manipulation, computer-aided design (CAD), computer vision, and graphics rendering.
Understanding and applying geometric transformations are fundamental skills in
computer science and engineering fields related to digital image processing.
11.Global optimization:
www.EnggTree.com
Global optimization is a branch of optimization that focuses on finding the global
minimum or maximum of a function over its entire feasible domain. Unlike local
optimization, which aims to find the optimal solution within a specific region, global
optimization seeks the best possible solution across the entire search space. Global
optimization problems are often challenging due to the presence of multiple local
optima or complex, non-convex search spaces.
Here are key concepts and approaches related to global optimization:
Concepts:
Objective Function:
● The function to be minimized or maximized.
Feasible Domain:
● The set of input values (parameters) for which the objective function is
defined.
Global Minimum/Maximum:
● The lowest or highest value of the objective function over the entire
feasible domain.

EnggTree.com
Local Minimum/Maximum:

● A minimum or maximum within a specific region of the feasible domain.
Approaches:
Grid Search:
● Dividing the feasible domain into a grid and evaluating the objective
function at each grid point to find the optimal solution.
Random Search:
● Randomly sampling points in the feasible domain and evaluating the
objective function to explore different regions.
Evolutionary Algorithms:
● Genetic algorithms, particle swarm optimization, and other evolutionary
techniques use populations of solutions and genetic operators to
iteratively evolve toward the optimal solution.
Simulated Annealing:
● Inspired by the annealing process in metallurgy, simulated annealing
gradually decreases the temperature to allow the algorithm to escape
local optima.
Ant Colony Optimization:www.EnggTree.com
● Inspired by the foraging behavior of ants, this algorithm uses pheromone
trails to guide the search for the optimal solution.
Genetic Algorithms:
● Inspired by biological evolution, genetic algorithms use mutation,
crossover, and selection to evolve a population of potential solutions.
Particle Swarm Optimization:
● Simulates the social behavior of birds or fish, where a swarm of particles
moves through the search space to find the optimal solution.
Bayesian Optimization:
● Utilizes probabilistic models to model the objective function and guide the
search toward promising regions.
Quasi-Newton Methods:
● Iterative optimization methods that use an approximation of the Hessian
matrix to find the optimal solution efficiently.
Global optimization is applied in various fields, including engineering design, machine

learning, finance, and parameter tuning in algorithmic optimization. The choice of a
specific global optimization method depends on the characteristics of the objective

EnggTree.com
function, the dimensionality of the search space, and the available computational
resources.
www.EnggTree.com

EnggTree.com
UNIT II
FEATURE DETECTION, MATCHING AND SEGMENTATION
Points and patches - Edges - Lines - Segmentation - Active contours - Split and
merge - Mean shift and mode finding - Normalized cuts - Graph cuts and
energy-based methods.
1. Points and Patches:
Points:
Definition: Points in the context of computer vision typically refer to

specific locations or coordinates within an image.
Usage: Points are often used as key interest points or landmarks. These
can be locations with unique features, such as corners, edges, or
distinctive textures.
Applications: Points are crucial in various computer vision tasks, including

feature matching, www.EnggTree.com
image registration, and object tracking. Algorithms often
detect and use points as reference locations for comparing and analyzing
images.
Patches:
Definition: Patches are small, localized regions or segments within an

image.
Usage: In computer vision, patches are often extracted from images to

focus on specific areas of interest. These areas can be defined by points
or other criteria.
Applications: Patches are commonly used in feature extraction and

representation. Instead of analyzing entire images, algorithms may work
with patches to capture detailed information about textures, patterns, or
structures within the image. Patches are also utilized in tasks like image
classification and object recognition.

EnggTree.com
while "points" usually refer to specific coordinates or locations within an image,

"patches" are small, localized regions or segments extracted from images. Both
concepts are fundamental in various computer vision applications, providing essential
www.EnggTree.com
information for tasks such as image analysis, recognition, and understanding. Points
and patches play a crucial role in the extraction of meaningful features that contribute
to the overall interpretation of visual data by computer vision systems.
2. Edges
In image processing and computer vision, "edges" refer to significant changes in

intensity or color within an image. Edges often represent boundaries or transitions
between different objects or regions in an image. Detecting edges is a fundamental step
in various computer vision tasks, as edges contain important information about the
structure and content of an image.
Here are key points about edges:
Definition:

EnggTree.com
● An edge is a set of pixels where there is a rapid transition in intensity or

color. This transition can occur between objects, textures, or other
features in an image.
Importance:
● Edges are crucial for understanding the structure of an image. They
represent boundaries between different objects or regions, providing
valuable information for object recognition and scene understanding.
Edge Detection:
● Edge detection is the process of identifying and highlighting edges within
an image. Various edge detection algorithms, such as the Sobel operator,
Canny edge detector, and Laplacian of Gaussian (LoG), are commonly
used for this purpose.
Applications:
● Object Recognition: Edges help in defining the contours and shapes of
objects, facilitating their recognition.
● Image Segmentation: Edges assist in dividing an image into meaningful
segments or regions.
● Feature Extraction: Edges are important features that can be extracted and
used in higher-level analysis.
● Image Compression: www.EnggTree.com
Information about edges can be used to reduce the
amount of data needed to represent an image.
Types of Edges:
● Step Edges: Sharp transitions in intensity.
● Ramp Edges: Gradual transitions in intensity.
● Roof Edges: A combination of step and ramp edges.
Challenges:
● Edge detection may be sensitive to noise in the image, and selecting an
appropriate edge detection algorithm depends on the characteristics of
the image and the specific application.

EnggTree.com
3. Lines
In the context of image processing and computer vision, "lines" refer to straight or
curved segments within an image. Detecting and analyzing lines is a fundamental
aspect of image understanding and is important in various computer vision
applications. Here are key points about lines:
Definition:
● A line is a set of connected pixels with similar characteristics, typically
representing a continuous or approximate curve or straight segment
within an image.
Line Detection:
● Line detection is the process of identifying and extracting lines from an
image. Hough Transform is a popular technique used for line detection,
www.EnggTree.com
especially for straight lines.
Types of Lines:
● Straight Lines: Linear segments with a constant slope.
● Curved Lines: Non-linear segments with varying curvature.
● Line Segments: Partial lines with a starting and ending point.
Applications:
● Object Detection: Lines can be important features in recognizing and
understanding objects within an image.

EnggTree.com
● Lane Detection: In the context of autonomous vehicles, detecting and

tracking lanes on a road.
● Document Analysis: Recognizing and extracting lines of text in document
images.
● Industrial Inspection: Inspecting and analyzing patterns or structures in
manufacturing processes.
Representation:
● Lines can be represented using mathematical equations, such as the
slope-intercept form (y = mx + b) for straight lines.
Challenges:
● Line detection may be affected by noise in the image or variations in
lighting conditions. Robust algorithms are needed to handle these
challenges.
Line Segmentation:
● Line segmentation involves dividing an image into segments based on the
presence of lines. This is useful in applications like document layout
analysis and text extraction.
Hough Transform:
● The Hough Transform is a widely used technique for detecting lines in an
www.EnggTree.com
image. It represents lines in a parameter space and identifies peaks in this
space as potential lines.
In this lines are important features in images and play a crucial role in computer vision
applications. Detecting and understanding lines contribute to tasks such as object
recognition, image segmentation, and analysis of structural patterns. The choice of line
detection methods depends on the specific characteristics of the image and the goals
of the computer vision application.
4. Segmentation
Image segmentation is a computer vision task that involves partitioning an image into
meaningful and semantically coherent regions or segments. The goal is to group
together pixels or regions that share similar visual characteristics, such as color, texture,
or intensity. Image segmentation is a crucial step in various computer vision
applications as it provides a more detailed and meaningful understanding of the content
within an image. Here are key points about image segmentation:
Definition:

EnggTree.com
● Image segmentation is the process of dividing an image into distinct and

meaningful segments. Each segment typically corresponds to a region or
object in the image.
Purpose:
● Segmentation is used to simplify the representation of an image, making it
easier to analyze and understand. It helps in identifying and delineating
different objects or regions within the image.
Types of Segmentation:
● Semantic Segmentation: Assigning a specific class label to each pixel in
the image, resulting in a detailed understanding of the object categories
present.
● Instance Segmentation: Identifying and delineating individual instances of
objects within the image. Each instance is assigned a unique label.
● Boundary or Edge-based Segmentation: Detecting edges or boundaries
between different regions in the image.
● Region-based Segmentation: Grouping pixels into homogeneous regions
based on similarity criteria.
Algorithms:
● Various algorithms are used for image segmentation, including
www.EnggTree.com
region-growing methods, clustering algorithms (e.g., K-means), watershed
algorithms, and deep learning-based approaches using convolutional
neural networks (CNNs).
Applications:
● Object Recognition: Segmentation helps in isolating and recognizing
individual objects within an image.
● Medical Imaging: Identifying and segmenting structures or anomalies in
medical images.
● Autonomous Vehicles: Segmenting the environment to detect and
understand objects on the road.
● Satellite Image Analysis: Partitioning satellite images into meaningful
regions for land cover classification.
● Robotics: Enabling robots to understand and interact with their
environment by segmenting objects and obstacles.
Challenges:
● Image segmentation can be challenging due to variations in lighting,
complex object shapes, occlusions, and the presence of noise in the
image.
Evaluation Metrics:

EnggTree.com
● Common metrics for evaluating segmentation algorithms include

Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy.
In image segmentation is a fundamental task in computer vision that involves dividing

an image into meaningful segments to facilitate more advanced analysis and
understanding. The choice of segmentation method depends on the specific
characteristics of the images and the requirements of the application.
5. Active Contours www.EnggTree.com
Active contours, also known as snakes, are a concept in computer vision and image
processing that refers to deformable models used for image segmentation. The idea
behind active contours is to evolve a curve or contour within an image in a way that
captures the boundaries of objects or regions of interest. These curves deform under
the influence of internal forces (encouraging smoothness) and external forces
(attracted to features in the image).

EnggTree.com
Key features of active contours include:
Initialization:
● Active contours are typically initialized near the boundaries of the objects
to be segmented. The initial contour can be a closed curve or an open
curve depending on the application.
Energy Minimization:
● The evolution of the active contour is guided by an energy function that
combines internal and external forces. The goal is to minimize this energy
to achieve an optimal contour that fits the boundaries of the object.
Internal Forces:
● Internal forces are associated with the deformation of the contour itself.
They include terms that encourage smoothness and continuity of the
curve. The internal energy helps prevent the contour from oscillating or
exhibiting unnecessary deformations.
External Forces:
● External forces are derived from the image data and drive the contour
toward the boundaries of objects. These forces are attracted to features
such as edges, intensity changes, or texture gradients in the image.
Snakes Algorithm: www.EnggTree.com
● The snakes algorithm is a well-known method for active contour modeling.
It was introduced by Michael Kass, Andrew Witkin, and Demetri
Terzopoulos in 1987. The algorithm involves iterative optimization of the
energy function to deform the contour.
Applications:
● Active contours are used in various image segmentation applications,
such as medical image analysis, object tracking, and computer vision
tasks where precise delineation of object boundaries is required.
Challenges:
● Active contours may face challenges in the presence of noise, weak
edges, or complex object structures. Careful parameter tuning and
initialization are often required.
Variations:
● There are variations of active contours, including geodesic active contours
and level-set methods, which offer different formulations for contour
evolution and segmentation.
Active contours provide a flexible framework for interactive and semi-automatic

segmentation by allowing users to guide the evolution of the contour. While they have

EnggTree.com
been widely used, the choice of segmentation method depends on the specific
characteristics of the images and the requirements of the application.
6. Split and Merge

Split and Merge is a recursive image segmentation algorithm that divides an image into
homogeneous regions based on certain criteria. The primary idea behind the algorithm
is to recursively split an image into smaller blocks until certain conditions are met, and
then merge those blocks if they are sufficiently homogeneous. This process continues
iteratively until the desired level of segmentation is achieved.
www.EnggTree.com
Here is an overview of the Split and Merge algorithm:
Splitting Phase:
● The algorithm starts with the entire image as a single block.
● It evaluates a splitting criterion to determine if the block is sufficiently
homogeneous or should be split further.
● If the splitting criterion is met, the block is divided into four equal
sub-blocks (quadrants), and the process is applied recursively to each
sub-block.

EnggTree.com
Merging Phase:

● Once the recursive splitting reaches a certain level or the splitting criterion
is no longer satisfied, the merging phase begins.
● Adjacent blocks are examined to check if they are homogeneous enough
to be merged.
● If the merging criterion is satisfied, neighboring blocks are merged into a
larger block.
● The merging process continues until no further merging is possible, and
the segmentation is complete.
Homogeneity Criteria:
● The homogeneity of a block or region is determined based on certain
criteria, such as color similarity, intensity, or texture. For example, blocks
may be considered homogeneous if the variance of pixel values within the
block is below a certain threshold.
Recursive Process:
● The splitting and merging phases are applied recursively, leading to a
hierarchical segmentation of the image.
Applications:
● Split and Merge can be used for image segmentation in various
www.EnggTree.com
applications, including object recognition, scene analysis, and computer
vision tasks where delineation of regions is essential.
Challenges:
● The performance of Split and Merge can be affected by factors such as
noise, uneven lighting, or the presence of complex structures in the image.
The Split and Merge algorithm provides a way to divide an image into regions of
homogeneous content, creating a hierarchical structure. While it has been used
historically, more recent image segmentation methods often involve advanced
techniques, such as machine learning-based approaches (e.g., convolutional neural
networks) or other region-growing algorithms. The choice of segmentation method
depends on the characteristics of the images and the specific requirements of the
application.
7. Mean Shift and Mode Finding

Mean Shift is a non-parametric clustering algorithm commonly used for image
segmentation and object tracking. The algorithm works by iteratively shifting a set of
data points towards the mode or peak of the data distribution. In the context of image

EnggTree.com
processing, Mean Shift can be applied to group pixels with similar characteristics into
coherent segments.
Here's a brief overview of the Mean Shift algorithm:
Kernel Density Estimation:

● The algorithm begins by estimating the probability density function (PDF)
of the input data points. This is typically done using a kernel function, such
as a Gaussian kernel.
Initialization:
● Each data point is considered as a candidate cluster center.
Mean Shift Iterations:
● For each data point, a mean shift vector is computed. The mean shift
vector points towards the mode or peak of the underlying data
distribution.
● Data points are iteratively shifted in the direction of the mean shift vector
until convergence.
Convergence Criteria:
● The algorithm converges when the mean shift vectors become very small
or when the pointswww.EnggTree.com
reach local modes in the data distribution.
Cluster Assignment:
● After convergence, data points that converge to the same mode are
assigned to the same cluster.
Mean Shift has been successfully applied to image segmentation, where it effectively
groups pixels with similar color or intensity values into coherent segments.
Now, let's talk about mode finding:
In statistics and data analysis, a "mode" refers to the value or values that appear most
frequently in a dataset. Mode finding, in the context of Mean Shift or other clustering
algorithms, involves identifying the modes or peaks in the data distribution.
For Mean Shift:
● Mode Finding in Mean Shift:

● The mean shift process involves iteratively shifting towards the modes of
the underlying data distribution.

EnggTree.com
● Each cluster is associated with a mode, and the mean shift vectors guide
the data points toward these modes during the iterations.
Mean Shift is an algorithm that performs mode finding to identify clusters in a dataset.
In image processing, it is often used for segmentation by iteratively shifting towards
modes in the color or intensity distribution, effectively grouping pixels into coherent
segments.
8. Normalized Cuts
Normalized Cuts is a graph-based image segmentation algorithm that seeks to divide
an image into meaningful segments by considering both the similarity between pixels
and the dissimilarity between different segments. It was introduced by Jianbo Shi and
Jitendra Malik in 2000 and has been widely used in computer vision and image
processing.
www.EnggTree.com
Here's a high-level overview of the Normalized Cuts algorithm:
Graph Representation:
● The image is represented as an undirected graph, where each pixel is a
node in the graph, and edges represent relationships between pixels.
Edges are weighted based on the similarity between pixel values.
Affinity Matrix:
● An affinity matrix is constructed to capture the similarity between pixels.
The entries of this matrix represent the weights of edges in the graph, and
the values are determined by a similarity metric, such as color similarity or
texture similarity.
Segmentation Objective:

EnggTree.com
● The goal is to partition the graph into two or more segments in a way that
minimizes the dissimilarity between segments and maximizes the
similarity within segments.
Normalized Cuts Criteria:
● The algorithm formulates the segmentation problem using a normalized
cuts criteria, which is a ratio of the sum of dissimilarities between
segments to the sum of similarities within segments.
● The normalized cuts criteria are mathematically defined, and optimization
techniques are applied to find the partition that minimizes this criteria.
Eigenvalue Problem:
● The optimization problem involves solving an eigenvalue problem derived
from the affinity matrix. The eigenvectors corresponding to the smallest
eigenvalues provide information about the optimal segmentation.
Recursive Approach:
● To achieve multi-segmentation, the algorithm employs a recursive
approach. After the initial segmentation, each segment is further divided
into sub-segments by applying the same procedure recursively.
Advantages:
● Normalized Cuts is capable of capturing both spatial and color
information in thewww.EnggTree.com
segmentation process.
● It avoids the bias towards small, compact segments, making it suitable for
segmenting images with non-uniform structures.
Challenges:
● The computational complexity of solving the eigenvalue problem can be a
limitation, particularly for large images.
Normalized Cuts has been widely used in image segmentation tasks, especially when
capturing global structures and relationships between pixels is essential. It has
applications in computer vision, medical image analysis, and other areas where precise
segmentation is crucial.
9. Graph Cuts and Energy-Based Methods

Graph cuts and energy-based methods are widely used in computer vision and image
processing for solving optimization problems related to image segmentation. These
methods often leverage graph representations of images and use energy functions to
model the desired properties of the segmentation.

EnggTree.com
Graph Cuts:
Graph cuts involve partitioning a graph into two disjoint sets such that the cut cost (the
sum of weights of edges crossing the cut) is minimized. In image segmentation, pixels
are represented as nodes, and edges are weighted based on the dissimilarity between
pixels.
Graph Representation:
● Each pixel is a node, and edges connect adjacent pixels. The weights of
www.EnggTree.com
edges reflect the dissimilarity between pixels (e.g., color, intensity).
Energy Minimization:
● The problem is formulated as an energy minimization task, where the
energy function includes terms encouraging similarity within segments
and dissimilarity between segments.
Binary Graph Cut:
● In the simplest case, the goal is to partition the graph into two sets
(foreground and background) by finding the cut with the minimum energy.
Multiclass Graph Cut:
● The approach can be extended to handle multiple classes or segments by
using techniques like the normalized cut criterion.
Applications:
● Graph cuts are used in image segmentation, object recognition, stereo
vision, and other computer vision tasks.
Energy-Based Methods:
Energy-based methods involve formulating an energy function that measures the quality
of a particular configuration or assignment of labels to pixels. The optimization process

EnggTree.com
aims to find the label assignment that minimizes the energy.
Energy Function:
● The energy function is defined based on factors such as data terms
(measuring agreement with observed data) and smoothness terms
(encouraging spatial coherence).
Unary and Pairwise Terms:
www.EnggTree.com
● Unary terms are associated with individual pixels and capture the
likelihood of a pixel belonging to a particular class. Pairwise terms model
relationships between neighboring pixels and enforce smoothness.
Markov Random Fields (MRFs) and Conditional Random Fields (CRFs):
● MRFs and CRFs are common frameworks for modeling energy-based
methods. MRFs consider local interactions, while CRFs model
dependencies more globally.
Iterative Optimization:
● Optimization techniques like belief propagation or graph cuts are often
used iteratively to find the label assignment that minimizes the energy.
Applications:
● Energy-based methods are applied in image segmentation, image
denoising, image restoration, and various other vision tasks.
Both graph cuts and energy-based methods provide powerful tools for image
segmentation by incorporating information about pixel relationships and modeling the
desired properties of segmented regions. The choice between them often depends on
the specific characteristics of the problem at hand.

EnggTree.com
www.EnggTree.com

EnggTree.com
UNIT III
FEATURE-BASED ALIGNMENT & MOTION ESTIMATION
2D and 3D feature-based alignment - Pose estimation - Geometric intrinsic
calibration - Triangulation - Two-frame structure from motion - Factorization
- Bundle adjustment - Constrained structure and motion - Translational
alignment - Parametric motion - Spline-based motion - Optical flow -
Layered motion.
1. 2D and 3D feature-based alignment:
2D and 3D Feature-Based Alignment
Feature-based alignment is a technique used in computer vision and image processing

to align or match corresponding features in different images or scenes. The alignment
can be performed in either 2D or 3D space, depending on the nature of the data.
www.EnggTree.com
2D Feature-Based Alignment:
● Definition: In 2D feature-based alignment, the goal is to align and match
features in two or more 2D images.
● Features: Features can include points, corners, edges, or other distinctive
patterns.
● Applications: Commonly used in image stitching, panorama creation,
object recognition, and image registration.
3D Feature-Based Alignment:

EnggTree.com
● Definition: In 3D feature-based alignment, the goal is to align and match

features in three-dimensional space, typically in the context of 3D
reconstruction or scene understanding.
● Features: Features can include keypoints, landmarks, or other distinctive
3D points.
● Applications: Used in 3D reconstruction, simultaneous localization and
mapping (SLAM), object recognition in 3D scenes, and augmented reality.
Techniques for 2D and 3D Feature-Based Alignment:
● Correspondence Matching: Identifying corresponding features in different
images or 3D point clouds.
● RANSAC (Random Sample Consensus): Robust estimation technique to
find the best-fitting model despite the presence of outliers.
● Transformation Models: Applying transformation models (affine,
homography for 2D; rigid body, affine for 3D) to align features.
● Iterative Optimization: Refining the alignment through iterative
optimization methods such as Levenberg-Marquardt.
Challenges:
● Noise and Outliers: Real-world data often contains noise and outliers,
requiring robust techniques for feature matching.
● Scale and Viewpoint www.EnggTree.com
Changes: Features may undergo changes in scale or
viewpoint, requiring methods that are invariant to such variations.
Applications:
● Image Stitching: Aligning and stitching together multiple images to create
panoramic views.
● Robotics and SLAM: Aligning consecutive frames in the context of robotic
navigation and simultaneous localization and mapping.
● Medical Imaging: Aligning 2D slices or 3D volumes for accurate medical
image analysis.
Evaluation:
● Accuracy and Robustness: The accuracy and robustness of feature-based
alignment methods are crucial for their successful application in various
domains.
Feature-based alignment is a fundamental task in computer vision, enabling the

integration of information from multiple views or modalities for improved analysis and
understanding of the visual world.

EnggTree.com
2. Pose estimation:
Pose estimation is a computer vision task that involves determining the position and
orientation of an object or camera relative to a coordinate system. It is a crucial aspect
of understanding the spatial relationships between objects in a scene. Pose estimation
can be applied to both 2D and 3D scenarios, and it finds applications in various fields,
including robotics, augmented reality, autonomous vehicles, and human-computer
interaction.
2D Pose Estimation:
www.EnggTree.com
● Definition: In 2D pose estimation, the goal is to estimate the position
(translation) and orientation (rotation) of an object in a two-dimensional
image.
● Methods: Techniques include keypoint-based approaches, where
distinctive points (such as corners or joints) are detected and used to
estimate pose. Common methods include PnP (Perspective-n-Point)
algorithms.
3D Pose Estimation:
● Definition: In 3D pose estimation, the goal is to estimate the position and
orientation of an object in three-dimensional space.
● Methods: Often involves associating 2D keypoints with corresponding 3D
points. PnP algorithms can be extended to 3D, and there are other
methods like Iterative Closest Point (ICP) for aligning a 3D model with a
point cloud.
Applications:
● Robotics: Pose estimation is crucial for robotic systems to navigate and
interact with the environment.
● Augmented Reality: Enables the alignment of virtual objects with the
real-world environment.

EnggTree.com
● Autonomous Vehicles: Used for understanding the position and

orientation of the vehicle in its surroundings.
● Human Pose Estimation: Estimating the pose of a person, often used in
applications like gesture recognition and action recognition.
Camera Pose Estimation:
● Definition: Estimating the pose of a camera, which involves determining its
position and orientation in the scene.
● Methods: Camera pose can be estimated using visual odometry, SLAM
(Simultaneous Localization and Mapping), or using known reference
points in the environment.
Challenges:
● Ambiguity: Limited information or similar appearance of different poses
can introduce ambiguity.
● Occlusion: Partially or fully occluded objects can make pose estimation
challenging.
● Real-time Requirements: Many applications, especially in robotics and
augmented reality, require real-time pose estimation.
Evaluation Metrics:
● Common metrics include translation and rotation errors, which measure
the accuracy of thewww.EnggTree.com
estimated pose compared to ground truth.
Deep Learning Approaches:
● Recent advances in deep learning have led to the development of neural
network-based methods for pose estimation, leveraging architectures like
convolutional neural networks (CNNs) for feature extraction.
Pose estimation is a fundamental task in computer vision with widespread applications.

It plays a crucial role in enabling machines to understand the spatial relationships
between objects and the environment.
3. Geometric intrinsic calibration:
Geometric intrinsic calibration is a process in computer vision and camera calibration

that involves determining the intrinsic parameters of a camera. Intrinsic parameters
describe the internal characteristics of a camera, such as its focal length, principal
point, and lens distortion coefficients. Accurate calibration is essential for applications

EnggTree.com
like 3D reconstruction, object tracking, and augmented reality, where knowing the
intrinsic properties of the camera is crucial for accurate scene interpretation.
Here are key points related to geometric intrinsic calibration:
Intrinsic Parameters:
● Focal Length (f): Represents the distance from the camera's optical center
to the image plane. It is a critical parameter for determining the scale of
objects in the scene.
● Principal Point (c): Denotes the coordinates of the image center. It
www.EnggTree.com
represents the offset from the top-left corner of the image to the center of
the image plane.
● Lens Distortion Coefficients: Describe imperfections in the lens, such as
radial and tangential distortions, that affect the mapping between 3D
world points and 2D image points.
Camera Model:
● The camera model, often used for intrinsic calibration, is the pinhole
camera model. This model assumes that light enters the camera through
a single point (pinhole) and projects onto the image plane.
Calibration Patterns:
● Intrinsic calibration is typically performed using calibration patterns with
known geometric features, such as chessboard patterns. These patterns
allow for the extraction of corresponding points in both 3D world
coordinates and 2D image coordinates.
Calibration Process:
● Image Capture: Multiple images of the calibration pattern are captured
from different viewpoints.
● Feature Extraction: Detected features (corners, intersections) in the
calibration pattern are identified in both image and world coordinates.

EnggTree.com
● Parameter Estimation: The intrinsic parameters are estimated using

mathematical optimization techniques, such as nonlinear least squares
optimization.
● Evaluation: The accuracy of calibration is often assessed by reprojecting
3D points onto the images and comparing with the detected 2D points.
Radial and Tangential Distortions:
● Radial Distortion: Deviation from a perfect pinhole camera model due to
radial symmetry. Corrected using distortion coefficients.
● Tangential Distortion: Caused by the lens not being perfectly parallel to the
image plane. Corrected using tangential distortion coefficients.
Multiple Views:
● Calibration is often performed using multiple views to improve accuracy
and handle lens distortions more effectively.
Applications:
● Intrinsic calibration is essential for various computer vision applications,
including 3D reconstruction, camera pose estimation, and stereo vision.
Accurate geometric intrinsic calibration is a critical step in ensuring that the camera
model accurately represents the mapping between the 3D world and the 2D image,
facilitating precise computer vision tasks
www.EnggTree.com
4. Triangulation:
Triangulation, in the context of computer vision and 3D computer graphics, is a

technique used to determine the 3D coordinates of a point in space by computing its
position relative to multiple camera viewpoints. The process involves finding the
intersection point of lines or rays originating from corresponding 2D image points in
different camera views.

EnggTree.com
Here are key points related to triangulation:
Basic Concept:
● Triangulation is based on the principle of finding the 3D location of a point
in space by measuring its projection onto two or more image planes.
Camera Setup:
● Triangulation requires at least two cameras (stereo vision) or more to
capture the same scene from different viewpoints. Each camera provides
a 2D projection of the 3D point.
Mathematical Representation:
Epipolar Geometry:
● Epipolar geometrywww.EnggTree.com
is utilized to relate the 2D projections of a point in
different camera views. It defines the geometric relationship between the
two camera views and helps establish correspondences between points.
Triangulation Methods:
● Direct Linear Transform (DLT): An algorithmic approach that involves
solving a system of linear equations to find the 3D coordinates.
● Iterative Methods: Algorithms like the Gauss-Newton algorithm or the
Levenberg-Marquardt algorithm can be used for refining the initial
estimate obtained through DLT.
Accuracy and Precision:
● The accuracy of triangulation is influenced by factors such as the
calibration accuracy of the cameras, the quality of feature matching, and
the level of noise in the image data.
Bundle Adjustment:
● Triangulation is often used in conjunction with bundle adjustment, a
technique that optimizes the parameters of the cameras and the 3D points
simultaneously to minimize the reprojection error.
Applications:
● 3D Reconstruction: Triangulation is fundamental to creating 3D models of
scenes or objects from multiple camera views.

EnggTree.com
● Structure from Motion (SfM): Used in SfM pipelines to estimate the 3D

structure of a scene from a sequence of images.
● Stereo Vision: Essential for depth estimation in stereo vision systems.
Challenges:
● Ambiguity: Ambiguities may arise when triangulating points from two
views if the views are not well-separated or if the point is near the baseline
connecting the cameras.
● Noise and Errors: Triangulation results can be sensitive to noise and errors
in feature matching and camera calibration.
Triangulation is a core technique in computer vision that enables the reconstruction of

3D geometry from multiple 2D images. It plays a crucial role in applications such as 3D
modeling, augmented reality, and structure-from-motion pipelines.
5. Two-frame structure from motion:
www.EnggTree.com
Two-Frame Structure from Motion
Structure from Motion (SfM) is a computer vision technique that aims to reconstruct the
three-dimensional structure of a scene from a sequence of two-dimensional images.
Two-frame Structure from Motion specifically refers to the reconstruction of scene
geometry using information from only two images (frames) taken from different
viewpoints. This process involves estimating both the 3D structure of the scene and the
camera motion between the two frames.

EnggTree.com
Here are key points related to Two-Frame Structure from Motion:
Basic Concept:
● Two-frame Structure from Motion reconstructs the 3D structure of a scene
by analyzing the information from just two images taken from different
perspectives.
Correspondence Matching:
● Establishing correspondences between points or features in the two
images is a crucial step. This is often done by identifying key features
(such as keypoints) in both images and finding their correspondences.
Epipolar Geometry:
● Epipolar geometry describes the relationship between corresponding
points in two images taken by different cameras. It helps constrain the
possible 3D structures and camera motions.
Essential Matrix:
● The essential matrix is a fundamental matrix in epipolar geometry that
encapsulates the essential information about the relative pose of two
calibrated cameras.
Camera Pose Estimation:
● The camera poseswww.EnggTree.com
(positions and orientations) are estimated for both
frames. This involves solving for the rotation and translation between the
two camera viewpoints.
Triangulation:
● Triangulation is applied to find the 3D coordinates of points in the scene.
By knowing the camera poses and corresponding points, the depth of
scene points can be estimated.
Bundle Adjustment:
● Bundle adjustment is often used to refine the estimates of camera poses
and 3D points. It is an optimization process that minimizes the error
between observed and predicted image points.
Depth Ambiguity:
● Two-frame SfM is susceptible to depth ambiguity, meaning that the
reconstructed scene could be scaled or mirrored without affecting the
projections onto the images.
Applications:
● Robotics: Two-frame SfM is used in robotics for environment mapping and
navigation.
● Augmented Reality: Reconstruction of the 3D structure for overlaying
virtual objects onto the real-world scene.

EnggTree.com
● Computer Vision Research: Studying the principles of SfM and epipolar

geometry.
Challenges:
● Noise and Outliers: The accuracy of the reconstruction can be affected by noise
and outliers in the correspondence matching process.
● Limited Baseline: With only two frames, the baseline (distance between camera
viewpoints) may be limited, leading to potential depth ambiguities.
Two-frame Structure from Motion is a fundamental concept in computer vision,

providing a foundation for understanding 3D scene structure from a pair of images. It is
often extended to multi-frame SfM for more robust reconstructions in scenarios where
more images are available.
6. Factorization:
Factorization in the context of computer vision typically refers to the factorization of

matrices or tensors representing data in various computer vision tasks. One common
www.EnggTree.com
application is in the field of structure from motion (SfM) and multiple-view geometry.
Here are key points related to factorization in computer vision:
Matrix Factorization in SfM:

● Problem Statement: In structure from motion, the goal is to reconstruct
the 3D structure of a scene from a sequence of 2D images taken from
different viewpoints.
● Matrix Representation: The correspondence matrix, also known as the
measurement matrix, is constructed by stacking the image coordinates of
corresponding points from multiple views.
● Matrix Factorization: Factorizing the correspondence matrix into two
matrices representing camera parameters and 3D structure is a common
approach. This factorization is often achieved through techniques like
Singular Value Decomposition (SVD).
Singular Value Decomposition (SVD):
● Application: SVD is frequently used in matrix factorization problems in
computer vision.

EnggTree.com
Applications:

● Structure from Motion (SfM): Factorization is used to recover camera
poses and 3D scene structure from 2D image correspondences.
● Background Subtraction: Matrix factorization techniques are employed in
background subtraction methods for video analysis.
● Face Recognition: Eigenface and Fisherface methods involve factorizing
covariance matrices for facial feature representation.
Non-Negative Matrix Factorization (NMF):
● Application: NMF is a variant of matrix factorization where the factors are
constrained to be non-negative.
● Use Cases: It is applied in areas such as topic modeling, image
segmentation, and feature extraction.
Tensor Factorization:
● Extension to Higher www.EnggTree.com
Dimensions: In some cases, data is represented as
tensors, and factorization techniques are extended to tensors for
applications like multi-way data analysis.
● Example: Canonical Polyadic Decomposition (CPD) is a tensor
factorization technique.
Robust Factorization:
● Challenges: Noise and outliers in the data can affect the accuracy of
factorization.
● Robust Methods: Robust factorization techniques are designed to handle
noisy data and outliers, providing more reliable results.
Deep Learning Approaches:
● Autoencoders and Neural Networks: Deep learning models, including
autoencoders, can be considered as a form of nonlinear factorization.
Factorization Machine (FM):
● Application: Factorization Machines are used in collaborative filtering and
recommendation systems to model interactions between features.
Factorization plays a crucial role in various computer vision and machine learning tasks,
providing a mathematical framework for extracting meaningful representations from

EnggTree.com
data and solving complex problems like 3D reconstruction and dimensionality

reduction.
7. Bundle adjustment:
Bundle Adjustment is a crucial optimization technique in computer vision and

photogrammetry. It is used to refine the parameters of a 3D scene, such as camera
poses and 3D points, by minimizing the reprojection error between the observed image
points and their corresponding projections from the 3D scene. Bundle Adjustment is
commonly employed in the context of structure from motion (SfM), simultaneous
localization and mapping (SLAM), and 3D reconstruction.
www.EnggTree.com
Here are key points related to Bundle Adjustment:
Optimization Objective:
● Minimization of Reprojection Error: Bundle Adjustment aims to find the
optimal set of parameters (camera poses, 3D points) that minimizes the
difference between the observed 2D image points and their projections
onto the image planes based on the estimated 3D scene.
Parameters to Optimize:
● Camera Parameters: Intrinsic parameters (focal length, principal point)
and extrinsic parameters (camera poses - rotation and translation).
● 3D Scene Structure: Coordinates of 3D points in the scene.
Reprojection Error:
● Definition: The reprojection error is the difference between the observed
2D image points and the projections of the corresponding 3D points onto
the image planes.
● Sum of Squared Differences: The objective is to minimize the sum of
squared differences between observed and projected points.

EnggTree.com
Bundle Adjustment Process:

● Initialization: Start with initial estimates of camera poses and 3D points.
● Objective Function: Define an objective function that measures the
reprojection error.
● Optimization: Use optimization algorithms (such as Levenberg-Marquardt,
Gauss-Newton, or others) to iteratively refine the parameters, minimizing
the reprojection error.
Sparse and Dense Bundle Adjustment:
● Sparse BA: Considers a subset of 3D points and image points, making it
computationally more efficient.
● Dense BA: Involves all 3D points and image points, providing higher
accuracy but requiring more computational resources.
Sequential and Global Bundle Adjustment:
● Sequential BA: Optimizes camera poses and 3D points sequentially,
typically in a sliding window fashion.
● Global BA: Optimizes all camera poses and 3D points simultaneously.
Provides a more accurate solution but is computationally more
demanding.
Applications:
● Structure from Motion www.EnggTree.com
(SfM): Refines the reconstruction of 3D scenes
from a sequence of images.
● Simultaneous Localization and Mapping (SLAM): Improves the accuracy
of camera pose estimation and map reconstruction in real-time
environments.
● 3D Reconstruction: Enhances the accuracy of reconstructed 3D models
from images.
Challenges:
● Local Minima: The optimization problem may have multiple local minima,
making it essential to use robust optimization methods.
● Outliers and Noise: Bundle Adjustment needs to be robust to outliers and
noise in the input data.
Integration with Other Techniques:
● Feature Matching: Often used in conjunction with feature matching
techniques to establish correspondences between 2D and 3D points.
● Camera Calibration: Bundle Adjustment may be preceded by or integrated
with camera calibration to refine intrinsic parameters.
Bundle Adjustment is a fundamental optimization technique that significantly improves

the accuracy of 3D reconstructions and camera pose estimations in computer vision

EnggTree.com
applications. It has become a cornerstone in many systems dealing with 3D scene

understanding and reconstruction.
8. Constrained structure and motion:
Constrained Structure and Motion
Constrained Structure and Motion refers to a set of techniques and methods in

computer vision and photogrammetry that incorporate additional constraints into the
structure from motion (SfM) process. The goal is to improve the accuracy and reliability
of 3D reconstruction by imposing constraints on the estimated camera poses and 3D
scene points. These constraints may come from prior knowledge about the scene,
sensor characteristics, or additional information.
www.EnggTree.com
Here are key points related to Constrained Structure and Motion:
Introduction of Constraints:
● Prior Information: Constraints can be introduced based on prior
knowledge about the scene, such as known distances, planar structures,
or object shapes.

EnggTree.com
● Sensor Constraints: Information about the camera system, such as focal

length or aspect ratio, can be incorporated as constraints.
Types of Constraints:
● Geometric Constraints: Constraints that enforce geometric relationships,
such as parallel lines, perpendicularity, or known distances between
points.
● Semantic Constraints: Incorporating semantic information about the
scene, such as the knowledge that certain points belong to a specific
object or structure.
Bundle Adjustment with Constraints:
● Objective Function: The bundle adjustment problem is formulated with an
objective function that includes the reprojection error, as well as additional
terms representing the constraints.
● Optimization: Optimization techniques, such as Levenberg-Marquardt or
Gauss-Newton, are used to minimize the combined cost function.
Advantages:
● Improved Accuracy: Incorporating constraints can lead to more accurate
and reliable reconstructions, especially in scenarios with limited or noisy
data.
www.EnggTree.com
● Handling Ambiguities: Constraints help in resolving ambiguities that may
arise in typical SfM scenarios.
Common Types of Constraints:
● Planar Constraints: Assuming that certain structures in the scene lie on
planes, which can be enforced during reconstruction.
● Scale Constraints: Fixing or constraining the scale of the scene to prevent
scale ambiguity in the reconstruction.
● Object Constraints: Incorporating constraints related to specific objects or
entities in the scene.
Applications:
● Architectural Reconstruction: Constraining the reconstruction based on
known architectural elements or planar surfaces.
● Robotics and Autonomous Systems: Utilizing constraints to enhance the
accuracy of pose estimation and mapping in robotic navigation.
● Augmented Reality: Incorporating semantic constraints for more accurate
alignment of virtual objects with the real world.
Challenges:
● Correctness of Constraints: The accuracy of the reconstruction depends
on the correctness of the imposed constraints.

EnggTree.com
● Computational Complexity: Some constraint types may increase the

computational complexity of the optimization problem.
Integration with Semantic Technologies:
● Semantic 3D Reconstruction: Integrating semantic information into the
reconstruction process to improve the understanding of the scene.
Constrained Structure and Motion provides a way to incorporate additional information

and domain knowledge into the reconstruction process, making it a valuable approach
for scenarios where such information is available and reliable. It contributes to more
accurate and meaningful 3D reconstructions in computer vision applications.
9. Translational alignment
Translational alignment, in the context of computer vision and image processing, refers
to the process of aligning two or more images based on translational transformations.
Translational alignment involves adjusting the position of images along the x and y axes
to bring corresponding features or points into alignment. This type of alignment is often
www.EnggTree.com
a fundamental step in various computer vision tasks, such as image registration,
panorama stitching, and motion correction.
Here are key points related to translational alignment:
Objective:
● The primary goal of translational alignment is to align images by
minimizing the translation difference between corresponding points or
features in the images.
Translation Model:

EnggTree.com
Correspondence Matching:

● Correspondence matching involves identifying corresponding features or
points in the images that can be used as reference for alignment.
Common techniques include keypoint detection and matching.
Alignment Process:
● The translational alignment process typically involves the following steps:
Applications:
● Image Stitching: In panorama creation, translational alignment is used to
align images before merging them into a seamless panorama.
● Motion Correction: In video processing, translational alignment corrects
www.EnggTree.com
for translational motion between consecutive frames.
● Registration in Medical Imaging: Aligning medical images acquired from
different modalities or at different time points.
Evaluation:
● The success of translational alignment is often evaluated by measuring
the accuracy of the alignment, typically in terms of the distance between
corresponding points before and after alignment.
Robustness:
● Translational alignment is relatively straightforward and computationally
efficient. However, it may be sensitive to noise and outliers, particularly in
the presence of large rotations or distortions.
Integration with Other Transformations:
● Translational alignment is frequently used as an initial step in more
complex alignment processes that involve additional transformations,
such as rotational alignment or affine transformations.
Automated Alignment:
● In many applications, algorithms for translational alignment are designed
to operate automatically without requiring manual intervention.

EnggTree.com
Translational alignment serves as a foundational step in various computer vision

applications, providing a simple and effective means to align images before further
processing or analysis.
10. Parametric motion
Parametric motion refers to the modeling and representation of motion in computer

vision and computer graphics using parametric functions or models. Instead of directly
capturing the motion with a set of discrete frames, parametric motion models describe
how the motion evolves over time using a set of parameters. These models are often
employed in various applications, such as video analysis, animation, and tracking.
Here are key points related to parametric motion:
Parametric Functions:
● Parametric motion models use mathematical functions with parameters
to represent the motion of objects or scenes over time. These functions
could be simple mathematical equations or more complex models.
www.EnggTree.com
Types of Parametric Motion Models:
● Linear Models: Simplest form of parametric motion, where motion is
represented by linear equations. For example, linear interpolation between
keyframes.
● Polynomial Models: Higher-order polynomial functions can be used to
model more complex motion. Cubic splines are commonly used for
smooth motion interpolation.
● Trigonometric Models: Sinusoidal functions can be employed to represent
periodic motion, such as oscillations or repetitive patterns.
● Exponential Models: Capture behaviors that exhibit exponential growth or
decay, suitable for certain types of motion.
Keyframe Animation:
● In parametric motion, keyframes are specified at certain points in time,
and the motion between keyframes is defined by the parametric motion
model. Interpolation is then used to generate frames between keyframes.
Control Points and Handles:
● Parametric models often involve control points and handles that influence
the shape and behavior of the motion curve. Adjusting these parameters
allows for creative control over the motion.

EnggTree.com
Applications:

● Computer Animation: Used for animating characters, objects, or camera
movements in 3D computer graphics and animation.
● Video Compression: Parametric motion models can be used to describe
the motion between video frames, facilitating efficient compression
techniques.
● Video Synthesis: Generating realistic videos or predicting future frames in
a video sequence based on learned parametric models.
● Motion Tracking: Tracking the movement of objects in a video by fitting
parametric motion models to observed trajectories.
Smoothness and Continuity:
● One advantage of parametric motion models is their ability to provide
smooth and continuous motion, especially when using interpolation
techniques between keyframes.
Constraints and Constraints-Based Motion:
● Parametric models can be extended to include constraints, ensuring that
the motion adheres to specific rules or conditions. For example, enforcing
constant velocity or maintaining specific orientations.
Machine Learning Integration:
● Parametric motionwww.EnggTree.com
models can be learned from data using machine
learning techniques. Machine learning algorithms can learn the
parameters of the motion model from observed examples.
Challenges:
● Designing appropriate parametric models that accurately capture the
desired motion can be challenging, especially for complex or non-linear
motions.
● Ensuring that the motion remains physically plausible and visually
appealing is crucial in animation and simulation.
Parametric motion provides a flexible framework for representing and controlling

motion in various visual computing applications. The choice of parametric model
depends on the specific characteristics of the motion to be represented and the desired
level of control and realism.
11. Spline-based motion

EnggTree.com
Spline-based motion refers to the use of spline curves to model and interpolate motion
in computer graphics, computer-aided design, and animation. Splines are mathematical
curves that provide a smooth and flexible way to represent motion paths and
trajectories. They are widely used in 3D computer graphics and animation for creating
natural and visually pleasing motion, particularly in scenarios where continuous and
smooth paths are desired.
Here are key points related to spline-based motion:
Spline Definition:
● Spline Curve: A spline is a piecewise-defined polynomial curve. It consists
of several polynomial segments (typically low-degree) that are smoothly
connected at specific points called knots or control points.
● Types of Splines: Common types of splines include B-splines, cubic
splines, and Bezier splines.
Spline Interpolation:
● Spline curves are often used to interpolate keyframes or control points in
animation. This means the curve passes through or follows the specified
keyframes, creating a smooth motion trajectory.
B-spline (Basis Spline): www.EnggTree.com
● B-splines are widely used for spline-based motion. They are defined by a
set of control points, and their shape is influenced by a set of basis
functions.
● Local Control: Modifying the position of a control point affects only a local
portion of the curve, making B-splines versatile for animation.
Cubic Splines:
● Cubic splines are a specific type of spline where each polynomial segment
is a cubic (degree-3) polynomial.
● Natural Motion: Cubic splines are often used for creating natural motion
paths due to their smoothness and continuity.
Bezier Splines:
● Bezier splines are a type of spline that is defined by a set of control points.
They have intuitive control handles that influence the shape of the curve.
● Bezier Curves: Cubic Bezier curves, in particular, are frequently used for
creating motion paths in animation.
Spline Tangents and Curvature:
● Spline-based motion allows control over the tangents at control points,
influencing the direction of motion. Curvature continuity ensures smooth
transitions between segments.

EnggTree.com
Applications:

● Computer Animation: Spline-based motion is extensively used for
animating characters, camera movements, and objects in 3D scenes.
● Path Generation: Designing smooth and visually appealing paths for
objects to follow in simulations or virtual environments.
● Motion Graphics: Creating dynamic and aesthetically pleasing visual
effects in motion graphics projects.
Parametric Representation:
● Spline-based motion is parametric, meaning the position of a point on the
spline is determined by a parameter. This allows for easy manipulation
and control over the motion.
Interpolation Techniques:
● Keyframe Interpolation: Spline curves interpolate smoothly between
keyframes, providing fluid motion transitions.
● Hermite Interpolation: Splines can be constructed using Hermite
interpolation, where both position and tangent information at control
points are considered.
Challenges:
● Overfitting: In some cases, spline curves can be overly flexible and lead to
www.EnggTree.com
overfitting if not properly controlled.
● Control Point Placement: Choosing the right placement for control points
is crucial for achieving the desired motion characteristics.
Spline-based motion provides animators and designers with a versatile tool for creating
smooth and controlled motion paths in computer-generated imagery. The ability to
adjust the shape of the spline through control points and handles makes it a popular
choice for a wide range of animation and graphics applications.
12. Optical flow
Optical flow is a computer vision technique that involves estimating the motion of
objects or surfaces in a visual scene based on the observed changes in brightness or
intensity over time. It is a fundamental concept used in various applications, including
motion analysis, video processing, object tracking, and scene understanding.
Here are key points related to optical flow:

EnggTree.com
Motion Estimation:

● Objective: The primary goal of optical flow is to estimate the velocity
vector (optical flow vector) for each pixel in an image, indicating the
apparent motion of that pixel in the scene.
● Pixel-level Motion: Optical flow provides a dense representation of motion
at the pixel level.
Brightness Constancy Assumption:
● Assumption: Optical flow is based on the assumption of brightness
constancy, which states that the brightness of a point in the scene
remains constant over time.
Optical Flow Equation:

● Derivation: The optical flow equation is derived from the brightness
constancy assumption using partial derivatives with respect to spatial
www.EnggTree.com
coordinates and time.
Dense and Sparse Optical Flow:

● Dense Optical Flow: Estimating optical flow for every pixel in the image,
providing a complete motion field.
● Sparse Optical Flow: Estimating optical flow only for selected key points or
features in the image.
Computational Methods:
● Correlation-based Methods: Match image patches or windows between
consecutive frames to estimate motion.
● Gradient-based Methods: Utilize image gradients to compute optical flow.

EnggTree.com
● Variational Methods: Formulate energy minimization problems to estimate

optical flow.
Lucas-Kanade Method:
● A well-known differential method for estimating optical flow, particularly
suited for small motion and local analysis.
Horn-Schunck Method:
● A variational method that minimizes a global energy function, taking into
account smoothness constraints in addition to brightness constancy.
Applications:
● Video Compression: Optical flow is used in video compression algorithms
to predict motion between frames.
● Object Tracking: Tracking moving objects in a video sequence.
● Robotics: Providing visual feedback for navigation and obstacle
avoidance.
● Augmented Reality: Aligning virtual objects with the real-world scene.
Challenges:
● Illumination Changes: Optical flow may be sensitive to changes in
illumination.
● Occlusions: Occlusions and complex motion patterns can pose challenges
www.EnggTree.com
for accurate optical flow estimation.
● Large Displacements: Traditional methods may struggle with handling
large displacements.
Deep Learning for Optical Flow:
● Recent advances in deep learning have led to the development of neural
network-based methods for optical flow estimation, such as FlowNet and
PWC-Net.
Optical flow is a valuable tool for understanding and analyzing motion in visual data.
While traditional methods have been widely used, the integration of deep learning has
brought new perspectives and improved performance in optical flow estimation.
13. Layered motion
Layered motion, in the context of computer vision and motion analysis, refers to the
representation and analysis of a scene where different objects or layers move
independently of each other. It assumes that the motion in a scene can be decomposed
into multiple layers, each associated with a distinct object or surface. Layered motion

EnggTree.com
models are employed to better capture complex scenes with multiple moving entities,
handling occlusions and interactions between objects.
Here are key points related to layered motion:
Layered Motion Models:

● Objective: The goal of layered motion models is to represent the motion of
distinct objects or surfaces in a scene independently, allowing for a more
accurate description of complex motion scenarios.
● Assumption: It assumes that the observed motion in a scene can be
decomposed into the motion of different layers.
Key Concepts:
● Independence: Layers are assumed to move independently of each other,
simplifying the modeling of complex scenes.
● Occlusions: Layered motion models can handle occlusions more
effectively, as each layer represents a separate entity in the scene.
Motion Layer Segmentation:
● Segmentation Process: The process of identifying and separating the
different motion layers in a video sequence is referred to as motion layer
segmentation. www.EnggTree.com
● Foreground and Background: Layers might represent the foreground and
background elements in a scene.
Challenges in Layered Motion:
● Interaction Handling: Representing the interaction between layers, such as
occlusions or overlapping motions.
● Dynamic Scene Changes: Adapting to changes in the scene, including the
appearance or disappearance of objects.
Optical Flow for Layered Motion:
● Optical flow techniques can be extended to estimate the motion of
individual layers in a scene.
● Layer-Specific Optical Flow: Applying optical flow independently to
different layers.
Multiple Object Tracking:
● Layered motion models are closely related to multiple object tracking, as
each layer can correspond to a tracked object.
Applications:
● Surveillance and Security: Tracking and analyzing the motion of multiple
objects in surveillance videos.

EnggTree.com
● Robotics: Layered motion models can aid robots in understanding and

navigating dynamic environments.
● Augmented Reality: Aligning virtual objects with the real-world scene by
understanding the layered motion.
Representation Formats:
● Layers can be represented in various formats, such as depth maps,
segmentation masks, or explicit motion models for each layer.
Integration with Scene Understanding:
● Layered motion models can be integrated with higher-level scene und
www.EnggTree.com

EnggTree.com
UNIT IV
3D RECONSTRUCTION
Shape from X - Active range finding - Surface representations -
Point-based representationsVolumetric representations - Model-based
reconstruction - Recovering texture maps and albedosos.
1. Shape from X:
"Shape from X" refers to a category of computer vision and computer graphics
techniques that aim to recover the three-dimensional (3D) shape or structure of objects
or scenes from different types of information or cues, represented by the variable "X".
The "X" can stand for various sources or modalities that provide information about the
scene. Some common examples include:
www.EnggTree.com
Shape from Shading (SfS): This technique involves recovering 3D shape

information from variations in brightness and shading in 2D images. It assumes
that the shading patterns in an image are influenced by the underlying 3D
geometry.

Shape from Stereo (SfS): This method utilizes the disparity or parallax
information between two or more images of a scene taken from different

EnggTree.com
viewpoints. By triangulating corresponding points, the 3D structure of the scene

can be reconstructed.

Shape from Motion (SfM): SfM aims to recover the 3D structure of a scene by
analyzing the motion of objects or the camera itself. This is often achieved by
tracking features across multiple frames of a video sequence.

Shape from Texture (SfT): SfT relies on the variations in texture patterns across
surfaces to infer their 3D structure. The assumption is that different surface
orientations result in distinct texture variations.

Shape from Focus (SfF): In SfF, the depth information is inferred from the
variation in image sharpness or focus. By analyzing the focus information at
different depths, the 3D shape can be estimated.
Shape from Defocus (SfD): Similar to SfF, SfD leverages the effects of defocusing
in images to estimate depth information. Objects at different distances from the
camera will exhibit different degrees of blur.

Shape from Light (SfL): This technique involves using information about the
www.EnggTree.com
lighting conditions in a scene to infer 3D shape. The interaction between light and
surfaces provides cues about the geometry.
These approaches demonstrate the diversity of methods used in computer vision to

recover 3D shape information from different types of visual cues. The choice of the
specific "X" (shading, stereo, motion, etc.) depends on the available data and the
characteristics of the scene being reconstructed.
2. Active range finding:

Active range finding is a technique used in computer vision and remote sensing to
determine the distance to objects in a scene actively. Unlike passive methods that rely
on existing ambient illumination, active range finding involves emitting a signal or probe
towards the target and measuring the time it takes for the signal to return. This process
is often based on the principles of time-of-flight or phase-shift measurement. The goal
is to obtain accurate depth or distance information about the surfaces in the scene.
Here are a few common methods of active range finding:

EnggTree.com
Laser Range Finding: This method involves emitting laser beams towards the

target and measuring the time it takes for the laser pulses to travel to the object
and back. By knowing the speed of light, the distance to the object can be
calculated.

Structured Light: In structured light range finding, a known light pattern, often a
grid or a set of stripes, is projected onto the scene. Cameras capture the
deformed pattern on surfaces, and the distortion helps calculate depth
information based on the known geometry of the projected pattern.

Time-of-Flight (ToF) Cameras: ToF cameras emit modulated light signals (often
infrared) and measure the time it takes for the light to travel to the object and
return. The phase shift of the modulated signal is used to determine the distance
to the object.

Ultrasound Range Finding: Ultrasound waves are emitted, and the time it takes
for the waves to bounce back to a sensor is measured. This method is commonly
used in environments where optical methods may be less effective, such as in
low-light conditions.
www.EnggTree.com
Active range finding has various applications, including robotics, 3D scanning,
autonomous vehicles, augmented reality, and industrial inspection. The ability to actively
measure distances is valuable in scenarios where ambient lighting conditions may vary
or when accurate depth information is essential for understanding the environment.
3. Surface representations:
Surface representations in computer vision refer to the ways in which the geometry or
shape of surfaces in a three-dimensional (3D) scene is represented. These
representations are crucial for tasks such as 3D reconstruction, computer graphics, and
virtual reality. Different methods exist for representing surfaces, and the choice often
depends on the application's requirements and the characteristics of the data. Here are
some common surface representations:

EnggTree.com
Polygonal Meshes:

● Description: Meshes are composed of vertices, edges, and faces that
define the surface geometry. Triangular and quadrilateral meshes are
most common.
● Application: Widely used in computer graphics, gaming, and 3D modeling.
Point Clouds:
● Description: A set of 3D points in space, each representing a sample on
the surface of an object.
● Application: Generated by 3D scanners, LiDAR, or depth sensors; used in
applications like autonomous vehicles, robotics, and environmental
mapping.
Implicit Surfaces:
● Description: Represent surfaces as the zero level set of a scalar function.
Points inside the surface have negative values, points outside have
positive values, and points on the surface have values close to zero.
● Application: Used in physics-based simulations, medical imaging, and
shape modeling.
NURBS (Non-Uniform Rational B-Splines):
● Description: Mathematical representations using control points and basis
functions to definewww.EnggTree.com
smooth surfaces.
● Application: Commonly used in computer-aided design (CAD), automotive
design, and industrial design.
Voxel Grids:
● Description: 3D grids where each voxel (volumetric pixel) represents a
small volume in space, and the surface is defined by the boundary
between occupied and unoccupied voxels.
● Application: Used in medical imaging, volumetric data analysis, and
computational fluid dynamics.
Level Set Methods:
● Description: Represent surfaces as the zero level set of a
higher-dimensional function. The evolution of this function over time
captures the motion of the surface.
● Application: Used in image segmentation, shape optimization, and fluid
dynamics simulations.
Octrees:
● Description: Hierarchical tree structures that recursively divide space into
octants. Each leaf node contains information about the geometry within
that region.

EnggTree.com
● Application: Used in real-time rendering, collision detection, and efficient

storage of 3D data.
The choice of surface representation depends on factors such as the nature of the
scene, the desired level of detail, computational efficiency, and the specific
requirements of the application.
4. Point-based representations:
Point-based representations in computer vision and computer graphics refer to
methods that represent surfaces or objects using a set of individual points in
three-dimensional (3D) space. Instead of explicitly defining the connectivity between
points as in polygonal meshes, point-based representations focus on the spatial
distribution of points to describe the surface geometry. Here are some common
point-based representations:
www.EnggTree.com
Point Clouds:
● Description: A collection of 3D points in space, each representing a sample
on the surface of an object or a scene.
● Application: Point clouds are generated by 3D scanners, LiDAR, depth
sensors, or photogrammetry. They find applications in robotics,
autonomous vehicles, environmental mapping, and 3D modeling.
Dense Point Clouds:
● Description: Similar to point clouds but with a high density of points,
providing more detailed surface information.

EnggTree.com
● Application: Used in applications requiring detailed 3D reconstructions,

such as cultural heritage preservation, archaeological studies, and
industrial inspections.
Sparse Point Sets:
● Description: Representations where only a subset of points is used to
describe the surface, resulting in a sparser dataset compared to a dense
point cloud.
● Application: Sparse point sets are useful in scenarios where
computational efficiency is crucial, such as real-time applications and
large-scale environments.
Point Splats:
● Description: Represent each point as a disc or a splat in 3D space. The
size and orientation of the splats can convey additional information.
● Application: Commonly used in point-based rendering and visualization to
represent dense point clouds efficiently.
Point Features:
● Description: Represent surfaces using distinctive points or key points,
each associated with local features such as normals, color, or texture
information.
● Application: Widelywww.EnggTree.com
used in feature-based registration, object recognition,
and 3D reconstruction.
Point Set Surfaces:
● Description: Represent surfaces as a set of unorganized points without
connectivity information. Surface properties can be interpolated from
neighboring points.
● Application: Used in surface reconstruction from point clouds and
point-based rendering.
Radial Basis Function (RBF) Representations:
● Description: Use radial basis functions to interpolate surface properties
between points. These functions define a smooth surface that passes
through the given points.
● Application: Commonly used in shape modeling, surface reconstruction,
and computer-aided design.
Point-based representations are particularly useful when dealing with unstructured or

irregularly sampled data. They provide flexibility in representing surfaces with varying
levels of detail and are well-suited for capturing complex and intricate structures.

EnggTree.com
5. Volumetric representations:
Volumetric representations in computer vision and computer graphics are methods

used to describe and model three-dimensional (3D) space in a volumetric manner.
Unlike surface representations, which focus on defining the surface geometry explicitly,
volumetric representations capture information about the entire volume, including the
interior of objects. Here are some common volumetric representations:
Voxel Grids:
● Description: A regular grid of small volume elements, called voxels, where
each voxel represents a small unit of 3D space.
● Application: Used www.EnggTree.com
in medical imaging, computer-aided design (CAD),
computational fluid dynamics, and robotics. Voxel grids are effective for
representing both the exterior and interior of objects.
Octrees:
● Description: A hierarchical data structure that recursively divides 3D space
into octants. Each leaf node in the octree contains information about the
occupied or unoccupied status of the corresponding volume.
● Application: Octrees are employed for efficient storage and representation
of volumetric data, particularly in real-time rendering, collision detection,
and adaptive resolution.
Signed Distance Fields (SDF):
● Description: Represent the distance from each point in space to the
nearest surface of an object, with positive values inside the object and
negative values outside.
● Application: Used in shape modeling, surface reconstruction, and
physics-based simulations. SDFs provide a compact representation of
geometry and are often used in conjunction with implicit surfaces.
3D Texture Maps:

EnggTree.com
● Description: Extend the concept of 2D texture mapping to 3D space,

associating color or other properties with voxels in a volumetric grid.
● Application: Employed in computer graphics, simulations, and
visualization to represent complex volumetric details such as smoke,
clouds, or other phenomena.
Point Clouds with Occupancy Information:
● Description: Combine the idea of point clouds with additional information
about the occupancy of each point in space.
● Application: Useful in scenarios where capturing both the surface and
interior details of objects is necessary, such as in robotics and 3D
reconstruction.
Tensor Fields:
● Description: Represent the local structure of a volumetric region using
tensors. Tensor fields capture directional information, making them
suitable for anisotropic materials and shapes.
● Application: Commonly used in materials science, biomechanics, and
simulations where capturing anisotropic properties is important.
Shell Maps:
● Description: Represent the surfaces of objects as a collection of shells or
www.EnggTree.com
layers, each encapsulating the object's geometry.
● Application: Used in computer graphics and simulation to efficiently
represent complex objects and enable dynamic level-of-detail rendering.
Volumetric representations are valuable in various applications where a comprehensive

understanding of the 3D space is required, and they offer flexibility in capturing both
surface and interior details of objects. The choice of representation often depends on
the specific requirements of the task at hand and the characteristics of the data being
modeled.
6. Model-based reconstruction:
Model-based reconstruction in computer vision refers to a category of techniques that
involve creating a 3D model of a scene or object based on predefined models or
templates. These methods leverage prior knowledge about the geometry, appearance,
or structure of the objects being reconstructed. Model-based reconstruction is often
used in scenarios where a known model can be fitted to the observed data, providing a
structured and systematic approach to understanding the scene. Here are some key
aspects and applications of model-based reconstruction:

EnggTree.com
Prior Model Representation:

● Description: In model-based reconstruction, a mathematical
representation or a geometric model of the object or scene is assumed or
known in advance.
● Application: Commonly used in computer-aided design (CAD), medical
imaging, and industrial inspection, where known shapes or structures can
be explicitly represented.
Model Fitting:
● Description: The reconstruction process involves adjusting the parameters
www.EnggTree.com
of the model to best fit the observed data, typically obtained from images
or sensor measurements.
● Application: Used in applications such as object recognition, pose
estimation, and 3D reconstruction by aligning the model with the observed
features.
Geometric Constraints:
● Description: Constraints on the geometry of the scene, such as the
relationships between different components or the expected shape
characteristics, are incorporated into the reconstruction process.
● Application: Applied in robotics, augmented reality, and computer vision
tasks where geometric relationships play a crucial role.
Deformable Models:
● Description: Models that can adapt and deform to fit the observed data,
allowing for more flexible and realistic representations.
● Application: Commonly used in medical imaging for organ segmentation
and shape analysis, as well as in computer graphics for character
animation.
Stereo Vision with Model Constraints:
● Description: Stereo vision techniques that incorporate known models to
improve depth estimation and 3D reconstruction.

EnggTree.com
● Application: Used in stereo matching algorithms and 3D reconstruction

pipelines to enhance accuracy by considering geometric priors.
Parametric Surfaces:
● Description: Representing surfaces using parametric equations or
functions, allowing for efficient adjustment of parameters during the
reconstruction process.
● Application: Applied in computer graphics, virtual reality, and industrial
design where surfaces can be described mathematically.
Multi-View Reconstruction with Known Models:
● Description: Leveraging multiple views or images of a scene to reconstruct
a 3D model while incorporating information from known models.
● Application: Common in photogrammetry and structure-from-motion
applications where multiple perspectives contribute to accurate 3D
reconstruction.
Model-based reconstruction is valuable when there is prior knowledge about the objects
or scenes being reconstructed, as it allows for more efficient and accurate
reconstruction compared to purely data-driven approaches. This approach is
particularly useful in fields where a well-defined understanding of the underlying
geometry is available. www.EnggTree.com
7. Recovering texture maps and albedos:

Recovering texture maps and albedos in computer vision and computer graphics
involves estimating the surface appearance, color, and reflectance properties of objects
in a scene. These processes are integral to creating realistic and detailed 3D models for
applications like virtual reality, computer games, and simulations. Here's a brief
overview of these concepts:

EnggTree.com
Texture Maps:

● Description: Texture mapping involves applying a 2D image, known as a
texture map, onto a 3D model's surface to simulate surface details,
patterns, or color variations.
● Recovery Process: Texture maps can be recovered through various
methods, including image-based techniques, photogrammetry, or using
specialized 3D scanners. These methods capture color information
associated with the surface geometry.
● Application: Used www.EnggTree.com
in computer graphics, gaming, and virtual reality to
enhance the visual appearance of 3D models by adding realistic surface
details.
Albedo:
● Description: Albedo represents the intrinsic color or reflectance of a
surface, independent of lighting conditions. It is a measure of how much
light a surface reflects.
● Recovery Process: Albedo can be estimated by decoupling surface
reflectance from lighting effects. Photometric stereo, shape-from-shading,
or using multi-view images are common methods to recover albedo
information.
● Application: Albedo information is crucial in computer vision applications,
such as material recognition, object tracking, and realistic rendering in
computer graphics.
Recovering Texture Maps and Albedos often involves the following techniques:
Photometric Stereo:

EnggTree.com
● Description: A technique that uses multiple images of an object

illuminated from different directions to recover surface normals and,
subsequently, albedo information.
● Application: Used in scenarios where detailed surface properties are
needed, such as facial recognition, material analysis, and industrial
inspection.
Shape-from-Shading:
● Description: Inferring the shape of a surface based on variations in
brightness or shading in images. By decoupling shading from geometry,
albedo information can be estimated.
● Application: Applied in computer vision for shape recovery, as well as in
computer graphics to enhance the realism of rendered images.
Multi-View Stereo (MVS):
● Description: In the context of 3D reconstruction, MVS involves capturing
images of a scene from multiple viewpoints and recovering both geometry
and texture information.
● Application: Commonly used in 3D modeling, virtual reality, and cultural
heritage preservation to create detailed and textured 3D models.
Reflectance Transformation Imaging (RTI):
www.EnggTree.com
● Description: A technique that captures a series of images with controlled
lighting conditions to reveal surface details, including albedo variations.
● Application: Widely used in cultural heritage preservation and art
restoration for capturing fine details on surfaces.
Recovering texture maps and albedos is crucial for creating visually appealing and
realistic 3D models. These techniques bridge the gap between the geometry of the
objects and their appearance, contributing to the overall fidelity of virtual or augmented
environments.

EnggTree.com
UNIT V
IMAGE-BASED RENDERING AND RECOGNITION
View interpolation Layered depth images - Light fields and Lumigraphs -
Environment mattes - Video-based rendering-Object detection - Face
recognition - Instance recognition - Category recognition - Context and
scene understanding- Recognition databases and test sets.
1. View Interpolation:
View interpolation is a technique used in computer graphics and computer vision to
generate new views of a scene that are not present in the original set of captured or
rendered views. The goal is to create additional viewpoints between existing ones,
providing a smoother transition and a more immersive experience. This is particularly
useful in applications like 3D graphics, virtual reality, and video processing. Here are key
points about view interpolation:
www.EnggTree.com
Description:
● View interpolation involves synthesizing views from known viewpoints in a
way that appears visually plausible and coherent.
● The primary aim is to provide a sense of continuity and smooth transitions
between the available views.
Methods:
● Image-Based Methods: These methods use image warping or morphing
techniques to generate new views by blending or deforming existing
images.

EnggTree.com
● 3D Reconstruction Methods: These approaches involve estimating the 3D

geometry of the scene and generating new views based on the
reconstructed 3D model.
Applications:
● Virtual Reality (VR): In VR applications, view interpolation helps create a
more immersive experience by generating views based on the user's head
movements.
● Free-viewpoint Video: View interpolation is used in video processing to
generate additional views for a more dynamic and interactive video
experience.
Challenges:
● Depth Discontinuities: Handling depth changes in the scene can be
challenging, especially when interpolating between views with different
depths.
● Occlusions: Addressing occlusions, where objects in the scene may block
the view of others, is a common challenge.
Techniques:
● Linear Interpolation: Basic linear interpolation is often used to generate
intermediate views by blending the pixel values of adjacent views.
● Depth-Image-Based www.EnggTree.com
Rendering (DIBR): This method involves warping
images based on depth information to generate new views.
● Neural Network Approaches: Deep learning techniques, including
convolutional neural networks (CNNs), have been employed for view
synthesis tasks.
Use Cases:
● 3D Graphics: View interpolation is used to smoothly transition between
different camera angles in 3D graphics applications and games.
● 360-Degree Videos: In virtual tours or immersive videos, view interpolation
helps create a continuous viewing experience.
View interpolation is a valuable tool for enhancing the visual quality and user experience
in applications where dynamic or interactive viewpoints are essential. It enables the
creation of more natural and fluid transitions between views, contributing to a more
realistic and engaging visual presentation.

EnggTree.com
2. Layered Depth Images:
Layered Depth Images (LDI) is a technique used in computer graphics for efficiently
representing complex scenes with multiple layers of geometry at varying depths. The
primary goal of Layered Depth Images is to provide an effective representation of
scenes with transparency and occlusion effects. Here are key points about Layered
Depth Images:
www.EnggTree.com
Description:
● Layered Representation: LDI represents a scene as a stack of images,
where each image corresponds to a specific depth layer within the scene.
● Depth Information: Each pixel in the LDI contains color information as well
as depth information, indicating the position of the pixel along the view
direction.
Representation:
● 2D Array of Images: Conceptually, an LDI can be thought of as a 2D array
of images, where each image represents a different layer of the scene.
● Depth Slice: The images in the array are often referred to as "depth slices,"
and the order of the slices corresponds to the depth ordering of the layers.
Advantages:
● Efficient Storage: LDIs can provide more efficient storage for scenes with
transparency compared to traditional methods like z-buffers.
● Occlusion Handling: LDIs naturally handle occlusions and transparency,
making them suitable for rendering scenes with complex layering effects.

EnggTree.com
Use Cases:

● Augmented Reality: LDIs are used in augmented reality applications where
virtual objects need to be integrated seamlessly with the real world,
considering occlusions and transparency.
● Computer Games: LDIs can be employed in video games to efficiently
handle scenes with transparency effects, such as foliage or glass.
Scene Composition:
● Compositing: To render a scene from a particular viewpoint, the images
from different depth slices are composited together, taking into account
the depth values to handle transparency and occlusion.
Challenges:
● Memory Usage: Depending on the complexity of the scene and the
number of depth layers, LDIs can consume a significant amount of
memory.
● Anti-aliasing: Handling smooth transitions between layers, especially when
dealing with transparency, can pose challenges for anti-aliasing.
Extensions:
● Sparse Layered Representations: Some extensions of LDIs involve using
sparse representations to reduce memory requirements while maintaining
www.EnggTree.com
the benefits of layered depth information.
Layered Depth Images are particularly useful in scenarios where traditional rendering
techniques, such as z-buffer-based methods, struggle to handle transparency and
complex layering. By representing scenes as a stack of images, LDIs provide a more
natural way to deal with the challenges posed by rendering scenes with varying depths
and transparency effects.
3. Light Fields and Lumigraphs:
Light Fields:
● Definition: A light field is a representation of all the light rays traveling in all
directions through every point in a 3D space.
● Components: It consists of both the intensity and the direction of light at
each point in space.

EnggTree.com
● Capture: Light fields can be captured using an array of cameras or

specialized camera setups to record the rays of light from different
perspectives.
● Applications: Used in computer graphics for realistic rendering, virtual
reality, and post-capture refocusing where the focus point can be adjusted
after the image is captured.
●
Lumigraphs:
www.EnggTree.com
● Definition: A lumigraph is a type of light field that represents the visual
information in a scene as a function of both space and direction.
● Capture: Lumigraphs are typically captured using a set of images from a
dense camera array, capturing the scene from various viewpoints.
● Components: Similar to light fields, they include information about the
intensity and direction of light at different points in space.
● Applications: Primarily used in computer graphics and computer vision for
3D reconstruction, view interpolation, and realistic rendering of complex
scenes.
Comparison:
● Difference: While the terms are often used interchangeably, a light field
generally refers to the complete set of rays in 4D space, while a lumigraph
specifically refers to a light field in 3D space and direction.
● Similarities: Both light fields and lumigraphs aim to capture a
comprehensive set of visual information about a scene to enable realistic
rendering and various computational photography applications.
Advantages:
● Realism: Light fields and lumigraphs contribute to realistic rendering by
capturing the full complexity of how light interacts with a scene.

EnggTree.com
● Flexibility: They allow for post-capture manipulation, such as changing the

viewpoint or adjusting focus, providing more flexibility in the rendering
process.
Challenges:
● Data Size: Light fields and lumigraphs can generate large amounts of data,
requiring significant storage and processing capabilities.
● Capture Setup: Acquiring a high-quality light field or lumigraph often
requires specialized camera arrays or complex setups.
Applications:
● Virtual Reality: Used to enhance the realism of virtual environments by
providing a more immersive visual experience.
● 3D Reconstruction: Applied in computer vision for reconstructing 3D
scenes and objects from multiple viewpoints.
Future Developments:
● Computational Photography: Ongoing research explores advanced
computational photography techniques leveraging light fields for
applications like refocusing, depth estimation, and novel view synthesis.
● Hardware Advances: Continued improvements in camera technology may
lead to more accessible methods for capturing high-quality light fields.
www.EnggTree.com
Light fields and lumigraphs are powerful concepts in computer graphics and computer
vision, offering a rich representation of visual information that opens up possibilities for
creating more immersive and realistic virtual experiences.
4. Environment Mattes:
Definition:
● Environment Mattes refer to the process of separating the foreground

elements from the background in an image or video to enable compositing
or replacement of the background.
Purpose:
● Isolation of Foreground Elements: The primary goal is to isolate the
objects or people in the foreground from the original background, creating
a "matte" that can be replaced or composited with a new background.\

EnggTree.com
Techniques:

● Chroma Keying: Commonly used in film and television, chroma keying
involves shooting the subject against a uniformly colored background
(often green or blue) that can be easily removed in post-production.
● Rotoscoping: Involves manually tracing the outlines of the subject frame
by frame, providing precise control over the matte but requiring significant
labor.
● Depth-based Mattes: In 3D applications, depth information can be used to
www.EnggTree.com
create a matte, allowing for more accurate separation of foreground and
background elements.
Applications:
● Film and Television Production: Widely used in the entertainment industry
to create special effects, insert virtual backgrounds, or composite actors
into different scenes.
● Virtual Studios: In virtual production setups, environment mattes are
crucial for seamlessly integrating live-action footage with
computer-generated backgrounds.
Challenges:
● Soft Edges: Achieving smooth and natural transitions between the
foreground and background is challenging, especially when dealing with
fine details like hair or transparent objects.
● Motion Dynamics: Handling dynamic scenes with moving subjects or
dynamic camera movements requires advanced techniques to maintain
accurate mattes.
Spill Suppression:

EnggTree.com
● Definition: Spill refers to the unwanted influence of the background color

on the foreground subject. Spill suppression techniques are employed to
minimize this effect.
● Importance: Ensures that the foreground subject looks natural when
placed against a new background.
Foreground-Background Integration:
● Lighting and Reflection Matching: For realistic results, it's essential to
match the lighting and reflections between the foreground and the new
background.
● Shadow Casting: Consideration of shadows cast by the foreground
elements to ensure they align with the lighting conditions of the new
background.
Advanced Techniques:
● Machine Learning: Advanced machine learning techniques, including
semantic segmentation and deep learning, are increasingly being applied
to automate and enhance the environment matte creation process.
● Real-time Compositing: In some applications, especially in live events or
broadcasts, real-time compositing technologies are used to create
environment mattes on the fly.
Evolution with Technology:www.EnggTree.com
● HDR and 3D Capture: High Dynamic Range (HDR) imaging and 3D capture
technologies contribute to more accurate and detailed environment
mattes.
● Real-time Processing: Advances in real-time processing enable more
efficient and immediate creation of environment mattes, reducing
post-production time.
Environment mattes play a crucial role in modern visual effects and virtual production,
allowing filmmakers and content creators to seamlessly integrate real and virtual
elements to tell compelling stories.
5. Video-based Rendering:
Definition:

EnggTree.com
● Video-based Rendering (VBR) refers to the process of generating
novel views or frames of a scene by utilizing information from a set
of input video sequences.
Capture Techniques:
● Multiple Viewpoints: VBR often involves capturing a scene from
multiple viewpoints, either through an array of cameras or by utilizing

www.EnggTree.com
video footage captured from different angles.
● Light Field Capture: Some VBR techniques leverage light field
capture methods to acquire both spatial and directional information,
allowing for more flexibility in view synthesis.
Techniques:
● View Synthesis: The core objective of video-based rendering is to
synthesize new views or frames that were not originally captured but
can be realistically generated from the available footage.
● Image-Based Rendering (IBR): Techniques such as image-based
rendering, which use captured images or video frames as the basis
for view synthesis.
Applications:

EnggTree.com
● Virtual Reality (VR): VBR is used in VR applications to provide a more
immersive experience by allowing users to explore scenes from
various perspectives.
● Free-Viewpoint Video: VBR techniques enable the creation of
free-viewpoint video, allowing users to interactively choose their
viewpoint within a scene.
View Synthesis Challenges:
● Occlusions: Handling occlusions and ensuring that synthesized
views account for objects obstructing the line of sight is a significant
challenge.
● Consistency: Ensuring visual consistency and coherence across
synthesized views to avoid artifacts or discrepancies.

www.EnggTree.com
3D Reconstruction:
● Depth Estimation: Some video-based rendering approaches involve
estimating depth information from the input video sequences,
enabling more accurate view synthesis.
● Multi-View Stereo (MVS): Utilizing multiple viewpoints for 3D
reconstruction to enhance the quality of synthesized views.
Real-time Video-based Rendering:
● Live Events: In certain scenarios, real-time video-based rendering is
employed for live events, broadcasts, or interactive applications.
● Low Latency: Minimizing latency is crucial for applications where the
rendered views need to be presented in real-time.
Emerging Technologies:

EnggTree.com
● Deep Learning: Advances in deep learning, particularly convolutional
neural networks (CNNs) and generative models, have been applied
to video-based rendering tasks, enhancing the quality of synthesized
views.
● Neural Rendering: Techniques like neural rendering leverage neural
networks to generate realistic novel views, addressing challenges
like specular reflections and complex lighting conditions.
Hybrid Approaches:
● Combining Techniques: Some video-based rendering methods
combine traditional computer graphics approaches with machine
learning techniques for improved results.
● Incorporating VR/AR: VBR is often integrated with virtual reality (VR)

www.EnggTree.com
and augmented reality (AR) systems to provide more immersive and
interactive experiences.
Future Directions:
● Improved Realism: Ongoing research aims to enhance the realism of
synthesized views, addressing challenges related to complex scene
dynamics, lighting variations, and realistic material rendering.
● Applications Beyond Entertainment: Video-based rendering is
expanding into fields like remote collaboration, telepresence, and
interactive content creation.
Video-based rendering is a dynamic field that plays a crucial role in shaping
immersive experiences across various domains, including entertainment,

EnggTree.com
communication, and virtual exploration. Advances in technology and research
continue to push the boundaries of what is achievable in terms of realistic view
synthesis.
6. Object Detection:
Definition:
● Object Detection is a computer vision task that involves identifying and

locating objects within an image or video. The goal is to draw bounding
boxes around the detected objects and assign a label to each identified
object.
www.EnggTree.com
Object Localization vs. Object Recognition:

● Object Localization: In addition to identifying objects, object detection also
involves providing precise coordinates (bounding box) for the location of
each detected object within the image.
● Object Recognition: While object detection includes localization, the term
is often used in conjunction with recognizing and categorizing the objects.
Methods:

EnggTree.com
● Two-Stage Detectors: These methods first propose regions in the image

that might contain objects and then classify and refine those proposals.
Examples include Faster R-CNN.
● One-Stage Detectors: These methods simultaneously predict object
bounding boxes and class labels without a separate proposal stage.
Examples include YOLO (You Only Look Once) and SSD (Single Shot
Multibox Detector).
● Anchor-based and Anchor-free Approaches: Some methods use anchor
boxes to predict object locations and sizes, while others adopt anchor-free
strategies.
Applications:
● Autonomous Vehicles: Object detection is crucial for autonomous vehicles
to identify pedestrians, vehicles, and other obstacles.
● Surveillance and Security: Used in surveillance systems to detect and
track objects or individuals of interest.
● Retail: Applied in retail for inventory management and customer behavior
analysis.
● Medical Imaging: Object detection is used to identify and locate
abnormalities in medical images.
● Augmented Reality: www.EnggTree.com
Utilized for recognizing and tracking objects in AR
applications.
Challenges:
● Scale Variations: Objects can appear at different scales in images,
requiring detectors to be scale-invariant.
● Occlusions: Handling situations where objects are partially or fully
occluded by other objects.
● Real-time Processing: Achieving real-time performance for applications
like video analysis and robotics.
Evaluation Metrics:
● Intersection over Union (IoU): Measures the overlap between the predicted
and ground truth bounding boxes.
● Precision and Recall: Metrics to evaluate the trade-off between correctly
detected objects and false positives.
Deep Learning in Object Detection:
● Convolutional Neural Networks (CNNs): Deep learning, especially CNNs,
has significantly improved object detection accuracy.
● Region-based CNNs (R-CNN): Introduced the idea of region proposal
networks to improve object localization.

EnggTree.com
● Single Shot Multibox Detector (SSD), You Only Look Once (YOLO):
One-stage detectors that are faster and suitable for real-time applications.
Transfer Learning:
● Pre-trained Models: Transfer learning involves using pre-trained models on
large datasets and fine-tuning them for specific object detection tasks.
● Popular Architectures: Models like ResNet, VGG, and MobileNet are often
used as backbone architectures for object detection.
Recent Advancements:
● EfficientDet: An efficient object detection model that balances accuracy
and efficiency.
● CenterNet: Focuses on predicting object centers and regressing bounding
box parameters.
Object Detection Datasets:
● COCO (Common Objects in Context): Widely used for evaluating object
detection algorithms.
● PASCAL VOC (Visual Object Classes): Another benchmark dataset for
object detection tasks.
● ImageNet: Originally known for image classification, ImageNet has also
been used for object detection challenges.
www.EnggTree.com
Object detection is a fundamental task in computer vision with widespread applications
across various industries. Advances in deep learning and the availability of large-scale
datasets have significantly improved the accuracy and efficiency of object detection
models in recent years.
7. Face Recognition:
Definition:
● Face Recognition is a biometric technology that involves identifying and

verifying individuals based on their facial features. It aims to match the
unique patterns and characteristics of a person's face against a database
of known faces.
Components:
● Face Detection: The process of locating and extracting facial features
from an image or video frame.

EnggTree.com
● Feature Extraction: Capturing distinctive features of the face, such as the

distances between eyes, nose, and mouth, and creating a unique
representation.
● Matching Algorithm: Comparing the extracted features with pre-existing
templates to identify or verify a person.
Methods:
● Eigenfaces: A technique that represents faces as linear combinations of
principal components.
● Local Binary Patterns (LBP): A texture-based method that captures
www.EnggTree.com
patterns of pixel intensities in local neighborhoods.
● Deep Learning: Convolutional Neural Networks (CNNs) have significantly
improved face recognition accuracy, with architectures like FaceNet and
VGGFace.
Applications:
● Security and Access Control: Commonly used in secure access systems,
unlocking devices, and building access.
● Law Enforcement: Applied for identifying individuals in criminal
investigations and monitoring public spaces.
● Retail: Used for customer analytics, personalized advertising, and
enhancing customer experiences.
● Human-Computer Interaction: Implemented in applications for facial
expression analysis, emotion recognition, and virtual avatars.
Challenges:
● Variability in Pose: Recognizing faces under different poses and
orientations.
● Illumination Changes: Handling variations in lighting conditions that can
affect the appearance of faces.

EnggTree.com
● Aging and Environmental Factors: Adapting to changes in appearance due

to aging, facial hair, or accessories.
Privacy and Ethical Considerations:
● Data Privacy: Concerns about the collection and storage of facial data and
the potential misuse of such information.
● Bias and Fairness: Ensuring fairness and accuracy, particularly across
diverse demographic groups, to avoid biases in face recognition systems.
Liveness Detection:
● Definition: A technique used to determine whether the presented face is
from a live person or a static image.
● Importance: Prevents unauthorized access using photos or videos to trick
the system.
Multimodal Biometrics:
● Fusion with Other Modalities: Combining face recognition with other
biometric methods, such as fingerprint or iris recognition, for improved
accuracy.
Real-time Face Recognition:
● Applications: Real-time face recognition is essential for applications like
video surveillance, access control, and human-computer interaction.
● Challenges: Ensuringwww.EnggTree.com
low latency and high accuracy in real-time scenarios.
Benchmark Datasets:
● Labeled Faces in the Wild (LFW): A popular dataset for face recognition,
containing images collected from the internet.
● CelebA: Dataset with celebrity faces for training and evaluation.
● MegaFace: Benchmark for evaluating the performance of face recognition
systems at a large scale.
Face recognition is a rapidly evolving field with numerous applications and ongoing
research to address challenges and enhance its capabilities. It plays a crucial role in
various industries, from security to personalized services, contributing to the
advancement of biometric technologies.
8. Instance Recognition:
Definition:

EnggTree.com
● Instance Recognition, also known as instance-level recognition or

instance-level segmentation, involves identifying and distinguishing
individual instances of objects or entities within an image or a scene. It
goes beyond category-level recognition by assigning unique identifiers to
different instances of the same object category.
●
Object Recognition vs. Instance Recognition:
● Object Recognition: Identifies object categories in an image without
distinguishing between different instances of the same category.
● Instance Recognition: Assigns unique identifiers to individual instances of
objects, allowing for differentiation between multiple occurrences of the
same category. www.EnggTree.com
Semantic Segmentation and Instance Segmentation:
● Semantic Segmentation: Assigns a semantic label to each pixel in an
image, indicating the category to which it belongs (e.g., road, person, car).
● Instance Segmentation: Extends semantic segmentation by assigning a
unique identifier to each instance of an object, enabling differentiation
between separate objects of the same category.
Methods:
● Mask R-CNN: A popular instance segmentation method that extends the
Faster R-CNN architecture to provide pixel-level masks for each detected
object instance.
● Point-based Methods: Some instance recognition approaches operate on
point clouds or 3D data to identify and distinguish individual instances.
● Feature Embeddings: Utilizing deep learning methods to learn
discriminative feature embeddings for different instances.
Applications:
● Autonomous Vehicles: Instance recognition is crucial for detecting and
tracking individual vehicles, pedestrians, and other objects in the
environment.

EnggTree.com
● Robotics: Used for object manipulation, navigation, and scene

understanding in robotics applications.
● Augmented Reality: Enables the accurate overlay of virtual objects onto
the real world by recognizing and tracking specific instances.
● Medical Imaging: Identifying and distinguishing individual structures or
anomalies in medical images.
Challenges:
● Occlusions: Handling situations where objects partially or fully occlude
each other.
● Scale Variations: Recognizing instances at different scales within the
same image or scene.
● Complex Backgrounds: Dealing with cluttered or complex backgrounds
that may interfere with instance recognition.
Datasets:
● COCO (Common Objects in Context): While primarily used for object
detection and segmentation, COCO also contains instance segmentation
annotations.
● Cityscapes: A dataset designed for urban scene understanding, including
pixel-level annotations for object instances.
● ADE20K: A large-scalewww.EnggTree.com
dataset for semantic and instance segmentation in
diverse scenes.
Evaluation Metrics:
● Intersection over Union (IoU): Measures the overlap between predicted
and ground truth masks.
● Mean Average Precision (mAP): Commonly used for evaluating the
precision of instance segmentation algorithms.
Real-time Instance Recognition:
● Applications: In scenarios where real-time processing is crucial, such as
robotics, autonomous vehicles, and augmented reality.
● Challenges: Balancing accuracy with low-latency requirements for
real-time performance.
Future Directions:
● Weakly Supervised Learning: Exploring methods that require less
annotation effort, such as weakly supervised or self-supervised learning
for instance recognition.
● Cross-Modal Instance Recognition: Extending instance recognition to
operate across different modalities, such as combining visual and textual
information for more comprehensive recognition.

EnggTree.com
Instance recognition is a fundamental task in computer vision that enhances our ability
to understand and interact with the visual world by providing detailed information about
individual instances of objects or entities within a scene.
9. Category Recognition:
Definition:
● Category Recognition, also known as object category recognition or image

categorization, involves assigning a label or category to an entire image
based on the objects or scenes it contains. The goal is to identify the
overall content or theme of an image without necessarily distinguishing
individual instances or objects within it.
Scope:
● Whole-Image Recognition: Category recognition focuses on recognizing
and classifying the entire content of an image rather than identifying
specific instances or details within the image.
www.EnggTree.com
●
Methods:
● Convolutional Neural Networks (CNNs): Deep learning methods,
particularly CNNs, have shown significant success in image categorization
tasks, learning hierarchical features.
● Bag-of-Visual-Words: Traditional computer vision approaches that
represent images as histograms of visual words based on local features.
● Transfer Learning: Leveraging pre-trained models on large datasets and
fine-tuning them for specific category recognition tasks.
Applications:
● Image Tagging: Automatically assigning relevant tags or labels to images
for organization and retrieval.

EnggTree.com
● Content-Based Image Retrieval (CBIR): Enabling the retrieval of images

based on their content rather than textual metadata.
● Visual Search: Powering applications where users can search for similar
images by providing a sample image.
Challenges:
● Intra-class Variability: Dealing with variations within the same category,
such as different poses, lighting conditions, or object appearances.
● Fine-grained Categorization: Recognizing subtle differences between
closely related categories.
● Handling Clutter: Recognizing the main category in images with complex
backgrounds or multiple objects.
Datasets:
● ImageNet: A large-scale dataset commonly used for image classification
tasks, consisting of a vast variety of object categories.
● CIFAR-10 and CIFAR-100: Datasets with smaller images and multiple
categories, often used for benchmarking image categorization models.
● Open Images: A dataset with a large number of annotated images
covering diverse categories.
Evaluation Metrics:
www.EnggTree.com
● Top-k Accuracy: Measures the proportion of images for which the correct
category is among the top-k predicted categories.
● Confusion Matrix: Provides a detailed breakdown of correct and incorrect
predictions across different categories.
Multi-Label Categorization:
● Definition: Extends category recognition to handle cases where an image
may belong to multiple categories simultaneously.
● Applications: Useful in scenarios where images can have complex content
that falls into multiple distinct categories.
Real-world Applications:
● E-commerce: Categorizing product images for online shopping platforms.
● Content Moderation: Identifying and categorizing content for moderation
purposes, such as detecting inappropriate or unsafe content.
● Automated Tagging: Automatically categorizing and tagging images in
digital libraries or social media platforms.
Future Trends:
● Weakly Supervised Learning: Exploring methods that require less
annotated data for training, such as weakly supervised or self-supervised
learning for category recognition.

EnggTree.com
● Interpretable Models: Developing models that provide insights into the

decision-making process for better interpretability and trustworthiness.
Category recognition forms the basis for various applications in image understanding
and retrieval, providing a way to organize and interpret visual information at a broader
level. Advances in deep learning and the availability of large-scale datasets continue to
drive improvements in the accuracy and scalability of category recognition models.
10. Context and Scene Understanding:
Definition:
● Context and Scene Understanding in computer vision involves

comprehending the overall context of a scene, recognizing relationships
between objects, and understanding the semantic meaning of the visual
elements within an image or a sequence of images.
Scene Understanding vs. Object Recognition:
● Object Recognition: www.EnggTree.com
Focuses on identifying and categorizing individual
objects within an image.
● Scene Understanding: Encompasses a broader understanding of the
relationships, interactions, and contextual information that characterize
the overall scene.
Elements of Context and Scene Understanding:
● Spatial Relationships: Understanding the spatial arrangement and relative
positions of objects within a scene.
● Temporal Context: Incorporating information from a sequence of images
or frames to understand changes and dynamics over time.
● Semantic Context: Recognizing the semantic relationships and meanings
associated with objects and their interactions.
Methods:
● Graph-based Representations: Modeling scenes as graphs, where nodes
represent objects and edges represent relationships, to capture contextual
information.
● Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM):
Utilizing recurrent architectures for processing sequences of images and
capturing temporal context.

EnggTree.com
● Graph Neural Networks (GNNs): Applying GNNs to model complex

relationships and dependencies in scenes.
Applications:
● Autonomous Vehicles: Scene understanding is critical for autonomous
navigation, as it involves comprehending the road, traffic, and dynamic
elements in the environment.
● Robotics: Enabling robots to understand and navigate through indoor and
outdoor environments.
● Augmented Reality: Integrating virtual objects into the real world in a way
that considers the context and relationships with the physical
environment.
● Surveillance and Security: Enhancing the analysis of surveillance footage
by understanding activities and anomalies in scenes.
Challenges:
● Ambiguity: Scenes can be ambiguous, and objects may have multiple
interpretations depending on context.
● Scale and Complexity: Handling large-scale scenes with numerous objects
and complex interactions.
● Dynamic Environments: Adapting to changes in scenes over time,
www.EnggTree.com
especially in dynamic and unpredictable environments.
Semantic Segmentation and Scene Parsing:
● Semantic Segmentation: Assigning semantic labels to individual pixels in
an image, providing a detailed understanding of object boundaries.
● Scene Parsing: Extending semantic segmentation to recognize and
understand the overall scene layout and context.
Hierarchical Representations:
● Multiscale Representations: Capturing information at multiple scales, from
individual objects to the overall scene layout.
● Hierarchical Models: Employing hierarchical structures to represent
objects, sub-scenes, and the global context.
Context-Aware Object Recognition:
● Definition: Enhancing object recognition by considering the contextual
information surrounding objects.
● Example: Understanding that a "bat" in a scene with a ball and a glove is
likely associated with the sport of baseball.
Future Directions:
● Cross-Modal Understanding: Integrating information from different
modalities, such as combining visual and textual information for a more
comprehensive understanding.

EnggTree.com
● Explainability and Interpretability: Developing models that can provide

explanations for their decisions to enhance transparency and trust.
Context and scene understanding are essential for creating intelligent systems that can
interpret and interact with the visual world in a manner similar to human perception.
Ongoing research in this field aims to improve the robustness, adaptability, and
interpretability of computer vision systems in diverse real-world scenarios.
11. Recognition Databases and Test Sets:
Recognition databases and test sets play a crucial role in the development and
evaluation of computer vision algorithms, providing standardized datasets for
training, validating, and benchmarking various recognition tasks. These datasets
often cover a wide range of domains, from object recognition to scene
understanding. Here are some commonly used recognition databases and test
sets:
ImageNet: www.EnggTree.com
● Task: Image Classification, Object Recognition
● Description: ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
is a widely used dataset for image classification and object detection. It
includes millions of labeled images across thousands of categories.
COCO (Common Objects in Context):
● Tasks: Object Detection, Instance Segmentation, Keypoint Detection
● Description: COCO is a large-scale dataset that includes complex scenes
with multiple objects and diverse annotations. It is commonly used for
evaluating algorithms in object detection and segmentation tasks.
PASCAL VOC (Visual Object Classes):
● Tasks: Object Detection, Image Segmentation, Object Recognition
● Description: PASCAL VOC datasets provide annotated images with various
object categories. They are widely used for benchmarking object detection
and segmentation algorithms.
MOT (Multiple Object Tracking) Datasets:
● Task: Multiple Object Tracking
● Description: MOT datasets focus on tracking multiple objects in video
sequences. They include challenges related to object occlusion,
appearance changes, and interactions.

EnggTree.com
KITTI Vision Benchmark Suite:

● Tasks: Object Detection, Stereo, Visual Odometry
● Description: KITTI dataset is designed for autonomous driving research
and includes tasks such as object detection, stereo estimation, and visual
odometry using data collected from a car.
ADE20K:
● Tasks: Scene Parsing, Semantic Segmentation
● Description: ADE20K is a dataset for semantic segmentation and scene
parsing. It contains images with detailed annotations for pixel-level object
categories and scene labels.
Cityscapes:
● Tasks: Semantic Segmentation, Instance Segmentation
● Description: Cityscapes dataset focuses on urban scenes and is
commonly used for semantic segmentation and instance segmentation
tasks in the context of autonomous driving and robotics.
CelebA:
● Tasks: Face Recognition, Attribute Recognition
● Description: CelebA is a dataset containing images of celebrities with
annotations for face recognition and attribute recognition tasks.
LFW (Labeled Faces in the www.EnggTree.com
Wild):
● Task: Face Verification
● Description: LFW dataset is widely used for face verification tasks,
consisting of images of faces collected from the internet with labeled
pairs of matching and non-matching faces.
Open Images Dataset:
● Tasks: Object Detection, Image Classification
● Description: Open Images Dataset is a large-scale dataset that includes
images with annotations for object detection, image classification, and
visual relationship prediction.
These recognition databases and test sets serve as benchmarks for evaluating the
performance of computer vision algorithms. They provide standardized and diverse
data, allowing researchers and developers to compare the effectiveness of different
approaches across a wide range of tasks and applications

CCS338 Computer Vision Lecture Notes 1

Uploaded by

CCS338 Computer Vision Lecture Notes 1

Uploaded by

EnggTree.

Computer vision is a multidisciplinary field that enables machines to interpret and

Key tasks in computer vision www.EnggTree.com

1. Image Recognition:Identifying objects, people, or patterns within images.

2. Object Detection: Locating and classifying multiple objects within an image or

3. Image Segmentation: Dividing an image into meaningful segments or regions,

4. Face Recognition: Identifying and verifying individuals based on facial

5. Gesture Recognition: Understanding and interpreting human gestures from

6. Scene Understanding: Analyzing and comprehending the content and context

Downloaded from EnggTree.com

7. Motion Analysis: Detecting and tracking movements within video sequences.

8. 3D Reconstruction: Creating three-dimensional models of objects or scenes

2.Geometric primitives and transformations:

Geometric primitives and transformations are fundamental concepts in computer

Geometric Primitives: www.EnggTree.com

Downloaded from EnggTree.com

1. Translation: Moves an object by a certain distance along a specified direction.

2. Rotation: Rotates an object around a specified point or axis.

3. Scaling: Changes the size of an object along different axes.

4. Shearing: Distorts the shape of an object by stretching or compressing along one or

5. Reflection: Mirrors an object across a specified plane.

6. Affine Transformations: Combine translation, rotation, scaling, and shearing.

7. Projective Transformations: Used for perspective transformations in 3D graphics.

Understanding geometric primitives and transformations is crucial for creating realistic

3.Photometric image formation:

Downloaded from EnggTree.com

image formation is crucial in computer vision, computer graphics, and image

Here are some key concepts involved:

Downloaded from EnggTree.com

● Phong Lighting Model: Combines diffuse and specular reflection

4.The digital camera: www.EnggTree.com

Downloaded from EnggTree.com

● It affects the depth of field and exposure.

Downloaded from EnggTree.com

Downloaded from EnggTree.com

Downloaded from EnggTree.com

Linear filtering is a fundamental concept in image processing that involves applying a

Common linear filtering operations include:

Downloaded from EnggTree.com

Downloaded from EnggTree.com

7.More neighborhood operators :

Downloaded from EnggTree.com

Downloaded from EnggTree.com

● Erosion: Contracts bright regions by considering the minimum pixel value

Downloaded from EnggTree.com

● Filtering in the frequency domain allows for efficient operations such as

Downloaded from EnggTree.com

9.Pyramids and wavelets:

Downloaded from EnggTree.com

Downloaded from EnggTree.com

Geometric transformations are operations that modify the spatial configuration of

● Applications: Image rotation, orientation adjustment.

Downloaded from EnggTree.com

● Applications: Zooming in/out, resizing.

● Description: Distorts the shape of an object by varying its coordinates linearly.

● Applications: Skewing, slanting.

Downloaded from EnggTree.com

● Transformation Matrix (2D):

● Applications: Generalized transformations.

● Applications: 3D rendering, simulation.

Downloaded from EnggTree.com

These transformations are crucial for various applications, including image

Here are key concepts and approaches related to global optimization:

Downloaded from EnggTree.com

Global optimization is applied in various fields, including engineering design, machine

Downloaded from EnggTree.com

Downloaded from EnggTree.com

1. Points and Patches:

Definition: Points in the context of computer vision typically refer to

Applications: Points are crucial in various computer vision tasks, including

Definition: Patches are small, localized regions or segments within an

Usage: In computer vision, patches are often extracted from images to

Applications: Patches are commonly used in feature extraction and

Downloaded from EnggTree.com

while "points" usually refer to specific coordinates or locations within an image,

In image processing and computer vision, "edges" refer to significant changes in

Kernel Density Estimation:

Matrix Factorization in SfM: