530 Exam #2

Created by Phillip Hyams

First-Order Logic (FOL)
A formal language for representing objects, relations, and functions in the world.

1/311

TermDefinition
First-Order Logic (FOL)
A formal language for representing objects, relations, and functions in the world.
Constant Symbols
Symbols that refer to specific objects (e.g., Cat, Dog, John).
Predicate Symbols
Symbols representing relations or properties that are true or false (e.g., BiggerThan, BrotherOf).
Function Symbols
Symbols representing functions that map objects to other objects (e.g., LeftLeg(John)).
Logical Variables
Symbols like x or y that can refer to any object in the domain.
Atomic Sentence
A predicate applied to terms, representing a basic true/false fact.
Equality Sentence
A statement asserting two terms refer to the same object.
Complex Sentence
A sentence built using logical connectives or quantifiers.
Universal Quantifier (∀)
States that a property holds for all objects in the domain.
Existential Quantifier (∃)
States that there exists at least one object for which a property holds.
Knowledge Base (KB)
A structured collection of facts and rules represented in FOL.
Ontological Engineering
Designing structured representations of concepts and relationships in a domain.
Upper Ontology
A high-level conceptual framework organizing general categories like objects, events, and processes.
Category (in FOL)
A group of objects sharing properties, representable as predicates or reified objects.
Taxonomic Hierarchy
A structure where categories inherit properties from more general categories.
Inheritance (Categories)
The idea that members of a category inherit properties of that category.
PartOf Relationship
A relation indicating that one object is a component of another.
Measurement Representation
Using FOL to express quantities, units, and conversions.
Fluent
A property whose truth value can change over time (e.g., HaveArrow).
Action
An event that changes the truth value of fluents.
Event Calculus
A formalism for representing events, their effects, and how fluents change over time.
T(f, t1, t2)
A predicate stating fluent f is true between times t1 and t2.
Happens(e, t1, t2)
A predicate stating event e occurs from time t1 to t2.
Initiates(e, f, t)
Event e causes fluent f to become true at time t.
Terminates(e, f, t)
Event e causes fluent f to become false at time t.
Modal Logic
A logic that represents knowledge across different possible worlds.
Referential Transparency
The property that substituting equal terms preserves truth (not always desired in knowledge reasoning).
Semantic Network
A graphical representation of objects, categories, and their relationships.
Default Information
Assumptions treated as true unless contradicted by evidence.
Nonmonotonic Logic
A logic where adding new information can invalidate previous conclusions.
Circumscription
A form of nonmonotonic reasoning that minimizes the extension of certain predicates.
Abnormal Predicate
A predicate used in circumscription to represent exceptions to general rules.
Knowledge Base (KB)
A structured collection of facts and rules represented using First-Order Logic.
FOL for Knowledge Representation
Using First-Order Logic to encode objects, relations, and rules about the world.
Universal Properties in FOL
Statements that apply a property to all objects in a domain.
Existential Properties in FOL
Statements asserting that at least one object has a specific property.
Fluents
Properties that can change over time, representable using event calculus.
Event Calculus
A formalism used to represent how actions change fluents over time.
FOL Complexity
The idea that fully using FOL requires extensive background knowledge not covered in this survey course.
Survey Course Limitation
The course introduces concepts at a high level without covering all formal details.
Uncertainty in AI
The idea that real-world environments involve incomplete, noisy, or unpredictable information.
Partial Observability
When an agent cannot directly observe all relevant aspects of the environment.
Noisy Sensors
Sensors that provide imperfect or unreliable measurements.
World Dynamics Uncertainty
Unknown or unpredictable changes in the environment.
Decision Theory
Framework combining probability and utility to choose optimal actions.
Expected Utility
The value of an action computed as the sum of utilities weighted by probabilities.
Possible Worlds
All distinct ways the world could be, used to define probability models.
Probability Model
A function assigning probabilities to each possible world.
Event
A subset of possible worlds representing an outcome of interest.
Random Variable
A function mapping each possible world to a value of interest.
Range of a Random Variable
The set of all possible values a random variable can take.
Probability Distribution
A function giving the probability of each value of a random variable.
Joint Distribution
A table giving probabilities for all combinations of values of multiple variables.
Marginal Distribution
A distribution obtained by summing out (eliminating) other variables.
Marginalization
The process of summing over variables to compute a marginal distribution.
Conditional Probability
The probability of an event given that another event has occurred.
Normalization
Scaling probabilities so they sum to 1.
Product Rule
P(a, b) = P(a | b) P(b).
Chain Rule
A method for decomposing joint distributions into conditional probabilities.
Probabilistic Inference
Computing desired probabilities from a probability model.
Inference by Enumeration
A brute-force method that sums over all hidden variables.
Evidence Variables
Observed variables used to condition a probability query.
Query Variable
The variable whose probability distribution we want to compute.
Hidden Variables
Variables not observed but included in the model.
Bayes’ Rule
A formula for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b).
Diagnostic Probability
Probability of a cause given an observed effect.
Causal Probability
Probability of an effect given a cause.
Independence
When P(x, y) = P(x) P(y), meaning variables do not influence each other.
Conditional Independence
When two variables are independent given a third variable.
Ghostbusters Domain
A grid-based example illustrating conditional independence and sensor models.
Ghost Location Variable (G)
The random variable representing the ghost’s position.
Sensor Variables (Cxy)
Variables representing the color readings at each grid square.
Sensor Model
The probability distribution P(Cxy | G) depending on distance to the ghost.
Uniform Prior
A prior distribution where all values are equally likely.
Naïve Bayes Model
A model where all evidence variables are conditionally independent given a single cause variable.
Naïve Bayes Joint Formula
P(Cause, Effects) = P(Cause) Π P(Effect_i | Cause).
Uncertainty in AI
The idea that real-world environments involve incomplete, noisy, or unpredictable information.
Partial Observability
When an agent cannot directly observe all relevant aspects of the environment.
Noisy Sensors
Sensors that provide imperfect or unreliable measurements.
Decision Theory
Framework combining probability and utility to choose optimal actions.
Expected Utility
The value of an action computed as the sum of utilities weighted by probabilities.
Product Rule
P(a, b) = P(a | b) P(b), relating joint and conditional probabilities.
Chain Rule
A method for decomposing joint distributions into conditional probabilities.
Inference by Enumeration
A brute-force method for computing probabilities by summing over hidden variables.
Evidence Variables
Observed variables used to condition a probability query.
Query Variable
The variable whose probability distribution we want to compute.
Hidden Variables
Variables not observed but included in the model.
Bayes’ Rule
A formula for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b).
Diagnostic Probability
Probability of a cause given an observed effect.
Causal Probability
Probability of an effect given a cause.
Independence
When P(x, y) = P(x) P(y), meaning variables do not influence each other.
Conditional Independence
When two variables are independent given a third variable.
Ghostbusters Domain
A grid-based example illustrating conditional independence and sensor models.
Ghost Location Variable (G)
The random variable representing the ghost’s position.
Sensor Variables (Cxy)
Variables representing the color readings at each grid square.
Sensor Model
The probability distribution P(Cxy | G) depending on distance to the ghost.
Uniform Prior
A prior distribution where all values are equally likely.
Naïve Bayes Model
A model where all evidence variables are conditionally independent given a single cause variable.
Naïve Bayes Joint Formula
P(Cause, Effects) = P(Cause) Π P(Effect_i | Cause).
Text Classification (Naïve Bayes)
Using word frequencies and conditional independence to classify documents.
Category Prior
The probability of a class before observing evidence.
Word Likelihood
The probability of a word occurring given a category.
Posterior Probability
Updated probability of a category after observing evidence.
Event
A subset of possible worlds representing an outcome of interest.
Random Variable
A function mapping each possible world to a value of interest.
Distribution
A function giving the probability of each value of a random variable.
Joint Distribution
A table giving probabilities for all combinations of values of multiple variables.
Marginalization
Summing over variables to compute a marginal distribution.
Conditional Probability
The probability of an event given that another event has occurred.
Chain Rule
A decomposition of joint distributions into conditional probabilities.
Bayes’ Rule
A method for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b).
Probabilistic Inference
Computing desired probabilities from a probability model.
Inference by Enumeration
A method involving selecting consistent entries, summing out hidden variables, and normalizing.
Independence
When two variables do not influence each other and satisfy P(x, y) = P(x) P(y).
Conditional Independence
When two variables are independent given a third variable.
Bayes Net (Bayesian Network)
A graphical model representing joint distributions using nodes and directed edges.
Graphical Model
A representation of variables and their dependencies using a graph.
Node (in Bayes Net)
A variable in the model.
Arc (in Bayes Net)
A directed edge indicating direct influence between variables.
Directed Acyclic Graph (DAG)
The structure underlying a Bayesian network.
Conditional Probability Table (CPT)
A table specifying probabilities of a variable given its parents.
Local Conditional Distribution
The probability distribution for a node given its parents.
Global Semantics (Bayes Net)
The joint distribution defined as the product of all local conditionals.
Sparse Bayes Net
A Bayes net where each variable has few parents, reducing complexity.
Markov Blanket
The set of a node’s parents, children, and children’s parents; shields the node from the rest of the network.
Conditional Independence Semantics
Every variable is conditionally independent of its non-descendants given its parents.
Topological Order
An ordering of nodes where parents appear before children.
Alarm Network
A classic Bayes net example involving burglary, earthquake, alarm, and neighbor calls.
Free Parameters (CPT)
The number of independent probabilities needed to specify a conditional probability table.
Local Causality
The idea that variables interact only with a small number of others.
Hidden Markov model (HMM)
A model with a Markov chain over hidden states X_t and observations E_t emitted from each state via P(E_t\mid X_t).
Umbrella world state variable
R_t, indicating whether it is raining on day t.
Umbrella world evidence variable
U_t, indicating whether the director carries an umbrella on day t.
Viterbi algorithm
Dynamic programming algorithm that computes, for each state and time, the maximum probability of any path ending there, to find the most likely state sequence.
State trellis
A layered graph whose nodes are states at each time step and whose edges represent transitions with weights from transition and sensor probabilities.
Robot localization problem
Using a transition model over robot positions and a sensor model over wall/no-wall readings to maintain a belief over the robot’s location.
Markov Process
A system where the next state depends only on the current state, not the full history.
First-Order Markov Assumption
The assumption that each state depends only on the immediately previous state.
Stationarity Assumption
The rule that transition probabilities stay the same at every time step.
State Variable
A hidden variable representing the true condition of the system at a given time.
Evidence Variable
An observable variable that provides information about the hidden state.
Transition Model
The model describing how the hidden state evolves from one time step to the next.
Sensor Model
The model describing how observations are generated from the current hidden state.
Discrete Time Model
A model that represents the world as a sequence of time slices.
Umbrella World
A simple example where the hidden state is whether it is raining and the evidence is whether the director carries an umbrella.
Filtering
The task of computing the belief about the current state given all evidence so far.
Prediction
Estimating future states based on the current belief without adding new evidence.
Smoothing
Estimating past states using both earlier and later evidence.
Most Likely Explanation
The single most probable sequence of states that explains the observed evidence.
Forward Algorithm
A recursive method for updating beliefs over time using transition and sensor models.
Viterbi Algorithm
A dynamic programming method for finding the most likely sequence of hidden states.
State Trellis
A layered graph showing all possible states at each time step and transitions between them.
Hidden Markov Model
A model with hidden states that evolve over time and produce observable evidence.
Robot Localization
A problem where a robot uses noisy sensors and movement models to estimate its position.
Belief State
The probability distribution over all possible states at a given time.
Update Step
The filtering step where new evidence is incorporated into the belief state.
Prediction Step
The filtering step where the belief is projected forward using the transition model.
Stationary Distribution
The long-run distribution that a Markov chain converges to after many steps.
Partial Observability
A situation where the agent cannot directly observe the true state of the world.
Dynamic World
A world where the state changes over time while the agent is reasoning.
Atomic Representation
A representation where states are indivisible wholes, used in search algorithms.
Factored Representation
A representation where states are described by variables and their values.
Structured Representation
A representation using objects and relations, as in first-order logic.
Bayesian Network
A factored probabilistic model over a fixed, finite set of random variables.
Probabilistic Programming Language (PPL)
A language that defines probability distributions using programs with random choices.
Generative Program
A program where each random choice defines a random variable in a probability model.
Execution Trace
A sequence of values for all random choices made during one run of a generative program.
Decision-Theoretic Agent
An agent that makes rational decisions using beliefs (probabilities) and utilities.
Utility Function
A function assigning a numerical desirability to each state.
Expected Utility
The average utility of outcomes from a state, weighted by their probabilities.
Maximum Expected Utility (MEU)
The principle that an agent should choose the action with the highest expected utility.
Episodic Decision Problem
A one-shot decision problem where each decision is independent.
Sequential Decision Problem
A decision problem requiring a sequence of actions over time.
Learning
The process of improving performance based on observations.
Machine Learning
When a computer learns models from data to solve problems.
Induction
Reasoning from specific observations to general rules.
Deduction
Reasoning from general rules to specific conclusions.
Classification
A learning task where the output is a discrete label.
Regression
A learning task where the output is a numeric value.
Supervised Learning
Learning from labeled input-output pairs.
Unsupervised Learning
Learning patterns or structure without labeled outputs.
Clustering
Grouping similar examples in unsupervised learning.
Reinforcement Learning
Learning from rewards and punishments to determine good actions.
Training Set
A set of labeled examples used to learn a hypothesis.
Test Set
A set of unseen examples used to evaluate generalization.
Ground Truth
The true output value the model is trying to predict.
Hypothesis
A learned function that approximates the true function.
Hypothesis Space
The set of all candidate hypotheses the learner can choose from.
Consistent Hypothesis
A hypothesis that correctly predicts all training examples.
Generalization
How well a learned hypothesis performs on unseen data.
Bias
Systematic tendency of a hypothesis to deviate from the true function.
Variance
How much a hypothesis changes when trained on different data samples.
Underfitting
When a model is too simple to capture the underlying pattern.
Overfitting
When a model fits the training data too closely and fails to generalize.
Bias-Variance Tradeoff
The balance between simplicity (bias) and flexibility (variance).
Rule of Simplicity
Choose models as simple as possible while still capturing important patterns.
Linear Regression
A method for fitting a straight line or hyperplane to data for prediction.
Prediction Function
The linear model output computed as w0 + w1x.
Prediction Error
The difference between the true value y and the predicted value h(x).
Residual
The error on a single instance, defined as y − h(x).
Least Squares Loss
A loss function that sums the squared prediction errors over all examples.
Gradient Descent
An iterative optimization method that updates weights to reduce loss.
Learning Rate
A small constant controlling how large each gradient descent update is.
Feature Vector
A vector of input features representing an example.
Linear Classifier
A classifier that makes predictions using a weighted sum of features.
Decision Boundary
A line, plane, or surface separating predicted classes.
Threshold Classifier
A classifier that outputs 1 if w·x ≥ 0 and 0 otherwise.
Activation
The weighted sum w·f(x) used to determine a classifier’s output.
Perceptron
A linear classifier that outputs +1 or -1 based on the sign of w·x.
Weight Vector
The set of learned weights applied to input features.
False Negative
When the classifier predicts negative but the true label is positive.
False Positive
When the classifier predicts positive but the true label is negative.
Perceptron Learning Rule
The update rule w ← w + α(y − h(x))x.
Learning Rate (Perceptron)
A constant α controlling how strongly weights are adjusted.
Linearly Separable Data
Data that can be perfectly separated by a hyperplane.
Perceptron Convergence Theorem
States that perceptron learning converges if data is linearly separable.
Non-Separable Data
Data that cannot be perfectly separated by any linear boundary.
Overtraining
When training accuracy rises but test accuracy eventually falls.
Overfitting
When a model fits training data too closely and generalizes poorly.
Averaged Perceptron
A perceptron variant that averages weight vectors to reduce noise.
XOR Problem
A classic example of a function that cannot be learned by a linear classifier.
Sigmoid Function
A smooth function mapping real values to probabilities between 0 and 1.
Logistic Regression
A linear classifier that uses the sigmoid function to output probabilities.
Support Vector Machine (SVM)
A classifier that maximizes the margin between classes.
Ensemble Learning
Using multiple hypotheses together to improve performance.
Bagging
Training multiple models on bootstrap samples and combining their predictions.
Random Forest
An ensemble of decision trees built using random subsets of features.
Boosting
An ensemble method that sequentially trains models to correct previous errors.
Stacking
An ensemble method that combines predictions from multiple models using another learner
Deep Learning
A broad family of machine learning techniques where hypotheses are complex algebraic circuits with tunable weights.
Neural Network
A model composed of interconnected units that compute weighted sums and apply nonlinear activation functions.
Deep Network
A neural network with many layers, giving a long computation path from input to output.
Advantage of Deep Networks
They allow interactions between variables through multiple layers, unlike linear models which combine inputs independently.
Neuron (McCulloch-Pitts Model)
A simple computational unit that sums weighted inputs and applies a nonlinear activation function.
Bias Weight
An additional fixed input with weight that shifts the activation threshold of a neuron.
Feedforward Network
A neural network where connections flow only forward, forming a directed acyclic graph.
Recurrent Network
A neural network with feedback connections, giving it internal state or memory.
Unit (Node)
A component that computes a weighted sum of inputs and applies a nonlinear activation function.
Activation Function
A nonlinear function applied to a unit’s weighted input; enables networks to represent nonlinear functions.
Universal Approximation Theorem
A two‑layer neural network with enough hidden units can approximate any continuous function.
Sigmoid Function
A smooth activation function mapping inputs to values between 0 and 1.
ReLU
Activation function defined as max(0, x), commonly used in modern deep networks.
Softplus
A smooth version of ReLU defined as log(1 + e^x).
Tanh
Activation function mapping inputs to values between -1 and 1.
Building a Network
Combining units into layers so the overall function is a composition of many algebraic operations.
Gradient Descent in Neural Nets
Adjusting weights by computing gradients of loss through all layers.
Backpropagation
The process of passing output error backward through the network to compute gradients for each weight.
Vanishing Gradient Problem
When gradients become extremely small in deep networks, preventing effective learning.
Automatic Differentiation
Software technique (e.g., TensorFlow, PyTorch) that automatically computes derivatives for all weights.
Computation Graph
A dataflow graph where each node represents an elementary computation in the network.
Input Encoding
The process of converting raw data into numeric inputs for a neural network.
One-Hot Encoding
Representing a categorical variable with a vector where exactly one element is 1 and the rest are 0.
Output Encoding
Representing outputs (labels) in a form suitable for loss computation, such as one-hot vectors.
Squared Error Loss
A loss function used for regression that measures squared differences between predicted and true values.
Negative Log Likelihood
A loss function used for classification that encourages high probability for correct labels.
Maximum Likelihood Learning
Choosing weights that maximize the probability of the observed data.
Softmax
An activation function for multiclass classification that outputs a probability distribution over classes.
Hidden Layer
A layer between input and output that transforms representations and may learn meaningful features.
Convolutional Neural Network (CNN)
A neural network with spatially local connections and shared weights (kernels).
Kernel (Filter)
A small set of weights replicated across an image to detect specific features.
Convolution
Operation applying a kernel across an input to produce feature maps.
Stride
The step size of the convolution; larger strides reduce output size.
Receptive Field
The region of input that influences a particular hidden unit.
Pooling Layer
A layer that summarizes adjacent units using a fixed operation such as max or average
Average Pooling
Pooling method that computes the average of inputs in a region.
Max Pooling
Pooling method that outputs the maximum value in a region.
Downsampling
Reducing spatial resolution by pooling or strided convolution.
Fully Connected Layer
A layer where each unit connects to all units in the previous layer, often used near the output.
Residual Network (ResNet)
A deep architecture that adds small perturbations to previous layer outputs to avoid vanishing gradients.
Stochastic Gradient Descent (SGD)
A training method where weights are updated using gradients computed from small minibatches of data.
Minibatch
A small subset of training examples used to compute an approximate gradient during SGD.
Learning Rate Decay
Gradually reducing the learning rate over time to stabilize training.
Momentum
A technique that smooths SGD updates by incorporating past gradients to reduce variance.
Batch Normalization
A method that normalizes activations within a minibatch to stabilize and speed up training.
Generalization
The ability of a model to perform well on unseen data rather than just the training set.
Adversarial Example
An input intentionally perturbed to cause a neural network to make an incorrect prediction.
Weight Decay
A regularization technique that penalizes large weights to reduce overfitting.
Dropout
A regularization method where random units are deactivated during training to prevent co-adaptation.
Recurrent Neural Network (RNN)
A neural network with cycles that allow information to persist across time steps.
Hidden State (RNN)
The internal memory of an RNN that carries information from previous time steps.
Unrolling an RNN
Expanding an RNN across time steps to form a feedforward computation graph for training.
Vanishing Gradient (RNN)
When gradients shrink during backpropagation through time, preventing learning of long-term dependencies.
Exploding Gradient (RNN)
When gradients grow uncontrollably during training, destabilizing learning.
Long Short-Term Memory (LSTM)
A type of RNN with memory cells and gates designed to avoid vanishing and exploding gradients.
Memory Cell (LSTM)
A component that stores information across time steps with minimal modification.
Gating Units (LSTM)
Mechanisms that control information flow into and out of the memory cell.
Unsupervised Learning
Learning patterns from data without labeled outputs.
Clustering
Grouping similar data points based on inherent structure in the data.
K-Means Clustering
A partitional clustering algorithm that assigns points to the nearest centroid and iteratively updates centroids.
Centroid
The center of a cluster, often the mean of its points.
Cluster Shape Problem
The issue that some clustering methods assume round or blob-like clusters, which may not match real data.
Gaussian Mixture Model
A probabilistic clustering model that represents data as a mixture of Gaussian components.
Hierarchical Clustering
A clustering method that builds nested clusters organized as a tree.
Partitional Clustering
A clustering method that divides data into non-overlapping clusters.
Exclusive Clustering
A clustering method where each point belongs to exactly one cluster.
Non-Exclusive Clustering
A clustering method where points may belong to multiple clusters.
Fuzzy Clustering
A clustering method where each point has a degree of membership in each cluster.
Well-Separated Cluster
A cluster where every point is closer to points in its own cluster than to points outside it.
Prototype-Based Cluster
A cluster defined by similarity to a central prototype such as a centroid or medoid.
Contiguity-Based Cluster
A cluster where each point is closer to at least one other point in the same cluster.
Density-Based Cluster
A cluster defined as a dense region separated by low-density areas.
Objective Function (Clustering)
A function that measures clustering quality, such as sum of squared errors.
Initial Centroid Selection
The process of choosing starting centroids for K-means, which strongly affects results.
Autoencoder
A neural network with an encoder and decoder trained to reconstruct its input.
Encoder
The part of an autoencoder that maps input x to a compressed representation z.
Decoder
The part of an autoencoder that reconstructs x from the representation z.
Generative Adversarial Network (GAN)
A system of two networks—a generator and discriminator—trained together to produce realistic data.
Generator (GAN)
A network that produces synthetic data intended to fool the discriminator.
Discriminator (GAN)
A classifier trained to distinguish real data from generated data.
Transfer Learning
Using knowledge from one trained model to improve learning on a related task.
Frozen Layers
Layers whose weights are kept fixed during transfer learning because they capture general features