First-Order Logic (FOL)
A formal language for representing objects, relations, and functions in the world.
1/311
| Term | Definition |
|---|---|
First-Order Logic (FOL) | A formal language for representing objects, relations, and functions in the world. |
Constant Symbols | Symbols that refer to specific objects (e.g., Cat, Dog, John). |
Predicate Symbols | Symbols representing relations or properties that are true or false (e.g., BiggerThan, BrotherOf). |
Function Symbols | Symbols representing functions that map objects to other objects (e.g., LeftLeg(John)). |
Logical Variables | Symbols like x or y that can refer to any object in the domain. |
Atomic Sentence | A predicate applied to terms, representing a basic true/false fact. |
Equality Sentence | A statement asserting two terms refer to the same object. |
Complex Sentence | A sentence built using logical connectives or quantifiers. |
Universal Quantifier (∀) | States that a property holds for all objects in the domain. |
Existential Quantifier (∃) | States that there exists at least one object for which a property holds. |
Knowledge Base (KB) | A structured collection of facts and rules represented in FOL. |
Ontological Engineering | Designing structured representations of concepts and relationships in a domain. |
Upper Ontology | A high-level conceptual framework organizing general categories like objects, events, and processes. |
Category (in FOL) | A group of objects sharing properties, representable as predicates or reified objects. |
Taxonomic Hierarchy | A structure where categories inherit properties from more general categories. |
Inheritance (Categories) | The idea that members of a category inherit properties of that category. |
PartOf Relationship | A relation indicating that one object is a component of another. |
Measurement Representation | Using FOL to express quantities, units, and conversions. |
Fluent | A property whose truth value can change over time (e.g., HaveArrow). |
Action | An event that changes the truth value of fluents. |
Event Calculus | A formalism for representing events, their effects, and how fluents change over time. |
T(f, t1, t2) | A predicate stating fluent f is true between times t1 and t2. |
Happens(e, t1, t2) | A predicate stating event e occurs from time t1 to t2. |
Initiates(e, f, t) | Event e causes fluent f to become true at time t. |
Terminates(e, f, t) | Event e causes fluent f to become false at time t. |
Modal Logic | A logic that represents knowledge across different possible worlds. |
Referential Transparency | The property that substituting equal terms preserves truth (not always desired in knowledge reasoning). |
Semantic Network | A graphical representation of objects, categories, and their relationships. |
Default Information | Assumptions treated as true unless contradicted by evidence. |
Nonmonotonic Logic | A logic where adding new information can invalidate previous conclusions. |
Circumscription | A form of nonmonotonic reasoning that minimizes the extension of certain predicates. |
Abnormal Predicate | A predicate used in circumscription to represent exceptions to general rules. |
Knowledge Base (KB) | A structured collection of facts and rules represented using First-Order Logic. |
FOL for Knowledge Representation | Using First-Order Logic to encode objects, relations, and rules about the world. |
Universal Properties in FOL | Statements that apply a property to all objects in a domain. |
Existential Properties in FOL | Statements asserting that at least one object has a specific property. |
Fluents | Properties that can change over time, representable using event calculus. |
Event Calculus | A formalism used to represent how actions change fluents over time. |
FOL Complexity | The idea that fully using FOL requires extensive background knowledge not covered in this survey course. |
Survey Course Limitation | The course introduces concepts at a high level without covering all formal details. |
Uncertainty in AI | The idea that real-world environments involve incomplete, noisy, or unpredictable information. |
Partial Observability | When an agent cannot directly observe all relevant aspects of the environment. |
Noisy Sensors | Sensors that provide imperfect or unreliable measurements. |
World Dynamics Uncertainty | Unknown or unpredictable changes in the environment. |
Decision Theory | Framework combining probability and utility to choose optimal actions. |
Expected Utility | The value of an action computed as the sum of utilities weighted by probabilities. |
Possible Worlds | All distinct ways the world could be, used to define probability models. |
Probability Model | A function assigning probabilities to each possible world. |
Event | A subset of possible worlds representing an outcome of interest. |
Random Variable | A function mapping each possible world to a value of interest. |
Range of a Random Variable | The set of all possible values a random variable can take. |
Probability Distribution | A function giving the probability of each value of a random variable. |
Joint Distribution | A table giving probabilities for all combinations of values of multiple variables. |
Marginal Distribution | A distribution obtained by summing out (eliminating) other variables. |
Marginalization | The process of summing over variables to compute a marginal distribution. |
Conditional Probability | The probability of an event given that another event has occurred. |
Normalization | Scaling probabilities so they sum to 1. |
Product Rule | P(a, b) = P(a | b) P(b). |
Chain Rule | A method for decomposing joint distributions into conditional probabilities. |
Probabilistic Inference | Computing desired probabilities from a probability model. |
Inference by Enumeration | A brute-force method that sums over all hidden variables. |
Evidence Variables | Observed variables used to condition a probability query. |
Query Variable | The variable whose probability distribution we want to compute. |
Hidden Variables | Variables not observed but included in the model. |
Bayes’ Rule | A formula for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b). |
Diagnostic Probability | Probability of a cause given an observed effect. |
Causal Probability | Probability of an effect given a cause. |
Independence | When P(x, y) = P(x) P(y), meaning variables do not influence each other. |
Conditional Independence | When two variables are independent given a third variable. |
Ghostbusters Domain | A grid-based example illustrating conditional independence and sensor models. |
Ghost Location Variable (G) | The random variable representing the ghost’s position. |
Sensor Variables (Cxy) | Variables representing the color readings at each grid square. |
Sensor Model | The probability distribution P(Cxy | G) depending on distance to the ghost. |
Uniform Prior | A prior distribution where all values are equally likely. |
Naïve Bayes Model | A model where all evidence variables are conditionally independent given a single cause variable. |
Naïve Bayes Joint Formula | P(Cause, Effects) = P(Cause) Π P(Effect_i | Cause). |
Uncertainty in AI | The idea that real-world environments involve incomplete, noisy, or unpredictable information. |
Partial Observability | When an agent cannot directly observe all relevant aspects of the environment. |
Noisy Sensors | Sensors that provide imperfect or unreliable measurements. |
Decision Theory | Framework combining probability and utility to choose optimal actions. |
Expected Utility | The value of an action computed as the sum of utilities weighted by probabilities. |
Product Rule | P(a, b) = P(a | b) P(b), relating joint and conditional probabilities. |
Chain Rule | A method for decomposing joint distributions into conditional probabilities. |
Inference by Enumeration | A brute-force method for computing probabilities by summing over hidden variables. |
Evidence Variables | Observed variables used to condition a probability query. |
Query Variable | The variable whose probability distribution we want to compute. |
Hidden Variables | Variables not observed but included in the model. |
Bayes’ Rule | A formula for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b). |
Diagnostic Probability | Probability of a cause given an observed effect. |
Causal Probability | Probability of an effect given a cause. |
Independence | When P(x, y) = P(x) P(y), meaning variables do not influence each other. |
Conditional Independence | When two variables are independent given a third variable. |
Ghostbusters Domain | A grid-based example illustrating conditional independence and sensor models. |
Ghost Location Variable (G) | The random variable representing the ghost’s position. |
Sensor Variables (Cxy) | Variables representing the color readings at each grid square. |
Sensor Model | The probability distribution P(Cxy | G) depending on distance to the ghost. |
Uniform Prior | A prior distribution where all values are equally likely. |
Naïve Bayes Model | A model where all evidence variables are conditionally independent given a single cause variable. |
Naïve Bayes Joint Formula | P(Cause, Effects) = P(Cause) Π P(Effect_i | Cause). |
Text Classification (Naïve Bayes) | Using word frequencies and conditional independence to classify documents. |
Category Prior | The probability of a class before observing evidence. |
Word Likelihood | The probability of a word occurring given a category. |
Posterior Probability | Updated probability of a category after observing evidence. |
Event | A subset of possible worlds representing an outcome of interest. |
Random Variable | A function mapping each possible world to a value of interest. |
Distribution | A function giving the probability of each value of a random variable. |
Joint Distribution | A table giving probabilities for all combinations of values of multiple variables. |
Marginalization | Summing over variables to compute a marginal distribution. |
Conditional Probability | The probability of an event given that another event has occurred. |
Chain Rule | A decomposition of joint distributions into conditional probabilities. |
Bayes’ Rule | A method for reversing conditional probabilities using P(a | b) = P(b | a) P(a) / P(b). |
Probabilistic Inference | Computing desired probabilities from a probability model. |
Inference by Enumeration | A method involving selecting consistent entries, summing out hidden variables, and normalizing. |
Independence | When two variables do not influence each other and satisfy P(x, y) = P(x) P(y). |
Conditional Independence | When two variables are independent given a third variable. |
Bayes Net (Bayesian Network) | A graphical model representing joint distributions using nodes and directed edges. |
Graphical Model | A representation of variables and their dependencies using a graph. |
Node (in Bayes Net) | A variable in the model. |
Arc (in Bayes Net) | A directed edge indicating direct influence between variables. |
Directed Acyclic Graph (DAG) | The structure underlying a Bayesian network. |
Conditional Probability Table (CPT) | A table specifying probabilities of a variable given its parents. |
Local Conditional Distribution | The probability distribution for a node given its parents. |
Global Semantics (Bayes Net) | The joint distribution defined as the product of all local conditionals. |
Sparse Bayes Net | A Bayes net where each variable has few parents, reducing complexity. |
Markov Blanket | The set of a node’s parents, children, and children’s parents; shields the node from the rest of the network. |
Conditional Independence Semantics | Every variable is conditionally independent of its non-descendants given its parents. |
Topological Order | An ordering of nodes where parents appear before children. |
Alarm Network | A classic Bayes net example involving burglary, earthquake, alarm, and neighbor calls. |
Free Parameters (CPT) | The number of independent probabilities needed to specify a conditional probability table. |
Local Causality | The idea that variables interact only with a small number of others. |
Hidden Markov model (HMM) | A model with a Markov chain over hidden states X_t and observations E_t emitted from each state via P(E_t\mid X_t). |
Umbrella world state variable | R_t, indicating whether it is raining on day t. |
Umbrella world evidence variable | U_t, indicating whether the director carries an umbrella on day t. |
Viterbi algorithm | Dynamic programming algorithm that computes, for each state and time, the maximum probability of any path ending there, to find the most likely state sequence. |
State trellis | A layered graph whose nodes are states at each time step and whose edges represent transitions with weights from transition and sensor probabilities. |
Robot localization problem | Using a transition model over robot positions and a sensor model over wall/no-wall readings to maintain a belief over the robot’s location. |
Markov Process | A system where the next state depends only on the current state, not the full history. |
First-Order Markov Assumption | The assumption that each state depends only on the immediately previous state. |
Stationarity Assumption | The rule that transition probabilities stay the same at every time step. |
State Variable | A hidden variable representing the true condition of the system at a given time. |
Evidence Variable | An observable variable that provides information about the hidden state. |
Transition Model | The model describing how the hidden state evolves from one time step to the next. |
Sensor Model | The model describing how observations are generated from the current hidden state. |
Discrete Time Model | A model that represents the world as a sequence of time slices. |
Umbrella World | A simple example where the hidden state is whether it is raining and the evidence is whether the director carries an umbrella. |
Filtering | The task of computing the belief about the current state given all evidence so far. |
Prediction | Estimating future states based on the current belief without adding new evidence. |
Smoothing | Estimating past states using both earlier and later evidence. |
Most Likely Explanation | The single most probable sequence of states that explains the observed evidence. |
Forward Algorithm | A recursive method for updating beliefs over time using transition and sensor models. |
Viterbi Algorithm | A dynamic programming method for finding the most likely sequence of hidden states. |
State Trellis | A layered graph showing all possible states at each time step and transitions between them. |
Hidden Markov Model | A model with hidden states that evolve over time and produce observable evidence. |
Robot Localization | A problem where a robot uses noisy sensors and movement models to estimate its position. |
Belief State | The probability distribution over all possible states at a given time. |
Update Step | The filtering step where new evidence is incorporated into the belief state. |
Prediction Step | The filtering step where the belief is projected forward using the transition model. |
Stationary Distribution | The long-run distribution that a Markov chain converges to after many steps. |
Partial Observability | A situation where the agent cannot directly observe the true state of the world. |
Dynamic World | A world where the state changes over time while the agent is reasoning. |
Atomic Representation | A representation where states are indivisible wholes, used in search algorithms. |
Factored Representation | A representation where states are described by variables and their values. |
Structured Representation | A representation using objects and relations, as in first-order logic. |
Bayesian Network | A factored probabilistic model over a fixed, finite set of random variables. |
Probabilistic Programming Language (PPL) | A language that defines probability distributions using programs with random choices. |
Generative Program | A program where each random choice defines a random variable in a probability model. |
Execution Trace | A sequence of values for all random choices made during one run of a generative program. |
Decision-Theoretic Agent | An agent that makes rational decisions using beliefs (probabilities) and utilities. |
Utility Function | A function assigning a numerical desirability to each state. |
Expected Utility | The average utility of outcomes from a state, weighted by their probabilities. |
Maximum Expected Utility (MEU) | The principle that an agent should choose the action with the highest expected utility. |
Episodic Decision Problem | A one-shot decision problem where each decision is independent. |
Sequential Decision Problem | A decision problem requiring a sequence of actions over time. |
Learning | The process of improving performance based on observations. |
Machine Learning | When a computer learns models from data to solve problems. |
Induction | Reasoning from specific observations to general rules. |
Deduction | Reasoning from general rules to specific conclusions. |
Classification | A learning task where the output is a discrete label. |
Regression | A learning task where the output is a numeric value. |
Supervised Learning | Learning from labeled input-output pairs. |
Unsupervised Learning | Learning patterns or structure without labeled outputs. |
Clustering | Grouping similar examples in unsupervised learning. |
Reinforcement Learning | Learning from rewards and punishments to determine good actions. |
Training Set | A set of labeled examples used to learn a hypothesis. |
Test Set | A set of unseen examples used to evaluate generalization. |
Ground Truth | The true output value the model is trying to predict. |
Hypothesis | A learned function that approximates the true function. |
Hypothesis Space | The set of all candidate hypotheses the learner can choose from. |
Consistent Hypothesis | A hypothesis that correctly predicts all training examples. |
Generalization | How well a learned hypothesis performs on unseen data. |
Bias | Systematic tendency of a hypothesis to deviate from the true function. |
Variance | How much a hypothesis changes when trained on different data samples. |
Underfitting | When a model is too simple to capture the underlying pattern. |
Overfitting | When a model fits the training data too closely and fails to generalize. |
Bias-Variance Tradeoff | The balance between simplicity (bias) and flexibility (variance). |
Rule of Simplicity | Choose models as simple as possible while still capturing important patterns. |
Linear Regression | A method for fitting a straight line or hyperplane to data for prediction. |
Prediction Function | The linear model output computed as w0 + w1x. |
Prediction Error | The difference between the true value y and the predicted value h(x). |
Residual | The error on a single instance, defined as y − h(x). |
Least Squares Loss | A loss function that sums the squared prediction errors over all examples. |
Gradient Descent | An iterative optimization method that updates weights to reduce loss. |
Learning Rate | A small constant controlling how large each gradient descent update is. |
Feature Vector | A vector of input features representing an example. |
Linear Classifier | A classifier that makes predictions using a weighted sum of features. |
Decision Boundary | A line, plane, or surface separating predicted classes. |
Threshold Classifier | A classifier that outputs 1 if w·x ≥ 0 and 0 otherwise. |
Activation | The weighted sum w·f(x) used to determine a classifier’s output. |
Perceptron | A linear classifier that outputs +1 or -1 based on the sign of w·x. |
Weight Vector | The set of learned weights applied to input features. |
False Negative | When the classifier predicts negative but the true label is positive. |
False Positive | When the classifier predicts positive but the true label is negative. |
Perceptron Learning Rule | The update rule w ← w + α(y − h(x))x. |
Learning Rate (Perceptron) | A constant α controlling how strongly weights are adjusted. |
Linearly Separable Data | Data that can be perfectly separated by a hyperplane. |
Perceptron Convergence Theorem | States that perceptron learning converges if data is linearly separable. |
Non-Separable Data | Data that cannot be perfectly separated by any linear boundary. |
Overtraining | When training accuracy rises but test accuracy eventually falls. |
Overfitting | When a model fits training data too closely and generalizes poorly. |
Averaged Perceptron | A perceptron variant that averages weight vectors to reduce noise. |
XOR Problem | A classic example of a function that cannot be learned by a linear classifier. |
Sigmoid Function | A smooth function mapping real values to probabilities between 0 and 1. |
Logistic Regression | A linear classifier that uses the sigmoid function to output probabilities. |
Support Vector Machine (SVM) | A classifier that maximizes the margin between classes. |
Ensemble Learning | Using multiple hypotheses together to improve performance. |
Bagging | Training multiple models on bootstrap samples and combining their predictions. |
Random Forest | An ensemble of decision trees built using random subsets of features. |
Boosting | An ensemble method that sequentially trains models to correct previous errors. |
Stacking | An ensemble method that combines predictions from multiple models using another learner |
Deep Learning | A broad family of machine learning techniques where hypotheses are complex algebraic circuits with tunable weights. |
Neural Network | A model composed of interconnected units that compute weighted sums and apply nonlinear activation functions. |
Deep Network | A neural network with many layers, giving a long computation path from input to output. |
Advantage of Deep Networks | They allow interactions between variables through multiple layers, unlike linear models which combine inputs independently. |
Neuron (McCulloch-Pitts Model) | A simple computational unit that sums weighted inputs and applies a nonlinear activation function. |
Bias Weight | An additional fixed input with weight that shifts the activation threshold of a neuron. |
Feedforward Network | A neural network where connections flow only forward, forming a directed acyclic graph. |
Recurrent Network | A neural network with feedback connections, giving it internal state or memory. |
Unit (Node) | A component that computes a weighted sum of inputs and applies a nonlinear activation function. |
Activation Function | A nonlinear function applied to a unit’s weighted input; enables networks to represent nonlinear functions. |
Universal Approximation Theorem | A two‑layer neural network with enough hidden units can approximate any continuous function. |
Sigmoid Function | A smooth activation function mapping inputs to values between 0 and 1. |
ReLU | Activation function defined as max(0, x), commonly used in modern deep networks. |
Softplus | A smooth version of ReLU defined as log(1 + e^x). |
Tanh | Activation function mapping inputs to values between -1 and 1. |
Building a Network | Combining units into layers so the overall function is a composition of many algebraic operations. |
Gradient Descent in Neural Nets | Adjusting weights by computing gradients of loss through all layers. |
Backpropagation | The process of passing output error backward through the network to compute gradients for each weight. |
Vanishing Gradient Problem | When gradients become extremely small in deep networks, preventing effective learning. |
Automatic Differentiation | Software technique (e.g., TensorFlow, PyTorch) that automatically computes derivatives for all weights. |
Computation Graph | A dataflow graph where each node represents an elementary computation in the network. |
Input Encoding | The process of converting raw data into numeric inputs for a neural network. |
One-Hot Encoding | Representing a categorical variable with a vector where exactly one element is 1 and the rest are 0. |
Output Encoding | Representing outputs (labels) in a form suitable for loss computation, such as one-hot vectors. |
Squared Error Loss | A loss function used for regression that measures squared differences between predicted and true values. |
Negative Log Likelihood | A loss function used for classification that encourages high probability for correct labels. |
Maximum Likelihood Learning | Choosing weights that maximize the probability of the observed data. |
Softmax | An activation function for multiclass classification that outputs a probability distribution over classes. |
Hidden Layer | A layer between input and output that transforms representations and may learn meaningful features. |
Convolutional Neural Network (CNN) | A neural network with spatially local connections and shared weights (kernels). |
Kernel (Filter) | A small set of weights replicated across an image to detect specific features. |
Convolution | Operation applying a kernel across an input to produce feature maps. |
Stride | The step size of the convolution; larger strides reduce output size. |
Receptive Field | The region of input that influences a particular hidden unit. |
Pooling Layer | A layer that summarizes adjacent units using a fixed operation such as max or average |
Average Pooling | Pooling method that computes the average of inputs in a region. |
Max Pooling | Pooling method that outputs the maximum value in a region. |
Downsampling | Reducing spatial resolution by pooling or strided convolution. |
Fully Connected Layer | A layer where each unit connects to all units in the previous layer, often used near the output. |
Residual Network (ResNet) | A deep architecture that adds small perturbations to previous layer outputs to avoid vanishing gradients. |
Stochastic Gradient Descent (SGD) | A training method where weights are updated using gradients computed from small minibatches of data. |
Minibatch | A small subset of training examples used to compute an approximate gradient during SGD. |
Learning Rate Decay | Gradually reducing the learning rate over time to stabilize training. |
Momentum | A technique that smooths SGD updates by incorporating past gradients to reduce variance. |
Batch Normalization | A method that normalizes activations within a minibatch to stabilize and speed up training. |
Generalization | The ability of a model to perform well on unseen data rather than just the training set. |
Adversarial Example | An input intentionally perturbed to cause a neural network to make an incorrect prediction. |
Weight Decay | A regularization technique that penalizes large weights to reduce overfitting. |
Dropout | A regularization method where random units are deactivated during training to prevent co-adaptation. |
Recurrent Neural Network (RNN) | A neural network with cycles that allow information to persist across time steps. |
Hidden State (RNN) | The internal memory of an RNN that carries information from previous time steps. |
Unrolling an RNN | Expanding an RNN across time steps to form a feedforward computation graph for training. |
Vanishing Gradient (RNN) | When gradients shrink during backpropagation through time, preventing learning of long-term dependencies. |
Exploding Gradient (RNN) | When gradients grow uncontrollably during training, destabilizing learning. |
Long Short-Term Memory (LSTM) | A type of RNN with memory cells and gates designed to avoid vanishing and exploding gradients. |
Memory Cell (LSTM) | A component that stores information across time steps with minimal modification. |
Gating Units (LSTM) | Mechanisms that control information flow into and out of the memory cell. |
Unsupervised Learning | Learning patterns from data without labeled outputs. |
Clustering | Grouping similar data points based on inherent structure in the data. |
K-Means Clustering | A partitional clustering algorithm that assigns points to the nearest centroid and iteratively updates centroids. |
Centroid | The center of a cluster, often the mean of its points. |
Cluster Shape Problem | The issue that some clustering methods assume round or blob-like clusters, which may not match real data. |
Gaussian Mixture Model | A probabilistic clustering model that represents data as a mixture of Gaussian components. |
Hierarchical Clustering | A clustering method that builds nested clusters organized as a tree. |
Partitional Clustering | A clustering method that divides data into non-overlapping clusters. |
Exclusive Clustering | A clustering method where each point belongs to exactly one cluster. |
Non-Exclusive Clustering | A clustering method where points may belong to multiple clusters. |
Fuzzy Clustering | A clustering method where each point has a degree of membership in each cluster. |
Well-Separated Cluster | A cluster where every point is closer to points in its own cluster than to points outside it. |
Prototype-Based Cluster | A cluster defined by similarity to a central prototype such as a centroid or medoid. |
Contiguity-Based Cluster | A cluster where each point is closer to at least one other point in the same cluster. |
Density-Based Cluster | A cluster defined as a dense region separated by low-density areas. |
Objective Function (Clustering) | A function that measures clustering quality, such as sum of squared errors. |
Initial Centroid Selection | The process of choosing starting centroids for K-means, which strongly affects results. |
Autoencoder | A neural network with an encoder and decoder trained to reconstruct its input. |
Encoder | The part of an autoencoder that maps input x to a compressed representation z. |
Decoder | The part of an autoencoder that reconstructs x from the representation z. |
Generative Adversarial Network (GAN) | A system of two networks—a generator and discriminator—trained together to produce realistic data. |
Generator (GAN) | A network that produces synthetic data intended to fool the discriminator. |
Discriminator (GAN) | A classifier trained to distinguish real data from generated data. |
Transfer Learning | Using knowledge from one trained model to improve learning on a related task. |
Frozen Layers | Layers whose weights are kept fixed during transfer learning because they capture general features |