In human experience, features can be thought of as the observable or measurable characteristics that we use to interpret and make decisions about the world around us. These features can come from different sensory modalities or cognitive processes.
Visual modality: Color, shape, size, motion, texture (raw); patterns, symmetry, depth cues (derived)
Auditory modality: pitch, volume, tempo, rhythm (raw); speech patterns, tone of voice (derived)
haptics ? emotions ? language ?
Features represent the input variables used by a machine learning model to make predictions or classifications. They are the building blocks of the dataset and provide the information necessary for the model to learn relationships and patterns. Features can be:
Numerical: Continuous or discrete values (e.g., height, number of words).
Categorical: Representing distinct groups (e.g., color, category labels).
Derived: Transformed or engineered values combining raw data (e.g., ratios, log values).
Feature detection is the process of identifying significant patterns, structures, or attributes in raw data to aid analysis and decision-making.
In images, this includes methods like SIFT, SURF, and Haar cascades for detecting edges, corners, or keypoints.
In audio, algorithms like MFCCs extract time-frequency characteristics, while text relies on tokenization and n-grams.
Feature selection involves choosing the most relevant features from a dataset to improve model accuracy, reduce overfitting, and enhance computational efficiency. Techniques include filters (e.g., chi-square tests), wrappers (e.g., recursive feature elimination), and embedded methods like LASSO. Boosting algorithms (e.g., AdaBoost, Gradient Boosting, XGBoost) also inherently perform feature selection by iteratively focusing on features with the highest predictive power.
Feature selection process is crucial in high-dimensional datasets, enabling models to concentrate on the most impactful data while discarding irrelevant or redundant features.