Why feature engineering is the highest-impact ML discipline
In practice, a simple model (logistic regression, decision tree) applied to well-crafted features generally beats a sophisticated model (XGBoost, neural network) applied to raw data. This observation, verified hundreds of times in Kaggle competitions and industrial projects, places feature engineering at the heart of any serious ML workflow.
The reason is fundamental: ML algorithms learn patterns in the feature space. If the features do not capture the right dimensions of the problem — the variables that truly vary with the target — even the most powerful algorithm cannot extract anything relevant. Conversely, rich and informative features make the problem trivial for any model.
Pedro Domingos, in his landmark paper 'A Few Useful Things to Know about Machine Learning', formulates this principle: 'The features used are often more important than the choice of learner.' This statement, drawn from decades of ML practice, remains the compass of every experienced data scientist.
Domingos, P. (2012). A Few Useful Things to Know about Machine Learning. Communications of the ACM, 55(10), 78-87.The three major variable families
Before transforming data, you need to understand its nature. Each variable type follows a different processing logic and has adapted techniques.
Numeric variables (continuous and discrete)
Numeric variables (age, income, temperature, click count) are directly usable by most algorithms, but rarely in their raw form. A highly skewed distribution (salaries, property prices) can disturb linear models and distance-based algorithms.
Typical transformations: logarithm to reduce skewness, square root to compress large values, Box-Cox to normalize arbitrary distributions. For discrete variables (number of orders), discretization into deciles can sometimes bring more signal than a raw value.
Categorical variables (nominal and ordinal)
Categorical variables (country, customer type, product category) cannot be used directly by algorithms that expect numeric values. Encoding is therefore essential.
For nominal variables (without order), one-hot encoding creates a binary column per category. For high-cardinality variables (thousands of unique values), target encoding replaces each category with the mean of the target variable — a powerful technique but prone to data leakage if poorly managed. For ordinal variables (low/medium/high), ordinal encoding preserves the order relationship.
Temporal variables and time series
A raw date is generally not informative for a model. From a timestamp, you can extract: the day of the week (strong weekly seasonality for retail), the hour (different morning/evening behaviors), the month, the quarter, the holiday, the season, the gap from a reference event.
For time series, lag features (value from N periods ago) and moving averages are frequently the most predictive. A purchase 7 days after the last purchase is a much more powerful churn feature than the absolute date.
Feature transformation techniques
Beyond basic treatments by variable type, advanced techniques allow creating features that capture interactions, non-linear structures or domain knowledge.
Normalization and scaling
Scaling is critical for distance-based algorithms (KNN, SVM) or gradient-based ones (logistic regression, neural networks). Without normalization, a variable in euros (order of magnitude 100,000) dominates a percentage variable (order 0-1), distorting the learned weights.
StandardScaler centers and reduces each variable (mean 0, standard deviation 1). MinMaxScaler compresses between 0 and 1. RobustScaler uses the median and IQR, resistant to outliers. Ensemble methods (Random Forest, XGBoost) are invariant to scaling and do not require it.
Advanced encoding and embedding
For very high-cardinality categorical variables (product IDs, URLs, identifiers), one-hot encoding causes dimensional explosion. Alternatives: the hashing trick (projection into a fixed space), entity embeddings (vectors learned by a neural network, transferable), or scikit-learn's feature hashing.
Pre-trained embeddings (Word2Vec, FastText, BERT for text; ResNet for images) allow transforming unstructured data into dense vectors directly usable as features for any downstream model.
Interactions and polynomial features
A linear model only captures additive effects. Explicitly creating interaction features (product of two variables, ratio, difference) allows capturing combinatorial effects without switching to a non-linear model.
Classic example: the 'session duration / pages viewed' ratio is more predictive than both variables separately for predicting engagement. Systematic exploration of interactions can rely on PolynomialFeatures (scikit-learn) or decision tree techniques to identify relevant pairs.
Feature selection: fewer features, better model
Adding redundant or uninformative features does not improve performance: it increases noise, slows training and can degrade generalization (curse of dimensionality). Feature selection identifies the optimal subset of variables.
Statistical filters
Filters evaluate each feature independently of the algorithm. Pearson correlation measures the linear relationship with the target. Chi-2 test evaluates independence for categorical variables. Mutual information measures any form of dependence, linear or not.
These methods are fast and scalable, but they ignore interactions between features: a variable may be non-informative alone but highly predictive in combination with another.
Embedded methods: Lasso, Random Forest, SHAP
Embedded methods select features during model training. L1 regularization (Lasso) pushes unimportant coefficients to zero, automatically selecting relevant features. Very effective for linear regression and classification.
For ensemble models, feature importances from Random Forest or XGBoost give a score per variable. SHAP (SHapley Additive exPlanations) goes further: it calculates the marginal contribution of each feature for each individual prediction, producing local and global interpretation simultaneously.
Andrew Ng, founder of Google Brain and Coursera, consistently emphasizes this principle in his courses: 'Feature engineering is the single most important factor in the success of ML projects in industry.' In his MLOps program, he dedicates more time to feature engineering than to algorithm selection or tuning.
Ng, A. (2021). A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. DeepLearning.AI.Feature stores: industrializing feature engineering
In mature ML organizations, ad-hoc manual feature engineering creates problems: duplicated computation across teams, inconsistency between training and inference (training-serving skew), impossibility to reuse features from one project to another. The feature store solves these problems.
A feature store is a centralized system that stores, versions and serves computed features, with a clear separation between batch features (computed periodically) and real-time features (computed on request). It guarantees that the same calculations are used in training and production, eliminating a major source of bugs.
Offline / online architecture
The offline store holds batch features in a data warehouse (BigQuery, Snowflake, Redshift) or data lake. It is used for model training and historical dataset creation. Computations are scheduled (hourly, daily) and features are versioned.
The online store holds low-latency features (Redis, DynamoDB, Cassandra) to serve them in real time during predictions. A central catalog ensures offline and online features are synchronized and each feature is documented (definition, owner, computation date, applied transformations).
Main open-source or cloud feature stores: Feast (open-source, Kubernetes-native), Tecton (SaaS enterprise, formerly Uber Michelangelo), Databricks Feature Store (integrated with MLflow), Vertex AI Feature Store (GCP), SageMaker Feature Store (AWS). The choice depends on the existing data stack.
Memorize feature engineering with memia
Feature engineering relies on a combination of statistical concepts, algorithmic reflexes and domain knowledge. Normalization techniques, encoding methods, selection criteria, feature store architectures: all notions that consolidate through repetition.
memia offers flashcard decks covering feature engineering techniques, feature selection and production patterns (feature stores). Each card is AI-generated and validated, with concrete examples and mnemonics. By anchoring these concepts through FSRS spaced repetition, they become immediately usable reflexes on your projects.
Explore the Data & AI cluster
Frequently asked questions about feature engineering
What exactly is feature engineering?
Feature engineering is the process of transforming raw data into variables (features) usable by a machine learning algorithm. It includes cleaning, creating new variables, encoding categorical variables, normalization and selecting the most informative features.
Why is feature engineering more important than algorithm choice?
ML algorithms learn patterns in the feature space. If that space does not capture the right dimensions of the problem, even the best algorithm fails. Conversely, rich, well-crafted features make any algorithm effective. Pedro Domingos (ML/2012) and Andrew Ng have both documented this principle from real industrial cases.
What is the difference between feature engineering and feature selection?
Feature engineering creates new variables from raw data (transformation, combination, extraction). Feature selection chooses among existing variables those that are most informative for the model. Both are complementary: first create as many relevant features as possible, then select those that contribute the most.
Do you always need to normalize your features?
No. Normalization is necessary for distance-based algorithms (KNN, SVM) or gradient-based ones (logistic regression, neural networks). It is unnecessary for tree-based ensemble methods (Random Forest, XGBoost, LightGBM) which are invariant to monotone feature transformations.
What is target encoding and when should you use it?
Target encoding replaces each category of a categorical variable with the mean of the target variable in the training set. It effectively handles high-cardinality variables (hundreds or thousands of categories) where one-hot encoding explodes dimensionality. Beware of data leakage: always compute target encoding within cross-validation folds, never on the entire training set.
What is a feature store and when do you need one?
A feature store is a centralized system that stores, versions and serves ML features, with an offline architecture (batch, for training) and online (real-time, for prediction). It becomes necessary when multiple teams share features, when training-production inconsistencies are observed, or when feature computation is costly and repeated.
How does SHAP help with feature selection?
SHAP (SHapley Additive exPlanations) calculates the contribution of each feature to each individual prediction, then aggregates them to get global importance. Unlike Random Forest importances (biased toward high-cardinality variables), SHAP is theoretically grounded (cooperative game theory) and works with any model, including black boxes.