Introduction to Machine Learning: core concepts and algorithms

Fundamentals

AI, Machine Learning, Deep Learning: key distinctions

Artificial intelligence (AI) is the general field aimed at creating systems capable of performing tasks that normally require human intelligence. Machine learning (ML) is a subset: rather than explicitly programming rules, you provide data to the system so it can infer patterns on its own. Deep learning is in turn a subset of ML, based on multi-layer artificial neural networks.

This distinction is not purely academic. A traditional ML system like logistic regression is interpretable, fast to train and deployable on standard hardware. A deep learning model like a transformer requires GPUs, millions of data points and a significant compute budget. Choosing the right approach depends on the problem, the volume of available data and operational constraints.

Origins

Alan Turing laid the theoretical foundations of ML in 1950 with his paper 'Computing Machinery and Intelligence'. Arthur Samuel coined the term 'machine learning' in 1959 during his work on a checkers program capable of improving through experience.

Samuel, A.L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development.

Taxonomy

The three major learning paradigms

ML is divided into three paradigms based on the nature of the available learning signal. Each paradigm addresses distinct types of problems and involves different algorithms.

Supervised learning

Supervised learning uses (input, expected output) pairs to learn a mapping function. Two cases are distinguished: classification (discrete output: spam/not-spam, positive/negative diagnosis) and regression (continuous output: property price, predicted temperature).

Typical examples: bank fraud detection, credit scoring, image recognition, churn prediction. Label quality is critical: a poorly annotated dataset produces a biased model, regardless of the algorithm chosen.

Unsupervised learning

When labels are absent or too costly to obtain, unsupervised learning extracts latent structures from data alone. The main tasks are clustering (K-Means, DBSCAN), dimensionality reduction (PCA, UMAP, t-SNE) and anomaly detection.

Concrete applications: customer segmentation, network intrusion detection, content recommendation, data compression. Results are harder to evaluate objectively since there is no reference 'correct answer'.

Reinforcement learning

An agent interacts with an environment, receives rewards or penalties, and learns a policy maximizing cumulative long-term rewards. This paradigm has produced the most spectacular systems: AlphaGo, Atari game AIs, high-frequency trading algorithms.

Reinforcement learning is also at the heart of RLHF (Reinforcement Learning from Human Feedback) techniques used to align large language models like GPT or Claude with human preferences.

Enterprise breakdown

Supervised learning accounts for approximately 85% of ML use cases in production according to Kaggle and O'Reilly ML surveys. The main reason: the availability of historically labelled data in business systems (CRM, ERP, ticketing tools).

Algorithms

The fundamental algorithms to know

The ML algorithmic landscape is vast, but a few algorithm families cover the vast majority of use cases. Knowing them allows you to quickly choose a good starting point and interpret results.

Linear and logistic regression

Linear regression predicts a continuous value (y = wX + b) by minimizing squared error. Simple, interpretable and fast, it serves as an excellent baseline for any regression problem. Logistic regression extends this to classification by applying a sigmoid function to produce a probability.

These models have the key advantage of interpretability: the coefficients directly indicate the impact of each variable on the prediction, which is crucial in regulated domains (credit, healthcare, justice).

Decision trees, Random Forest and XGBoost

A decision tree learns hierarchical segmentation rules (if age > 35 AND income > 50k THEN ...). Intuitive but prone to overfitting, it is generally replaced by ensemble methods.

Random Forest builds hundreds of trees on random subsamples and averages their predictions (bagging). XGBoost and LightGBM use gradient boosting: each tree corrects the errors of the previous one. These methods dominate Kaggle competitions on tabular data and are widely used in production for their robustness.

Neural networks and deep learning

A neural network is composed of layers of artificial neurons connected by adjustable weights. Learning occurs through backpropagation: the error is computed at the output and weights are adjusted backward through the network.

For unstructured data (images, text, audio), specialized architectures like CNNs (vision), transformers (NLP) or GNNs (graphs) systematically outperform traditional algorithms — at the cost of much greater computational complexity and opacity.

Practice

The end-to-end ML pipeline

An ML model does not live in isolation. From data collection to production deployment, a structured ML pipeline is essential for producing reliable and reproducible models.

Data preparation and cleaning

Data scientists spend on average 60 to 80% of their time on data: collection, cleaning, handling missing values, outlier detection, encoding categorical variables. A clean and representative dataset is worth more than a sophisticated algorithm applied to noisy data.

Typical operations include: imputation (replacing missing values with the median, mean or a model), distribution normalization, correction of input errors and handling class imbalances (SMOTE oversampling, undersampling, class_weight).

Feature engineering

Feature engineering transforms raw variables into more informative representations for the model. For example, from a date of birth, you can extract age, age group, day of the week or season — each potentially more predictive than the raw date.

This is often the step that produces the greatest performance gain, far more than choosing a more complex algorithm. Feature engineering is covered in a dedicated article in this cluster.

Training and hyperparameter optimization

Training consists of adjusting the model's parameters on the training data (train set). Hyperparameter optimization (learning rate, max tree depth, number of layers) is done on the validation set via grid search, random search or Bayesian methods (Optuna, Hyperopt).

Train/validation/test separation is non-negotiable: the test set must never be seen during training or tuning, to avoid data leakage which produces artificially flattering metrics.

Overfitting

Overfitting occurs when the model memorizes noise from the training set instead of learning generalizable patterns. Symptom: excellent performance on the train set, mediocre on the test set. Remedies: regularization (L1/L2, dropout), more data, model simplification, cross-validation.

Evaluation

How to evaluate an ML model

The choice of evaluation metrics is as important as the choice of algorithm. A poorly chosen metric can lead to deploying a model that performs well on paper but fails on the real business problem.

Classification metrics

Accuracy is misleading on imbalanced datasets: a model always predicting 'negative' achieves 99% accuracy if only 1% of cases are positive. Complementary metrics are precision (among predicted positives, how many are truly positive?), recall (among true positives, how many are detected?) and F1-score, their harmonic mean.

The ROC curve and AUC (Area Under Curve) measure the model's discriminative ability across all decision thresholds. An AUC of 0.5 is random, 1.0 is perfect. The Precision-Recall curve is preferred for highly imbalanced problems.

Regression metrics

For regression, MAE (Mean Absolute Error) measures the average error in absolute value, interpretable in the unit of the target variable. MSE (Mean Squared Error) penalizes large errors more heavily (useful when outliers are costly). RMSE is the square root of MSE, in the same unit as the target variable.

The R² coefficient (between 0 and 1) measures the proportion of variance explained by the model. An R² of 0.85 means the model explains 85% of data variability, the remaining 15% being noise or factors not included.

Cross-validation and robustness

K-fold cross-validation divides the data into k partitions, trains k models (each excluding one partition) and averages the scores. It gives a more robust estimate of real-world performance than a simple train/test split, especially on moderately sized datasets.

For time series, cross-validation must respect chronological order (TimeSeriesSplit) to avoid look-ahead bias: you cannot use future data to predict the past.

Memorisation

Memorize machine learning with memia

Machine learning is a dense field: technical vocabulary, mathematical formulas, fine algorithmic distinctions, classic pitfalls like overfitting or data leakage. Spaced repetition is the scientifically most effective method for permanently anchoring these concepts.

memia offers flashcard decks on ML fundamentals, algorithms, model evaluation and the production pipeline. Each card is AI-generated and validated, with hints and mnemonics to accelerate memorization. In 10 minutes a day, key concepts become reflexes.

Explore the Data & AI cluster

Frequently asked questions about machine learning

What is the difference between machine learning and artificial intelligence?

Artificial intelligence is the general field aimed at creating intelligent systems. Machine learning is a subset that learns patterns from data rather than following explicitly programmed rules. All ML is AI, but not all AI is ML.

Do you need to know how to code to do machine learning?

Python is the dominant ML language, with libraries like scikit-learn, PyTorch and TensorFlow. Some knowledge of statistics and linear algebra is also useful. That said, no-code tools like Google AutoML or DataRobot allow building models without code.

How much data is needed to train an ML model?

It depends on the complexity of the problem and the algorithm. Logistic regression can work with a few hundred examples; a deep neural network may need millions. Data quality often matters more than quantity: 10,000 well-labelled examples are worth more than 1 million noisy data points.

What is overfitting and how do you avoid it?

Overfitting occurs when a model memorizes details of the training set instead of learning generalizable patterns. It is detected by a large gap between train and test performance. Main remedies: regularization (L1/L2), dropout (for neural networks), cross-validation, and increasing the size of the training dataset.

Which ML algorithm should I choose for my problem?

For tabular data, start with logistic regression (classification) or linear regression (regression) as a baseline, then try XGBoost or LightGBM. For images, CNNs or vision transformers. For text, transformers (BERT, RoBERTa). The general rule: favor simplicity and interpretability when business constraints require it.

What is deep learning and when should I use it?

Deep learning uses neural networks with multiple layers (hidden layers). It excels on unstructured data (images, sound, text) and with large volumes of data. It is less suited to classic tabular data (where XGBoost often dominates) and requires more compute resources and expertise.

How do I evaluate whether my model is good?

Choose metrics aligned with the business objective: accuracy for balanced classes, F1-score or AUC-ROC for imbalanced classes, RMSE or MAE for regression. Always compare your model to a simple baseline (predict the mean, the majority class). And test on truly unseen data to avoid data leakage.

AI/ML sub-guide

Feature Engineering

What you will learn