Amazon’s latest AI research improves predictions by learning quantile functions

This research summary is based on the papers 'Learning quantile functions without quantile crossing for distribution-free time series forecasting' and 'Multivariate quantile function forecaster'

Please don't forget to join our ML Subreddit

The quantile function is a mathematical function that takes a quantile (a percentage of a distribution ranging from 0 to 1) as input and returns the value of a variable as output.’ It can answer questions such as “How much inventory should I keep if I want to guarantee that 95% of my customers receive their orders within 24 hours?” Therefore, the quantile function is frequently used in forecasting questions.

However, in practice, there is rarely a neat way to calculate the quantile function. Statisticians usually approximate it using regression analysis for only one quantile level at a time. This means that if you want to calculate it for a different quantile, you’ll have to create a new regression model, which these days usually involves retraining a neural network.

Amazon researchers offer a method for learning a full quantile function estimate at once, rather than approximating it for each quantile level, in a pair of recently published papers.

This means that users can query the function at different times to improve performance tradeoffs. For example, it’s possible that reducing the 24-hour delivery promise from 95% to 94% would allow for a much greater reduction in inventory, which could be an attractive trade-off. Alternatively, it is possible that increasing the level of warranty – and therefore improving customer satisfaction – requires just a small increase in inventory.

The technique does not care about the shape of the distribution on which the quantile function is based. The distribution can be Gaussian (also known as a bell curve or normal distribution), uniform, or otherwise. Since the technique makes no assumptions about the shape of the distribution, it can follow the data wherever it leads, increasing the accuracy of approximations.

The cumulative distribution function (CDF) is a useful related function that calculates the probability of a variable taking on a value equal to or less than a given value – for example, the percentage of the population that is 5’6″ or less. CDF values ​​range from 0 (nobody is shorter than 0’0″) to 1 (everyone in the population is shorter than 5’0″).

The CDF calculates the area under the probability curve to the target point since it is the integral of the probability distribution function. The probability output by the CDF may be lower than the probability output by the PDF at low input values. The CDF, however, is monotonically non-decreasing because it is cumulative: the higher the input value, the higher the output value.


The quantile function is simply the inverse of the CDF if there is one. By flipping the CDF graph – that is, rotating it 180 degrees around a diagonal axis that runs from the lower left corner to the upper right corner of the graph – the graph of the quantile function can be created. The quantile function, like the CDF, is monotonically non-decreasing. The technique is based on this fundamental observation.

Quantile crossover is one of the drawbacks of the traditional method of approximating the quantile function – calculating it exclusively at specified places. Because each prediction is based on a separate model, which was trained on different local data, the value of the predicted variable for a given probability may be lower than the predicted value for a lower probability. The requirement that the quantile function be monotone non-decreasing is not met in this case.

The method develops a predictive model for several distinct input values ​​- quantiles – at the same time, distributed at regular intervals between 0 and 1 to avoid crossing the quantiles. The model is a neural network, with the prediction of each subsequent quantile being an incremental increase over the prediction of the previous quantile.

Researchers can estimate the function using simple linear extrapolation between anchor points (called “nodes” in the literature) and nonlinear extrapolation to handle the tails of the function once the model has learned estimates for several anchor points that reinforce the monotony of the quantile function.

When there is enough training data to allow for a higher density of anchor points (nodes), linear extrapolation provides a more accurate approximation. To demonstrate that they didn’t need to make any assumptions about the shape of the distribution, the researchers used a toy distribution with three arbitrary peaks to test the strategy.

So far, we have only considered the case where the distribution is applied to a single variable. Researchers, on the other hand, seek to examine multivariate distributions in many practical forecasting use cases. For example, if a product uses a rare battery that is not included, a demand forecast for that battery will most likely be associated with the demand forecast for that product.

The problem is that the concept of a multivariate quantile function is vague. When the CDF maps multiple variables to a single probability, what value do you map to when you reverse the process?

In the second study, the researchers solve this problem. The main point is that the quantile function must be non-decreasing monotonically. Accordingly, the multivariate quantile function is defined as the derivative of a convex function by academics.

A convex function is a function which tends everywhere towards a single global minimum; it looks like a two-dimensional U-shaped curve. The slope of the graph of a convex function is calculated by its derivative: in the two-dimensional instance, the slope is negative but flattens as it approaches the global minimum, zero at the lower, and more positive on the side opposite. As a result, the derivative increases monotonically.

This two-dimensional image can easily be extrapolated to higher dimensions. The team explains how to train a neural network to learn a quantile function, which is the derivative of a convex function, in the study. Convexity is imposed by the architecture of the network and the model learns the convex function using its derivative as a learning signal.

The researchers use a dataset that follows a multivariate Gaussian distribution to test the technique against the challenge of simultaneous prediction across multiple time horizons, in addition to real-world datasets. Indeed, the technique captures correlations between consecutive time ranges better than a univariate approach, according to the trials.

Article 1:

Article 2: