Hi @etaioinshrdlu! The worst-case bounds are discussed in \*Section 2.2\* and \*...

Hi @etaioinshrdlu!

The worst-case bounds are discussed in *Section 2.2* and *Theorem 1*. Essentially, we have the following bounds:

Query: O(log_c(m) log_2(ε/B)) I/Os Space of the index: O(m)

where: n = number of input keys B = disk page size ε = user-given maximum error of the piecewise linear approximation (determines how many keys you need to search at each level) m = number of segments in the piecewise linear approximation c = fan out of the data structure (differently from standard B-trees it is not fixed, and it is potentially large)

Intuitively, the query complexity comes from the fact that the PGM-index has O(log_c(m)) levels, and at each level you do a binary search that costs O(log_2(ε/B)) I/Os.

Note that m and c depend on the "linearity" of the given input data. For example, if the input data can be approximated by a few segments, i.e. if m=O(1), and you choose ε=Θ(B), then the PGM-index takes O(1) space and answer queries in O(1) I/Os!

In general, you can remove the dependence from m and c if you can prove a lower bound on the length of a segment (i.e. the number of keys it "covers"), irrespective of the input data. We proved that the length of a single segment is at least 2ε (thus c≥2ε), or equivalently, that the number of segments m is upper bounded by n/(2ε) [Lemma 2, the proof is very straightforward]. Again, if you choose ε=Θ(B), then you have the following (rather pessimistic) worst-case bounds:

Query: O(log_B(n)) I/Os Space of the index: O(n/B)

Basically, these bounds tell you that the PGM-index is *never* worse in time and in space complexity than a B-tree!

---

However, in our experiments, the performance of the PGM-index was better than what the above bounds show, and this motivated us to study what happens when you make some (general) assumptions on the input data. The results of this study are in the ICML20 paper "Why are learned indexes so effective?" (http://pages.di.unipi.it/vinciguerra/publication/learned-ind...).

We found that, if you assume that the gaps between input sorted keys are taken from a distribution with finite mean and variance, then you can prove (Corollary 2 of the ICML20 paper) that the space of the PGM-index is actually O(n/B^2) whp (versus Θ(n/B) of classic B-trees).

Note that the result applies to *any* distribution, as long as the mean and variance of the RVs modelling the gaps are finite. Indeed, we specialised our main result to some well-known distributions, such as Uniform, Lognormal, Pareto, Exponential, and Gamma (Corollary 1 of the paper).