How blocks storage in Cortex reduces operational complexity for running Prometheus at massive scale


how to store cortex

A store-gateway is not ready until this initial index-header download is completed. Moreover, while running, the store-gateway periodically looks for newly uploaded blocks in the storage and downloads the index-header for the blocks belonging to its shard. At startup store-gateways iterate over the entire storage bucket to discover blocks for all tenants and download the meta.json and index-header for each block. During this initial bucket synchronization phase, the store-gateway /ready readiness probe endpoint will fail. Today, the Cortex blocks storage is still marked experimental, but at Grafana Labs we’re already running it at scale in few of our clusters, and we expect to mark it stable pretty soon. For example, when running the store-gateways with a replication factor of three, the querier will balance the requests to query a block across the three store-gateways holding it, effectively distributing the workload for a specific block by a factor of three.

What happens is, the store-gateway periodically scans the bucket and for each block found, it downloads a subset of the index, which we call the index-header. The index-header just contains the index’s symbols table (used to intern strings), and postings offset table (used to look up postings). We know that each ARM instruction is 32bit long, and all instructions are conditional.

The findings suggest that traditional theories of consolidation may not be accurate, because memories are formed rapidly and simultaneously in the prefrontal cortex and the hippocampus on the day of training. To reduce the likelihood this could happen, the store-gateway waits for a stable ring at startup. A ring is considered stable if no instance is added/removed to the ring for at least -store-gateway.sharding-ring.wait-stability-min-duration. If the ring keep getting changed after -store-gateway.sharding-ring.wait-stability-max-duration, the store-gateway will stop waiting for a stable ring and will proceed starting up normally.

Generally, LDR is used to load something from memory into a register, and STR is used to store something from a register to a memory address. A recharging activity is something that makes you feel more energetic at the end than when you start. Common recharging activities will include behaviors such as getting good sleep, eating healthy foods, staying hydrated, engaging in physical activity, being in nature, and social connection.

Offset form: Register as the offset.

When we visit a friend or go to the beach, our brain stores a short-term memory of the experience in a part of the brain called the hippocampus. Those memories are later “consolidated” — that is, transferred to another part of the brain for longer-term storage. Sharding can be used to horizontally scale blocks in a large cluster without hitting any vertical scalability limit. However, running a large and scalable index store may add significant operational complexity, and storing per-series chunks in the chunks store generates millions of objects per day and makes it difficult implementing some features like per-tenant retention or deletions. The typical setup is having one or more Prometheus servers configured to remote-write the scraped series to Cortex, and then configuring Grafana (or your querying tool of choice) to query back the data from Cortex. In this scenario, Prometheus can be configured with a very short retention because all the queries are actually served by Cortex itself.

To protect from this, when an healthy store-gateway instance finds another instance in the ring which is unhealthy for more than 10 times the configured -store-gateway.sharding-ring.heartbeat-timeout, the healthy instance forcibly removes the unhealthy one from the ring. Zone stable shuffle sharding can be enabled via CLI flag. The query frontend is where the first layer of query optimization happens. Given a large time range query, for example 30 days, the query frontend splits the query into 30 queries, each covering 1 day. You typically have a load balancer in front of two query frontends, and Grafana (or your querying tool of choice) is configured to run queries against the query frontend through the load balancer.

A more recent model, the multiple trace model, suggests that traces of episodic memories remain in the hippocampus. These traces may store details of the memory, while the more general outlines are stored in the neocortex. Beginning in the 1950s, studies of the famous amnesiac patient Henry Molaison, then known only as Patient H.M., revealed that the hippocampus is essential for forming new long-term memories. Molaison, whose hippocampus was damaged during an operation meant to help control his epileptic seizures, was no longer able to store new memories after the operation. However, he could still access some memories that had been formed before the surgery. The store_gateway_config configures the store-gateway service used by the blocks storage.

how to store cortex

If a querier tries to query a block which has not been loaded by a store-gateway, the querier will either retry on a different store-gateway (if blocks replication is enabled) or fail the query. The request sent to each store gateway contains the list of block IDs that are expected to be queried, and the response sent back by the store gateway to the querier contains the list of block IDs that were actually queried. Given that the received samples are replicated by the distributors to ingesters, typically by a factor of 3, completely losing a single ingester will not lead to any data loss, and thus the WAL wouldn’t be required.

In a recent blog post, I wrote about the work we’ve done over the past year on Cortex blocks storage. It provides horizontal scalability, high availability, multi-tenancy and blazing fast query performances when querying high cardinality series or large time ranges. The literal pool is a memory area in the same section (because the literal pool is part of the code) to store constants, strings, or offsets. In the example above we use these pseudo-instructions to reference an offset to a function, and to move a 32-bit constant into a register in one instruction.

Offset form: Scaled register as the offset

The solution we adopted in Cortex is to query the last 12h only from ingesters (it’s configurable). The idea is to have a cut-off time of 12h between ingesters and long-term storage in order to give the compactor enough time to run the vertical compaction of 2h blocks. Samples with a timestamp more recent than 12h are only queried from ingesters, while older samples are only queried from the long-term storage. The picture below is an attempt to visualize how the memory location is calculated with [r1, r2, LSL#2]. Simon Makin of Scientific American writes that MIT researchers have discovered the brain uses a complimentary memory system that simultaneously creates and stores both long and short-term memories.

For example, for an extrovert, socializing will be a battery recharge, whereas an introvert may find this activity draining. There is also probably an optimal amount of time for something to be recharging. A 30-minute walk might be recharging but a six-hour walk might be draining. Further studies are needed to determine whether memories fade completely from hippocampal cells or if some traces remain. Right now, the researchers can only monitor engram cells for about two weeks, but they are working on adapting their technology to work for a longer period.

  1. “They’re formed in parallel but then they go different ways from there.
  2. Instances can be added and removed at any time (it happens whenever you scale up or down the store-gateways) and – whenever the topology changes – blocks are automatically resharded across store-gateways.
  3. Each block ID is hashed and assigned to a store-gateway instance and replicated on other RF-1 instances, where RF is the replication factor (defaults to 3).
  4. In a recent blog post, I wrote about the work we’ve done over the past year on Cortex blocks storage.
  5. Then, they could use light to artificially reactivate these memory cells at different times and see if that reactivation provoked a behavioral response from the mice (freezing in place).

Neuroscientists have developed two major models to describe how memories are transferred from short- to long-term memory. The earliest, known as the standard model, proposes that short-term memories are initially formed and stored in the hippocampus only, before being gradually transferred to long-term storage in the neocortex and disappearing from the hippocampus. For each block belonging to a store-gateway shard, the store-gateway loads its meta.json, the deletion-mark.json and the index-header. Once a block is loaded on the store-gateway, it’s ready to be queried by queriers. When the querier queries blocks through a store-gateway, the response will contain the list of actually queried block IDs.

What Processes in the Brain Allow You to Remember Dreams?

Now, the compactor is the service responsible for merging and deduplicating source blocks into a larger block. It supports both vertical compaction (to compact overlapping 2h blocks) and horizontal compaction (to compact adjacent blocks into a wider one). This value is added or subtracted from the base register (R1 in the example below) to access data at an offset known at compile time. Many of our previous recharging activities are no longer available or altered during the pandemic. We therefore have to be more creative and deliberate in terms of identifying and integrating recharging activities into our daily lives.

Kitamura says he believes that some trace of memory may stay in the hippocampus indefinitely, storing details that are retrieved only occasionally. “To discriminate two similar episodes, this silent engram may reactivate and people can retrieve the detailed episodic memory, even at very remote time points,” he says. For the general Cortex configuration and references to common config blocks, please refer to the configuration documentation. Cortex exposes a 100% Prometheus-compatible API, so any client tool capable of querying Prometheus can also be used to run the same exact queries against Cortex. Recently, the Prometheus community has legitimately spent significant effort to reduce the TSDB memory footprint, and some of the changes came with a tradeoff between memory and CPU. I personally believe that these changes make a lot of sense in the context of Prometheus, where the compactor runs within Prometheus itself, but it may look counterproductive in systems like Cortex, where the compactor runs isolated from the rest of the system.

The query frontend also offers other capabilities, like start and end timestamp alignment, to make the query cacheable, and supports partial results caching. In this blog post, I will talk about the work we’ve done over the past year on Cortex blocks storage to help solve this problem. GitHub issues tagged with the storage/blocks label are the best source of currently known issues affecting the blocks storage. I’d like to talk about how we implemented sub-object caching for the chunks, or all the nuances of the query consistency check, but it’s getting late and I’ve probably covered more than I should in a single blog post. The first thing we learned is that querying non-compacted blocks is pretty inefficient.

This is important for querying store gateway because a block can be retried at most 3 times. To fetch samples from the long-term storage, the querier analyzes the query start and end time range to compute a list of all known blocks containing at least one sample within this time range. Given the list of blocks, the querier then computes a set of store-gateway instances holding these blocks and sends a request to each matching store-gateway instance asking to fetch all the samples for the series matching the query within the start and end time range. Each push request belongs to a tenant, and the ingester appends the received samples to the specific per-tenant TSDB stored on the local disk. The received samples are both kept in-memory and written to a write-ahead log (WAL) and used to recover the in-memory series in case the ingester abruptly terminates. The per-tenant TSDB is lazily created in each ingester as soon as the first samples are received for that tenant.