10 Questions and Answers about Data Availability: Celestia as an Example

What is DA Data Availability?
The question that data availability solves is: Has this data been published? Specifically, when a node receives a new block that is about to be added to the chain, it verifies the availability of the data. The node attempts to download all the transaction data of the new block to confirm its availability. If the node is able to download all the transaction data, then it has successfully verified the data availability, proving that the block data has indeed been published to the network.
As you can see, modular blockchains such as Celestia (learn more about it here: https://docs.celestia.org/learn/how-celestia-works/data-availability-faq) utilize additional primitives to allow nodes to more efficiently verify data availability. Data availability is critical to the security of any blockchain because it ensures that anyone can inspect the transaction ledger and verify it. Data availability is particularly problematic when blockchains scale. As blocks get larger, it becomes impractical for the average user to download all the data, so users can no longer verify the authenticity of the chain.
What is the problem with data availability?
This problem occurs when the transaction data of a new block cannot be downloaded and verified by nodes on the network. One possible scenario is that the producer of the block deliberately does not publish the transaction data, which is called a data hiding attack. If the transaction data is not published, the nodes on the network cannot confirm and accept the new block, resulting in the interruption of the process of updating the blockchain to the latest state.
This could cause the blockchain to stop functioning because nodes cannot verify the data of new blocks, or worse, attackers could exploit this vulnerability to steal funds. The severity of the consequences will depend on the type of blockchain (L1 or L2) and whether data availability is kept on-chain or off-chain. Data availability issues are particularly common in Layer 2 scaling solutions such as rollups and validiums. These technologies attempt to improve the performance of blockchains by expanding processing power on-chain, but this may also introduce new data availability challenges.
How do nodes verify data availability in Celestia?
In most blockchains, nodes verify data availability by downloading all transaction data of a block. If nodes are able to download all data, then they verify data availability. In Celestia, light nodes can use a new mechanism to verify data availability without downloading all data of a block. This new method of verifying data availability is called data availability sampling.
What is Data Availability Sampling?
Data availability sampling is a mechanism that enables light nodes to verify data availability without downloading the entire data of a block. Data availability sampling (DAS) works by having light nodes perform multiple rounds of random sampling to obtain small portions of a block's data. As a light node completes more rounds of data sampling, its confidence in the data's availability increases. Once a light node successfully reaches a predetermined confidence level (e.g. 99%), it will consider the block data to be available.
Want a simpler explanation? Check out this discussion thread about how data availability sampling is like flipping a coin. https://twitter.com/nickwh8te/status/1559977957195751424
What are some of Celestia's safety assumptions regarding data availability sampling?
Two security assumptions of Data Availability Sampling (DAS) in the Celestia blockchain network:
Light Node Quantity Assumption: Celestia assumes that there are enough light nodes in the network, and these light nodes will sample the data availability of newly generated blocks. This is based on a mechanism of random sample checking of data, that is, light nodes do not need to download the data of the entire block, but download a part of the data to verify whether the data of the entire block is available. This assumption ensures that if the block data is indeed published on the network, the full node can reconstruct the complete block by aggregating the data sampled by the light nodes. Under this assumption, for larger blocks, more light nodes are required to ensure data availability.
Connecting to Honest Full Nodes: The second assumption is that each light node is able to connect to at least one honest full node. This is done to ensure that the light node can receive fraud proofs of incorrect erasure coded blocks. Fraud proofs are a security mechanism used to verify that block data was processed correctly. If a light node cannot connect to at least one honest full node during an eclipse attack (a type of network attack where the attacker attempts to isolate the target node so that it can only connect to malicious nodes), it will not be able to verify that blocks were improperly constructed, and the security and reliability of the network will be threatened.
Why is block reconstruction necessary for safety?
In blockchain, "reconstructing blocks" means that if we do not get the data of the entire block at once, we can still restore the complete block content through the data fragments we already have. This is like if we have a torn piece of paper in our hand, if each torn piece has some overlap, we may be able to piece the whole piece of paper back to its original state.
In a system like Celestia, through erasure coding, even if we don’t have the complete block data, as long as there are enough data fragments, we can restore the data of the entire block. Erasure coding creates some additional data redundancy, so that even if some data is lost, the remaining information is enough for us to reconstruct the complete block.
Why is this important for security? Because it ensures that even in imperfect situations - such as malicious nodes trying to hide data or network instability that prevents data from being fully transmitted - we can still verify the integrity and correctness of transactions. If someone tries to tamper with or hide transaction data, as long as we can reconstruct the block, we can discover and prove the tampering, ensuring the transparency and trustworthiness of the entire blockchain.
What is data storage? What are the issues regarding data storage?
Data storage involves the ability to store and access data about past transactions.
Data storage and retrieval are needed for several purposes, such as:
Read information about previous transactions
Synchronize Node
Indexing and providing transaction data services
Retrieving NFT Information
The problem with data storage is whether past transaction data can be stored and successfully retrieved later. Failure to retrieve historical transaction data can lead to problems such as users being unable to access information about their past transactions or nodes being unable to sync data from the genesis block. Fortunately, the assumptions about storing and accessing past data are not demanding. Users only need to be able to access a single copy of the blockchain history in order to obtain historical transaction data. In other words, data storage security is a 1-to-N honesty assumption.
What is the difference between data availability and data storage? How does blockchain state fit into this question?
Data availability is about verifying that the transaction data of a new block is publicly available. In contrast, data storage involves storing and accessing past transaction data of older blocks.
So far, we have been discussing transaction data, but blockchain state is a related topic. State is different from transaction data. Specifically, state is like a current snapshot of the network, including account balances, smart contract balances, and validator set information. The problems caused by state size are qualitatively different from data availability and retrievability issues.
Why does Celestia discourage the storage of historical data? Who would store historical data if there is no reward?
Most blockchains discourage data storage because it should not be the blockchain’s responsibility to ensure that historical data is permanently retrievable. Furthermore, the data storage problem only requires one party to store and provide data to users, which is not a strong problem. Therefore, Celestia aims to provide a secure and scalable way to verify the availability of data. Once the data is verified as available, the task of storing and retrieving historical data is left to other entities that need the data. Fortunately, even if Celestia itself does not directly provide incentives (e.g., by paying tokens or other rewards) to encourage the storage and retrieval of data, there are other factors that motivate certain organizations or individuals to store historical data for their own benefit and make it available to users who need it.
There are many types of participants that may store historical data. Some of these include:
A block explorer that provides access to past transaction data.
An indexer that provides API queries for past data.
Applications or Rollups that require historical data to perform some processing.
Users who wish to have guaranteed access to their transaction history.
What can blockchain do to provide stronger guarantees of data retrievability?
Nodes are rewarded based on the amount of transaction data they store and the data requests they serve (this is the case with some data storage blockchains, such as Filecoin).
Publish transaction data to a data storage blockchain that incentivizes storage and provides historical data request services.
Reference link: https://docs.celestia.org/learn/how-celestia-works/data-availability-faq
read more:
From single chain to modularization: How Avail changes blockchain application development
How can a novice install and run a Celestia light node in 30 seconds?
Quick Look! 7 Myths and Facts About Modular Blockchain!