Company News

Launching Our Next Generation Data Platform to Scale Blockchain and Asset Support

One of our primary goals is to support every blockchain and every interaction. Today, layer 2s and rollups are the driving force for blockchain scalability, leading to the proliferation of new blockchains and many more in development in 2023. As activity continues to increase on chains outside of Bitcoin and Ethereum, we recognized a need to develop a generic platform that more quickly and comprehensively supports new chains and their corresponding tokens. We are proud to announce that our research and development team recently launched a next generation data platform to advance this vision. 

This platform makes step function improvements in three key areas: 

  1. Onboarding speed: Our knowledge graph can now integrate new blockchains at a faster rate. 
  2. Token coverage: For smart contract-based networks, our platform will automatically support all fungible and non-fungible tokens. As new tokens go live, they will also be supported dynamically. 
  3. Address clustering speed: Chainalysis employs a clustering process to group together addresses controlled by the same real-world entity or service. With our new platform upgrades, we are able to support comprehensive clustering for all blockchains more quickly than before. 

Keep reading to learn more about the features of this platform and how it impacts our customers.  

Our new platform capabilities 

Our research and development team implemented a plug-and-play framework to onboard the next set of blockchains by generalizing the format and structure of the data at each stage of our pipelines. 

The three main stages are: 

  • Stage 1 – Raw Archival Data: Establish generalized frameworks for ingesting and onboarding blockchains at scale  
  • Stage 2 – Derived Datasets: Process raw data and derive data objects that drive all Chainalysis solutions, such as clustering and exposure 
  • Stage 3 – Solutions: Normalize the derived datasets so they can flow seamlessly across our portfolio in a predictable and easily understandable format

Stage 1 – Raw archival data ingestion

In the initial stage of the pipeline, our now serverless framework ingests the raw archival data. In order to properly scale this stage, we designed the framework to be generalized by blockchain data type, such as account-based or unspent transaction output (UTXO).  

For example, the first step to generically onboard account-based blockchains involves ingesting block, receipt, and trace data. For trace data, the framework also accommodates the ingestion of both Parity and Geth data as it relates to EVMs.

There are three unique features of the ingestion stage framework: 

  • Node integrations: The first challenge with building any blockchain data pipeline is defining an interface with RPC-nodes to know when new data is available and to subsequently request new data. To streamline this process going forward, we abstracted away the repetitive work needed to integrate with nodes to encapsulate it as a configuration. This component is also flexible and can be modified to simply read data pushed into a storage service, such as Amazon Web Services S3.
  • Event-driven consumers: Once new data is ingested, all downstream systems immediately start processing, as opposed to relying on a polling mechanism. We achieved this via an event-based architecture whereby messages are triggered, thus alerting downstream consumers. 
  • Reorg-handling: Any newly confirmed block is always at risk of being rejected via a reorg event — for instance, if miners or validators agree to fork the chain and ignore the last numbered blocks. For blockchains with higher transaction speeds, the ingestion stage seamlessly handles reorgs and ensures that forked blocks are removed from the data. Therefore, event messages alert downstream consumers so they can take action accordingly, such as deleting data related to the forked blocks and consuming data for the newly formed chain.

Stage 2 – Derived data representations

After the platform ingests the raw archival data, it generates the core derived datasets that are commonly surfaced throughout our solutions, such as value transfers, clustering, and exposure. To achieve this, we leverage numerous transformers, which are generic to the dataset type and therefore can effectively be reused by all blockchains that inherit the same structure. For instance, the EVM transformer can be leveraged to onboard Ethereum, BNB Smart Chain, Polygon PoS, and more. 

Each blockchain is unique, however, and may have identifying characteristics such as treatment of genesis transactions and whether the blockchain is deflationary or inflationary. These nuances are important because they can impact metrics such as cluster balances. In the case of genesis, not including some transactions may imply that certain clusters have significant negative balances. 

We also designed the transformer to identify and capture activity from all predefined assets for a given blockchain type. Therefore, an EVM blockchain will automatically support ERC-20s, ERC-721s, ERC-1155s, and all equivalents. These token standards typically cover most fungible and non-fungible tokens on any blockchain and result in millions of tokens supported per chain. 

Supporting transfers for millions of assets does come at the cost of latency if the data has to be transformed one by one and in a specific order — this is why the new pipeline instead uses parallelism and processes transactions in a non-sequential order to increase throughput. Consequently, this system allows us to not be halted by most edge cases and processes most transactions in a matter of hours for high throughput blockchains. Additionally, the system’s mechanisms record edge cases in a dead-letter queue (DLQ) to ensure that they can be addressed later for data completeness. 

Lastly, the combination of the event-driven architecture and the use of generic transformers allows us to roll out derived datasets for a new blockchain in a matter of days as opposed to weeks. An example is clustering data whereby each new blockchain can simply leverage the same generalized heuristics already written. The main bottleneck for clustering blockchain coverage is now whether or not the desired blockchain has data flowing through the pipeline.  

Stage 3 – Availability throughout our solutions

After the transformers described in stage 2 have abstracted all the blockchain-specific information, they will generate blockchain agnostic datasets, such as transfers, clustering, and exposure. This data normalization is a requirement for stage 3 to allow for the final datasets to flow seamlessly across our investigation, risk, and growth solutions. 

Additionally, data normalization ensures that there will always be a consistent UI/UX experience when interacting with Chainalysis products, regardless of whether the blockchain is a UTXO versus a EVM, has a hierarchical account-structure, or supports the minting of fungible tokens. Datasets will remain universal across all blockchains.  

How our platform impacts customers 

With the launch of this next generation data platform, Chainalysis has begun rolling out in quick succession new blockchains that can be supported with advanced features. BNB Smart Chain will be the first blockchain delivered through this platform where more than three million tokens will be supported with our most advanced capabilities. After this, we will roll out Polygon, Avalanche, Arbitrum, and Optimism. 

These capabilities will strengthen blockchain and token analysis by broadening data availability. The pace with which we incorporate these blockchains and implement asset coverage will demonstrate the robustness of our data platform and its ability to accelerate support of new blockchains in the future. 

This website contains links to third-party sites that are not under the control of Chainalysis, Inc. or its affiliates (collectively “Chainalysis”). Access to such information does not imply association with, endorsement of, approval of, or recommendation by Chainalysis of the site or its operators, and Chainalysis is not responsible for the products, services, or other content hosted therein. 

This material is for informational purposes only, and is not intended to provide legal, tax, financial, or investment advice. Recipients should consult their own advisors before making these types of decisions. Chainalysis has no responsibility or liability for any decision made or any other acts or omissions in connection with Recipient’s use of this material.

Chainalysis does not guarantee or warrant the accuracy, completeness, timeliness, suitability or validity of the information in this report and will not be responsible for any claim attributable to errors, omissions, or other inaccuracies of any part of such material.