data

Inside LYS: Building A Robust Data Pipeline

Marketing

Oct 4, 2024 • 7 min read

In the vast, high-speed world of DeFi, data is king. Being able to process and analyze massive datasets comprehensively gives you a serious edge, whether that’s finding the best yield opportunities or avoiding unnecessary risks. At LYS, we’ve designed our data pipeline to do exactly that. We created a solution that ensures data flows efficiently, stays accurate, and is always ready for our AI models to make smart decisions that protect you.

Let’s take a look behind the scenes of the LYS data pipeline and break down each piece of our tech, showing how it all fits together and why it’s a big deal for risk management and yield optimization.

What Is A Data Pipeline?

If you’re new to data pipelines, imagine them as highways for your data. They are essential components of any platform harnessing large and complex datasets, responsible for effective data preparation and processing. In DeFi, these highways are essential. Without them, platforms can drown in all the incoming data, leading to missed opportunities or even costly mistakes.

At LYS, we built our data pipeline to handle blockchain’s unique complexities (high-speed processing, high availability, and tons of data coming from all directions). Our design keeps everything well oiled so that our AI can do its job well - find yield and manage risk effectively.

How Does The LYS Data Pipeline Work?

The LYS Data Pipeline relies on a modular architecture that facilitates data extraction from the blockchain, subsequent processing through several services, and eventually feeding the data into specialized databases. These include PostgreSQL for structured historical storage, Neo4J for graph data, and Redis for in-memory, real-time access. This infrastructure supports real-time analytics, ensuring that data is aggregated, indexed, and analyzed within the time frame of a single blockchain block (typically 12-15 seconds).

EVM Nodes

The data journey starts at the source, with the EVM nodes. These nodes act as our primary data gateways, fetching the essential on-chain information such as block headers, transactions, and logs. To ensure reliability, LYS connects to several nodes simultaneously which makes sure the data stream remains uninterrupted even when individual nodes go down. Here, the Node Connector microservices come in-to connect directly to these blockchain nodes using JSON-RPC over WebSocket, getting real-time feeds from on-chain activity. Events such as newly minted blocks or ongoing mempool transactions are listened to by the Node Connectors. Their main purpose is capturing and ingesting the data immediately and feeding this into the pipeline with minimum latency. This redundancy ensures we have multiple, active sources of data flowing in, giving us resilience against node failures and making sure we have a complete, uninterrupted view of the blockchain.

Block Indexer

Once the blockchain data is ingested, it is aggregated and indexed by the Block Indexer. The Block Indexer is essentially the organizing force of the data pipeline. It takes raw blockchain data in and transforms it into a structured format that can then be queried effectively further down the line. It maintains an in-memory database of the latest blocks (typically the last 100 blocks) for quick access, while also storing data in PostgreSQL for persistent and historical records, enabling sophisticated analysis and deep insight retrieval. This is important because PostgreSQL allows us to conduct complex queries, such as searching historical transaction records or analyzing the performance of yield opportunities over time. Besides the structured data storage, the pipeline uses Redis to store the latest fetched data (blocks and mempool data). Due to the in-memory nature of Redis, the required real time information is available right away.

An additional, but key factor of the Block Indexer is its ability to manage blockchain reorganizations, or so-called re-orgs, that temporarily fork the blockchain into a multi-headed monster. It then identifies which chain is valid and invalidates any data from any outdated or losing branches. This step is crucial for maintaining data consistency and ensuring the correctness of everything else that follows.

Real-Time Data Streaming

Kafka acts as the real-time streaming backbone of our infrastructure. Everything in it relies on the flow of information between components. Kafka's job is to provide this continuous, asynchronous communication from data producers to consumers, creating a decoupled and scalable system in the process. Once the Block Indexer has cleaned up and validated the data, it gets streamed into the next layer of services, such as the BlockProcessor. Kafka enables each component to scale independently per the workload without direct dependencies. It provides a reliable, distributed framework that ensures blockchain data, blocks, transactions, and receipts, are delivered in real-time to the components needing it most, ready for immediate processing or deeper analysis.

BlockProcessor

Once data is streamed through Kafka, it goes straight into the processing stage. Here’s where the BlockProcessor kicks in. It takes the incoming blockchain data (confirmed blocks) and breaks it down into detailed components, such as transactions, smart contract interactions, and event logs. This is where all the valuable metadata gets extracted, which is important for the downstream AI models that handle risk analysis and yield optimization.

The BlockProcessor works through several key steps. First, Transaction Decomposition takes apart each transaction, focusing on details like sender, receiver, and value. Then Smart Contract Interaction Analysis, which looks at the contracts triggered and their function calls. The Bytecode Breakdown gets even deeper, decoding what the contract is doing.

Next is Data Enrichment and Categorization, which adds context by labeling transactions, such as identifying liquidity moves or governance actions, which makes the data actionable. Finally, the enriched insights are sent back through Kafka to ensure everything is in real-time for any part that needs it (including end users), especially for feeding into AI models that assess risks and optimize strategies.

Real-Time Risk Assessment

With the data provided by the BlockProcessor, the next phase is risk analysis. This data feeds directly into our custom AI models, which are built for analyzing on-chain activity to identify and calculate risk scores in real time. These models use large-scale datasets from both on-chain and off-chain sources, improved by insights gathered from extensive Exploratory Data Analysis (EDA) and automated data labeling methods.

The AI-driven Real-Time Risk Assessment focuses on detecting unusual or risky activities by evaluating a range of on-chain metrics, including transaction sizes, address behaviors, and historical patterns. The integration of data augmentation techniques improves prediction accuracy, while semi-supervised methods like Generative Adversarial Networks (GANs) allow the system to learn from both labeled and unlabeled data, which ensures continuous adaptation.

By assigning risk scores to on-chain interactions and recalculating them as new data streams in, we can help ensure up-to-the-minute risk management. This ability to adapt in real time helps LYS safeguard user assets effectively, providing dynamic risk-reward optimization, even in a highly volatile DeFi environment.

AI Pathfinder

Once risk analysis has finished, it’s time to optimize yield opportunities with the AI Pathfinder. Think of it like a sophisticated recommendation engine, using enriched data and calculated risk scores to chart optimal yield paths across various DeFi protocols. By utilizing Neo4J for graph-based analysis, the Pathfinder evaluates different potential deposit routes, analyzing how nodes (protocols) are connected and calculating cumulative risks and APY.

It combines multiple data sources (market conditions, historical performance, and real-time risk metrics) to generate actionable insights. Using methods like gradient boosting and decision trees for regression analysis, the Pathfinder identifies which protocols or liquidity pools are most promising for increasing returns while managing risk exposure.

By continuously integrating real-time updates and adapting to emerging patterns, the AI Pathfinder doesn’t just react, it anticipates, helping users stay ahead of market fluctuations.

Security, Redundancy, and High Availability

At LYS, maintaining the integrity and availability of data is one of the most important things, especially in the high-stakes environment of DeFi. To ensure data is protected, data encryption is applied both in transit and at rest, to ensure the safety of all sensitive information, including transactions and user data, from unauthorized access or breaches.

Redundancy is a core design principle in our architecture. We never rely on a single source of data, ensuring that any point of failure won’t bring the system down. LYS connects to multiple EVM nodes and uses parallel Node Connectors, so even if individual nodes experience disruptions, the data stream remains uninterrupted.

We ensure high availability by implementing real-time monitoring and anomaly detection. This approach continuously tracks unusual activity, allowing LYS to detect and mitigate risks as they arise. To make sure there's no data loss, we use geographically distributed nodes and redundancy measures, preventing service disruptions even during peak periods or stress scenarios.

And, as a standard practice, LYS undergoes regular and rigorous security audits to proactively identify vulnerabilities and implement necessary fixes. These proactive audits help ensure that the protocol remains secure and compliant with evolving best practices.

Cross-Chain Scalability and Expansion

The LYS pipeline isn’t limited to just one blockchain. Right now, our focus is on Ethereum, but the whole architecture is built with cross-chain capabilities in mind. Thanks to our modular and scalable setup (Node Connectors etc), we can bring in new chains easily.

This means that as more networks appear and DeFi keeps growing across them, LYS protocol can adapt without much hassle. Our goal is to offer comprehensive cross-chain analytics, giving users a full picture of what’s happening everywhere. With cross-chain integration, we can make sure that no matter how multi-chain DeFi gets, we can still deliver the best insights and opportunities to our users.

Summary

LYS Protocol's custom-built data pipeline is all about processing data efficiently and providing real-time analytics, even at massive scale. This pipeline is really the engine that powers everything, making sure we can offer users top-tier yield optimization and risk management. By tying everything together with AI and machine learning, we can ensure that every decision is backed by solid, up-to-the-minute data.

What also makes this pipeline stand out is how well it handles real-time data streaming, users get fresh insights just as they happen, which is important for making informed decisions in the fast-paced world of DeFi. We've built everything with security and redundancy baked in, so even if things go wrong, whether it’s node failures or unexpected network hiccups, everything keeps running smoothly.

The modular architecture means we can easily scale when needed, adapting to increased data loads or adding support for more networks without any headaches. By integrating AI throughout and focusing on scalable, secure architecture, LYS Protocol isn't just keeping up with DeFi, it's making sure users have the tools they need to stay ahead, optimize their returns, and minimize risks… all in one place.