Synthetic Data Generation for AI-based O-RAN Apps
Introduction
The O-RAN architecture comes with native support for Artificial Intelligence/Machine Learning (AI/ML [1]), with non-real-time (non-RT) and near-RT control loops targeting xApps and rApps, respectively (see our previous post ML Framework in O-RAN [2]). This idea is further extended by the AI-RAN ALLIANCE, incorporating the AI/ML algorithms directly into the Radio Access Network (RAN) following the AI-for-RAN concept [3]. However, even the most sophisticated xApp/rApp or RAN-internal algorithm based on AI would perform poorly when trained on low-quality data [4]. This was a motivation behind the AI-RAN ALLIANCE’s launch of the Data-for-AI (D4AI) initiative, announced during the Mobile World Congress (MWC) in Barcelona, 2025 (see the press release). One of the topics highlighted by the D4AI is how to build unified, exchangeable, and reliable datasets, which can be used to train, evaluate, and compare the AI/ML-based algorithms developed by multiple vendors, or used to benchmark the GPU/CPU hardware, which leads us to the topic of synthetic data generation.
This blog post discusses synthetic data generation for AI-based O-RAN xApps/rApps. We start by outlining the need for such data for O-RAN’s x/rApps training and testing. This is followed by the two preparation steps for synthetic data generation, namely, data alignment and data preparation. Finally, some opportunities and challenges are discussed.
Why do we need synthetic data?

Suppose that an xApp/rApp developer is interested in training algorithms based on real-world datasets. A straightforward approach is to utilize the datasets captured either from a real network or from an advanced testbed. Usually, this data contains some artifacts or missing values (see Figure 1) and requires clean-up. After that, the xApp/rApp developer has a single dataset representing certain network conditions, tailored to the specific configuration, e.g., spectrum band, bandwidth, or antennas. Moreover, the network utilizes a certain set of algorithms, e.g., radio resource scheduling algorithm, or user-to-cell association algorithm (traffic steering/load balancing). From this perspective, some data is directly bound to their operation, e.g., Physical Resource Block (PRB) utilization resulting from Round Robin or Proportional Fair scheduling scheme. This is not enough for the reliable development and testing of the innovative AI-based O-RAN xApps/rApps. Instead, to provide a high-quality product, the xApp/rApp developer must test them under different network conditions and validate their interoperability with other RRM algorithms. For this purpose, the training and test datasets should be adjustable and configurable in terms of, e.g., cell configuration, propagation environments, traffic, or user mobility patterns. Also, the datasets should enable the creation of different representations to be used by various types of algorithms operating in the network. Achieving such datasets directly with the network or testbed data would require running many drive tests or trials. Moreover, a significant amount of memory is necessary to store the results. Therefore, one of the possible solutions to this challenge is to utilize the concept of synthetic data generation.
The key idea behind the synthetic data generation is to feed the existing real-world datasets to the dedicated framework, either based on the Network Digital Twin (NDT) [5] or a Generative AI (GenAI) model, as depicted in Figure 2. The framework can be flexibly configured and output multiple realizations of synthetic data to be used by the xApp/rApp developer for training algorithms under various conditions. The crucial requirement is that synthetic data must have the same statistical properties as the original dataset from the network.

Postprocessing Step 1: Data Alignment
To produce the synthetic data for xApp/rApp developers, a few steps must be completed in the postprocessing phase, i.e., to make it possible to be consumed by the NDT/GenAI. The first step after collecting the data is the so-called data alignment. It is related to the fact that usually, data obtained either from a testbed or a real network is not a single unified dataset, but is composed of a few subsets. For example:
- Site Configuration: contains the information about the static configuration of the sites. This can include:
- cells supported
- Radio Access Technology (4G, 5G)
- carrier frequencies
- bandwidth
- antenna type
- antenna configuration, e.g., beamwidth, tilt, and orientation, MIMO configuration
- Coverage: the maps of the coverage related to the particular sites and cells. These can be delivered, e.g., as shapefiles.
- Time Series: containing the network Key Performance Indicators (KPIs) or, more specifically, Performance Metrics (PMs). The PMs are usually a time series of counters compliant with 3GPP definitions [6]. The few representative examples of PMs are:
- UL/DL Total PRB Usage
- Average UL/DL UE throughput
- Mean number of RRC Connections
- Physical Network Function (PNF) Power Consumption
(Note: These PMs can also be reported by sub-counters like cell, slice, or Quality of Service (QoS) indicators, like 5QI.)
The challenge in data alignment is to process all of these subsets to create a unified representation of the data, which can be fed, e.g., into the NDT. Example steps may be to:
- Extract network configuration parameters and create a unified configuration file for each cell to recreate the network topology in NDT.
- Match the coverage files with cell configuration, and identify the neighbor relations, e.g., which cells are overlapping, which cells are potential candidates for load balancing operations, or which cells are coverage/capacity.
- Identify the KPIs/PMs that are available, irrespective of the particular network configuration. Each should be analyzed, e.g.,
- The UL/DL Total PRB Usage is probably not a good candidate because it highly depends on the underlying algorithms, like the scheduler or traffic steering.
- The Average UL/DL UE throughput in gNB, together with the mean number of RRC Connections, can be aggregated over multiple cells and carriers to reflect a spatial characteristic of traffic. As such, it can be potentially used as a base to create synthetic traffic, e.g., to be handled by AI-based xApps/rApps in the NDT environment.
Postprocessing Step 2: Data Preparation
After having the input data aligned between the provided subsets and identified KPI/PMs of interest, the next step is to perform data preparation for each KPI/PM. As was already mentioned, the provided KPI/PMs are sometimes corrupted (see Figure 1). Therefore, a few steps must be followed before they can be used for synthetic data generation for AI-based O-RAN xApps/rApps. This involves state-of-the-art steps like data clean-up, averaging, and normalization as depicted in Figure 3. However, after having the clean-up data, the crucial point is to extract the underlying statistical properties of selected PMs, like Average UL/DL UE throughput in gNB or Mean number of RRC Connections, to build a generalized model which can be fed into NDT. Alternatively, the prepared data (KPIs/PMs) can be used to train the GenAI synthetic data generation tool. In both cases, models used for synthetic data generation should be designed to reflect the same long-term statistical properties as the raw measurement data, but would allow for flexible configuration of data sets for various evaluation scenarios supporting development and testing of AI-based O-RAN xApps/rApps.

The data alignment, together with the data preparation, constitutes the postprocessing block from Figure 2. To generate the synthetic data itself, the aligned and prepared datasets are then fed to the NDT or GenAI model. Both approaches have their pros and cons, which require detailed discussion aimed at the next blog post.
Conclusions
Synthetic data generation is expected to play an important role in developing future AI-based automated networks. It enables the creation of various realistic network topologies and configurations, without the need to collect the data directly from the network or a testbed. However, the synthetic data must be based on real-world KPIs, reflecting their long-term statistical properties. The key enabler for its generation is to have the input from the live network, which requires cooperation with the MNO. The proper automated methodology should be applied for the data alignment from multiple subsets, and data preparation covering: clean-up, averaging, and normalization, as well as extracting its statistical properties. This can be challenging, as provided information can significantly differ between MNOs, e.g., coverage maps might not be provided, some site configurations can be expressed using different parameters, but most importantly, the PMs can support different subsets of counters. Moreover, the PMs’ time granularity, or support for the per-slice or per-QoS Flow sub-counters, may differ. Finally, extracted unified statistical models of network KPIs augmented with network topology and configuration can be fed to the NDT or used to train GenAI.
In the upcoming post, we will discuss how, based on the aligned and prepared real-world data, the synthetic datasets can be generated using either Digital Twin or GenAI.
References
- “Artificial Intelligence (AI) and Its Applications”, online: https://rimedolabs.com/blog/artificial-intelligence-ai-and-its-applications/
- “ML Framework in O-RAN”, online: https://rimedolabs.com/blog/ml-framework-in-o-ran/
- AI-RAN ALLIANCE, „Vision and Mission White Paper”, online: https://ai-ran.org/wp-content/uploads/2024/12/AI-RAN_Alliance_Whitepaper.pdf
- L. Bonati, S. D’Oro, M. Polese, S. Basagni and T. Melodia, „Intelligence and Learning in O-RAN for Data-Driven NextG Cellular Networks,” in IEEE Communications Magazine, vol. 59, no. 10, pp. 21-27, October 2021
- “Digital Twin – What Is It and How Can It Affect Future Networks”, online: https://rimedolabs.com/blog/digital-twin-what-is-it-and-how-can-it-affect-future-networks/
- 3GPP, TS 28.552 “Management and orchestration; 5G performance measurements” v19.4.0, June, 2025
Author Bio
Marcin Hoffmann is a Technical Solution Manager at Rimedo Labs, working on O-RAN software development solutions and R&D projects covering energy savings, traffic steering, and massive MIMO. Marcin is a Graduate Student Member at IEEE and received the M.Sc. degree (Hons.) in electronics and telecommunication from Poznań University of Technology in 2019, where he is currently pursuing a Ph.D. degree with the Institute of Radiocommunications. He was involved in many both national and international research projects. His research interests include the utilization of machine learning and location-dependent information for network management. He coauthored many scientific articles published in top journals like IEEE Journal on Selected Areas in Communications, IEEE Transactions on Intelligent Transportation Systems, IEEE Communications Magazine, or IEEE Access.