Considerations in Utilizing Synthetic Data for AI/ML
Synthetic data is expected to grow in the coming years, based on research from IT analysts firms such as Gartner.
What is synthetic data?
Synthetic data is artificially generated data that mimics real-world data. Synthetic datasets can be created through various methods, including computer simulations, statistical methods, or algorithmic generation (such as GAN and/or LLMs).
The use and role of synthetic data in machine learning is generally well understood.
- Preserve data privacy: When real data is not feasible due to privacy concerns, synthetic data can be a potentially feasible alternative to enable model training and experiments in a "lab" environment.
- Expand training data: Synthetic data can be used to augment real-world datasets, providing additional examples to improve model training. This is particularly useful when real data is scarce, expensive, or sensitive (in the case of privacy preservation) to obtain.
- Expand testing data: Synthetic data can be added to the unseen dataset to help test machine learning models. By randomly generating data with a set of defined characteristics and expected distribution, we can potentially augment testing data to evaluate how well models perform under different conditions. However, if the synthetic dataset is too similar to the training data, we risk performance issues related to data leakage.
- Mitigate bias: Where real-world data are biased and additional real-world data cannot be obtained, synthetic data can potentially be added to help mitigate bias in training machine learning models. It requires humans understand bias that exists in the data in the first place.
While synthetic data can be valuable, challenge remains in its evaluation. The following are questions we should ask and strive to consider when we use synthetic data:
- How good is the synthetic data and can we really use it (for good vs injecting potential harm)?
- How do we evaluate quality and utility of synthetic datasets?
- How do we create enough diversity and variability in synthetic data while still aligning to its original purpose and value in augmenting real-world data?
- How do we effectively evaluate the characteristics and representation of the synthetic data to ensure that the mix data (real and synthetic) provide the right distribution and bias mitigation in training/test/experiment data for machine learning and AI?
Synthetic data can be a powerful component in the machine learning "toolkit" but opportunities exist to further understand this area in data intelligence and enablement for AI/ML.