Synthetic Data – Are You Real?

Forward-looking marketers are probably reading a lot about the use of synthetic data, especially to train AI models. Synthetic data is just as it sounds – artificially generated data instead of real-world collected data. It’s being used more frequently in modeling to generate scale quickly and work around privacy concerns. Keep in mind, it’s typically fused with real-world collected data.

For example, let’s say you are PepsiCo and you want focus group information about a new product. Organizing focus groups internationally can be really time consuming and expensive. Enter synthetic data. A cohort of focus group respondents is queried, and then an AI model simulates the rest of the world. Add the fact that the cohorts can be queried online and the output gets really fast.

The challenge in using synthetic data – and some AI for that matter – is the potential for hallucinations and confabulations. If the primary (authentic) dataset is not large enough, the simulated version – where the model combines data sets and components, morphing it into new insights – can become a bit Frankenstein-ish. I think the official term is “suboptimal MMM output.“

We see a future for synthetic data with enterprise brands, brands that have volumes of training data. But for nimble, scrappy marketers, we recommend sticking with the real stuff.