Evaluating Gen AI Applications - Part I
When I worked in mergers & acquisitions way back, one of the things to do was generating project code names to help preserve the confidentiality of these highly sensitive projects. I remember that we have gone through all kinds of names, drawing inspiration from mountains, park names, bird names, car names. Tonight, for fun, I gave a prompt to ChatGPT to help me find some names. I was curious.
"Generate fun but appropriate investment project code names around the theme of wine but not name of wine. Start with Project and give me one word names"
Here is what I received:
- Project Grape
- Project Vinify
- Project Cellar
- Project Terroir
- Project Cork
- Project Vino
- Project Bubbly
- Project Meritage
- Project Terroir
- Project Vinum
- Project Magnum
- Project Vinify
- Project Decant
- Project Cabernet
- Project Terroir
What do you think? How would you evaluate the answers?
I asked ChatGPT to do a self-evaluation. 😁
"Evaluate using relevance"
- Project Grape - Highly Relevant
- Project Vinify - Highly Relevant
- Project Cellar - Relevant
- Project Terroir - Highly Relevant
- Project Cork - Relevant
- Project Vino - Highly Relevant
- Project Bubbly - Relevant
- Project Meritage - Relevant
- Project Terroir - Highly Relevant
- Project Vinum - Highly Relevant
- Project Magnum - Relevant
- Project Vinify - Highly Relevant
- Project Decant - Relevant
- Project Cabernet - Relevant
- Project Terroir - Highly Relevant
Without prompting, it provides a justification for the evaluation output:
The relevance of the code names to the wine theme varies. Some, like "Project Grape," "Project Vinify," "Project Terroir," "Project Vino," and "Project Vinum," are highly relevant as they directly relate to wine-related terms or concepts. Others, such as "Project Cellar," "Project Cork," "Project Bubbly," and "Project Meritage," are relevant but may not be as specific to wine. "Project Magnum" and "Project Decant" are somewhat relevant due to their connection to wine-related activities, and "Project Cabernet" is relevant because it's a wine varietal. Overall, the list contains a mix of highly relevant and relevant names to the wine theme.
One of the big things to tackle in generative AI is evaluation. For a fun and creative exercise such as the above, we don't have a need for formal and rigorous evaluation metrics. I can pick what I like in the moment as long as the name is appropriate and does not overlap with another existing code name.
As organizations experiment and implement generative AI applications that are even more non-deterministic and varied (and less explainable in how a particular instance of " generation engine" works), we have an opportunity to elevate and innovate on how we consider methods of evaluating generative AI applications going forward. I am excited to see opportunities in this space.