GenAI Models Quality Evaluations: Text and Image
Published:
Evaluating Generative AI (GenAI) models is challenging due to their complex and diverse outputs across different modalities (text, image, and multimodal generation). Unlike traditional supervised learning models, where direct comparison with ground truth labels is feasible, GenAI models often require implicit evaluation techniques to assess quality, coherence, and usability.
Automatic Evaluation Metrics
1. ML Metrics for Text Generation
Evaluating text-based GenAI models involves multiple automatic metrics that measure fluency, coherence, and factual accuracy.
a. BLEU Score (Bilingual Evaluation Understudy)
- Type: N-gram-based similarity measure.
- Usage: Commonly used in machine translation.
- Intuition: BLEU evaluates how similar the generated text is to the reference text by comparing n-grams (contiguous word sequences). The more overlap in n-grams, the higher the BLEU score.
- Example:
- Reference: “The cat is on the mat.”
- Generated: “A cat sits on the mat.”
- Unigram overlap: {“cat”, “on”, “the”, “mat”} → 4 matches
- Bigram overlap: {“on the”, “the mat”} → 2 matches
- Formula: \(BLEU = BP \cdot exp\left( \sum_{n=1}^{N} w_n \log p_n \right)\) where:
- $ BP $ is the brevity penalty,
- ( p_n ) is the precision of n-grams,
- ( w_n ) is the weight assigned to each n-gram level.
b. BERTScore
- Type: Semantic similarity metric.
- Usage: Evaluates text similarity at a deeper level using pretrained contextual embeddings (e.g., BERT).
- Advantage: Recognizes word order changes and distant dependencies.
- Example:
- Reference: “The cat is on the mat.”
- Generated: “The feline is lying on the rug.”
- Traditional BLEU may score this poorly due to different words, but BERTScore captures semantic similarity.
- Formula: \(BERTScore = \frac{1}{N} \sum_{i} \max_{j} \text{cosine}(E(x_i), E(y_j))\) where ( E(x) ) and ( E(y) ) are the contextualized embeddings of the input and reference sentences.
c. ROUGE Score
- Type: Recall-oriented metric for summarization.
- Usage: Measures overlap between generated and reference summaries.
- Variants:
- ROUGE-1 (Unigrams),
- ROUGE-2 (Bigrams),
- ROUGE-L (Longest common subsequence).
- Example:
- Reference Summary: “AI models generate text.”
- Generated Summary: “Text is generated by AI models.”
- ROUGE-L captures that the sequence of words is retained.
- Formula: \(ROUGE = \frac{\sum_{s \in S} \sum_{w \in s} \text{match}(w)}{\sum_{s \in S} \sum_{w \in s} \text{total}(w)}\)
d. Perplexity (PPL)
- Type: Measures how well the model predicts unseen text.
- Usage: Lower values indicate better fluency and lower uncertainty.
- Example:
- A model trained on grammatical English will have low perplexity on normal sentences.
- If given random word sequences, perplexity will be high.
- Formula: \(PPL = exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(y_i | y_{<i})\right)\)
2. ML Metrics for Image Generation
Evaluating images generated by GenAI models involves measuring realism, diversity, and fidelity compared to real-world distributions.
a. Fréchet Inception Distance (FID)
- Type: Measures how close generated images are to real images in feature space.
- Usage: Lower FID scores indicate better image realism.
- Example:
- GAN-generated images of cats should have an FID close to real cat images.
- Formula: \(FID = || \mu_r - \mu_g ||^2 + Tr(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})\) where:
- ( \mu_r, \mu_g ) are the mean feature vectors of real and generated images,
- ( \Sigma_r, \Sigma_g ) are the covariance matrices.
b. Inception Score (IS)
- Type: Measures image quality and diversity.
- Example:
- A model generating only dog images will have high quality but low diversity.
- A good generative model should create varied yet realistic images.
- Formula: \(IS = exp \left( E_x KL(p(y|x) || p(y)) \right)\)
3. Multi-Modal Model Evaluations
For multimodal models (e.g., vision-language models, text-to-image generation), evaluation metrics must capture cross-modal coherence and alignment.
a. LLM Judge
- Uses a pretrained LLM (like GPT-4) as a judge to rate outputs based on predefined criteria.
- Applicable for chatbots, conversational AI, and text-to-image ranking.
b. MLM Judge (Masked Language Model Judge)
- Measures how well an LLM predicts missing words given multimodal inputs.
- Used in retrieval-augmented generation (RAG) for factual consistency.
Conclusion
Evaluating GenAI models requires a mix of automatic metrics, ML classifiers, and sometimes human-in-the-loop evaluations. Different models and modalities demand tailored evaluation strategies to measure accuracy, realism, fluency, and alignment with human expectations.
🤖 Disclaimer: This post was generated with the help of AI but reviewed, refined, and enhanced by Dr. Rebecca Li, blending AI efficiency with human expertise for a balanced perspective.