Bases: MetricWithLLM
Context relevance metric for RAG evaluation.
This metric evaluates the relevance of the retrieved context to the user's question
based on a scoring system from 1 to 5. The metric uses an LLM to assess how well
the context aligns with the question.
Attributes:
| Name |
Type |
Description |
name |
str
|
|
Source code in ragbot\evaluation\metrics\context_relevance.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61 | @dataclass
class ContextRelevance(MetricWithLLM):
"""Context relevance metric for RAG evaluation.
This metric evaluates the relevance of the retrieved context to the user's question
based on a scoring system from 1 to 5. The metric uses an LLM to assess how well
the context aligns with the question.
Attributes:
name (str): The name of the metric.
"""
name: str = field(default="context relevance", repr=True)
_required_columns: Set[str] = field(
default_factory=lambda: {"question", "retrieved_context"}
)
def score(self, sample: Sample, **kwargs: Any) -> float:
"""Compute the context relevance score for a given sample.
Args:
sample (Sample): A sample containing the user question and retrieved context.
**kwargs: Optional keyword arguments (not used here).
Returns:
float: A score (1-5) indicating how relevant the retrieved context is to the user's question.
"""
question, context = sample.question, sample.retrieved_context
output = self.llm.invoke(
f"""
You are an expert evaluator assessing how relevant a retrieved context is to a given user question.
Context relevance is defined as how well the retrieved information aligns with the question.
**Scoring Guidelines:**
- **5 (Excellent):** The context is fully relevant to the question, containing directly useful and necessary information.
- **4 (Good):** The context is mostly relevant but may include minor irrelevant details or miss slight nuances.
- **3 (Acceptable):** The context is somewhat relevant but may include noticeable irrelevant parts or lack some important details.
- **2 (Poor):** The context is only partially relevant, with significant irrelevant or missing information.
- **1 (Not Relevant):** The context is mostly or entirely unrelated to the question.
**User Question:** {question}
**Retrieved Context:** {context}
Assign a single integer score (1-5) based on the above criteria.
Only return the score as a number, without any extra text.
Verdict [1 | 2 | 3 | 4 | 5]:
"""
)
score = int(output.content.strip())
return score
|
score(sample, **kwargs)
Compute the context relevance score for a given sample.
Parameters:
| Name |
Type |
Description |
Default |
sample
|
Sample
|
A sample containing the user question and retrieved context.
|
required
|
**kwargs
|
Any
|
Optional keyword arguments (not used here).
|
{}
|
Returns:
| Name | Type |
Description |
float |
float
|
A score (1-5) indicating how relevant the retrieved context is to the user's question.
|
Source code in ragbot\evaluation\metrics\context_relevance.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61 | def score(self, sample: Sample, **kwargs: Any) -> float:
"""Compute the context relevance score for a given sample.
Args:
sample (Sample): A sample containing the user question and retrieved context.
**kwargs: Optional keyword arguments (not used here).
Returns:
float: A score (1-5) indicating how relevant the retrieved context is to the user's question.
"""
question, context = sample.question, sample.retrieved_context
output = self.llm.invoke(
f"""
You are an expert evaluator assessing how relevant a retrieved context is to a given user question.
Context relevance is defined as how well the retrieved information aligns with the question.
**Scoring Guidelines:**
- **5 (Excellent):** The context is fully relevant to the question, containing directly useful and necessary information.
- **4 (Good):** The context is mostly relevant but may include minor irrelevant details or miss slight nuances.
- **3 (Acceptable):** The context is somewhat relevant but may include noticeable irrelevant parts or lack some important details.
- **2 (Poor):** The context is only partially relevant, with significant irrelevant or missing information.
- **1 (Not Relevant):** The context is mostly or entirely unrelated to the question.
**User Question:** {question}
**Retrieved Context:** {context}
Assign a single integer score (1-5) based on the above criteria.
Only return the score as a number, without any extra text.
Verdict [1 | 2 | 3 | 4 | 5]:
"""
)
score = int(output.content.strip())
return score
|