Skip to content

Context Relevance

Context relevance metric for evaluating the relevance of retrieved context.

ContextRelevance dataclass

Bases: MetricWithLLM

Context relevance metric for RAG evaluation.

This metric evaluates the relevance of the retrieved context to the user's question based on a scoring system from 1 to 5. The metric uses an LLM to assess how well the context aligns with the question.

Attributes:

Name Type Description
name str

The name of the metric.

Source code in ragbot\evaluation\metrics\context_relevance.py
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@dataclass
class ContextRelevance(MetricWithLLM):
    """Context relevance metric for RAG evaluation.

    This metric evaluates the relevance of the retrieved context to the user's question
    based on a scoring system from 1 to 5. The metric uses an LLM to assess how well
    the context aligns with the question.

    Attributes:
        name (str): The name of the metric.
    """

    name: str = field(default="context relevance", repr=True)
    _required_columns: Set[str] = field(
        default_factory=lambda: {"question", "retrieved_context"}
    )

    def score(self, sample: Sample, **kwargs: Any) -> float:
        """Compute the context relevance score for a given sample.

        Args:
            sample (Sample): A sample containing the user question and retrieved context.
            **kwargs: Optional keyword arguments (not used here).

        Returns:
            float: A score (1-5) indicating how relevant the retrieved context is to the user's question.
        """
        question, context = sample.question, sample.retrieved_context
        output = self.llm.invoke(
            f"""
            You are an expert evaluator assessing how relevant a retrieved context is to a given user question. 
            Context relevance is defined as how well the retrieved information aligns with the question. 

            **Scoring Guidelines:**  
            - **5 (Excellent):** The context is fully relevant to the question, containing directly useful and necessary information.  
            - **4 (Good):** The context is mostly relevant but may include minor irrelevant details or miss slight nuances.  
            - **3 (Acceptable):** The context is somewhat relevant but may include noticeable irrelevant parts or lack some important details.  
            - **2 (Poor):** The context is only partially relevant, with significant irrelevant or missing information.  
            - **1 (Not Relevant):** The context is mostly or entirely unrelated to the question.  

            **User Question:** {question}  
            **Retrieved Context:** {context}  

            Assign a single integer score (1-5) based on the above criteria.
            Only return the score as a number, without any extra text.

            Verdict [1 | 2 | 3 | 4 | 5]: 
            """
        )

        score = int(output.content.strip())
        return score

score(sample, **kwargs)

Compute the context relevance score for a given sample.

Parameters:

Name Type Description Default
sample Sample

A sample containing the user question and retrieved context.

required
**kwargs Any

Optional keyword arguments (not used here).

{}

Returns:

Name Type Description
float float

A score (1-5) indicating how relevant the retrieved context is to the user's question.

Source code in ragbot\evaluation\metrics\context_relevance.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def score(self, sample: Sample, **kwargs: Any) -> float:
    """Compute the context relevance score for a given sample.

    Args:
        sample (Sample): A sample containing the user question and retrieved context.
        **kwargs: Optional keyword arguments (not used here).

    Returns:
        float: A score (1-5) indicating how relevant the retrieved context is to the user's question.
    """
    question, context = sample.question, sample.retrieved_context
    output = self.llm.invoke(
        f"""
        You are an expert evaluator assessing how relevant a retrieved context is to a given user question. 
        Context relevance is defined as how well the retrieved information aligns with the question. 

        **Scoring Guidelines:**  
        - **5 (Excellent):** The context is fully relevant to the question, containing directly useful and necessary information.  
        - **4 (Good):** The context is mostly relevant but may include minor irrelevant details or miss slight nuances.  
        - **3 (Acceptable):** The context is somewhat relevant but may include noticeable irrelevant parts or lack some important details.  
        - **2 (Poor):** The context is only partially relevant, with significant irrelevant or missing information.  
        - **1 (Not Relevant):** The context is mostly or entirely unrelated to the question.  

        **User Question:** {question}  
        **Retrieved Context:** {context}  

        Assign a single integer score (1-5) based on the above criteria.
        Only return the score as a number, without any extra text.

        Verdict [1 | 2 | 3 | 4 | 5]: 
        """
    )

    score = int(output.content.strip())
    return score