Implement a pairwise evaluator that leverages LLM as a judge to judge two generations against each-other. In the case of experiments this would assume to perform judgement against the expected>
https://docs.llamaindex.ai/en/stable/examples/evaluation/pairwise_eval/
Note that there should be a parameter for consensus. E.g. force the LLM to judge the answer flipped and see what it would say.