[evals] pairwise evaluator

Implement a pairwise evaluator that leverages LLM as a judge to  judge two generations against each-other. In the case of experiments this would assume to perform judgement against the expected>

https://docs.llamaindex.ai/en/stable/examples/evaluation/pairwise_eval/

Note that there should be a parameter for consensus. E.g. force the LLM to judge the answer flipped and see what it would say.