Set up a testing protocol: queries, expected sources, metrics (retrieval/citation/answer), and a V12 regression loop.