Elicit is a literature assistant, primarily focused on “systematic reviews”.
Using embedding search, it will retrieve existing papers, so the issue of made-up references is avoided.
But, at the core there is a LLM, so the ability of understanding text is just based on the transformer power of soaking context, the results of Elicit must be evaluated carefully. In my experience (bioinformatics) this is particularly true as probably my field has poorly written papers and ambiguous use of language (due to the fact that is multidisciplinary and biologists and computer scientists use different terms for similar concepts).
Example prompt:
I’m interested in the use of large language models (e.g., GPT-4, Claude) for summarizing scientific research, especially in the biomedical field. Please find peer-reviewed studies that evaluate how factually accurate these models are when generating abstracts or summaries of full-text articles. Focus on papers that include benchmarking, error analysis, or human expert evaluation
Elicit tries a multi step approach:

Elicit output for “valuating LLMs in Biomedical Summarization”