March 15, 2025

Deep research - the good, the bad and the ugly

The “deep research” feature became available in late February for those paying for a “Plus” subscription to ChatGPT. I wrote about this, and about similar features from other providers, in an earlier blog post.

How good is the feature, have I personally found it useful, and are there pitfalls and risks? The short answers are “good”, “yes”, “yes” and “absolutely”.

The longer answer follows in this blog post.

In another blog post I wanted to write about whether a well-crafted prompt makes a difference in the quality of an AI’s response. I started from a paper that I came across in my feed. The paper argued, among other things, that giving the AI a role (“you are a highly intelligent assistant”) made no difference, and that there was no measurable effect from phrasing the prompt politely (“would you be so kind as to…”) or threateningly (“do as I say, or else…”).

Okay, that sounds reasonable, and the authors seem to have done a solid job backing up their claims with empirical data. Still, a host of new questions spring to mind:

The study was conducted using GPT-4o and GPT-4o-mini. But what about other models, such as Grok, Claude or Llama?
The study tested two prompting tricks (phrasing the prompt politely or threateningly, and giving the AI a role as “a highly intelligent assistant”), but what about other kinds of similar techniques?
Is there a difference between smaller models (such as Gemma) and larger models (GPT-4o et al)?
Is there a difference between older models and newer ones?
…and so on.

If I were to find and read research answering these questions on my own, it would take me at least a couple of days. A couple of days’ work for a blog post feels a bit much, so I used ChatGPT’s “deep research” instead.

This was the prompt I fed into “deep research”:

“Please help me find information and research on how important basic prompt engineering is for the result of output of an LLM. Specific prompting techniques I want to evaluate are:

-provide role in prompt (e.g. “you are a top expert in the field”).

-specify the format of output (e.g. “the answer should be in a bullet list”)

-politeness (e.g. “please help me”)

-adding virtual incentives (e.g. “i’ll give you 100 dollar if you are correct”)

-prompt the AI to interview the user for clarification (e.g. “ask me questions to ensure you have all information you need”)

I want information mainly from research and measurements, not speculations and guesses.”

I got a few follow-up questions from ChatGPT to clarify my query, which I answered with:

1: I want you to focus on recent studies. Preferably for models as old, or newer, than GPT-4o (i.e. released may 13 2024). only if there are no studies or very few you should look at older studies/models.

2: focus on the models in the ChatGPT family. only expand if too few data points are given by that.

3: structured summary please

4: no preference. peer-reivew, conference paper and whitepapers are all accepted!

After five minutes I received a thorough report that formed the basis for the blog post The prompt - how important is it really?.

Note that I did not read all the articles that were linked, and therefore cannot be entirely certain that ChatGPT accurately represented their content. What I did do, however, was read the articles that made claims I specifically wanted to highlight in the blog - in particular, that asking the AI to pose follow-up questions to the user yields better results.

My wife also ran a search with “deep research”, and the question she wanted answered was of a more local nature:

“Find out who has lived in (our neighbourhood) in (our village) since 1905 and what they worked with.”

She got a fairly good answer that described our house and the neighbouring houses. However, she got an unpleasant surprise when she checked the sources - one of them was a website formerly known as “Radio Islam”. Wikipedia describes the website as follows:

“Radio Islam was a Swedish community radio channel and media venture launched in 1987 by Ahmed Rami.[1] The channel gained international attention for its grossly antisemitic content, which prompted several police investigations and prosecutions.[2][3][4] Two of these led to convictions of its responsible publisher for incitement against an ethnic group.[1][5] The radio channel was eventually shut down in 1997, but parts of its operations continued on its website and on social media.[1][6]”

When asked whether the site is a reliable source, ChatGPT itself says:

“No, (the website address) is not a reliable source. The website, formerly known as Radio Islam, has a history of spreading antisemitic content and denying the Holocaust. It has been described as one of the most radical far-right antisemitic websites on the internet. Its founder, Ahmed Rami, has been convicted of incitement against an ethnic group in Sweden. Information from this source should therefore be regarded with great scepticism.”

How did this happen?

The reason the site was included in the results at all was that one of our neighbouring houses was built by a man from a well-known family of Jewish descent. The sources for this were texts from smaller, local websites run by local heritage associations and Wikipedia. The house was later inherited by his daughter, which was referenced via a page on Radio Islam titled “Jewish influence over economic life in Sweden”.

The report presented by ChatGPT did not contain any antisemitic language, and the facts it presented (the name of the daughter and her fiance) appear to be correct.

However, there is every reason to find it objectionable that the “deep research” feature (as it stands today) selects sources without filtering out ones that, to use ChatGPT’s own description, are “one of the most radical far-right antisemitic websites on the internet” and which it further urges me to “regard with great scepticism”.

So how should a regular user regard and use “deep research”?

When it works well, it is a tremendously valuable feature that can find and summarise information quickly.

It is evident that the feature has no real source criticism, and it is therefore important for the user to both review which sources are being used, and to read through the sources themselves to ensure that the generated report is based on sound, reliable sources, and that those sources actually say what the report claims they do.

Finally, I want to take the opportunity to state something that is hopefully obvious, which is that I detest antisemitism, and I firmly distance myself from antisemitic websites, radio channels and other material.