The Best LLM for Content Creation…

Evaluating LLMs for content creation use cases like copywriting, script writing, blog writing, email writing, etc.

Harshit Tyagi
7 min readMay 3, 2024
https://www.youtube.com/channel/UCH-xwLTKQaABNs2QmGxK2bQ/

DISCLAIMER: This is just a fun project that I wanted to do.

I was working on a project that required me to figure out the best LLM for content creation. I checked out the top models on the lmsys leaderboard, read what other people are saying about those models, checked the model card of the top LLMs and after no clear answer, I decided to run a test on all these LLMs for different content creation tasks.

Models to evaluate

The models I wanted to assess (given their cost, ease of use, and rankings on lmsys leaderboard):

  1. Llama-3–70b
  2. Mixtral-8x7B
  3. Gemini 1.5 Pro
  4. Claude 3 Sonnet

Here’s what I did…

Firstly, I broke down the field of content creation into 5 varied use cases:

  1. Blog writing
  2. Email writing
  3. Copywriting — with Advertisement, SEO, Website, Technical and Social Media
  4. Script writing
  5. Content Summarisation

And within each of these usecases, I created multiple categories that are either sub-usecases or steps of a process of that use case.

Here’s what each of these use cases looked like:

Simple Evaluation Framework

1. GPT-4 Turbo will serve as the first judge scoring every response out of 10 with the help of the evaluation prompt, written by me as per my use case.

2. Myself as the second judge.

3. Each judge will score the response out of 10

4. Final score is the average of the 2 scores.

Crafting and curating prompts

After extending the category of each use case, I had to carefully craft the prompts that will be provided to each of these LLMs. Not just the creation prompts, I knew that if I am the only one evaluating the responses of these LLMs, that would be highly biased and unreliable, so I joined hands with the best LLM out there, gpt-04-turbo.

Now, there will be,

  1. Creation prompt for each category
  2. Evaluation prompt for each category

where the evaluation will be done by another LLM, I know it sounds weird but benchmarks like MT-Bench (note that this evaluation is nowhere close to MT-Bench) also use strong LLMs as judges to automate the evaluation process.

To curate creation prompts, I used prompt engineering techniques like person adoption, clear instructions, time to think, and delimited reference text.

For example,

Social Media Copy Prompt: Imagine you are the social media manager for a boutique coffee shop that prides itself on using fair-trade, organic coffee beans. Your goal is to engage with a young, hip audience that frequents coffee shops as social hubs. Craft a series of social media posts: — Introduce a new seasonal blend with vibrant visuals and enticing descriptions. — Promote an upcoming live music evening, highlighting the cozy ambiance and quality coffee. — Share a customer testimonial about their favorite coffee and study spot. Ensure each post is engaging, uses a conversational tone, and includes hashtags that enhance visibility and drive interactions.

The evaluation prompts also used similar techniques and framework to evaluate. I broke each evaluation criteria into 5 components where each component is worth 2 marks and partial marking is to be done for partially meeting the criterion.

Example:

Social Media Copy response evaluation prompt: You are an expert copywriter and editor. Score the below social media copy(delimited by triple quotes below) out of 10 based on the following criteria where each point has 2 points, give 0 if the outline fails to capture that element altogether, 1 if it covers it partially and 2 if it perfectly covers all the essence of that criterion: Assess the social media copy for the following elements: — **Relevance:** Is the content aligned with current trends, popular hashtags, and audience interests? — **Conversational Tone:** Does the copy use a friendly, casual tone that resonates with social media users? — **Visual Impact:** Does the copy mention about engaging visuals such as images, videos, or GIFs to boost engagement? — **Brevity:** Is the copy short, concise, and easy to consume at a glance? — **Shareability:** Is the content crafted in a way that encourages likes, shares, and comments to expand reach? “””{text}”””

This was done for all the 22 categories.

Generation and Evaluation

Now came the time to generate and evaluate the responses.

  • I used Groq to evaluate Llama-3–70b and Mixtral-8x7B.
  • Google Vertex’s AI Studio to evalate Gemini 1.5 Pro and
  • Anthropic’s workbench and chat to evaluate their Claude models.

For evaluation, I used chatgpt which uses gpt-4-turbo by default.

Here are the results that I got, category wise:

1. Blog Writing

GPT’s evaluation scores:

My scores:

And then the average of the above two scores as the final score:

Verdict for Blog Writing — Llama-3–70B

Llama-3–70b scored 48.5 with its very thorough outline, ability to learn from reference text and quality text generation abilities turned up as the winner.

Sonnet and Gemini also gave very good responses but Llama’s responses have that nuance or more attention to detail that one looks for while reading real-world texts.

2. Email Writing

This was a bit disappointing category, partly because of the prompts, I should have put more effort into crafting more detailed email prompts, however, they were same for all, so let’s see the results:

GPT scores:

My scores:

Final scores:

Verdict for Email Writing — Llama-3–70B

Here again, Llama-3–70b, with 41.5 out of 50, outperformed its competitors but I was not really convinced on the quality and the modern email writing practices where we prioritise concise and direct responses but given the prompts they did quite well.

3. Copywriting

All models did fairly good when it comes to copywriting.

GPT scores:

My scores:

Final scores:

Verdict on Copywriting — Llama-3–70B

There is something about the quality and the instruction-following ability in llama-3. Nails every small detail in the prompt and thus scores high not only on GPT’s evaluation but I also found the copies more detailed, structured, coherent and appealing.

4. Script Writing

GPT Scores:

My scores:

Final scores:

Verdict on Script Writing — Llama-3–70b

All the models did fairly well when it comes to producing the first draft but needs a lot of improvement to be able to follow the writing style of another author, which is something that I have missed this time but will definitely check out.

We had 3 winners here. Llama-3–70B, Claude-3-Sonnet and Gemini 1.5 Pro.

5. Content Summarisation

This is one of the most important task that I had on my plate and here are the results:

GPT Scores:

My scores:

Final scores:

Verdict on Content Summarisation — Claude & Gemini 1.5 Pro

I got surprised here by the quality of summaries that Claude models generate. Claude Sonnet and I did try this for Claude 3 Opus (their best model but very expensive) also, and Opus’s summary is structured, detail oriented, and captures the essence of the doc as much as it can. These models can definitely perform really well when fine-tuned.

Winner: Gemini 1.5 Pro and Claude 3 Sonnet

Final Winner — Llama-3–70b

With a total score of 199.5 out of 220, Llama-3–70b does great overall on content creation.

  • Claude 3 Sonnet scored: 191.25. Does well on summarisation and script writing.
  • Gemini 1.5 Pro scored: 194. Does well on summarisation and script writing.
  • Mixtral 8x7b scored: 181.5. Was a bit out of place, would have been better to try out Mixtral 8x22b.

Video Version of this post

If you found this helpful, follow me on YouTube for more!

--

--

Harshit Tyagi

Director Data Science & ML at Scaler | LinkedIn's Bestselling Instructor | YouTuber