Lech Mazur

Advameg, Inc. CEO 

Founder, city-data.com 

https://twitter.com/LechMazur

Author: County-level COVID-19 machine learning case prediction model. 

Author: AI assistant for melody composition.

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

I've just created a NYT Connections benchmark. 267 puzzles, 3 prompts for each, uppercase and lowercase.

Results:

GPT-4 Turbo: 31.0

Claude 3 Opus: 27.3

Mistral Large: 17.7

Mistral Medium: 15.3

Gemini Pro: 14.2

Qwen 1.5 72B Chat: 10.7

Claude 3 Sonnet: 7.6

GPT-3.5 Turbo: 4.2

Mixtral 8x7B Instruct: 4.2

Llama 2 70B Chat: 3.5

Nous Hermes 2 Yi 34B: 1.5

  • Partial credit is given if the puzzle is not fully solved
  • There is only one attempt allowed per puzzle, 0-shot. Humans get 4 attempts and a hint when they are one step away from solving a group
  • Gemini Advanced is not yet available through the API

(Edit: I've added bigger models from together.ai and from Mistral)