The Intuitive AI Revolution: Less Guessing, More Understanding

Neurog
3 min readJun 12, 2024

--

Imagine you’re chatting with a smart assistant, asking it to help plan your best friend’s surprise birthday party. But instead of suggesting party themes and cake designs, it starts giving you gardening advice. Frustrating, right? This is what happens when AI models, like the smart assistant in our story, aren’t properly aligned with what we humans expect or prefer. That’s where a groundbreaking approach comes in, aimed at making these AI models, especially the big and brainy ones called large language models (LLMs), more in tune with us.

In a fascinating study titled “Reinforcement Learning from Human Feedback with Active Queries,” a trio of researchers, Kaixuan Ji, Jiafan He, and Quanquan Gu, introduce a clever way to teach these AI brains to better understand and cater to our preferences. Think of it as giving the AI a more refined compass to navigate the vast sea of human desires and expectations.

The traditional method, known as Reinforcement Learning from Human Feedback (RLHF), is like trying to train a puppy with an endless supply of treats. It can be effective, but you’d need a truckload of treats, and it’s going to take a lot of time. The researchers saw this issue and thought, “Why not make every treat count?” By borrowing ideas from something called active learning, they’ve figured out a way to train our AI puppy with fewer treats but in a smarter way.

They tackle this by framing the challenge as a game of choice, where the AI has to pick between two options, much like choosing between two flavors of ice cream, thereby understanding our preferences more efficiently. The technique they developed, called Active-query-based Proximal Policy Optimization (APPO), significantly reduces the number of guesses (or queries) needed to understand what we like. Their experiments showed that their new method, ADPO, could achieve understanding of human preferences with only about 32,000 queries, which is roughly half the number used by a previous method, DPO. This efficiency was highlighted across various datasets where ADPO significantly outperformed DPO, achieving higher scores in areas like ARC, TruthfulQA, and HellaSwag, with an average score improvement from 57.29 to 59.01, surpassing DPO’s 58.66 with similar or fewer queries.

This efficiency is crucial because it implies AI can be trained faster and with less data to better align with human expectations, potentially revolutionizing sectors like healthcare for personalized patient care, education for tailored learning experiences, and customer service for more effective problem resolution.

The societal impact of aligning AI models more closely with human preferences is profound. Technologies that are easier to use, more helpful, and ethically aligned could become a more integral and accepted part of our daily lives, enhancing everything from how we learn to how we manage our health.

In essence, this study isn’t just about making AI smarter; it’s about making technology more human-friendly, ensuring that as we continue to advance in the digital age, we’re building a world where technology understands us better and makes our lives easier in more meaningful ways.

--

--

Neurog

A Neurog publication about AI, tech, programming and everything in between.