New approaches to weighting drive innovation in large language models

Experts studying the changing and evolving designs of neural networks express interest in the idea of ”higher-order attentional mechanisms” to replace those used in artificial intelligence transformers to date.

Earlier this month, a group of academic authors presented what they call “Nexus,” a solution to a hurdle in standard attentional mechanisms, which they claim “struggle to capture complex, multiple relationships at a single level.”

“Unlike standard approaches that use static linear representations for queries and keys, Nexus dynamically refines these representations through nested self-healing mechanisms.” they wrote. “Specifically, the query and key vectors are themselves outputs of inner attention loops, allowing tokens to gather global context and model high-order correlations before the final attention computation.”

For non-academics, I ran this through ChatGPT twice to simplify and came up with this:

“Nexus does not generate queries and keys in a fixed step.
Performs additional mini care passes to improve them first.
So the tokens gather more context before the main attention.”

Queries, keys and values

It turns out that all three of these, queries, keys, and values, are all parts of an attention mechanism that helps a neural network “focus” on the right things.

This guide for Medium it’s a great reference. Let’s start with this:

“In artificial intelligence terms, the questions are asking, ‘What is relevant here?’ writes Thiksiga Ragulakaran. “The keys answer, ‘Here’s what I have.’ The values are the raw data used to create the output. All three are created by fitting the input embeddings with weight learning matrices. This allows the model to introduce ‘views’ into spaces where similarities become apparent.”

So the QKV set goes to “learned matrices”.

Here are more:

“All three — Query (Q), Key (K), and Value (V) — start with the same position embeddings. They are then transformed into unique matrices using separate trainable linear layers. These layers act as adjustable weights, updated during training, to allow the model to learn how to focus on different parts of the input.”

You can see how input weighting is crucial to neural network design and how it works.

Ragulakaran goes further into how these systems use multi-headed attention to facilitate multiple perspectives. And then there’s something called matmul, which I also looked into with GPT.

“Matmul is short for matrix multiplication,” the model explained. “In AI, it’s the basic math behind how neural networks combine inputs with learned weights. During training and inference, massive matmuls feed functions like linear layers and attention. That’s why GPUs/TPUs are optimized for fast, parallel matmul.”

Then I asked: do higher order attention mechanisms use matmul?

“Yes—almost always,” GPT replied. “Higher-order” variants (multi-head, tensor/outer product, factorized/low-rank, etc.) are still based on matrix multipliers or generalized tensor contractions (often written as einsum), which the hardware performs using matmul-type kernels.

So the next time you hear this phrase, or are asked about it, have some ballast.

As for the generalized tensor contractions often written as einsum, I’ll leave it alone.

Making it real

So what can people do with these architectures?

Some experts thinking about neural networks equipped with this attention design talk about creating a richer global context for summarizing or Q&A and tracking dependencies between functions/files, along with improved logic. In other applications, systems like Nexus could capture higher-order structures in molecules, proteins, or knowledge graphs, or help maintain a coherent global state across multiple stages in the agent era.

A explains a source from the Boston Institute of Analytics it’s like this:

“Attention mechanisms have become a key part of many of the most advanced artificial intelligence models, including large language models (LLMs) such as GPT or BERT. Attention mechanisms allow a model to achieve a high degree of accuracy in various tasks such as translation, question answering, text summarization, image captioning, and other systems.”

By any other name

Who knows what we’ll call these LLM innovations years from now? Will we see the world of artificial intelligence consisting of Markov states, or arrays, or key-value pairs? Or all of the above? And what are we going to use all this for? For many people, this is the biggest question. Stay tuned as we head into the new year.

What's Hot

Personal Finance, Insurance Fraud, Bad Financial Choices & Budget With Monika Halan | Ajeet Bharti

WTF Is Wealth? Ray Dalio Breaks It Down w/ Nikhil Kamath | WTF is Finance Ep 2

From Broke to Millionaire at 26 | Money Mindsets with Afnan Khalifa

New approaches to weighting drive innovation in large language models

How artificial intelligence could destroy your 401(k).

Today’s Wordle #1716 Hints and Answers for Sunday March 1st

Drinking coffee and tea is associated with lower rates of dementia

Tips, answers and walkthrough for Saturday 28th February

Leave A Reply Cancel Reply

How to Replace a 6-Figure Job You Hate With a Life That You Love

How To Build An Investment Portfolio For Retirement

What you thought you knew is hurting your money

What qualifies as an eligible HSA expense?

Personal Finance, Insurance Fraud, Bad Financial Choices & Budget With Monika Halan | Ajeet Bharti

WTF Is Wealth? Ray Dalio Breaks It Down w/ Nikhil Kamath | WTF is Finance Ep 2

From Broke to Millionaire at 26 | Money Mindsets with Afnan Khalifa

What's Hot

New approaches to weighting drive innovation in large language models

Queries, keys and values

Making it real

By any other name

Related Posts

Leave A Reply Cancel Reply

Subscribe to Updates