Talk:Mixture of experts

This is the talk page for discussing improvements to the Mixture of experts article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Artificial Intelligence

This article is within the scope of WikiProject Artificial Intelligence, a collaborative effort to improve the coverage of Artificial intelligence on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Artificial IntelligenceWikipedia:WikiProject Artificial IntelligenceTemplate:WikiProject Artificial IntelligenceArtificial Intelligence

Computing: Software / CompSci Low‑importance

This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing

Low

This article has been rated as Low-importance on the project's importance scale.

This article is supported by WikiProject Software (assessed as Low-importance).

This article is supported by WikiProject Computer science (assessed as Low-importance).

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Some redacted stuff

Pretty useless, but might be interesting...

In Hash MoE, routing is performed deterministically by a hash function, fixed before learning begins. For example, if the model is a 4-layered Transformer, and input is a token for word "eat", and the hash of "eat" is , then the token would be routed to the 1st expert in layer 1, 4th expert in layer 2, etc. Despite its simplicity, it achieves competitive performance as sparsely gated MoE with .

In soft MoE, suppose in each batch, each expert can process queries, then there are queries that can be assigned per batch. Now for each batch of queries , the soft MoE layer computes an array , such that is a probability distribution over queries, and the -th expert's -th query is . However, this does not work with autoregressive modelling, since the weights over one token depends on all other tokens'.

pony in a strange land (talk) 03:35, 2 February 2025 (UTC)[reply]