Talk:Mixture of experts
This is the talk page for discussing improvements to the Mixture of experts article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
![]() | This article is rated B-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||
|
Some redacted stuff
[edit]Pretty useless, but might be interesting...
In Hash MoE, routing is performed deterministically by a hash function, fixed before learning begins. For example, if the model is a 4-layered Transformer, and input is a token for word "eat", and the hash of "eat" is , then the token would be routed to the 1st expert in layer 1, 4th expert in layer 2, etc. Despite its simplicity, it achieves competitive performance as sparsely gated MoE with .
In soft MoE, suppose in each batch, each expert can process queries, then there are queries that can be assigned per batch. Now for each batch of queries , the soft MoE layer computes an array , such that is a probability distribution over queries, and the -th expert's -th query is . However, this does not work with autoregressive modelling, since the weights over one token depends on all other tokens'.
- WikiProject Artificial Intelligence articles
- B-Class Computing articles
- Low-importance Computing articles
- B-Class software articles
- Low-importance software articles
- B-Class software articles of Low-importance
- All Software articles
- B-Class Computer science articles
- Low-importance Computer science articles
- All Computing articles