DeepSeek-V3

这是一个基于混合专家（MoE）和多头潜在注意力（MLA）架构的开源大语言模型，在数学推理、代码生成等复杂任务中表现优秀。该模型总规模达 671B 参数，但每个 token 只激活其中的 37B 参数。即在处理输入时，并非所有“专家”都参与计算，而是选择一部分专家进行处理。通过激活部分参数（37B）完成计算，从而降低了训练和推理的成本。

This is an open-source large language model based on the Mixture of Experts (MoE) and Multi-Head Latent Attention (MLA) architectures, which performs exceptionally well in complex tasks such as mathematical reasoning and code generation. The model has a total scale of 671B parameters, but only 37B parameters are activated for each token. That is, not all 'experts' participate in the computation when processing inputs; instead, a subset of experts is selected for processing. By activating only a portion of the parameters (37B), the model can complete computations, significantly reducing the costs of training and inference.

DeepSeek-V3

DeepSeek-V3

Comments