歐洲RISC-V定制內(nèi)核AI專家Semidynamics公布其運行LlaMA-2 70億參數(shù)大語言模型 (LLM) 的‘一體式’ AI IP的張量單元效率數(shù)據(jù)。Semidynamics的CEO Roger Espasa解釋道:“傳統(tǒng)的人工智能設計使用三個獨立的計算元件:CPU、GPU(圖形處理器單元)和通過總線連接的NPU(神經(jīng)處理器單元)。這種傳統(tǒng)架構(gòu)需要DMA密集型編程,這種編程容易出錯、速度慢、耗能大,而且必須集成三種不同的軟件棧和架構(gòu)。而且,NPU是固定功能的硬件,無法適應未來尚未發(fā)明的AI算法?!?/p>
“相反,Semidynamics重新發(fā)明了AI架構(gòu),并將這三個要素整合到一個單一的、可擴展的處理元件中。我們將RISC-V內(nèi)核、處理矩陣乘法的張量單元(扮演NPU的角色)和處理類似激活的計算的矢量單元(扮演GPU的角色)組合到一個全集成的一體式計算元件,如圖1所示。我們的新架構(gòu)無DMA,使用基于ONNX和RISC-V的單個軟件堆棧,在三個元件之間提供直接的零延遲連接。因此,性能更高,功耗更低,面積更好,實現(xiàn)更容易編程的環(huán)境,降低整體開發(fā)成本。除此之外,因為張量和矢量單元由靈活的CPU直接控制,我們可以部署任何現(xiàn)有或未來的AI算法,為客戶的投資提供巨大保護。
Barcelona, Spain – 25 June 2024. Semidynamics, the European RISC-V custom core AI specialist, has announced Tensor Unit efficiency data for its ‘All-In-One’ AI IP running a LlaMA-2 7B-parameter Large Language Model (LLM).
Roger Espasa, Semidynamics’ CEO, explained, “The traditional AI design uses three separate computing elements: a CPU, a GPU (Graphical Processor Unit) and an NPU (Neural Processor Unit) connected through a bus. This traditional architecture requires DMA-intensive programming, which is error-prone, slow, and energy-hungry plus the challenge of having to integrate three different software stacks and architectures. In addition, NPUs are fixed-function hardware that cannot adapt to future AI algorithms yet-to-be-invented.
“In contrast, Semidynamics has re-invented AI architecture and integrates the three elements into a single, scalable processing element. We combine a RISC-V core, a Tensor Unit that handles matrix multiplication (playing the role of the NPU) and a Vector Unit that handles activation-like computations (playing the role of the GPU) into a fully integrated, all-in-one compute element, as shown in Figure 1. Our new architecture is DMA-free, uses a single software stack based on ONNX and RISC-V and offers direct, zero-latency connectivity between the three elements. The result is higher performance, lower power, better area and a much easier-to-program environment, lowering overall development costs. In addition, because the Tensor and Vector Units are under the direct control of a flexible CPU, we can deploy any existing or future AI algorithm, providing great protection to our customer’s investments.”
圖1 傳統(tǒng)AI架構(gòu)與Semidynamics的全新一體式集成解決方案對比
Figure 1 Comparison of traditional AI architecture to Semidynamics’ new All-In-One integrated solution
大語言模型 (LLM) 已成為AI應用的關鍵元件。LLM在計算上由自注意層主導,如圖2詳細所示。這些層包括五個矩陣乘法 (MatMul)、一個矩陣Transpose和一個SoftMax激活函數(shù),如圖2所示。在Semidynamics的一體式解決方案中,張量單元 (TU) 負責矩陣乘法,而向量單元(VU)可以有效地處理Transpose和SoftMax。由于張量和矢量單元共享矢量寄存器,因此可以在很大程度上避免昂貴的內(nèi)存復制。因此,在將數(shù)據(jù)從MatMul層傳輸?shù)郊せ顚右约皬募せ顚觽骰貢r,實現(xiàn)零延遲和零能耗。為了保持TU和VU持續(xù)繁忙,必須有效地將權(quán)重和輸入從存儲器提取到矢量寄存器中。為此,Semidynamics的Gazzillion? Misses技術提供了前所未有的數(shù)據(jù)遷移能力。通過支持大量的運行中緩存未命中,可以提前提取數(shù)據(jù),從而提高資源利用率。而且,Semidynamics的定制張量擴展包括為獲取和轉(zhuǎn)換2D貼片而優(yōu)化的新矢量指令,極大地改進了張量處理。
Large Language Models (LLMs) have emerged as a key element of AI applications. LLMs are computationally dominated by self-attention layers, shown in detail in Figure 2. These layers consist of five matrix multiplications (MatMul), a matrix Transpose and a SoftMax activation function, as shown in Figure 2. In Semidynamics’ All-In-One solution, the Tensor Unit (TU) takes care of matrix multiplication, whereas the Vector Unit (VU) can efficiently handle Transpose and SoftMax. Since the Tensor and Vector Units share the vector registers, expensive memory copies can be largely avoided. Hence, there is zero latency and zero energy spent in transferring data from the MatMul layers to the activation layers and vice versa. To keep the TU and the VU continuously busy, weights and inputs must be efficiently fetched from memory into the vector registers. To this end, Semidynamics’ Gazzillion? Misses technology provides unprecedented ability to move data. By supporting a large number of in-flight cache misses, data can be fetched ahead-of-time yielding high resource utilization. Furthermore, Semidynamics’ custom tensor extension includes new vector instructions optimized for fetching and transposing 2D tiles, greatly improving tensor processing.
圖2 LLM的自注意層
Figure 2 Attention Layer in LLM
Semidynamics在其一體式元件上運行了完整的LlaMA-2 70億參數(shù)模型(BF16權(quán)重),使用 Semidynamics的ONNX運行時執(zhí)行提供程序,并計算出模型中所有MatMul層的張量單元的利用率。結(jié)果如圖3所示。將結(jié)果聚在一起,并按照A張量形狀演示組織。LlaMA-2共有6種不同形狀,如圖3中的x軸標簽所示。我們從中可以看出,大多數(shù)形狀的利用率都在80%以上,與其他架構(gòu)形成鮮明對比。結(jié)果是在最具挑戰(zhàn)性的條件下收集的,即一批1和首個詞元計算。為了補充這些數(shù)據(jù),圖4顯示了大矩陣尺寸的張量單元效率,以展示張量單元和Gazzillion?技術的綜合效率。圖4標注了A+B矩陣大小。我們可以從中看出,隨著矩陣的N、M、P維度中的元件數(shù)量的增加,總大?。ㄒ訫B為單位)迅速超過任何可能的緩存/暫存區(qū)。該圖表值得注意的是,無論矩陣的總大小如何,性能都穩(wěn)定在略高于70%的水平。這一令人驚訝的結(jié)果要歸功于Gazzilion技術能夠在主存儲器和張量單元之間維持較高的流數(shù)據(jù)速率。
Semidynamics has run the full LlaMA-2 7B-parameter model (BF16 weights) on its All-In-One element, using Semidynamics’ ONNX Run Time Execution Provider, and calculated the utilization of the Tensor Unit for all the MatMul layers in the model. The results are shown in Figure 3. The results are aggregated and presented organized by the A-tensor shape. There are a total of 6 different shapes in LlaMA-2, as shown in the x-axis labels in Figure 2. As it can be seen, utilization is above 80% for most shapes, in sharp contrast with other architectures. Results are collected in the most challenging conditions, i.e., with a batch of 1 and for the first-token computation. To complement this data, Figure 4 presents the Tensor Unit efficiency for large matrix sizes, to demonstrate the combined efficiency of the Tensor Unit and the Gazzillion? technology. Figure 4 is annotated with the A+B matrix size. One can see that as the number of elements in the N, M, P dimensions of the matrix increase, the total size in MBs quickly exceeds any possible cache/scratchpad available. The noteworthy aspect of the chart is that the performance is stable slightly above 70%, irrespective of the total size of the matrices. This quite surprising result is thanks to the Gazzillion technology being capable of sustaining a high streaming data rate between main memory and the Tensor Unit.
圖3張量A形組織的LlaMA-2張量單元效率
Figure 3 LlaMA-2 Tensor Unit efficiency organized by Tensor-A shape
圖4不同矩陣大小的8位(左側(cè))和16位矩陣(右側(cè))的張量單元利用率
Figure 4 Tensor Unit utilization for 8-bit (left side) and 16-bit matrices (right side) for different matrix sizes
Espasa總結(jié)說:“我們的全新一體式AI IP不僅具有出色的人工智能性能,而且編程也更容易,因為現(xiàn)在只有一個軟件堆棧,而不是三個。開發(fā)人員可以使用已知的RISC-V堆棧,而且他們不必擔心軟件管理的本地SRAM或DMA。此外,Semidynamics提供了針對一體式AI IP優(yōu)化的ONNX運行時,這使程序員能夠輕松運行他們的ML模型。因此,我們的解決方案在程序員友好性和易于集成到新SOC設計方面邁出了一大步。借助一體式AI IP,我們的客戶將能夠以更好、更容易編程的硅的形式將所有這些好處傳遞給他們的客戶、開發(fā)人員和用戶?!?/p>
“此外,我們的一體式設計對未來AI/ML算法和工作負載的變化具有充分的彈性。對于啟動一個在幾年內(nèi)不會上市的硅片項目的客戶來說,這是一個巨大的風險保護。知道當您的硅片進入批量生產(chǎn)時您的AI IP仍然是相關的,這是我們技術的一個獨特優(yōu)勢?!?/p>
2016年成立于西班牙巴塞羅那,Semidynamics?是唯一完全可定制的RISC-V處理器IP提供商,專業(yè)提供高帶寬、高性能內(nèi)核,其矢量單元和張量單元面向機器學習和人工智能應用。我公司為私人公司,是RISC-V聯(lián)盟的戰(zhàn)略成員。