办公室里呻吟的丰满老师电影,欧美xxxxxbb,久久国产亚洲精品免费观看

歐洲RISC-V定制內(nèi)核AI專家Semidynamics公布其運(yùn)行LlaMA-2 70億參數(shù)大語言模型 (LLM) 的‘一體式’ AI IP的張量單元效率數(shù)據(jù)。Semidynamics的CEO Roger Espasa解釋道：“傳統(tǒng)的人工智能設(shè)計(jì)使用三個(gè)獨(dú)立的計(jì)算元件：CPU、GPU（圖形處理器單元）和通過總線連接的NPU（神經(jīng)處理器單元）。這種傳統(tǒng)架構(gòu)需要DMA密集型編程，這種編程容易出錯(cuò)、速度慢、耗能大，而且必須集成三種不同的軟件棧和架構(gòu)。而且，NPU是固定功能的硬件，無法適應(yīng)未來尚未發(fā)明的AI算法。”

“相反，Semidynamics重新發(fā)明了AI架構(gòu)，并將這三個(gè)要素整合到一個(gè)單一的、可擴(kuò)展的處理元件中。我們將RISC-V內(nèi)核、處理矩陣乘法的張量單元（扮演NPU的角色）和處理類似激活的計(jì)算的矢量單元（扮演GPU的角色）組合到一個(gè)全集成的一體式計(jì)算元件，如圖1所示。我們的新架構(gòu)無DMA，使用基于ONNX和RISC-V的單個(gè)軟件堆棧，在三個(gè)元件之間提供直接的零延遲連接。因此，性能更高，功耗更低，面積更好，實(shí)現(xiàn)更容易編程的環(huán)境，降低整體開發(fā)成本。除此之外，因?yàn)閺埩亢褪噶繂卧伸`活的CPU直接控制，我們可以部署任何現(xiàn)有或未來的AI算法，為客戶的投資提供巨大保護(hù)。

Barcelona, Spain – 25 June 2024. Semidynamics, the European RISC-V custom core AI specialist, has announced Tensor Unit efficiency data for its ‘All-In-One’ AI IP running a LlaMA-2 7B-parameter Large Language Model (LLM).

Roger Espasa, Semidynamics’ CEO, explained, “The traditional AI design uses three separate computing elements: a CPU, a GPU (Graphical Processor Unit) and an NPU (Neural Processor Unit) connected through a bus. This traditional architecture requires DMA-intensive programming, which is error-prone, slow, and energy-hungry plus the challenge of having to integrate three different software stacks and architectures. In addition, NPUs are fixed-function hardware that cannot adapt to future AI algorithms yet-to-be-invented.

“In contrast, Semidynamics has re-invented AI architecture and integrates the three elements into a single, scalable processing element. We combine a RISC-V core, a Tensor Unit that handles matrix multiplication (playing the role of the NPU) and a Vector Unit that handles activation-like computations (playing the role of the GPU) into a fully integrated, all-in-one compute element, as shown in Figure 1. Our new architecture is DMA-free, uses a single software stack based on ONNX and RISC-V and offers direct, zero-latency connectivity between the three elements. The result is higher performance, lower power, better area and a much easier-to-program environment, lowering overall development costs. In addition, because the Tensor and Vector Units are under the direct control of a flexible CPU, we can deploy any existing or future AI algorithm, providing great protection to our customer’s investments.”

圖1 傳統(tǒng)AI架構(gòu)與Semidynamics的全新一體式集成解決方案對比

Figure 1 Comparison of traditional AI architecture to Semidynamics’ new All-In-One integrated solution

大語言模型 (LLM) 已成為AI應(yīng)用的關(guān)鍵元件。LLM在計(jì)算上由自注意層主導(dǎo)，如圖2詳細(xì)所示。這些層包括五個(gè)矩陣乘法 (MatMul)、一個(gè)矩陣Transpose和一個(gè)SoftMax激活函數(shù)，如圖2所示。在Semidynamics的一體式解決方案中，張量單元 (TU) 負(fù)責(zé)矩陣乘法，而向量單元（VU）可以有效地處理Transpose和SoftMax。由于張量和矢量單元共享矢量寄存器，因此可以在很大程度上避免昂貴的內(nèi)存復(fù)制。因此，在將數(shù)據(jù)從MatMul層傳輸?shù)郊せ顚右约皬募せ顚觽骰貢r(shí)，實(shí)現(xiàn)零延遲和零能耗。為了保持TU和VU持續(xù)繁忙，必須有效地將權(quán)重和輸入從存儲(chǔ)器提取到矢量寄存器中。為此，Semidynamics的Gazzillion? Misses技術(shù)提供了前所未有的數(shù)據(jù)遷移能力。通過支持大量的運(yùn)行中緩存未命中，可以提前提取數(shù)據(jù)，從而提高資源利用率。而且，Semidynamics的定制張量擴(kuò)展包括為獲取和轉(zhuǎn)換2D貼片而優(yōu)化的新矢量指令，極大地改進(jìn)了張量處理。

Large Language Models (LLMs) have emerged as a key element of AI applications. LLMs are computationally dominated by self-attention layers, shown in detail in Figure 2. These layers consist of five matrix multiplications (MatMul), a matrix Transpose and a SoftMax activation function, as shown in Figure 2. In Semidynamics’ All-In-One solution, the Tensor Unit (TU) takes care of matrix multiplication, whereas the Vector Unit (VU) can efficiently handle Transpose and SoftMax. Since the Tensor and Vector Units share the vector registers, expensive memory copies can be largely avoided. Hence, there is zero latency and zero energy spent in transferring data from the MatMul layers to the activation layers and vice versa. To keep the TU and the VU continuously busy, weights and inputs must be efficiently fetched from memory into the vector registers. To this end, Semidynamics’ Gazzillion? Misses technology provides unprecedented ability to move data. By supporting a large number of in-flight cache misses, data can be fetched ahead-of-time yielding high resource utilization. Furthermore, Semidynamics’ custom tensor extension includes new vector instructions optimized for fetching and transposing 2D tiles, greatly improving tensor processing.

圖2 LLM的自注意層

Figure 2 Attention Layer in LLM

Semidynamics在其一體式元件上運(yùn)行了完整的LlaMA-2 70億參數(shù)模型（BF16權(quán)重），使用 Semidynamics的ONNX運(yùn)行時(shí)執(zhí)行提供程序，并計(jì)算出模型中所有MatMul層的張量單元的利用率。結(jié)果如圖3所示。將結(jié)果聚在一起，并按照A張量形狀演示組織。LlaMA-2共有6種不同形狀，如圖3中的x軸標(biāo)簽所示。我們從中可以看出，大多數(shù)形狀的利用率都在80%以上，與其他架構(gòu)形成鮮明對比。結(jié)果是在最具挑戰(zhàn)性的條件下收集的，即一批1和首個(gè)詞元計(jì)算。為了補(bǔ)充這些數(shù)據(jù)，圖4顯示了大矩陣尺寸的張量單元效率，以展示張量單元和Gazzillion?技術(shù)的綜合效率。圖4標(biāo)注了A+B矩陣大小。我們可以從中看出，隨著矩陣的N、M、P維度中的元件數(shù)量的增加，總大?。ㄒ訫B為單位）迅速超過任何可能的緩存/暫存區(qū)。該圖表值得注意的是，無論矩陣的總大小如何，性能都穩(wěn)定在略高于70%的水平。這一令人驚訝的結(jié)果要?dú)w功于Gazzilion技術(shù)能夠在主存儲(chǔ)器和張量單元之間維持較高的流數(shù)據(jù)速率。

Semidynamics has run the full LlaMA-2 7B-parameter model (BF16 weights) on its All-In-One element, using Semidynamics’ ONNX Run Time Execution Provider, and calculated the utilization of the Tensor Unit for all the MatMul layers in the model. The results are shown in Figure 3. The results are aggregated and presented organized by the A-tensor shape. There are a total of 6 different shapes in LlaMA-2, as shown in the x-axis labels in Figure 2. As it can be seen, utilization is above 80% for most shapes, in sharp contrast with other architectures. Results are collected in the most challenging conditions, i.e., with a batch of 1 and for the first-token computation. To complement this data, Figure 4 presents the Tensor Unit efficiency for large matrix sizes, to demonstrate the combined efficiency of the Tensor Unit and the Gazzillion? technology. Figure 4 is annotated with the A+B matrix size. One can see that as the number of elements in the N, M, P dimensions of the matrix increase, the total size in MBs quickly exceeds any possible cache/scratchpad available. The noteworthy aspect of the chart is that the performance is stable slightly above 70%, irrespective of the total size of the matrices. This quite surprising result is thanks to the Gazzillion technology being capable of sustaining a high streaming data rate between main memory and the Tensor Unit.

圖3張量A形組織的LlaMA-2張量單元效率

Figure 3 LlaMA-2 Tensor Unit efficiency organized by Tensor-A shape

圖4不同矩陣大小的8位（左側(cè)）和16位矩陣（右側(cè)）的張量單元利用率

Figure 4 Tensor Unit utilization for 8-bit (left side) and 16-bit matrices (right side) for different matrix sizes

Espasa總結(jié)說：“我們的全新一體式AI IP不僅具有出色的人工智能性能，而且編程也更容易，因?yàn)楝F(xiàn)在只有一個(gè)軟件堆棧，而不是三個(gè)。開發(fā)人員可以使用已知的RISC-V堆棧，而且他們不必?fù)?dān)心軟件管理的本地SRAM或DMA。此外，Semidynamics提供了針對一體式AI IP優(yōu)化的ONNX運(yùn)行時(shí)，這使程序員能夠輕松運(yùn)行他們的ML模型。因此，我們的解決方案在程序員友好性和易于集成到新SOC設(shè)計(jì)方面邁出了一大步。借助一體式AI IP，我們的客戶將能夠以更好、更容易編程的硅的形式將所有這些好處傳遞給他們的客戶、開發(fā)人員和用戶?！?/p>

“此外，我們的一體式設(shè)計(jì)對未來AI/ML算法和工作負(fù)載的變化具有充分的彈性。對于啟動(dòng)一個(gè)在幾年內(nèi)不會(huì)上市的硅片項(xiàng)目的客戶來說，這是一個(gè)巨大的風(fēng)險(xiǎn)保護(hù)。知道當(dāng)您的硅片進(jìn)入批量生產(chǎn)時(shí)您的AI IP仍然是相關(guān)的，這是我們技術(shù)的一個(gè)獨(dú)特優(yōu)勢?！?/p>

2016年成立于西班牙巴塞羅那，Semidynamics?是唯一完全可定制的RISC-V處理器IP提供商，專業(yè)提供高帶寬、高性能內(nèi)核，其矢量單元和張量單元面向機(jī)器學(xué)習(xí)和人工智能應(yīng)用。我公司為私人公司，是RISC-V聯(lián)盟的戰(zhàn)略成員。

器件型號(hào)	數(shù)量	器件廠商	器件描述	ECAD模型	參考價(jià)格	更多信息
KSZ8864CNXIA-TR	1	Microchip Technology Inc	DATACOM, ETHERNET TRANSCEIVER, QCC64		暫無數(shù)據(jù)	查看
SN74HC595DWR	1	Texas Instruments	8-bit shift registers with 3-state output registers 16-SOIC -40 to 85	ECAD模型下載ECAD模型	$0.87	查看
LTC6991IS6#TRMPBF	1	Linear Technology	LTC6991 - TimerBlox: Resettable, Low Frequency Oscillator; Package: SOT; Pins: 6; Temperature Range: -40°C to 85°C		$2.51	查看

器件型號(hào)

數(shù)量

器件廠商

器件描述

數(shù)據(jù)手冊

ECAD模型

風(fēng)險(xiǎn)等級(jí)

參考價(jià)格

更多信息

KSZ8864CNXIA-TR

Microchip Technology Inc

DATACOM, ETHERNET TRANSCEIVER, QCC64