<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[JOY.VN: AI - Machine Learning]]></title><description><![CDATA[AI cho dummy, những người không có gốc kỹ thuật hay lập trình.]]></description><link>https://joy.vn/s/ai-ml</link><image><url>https://substackcdn.com/image/fetch/$s_!vuHc!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd1685a53-dbd0-46c7-9e4a-013d015d2d5e_512x512.png</url><title>JOY.VN: AI - Machine Learning</title><link>https://joy.vn/s/ai-ml</link></image><generator>Substack</generator><lastBuildDate>Sat, 11 Apr 2026 07:38:43 GMT</lastBuildDate><atom:link href="https://joy.vn/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[JOY]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[joyvn@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[joyvn@substack.com]]></itunes:email><itunes:name><![CDATA[JOY]]></itunes:name></itunes:owner><itunes:author><![CDATA[JOY]]></itunes:author><googleplay:owner><![CDATA[joyvn@substack.com]]></googleplay:owner><googleplay:email><![CDATA[joyvn@substack.com]]></googleplay:email><googleplay:author><![CDATA[JOY]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Benchmark Qwen3.5-35B-A3B-GPTQ-Int4 trên RTX Pro 6000 Blackwell]]></title><description><![CDATA[&#272;&#226;y ch&#7855;c l&#224; model ph&#249; h&#7907;p nh&#7845;t &#273;&#7875; l&#224;m local service]]></description><link>https://joy.vn/p/benchmark-qwen35-35b-a3b</link><guid isPermaLink="false">https://joy.vn/p/benchmark-qwen35-35b-a3b</guid><dc:creator><![CDATA[JOY]]></dc:creator><pubDate>Sun, 29 Mar 2026 12:13:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6a2W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>B&#224;i &#273;&#432;&#7907;c vi&#7871;t b&#7903;i Claude v&#224; edit b&#7903;i tui.</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6a2W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6a2W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 424w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 848w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 1272w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6a2W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png" width="1200" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 &#183; Hugging Face&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 &#183; Hugging Face" title="Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 &#183; Hugging Face" srcset="https://substackcdn.com/image/fetch/$s_!6a2W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 424w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 848w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 1272w, https://substackcdn.com/image/fetch/$s_!6a2W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbac3f22-cf54-4027-bf1b-05b9d5a24ce3_1200x648.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Ph&#7847;n c&#7913;ng &amp; Model</h2><p><strong>GPU:</strong> NVIDIA RTX Pro 6000 Blackwell - 96GB GDDR7<br><strong>CPU:</strong> AMD Threadripper Pro 9965WX (24C/48T, Zen 5)<br><strong>RAM:</strong> Micron 768GB ECC DDR5-5600<br><strong>PCIe:</strong> Crucial T705 4TB PCIe Gen5</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://joy.vn/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading JOY.VN! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Th&#7853;t ra m&#225;y m&#236;nh <strong>kh&#244;ng ph&#7843;i &#273;&#7875; benchmark</strong> &#8212; n&#243; ch&#7841;y s&#7843;n ph&#7849;m h&#224;ng ng&#224;y, &#273;&#7891;ng th&#7901;i host th&#234;m v&#224;i model nh&#7887; (4B) v&#224; pipeline ph&#225;t AI cho DOSafe.</p><p><strong>Model: Qwen3.5-35B-A3B-GPTQ-Int4</strong> &#8212; model MoE (Mixture-of-Experts) 35 t&#7927; tham s&#7889; t&#7893;ng, nh&#432;ng ch&#7881; <strong>3 t&#7927; tham s&#7889; ho&#7841;t &#273;&#7897;ng</strong> m&#7895;i token. &#272;&#226;y l&#224; l&#253; do n&#243; nhanh h&#417;n nhi&#7873;u so v&#7899;i model dense c&#249;ng t&#7847;m.</p><p><strong>Framework:</strong> vLLM 0.18.0<br><strong>Quantization:</strong> GPTQ-Int4 (Marlin kernel)<br><strong>KV cache:</strong> FP8<br><strong>Ng&#7919; c&#7843;nh t&#7889;i &#273;a:</strong> 65,536 tokens<br><strong>VRAM:</strong> 21 GiB (ch&#432;a t&#7899;i 1/4 card)</p><div><hr></div><h2>K&#7871;t qu&#7843;: T&#7889;c &#273;&#7897; decode &#7893;n &#273;&#7883;nh b&#7845;t k&#7875; ng&#7919; c&#7843;nh</h2><p>&#272;&#226;y l&#224; &#273;i&#7873;u &#7845;n t&#432;&#7907;ng nh&#7845;t v&#7873; MoE: <strong>t&#7889;c &#273;&#7897; decode g&#7847;n nh&#432; kh&#244;ng &#273;&#7893;i</strong> d&#249; ng&#7919; c&#7843;nh t&#259;ng t&#7915; 1K l&#234;n 32K token.</p><pre><code>Context    TTFT        TPOT       Decode       E2E (256 tokens)
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
1,024      104 ms      6.1 ms     163 tok/s    1.7s
4,096      182 ms      6.0 ms     166 tok/s    1.7s
8,192      275 ms      6.2 ms     162 tok/s    1.9s
16,384     496 ms      6.2 ms     162 tok/s    2.1s
32,768     1,104 ms    6.3 ms     160 tok/s    2.7s</code></pre><p><strong>TPOT (Time Per Output Token) = 6.0&#8211;6.3 ms</strong> &#8212; ngh&#297;a l&#224; m&#7895;i token m&#7845;t ~6ms &#273;&#7875; sinh ra, b&#7845;t k&#7875; b&#7841;n g&#7917;i 1K hay 32K ng&#7919; c&#7843;nh. V&#7899;i chatbot trung b&#236;nh (2K&#8211;4K), ng&#432;&#7901;i d&#249;ng nh&#7853;n ph&#7843;n h&#7891;i g&#7847;n nh&#432; ngay l&#7853;p t&#7913;c.</p><p>&#272;&#7875; so s&#225;nh: GPT-4o API th&#432;&#7901;ng c&#243; TPOT kho&#7843;ng 10&#8211;20ms. Model t&#7921; host tr&#234;n m&#7897;t GPU duy nh&#7845;t nhanh h&#417;n nhi&#7873;u API tr&#234;n cloud.</p><div><hr></div><h2>Th&#244;ng l&#432;&#7907;ng: Ph&#7909;c v&#7909; bao nhi&#234;u ng&#432;&#7901;i c&#249;ng l&#250;c?</h2><pre><code>Concurrency    T&#7893;ng tok/s    M&#7895;i user tok/s
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
1              184           184
5              570           114
10             800           80
20             1,868         93
50             3,373         67</code></pre><p>&#7902; <strong>50 y&#234;u c&#7847;u &#273;&#7891;ng th&#7901;i</strong>, h&#7879; th&#7889;ng v&#7851;n &#273;&#7841;t <strong>3,373 tok/s t&#7893;ng</strong> &#8212; m&#7895;i ng&#432;&#7901;i nh&#7853;n ~67 tok/s. &#272;&#7911; nhanh &#273;&#7875; &#273;&#7885;c tho&#7843;i m&#225;i (t&#7889;c &#273;&#7897; &#273;&#7885;c trung b&#236;nh con ng&#432;&#7901;i l&#224; ~4 t&#7915;/gi&#226;y &#8776; ~5 tok/s).</p><p>T&#7889;c &#273;&#7897; prefill (x&#7917; l&#253; &#273;&#7847;u v&#224;o) c&#361;ng &#7845;n t&#432;&#7907;ng:</p><pre><code>Context    Prefill tok/s
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
1K         9,867
4K         22,503
8K         29,802
16K        33,005  &#8592; &#273;&#7881;nh
32K        29,693</code></pre><p>Prefill &#273;&#7841;t &#273;&#7881;nh &#7903; 16K (~33K tok/s), gi&#7843;m nh&#7865; &#7903; 32K khi b&#259;ng th&#244;ng b&#7897; nh&#7899; b&#7855;t &#273;&#7847;u ngh&#7869;n c&#7893; chai.</p><div><hr></div><h2>So s&#225;nh v&#7899;i benchmark &#273;&#7897;c l&#7853;p: Millstone AI</h2><p><a href="https://www.millstoneai.com/inference-benchmark/qwen3-5-35b-a3b-fp8-1x-rtx-pro-6000-blackwell">Millstone AI</a> l&#224; d&#7883;ch v&#7909; benchmark chuy&#234;n nghi&#7879;p, &#273;&#227; test <strong>c&#249;ng model, c&#249;ng GPU</strong>, nh&#432;ng d&#249;ng <strong>FP8 quantization</strong>. H&#7885; ch&#7841;y 4,600 y&#234;u c&#7847;u v&#7899;i t&#7881; l&#7879; th&#224;nh c&#244;ng 100% &#8212; d&#7919; li&#7879;u r&#7845;t &#273;&#225;ng tin. (<a href="https://cdn.millstoneai.cloud/benchmarks/qwen3-5-35b-a3b-fp8-1x-rtx-pro-6000-blackwell/qwen3-5-35b-a3b-fp8-1x-rtx-pro-6000-blackwell.pdf">B&#225;o c&#225;o PDF &#273;&#7847;y &#273;&#7911;</a>)</p><p>So s&#225;nh tr&#7921;c ti&#7871;p:</p><pre><code>Ch&#7881; s&#7889;              DOS.AI (GPTQ-Int4)    Millstone (FP8)    H&#417;n
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
TPOT @1K            6.1 ms                ~7 ms              GPTQ +13%
VRAM                21 GiB                ~35 GiB            GPTQ -40%
Prefill @32K        29,693 tok/s          34,509 tok/s       FP8 +16%
MMLU                84.6%                 84.8%              Ngang</code></pre><h3>Bao nhi&#234;u ng&#432;&#7901;i d&#249;ng c&#249;ng l&#250;c?</h3><p>Millstone &#273;o gi&#7899;i h&#7841;n ph&#7909;c v&#7909; cho t&#7915;ng lo&#7841;i t&#225;c v&#7909; &#8212; s&#7889; y&#234;u c&#7847;u &#273;&#7891;ng th&#7901;i tr&#432;&#7899;c khi tr&#7843;i nghi&#7879;m b&#7855;t &#273;&#7847;u gi&#7843;m:</p><pre><code>T&#225;c v&#7909;                 Context    S&#7913;c ch&#7913;a         TTFT     T&#7889;c &#273;&#7897; sinh
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Code Completion        1K         13 y&#234;u c&#7847;u       2.1s     63 tok/s
Short-form Chat        8K         125+ y&#234;u c&#7847;u     4.8s     15 tok/s
General Chatbot        32K        ~31 y&#234;u c&#7847;u      ~7.9s    ~20 tok/s
Document Processing    64K        13 y&#234;u c&#7847;u       11.9s    27 tok/s
Coding Assistant       96K        ~4 y&#234;u c&#7847;u       ~10.9s   ~47 tok/s</code></pre><p>L&#432;u &#253;: m&#7895;i slot y&#234;u c&#7847;u th&#432;&#7901;ng ph&#7909;c v&#7909; <strong>4&#8211;5 ng&#432;&#7901;i d&#249;ng th&#7921;c</strong> (ng&#432;&#7901;i d&#249;ng c&#243; kho&#7843;ng ngh&#7881; t&#7921; nhi&#234;n gi&#7919;a c&#225;c tin nh&#7855;n). N&#234;n &#8220;31 y&#234;u c&#7847;u &#273;&#7891;ng th&#7901;i&#8221; &#7903; 32K th&#7921;c t&#7871; ph&#7909;c v&#7909; <strong>~120&#8211;155 ng&#432;&#7901;i d&#249;ng</strong>.</p><h3>TTFT: Bao l&#226;u ng&#432;&#7901;i d&#249;ng ph&#7843;i ch&#7901; token &#273;&#7847;u ti&#234;n?</h3><pre><code>Context    1 ng&#432;&#7901;i    5 &#273;&#7891;ng th&#7901;i    10 &#273;&#7891;ng th&#7901;i
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
1K         0.8s       &lt;0.1s          0.2s
8K         0.8s       0.3s           0.4s
32K        1.0s       2.7s           4.7s
64K        2.5s       7.1s           12.8s
96K        4.6s       12.9s          18.2s</code></pre><p>Th&#250; v&#7883;: TTFT &#7903; 2 y&#234;u c&#7847;u &#273;&#7891;ng th&#7901;i <strong>th&#7845;p h&#417;n</strong> 1 y&#234;u c&#7847;u &#7903; ng&#7919; c&#7843;nh ng&#7855;n &#8212; nh&#7901; continuous batching hi&#7879;u qu&#7843; h&#417;n khi c&#243; nhi&#7873;u h&#417;n 1 y&#234;u c&#7847;u.</p><p>V&#224; quan tr&#7885;ng: <strong>&#273;&#226;y l&#224; tr&#432;&#7901;ng h&#7907;p x&#7845;u nh&#7845;t</strong> (kh&#244;ng c&#243; prompt caching). Trong th&#7921;c t&#7871;, chatbot x&#226;y d&#7921;ng ng&#7919; c&#7843;nh d&#7847;n qua cu&#7897;c h&#7897;i tho&#7841;i &#8212; ch&#7881; token m&#7899;i c&#7847;n x&#7917; l&#253;, n&#234;n TTFT th&#7845;p h&#417;n nhi&#7873;u.</p><h3>Benchmark FP8 c&#7911;a ch&#237;nh m&#236;nh</h3><p>M&#236;nh c&#361;ng &#273;&#227; benchmark FP8 tr&#234;n c&#249;ng m&#225;y (tr&#432;&#7899;c khi ch&#7885;n GPTQ-Int4 cho s&#7843;n ph&#7849;m):</p><pre><code>Ch&#7881; s&#7889;                    GPTQ-Int4       FP8            Ch&#234;nh l&#7879;ch
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Single tok/s              183.7           162.3          GPTQ +13%
@10 concurrent tok/s      800             642            GPTQ +25%
@50 concurrent tok/s      3,373           2,607          GPTQ +29%
@30 rps (vllm bench)      2,531           1,730          GPTQ +46%
TPOT @1 rps               6.6 ms          7.2 ms         GPTQ -8%
VRAM                      21.06 GiB       34.23 GiB      GPTQ -39%
MMLU (5-shot)             84.6%           84.8%          Ngang (0.2%)
Scam reasoning            0.970           0.970          Gi&#7889;ng h&#7879;t</code></pre><p>GPTQ-Int4 th&#7855;ng tr&#234;n <strong>m&#7885;i ch&#7881; s&#7889; hi&#7879;u n&#259;ng</strong> tr&#7915; MMLU (ch&#234;nh 0.2% &#8212; kh&#244;ng &#273;&#225;ng k&#7875;). &#272;&#7863;c bi&#7879;t &#7903; t&#7843;i cao: @50 &#273;&#7891;ng th&#7901;i nhanh h&#417;n <strong>29%</strong>, @30 rps nhanh h&#417;n <strong>46%</strong>.</p><p>D&#7919; li&#7879;u FP8 c&#7911;a m&#236;nh c&#361;ng nh&#7845;t qu&#225;n v&#7899;i Millstone AI (benchmark &#273;&#7897;c l&#7853;p, c&#249;ng GPU, c&#249;ng quantization) &#8212; x&#225;c nh&#7853;n k&#7871;t qu&#7843; l&#224; &#273;&#225;ng tin.</p><h3>T&#7841;i sao GPTQ-Int4 nhanh h&#417;n FP8 &#7903; decode, nh&#432;ng ch&#7853;m h&#417;n &#7903; prefill?</h3><p>Int4 &#237;t bit h&#417;n = &#273;&#7885;c &#237;t d&#7919; li&#7879;u t&#7915; VRAM h&#417;n = nhanh h&#417;n. &#272;&#250;ng &#8212; nh&#432;ng ch&#7881; khi ngh&#7869;n c&#7893; chai l&#224; <strong>b&#259;ng th&#244;ng b&#7897; nh&#7899;</strong> (&#273;&#7885;c tr&#7885;ng s&#7889;). &#7902; prefill, ngh&#7869;n c&#7893; chai l&#224; <strong>s&#7913;c t&#237;nh to&#225;n</strong> (nh&#226;n ma tr&#7853;n) &#8212; v&#224; FP8 c&#243; l&#7907;i th&#7871; &#7903; &#273;&#226;y v&#236; Blackwell tensor cores c&#243; &#273;&#432;&#7901;ng t&#237;nh FP8 g&#7889;c, trong khi GPTQ-Int4 ph&#7843;i gi&#7843;i n&#233;n Int4 &#8594; FP16 tr&#432;&#7899;c khi t&#237;nh.</p><ul><li><p><strong>Decode</strong> (ngh&#7869;n b&#259;ng th&#244;ng): GPU &#273;&#7885;c tr&#7885;ng s&#7889; cho t&#7915;ng token &#8594; Int4 &#273;&#7885;c &#237;t h&#417;n ~2x &#8594; <strong>GPTQ th&#7855;ng 13%</strong></p></li><li><p><strong>Prefill</strong> (ngh&#7869;n t&#237;nh to&#225;n): GPU x&#7917; l&#253; h&#224;ng ng&#224;n token c&#249;ng l&#250;c &#8594; FP8 t&#237;nh nhanh h&#417;n nh&#7901; tensor cores g&#7889;c &#8594; <strong>FP8 th&#7855;ng 16%</strong></p></li></ul><p>Ngo&#224;i ra, kernel GPTQ (Marlin/ExLlama) &#273;&#227; &#273;&#432;&#7907;c t&#7889;i &#432;u nhi&#7873;u n&#259;m, trong khi kernel FP8 tr&#234;n Blackwell c&#242;n t&#432;&#417;ng &#273;&#7889;i m&#7899;i. Kho&#7843;ng c&#225;ch c&#243; th&#7875; thu h&#7865;p khi kernel FP8 ho&#224;n thi&#7879;n h&#417;n.</p><p>V&#7899;i t&#7843;i chatbot (decode l&#224; ch&#237;nh, ng&#7919; c&#7843;nh ng&#7855;n-trung b&#236;nh), GPTQ-Int4 l&#224; l&#7921;a ch&#7885;n t&#7889;t nh&#7845;t hi&#7879;n t&#7841;i.</p><div><hr></div><h2>GPTQ-Int4 vs FP8: Khi n&#224;o ch&#7885;n g&#236;?</h2><ul><li><p><strong>Chatbot (2K&#8211;8K)</strong> &#8594; GPTQ-Int4 &#8212; decode nhanh h&#417;n, &#237;t VRAM</p></li><li><p><strong>RAG / x&#7917; l&#253; t&#224;i li&#7879;u (32K+)</strong> &#8594; FP8 &#8212; prefill m&#7841;nh h&#417;n &#7903; ng&#7919; c&#7843;nh d&#224;i</p></li><li><p><strong>Nhi&#7873;u model tr&#234;n 1 GPU</strong> &#8594; GPTQ-Int4 &#8212; 21 GiB vs 35 GiB, d&#432; ch&#7895; cho model kh&#225;c</p></li><li><p><strong>Y&#234;u c&#7847;u ch&#7845;t l&#432;&#7907;ng cao</strong> &#8594; C&#7843; hai &#8212; MMLU ch&#234;nh 0.2%, kh&#244;ng &#273;&#225;ng k&#7875;</p></li></ul><div><hr></div><h2>MoE vs Dense: T&#7841;i sao 35B nhanh h&#417;n 27B?</h2><p>Qwen3.5 c&#243; 2 phi&#234;n b&#7843;n: <strong>35B-A3B</strong> (MoE) v&#224; <strong>27B</strong> (dense). So s&#225;nh:</p><pre><code>Model           Ki&#7871;n tr&#250;c    Tham s&#7889; ho&#7841;t &#273;&#7897;ng    Single tok/s    @10 tok/s
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
35B-A3B-GPTQ    MoE          3B                   184             800
27B-GPTQ        Dense        27B                  52              394</code></pre><p>35B nhanh h&#417;n <strong>3.5 l&#7847;n</strong> d&#249; t&#7893;ng tham s&#7889; l&#7899;n h&#417;n. B&#237; quy&#7871;t: MoE ch&#7881; k&#237;ch ho&#7841;t 3B trong 35B tham s&#7889; cho m&#7895;i token &#8212; GPU &#273;&#7885;c &#237;t tr&#7885;ng s&#7889; h&#417;n 9 l&#7847;n so v&#7899;i 27B dense.</p><p>C&#242;n ch&#7845;t l&#432;&#7907;ng? 35B th&#7855;ng &#7903; c&#225;c t&#225;c v&#7909; ph&#7913;c t&#7841;p (suy lu&#7853;n l&#7915;a &#273;&#7843;o: 0.970 vs 0.880), trong khi 27B th&#7855;ng &#7903; &#273;&#7847;u ra c&#243; c&#7845;u tr&#250;c (JSON: 100% vs 95%).</p><div><hr></div><h2>So s&#225;nh v&#7899;i c&#225;c GPU kh&#225;c</h2><p>T&#7893;ng h&#7907;p benchmark t&#7915; c&#7897;ng &#273;&#7891;ng, Millstone AI, v&#224; NVIDIA Forums:</p><pre><code>GPU                              Quant        Single tok/s    &#272;&#7891;ng th&#7901;i cao nh&#7845;t
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
RTX Pro 6000 Blackwell (96GB)    GPTQ-Int4    184             3,373 (@50)
RTX 5090 (32GB)                  GPTQ-Int4    194&#8211;197         &#8212;
RTX 4090 (24GB)                  Q4_K GGUF    122             &#8212;
RTX 3090 (24GB)                  Q4_K GGUF    111             &#8212;
H100 SXM (80GB)                  FP8          &#8212;               908 (&#273;&#7881;nh)
H200 SXM (141GB)                 FP8          &#8212;               1,479 (&#273;&#7881;nh)
2x DGX Spark (GB10)              FP8          65.5            515
M3 Ultra 512GB                   8-bit MLX    80.6            &#8212;</code></pre><p>M&#7897;t v&#224;i &#273;i&#7875;m &#273;&#225;ng ch&#250; &#253;:</p><ul><li><p><strong>RTX 5090 nhanh h&#417;n ~6% &#7903; m&#7897;t y&#234;u c&#7847;u</strong> (197 vs 184 tok/s) &#8212; nh&#432;ng ch&#7881; c&#243; 32GB VRAM, kh&#244;ng host &#273;&#432;&#7907;c nhi&#7873;u model c&#249;ng l&#250;c</p></li><li><p><strong>DGX Spark (GB10) r&#7845;t ch&#7853;m</strong>: 65 tok/s m&#7897;t y&#234;u c&#7847;u, b&#7883; ngh&#7869;n c&#7893; chai b&#7903;i b&#259;ng th&#244;ng LPDDR5x. D&#249; gi&#225; t&#432;&#417;ng &#273;&#432;&#417;ng RTX Pro 6000 nh&#432;ng hi&#7879;u n&#259;ng k&#233;m ~3 l&#7847;n</p></li><li><p><strong>H100 SXM</strong>: &#273;&#7881;nh 908 tok/s &#273;&#7891;ng th&#7901;i (FP8) &#8212; nh&#432;ng RTX Pro 6000 v&#7899;i GPTQ-Int4 &#273;&#7841;t 800 tok/s @10 &#273;&#7891;ng th&#7901;i. G&#7847;n ngang H100 v&#7899;i gi&#225; r&#7867; h&#417;n 3.5 l&#7847;n</p></li><li><p><strong>H200 SXM</strong> l&#224; vua (1,479 tok/s &#273;&#7881;nh) &#8212; nh&#432;ng gi&#225; ~$35K v&#224; kh&#243; mua l&#7867;</p></li></ul><p>V&#7899;i nhu c&#7847;u t&#7921; host, RTX Pro 6000 Blackwell c&#243; <strong>t&#7881; l&#7879; gi&#225;/hi&#7879;u n&#259;ng t&#7889;t nh&#7845;t</strong>: m&#7897;t GPU &#273;&#7911; m&#7841;nh cho chatbot s&#7843;n ph&#7849;m, d&#432; VRAM cho nhi&#7873;u model, gi&#225; h&#7907;p l&#253;.</p><div><hr></div><h2>GPU Memory Utilization: T&#236;m &#273;i&#7875;m t&#7889;i &#432;u</h2><p>M&#236;nh th&#7917; c&#225;c m&#7913;c <code>gpu-memory-utilization</code> t&#7915; 0.65 &#273;&#7871;n 0.90:</p><pre><code>M&#7913;c       Sinh tok/s    TTFT
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
0.65      298.5         78.6 ms
0.75      294.9         71.0 ms
0.80      305.4         100.1 ms
0.85      315.8         71.4 ms  &#8592; &#273;&#7881;nh
0.90      264.4         74.3 ms  &#8592; gi&#7843;m</code></pre><p><strong>&#272;i&#7875;m t&#7889;i &#432;u: 0.85</strong> &#8212; nh&#432;ng m&#236;nh d&#249;ng <strong>0.70</strong> v&#236; m&#225;y c&#242;n host th&#234;m model observer v&#224; pipeline ph&#225;t hi&#7879;n &#7843;nh. &#272;&#225;nh &#273;&#7893;i ~6% th&#244;ng l&#432;&#7907;ng &#273;&#7875; c&#243; d&#432; ch&#7895; ch&#7841;y nhi&#7873;u model c&#249;ng l&#250;c.</p><div><hr></div><h2>Chi ph&#237; th&#7921;c t&#7871;</h2><p>M&#225;y workstation n&#224;y c&#243; gi&#225; <strong>tr&#234;n $50,000</strong> - (ri&#234;ng RAM &#273;&#227; h&#417;n $20,000). Kh&#244;ng h&#7873; r&#7867;. Nh&#432;ng so v&#7899;i thu&#234; API:</p><pre><code>D&#7883;ch v&#7909;                     Chi ph&#237; / 1M output tokens
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
Claude 4 Sonnet             $15.00
GPT-4o                      $10.00
Qwen3.5-35B qua Alibaba    $2.00
T&#7921; host Qwen3.5-35B        ~$0.05 (ti&#7873;n &#273;i&#7879;n)</code></pre><p>Th&#250; v&#7883;: ngay c&#7843; so v&#7899;i <strong>c&#249;ng model qua API</strong> (Alibaba: $2.00/1M), t&#7921; host v&#7851;n r&#7867; h&#417;n <strong>40 l&#7847;n</strong>. Theo <a href="https://artificialanalysis.ai/models/qwen3-5-35b-a3b">Artificial Analysis</a>, Qwen3.5-35B-A3B x&#7871;p <strong>#3/96 models</strong> v&#7873; ch&#7881; s&#7889; th&#244;ng minh - ch&#7845;t l&#432;&#7907;ng kh&#244;ng h&#7873; k&#233;m.</p><p>&#7902; m&#7913;c 3,373 tok/s (@50 &#273;&#7891;ng th&#7901;i), m&#225;y sinh <strong>~291 tri&#7879;u token/ng&#224;y</strong>. So s&#225;nh chi ph&#237;:</p><pre><code>                    API (GPT-4o)    T&#7921; host
&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;&#9472;
1M output tokens    $10.00          ~$0.05
291M tokens/ng&#224;y    $2,910          ~$3 (&#273;i&#7879;n)
Th&#225;ng (ch&#7841;y h&#7871;t)    $87,300         ~$90</code></pre><p>Millstone AI &#273;o m&#7913;c ti&#234;u th&#7909; &#273;i&#7879;n chi ti&#7871;t: &#273;&#7881;nh <strong>487W</strong> &#7903; 256K / 5 &#273;&#7891;ng th&#7901;i. &#7902; t&#7843;i chatbot th&#244;ng th&#432;&#7901;ng (8K&#8211;32K), c&#244;ng su&#7845;t kho&#7843;ng 200&#8211;350W.</p><p><strong>ROI</strong>: m&#225;y ho&#224;n v&#7889;n trong v&#224;i ng&#224;y n&#7871;u ch&#7841;y h&#7871;t t&#7843;i. D&#297; nhi&#234;n, &#273;&#226;y l&#224; so s&#225;nh th&#244; - ch&#7845;t l&#432;&#7907;ng model kh&#225;c nhau, v&#224; t&#7921; host c&#7847;n c&#244;ng s&#7913;c v&#7853;n h&#224;nh. Nh&#432;ng v&#7899;i b&#224;i to&#225;n ph&#249; h&#7907;p, con s&#7889; n&#243;i l&#234;n t&#7845;t c&#7843;.</p><div><hr></div><h2>L&#432;u &#253;</h2><p>&#272;&#7875; benchmark c&#243; &#253; ngh&#297;a, c&#7847;n l&#432;u &#253;:</p><ol><li><p><strong>Hai c&#244;ng c&#7909; &#273;o kh&#225;c nhau</strong>: benchmark n&#7897;i b&#7897; (500 max_tokens) v&#224; <code>vllm bench serve</code> (256 output) cho s&#7889; li&#7879;u h&#417;i kh&#225;c nhau &#8212; ho&#224;n to&#224;n b&#236;nh th&#432;&#7901;ng</p></li><li><p><strong>&#8220;Prefill tok/s&#8221; l&#224; ch&#7881; s&#7889; t&#7893;ng h&#7907;p</strong>: t&#237;nh t&#7915; <code>input_tokens / TTFT</code>, bao g&#7891;m c&#7843; chi ph&#237; l&#7853;p l&#7883;ch c&#7911;a vLLM</p></li><li><p><strong>Kh&#244;ng so tr&#7921;c ti&#7871;p &#273;&#432;&#7907;c</strong> v&#7899;i benchmark tr&#234;n RTX 5090, 8x3090, hay framework kh&#225;c (SGLang, TensorRT-LLM)</p></li><li><p><strong>M&#225;y &#273;ang ch&#7841;y th&#7853;t</strong>: kh&#244;ng ph&#7843;i m&#225;y chuy&#234;n benchmark, c&#243; ti&#7871;n tr&#236;nh kh&#225;c ch&#7841;y &#273;&#7891;ng th&#7901;i</p></li><li><p><strong>Millstone d&#249;ng 1024 output tokens</strong>, m&#236;nh d&#249;ng 256 &#8212; sinh d&#224;i h&#417;n th&#236; TTFT &#273;&#432;&#7907;c ph&#226;n b&#7893; ra, l&#224;m tok/s nh&#236;n cao h&#417;n. So s&#225;nh TPOT (m&#7895;i token) ch&#237;nh x&#225;c h&#417;n so s&#225;nh t&#7893;ng tok/s</p></li><li><p><strong>Ch&#7845;t l&#432;&#7907;ng kh&#225;c nhau</strong>: Qwen3.5-35B &#8800; GPT-4o &#8800; Claude Sonnet. So s&#225;nh chi ph&#237; ch&#7881; c&#243; &#253; ngh&#297;a khi model &#273;&#7911; t&#7889;t cho b&#224;i to&#225;n c&#7909; th&#7875;</p></li></ol><div><hr></div><h2>K&#7871;t lu&#7853;n</h2><p>Qwen3.5-35B-A3B + GPTQ-Int4 + vLLM 0.18.0 tr&#234;n RTX Pro 6000 Blackwell:</p><ul><li><p><strong>160+ tok/s decode</strong> &#7893;n &#273;&#7883;nh &#7903; m&#7885;i &#273;&#7897; d&#224;i ng&#7919; c&#7843;nh</p></li><li><p><strong>3,373 tok/s</strong> th&#244;ng l&#432;&#7907;ng t&#7893;ng @50 &#273;&#7891;ng th&#7901;i</p></li><li><p><strong>21 GiB VRAM</strong> &#8212; d&#432; ch&#7895; ch&#7841;y nhi&#7873;u model</p></li><li><p><strong>TPOT 6ms</strong> &#8212; nhanh h&#417;n h&#7847;u h&#7871;t API tr&#234;n cloud</p></li><li><p>&#272;&#432;&#7907;c x&#225;c nh&#7853;n b&#7903;i <a href="https://www.millstoneai.com/inference-benchmark/qwen3-5-35b-a3b-fp8-1x-rtx-pro-6000-blackwell">Millstone AI</a> (benchmark &#273;&#7897;c l&#7853;p, FP8)</p></li></ul><p>T&#7921; host LLM inference kh&#244;ng ph&#7843;i cho m&#7885;i ng&#432;&#7901;i. Nh&#432;ng n&#7871;u b&#7841;n c&#243; t&#7843;i &#273;&#7911; l&#7899;n, c&#7847;n b&#7843;o m&#7853;t d&#7919; li&#7879;u, v&#224; s&#7861;n s&#224;ng &#273;&#7847;u t&#432; v&#224;o ph&#7847;n c&#7913;ng &#8212; m&#7897;t GPU Blackwell + model MoE + GPTQ quantization l&#224; s&#7921; k&#7871;t h&#7907;p r&#7845;t m&#7841;nh.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://joy.vn/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading JOY.VN! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Cách build Docker image cho vLLM hỗ trợ CUDA 12.8 và GPU RTX 5090 (SM120)]]></title><description><![CDATA[Phi&#234;n b&#7843;n nightly v0.8.3rc2.dev172+g3cdc57669]]></description><link>https://joy.vn/p/cach-build-docker-image-cho-vllm-rtx-5090</link><guid isPermaLink="false">https://joy.vn/p/cach-build-docker-image-cho-vllm-rtx-5090</guid><dc:creator><![CDATA[JOY]]></dc:creator><pubDate>Fri, 16 May 2025 12:52:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ubZo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ubZo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ubZo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ubZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png" width="1200" height="600" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:600,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Feature]: Support for RTX 5090 (CUDA 12.8) &#183; Issue #13306 &#183; vllm-project/ vllm &#183; GitHub&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Feature]: Support for RTX 5090 (CUDA 12.8) &#183; Issue #13306 &#183; vllm-project/ vllm &#183; GitHub" title="Feature]: Support for RTX 5090 (CUDA 12.8) &#183; Issue #13306 &#183; vllm-project/ vllm &#183; GitHub" srcset="https://substackcdn.com/image/fetch/$s_!ubZo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 424w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 848w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 1272w, https://substackcdn.com/image/fetch/$s_!ubZo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4563790a-44d5-4bbf-85e8-3e2c32d05c65_1200x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Ban &#273;&#7847;u th&#236; m&#236;nh x&#224;i Ollama cho AI Inference k&#7871;t h&#7907;p v&#7899;i Open WebUI. Em n&#224;y c&#224;i l&#224; ch&#7841;y, kh&#244;ng ph&#7843;i suy ngh&#297; nhi&#7873;u. Nh&#432;ng gi&#7901; do c&#7847;n ch&#7841;y dual RTX 5090 n&#234;n ph&#7843;i ki&#7871;m gi&#7843;i ph&#225;p thay th&#7871;. H&#7887;i em ChatGPT L&#234; Ky (t&#234;n m&#236;nh &#273;&#7863;t cho &#7867;m) th&#236; em &#7845;y g&#7907;i &#253; x&#224;i h&#224;ng top vLLM. M&#236;nh s&#7869; c&#243; b&#224;i so s&#225;nh 2 em n&#224;y sau.</p><p>Ng&#7863;t n&#7895;i vLLM ch&#432;a support ch&#237;nh th&#7913;c cho d&#242;ng Blackwel&#8230;</p>
      <p>
          <a href="https://joy.vn/p/cach-build-docker-image-cho-vllm-rtx-5090">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Build Dual System Để Train AI, Làm Việc và Chơi Game - Phần 1: Chuẩn Bị]]></title><description><![CDATA[Hai h&#7879; th&#7889;ng trong m&#7897;t case v&#7899;i t&#7843;n nhi&#7879;t n&#432;&#7899;c custom d&#249;ng &#273;&#7875; train AI, ch&#417;i game v&#224; l&#224;m vi&#7879;c.]]></description><link>https://joy.vn/p/dual-system-tan-nhiet-nuoc-train-ai-choi-game-p1</link><guid isPermaLink="false">https://joy.vn/p/dual-system-tan-nhiet-nuoc-train-ai-choi-game-p1</guid><dc:creator><![CDATA[JOY]]></dc:creator><pubDate>Sat, 15 Mar 2025 19:35:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!z_7J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z_7J!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z_7J!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z_7J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:700356,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://joyvn.substack.com/i/152384377?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z_7J!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 424w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 848w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!z_7J!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F19282992-4235-4ede-bb7b-f11e7ea4d694_2518x1888.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>G&#7847;n &#273;&#226;y th&#236; m&#236;nh v&#224; team m&#236;nh c&#243; nghi&#234;n c&#7913;u v&#224; &#225;p d&#7909;ng m&#7897;t &#237;t v&#7873; AI. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://joy.vn/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading JOY.VN! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>C&#7845;u h&#236;nh b&#7897; m&#225;y:</p><h3>Main system for AI training &amp; home server</h3><ul><li><p>CPU: Intel Core Ultra 9 285K</p></li><li><p>Main: ASRock Z890 Taichi AQUA</p></li><li><p>RAM: CORSAIR VENGEANCE RGB DDR5 RAM 192GB (4 x 48GB)</p></li><li><p>SSD:</p><ul><li><p>Crucial T705 Gen5 4TB</p></li><li><p>Acer Pre&#8230;</p></li></ul></li></ul>
      <p>
          <a href="https://joy.vn/p/dual-system-tan-nhiet-nuoc-train-ai-choi-game-p1">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>