A comprehensive guide to sourcing, building, and running NVIDIA V100 SXM2 NVLink systems for local AI workloads — distilled from months of research across English and Chinese hardware communities.
| Topic | Key Fact |
|---|---|
| Best entry point | 1CATai TAQ-SXM2-4P5A5 quad board + 4× V100 SXM2 16GB |
| Entry cost | ~$1,000–1,200 for 64GB NVLink VRAM (board + GPUs + IO + cooling) |
| Mid-tier | Supermicro 4029GP-TVRT — 8× V100 SXM2, full NVLink cube mesh, runs on 120V |
| Ceiling | DGX-2 — 16× V100 32GB, 512GB unified NVSwitch domain, ~$15–30K used |
| V100 16GB price | ~$56–99 each (eBay/Taobao, early 2026) |
| V100 32GB price | ~$200–450 each (volatile, watch for decommissioning waves) |
| NVLink 2.0 bandwidth | ~300 GB/s bidirectional per pair |
| HBM2 bandwidth per GPU | ~900 GB/s (full TDP), ~650–720 GB/s at 150W power limit |
| Compute capability | 7.0 — no bfloat16, use float16. Q4/Q8 quantized models run efficiently. |
| Software | Ollama, llama.cpp, vLLM all confirmed working |
| OS | Linux strongly recommended. Windows has Code 43 / resource errors on SXM2 boards. |
| Power per GPU | 300W TDP, runs well at 150W limit via nvidia-smi |
| Key principle | NVLink domain size is the governing metric. Beyond ~3 PCIe-connected GPUs, additional cards are expensive VRAM storage, not useful compute. |
The NVIDIA Tesla V100 SXM2 was the flagship datacenter GPU from 2017–2020. It powered the Summit and Sierra supercomputers, trained early GPT models, and was the workhorse of an entire generation of AI research. Now it's being decommissioned in waves, and the modules are hitting the secondary market at absurdly low prices.
What makes the V100 SXM2 special compared to other cheap GPUs:
It has NVLink. The SXM2 form factor includes NVLink 2.0 connectors on the module itself — high-bandwidth GPU-to-GPU interconnect that blows PCIe out of the water. A single NVLink 2.0 pair delivers ~300 GB/s bidirectional, versus ~32 GB/s for PCIe 3.0 x16. This is what allows multiple V100s to function as a unified VRAM pool rather than isolated GPUs that happen to share a motherboard.
It has HBM2. Each V100 SXM2 has 900 GB/s of memory bandwidth. Since LLM token generation is almost entirely memory-bandwidth bound (each token requires reading the full model weights from VRAM once), this raw bandwidth translates directly to tokens per second. Four V100s over NVLink deliver ~2,600+ GB/s aggregate usable bandwidth — an order of magnitude beyond what any consumer GPU can achieve alone.
The modules are universal. A V100 SXM2 module pulled from a DGX-1 is physically identical to one from a Supermicro 4029GP, an Inspur NF5288M5, or a Dell C4140. Buy them once, use them across any platform that accepts the SXM2 socket. This makes them the foundation of an incremental build strategy.
They're practically free. As of early 2026, V100 SXM2 16GB modules sell for $56–99 each. That's less than a nice dinner for two in exchange for 16GB of HBM2 at 900 GB/s.
NVLink is a direct GPU-to-GPU interconnect that bypasses the CPU and PCIe bus entirely. When four V100s are connected via NVLink on a 1CATai quad board, they can share data at ~300 GB/s per link pair. For tensor parallelism — splitting a single model across multiple GPUs — this means each GPU reads its shard from local HBM in parallel, and the NVLink fabric handles the inter-GPU communication (all-reduce operations) fast enough that you get near-linear scaling.
Two separate quad boards plugged into the same motherboard via PCIe do not have NVLink between them. Each board is an isolated NVLink domain — the GPUs within a board talk at NVLink speed, but cross-board communication falls back to PCIe 3.0 through PLX switches (~12–16 GB/s). This means:
NVLink domain size is the single most important metric for hardware decisions. Beyond approximately three PCIe-connected GPUs, additional cards become expensive VRAM storage rather than useful compute, because the interconnect degrades faster than the compute scales. Every hardware purchase should be evaluated against: "How many GPUs are in a single NVLink domain?"
| Platform | NVLink Domain Size | Topology |
|---|---|---|
| 1CATai quad board | 4 GPUs | Full 4-way NVLink mesh |
| Supermicro 4029GP-TVRT | 8 GPUs | Hybrid Cube Mesh (HCM) — same as DGX-1 |
| Inspur NF5288M5 | 8 GPUs | HCM NVLink |
| Inspur NF5488M5 | 8 GPUs | NVSwitch full crossbar |
| DGX-2 | 16 GPUs | NVSwitch unified domain (V100 SXM3) |
This is the centerpiece of the budget build. Manufactured by 1CATai TECH (一猫之下科技), a Chinese company that reverse-engineered NVIDIA's NVLink 2.0 protocol and built custom SXM2 adapter boards. Their partner company 39com handles the NVLink implementation — the NVLink work is proprietary and closed-source.
Key specs:
Important distinction: 39com makes the dual (2-GPU) NVLink board. 1CATai makes the quad (4-GPU) board. These are different products from affiliated but separate entities. The dual board is widely available on eBay (~$230–380); the quad board is currently Taobao-only.
1CATai has also built a 16-GPU university prototype and mentioned plans for an 8-GPU board. The 8-GPU board is currently vaporware — without NVSwitch silicon, scaling NVLink beyond 4 GPUs is an architecturally much harder problem. Don't wait for it.
The quad board connects to your host system via a PLX8749 PCIe switch card. This card presents as a single x16 PCIe 3.0 device to the motherboard, then the onboard switch fans out to 4× SFF-8654 8i downstream ports — one per GPU.
Key facts:
For a dual-board setup (two quad boards), you need two PLX8749 cards. On a bifurcated riser, one can run in the x8 output slot and the other in a x4 output slot — the switch doesn't care about upstream bandwidth for inference.
| Module | VRAM | HBM2 BW | Price Range | Notes |
|---|---|---|---|---|
| V100 SXM2 16GB | 16GB | ~900 GB/s | $56–99 | Best value. Most price erosion already happened. |
| V100 SXM2 32GB | 32GB | ~900 GB/s | $200–450 | Volatile pricing. Watch for DOE lab decommissioning batches. |
Strategy: Buy 16GB cards now to prove the concept cheaply. When a batch of 32GB cards hits the secondary market at favorable prices, upgrade. Move the 16GB cards into an Inspur server as the second node. Never sell GPUs — accumulate.
If you want a true 8-way NVLink domain without building from scratch, the Supermicro 4029GP-TVRT is the answer. It uses the X10DGO-SXMV GPU baseboard — NVIDIA's Cube Mesh NVLink architecture with direct connections between all 8 GPUs.
Key specs:
The 120V revelation: At 120V, the Titanium-rated PSUs auto-derate to ~1,100W each. With V100s power-limited to 150W via nvidia-smi, total system draw is ~1,600–1,700W. Four derated PSUs provide ~4,400W capacity. You're at 39% utilization. Run two PSUs per 15A circuit on two separate circuits and you're well under code.
V100 SXM2 modules are bare mezzanine cards — no fans, no heatsinks in the box. You need aftermarket cooling. Options from cheapest to most elaborate:
LLM inference is memory-bandwidth bound. The formula is roughly: tokens/sec ≈ aggregate_bandwidth / model_size_in_bytes. Real-world numbers are 60–75% of theoretical due to framework overhead, attention computation, and KV cache management.
| Model | Quant | VRAM Needed | Fits in 64GB? | Est. tok/s |
|---|---|---|---|---|
| Llama 3.1 70B | Q4 | ~37 GB | Yes | 20–30 |
| Llama 3.1 70B | Q8 | ~70 GB | No | — |
| DeepSeek V3.2 685B (MoE) | Q4 | ~55 GB active | Yes | 25–35 |
| Qwen 2.5 72B | Q4 | ~38 GB | Yes | 20–30 |
| Model | Quant | VRAM Needed | Est. tok/s | Notes |
|---|---|---|---|---|
| Llama 3.1 70B | Q4 | ~37 GB | 20–30 | No speed gain over 1 board — PP doesn't add throughput for single-stream |
| Llama 3.1 70B | Q8 | ~70 GB | 12–18 | Better quality, lower speed |
| Llama 3.1 405B | Q4 | ~210 GB | 5–10 | Fits but PCIe bottleneck hurts |
| DeepSeek V3.2 685B (MoE) | Q4 | ~110 GB stored | 15–25 | MoE only activates ~37B per token |
| Model | Quant | VRAM Needed | Est. tok/s | Notes |
|---|---|---|---|---|
| Llama 3.1 70B | Q4 | ~37 GB | 40–50+ | Full NVLink bandwidth across all 8 GPUs |
| Llama 3.1 70B | Q8 | ~70 GB | 25–35 | Higher quality at usable speeds |
| Llama 3.1 405B | Q4 | ~210 GB | Doesn't fit | Need 32GB modules |
| Model | Quant | VRAM Needed | Est. tok/s |
|---|---|---|---|
| Llama 3.1 70B | FP16 | ~140 GB | 15–20 |
| Llama 3.1 405B | Q4 | ~210 GB | 10–15 |
| DeepSeek V3.2 685B (MoE) | Q4 | ~200 GB stored | 20–30 |
Mixture-of-Experts (MoE) models are transformative for V100 hardware because they decouple storage requirements from inference bandwidth.
A dense 405B model requires reading all 405B parameters from VRAM for every single token. An MoE model like DeepSeek V3.2 has ~685B total parameters, but only activates ~37B parameters per token — the router selects which expert sub-networks to use. This means:
This flips the V100 value proposition. The limiting factor on V100 hardware was always VRAM capacity relative to model size. MoE models let you store massive models in VRAM (using the cheap capacity) while only paying the bandwidth cost for a fraction of the parameters (where V100's 900 GB/s HBM2 excels).
| Model | Total Params | Active Params | Q4 Storage | Notes |
|---|---|---|---|---|
| DeepSeek V3.2 | ~685B | ~37B | ~200 GB | Flagship open MoE. Faster than dense 405B. |
| Llama 4 Maverick | ~400B | ~17B | ~120 GB | Meta's MoE entry. Very fast inference. |
| Llama 4 Behemoth | ~2T | ~288B | ~600 GB | Requires massive VRAM. Fantasy tier. |
| Kimi K2.5 | ~1T | varies | ~300 GB | Moonshot AI. Research frontier. |
V100 SXM2 TDP is 300W, but for inference workloads you can power-limit to 150W via:
sudo nvidia-smi -pl 150HBM2 runs on its own clock domain separate from the SMs. At 150W, SM clocks drop significantly but memory bandwidth retains roughly 70–80% of peak — call it ~650–720 GB/s per GPU. Since inference is bandwidth-bound, the performance hit is only ~20–30% for a 50% power reduction.
| Component | Full TDP | At 150W Limit |
|---|---|---|
| 4× V100 SXM2 | 1,200W | 600W |
| Host system (CPU, RAM, fans) | ~200W | ~200W |
| PLX switch + misc | ~50W | ~50W |
| Total | ~1,450W | ~850W |
A single quad board at 150W fits comfortably on a standard US 120V/15A circuit (1,800W capacity, NEC 80% continuous rule = 1,440W).
Two quad boards at 150W each: ~1,450W total system draw. Still fits on a single 20A circuit, or split across two 15A circuits.
| Component | Full TDP | At 150W Limit |
|---|---|---|
| 8× V100 SXM2 | 2,400W | 1,200W |
| 2× Xeon CPUs | ~330W | ~330W |
| Fans, RAM, misc | ~200W | ~200W |
| Total | ~2,930W | ~1,730W |
The Supermicro 4029GP-TVRT has 4× wide-input PSUs. At 120V, each derates to ~1,100W. With all 4 active: 4,400W capacity for ~1,730W load (39% utilization). Two PSUs per 15A circuit across two circuits = ~3.5A per PSU. Well under code.
If you want headroom or plan to scale further: have an electrician run a NEMA 6-30 (240V/30A) outlet. Your breaker panel almost certainly already has 240V on the bus (it's how your dryer/oven work). Cost: $200–500. This gives you 7,200W of clean power and eliminates all server PSU compatibility questions permanently.
At $0.03/kWh (cheap) to $0.12/kWh (average):
| Config | Draw at 150W | $/month @ $0.03 | $/month @ $0.12 |
|---|---|---|---|
| 1 quad board | ~850W | ~$18 | ~$73 |
| 2 quad boards | ~1,450W | ~$31 | ~$125 |
| 8-GPU server | ~1,730W | ~$37 | ~$149 |
Ollama — Works out of the box. Supports multi-GPU via NVLink for automatic model splitting. Easiest path to get started.
llama.cpp — Works well. GGUF quantized models. Flexible memory management, good control over layer distribution across GPUs. Best for fine-tuned control.
vLLM — Supports V100. Use --dtype float16 (V100 lacks bfloat16 tensor cores). Tensor parallel across NVLink GPUs works. Best for serving workloads or multi-user scenarios.
--dtype float16 flag.The quad board and many V100 accessories are only available through Chinese domestic marketplaces. Here's how to buy from the US.
Go to world.taobao.com — Taobao's international version works from US IPs with improving English support. Register with a US phone number. Search using Chinese terms (see below). This is for browsing and price-checking only.
Copy the Taobao product URL and paste it into an agent's search bar. They buy it, warehouse it in China, take QC photos, then ship it to you internationally.
| Agent | Service Fee | Shipping | Payment | Best For |
|---|---|---|---|---|
| Superbuy (superbuy.me) | 0% | Higher rates | PayPal, cards | Electronics, established trust, 180-day free storage |
| CSSBuy (cssbuy.com) | 6% | Cheapest rates | PayPal, cards | Heavy items (GPU hardware), best net cost |
| Basetao (basetao.com) | 5% | Mid-range | PayPal, cards | Hands-on seller communication |
| — | — | — | Avoid. Confirmed 2024 data breach (1.3M records), police actions. |
Expect $50–150 for international shipping on GPU hardware (heavy PCBs + copper heatsinks).
Xianyu (闲鱼) requires a Chinese phone number, Alipay verification, and is mobile-app-only. 1CATai sells the same products on Taobao. Don't bother.
| English | Chinese | Use For |
|---|---|---|
| Four-card | 四卡 | Quad board searches |
| Dual-card | 双卡 | Dual board searches |
| Adapter board | 转接板 | Board searches |
| Interconnect | 互联 | NVLink board searches |
| Motherboard/baseboard | 主板 | Board searches |
| Water-cooled | 水冷 | Cooling searches |
| Heatsink | 散热器 | Cooling searches |
Best search strings:
V100 SXM2 四卡 NVLink 转接板 (V100 SXM2 four-card NVLink adapter board)39com V100 四卡 (39com V100 four-card)一猫智星 V100 四卡 (1CATai V100 four-card)The dual NVLink board is available on eBay from Chinese sellers (~$230–380). The quad board is NOT on eBay as of early 2026. Individual V100 SXM2 modules, PLX8749 cards, and cables are all available on eBay with buyer protection.
Key eBay sellers:
Good news: Section 301 tariff exclusions for computer parts are active through November 2026. This significantly reduces the landed cost of Chinese GPU hardware.
| Cost Component | Per Board |
|---|---|
| Board price (Taobao) | ~$280–390 |
| Agent fee (Superbuy, 0%) | $0 |
| International shipping (share across 10 units) | ~$30–50 |
| US customs duty (with Section 301 exclusion) | Minimal |
| Total landed | ~$367–442 |
Compared to ~$700–800 from US-facing resellers, the savings are substantial — especially at volume.
The core strategy is accumulation, not sell-and-upgrade. V100 SXM2 modules are physically identical across all platforms. Buy them once and move them between systems as you scale.
Each GPU on the quad board has an independent SFF-8654 connection. This means you can split quad boards across multiple PLX cards, and each PLX card can serve a different board. The theoretical scaling on an AMD ROMED8-2T platform:
| Config | GPUs | PLX Cards | Host Lanes Per GPU | Notes |
|---|---|---|---|---|
| 1 quad board, 1 PLX | 4 | 1 | x4 | Standard setup |
| 2 quad boards, 2 PLX | 8 | 2 | x4 | Two NVLink islands |
| 4 quad boards, 4 PLX | 16 | 4 | x4 | Four NVLink islands |
| 17 quad boards at x4/GPU | 68 | 17 | x1 | Theoretical maximum on ROMED8-2T |
At x4 per GPU, the theoretical maximum is 140 GPUs on a ROMED8-2T. Obviously impractical, but the math illuminates the architecture's flexibility.
Beyond ~2 quad boards (8 GPUs), each additional board is adding VRAM capacity, not useful compute bandwidth. Pipeline parallelism across NVLink islands works for fitting models that exceed a single island's capacity, but the PCIe bottleneck between islands limits how much additional performance you actually get. For most practical homelab workloads, 4–8 GPUs in a single NVLink domain is the sweet spot.
For true scale-out, connect two 8-GPU servers via InfiniBand or high-speed Ethernet. Each node has 8 GPUs in an NVLink domain, with data parallelism across nodes. Tensor parallelism stays within-node (NVLink), data parallelism spans nodes (network). This is how actual training clusters work.
V100 SXM2 systems can train models, not just run inference. The economics differ significantly from inference though.
When training across multiple NVLink islands (e.g., two quad boards), only small activation tensors cross the PCIe boundary — NOT gradients. Gradients stay local to each pipeline stage. This makes the PCIe bottleneck largely irrelevant for training overhead, which is a much better situation than naive data-parallel training where gradient all-reduce would hammer the PCIe link.
| Method | Memory Per Parameter | 70B Model | 405B Model |
|---|---|---|---|
| Full training (Adam, FP16) | ~16 bytes | ~1,120 GB | ~6,480 GB |
| QLoRA (4-bit base + LoRA) | ~4.5 bytes + LoRA overhead | ~315 GB + ~50 GB | Impractical |
Full training is limited to roughly 140–280B parameters across a full cluster. QLoRA dramatically extends this by keeping base model weights quantized and only training the small adapter matrices.
| Tier | Server | Interconnect | Training Scaling (8 GPUs) |
|---|---|---|---|
| Budget | NF5468M5 | PCIe only | ~5–6× (gradient sync bottleneck) |
| Mid | NF5288M5 | NVLink HCM | ~6–7× (adequate for LoRA/fine-tuning) |
| Premium | NF5488M5 | NVSwitch crossbar | ~7–7.5× (near-linear, communication-heavy OK) |
The NF5288M5 (same topology as DGX-1) is the training sweet spot. It trained everything from ResNet to early GPT variants. More than adequate for the kind of fine-tuning and LoRA work that makes sense at homelab scale.
| V100 SXM2 Quad (4× 16GB) | Strix Halo (128GB config) | |
|---|---|---|
| VRAM / allocatable | 64GB | ~96GB |
| Memory bandwidth | ~2,600+ GB/s aggregate | ~256 GB/s |
| 70B Q4 tok/s | 20–30 | ~15 |
| Entry cost | ~$1,000 | ~$2,500+ (laptop) |
| Form factor | Open frame + power supplies | Laptop |
| Expandable | Yes — add more boards/servers | No |
The Strix Halo wins on VRAM capacity per dollar and power efficiency, but the V100 quad board delivers 10× the memory bandwidth at lower cost and scales incrementally. The Strix Halo is a dead end — you can't add more GPUs to a laptop.
| V100 SXM2 Quad (4× 16GB) | 2× RTX 4090 | |
|---|---|---|
| VRAM | 64GB (NVLink unified) | 48GB (PCIe isolated) |
| Interconnect | NVLink 300 GB/s | PCIe ~32 GB/s |
| 70B Q4 | Runs smoothly, TP=4 | Tight fit, PCIe bottleneck |
| Cost | ~$1,000 | ~$3,500 |
| Scaling | Add more boards/servers | No SLI, no NVLink |
Consumer GPUs lack NVLink. Two RTX 4090s connected only by PCIe will never match the effective bandwidth of four V100s connected by NVLink, despite the 4090s having faster individual compute.
The AC922 has native NVLink from CPU to GPU — sounds appealing. In practice, the CPU↔GPU NVLink is largely irrelevant for pipeline-parallel LLM inference (model weights live in GPU memory, not CPU memory). The POWER9's real problem is its dead-end ecosystem: ppc64le architecture means constant software compatibility headaches, limited community support, and no upgrade path. EPYC is the better CPU foundation.
At Vast.ai rates of ~$0.02/hr per V100:
The local cluster doesn't make sense for API-tier frontier reasoning (that's still $200/month to Anthropic/OpenAI). It makes sense for running large open models locally where privacy matters, latency matters, or you're running 24/7 workloads.
16GB modules have experienced most of their price erosion. At $56–99 each, they're near floor pricing. The ITAD (IT Asset Disposition) broker ecosystem controls supply — large decommissioning events (like the Summit/Sierra DOE labs) don't create fire sales. Brokers warehouse inventory and drip-feed it to maintain floor prices.
32GB modules are more volatile. Best buying windows come from catching specific decommissioning batches before brokers absorb and reprice inventory. Monitor eBay "V100 SXM2 32GB" searches with alerts enabled.
Strategy: Buy 16GB now. The 16GB → 32GB upgrade is purely a capacity decision, not a speed decision (same HBM2 bandwidth). Timing the 32GB purchase to a batch arrival can save hundreds per module.
The price spread between Chinese wholesale and US retail is approximately 2×:
This spread has been identified as a potential distribution business opportunity — purchasing at Chinese wholesale via Taobao agents and reselling in the US market. Contact channels for 1CATai TECH: Bilibili DM, Taobao chat, or Rex Yuan (hello@rexyuan.com) as an English-language bridge with existing manufacturer relationships.
NVIDIA does not want third parties cloning NVLink. However, V100 NVLink 2.0 is ~8-year-old technology running on decommissioned hardware. NVIDIA's current moat is NVLink 4/5 and NVSwitch on H100/B200. Going after hobbyists recycling retired Volta hardware is low priority compared to H100 export enforcement. That said, if you're planning to buy PLX8749 cards or quad boards, don't wait — buy both sooner rather than later.
| Component | Qty | Unit Cost | Total |
|---|---|---|---|
| TAQ-SXM2-4P5A5 quad board (Taobao) | 1 | ~$400 | $400 |
| V100 SXM2 16GB (eBay) | 4 | ~$99 | $396 |
| PLX8749 card (eBay, fastdeal8899) | 1 | ~$130 | $130 |
| SFF-8654 8i cables, 75cm | 4 | ~$19 | $76 |
| A100 passive heatsinks (China) | 4 | ~$25 | $100 |
| Mining frame / open frame | 1 | ~$50 | $50 |
| Total | ~$1,152 |
Add existing ATX PSU (850W+ recommended). Runs on single 120V/15A circuit.
| Component | Qty | Unit Cost | Total |
|---|---|---|---|
| TAQ-SXM2-4P5A5 quad board (Taobao) | 2 | ~$400 | $800 |
| V100 SXM2 16GB (eBay) | 8 | ~$99 | $792 |
| PLX8749 cards (eBay) | 2 | ~$130 | $260 |
| SFF-8654 8i cables, 75cm | 8 | ~$19 | $152 |
| A100 passive heatsinks (China) | 8 | ~$25 | $200 |
| Mining frame / open frame | 1 | ~$75 | $75 |
| ATX PSU (1200W+) | 1 | ~$150 | $150 |
| Total | ~$2,429 |
Requires bifurcated riser or two PCIe slots. Two 120V/15A circuits recommended.
| Component | Qty | Unit Cost | Total |
|---|---|---|---|
| 4029GP-TVRT barebones (eBay) | 1 | ~$500–1,000 | $750 |
| V100 SXM2 16GB (eBay/Taobao) | 8 | ~$75 | $600 |
| Xeon Gold CPUs (if not included) | 2 | ~$50 | $100 |
| DDR4 ECC RAM (if not included) | 128GB | ~$100 | $100 |
| Total | ~$1,550 |
Runs on 120V. Two standard circuits. Full 8-way NVLink cube mesh. Best price-to-NVLink-domain-size ratio available.
| Component | Qty | Unit Cost | Total |
|---|---|---|---|
| 4029GP-TVRT barebones (eBay) | 1 | ~$750 | $750 |
| V100 SXM2 32GB | 8 | ~$350 | $2,800 |
| Xeon Gold CPUs | 2 | ~$50 | $100 |
| DDR4 ECC RAM | 256GB | ~$200 | $200 |
| Total | ~$3,850 |
256GB unified NVLink VRAM. Runs Llama 3.1 405B at Q4. DeepSeek V3.2 at Q4. On 120V.
"V100 is too old for modern models." Wrong. Compute capability 7.0 is supported by Ollama, llama.cpp, and vLLM. The V100's strength was never its compute — it's the 900 GB/s HBM2 bandwidth and NVLink interconnect. Inference is bandwidth-bound, not compute-bound.
"V100 can't run quantized models because it lacks FP8/FP4 tensor cores." Wrong. Quantization is a memory/bandwidth optimization. Weights are stored in Q4/Q8, dequantized to FP16 on the fly. The dequantization overhead is ~5–15%. The bandwidth savings from smaller weights far outweigh this cost.
"Two quad boards give you 8-GPU tensor parallelism." Wrong. Two quad boards create two separate NVLink islands connected only by PCIe. Cross-board tensor parallelism is impractical due to the 20× bandwidth gap. You get pipeline parallelism (bigger models) but not more speed for single-stream inference.
"The system sees NVLink GPUs as a single GPU."
Misleading. NVLink GPUs appear as separate devices in nvidia-smi. The unified VRAM pool is managed by the inference framework (vLLM, llama.cpp) using tensor parallelism. The framework distributes model layers/shards and handles inter-GPU communication. It's not automatic OS-level memory pooling.
"You need 220V for server hardware." Not always. Many datacenter PSUs (including the Supermicro 4029GP-TVRT's Titanium units) accept 100–240V input. At 120V they auto-derate to ~1,100W per PSU. With V100s power-limited to 150W, total draw fits within the derated capacity. The server literally ships with standard US wall plugs.
"8-GPU servers are too loud for residential use." True at stock fan curves. Most server BMCs allow fan curve adjustment, or you can replace stock fans with Noctua equivalents. The 4029GP is a 4U chassis — there's room. Still louder than a desktop, but manageable in a closet or garage.
"Buying from Taobao is risky." Manageable risk. Purchasing agents (Superbuy, CSSBuy) provide QC photos before international shipping. You can inspect the board visually before it leaves China. PayPal adds buyer protection. The real risk is DOA hardware with no easy return process — budget for that possibility.
"The 8-GPU NVLink backplane from 1CATai is coming soon." No evidence. Without NVSwitch silicon, scaling NVLink beyond 4 GPUs is an architecturally harder problem. 39com's NVLink work is proprietary and closed-source. Any 8-card development would happen in private WeChat/QQ channels. Monitor 1CATai's Bilibili (space.bilibili.com/335717767) for announcements, but don't hold your breath or delay purchasing.
Last updated: March 2026. Prices and availability are snapshots — the V100 secondary market moves in waves. Check current listings before purchasing.
This guide was compiled from extensive research across English and Chinese hardware communities, hands-on planning sessions, and direct sourcing work. Primary English reference: Rex Yuan's blog at jekyll.rexyuan.com.