倚天710性能监控—DDR PMU子系统
1. 倚天710的DDR5子系统
倚天710支持支持最先进的DDR5 DRAM,为云计算和HPC提供巨大的内存带宽。倚天710有8 DDR5通道(channel),每个Die上有4个。每个通道相互独立地服务系统的内存请求,分别支持用于1DPC(DIMM Per Channel)的DDR5-4400和2DPC的DDR5-4000。
1.2 DDR5 Architecture
DDR5的一个主要变化是新的DIMM通道结构(Fig 2中Channel Architecture)。DDR4 DIMM的总线位宽为72比特,由64比特数据位和8比特ECC位组成。DDR5的每个DIMM有两个独立的子通道。两个通道中的总线位宽都为40比特:32比特的数据位和8比特的ECC位。尽管DDR4和DDR5的数据位宽相同(总共64比特),但两个独立通道可以提高内存访问效率并减少延迟。单通道单次任务只能读或写,双通道的DDR5则读写可以同时进行。
1.2 DDR5 理论带宽
倚天2DPC的DDR5-4000的理论带宽为:
(资料图片仅供参考)
4000MHz *32bit / 8 *8 *2 = 128 *10^9 *2 bytes = 128GB/s *2= 256 GB/s内存等效频率(4000MHz)_ 子通道位宽(32 bit)/ 8 _ 子通道数(8)* Die (2)注意GB和GiB的不同:
1 GB = 1000000000 bytes (= 1000^3 B = 10^9 B)1 GiB = 1073741824 bytes (= 1024^3 B = 2^30 B).2. 倚天710 DDRSS PMU
倚天710的DDRSS为每个子通道都实现了独立的PMU,用于性能和功能调试,每个子通道的PMU包含16个通用计数器。
带宽计算公式为:
DRAM ReadBandwidth = perf_hif_rd *DDRC_WIDTH *DDRC_Freq / DDRC_CycleDRAM Write Bandwidth = (perf_hif_wr + perf_hif_rmw) *DDRC_WIDTH *DDRC_Freq / DDRC_CycleDDRC_WIDTH: Units of 64 bytes3. Cloud-kernel对DDRSS PMU的支持
#lscpuArchitecture: aarch64Byte Order: Little EndianCPU(s): 128On-line CPU(s) list: 0-127Thread(s) per core: 1Core(s) per socket: 128Socket(s): 1NUMA node(s): 2...
测试环境为1个Socket,2个Die,包含两个NUMA node。
#numactl -Havailable: 2 nodes (0-1)node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63node 0 size: 257416 MBnode 0 free: 187991 MBnode 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127node 1 size: 257014 MBnode 1 free: 194504 MBnode distances:node 0 1 0: 10 15 1: 15 10
每个NUMA node有 256 GB内存。
#dmidecode|grep -P -A5 "Memorys+Device"|grep Size|grep -v Range Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: 32 GB Size: No Module Installed ...#dmidecode -t memory | grep Speed: Speed: 4000 MHz Configured Clock Speed: 4000 MHz
2DPC,共插了16根DIMM,每个Die8根DIMM,有效频率为 4000MHz。
#ls /sys/bus/event_source/devices/ | grep drwali_drw_21000ali_drw_21080ali_drw_23000ali_drw_23080ali_drw_25000ali_drw_25080ali_drw_27000ali_drw_27080ali_drw_40021000ali_drw_40021080ali_drw_40023000ali_drw_40023080ali_drw_40025000ali_drw_40025080ali_drw_40027000ali_drw_40027080
2DPC满插时一共16个PMU设备,其中ali_drw_21000
与ali_drw_21080
为Die 0上同一个DIMM的两个子通道,ali_drw_2X000
为Die 0的PMU设备,ali_drw_4002X000
为Die 1的PMU设备。
4. DDR 带宽准确性验证
4.1 TL;DR
带宽单位:MB/s
可以看到,DDR PMU的带宽统计误差不超过 1%。测试原理,请阅读《倚天710性能监控 —— CMN Flit Traffic Trace with Watchpoint Event》。
4.2 C0M0 rd
# First, run bw_mem as backgroud workload# numactl --cpubind=0 --membind=0 ./bw_mem 40960M rd# Then run perf command in another consoleperf stat -e ali_drw_21000/perf_hif_wr/ -e ali_drw_21000/perf_hif_rd/ -e ali_drw_21000/perf_hif_rmw/ -e ali_drw_21000/perf_cycle/ -e ali_drw_21080/perf_hif_wr/ -e ali_drw_21080/perf_hif_rd/ -e ali_drw_21080/perf_hif_rmw/ -e ali_drw_21080/perf_cycle/ -e ali_drw_23000/perf_hif_wr/ -e ali_drw_23000/perf_hif_rd/ -e ali_drw_23000/perf_hif_rmw/ -e ali_drw_23000/perf_cycle/ -e ali_drw_23080/perf_hif_wr/ -e ali_drw_23080/perf_hif_rd/ -e ali_drw_23080/perf_hif_rmw/ -e ali_drw_23080/perf_cycle/ -e ali_drw_25000/perf_hif_wr/ -e ali_drw_25000/perf_hif_rd/ -e ali_drw_25000/perf_hif_rmw/ -e ali_drw_25000/perf_cycle/ -e ali_drw_25080/perf_hif_wr/ -e ali_drw_25080/perf_hif_rd/ -e ali_drw_25080/perf_hif_rmw/ -e ali_drw_25080/perf_cycle/ -e ali_drw_27000/perf_hif_wr/ -e ali_drw_27000/perf_hif_rd/ -e ali_drw_27000/perf_hif_rmw/ -e ali_drw_27000/perf_cycle/ -e ali_drw_27080/perf_hif_wr/ -e ali_drw_27080/perf_hif_rd/ -e ali_drw_27080/perf_hif_rmw/ -e ali_drw_27080/perf_cycle/ -a -- sleep 1Performance counter stats for "system wide": 12398 ali_drw_21000/perf_hif_wr/ 40160751 ali_drw_21000/perf_hif_rd/ 743 ali_drw_21000/perf_hif_rmw/ 500620725 ali_drw_21000/perf_cycle/ 12252 ali_drw_21080/perf_hif_wr/ 40161013 ali_drw_21080/perf_hif_rd/ 767 ali_drw_21080/perf_hif_rmw/ 500619340 ali_drw_21080/perf_cycle/ 11960 ali_drw_23000/perf_hif_wr/ 40159522 ali_drw_23000/perf_hif_rd/ 737 ali_drw_23000/perf_hif_rmw/ 500613505 ali_drw_23000/perf_cycle/ 12044 ali_drw_23080/perf_hif_wr/ 40159066 ali_drw_23080/perf_hif_rd/ 773 ali_drw_23080/perf_hif_rmw/ 500607620 ali_drw_23080/perf_cycle/ 12698 ali_drw_25000/perf_hif_wr/ 40160138 ali_drw_25000/perf_hif_rd/ 709 ali_drw_25000/perf_hif_rmw/ 500601240 ali_drw_25000/perf_cycle/ 12521 ali_drw_25080/perf_hif_wr/ 40160169 ali_drw_25080/perf_hif_rd/ 727 ali_drw_25080/perf_hif_rmw/ 500594755 ali_drw_25080/perf_cycle/ 12171 ali_drw_27000/perf_hif_wr/ 40159404 ali_drw_27000/perf_hif_rd/ 706 ali_drw_27000/perf_hif_rmw/ 500589945 ali_drw_27000/perf_cycle/ 12290 ali_drw_27080/perf_hif_wr/ 40157620 ali_drw_27080/perf_hif_rd/ 710 ali_drw_27080/perf_hif_rmw/ 500583305 ali_drw_27080/perf_cycle/ 1.000923276 seconds time elapsed>>> 40159522*8*64/1000/1000.020561.675# set CPU and memory to the same NUMA nodenumactl --cpubind=0 --membind=0 ./bw_mem 40960M rd40960.00 20507.82
4.3 C1M1 rd
# First, run bw_mem as backgroud workload# numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd# Then run perf command in another consoleperf stat -e ali_drw_40021000/perf_hif_wr/ -e ali_drw_40021000/perf_hif_rd/ -e ali_drw_40021000/perf_hif_rmw/ -e ali_drw_40021000/perf_cycle/ -e ali_drw_40021080/perf_hif_wr/ -e ali_drw_40021080/perf_hif_rd/ -e ali_drw_40021080/perf_hif_rmw/ -e ali_drw_40021080/perf_cycle/ -e ali_drw_40023000/perf_hif_wr/ -e ali_drw_40023000/perf_hif_rd/ -e ali_drw_40023000/perf_hif_rmw/ -e ali_drw_40023000/perf_cycle/ -e ali_drw_40023080/perf_hif_wr/ -e ali_drw_40023080/perf_hif_rd/ -e ali_drw_40023080/perf_hif_rmw/ -e ali_drw_40023080/perf_cycle/ -e ali_drw_40025000/perf_hif_wr/ -e ali_drw_40025000/perf_hif_rd/ -e ali_drw_40025000/perf_hif_rmw/ -e ali_drw_40025000/perf_cycle/ -e ali_drw_40025080/perf_hif_wr/ -e ali_drw_40025080/perf_hif_rd/ -e ali_drw_40025080/perf_hif_rmw/ -e ali_drw_40025080/perf_cycle/ -e ali_drw_40027000/perf_hif_wr/ -e ali_drw_40027000/perf_hif_rd/ -e ali_drw_40027000/perf_hif_rmw/ -e ali_drw_40027000/perf_cycle/ -e ali_drw_40027080/perf_hif_wr/ -e ali_drw_40027080/perf_hif_rd/ -e ali_drw_40027080/perf_hif_rmw/ -e ali_drw_40027080/perf_cycle/ -a -- sleep 1 Performance counter stats for "system wide": 2329 ali_drw_40021000/perf_hif_wr/ 40071983 ali_drw_40021000/perf_hif_rd/ 58 ali_drw_40021000/perf_hif_rmw/ 500572165 ali_drw_40021000/perf_cycle/ 2374 ali_drw_40021080/perf_hif_wr/ 40071737 ali_drw_40021080/perf_hif_rd/ 39 ali_drw_40021080/perf_hif_rmw/ 500569615 ali_drw_40021080/perf_cycle/ 2330 ali_drw_40023000/perf_hif_wr/ 40071063 ali_drw_40023000/perf_hif_rd/ 74 ali_drw_40023000/perf_hif_rmw/ 500565635 ali_drw_40023000/perf_cycle/ 2372 ali_drw_40023080/perf_hif_wr/ 40070344 ali_drw_40023080/perf_hif_rd/ 54 ali_drw_40023080/perf_hif_rmw/ 500561355 ali_drw_40023080/perf_cycle/ 2362 ali_drw_40025000/perf_hif_wr/ 40070906 ali_drw_40025000/perf_hif_rd/ 45 ali_drw_40025000/perf_hif_rmw/ 500557480 ali_drw_40025000/perf_cycle/ 2385 ali_drw_40025080/perf_hif_wr/ 40070168 ali_drw_40025080/perf_hif_rd/ 46 ali_drw_40025080/perf_hif_rmw/ 500552550 ali_drw_40025080/perf_cycle/ 2333 ali_drw_40027000/perf_hif_wr/ 40069233 ali_drw_40027000/perf_hif_rd/ 28 ali_drw_40027000/perf_hif_rmw/ 500548745 ali_drw_40027000/perf_cycle/ 2211 ali_drw_40027080/perf_hif_wr/ 40068227 ali_drw_40027080/perf_hif_rd/ 30 ali_drw_40027080/perf_hif_rmw/ 500544450 ali_drw_40027080/perf_cycle/ 1.000863258 seconds time elapsed>>> 40070906*8*64/1000/1000.020516.303numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd40960.00 20492.53
4.4 C0M0 fwr
# First, run bw_mem as backgroud workload# numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr# Then run perf command in another consoleperf stat -e ali_drw_21000/perf_hif_wr/ -e ali_drw_21000/perf_hif_rd/ -e ali_drw_21000/perf_hif_rmw/ -e ali_drw_21000/perf_cycle/ -e ali_drw_21080/perf_hif_wr/ -e ali_drw_21080/perf_hif_rd/ -e ali_drw_21080/perf_hif_rmw/ -e ali_drw_21080/perf_cycle/ -e ali_drw_23000/perf_hif_wr/ -e ali_drw_23000/perf_hif_rd/ -e ali_drw_23000/perf_hif_rmw/ -e ali_drw_23000/perf_cycle/ -e ali_drw_23080/perf_hif_wr/ -e ali_drw_23080/perf_hif_rd/ -e ali_drw_23080/perf_hif_rmw/ -e ali_drw_23080/perf_cycle/ -e ali_drw_25000/perf_hif_wr/ -e ali_drw_25000/perf_hif_rd/ -e ali_drw_25000/perf_hif_rmw/ -e ali_drw_25000/perf_cycle/ -e ali_drw_25080/perf_hif_wr/ -e ali_drw_25080/perf_hif_rd/ -e ali_drw_25080/perf_hif_rmw/ -e ali_drw_25080/perf_cycle/ -e ali_drw_27000/perf_hif_wr/ -e ali_drw_27000/perf_hif_rd/ -e ali_drw_27000/perf_hif_rmw/ -e ali_drw_27000/perf_cycle/ -e ali_drw_27080/perf_hif_wr/ -e ali_drw_27080/perf_hif_rd/ -e ali_drw_27080/perf_hif_rmw/ -e ali_drw_27080/perf_cycle/ -a -- sleep 1 Performance counter stats for "system wide": 42910737 ali_drw_21000/perf_hif_wr/ 108397 ali_drw_21000/perf_hif_rd/ 495 ali_drw_21000/perf_hif_rmw/ 500708510 ali_drw_21000/perf_cycle/ 42911223 ali_drw_21080/perf_hif_wr/ 117280 ali_drw_21080/perf_hif_rd/ 515 ali_drw_21080/perf_hif_rmw/ 500706780 ali_drw_21080/perf_cycle/ 42910038 ali_drw_23000/perf_hif_wr/ 109179 ali_drw_23000/perf_hif_rd/ 516 ali_drw_23000/perf_hif_rmw/ 500702100 ali_drw_23000/perf_cycle/ 42911620 ali_drw_23080/perf_hif_wr/ 111038 ali_drw_23080/perf_hif_rd/ 523 ali_drw_23080/perf_hif_rmw/ 500697340 ali_drw_23080/perf_cycle/ 42910435 ali_drw_25000/perf_hif_wr/ 111748 ali_drw_25000/perf_hif_rd/ 469 ali_drw_25000/perf_hif_rmw/ 500692500 ali_drw_25000/perf_cycle/ 42908786 ali_drw_25080/perf_hif_wr/ 110177 ali_drw_25080/perf_hif_rd/ 456 ali_drw_25080/perf_hif_rmw/ 500686595 ali_drw_25080/perf_cycle/ 42908903 ali_drw_27000/perf_hif_wr/ 114093 ali_drw_27000/perf_hif_rd/ 490 ali_drw_27000/perf_hif_rmw/ 500681405 ali_drw_27000/perf_cycle/ 42908156 ali_drw_27080/perf_hif_wr/ 109668 ali_drw_27080/perf_hif_rd/ 489 ali_drw_27080/perf_hif_rmw/ 500676420 ali_drw_27080/perf_cycle/ 1.001100811 seconds time elapsed>>> (42908156+489)*8*64/1000/1000.021969.226numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr40960.00 21936.50
4.5 C1M1 fwr
# First, run bw_mem as backgroud workload# numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr# Then run perf command in another consoleperf stat -e ali_drw_40021000/perf_hif_wr/ -e ali_drw_40021000/perf_hif_rd/ -e ali_drw_40021000/perf_hif_rmw/ -e ali_drw_40021000/perf_cycle/ -e ali_drw_40021080/perf_hif_wr/ -e ali_drw_40021080/perf_hif_rd/ -e ali_drw_40021080/perf_hif_rmw/ -e ali_drw_40021080/perf_cycle/ -e ali_drw_40023000/perf_hif_wr/ -e ali_drw_40023000/perf_hif_rd/ -e ali_drw_40023000/perf_hif_rmw/ -e ali_drw_40023000/perf_cycle/ -e ali_drw_40023080/perf_hif_wr/ -e ali_drw_40023080/perf_hif_rd/ -e ali_drw_40023080/perf_hif_rmw/ -e ali_drw_40023080/perf_cycle/ -e ali_drw_40025000/perf_hif_wr/ -e ali_drw_40025000/perf_hif_rd/ -e ali_drw_40025000/perf_hif_rmw/ -e ali_drw_40025000/perf_cycle/ -e ali_drw_40025080/perf_hif_wr/ -e ali_drw_40025080/perf_hif_rd/ -e ali_drw_40025080/perf_hif_rmw/ -e ali_drw_40025080/perf_cycle/ -e ali_drw_40027000/perf_hif_wr/ -e ali_drw_40027000/perf_hif_rd/ -e ali_drw_40027000/perf_hif_rmw/ -e ali_drw_40027000/perf_cycle/ -e ali_drw_40027080/perf_hif_wr/ -e ali_drw_40027080/perf_hif_rd/ -e ali_drw_40027080/perf_hif_rmw/ -e ali_drw_40027080/perf_cycle/ -a -- sleep 1 Performance counter stats for "system wide": 42906048 ali_drw_40021000/perf_hif_wr/ 33939 ali_drw_40021000/perf_hif_rd/ 76 ali_drw_40021000/perf_hif_rmw/ 500629355 ali_drw_40021000/perf_cycle/ 42905967 ali_drw_40021080/perf_hif_wr/ 34018 ali_drw_40021080/perf_hif_rd/ 63 ali_drw_40021080/perf_hif_rmw/ 500631900 ali_drw_40021080/perf_cycle/ 42905422 ali_drw_40023000/perf_hif_wr/ 33843 ali_drw_40023000/perf_hif_rd/ 75 ali_drw_40023000/perf_hif_rmw/ 500628540 ali_drw_40023000/perf_cycle/ 42905547 ali_drw_40023080/perf_hif_wr/ 33858 ali_drw_40023080/perf_hif_rd/ 68 ali_drw_40023080/perf_hif_rmw/ 500623970 ali_drw_40023080/perf_cycle/ 42905230 ali_drw_40025000/perf_hif_wr/ 34028 ali_drw_40025000/perf_hif_rd/ 56 ali_drw_40025000/perf_hif_rmw/ 500620630 ali_drw_40025000/perf_cycle/ 42904734 ali_drw_40025080/perf_hif_wr/ 34141 ali_drw_40025080/perf_hif_rd/ 61 ali_drw_40025080/perf_hif_rmw/ 500615840 ali_drw_40025080/perf_cycle/ 42903390 ali_drw_40027000/perf_hif_wr/ 33712 ali_drw_40027000/perf_hif_rd/ 84 ali_drw_40027000/perf_hif_rmw/ 500610635 ali_drw_40027000/perf_cycle/ 42903975 ali_drw_40027080/perf_hif_wr/ 33916 ali_drw_40027080/perf_hif_rd/ 106 ali_drw_40027080/perf_hif_rmw/ 500606645 ali_drw_40027080/perf_cycle/ 1.000953335 seconds time elapsed>>> (42903975+106)*8*64/1000/1000.021966.889#numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr40960.00 21934.51
标签: