Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use stricter host buffer alignment (64B) required by modern CPUs. #19

Closed

Conversation

pioto1225
Copy link

No description provided.

@pioto1225
Copy link
Author

Hi,

I would like to propose a change to improve reported PCIe D2H numbers in OpenCL (on some CPU's).
Before the change:

$ ./make.sh 
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) A770 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 24.39.31294.12 (Linux)                                     |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s)              |
| Memory, Cache  | 16287 MB VRAM, 16384 KB global / 64 KB local               |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        11.650 TFLOPs/s (2/3 ) |
| FP16  compute                                        18.109 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.237  TIOPs/s (1/16) |
| INT32 compute                                         5.454  TIOPs/s (1/4 ) |
| INT16 compute                                        30.325  TIOPs/s ( 2x ) |
| INT8  compute                                        11.432  TIOPs/s (1/2 ) |
| Memory Bandwidth ( coalesced read      )                        220.53 GB/s |
| Memory Bandwidth ( coalesced      write)                        432.96 GB/s |
| Memory Bandwidth (misaligned read      )                        397.36 GB/s |
| Memory Bandwidth (misaligned      write)                        451.17 GB/s |
| PCIe   Bandwidth (send                 )                         17.11 GB/s |
| PCIe   Bandwidth (   receive           )                          1.68 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)    3.08 GB/s |
|-----------------------------------------------------------------------------|
'-----------------------------------------------------------------------------'

and after:

$ ./make.sh 
.-----------------------------------------------------------------------------.
|----------------.------------------------------------------------------------|
| Device ID    0 | Intel(R) Arc(TM) A770 Graphics                             |
|----------------'------------------------------------------------------------|
|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) A770 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 24.39.31294.12 (Linux)                                     |
| OpenCL Version | OpenCL C 1.2                                               |
| Compute Units  | 512 at 2400 MHz (4096 cores, 19.661 TFLOPs/s)              |
| Memory, Cache  | 16287 MB VRAM, 16384 KB global / 64 KB local               |
| Buffer Limits  | 4095 MB global, 4194296 KB constant                        |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                          not supported        |
| FP32  compute                                        11.594 TFLOPs/s (2/3 ) |
| FP16  compute                                        17.806 TFLOPs/s ( 1x ) |
| INT64 compute                                         1.243  TIOPs/s (1/16) |
| INT32 compute                                         5.418  TIOPs/s (1/4 ) |
| INT16 compute                                        30.375  TIOPs/s ( 2x ) |
| INT8  compute                                        11.463  TIOPs/s (1/2 ) |
| Memory Bandwidth ( coalesced read      )                        218.42 GB/s |
| Memory Bandwidth ( coalesced      write)                        419.85 GB/s |
| Memory Bandwidth (misaligned read      )                        387.50 GB/s |
| Memory Bandwidth (misaligned      write)                        434.06 GB/s |
| PCIe   Bandwidth (send                 )                         17.32 GB/s |
| PCIe   Bandwidth (   receive           )                         20.91 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen4 x16)   19.45 GB/s |
|-----------------------------------------------------------------------------|
'-----------------------------------------------------------------------------'

please see:

@ProjectPhysX
Copy link
Owner

Hi @pioto1225,

thanks a lot for this finding! Took some time to reproduce; I see a moderate improvement in D2H bandwidth on Raptor Lake, not nearly as much as you observe on Arrow Lake. I've made the changes in this commit, with some additions:

  • host_alignment doesn't have to be a member variable
  • additional 64 Byte padding is only applied when using CL_MEM_USE_HOST_PTR
  • simplified delete_host_buffer(), as now host_buffer_unaligned is always used

Thank you, and kind regards,
Moritz

@pioto1225
Copy link
Author

Hi Moritz,

What GPU are you using with Raptorlake, and what improvement do you observe after adding 64B alignment?
I had i9-14900k before and I recall about 8GBps D2H with A770 (much better than Arrowlake's 1-2GBps) before adding alignment. I am wondering if raptorlake's D2H and H2D speeds now match?

Many thanks.
Piotr

@ProjectPhysX
Copy link
Owner

Hi @pioto1225,

I'm pairing an i7-13700K with an A770, in PCIe 4.0 x8 (bifurcation with 2nd PCIe x8 slot). Unaligned pointer was ~10GB/s H2D and ~7GB/s D2H. 64B-aligned pointer is ~10GB/s for both.

Strangely the B580 in the 2nd PCIe 4.0 x8 slot didn't show slowdown in the first place, has both ~12GB/s D2H/H2D with/without 64B alignment.

Have you seen the D2H slowdown on other GPUs? Could it be specific to Arc Alchemist?

Kind regards,
Moritz

@pioto1225
Copy link
Author

Hi there,

Thanks for the info!
I reckon A770 should do 20-21 GBps in both directions when used in in PCIe4.0/16 configuration with alignment ON. With alignment OFF (and PCIe 4.0/16) I expect to see H2D ~20GB/s and D2H drop back to ~7GB/s.

Very interesting point with B580, it is a shame that it is only x8 interface. I am hoping for B770 to show up, which I would like to get. Unfortunately I do not have any other dGPU to test.

I had a feeling that this was host related, where arrowlake was way more sensitive to lack of host buffer alignment (1.6GB/s D2H) than raptorlake(/refresh) with 7GB/s D2H. But your data might prove me wrong.
One thing to keep in mind is that even if you use old version of software (without explicit alignment request) the allocated buffer could still be aligned, and then the transfer speed will be optimal. You might want to experiment with increasing the size of allocated aligned buffer, and adding an unaligned offset (8B) to make sure the buffer is not aligned. This should help to conclude it.

Nevertheless, it looks like B580 has better PCIe implementation.
Could you please share the lspci -vvv -s output for B580? I am wondering what MPS they use, and whether the report the PCIe speed correctly.
Many thanks.
Piotr

@ProjectPhysX
Copy link
Owner

Hi @pioto1225, here's the lspci -vvv output:

03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Acer Incorporated [ALI] Device 3888
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 220
        IOMMU group: 21
        Region 0: Memory at 78000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at 6800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 79000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, LnkDisable- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
                Address: 00000000fee00ed8  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [420 v1] Physical Resizable BAR
                BAR 2: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
        Capabilities: [400 v1] Latency Tolerance Reporting
                Max snoop latency: 15728640ns
                Max no snoop latency: 15728640ns
        Kernel driver in use: i915
        Kernel modules: i915, xe

07:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1100
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 225
        IOMMU group: 26
        Region 0: Memory at 76000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at 6000000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 77000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
                Address: 00000000fee00f38  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [110 v1] Null
        Capabilities: [200 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable-, Smallest Translation Unit: 00
        Capabilities: [420 v1] Physical Resizable BAR
                BAR 2: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
        Capabilities: [400 v1] Latency Tolerance Reporting
                Max snoop latency: 15728640ns
                Max no snoop latency: 15728640ns
        Kernel driver in use: xe
        Kernel modules: xe

77:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: NVIDIA Corporation Device 11df
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 228
        IOMMU group: 37
        Region 0: Memory at 7a000000 (32-bit, non-prefetchable) [size=16M]
        Region 1: Memory at 6c60000000 (64-bit, prefetchable) [size=256M]
        Region 3: Memory at 6c70000000 (64-bit, prefetchable) [size=32M]
        Region 5: I/O ports at 3000 [size=128]
        Expansion ROM at 7b000000 [virtual] [disabled] [size=512K]
        Capabilities: [60] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee00f78  Data: 0000
        Capabilities: [78] Express (v2) Legacy Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- TEE-IO-
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s (downgraded), Width x4 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via message, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [250 v1] Latency Tolerance Reporting
                Max snoop latency: 34326183936ns
                Max no snoop latency: 34326183936ns
        Capabilities: [128 v1] Power Budgeting <?>
        Capabilities: [420 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP-
                        ECRC- UnsupReq- ACSViol- UncorrIntErr- BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+
                        ECRC- UnsupReq- ACSViol- UncorrIntErr+ BlockedTLP- AtomicOpBlocked- TLPBlockedErr-
                        PoisonTLPBlocked- DMWrReqBlocked- IDECheck- MisIDETLP- PCRC_CHECK- TLPXlatBlocked-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CorrIntErr- HeaderOF-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ CorrIntErr- HeaderOF+
                AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
        Capabilities: [900 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

@pioto1225
Copy link
Author

Thanks!
This is a disappointment that Battleimage is still plagued by the same issue as Arc
(see ref: https://community.intel.com/t5/Graphics/Intel-Arc-A770-PCIe-Speed-2-5GT-s-Width-x1/m-p/1455448):
Both Intel cards report incorrectly PCIe interface capability and status (claiming it is PCIe 1.0x1):

LnkCap: Port #0, Speed 2.5GT/s, Width x1,
...
LnkSta: Speed 2.5GT/s, Width x1

Thanks for posting Nvidia card too. This one correctly advertises PCIe 3.0/16 capability:
LnkCap: Port #0, Speed 8GT/s, Width x16,

The issue is just with reporting in Intel cards, they do operate at correct speed as shown in your PCIe benchmarks.

One of the reasons Battleimage is faster than Arc in D2H and H2D transfers is higher MPS (BattleImage: 256, Arc: 128).

Regards,
Piotr

@ProjectPhysX
Copy link
Owner

Hi @pioto1225,

thanks for pointing out the PCIe interface reporting issue! I'll forward this internally at Intel.

Cheers,
Moritz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants