Thursday, July 14, 2016

Scale-Out Vs Scale-Up

Scale-Out Vs Scale-Up
 
What is Scale-Out Architecture
Scale Out or Horizontal Scaling refers to adding more identical Units like Cloud Controllers of similar type linearly in order to generate more horse power.
New

Current
This way of scaling is usually cheaper overall and can literally scale infinitely (although we know that there are usually limits imposed by software or other attributes of an environment’s infrastructure).
  • Pros
    • Much cheaper than scaling vertically
    • Easier to run fault-tolerance
    • Easy to upgrade
  • Cons
    • More licensing fees
    • Bigger footprint in the Data Center
    • Higher utility cost (Electricity and cooling)
    • Possible need for more networking equipment (switches/routers)


What is Scale-Up Architecture
Scaling upward or scale vertically is adding more powerful controller(s) to increase the overall throughput. Obviously the newer controllers come with more CPU power and more memory.

New

Current


  • Pros
    • Less power consumption than running multiple servers
    • Cooling costs are less than scaling horizontally
    • Generally less challenging to implement
    • Less licensing costs
    • (sometimes) uses less network hardware than scaling horizontally (this is a whole different topic that we can discuss later)
  • Cons
    • PRICE, PRICE, PRICE
    • Greater risk of hardware failure causing bigger outages
    • generally severe vendor lock-in and limited upgrade-ability in the future


Scaling out takes the infrastructure you’ve got, and replicates it to work in parallel. This has the effect of increasing infrastructure capacity roughly linearly. Data centers often scale out using pods. Build a compute pod, spin up applications to use it, then scale out by building another pod to add capacity. Actual application performance may not be linear, as application architectures must be written to work effectively in a scale-out environment.

Application delivery controllers (A10 Networks, F5 Networks) are examples of networking tools that help with scaling out. ADCs host a virtual IP that is the front end to pool members (real servers) on the back end. As the demand for an application grows, the application can be scaled out by adding additional pool members behind the virtual IP.

Leaf-spine network architectures are also “scale out” designs. As a new pod or rack is installed, top of rack leaf switches plumbed to the spine layer add capacity.

Scaling up is taking what you’ve got, and replacing it with something more powerful. From a networking perspective, this could be taking a 1GbE switch, and replacing it with a 10GbE switch. Same number of switchports, but the bandwidth has been scaled up via bigger pipes. The 1GbE bottleneck has been relieved by the 10GbE replacement.

Scaling up is a viable scaling solution until it is impossible to scale up individual components any larger. For example, 10GbE is a practical limit for uplinking hosts to the network until such time as 25GbE and higher ports are readily available on hosts. In that context, what happens when 10GbE is no longer enough bandwidth for the uplinked host? Rather than scaling up, you scale out.

Tuesday, July 12, 2016

Is Chelsio dead?

Is Chelsio dead?
There are too many disruptions in the high speed networking world with the introduction of Intel Omni-Path. Mellanox the Industry leader in High Speed world is still has the largest market share in terms of shipping InfiniBand/Ethernet Networking Card and Switches.

The new entrant completely wiped out the oldies like Chelsio, and tells that there is no place for older technologies in the high speed market.

The next generation of Mellanox cards are targetting 200Gb/sec both in IB and Ethenet speed. Oldies like Chelsio are still stuck in 40Gig speed itself with there debate on RoCE over iWARP.


Everything you wanted to know about Intel OmniPath Host Fabric Interface

Everything you wanted to know about 
Intel Omni-Path Host Fabric Interface

What is Intel Omni-Path?
 
Intel Omni-Path  is the technology behinds Intel's push on High Speed Networking on the HPC market.


The Hardware Host Fabric Interface, or the hfi
 
Once the Omnipath HFI is installed on the Linux system on lspci it will be

[root@sjsc-xxx ~]# lspci | grep -i Omni-Path
82:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)


For a detailed view of the hardware spec:

[root@sjsc-xxx ~]# lspci -vvv -s 82:00.0
82:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)
Subsystem: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete]
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 11
Region 0: Memory at ec000000 (64-bit, non-prefetchable) [size=64M]
Expansion ROM at f0000000 [disabled] [size=128K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <8us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L1, Exit Latency L0s <4us, L1 <64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 4s to 13s, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b0] MSI-X: Enable- Count=256 Masked-
Vector table: BAR=0 offset=00100000
PBA: BAR=0 offset=00110000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [148 v1] #19
Capabilities: [178 v1] Transaction Processing Hints
Device specific mode supported
No steering table available

[root@sjsc-xxx ~]# 



The Software Stack
Omnipath drivers can be downloaded from Intel Download Center website:
https://downloadcenter.intel.com/product/92007/Intel-Omni-Path-Host-Fabric-Interface-Adapter-100-Series-1-Port-PCIe-x16

https://downloadcenter.intel.com/download/26064/Intel-Omni-Path-Fabric-Software-Including-Intel-Omni-Path-Host-Fabric-Interface-Driver-?product=92007 is the Intel drivers for hfi.

The driver package contains the hfi1 driver along with the firmware for the host fabric interface (hfi).

Once all the low level driver/firmware software is loaded properly and we have Omni-Path compatible cable connected, then we can see the port comes up to the active state.

[root@localhost ~]#


[   12.964126] hfi1 0000:82:00.0: hfi1_0: set_link_state: current INIT, new ARMED
[   12.964131] hfi1 0000:82:00.0: hfi1_0: logical state changed to PORT_ARMED (0x3)
[   12.964134] hfi1 0000:82:00.0: hfi1_0: send_idle_message: sending idle message 0x103
[   12.964212] hfi1 0000:82:00.0: hfi1_0: read_idle_message: read idle message 0x103
[   12.964215] hfi1 0000:82:00.0: hfi1_0: handle_sma_message: SMA message 0x1
[   12.964681] hfi1 0000:82:00.0: hfi1_0: set_link_state: current ARMED, new ACTIVE
[   12.964684] hfi1 0000:82:00.0: hfi1_0: logical state changed to PORT_ACTIVE (0x4)
[   12.964697] hfi1 0000:82:00.0: hfi1_0: send_idle_message: sending idle message 0x203
[   12.965279] hfi1 0000:82:00.0: hfi1_0: read_idle_message: read idle message 0x203
[   12.965281] hfi1 0000:82:00.0: hfi1_0: handle_sma_message: SMA message 0x2
[   16.143492] hfi1 0000:82:00.0: hfi1_0: Switching to NO_DMA_RTAIL
[root@localhost ~]#
For Intel Omni-Path to work the port of hfi card needs to be connected to the Omni-Path switch or directly to another hfi card on some other system for point to point access.

Then you will see the ipoib port in the ifconfig's output:

[root@sjsc-xxx ~]# ifconfig ib0
ib0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 2044
       inet 172.18.51.69  netmask 255.255.224.0  broadcast 172.18.63.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
       RX packets 0  bytes 0 (0.0 B)
       RX errors 0  dropped 0  overruns 0  frame 0
       TX packets 0  bytes 0 (0.0 B)
       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@sjsc-xxx ~]#

 

 The driver and it's depencencies:

[root@sjsc-xxx ~]# lsmod |grep hfi
hfi1                  655730  1
ib_mad                 47817  4 hfi1,ib_cm,ib_sa,ib_umad
ib_core                98787  11 hfi1,rdma_cm,ib_cm,ib_sa,iw_cm,ib_mad,ib_ucm,ib_umad,ib_uverbs,ib_ipoib
[root@sjsc-xxx ~]#



[root@sjsc-xxx ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib0
DEVICE=ib0
NAME="Infiniband ib0"
TYPE=InfiniBand
ONBOOT=yes
NM_CONTROLLED=no
BOOTPROTO=static
PREFIX=19
IPADDR=172.18.51.69
[root@sjsc-xxx ~]#

[root@sjsc-xxx ~]# ethtool  ib0
Settings for ib0:
No data available
[root@sjsc-xxx ~]# ifconfig ib0
ib0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 2044
       inet 172.18.51.69  netmask 255.255.224.0  broadcast 172.18.63.255
Infiniband hardware address can be incorrect! Please read BUGS section in ifconfig(8).
       infiniband 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  txqueuelen 256  (InfiniBand)
       RX packets 0  bytes 0 (0.0 B)
       RX errors 0  dropped 0  overruns 0  frame 0
       TX packets 0  bytes 0 (0.0 B)
       TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

[root@sjsc-xxx ~]#



The hfi firmware: 
[root@sjsc-xxx ~]# dmesg |grep -i firmware |grep hfi |grep version

[   21.337504] hfi1 0000:82:00.0: hfi1_0: 8051 firmware version 0.38
[   21.344356] hfi1 0000:82:00.0: hfi1_0: Lane 0 firmware: version 0x1055, prod_id 0x0041
[   21.353209] hfi1 0000:82:00.0: hfi1_0: Lane 1 firmware: version 0x1055, prod_id 0x0041
[   21.362050] hfi1 0000:82:00.0: hfi1_0: Lane 2 firmware: version 0x1055, prod_id 0x0041
[   21.370900] hfi1 0000:82:00.0: hfi1_0: Lane 3 firmware: version 0x1055, prod_id 0x0041
[root@sjsc-xxx ~]#

 The hfi1 driver info:


[root@localhost ~]# modinfo hfi1
filename:       /lib/modules/3.10.0-327.4.4.el7.x86_64/updates/hfi1.ko
version:        0.10-121
description:    Intel Omni-Path Architecture driver
license:        Dual BSD/GPL
rhelversion:    7.2
srcversion:     E2F417E21B6A8F9F673CC41
alias:          pci:v00008086d000024F1sv*sd*bc*sc*i*
alias:          pci:v00008086d000024F0sv*sd*bc*sc*i*
depends:        ib_core,ib_mad
vermagic:       3.10.0-327.4.4.el7.x86_64 SMP mod_unload modversions
parm:           lkey_table_size:LKEY table size in bits (2^n, 1 <= n <= 23) (uint)
parm:           max_pds:Maximum number of protection domains to support (uint)
parm:           max_ahs:Maximum number of address handles to support (uint)
parm:           max_cqes:Maximum number of completion queue entries to support (uint)
parm:           max_cqs:Maximum number of completion queues to support (uint)
parm:           max_qp_wrs:Maximum number of QP WRs to support (uint)
parm:           max_qps:Maximum number of QPs to support (uint)
parm:           max_sges:Maximum number of SGEs to support (uint)
parm:           max_mcast_grps:Maximum number of multicast groups to support (uint)
parm:           max_mcast_qp_attached:Maximum number of attached QPs to support (uint)
parm:           max_srqs:Maximum number of SRQs to support (uint)
parm:           max_srq_sges:Maximum number of SRQ SGEs to support (uint)
parm:           max_srq_wrs:Maximum number of SRQ WRs support (uint)
parm:           piothreshold:size used to determine sdma vs. pio (ushort)
parm:           sdma_comp_size:Size of User SDMA completion ring. Default: 128 (uint)
parm:           sdma_descq_cnt:Number of SDMA descq entries (uint)
parm:           sdma_idle_cnt:sdma interrupt idle delay (ns,default 250) (uint)
parm:           num_sdma:Set max number SDMA engines to use (uint)
parm:           desct_intr:Number of SDMA descriptor before interrupt (uint)
parm:           qp_table_size:QP table size (uint)
parm:           pcie_caps:Max PCIe tuning: Payload (0..3), ReadReq (4..7) (int)
parm:           aspm:PCIe ASPM: 0: disable, 1: enable, 2: dynamic (uint)
parm:           pcie_target:PCIe target speed (0 skip, 1-3 Gen1-3) (uint)
parm:           pcie_force:Force driver to do a PCIe firmware download even if already at target speed (uint)
parm:           pcie_retry:Driver will try this many times to reach requested speed (uint)
parm:           pcie_pset:PCIe Eq Pset value to use, range is 0-10 (uint)
parm:           num_user_contexts:Set max number of user contexts to use (uint)
parm:           krcvqs:Array of the number of non-control kernel receive queues by VL (array of uint)
parm:           rcvarr_split:Percent of context's RcvArray entries used for Eager buffers (uint)
parm:           eager_buffer_size:Size of the eager buffers, default: 2MB (uint)
parm:           rcvhdrcnt:Receive header queue count (default 2048) (uint)
parm:           hdrq_entsize:Size of header queue entries: 2 - 8B, 16 - 64B (default), 32 - 128B (uint)
parm:           user_credit_return_threshold:Credit return threshold for user send contexts, return when unreturned credits passes this many blocks (in percent of allocated blocks, 0 is off) (uint)
parm:           max_mtu:Set max MTU bytes, default is 8192 (uint)
parm:           cu:Credit return units (uint)
parm:           prescan_rxq:Used to toggle rx prescan. Set to 1 to enable prescan (uint)
parm:           cap_mask:Bit mask of enabled/disabled HW features
parm:           kdeth_qp:Set the KDETH queue pair prefix (uint)
parm:           num_vls:Set number of Virtual Lanes to use (1-8) (uint)
parm:           rcv_intr_timeout:Receive interrupt mitigation timeout in ns (uint)
parm:           rcv_intr_count:Receive interrupt mitigation count (uint)
parm:           link_crc_mask:CRCs to use on the link (ushort)
parm:           loopback:Put into loopback mode (1 = serdes, 3 = external cable (uint)
parm:           mem_affinity:Bitmask for memory affinity control: 0 - device, 1 - process (uint)
[root@localhost ~]#

The kernel version info:
[root@localhost ~]# uname -r
3.10.0-327.4.4.el7.x86_64
[root@localhost ~]#   

 Further reading: http://www.anandtech.com/show/9561/exploring-intels-omnipath-network-fabric