
您所在的位置:首页 - 学术报告


Efficiently Running Al WorkloadsUsing Long SlMD and Matrix lSAs


主讲人:MarcCasas Guix 巴塞罗那超算中心


地 点:主楼B1421



Marc Casas is a technica researchlead at the Barcelona SupercomputingCenter (BSc)andlecturer attheUniversitat Polit è cnica de Catalunya(UPC). His researchlays betweencomputer architecture(e.g,memoryaddresstranslation,andvector architectures)high-performance computing(e.g.sparse linear algebraparallel deep learning). He is the technicallead of theSONAR (parallelSOftware and New ARchitectures)research group,composed of PhD students, engineers,and postdocs. Marc has lead BSC contributions to severaeuropean projects (Mont-Blanc2020,European RrocessoiInitiative, etc.), and research collaborations with nteandlBM.

Marc has been at Bcsince 2013.He was apostdoctoral research scholar at the Lawrence LivermoreNationalLaboratory(LLNL)from2010 to 2013.He receivedthe Marie Curie and Ramón y Cajal Fellowships on 2014and 2018,respectively.He obtained a 5-years degreein mathematics in 2004,and a PhD degree in ComputerScience in 2010 from the Universitat Politècnica deCatalunya (UPC).


This talk will show how state-of-the-art proposalsto compute convolutions on architectures with CPUsupporting SlMD instructions deliver poor performancefor long SlMD lengths due to freguent cache conflictmisses.The talk will propose new algorithmic approachesto mitigate the limitation of state-of-the-art proposals viathe adaptation of the amount of computation exposed tothe microarchitecture to mitigate cache misses, and theredefinition of the activation memory layout to improvethe memory access pattern.These algorithmic approachesMatrix Tile Extension(MT),a novewill motivate thematrix Instruction-Set Architecture (lSA) that completelydecouples the instruction set architecture from thmicroarchitecture and seamlessly interacts with existincvectorISAs.MTEincurs minimalimplementation overheacsince it only requires a few additional instructions and a64-bit Control Status Register (CSR) to keep its state, andbeats the best state-of-the-art matrix lSA by 1.20x.
