Here I report an issue with a high performance degradation when multiply a vector with 2d-tensor in column-wise (254 tops measured) than we do it in row-wise (419 tops measured). it is int8 matmul ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results