Here I report an issue with a high performance degradation when multiply a vector with 2d-tensor in column-wise (254 tops measured) than we do it in row-wise (419 tops measured). it is int8 matmul ...