All Questions
Tagged with blas performance
28 questions
5
votes
2
answers
188
views
crossprod(m1, m2) is running slower than t(m1) %*% m2 on my machine
Why does t(mat1) %*% mat2 work quicker than crossprod(mat1, mat2). Isn't the whole point of the latter that it calls a more efficient low-level routine?
r$> mat1 <- array(rnorm(100 * 600), dim = ...
14
votes
1
answer
576
views
Faster evaluation of matrix multiplication from right to left
I noticed that evaluating matrix operations in quadratic form from right to left is significantly faster than left to right in R, depending on how the parentheses are placed. Obviously they both ...
2
votes
1
answer
213
views
Armadillo: Inefficient chaining of .t()
consider the following two ways of doing the same thing.
arma::Mat<double> B(5000,5000,arma::fill::randu);
arma::Mat<double> C(5000,500, arma::fill::randu);
Okay two dense matrices in ...
1
vote
0
answers
648
views
How does cblas_dcopy fare against memcpy_s and std::copy?
There is a lot of discussion on the comparison between std::copy and memcpy_s in terms of efficiency in copying one array to another.
I'd like to know where Intel MKL's cblas_dcopy stands in all of ...
1
vote
1
answer
348
views
Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP
I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x ...
1
vote
1
answer
1k
views
Why is numpy's kron so fast?
I was trying to implement a kronecker product function. Below are three ideas that I have:
def kron(arr1, arr2):
"""columnwise outer product, avoiding relocate elements.
"""
r1, c1 = arr1....
1
vote
1
answer
79
views
C - array function evaluation
Aloha! I am working in C and I'm using basic functions on all elements of an array with a for loop and I was wondering if it's possible to speed up this calculation (e.g. with cblas functions). I am ...
0
votes
1
answer
202
views
A * B computation when B is a symmetric matrix in armadillo
Is there any way to multiply a symmetric matrix by a dense one in armadillo(and use the fact that we have a symmetric matrix)? I know about DSYMM Routine in BLAS,but the matrices I'm dealing with are ...
2
votes
0
answers
271
views
Eigenlib and performance of small matrix operations
I chose eigenlib for my project, since I deal with a lot of small scale vector- and matrix-operations. Naturally, I implemented the simple the vector-Matrix-vector product in eigenlib as this function:...
2
votes
0
answers
106
views
cython_blas level 1 routine orders of magnitude faster than Cython for loop
I've come across a performance difference between a call to cblas (namely daxpy: perform y += alpha * x where y and x are vectors of the same length, and alpha is a scalar) and the same operation ...
1
vote
1
answer
4k
views
Optimizing numpy array multiplication: * faster than numpy.dot?
Questions:
1) How is it that numpy.dot() is slower than * in the example code below when BLAS is being used?
2) Is there a way that numpy.dot() can be implemented instead of * in this case for ...
0
votes
1
answer
234
views
Avoid blas when involving temporary memory allocation?
I have a program that computes the matrix product x'Ay repeatedly. Is it better practice to compute this by making calls to MKL's blas, i.e. cblas_dgemv and cblas_ddot, which requires allocating ...
5
votes
1
answer
3k
views
Is numpy.einsum efficient compared to fortran or C?
I have written a numpy program which is very time consuming. After profiling it, I found that most of the time is spent in numpy.einsum.
Although numpy is a wrapper of LAPACK or BLAS, I don't know ...
1
vote
1
answer
1k
views
performance in linear algebra with python
Benchmarks of different languages and related questions are everywhere on the Internet. However, I still cannot figure out an answer of whether I should switch to C in my program.
Basically, The most ...
9
votes
1
answer
2k
views
How to measure overall performance of parallel programs (with papi)
I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can ...