Newest 'blas performance' Questions

5 votes

2 answers

188 views

crossprod(m1, m2) is running slower than t(m1) %*% m2 on my machine

Why does t(mat1) %*% mat2 work quicker than crossprod(mat1, mat2). Isn't the whole point of the latter that it calls a more efficient low-level routine? r$> mat1 <- array(rnorm(100 * 600), dim = ...

Turdle

53

asked Oct 3, 2024 at 5:24

14 votes

1 answer

576 views

Faster evaluation of matrix multiplication from right to left

I noticed that evaluating matrix operations in quadratic form from right to left is significantly faster than left to right in R, depending on how the parentheses are placed. Obviously they both ...

Taotao Tan

283

asked Oct 13, 2022 at 20:28

2 votes

1 answer

213 views

Armadillo: Inefficient chaining of .t()

consider the following two ways of doing the same thing. arma::Mat<double> B(5000,5000,arma::fill::randu); arma::Mat<double> C(5000,500, arma::fill::randu); Okay two dense matrices in ...

BenG

143

asked Feb 18, 2022 at 22:07

1 vote

0 answers

648 views

How does cblas_dcopy fare against memcpy_s and std::copy?

There is a lot of discussion on the comparison between std::copy and memcpy_s in terms of efficiency in copying one array to another. I'd like to know where Intel MKL's cblas_dcopy stands in all of ...

MajinSaha

198

asked Dec 11, 2020 at 8:49

1 vote

1 answer

348 views

Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP

I wrote a C++ code to solve a linear system A.x = b where A is a symmetric matrix by first diagonalizing the matrix A = V.D.V^T with LAPACK(E) (because I need the eigenvalues later) and then solving x ...

Toool

361

asked Jul 30, 2020 at 13:29

1 vote

1 answer

1k views

Why is numpy's kron so fast?

I was trying to implement a kronecker product function. Below are three ideas that I have: def kron(arr1, arr2): """columnwise outer product, avoiding relocate elements. """ r1, c1 = arr1....

Chen

330

asked Feb 24, 2020 at 20:44

1 vote

1 answer

79 views

C - array function evaluation

Aloha! I am working in C and I'm using basic functions on all elements of an array with a for loop and I was wondering if it's possible to speed up this calculation (e.g. with cblas functions). I am ...

fxm

135

asked Dec 4, 2018 at 12:01

0 votes

1 answer

202 views

A * B computation when B is a symmetric matrix in armadillo

Is there any way to multiply a symmetric matrix by a dense one in armadillo(and use the fact that we have a symmetric matrix)? I know about DSYMM Routine in BLAS,but the matrices I'm dealing with are ...

MAh2014

147

asked May 24, 2018 at 18:31

2 votes

0 answers

271 views

Eigenlib and performance of small matrix operations

I chose eigenlib for my project, since I deal with a lot of small scale vector- and matrix-operations. Naturally, I implemented the simple the vector-Matrix-vector product in eigenlib as this function:...

thogra

323

asked Jan 24, 2018 at 14:39

2 votes

0 answers

106 views

cython_blas level 1 routine orders of magnitude faster than Cython for loop

I've come across a performance difference between a call to cblas (namely daxpy: perform y += alpha * x where y and x are vectors of the same length, and alpha is a scalar) and the same operation ...

P. Camilleri

13.2k

asked Aug 23, 2017 at 12:42

1 vote

1 answer

4k views

Optimizing numpy array multiplication: * faster than numpy.dot?

Questions: 1) How is it that numpy.dot() is slower than * in the example code below when BLAS is being used? 2) Is there a way that numpy.dot() can be implemented instead of * in this case for ...

user6565699

asked Jan 5, 2017 at 14:10

0 votes

1 answer

234 views

Avoid blas when involving temporary memory allocation?

I have a program that computes the matrix product x'Ay repeatedly. Is it better practice to compute this by making calls to MKL's blas, i.e. cblas_dgemv and cblas_ddot, which requires allocating ...

Agrim Pathak

3,207

asked Jul 3, 2016 at 1:47

5 votes

1 answer

3k views

Is numpy.einsum efficient compared to fortran or C?

I have written a numpy program which is very time consuming. After profiling it, I found that most of the time is spent in numpy.einsum. Although numpy is a wrapper of LAPACK or BLAS, I don't know ...

atbug

838

asked Feb 19, 2016 at 8:46

1 vote

1 answer

1k views

performance in linear algebra with python

Benchmarks of different languages and related questions are everywhere on the Internet. However, I still cannot figure out an answer of whether I should switch to C in my program. Basically, The most ...

atbug

838

asked Dec 7, 2015 at 8:50

9 votes

1 answer

2k views

How to measure overall performance of parallel programs (with papi)

I asked myself what would be the best way to measure the performance (in flops) of a parallel program. I read about papi_flops. This seems to work fine for a serial program. But I don't know how I can ...

Sebastian

153

asked Jul 29, 2015 at 13:21

Collectives™ on Stack Overflow

All Questions

crossprod(m1, m2) is running slower than t(m1) %*% m2 on my machine

Faster evaluation of matrix multiplication from right to left

Armadillo: Inefficient chaining of .t()

How does cblas_dcopy fare against memcpy_s and std::copy?

Optimize eigen recomposition (Matrix - Diagonal Matrix - Matrix) product C++ with BLAS and OpenMP

Why is numpy's kron so fast?

C - array function evaluation

A * B computation when B is a symmetric matrix in armadillo

Eigenlib and performance of small matrix operations

cython_blas level 1 routine orders of magnitude faster than Cython for loop

Optimizing numpy array multiplication: * faster than numpy.dot?

Avoid blas when involving temporary memory allocation?

Is numpy.einsum efficient compared to fortran or C?

performance in linear algebra with python

How to measure overall performance of parallel programs (with papi)

Hot Network Questions

Collectives™ on Stack Overflow

All Questions

Related Tags