20210924, 06:17  #56  
Jul 2003
So Cal
2^{4}×139 Posts 
Quote:


20210924, 06:21  #57  
Jul 2003
So Cal
8B0_{16} Posts 
Quote:


20210924, 12:36  #58 
Apr 2020
1000100011_{2} Posts 

20210924, 13:40  #59 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3·1,663 Posts 

20210924, 15:13  #60 
Jul 2003
So Cal
2224_{10} Posts 
A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

20210924, 15:50  #61 
Apr 2020
1000100011_{2} Posts 
If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.
But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this. 
20211022, 13:34  #62 
Jun 2012
Boulder, CO
2^{4}·3·7 Posts 
In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?

20211022, 23:00  #63 
Jul 2003
So Cal
8B0_{16} Posts 
Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.
Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)byte row offset array in addition to the 4*num_nonzerosbyte column array in GPU memory. For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default. So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well. 
20211026, 04:05  #64 
Jul 2003
So Cal
2^{4}·139 Posts 
2,2174M is in LA, so here's one more data point. Running on eight NVLinkconnected V100's,
Code:
Sun Oct 24 01:15:27 2021 matrix is 106764994 x 106765194 (56998.7 MB) with weight 16127184931 (151.05/col) Sun Oct 24 01:15:27 2021 sparse part has weight 13874205635 (129.95/col) ... Sun Oct 24 23:03:59 2021 commencing linear algebra Sun Oct 24 23:03:59 2021 using VBITS=384 Sun Oct 24 23:03:59 2021 skipping matrix build Sun Oct 24 23:03:59 2021 initialized process (0,0) of 2 x 4 grid Sun Oct 24 23:09:35 2021 matrix starts at (0, 0) Sun Oct 24 23:09:39 2021 matrix is 53382681 x 25338016 (8267.4 MB) with weight 2435546404 (96.12/col) Sun Oct 24 23:09:39 2021 sparse part has weight 1913870759 (75.53/col) Sun Oct 24 23:09:39 2021 saving the first 368 matrix rows for later Sun Oct 24 23:09:46 2021 matrix includes 384 packed rows Sun Oct 24 23:10:15 2021 matrix is 53382313 x 25338016 (7468.9 MB) with weight 1554978635 (61.37/col) Sun Oct 24 23:10:15 2021 sparse part has weight 1451172382 (57.27/col) Sun Oct 24 23:10:15 2021 using GPU 0 (Tesla V100SXM232GB) Sun Oct 24 23:10:15 2021 selected card has CUDA arch 7.0 Sun Oct 24 23:12:44 2021 commencing Lanczos iteration Sun Oct 24 23:12:47 2021 memory use: 20898.7 MB Sun Oct 24 23:12:56 2021 linear algebra at 0.0%, ETA 90h17m 
20211026, 06:21  #65 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3×1,663 Posts 
And I suppose you will be comparing the other sieve run with higher LP’s, probably still some left overs.

20211026, 07:48  #66 
Jul 2003
So Cal
2^{4}×139 Posts 
We didn't sieve it twice. Only a little at the beginning was sieved with 33 bit LPs and all the relations were combined. There are a few stragglers that I'm not worrying about.

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Resume linear algebra  Timic  Msieve  35  20201005 23:08 
use msieve linear algebra after CADONFS filtering  aein  Msieve  2  20171005 01:52 
Has anyone tried linear algebra on a Threadripper yet?  fivemack  Hardware  3  20171003 03:11 
Linear algebra at 600%  CRGreathouse  Msieve  8  20090805 07:25 
Linear algebra proof  Damian  Math  8  20070212 22:25 