I found that ifort (and gfortran) create a temporary for the following array assignment:
a(1:n-inc:inc)= a(inc+1:n:inc)+b(1:n-inc:inc)
presumably because of the possibility that inc is less than zero. The result is stored in a stride 1 temporary and then copied to the destination, all reporting vectorization.
If I write
do i= 1,n-inc,inc
a(i)= a(i+inc)+b(i)
enddo
ifort decides not to vectorize with /QxAVX2. Apparently, that's a good decision, as adding a !dir$ simd to produce simulated gather-scatter makes it slower, even in the case inc==1 (but not as slow as the array assignment with temporary).
Intel's vecanalysis script:
http://software.intel.com/en-us/articles/vecanalysis-python-script-for-a...
reports heavy-overhead vectorization.
Just one more data point in the continuing question about marginal vectorization.