我有两个numpy数组,第一个数组的大小为100 * 4 * 200,第二个数组的大小为150 * 6 * 200。 实际上,我在阵列1中存储了100个200维向量表示的4个场的样本,并且在阵列2中存储了6个场的200维向量的140个样本。
现在我想计算样本之间的相似度向量并创建一个相似度矩阵。 对于每个样本,我想计算每个字段组合之间的相似度并存储它,以便得到一个15000 * 24维的数组。
前150行将是数组1的第1行和数组2的150行之间的相似性向量,接下来的150行将是数组1的第2行和数组2的150行之间的相似性向量。每个相似性向量是#字段在数组1中*#字段,即相似度向量的第一个元素是数组1的字段1和数组2的字段1之间的余弦相似度,第二个元素将是数组1的字段1和数组2的字段2之间的相似度而最后一个元素是数组1的最后一个字段和数组2的最后一个字段之间的相似度。
使用numpy数组来做这件事的最好方法是什么?
I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.
Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.
First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc. Each similarity vector is # fields in array 1 * # fields in array 2 i.e. 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.
What is the best way to do this using numpy arrays ?
最满意答案
所以每个“行”(我假设第一个轴,我将调用轴0)是样本轴。 这意味着您从一个矢量中获取100个样本,每个样本的字段x维数为4 x 200。
按照您描述的方式进行操作,那么第一个数组的第一行将会有(4,200),第二行将会有(150,6,200)。 然后你想要在(m,n)和(m,n,k)数组之间做一个cos距离,这是没有意义的(这里最接近点积的就是张量积,我相当确定是不是你想要的)。
所以我们必须首先提取这些,然后遍历所有其他的。
要做到这一点,我实际上只是建议将数组与np.split然后遍历它们。 这仅仅是因为我从来没有在numpy中遇到过更快的方式。 你可以使用张量流来获得效率,但我不会在这里回答我的答案。
import numpy as np from sklearn.metrics.pairwise import cosine_similarity a = np.random.rand(100, 4, 200) b = np.random.rand(150, 6, 200) # We know the output will be 150*100 x 6*4 c = np.empty([15000, 24]) # Make an array with the rows of a and same for b a_splitted=np.split(a, a.shape[0], 0) b_splitted=np.split(b, b.shape[0], 0) i=0 for alpha in a_splitted: for beta in b_splitted: # Gives a 4x6 matrix sim=cosine_similarity(alpha[0],beta[0]) c[i,:]=sim.ravel() i+=1对于上面的similarity函数,我只是选择了@StefanFalk获取的内容: sklearn.metrics.pairwise.cosine_similarity 。 如果这种相似性措施不够,那么你可以写自己的。
我完全没有声称这是在所有python中完成此操作的最佳方法。 我认为最有效的方法就是象这样象征性地使用tensorflow 。
无论如何,希望它有帮助!
So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fieldsxdimentions 4x200.
Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n), and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).
So we have to extract these first and then iterate over all the others.
To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.
import numpy as np from sklearn.metrics.pairwise import cosine_similarity a = np.random.rand(100, 4, 200) b = np.random.rand(150, 6, 200) # We know the output will be 150*100 x 6*4 c = np.empty([15000, 24]) # Make an array with the rows of a and same for b a_splitted=np.split(a, a.shape[0], 0) b_splitted=np.split(b, b.shape[0], 0) i=0 for alpha in a_splitted: for beta in b_splitted: # Gives a 4x6 matrix sim=cosine_similarity(alpha[0],beta[0]) c[i,:]=sim.ravel() i+=1For the similarity-function above I just chose what @StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity. If this similarity measure is not sufficient, then you could either write your own.
I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow.
Anyways, hope it helps!
余弦相似性的两个ndarrays(Cosine similarity between two ndarrays)我有两个numpy数组,第一个数组的大小为100 * 4 * 200,第二个数组的大小为150 * 6 * 200。 实际上,我在阵列1中存储了100个200维向量表示的4个场的样本,并且在阵列2中存储了6个场的200维向量的140个样本。
现在我想计算样本之间的相似度向量并创建一个相似度矩阵。 对于每个样本,我想计算每个字段组合之间的相似度并存储它,以便得到一个15000 * 24维的数组。
前150行将是数组1的第1行和数组2的150行之间的相似性向量,接下来的150行将是数组1的第2行和数组2的150行之间的相似性向量。每个相似性向量是#字段在数组1中*#字段,即相似度向量的第一个元素是数组1的字段1和数组2的字段1之间的余弦相似度,第二个元素将是数组1的字段1和数组2的字段2之间的相似度而最后一个元素是数组1的最后一个字段和数组2的最后一个字段之间的相似度。
使用numpy数组来做这件事的最好方法是什么?
I have two numpy arrays, first array is of size 100*4*200, and second array is of size 150*6*200. In fact, I am storing the 100 samples of 200 dimensional vector representations of 4 fields in array 1 and 140 samples of 200 dimensional vectors of 6 fields in array 2.
Now I want to compute the similarity vector between the samples and create a similarity matrix. For each sample, I would like to calculate the similarity between the each combination of fields and store it such that I get a 15000*24 dimensional array.
First 150 rows will be the similarity vector between 1st row of array 1 and 150 rows of array 2, next 150 rows will be the similarity vector between the 2nd row of array 1 and 150 rows of array 2 etc. Each similarity vector is # fields in array 1 * # fields in array 2 i.e. 1st element of the similarity vector is cosine similarity between field 1 of array 1 and field 1 of array 2, 2nd element will be the similarity between field 1 of array 1 and field 2 of array 2 and so on with last element is the similarity between last field of array 1 and last field of array 2.
What is the best way to do this using numpy arrays ?
最满意答案
所以每个“行”(我假设第一个轴,我将调用轴0)是样本轴。 这意味着您从一个矢量中获取100个样本,每个样本的字段x维数为4 x 200。
按照您描述的方式进行操作,那么第一个数组的第一行将会有(4,200),第二行将会有(150,6,200)。 然后你想要在(m,n)和(m,n,k)数组之间做一个cos距离,这是没有意义的(这里最接近点积的就是张量积,我相当确定是不是你想要的)。
所以我们必须首先提取这些,然后遍历所有其他的。
要做到这一点,我实际上只是建议将数组与np.split然后遍历它们。 这仅仅是因为我从来没有在numpy中遇到过更快的方式。 你可以使用张量流来获得效率,但我不会在这里回答我的答案。
import numpy as np from sklearn.metrics.pairwise import cosine_similarity a = np.random.rand(100, 4, 200) b = np.random.rand(150, 6, 200) # We know the output will be 150*100 x 6*4 c = np.empty([15000, 24]) # Make an array with the rows of a and same for b a_splitted=np.split(a, a.shape[0], 0) b_splitted=np.split(b, b.shape[0], 0) i=0 for alpha in a_splitted: for beta in b_splitted: # Gives a 4x6 matrix sim=cosine_similarity(alpha[0],beta[0]) c[i,:]=sim.ravel() i+=1对于上面的similarity函数,我只是选择了@StefanFalk获取的内容: sklearn.metrics.pairwise.cosine_similarity 。 如果这种相似性措施不够,那么你可以写自己的。
我完全没有声称这是在所有python中完成此操作的最佳方法。 我认为最有效的方法就是象这样象征性地使用tensorflow 。
无论如何,希望它有帮助!
So every "row" (i assume the first axis, that I'll call axis 0) is the sample axis. That means you have 100 samples from one vector, each with fieldsxdimentions 4x200.
Doing this the way you describe, then the first row of the first array would have (4,200) and the second one would then have (150,6,200). Then you'd want to do a cos distance between an (m,n), and (m,n,k) array, which does not make sense (the closest you have to a dot product here would be the tensor product, which I'm fairly sure is not what you want).
So we have to extract these first and then iterate over all the others.
To do this I actually recomend just splitting the array with np.split and iterate over both of them. This is just because I've never come across a faster way in numpy. You could use tensorflow to gain efficiency, but I'm not going into that here in my answer.
import numpy as np from sklearn.metrics.pairwise import cosine_similarity a = np.random.rand(100, 4, 200) b = np.random.rand(150, 6, 200) # We know the output will be 150*100 x 6*4 c = np.empty([15000, 24]) # Make an array with the rows of a and same for b a_splitted=np.split(a, a.shape[0], 0) b_splitted=np.split(b, b.shape[0], 0) i=0 for alpha in a_splitted: for beta in b_splitted: # Gives a 4x6 matrix sim=cosine_similarity(alpha[0],beta[0]) c[i,:]=sim.ravel() i+=1For the similarity-function above I just chose what @StefanFalk sugested: sklearn.metrics.pairwise.cosine_similarity. If this similarity measure is not sufficient, then you could either write your own.
I am not at all claiming that this is the best way to do this in all of python. I think the most efficient way is to do this symbolically using, as mentioned, tensorflow.
Anyways, hope it helps!
发布评论