Python，solr和大量查询：需要一些建议(Python, solr and massive amounts of queries: need some suggestions)

网站建设890 更新时间：2025-06-17 14:28:25

我在项目中遇到了设计问题。

问题我需要向solr查询从我们的列表中提取的一些参数的所有可能组合（或多或少2千万），以测试它们至少提供1个结果。如果没有，则将该组合插入黑名单（用于统计分析和站点地图创建）

我现在怎么做 嵌套for循环以组合参数（从python列表中提取）并将它们传递给一个方法（我在生产环境中用来查询网站中的数据库），该方法测试0结果。如果它为0，则在黑名单中插入一个方法 没有涉及线程

我现在怎么样 我想将所有组合放在一个队列中，让一个线程对象拉出它们，查询和插入，以获得更好的性能

我遇到的问题是什么 慢度：单线程，现在需要很多时间才能完成（当它完成时）

连接由peer重置[104] ：这是一段时间后被solr抛出的错误（我增加了池大小，但没有任何变化）这是目前最常见的（也是烦人的）错误。

python hanging ：这个我解决了超时装饰器（这不是一个正确的解决方案，但至少它帮助我通过整个处理，并现在有一个快速的测试输出。我会放弃这个，每当我可以来一个智能解决方案

queue max size ：队列对象最多可包含32k个元素，因此它不适合我的数字

我在用什么 python 2.7 MySQL的 Apache的Solr的 sunburnt（solr的python接口） linux盒子

我不需要任何代码调试，因为我宁愿抛弃我为新的开始做的事情，而不是一遍又一遍地修补它......“错误的试验”不是我喜欢的。

我希望您能够以正确的方式设计这些建议。此外，链接，网站，指南也非常受欢迎，因为我使用这种脚本的经验正在建设中。

在此先感谢您的帮助！如果您不明白，请问，如果需要，我会回复/更新帖子！

编辑基于一些答案（将保持更新） 我可能会删除多处理库的python线程 ：这可以解决我的性能问题

基于分而治之的构造方法 ：这应该在我的参数构造中添加一些逻辑，而不需要任何暴力攻击

我仍然需要知道的是 ：我可以在哪里存储我的组合来提供工作线程？也许这不再是一个问题，因为分而治之的方法可以让我生成运行时组合并在工作线程之间拆分它们。

NB：我现在不会接受任何答案，因为我想暂时保留这篇文章，只是为了收集越来越多的想法（不仅是为了我，也许是为了将来参考其他人，因为它是通用的性质）

再次感谢所有人！

i'm facing a design problem within my project.

PROBLEM i need to query solr with all the possible combinations (more or less 20 millions) of some parameters extracted from our lists, to test wether they give at least 1 result. in the case they don't, that combination is inserted into a blacklist (used for statistical analysis and sitemap creation)

HOW I'M DOING IT NOW nested for loops to combine parameters (extracted from python lists) and pass them to a method (the same i use in production environment to query the db within the website) that tests for 0-results. if it's 0, there's a method inserting inside the blacklist no threading involved

HOW I'D LIKE TO TO THIS i'd like to put all the combinations inside a queue and let a thread object pull them, query and insert, for better performances

WHAT PROBLEMS I'M EXPERIENCING slowliness: being single threaded, it now takes a lot to complete (when and if it completes)

connection reset by peer[104] : it's an error throwed by solr after a while it's been queried (i increased the pool size, but nothing changes) this is the most recurrent (and annoying) error, at the moment.

python hanging: this i resolved with a timeout decorator (which isn't a correct solution, but at least it helps me go throu the whole processing and have a quick test output for now. i'll drop this whenever i can come to a smart solution)

queue max size: a queue object can contain up to 32k elements, so it won't fit my numbers

WHAT I'M USING python 2.7 mysql apache-solr sunburnt (python interface to solr) linux box

I don't need any code debugging, since i'd rather throw away what i did for a fresh start, instead than patching it over and over and over... "Trial by error" is not what i like.

I'd like every suggestion that can come in mind to you to design this in the correct way. Also links, websites, guides are very much welcomed, since my experience with this kind of scripts is building as i work.

Thanks all in advance for your help! If you didn't understand something, just ask, i'll answer/update the post if needed!

EDIT BASED ON SOME ANSWERS (will keep this updated) i'll probably drop python threads for the multiprocessing lib: this could solve my performance issues

divide-and-conquer based construction method: this should add some logic in my parameters construction, without needing any bruteforce approac

what i still need to know: where can i store my combinations to feed the worker thread? maybe this is no more an issue, since the divide-and-conquer approach may let me generate runtime the combinations and split them between the working threads.

NB: i wont' accept any answer for now, since i'd like to mantain this post alive for a while, just to gather more and more ideas (not only for me, but maybe for future reference of others, since it's generic nature)

Thanks all again!

最满意答案

而不是蛮力，改为使用分而治之的方法，同时跟踪每次搜索的命中数。如果细分为某些组合，其中一些组将为空，因此您可以一次性删除多个子树。将遗漏的参数添加到剩余的搜索中并重复，直到完成为止。它需要更多的簿记，但搜索次数更少。

Instead of brute force, change to using a divide-and-conquer approach while keeping track of the number of hits for each search. If you subdivide into certain combinations, some of those sets will be empty so you eliminate many subtrees at once. Add missing parameters into remaining searches and repeat until you are done. It takes more bookkeeping but many fewer searches.