请原谅这个问题是否有意义,因为我刚开始使用Spark并试图理解它。
根据我的阅读,Spark是一个很好的用例,用于对流数据进行实时分析,然后可以将其推送到下游接收器,如hdfs / hive / hbase等。
我有2个问题。 我不清楚在任何给定时间是否只有一个火花流工作正在运行或多个。 假设我需要为来自Kafka的每个主题或流入Kafka的每个源执行不同的分析,然后推送那些下游的结果。
Spark是否允许您并行运行多个流式作业,因此您可以为每个流分别保持聚合分析,或者在每种情况下保持每个Kafka主题。 如果是这样,那怎么办,你可以指点我的任何文件?
为了清楚起见,我的用例是从不同来源流式传输,每个源可能具有我需要执行的不同分析以及不同的数据结构。 我希望能够拥有多个Kafka主题和分区。 我知道每个Kafka分区都映射到Spark分区,并且可以并行化。
我不确定你如何并行运行多个Spark流工作,以便能够从多个Kafka主题中读取,并将这些主题/流的单独分析制成表格。
如果不是Spark就可以在Flink中做到这一点?
其次,如何开始使用Spark,似乎每个组件都有一个公司和/或发行版,Confluent-Kafka,Databricks-Spark,Hadoop-HW / CDH / MAPR。 是否真的需要所有这些,或者在限制供应商数量的同时,使用大数据pipleine的最小和最简单的方法是什么? 甚至在POC上开始似乎是一项艰巨的任务。
Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
最满意答案
您已经提出了多个问题,因此我将分别解决每个问题。
Spark是否允许您并行运行多个流式作业?是
有关Kafka的Spark Streaming的文档吗?https://spark.apache.org/docs/latest/streaming-kafka-integration.html
如何开始?一个。 书: https : //www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
湾 运行/学习Spark的简便方法: https : //community.cloud.databricks.com
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?Yes
Is there any documentation on Spark Streaming with Kafka?https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
如何将多个Kafka主题并行执行多个Spark作业(How to do multiple Kafka topics to multiple Spark jobs in parallel)请原谅这个问题是否有意义,因为我刚开始使用Spark并试图理解它。
根据我的阅读,Spark是一个很好的用例,用于对流数据进行实时分析,然后可以将其推送到下游接收器,如hdfs / hive / hbase等。
我有2个问题。 我不清楚在任何给定时间是否只有一个火花流工作正在运行或多个。 假设我需要为来自Kafka的每个主题或流入Kafka的每个源执行不同的分析,然后推送那些下游的结果。
Spark是否允许您并行运行多个流式作业,因此您可以为每个流分别保持聚合分析,或者在每种情况下保持每个Kafka主题。 如果是这样,那怎么办,你可以指点我的任何文件?
为了清楚起见,我的用例是从不同来源流式传输,每个源可能具有我需要执行的不同分析以及不同的数据结构。 我希望能够拥有多个Kafka主题和分区。 我知道每个Kafka分区都映射到Spark分区,并且可以并行化。
我不确定你如何并行运行多个Spark流工作,以便能够从多个Kafka主题中读取,并将这些主题/流的单独分析制成表格。
如果不是Spark就可以在Flink中做到这一点?
其次,如何开始使用Spark,似乎每个组件都有一个公司和/或发行版,Confluent-Kafka,Databricks-Spark,Hadoop-HW / CDH / MAPR。 是否真的需要所有这些,或者在限制供应商数量的同时,使用大数据pipleine的最小和最简单的方法是什么? 甚至在POC上开始似乎是一项艰巨的任务。
Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
最满意答案
您已经提出了多个问题,因此我将分别解决每个问题。
Spark是否允许您并行运行多个流式作业?是
有关Kafka的Spark Streaming的文档吗?https://spark.apache.org/docs/latest/streaming-kafka-integration.html
如何开始?一个。 书: https : //www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
湾 运行/学习Spark的简便方法: https : //community.cloud.databricks.com
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?Yes
Is there any documentation on Spark Streaming with Kafka?https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
发布评论