#
apache-spark

Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 1,092 public repositories matching this topic...
酷玩 Spark: Spark 源代码解析、Spark 类库等
-
Updated
May 26, 2019 - Scala
Interactive and Reactive Data Science using Scala and Spark.
-
Updated
Mar 31, 2021 - JavaScript
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
python
scala
apache-spark
pytorch
keras-tensorflow
bigdl
distributed-deep-learning
deep-neural-network
analytics-zoo
-
Updated
Aug 19, 2021 - Jupyter Notebook
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
-
Updated
Aug 16, 2021 - Java
GoEddie
commented
Dec 30, 2019
This is to track implementation of the ML-Features: https://spark.apache.org/docs/latest/ml-features
Bucketizer has been implemented in dotnet/spark#378 but there are more features that should be implemented.
- Feature Extractors
- TF-IDF
- Word2Vec (dotnet/spark#491)
- CountVectorizer (https://github.com/dotnet/spark/p
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
kubernetes
spark
apache-spark
kubernetes-operator
kubernetes-controller
kubernetes-crd
google-cloud-dataproc
-
Updated
Aug 18, 2021 - Go
Apache Spark docker image
-
Updated
Apr 20, 2021 - Dockerfile
A curated list of awesome Apache Spark packages and resources.
-
Updated
Aug 18, 2021
PySpark + Scikit-learn = Sparkit-learn
-
Updated
Dec 31, 2020 - Python
(Deprecated) Scikit-learn integration package for Apache Spark
-
Updated
Dec 3, 2019 - Python
C# and F# language binding and extensions to Apache Spark
streaming
spark
apache-spark
csharp
fsharp
bigdata
dataset
spark-streaming
eventhubs
mapreduce
dataframe
rdd
dstream
mobius
kafka-streaming
near-real-time
-
Updated
Jan 29, 2021 - C#
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
python
airflow
spark
apache-spark
scheduler
s3
data-engineering
data-lake
warehouse
redshift
data-migration
livy
etl-framework
apache-airflow
emr-cluster
etl-pipeline
etl-job
data-engineering-pipeline
airflow-dag
goodreads-data-pipeline
-
Updated
Mar 9, 2020 - Python
R interface for Apache Spark
-
Updated
Aug 18, 2021 - R
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
-
Updated
Jan 24, 2017 - Scala
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
data-science
machine-learning
spark
apache-spark
deep-learning
hadoop
tensorflow
keras
keras-models
optimization-algorithms
data-parallelism
distributed-optimizers
-
Updated
Jul 25, 2018 - Python
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
-
Updated
Jan 8, 2020 - Scala
Streaming System 相关的论文读物
streaming
apache-spark
storm
stream-processing
spark-streaming
dataflow
flink
heron
drizzle
millwheel
s4
streaming-engine
spe
stream-processing-engine
-
Updated
Mar 31, 2018
A command-line tool for launching Apache Spark clusters.
-
Updated
Jun 13, 2021 - Python
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
-
Updated
Feb 22, 2021 - Java
This is the github repo for Learning Spark: Lightning-Fast Data Analytics [2nd Edition]
-
Updated
Apr 15, 2021 - Scala
A list about Apache Kafka
infrastructure
kafka
apache-spark
stream-processing
apache-kafka
kafka-streams
data-processing
data-pipeline
streaming-data
-
Updated
Jul 27, 2021
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
python
vagrant
data-science
data
machine-learning
airflow
kafka
spark
apache-spark
analytics
machine-learning-algorithms
python3
amazon-ec2
python-3
apache-kafka
amazon-web-services
predictive-analytics
agile-data
data-syndrome
agile-data-science
-
Updated
May 28, 2021 - Jupyter Notebook
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
-
Updated
Jun 15, 2021 - Scala
The Internals of Spark Structured Streaming
-
Updated
May 23, 2021
A boilerplate for writing PySpark Jobs
-
Updated
Mar 30, 2021 - Python
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
-
Updated
Sep 14, 2015 - Shell
Created by Matei Zaharia
Released May 26, 2014
- Repository
- apache/spark
- Website
- spark.apache.org
- Wikipedia
- Wikipedia
URLS with the issue:
Description of proposal:
Document the maximum value and legal characters for log_param, log_metric and set_tag. Note that log_metric's value i