[go: up one dir, main page]

Portability Framework Roadmap

Overview

Interoperability between SDKs and runners is a key aspect of Apache Beam. Previously, the reality was that most runners supported the Java SDK only, because each SDK-runner combination required non-trivial work on both sides. Most runners are also currently written in Java, which makes support of non-Java SDKs far more expensive. The portability framework rectified this situation and provided full interoperability across the Beam ecosystem.

The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer – called the portability API – ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden for both SDKs and runners to a constant effort. It notably ensures that new SDKs automatically work with existing runners and vice versa. The framework introduces a new runner, the Universal Local Runner (ULR), as a practical reference implementation that complements the direct runners. Finally, it enables cross-language pipelines (sharing I/O or transformations across SDKs) and user-customized execution environments (“custom containers”).

The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management and execution. These contracts use protobufs and gRPC for broad language support.

The goal is that all (non-direct) runners and SDKs eventually support the portability API, perhaps exclusively.

If you are interested in digging in to the designs, you can find them on the Beam developers’ wiki. Another overview can be found here.

Status

All SDKs currently support the portability framework. There is also a Python Universal Local Runner and shared java runners library. Performance is good and multi-language pipelines are supported. Currently, the Flink and Spark runners support portable pipeline execution (which is used by default for SDKs other than Java), as does Dataflow when using the Dataflow Runner v2. See the Portability support table for details.

Issues

The portability effort touches every component, so the “portability” label is used to identify all portability-related issues. Pure design or proto definitions should use the “beam-model” component. A common pattern for new portability features is that the overall feature is in “beam-model” with subtasks for each SDK and runner in their respective components.

Issues: query

Prerequisites: Docker, Python, Java 8

The Beam Flink runner can run Python pipelines in batch and streaming modes. Please see the Flink Runner page for more information on how to run portable pipelines on top of Flink.

Running Python wordcount on Spark

The Beam Spark runner can run Python pipelines in batch mode. Please see the Spark Runner page for more information on how to run portable pipelines on top of Spark.

Python streaming mode is not yet supported on Spark.

SDK Harness Configuration

See here for more information on SDK harness deployment options and here for what goes into writing a portable SDK.