[go: up one dir, main page]

Apache Beam Typescript SDK

The Typescript SDK for Apache Beam provides a simple, powerful API for building batch and streaming data processing pipelines.

Get started with the Typescript SDK

Get started with the Beam Typescript SDK quickstart to set up your development environment, get the Beam SDK for Typescript, and run an example pipeline. Then, read through the Beam programming guide to learn the basic concepts that apply to all SDKs in Beam.

Overview

We generally try to apply the concepts from the Beam API in a TypeScript idiomatic way. In addition, some notable departures are taken from the traditional SDKs:

An example pipeline can be found at wordcount.ts and more documentation can be found in the beam programming guide.

Pipeline I/O

See the Beam-provided I/O Transforms page for a list of the currently available I/O transforms.

Supported Features

The Typescript SDK is still under development but already supports many, but not all, features currently supported by the Beam model, both batch and streaming. It also has extensive support for cross-language transforms which can be leveraged to use more advanced features from Typescript pipelines.

Serialization

As Beam is designed to run in a distributed environment, all functions and data are required to be serializable.

By default, data is serialized using a BSON encoding, though this can be customized by applying the withRowCoder or withCoderInternal transforms to a PCollection.

Functions that are used in transforms (such as map), including closures and their captured data, are serialized via ts-serialize-closures. While this handles most cases well, it still has limitations and can capture, and in its walk of the transitive closure of referenced objects may capture objects that are better imported rather than serialized. To avoid these limitations one, one can explicitly register references with the requireForSerialization function as follows.

// in module my_package/module_to_be_required
import { requireForSerialization } from "apache-beam/serialization";

// define or import various objects_to_register here

requireForSerialization(
    "my_package/module_to_be_required", { objects_to_register });

The starter project has such an example.