inc-join Python/PySpark library released

Incremental join

During my time at ABN, one of the most complex topics was joining big datasets that were incrementally refreshed. The complexity comes from the fact that data might arrive late, or not at all.

There is a tradeoff here between completeness and performance: the more data you use in your join, the more complete it will be, but also slower and more expensive.

Since I now have some time between jobs, I decided to start all over again with a new implementation that solves the above issue. This is now available as an open source Python/PySpark library.

By using this library you will get:

– The best performance

– Consistency in joining (by using a clear definition of how far you want to look back and wait for late arriving data)

– Understandable concepts (sliding join window, output window, timed-out records, etc.)

– Cleaner code (because the complexity is contained in the library)

Detailed documentation and getting started can be found here:

https://github.com/basvdberg/IncrementalJoin