How Map-Reduce use Micro-Service with Pub-Sub

4 min readApr 20, 2020

Map-Reduce

MapReduce is a processing technique and programming model which is gathering data sets to generate output data with a parallel, distributed algorithm on a cluster.

Map

Map function takes input from the data as <key,value> pairs to processes a in node then produces intermediate <key,value> pairs as output.

Shuffle

Worker nodes of map redistribute data based on the output keys to redirect that all data belonging to one key is located on the same worker node.

Reduce

Reduce function process redirected data or group of data in parallel by per key.

What is Kafka

Distributed streaming platform which have below capabilities,

Publish and subscribe to streams of records for given topic
Store streams of records in a fault-tolerant durable way which have capability of compaction and etc
Process streams of records in real-time

Why Kafka

Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.

Kafka is run as a cluster on one or more servers that can span multiple datacenters
The Kafka cluster stores streams of records in categories called topics
Each record consists of a key, a value, and a timestamp

How Map-Reduce work

Overview

Basic Implementation

Configure Kafka Environment

Go through the kafka documentation to setup the foundation of the approach. Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don’t already have one. Then need to add topics for subscription and publish. You can decide the partition count of the each topic that will be important to future development.

Before create the topics we need to have abstract idea of our working flow.

Above diagram show the one iteration of map-reduce flow. There are two map functioning instances which is subscribe two topics. Topic_1 has two partitions that is equaly divide through two instance but the Topic_2 only forward messages to instance 2 because each instance has same client id or group id. Each instance process the message through map function and produce message to sub topic with shuffle.

In this scenario sub_topic has three partition that means maximum three reducers can work parallel. Shuffled message process with define reducer to gain the desired output. Output can be turn into any results by changing the fact of the reducer function.

Java Code Example for Beginners in Kafka

Basic Build Properties

Kafka API provide the foundation to example. First we need to create java project with org.apache.kafka.kafka-streams and org.apache.kafka.kafka-clients packages. Then insert the below configuration under the main function in each reducer and mapper project.

Stream Build Configuration

Below implementation is used each map and reduce function for begin the stream configuration.

final Properties config = new Properties();config.put(StreamsConfig.APPLICATION_ID_CONFIG, “streams-starter-app”);config.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, “broker-1:9092”);config.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, “earliest”);config.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());config.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());final StreamsBuilder builder = new StreamsBuilder();

2. Shuffle Function

Create a new Class for Partitioner that is implement for the shuffle function. Map function gives the output with key, value that can be redirect to desired path like specific reduce instance to work parallel among the instance. Output of the map function as key, value and number of partition will generate the desired partition for publish the value with key.

class Partitioner implements StreamPartitioner<String, String> {public Integer partition(final String topic, final String key, final String value, final int numPartitions) {/* Shuffle function, which can use to redirect value to specific partition by using it’s key, value and number of partition in the topic. Map function final key and value output will be produce to specific partition that can be controlled by this return value*/           return value;
          }}

3. Map Function

After adding above stream build configuration with specific client id. Map function basically map value to specific key for shuffle functionality.

final KStream<String, String> topic1 = builder.stream(“Topic_1”);final KStream<String, String> topic2 = builder.stream(“Topic_2”);// Store if need some additional datafinal GlobalKTable<String, String> table = builder.globalTable(“service_constraint”,Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(“store” /* table/store name */).withKeySerde(Serdes.String()) /* key serde */.withValueSerde(Serdes.String()) /* value serde */);/* perform the join, transform, filter or other operation to finish mapper functionality. Kstream topic is refer as final aggregated branch in the stream */final Partitioner partitioner = new Partitioner(); // shuffle logictopic.to(“sub_topic”,Produced.with(Serdes.String(), Serdes.String(), partitioner));

4. Reduce Function

After adding above stream build configuration with specific client id. Reducer fetch the mapper shuffle data to reduce function and implement the desire output.

final KStream<String, String> subTopic = builder.stream(“sub_topic”);// Store if need some additional datafinal GlobalKTable<String, String> table = builder.globalTable(“service_constraint”,Materialized.<String, String, KeyValueStore<Bytes, byte[]>>as(“store” /* table/store name */).withKeySerde(Serdes.String()) /* key serde */.withValueSerde(Serdes.String()) /* value serde */);/* perform the join, transform, filter or other operation to finish reducer functionality.*/subTopic.to(“output”,Produced.with(Serdes.String(), Serdes.String()));

Conclusion

This was only an overview of the functions you can find in the Kafka and kstream package. I’d recommended to use documentation of kafka. Understand Kstream package, that (especially custom types) can make your code safe in a production environment.

If you want to have the same functionalities in your production code, the only solution you’d have is to write your own functions for your custom types (given comment section).

Thank you for reading this article. I hope you enjoyed it!