Leveraging Fluent Bit in Large-Scale Machine Learning Model Pipeline

Sophia
8 min readMar 25, 2024

Overview

This post introduces Fluent Bit, a light-weight log processing and forwarding tool. And use an example of a movie recommendation scenario to show the integration of Fluent Bit with machine learning model for the pipeline design of a large-scale recommendation system. There are also discussions about the comparison of Fluent Bit with Fluentd and LogStash, showing the advantages and limitations of Fluent Bit.

What is Fluent Bit?

Introduction

Fluent Bit is a scalable observability tool that helps perform the processing and forwarding of log, metrics, and other data in real-time. Its characteristics of light-weighted, event-driven architecture, and security communication make it useful for large-scale production scenarios. It could be useful for MLOps pipeline and the observing and monitoring of the data.

Architecture Overview

For the processing of logs and metrics, Fluent Bit contains several steps for the data pipeline, including Input, Parser, Filter, Buffer, Router, and Output. The data pipeline of using Fluent Bit could be shown as follows:

Data Pipeline Steps in Fluent Bit

Problems that Fluent Bit Address

The main advantages of using Fluent Bit are to help process logs for further data analysis and gather metrics of system performance. Logs could come from different sources, such as the user interface, login access, system performance, and database collection. Fluent Bit could be useful for processing logs from servers, containers, and embedded systems. And one of the most common uses is for the Kubernetes log analysis.

When using Kubernetes for large scale system development, it might contain several nodes, pods, and containers for the deployment. This may cause the difficulties of finding the source of logs to identify which part of the system has an error or needs improvement. Fluent Bit could help by collecting the log data, processing the data into key-value pairs, and store the data for further usage.

The features and advantages of Fluent Bit could be useful for large scale machine learning system design. Let’s identify some possible problems in the system first in the following section.

Problems in Designing Large Scale Machine Learning System

Data Pipeline

For machine learning projects, the data analysis and pipeline design could influence the model training and performance. Moreover, for the production of machine learning products with large-scale data, the pipeline design could also influence the system performance and further impact the user experience. Here are some common problems in different stages of a production machine learning system.

Input Streaming Data

Some of the possible issues in input data gathering include the high data velocity and large amount of streaming data in parallel, which may cause problems of scaling the system or getting latency for the overall process.

ETL Process

The ETL (Extract, Transform, Load) process is one of the most important parts of data analysis in machine learning training and testing. The transformation of data may cause issues if the process is not stable and structured. In addition, there may be an error of losing the data if the system loses connection with the input source, or resulting in the inconsistency of data if the system could not deal with the large amount of data in parallel.

Model Endpoint and Training

Potential errors in model endpoint deployment and the training process might include the versioning of model, model performance monitoring, and data drift detection.

Others

There are other issues for production of machine learning products, including the security and privacy of data, the management and optimization of resource utilization. The security includes not only the data collection from the input and storage in the database, but also the overall communication process in the data pipeline. And for resource management, how to maximize efficiency and minimize the cost of the resource are also important factors to be considered.

Advantages of using Fluent Bit

The advantage of lightweight system architecture of Fluent Bit could be useful for the resource efficiency consideration in production machine learning systems. It could also help to collect data from different input sources, perform the ETL pipeline in a more structured way, and output processed data to different output destinations. It also contains security features when performing data processing or saving to storage systems.

Application Example: Using Fluent Bit for Movie Recommendation System

Installation

To install Fluent Bit, one can follow the documentation on the website of Fluent Bit. The following is the example of installation on a MacOS system with homebrew:

brew install fluent-bit

You could also install from the source code on the MacOS system, following the documentation here. For other operating systems, the documentation could also be found here for Windows, and here for Ubuntu. Fluent Bit also supports installation on Kubernetes, Docker, and many other platforms.

Log Data Processing

I’ve tried the data processing with Fluent Bit for a log file showing the movie users watched and the rating of the movies from users. Here are some files that I provided in the experiments:

parsers.conf

[PARSER]
Name custom_parser
Format regex
Regex ^(?<time>[^,]+),(?<user>[^,]+),(?<method>[^ ]+) (?<path>.*)

In the parsers.conf file, I designed a customized parser that fits the input data format that I used. This could be changed based on the data structure. The parse would be used in the fluent-bit.conf file for data processing.

fluent-bit.conf

[SERVICE]
flush 1
daemon Off
log_level info
parsers_file parsers.conf

[INPUT]
name tail
path ./sample_kafka_dset.log
parser custom_parser
read_from_head true
tag movie_recommendation
exit_on_eof on

[FILTER]
name record_modifier
match movie_recommendation

[OUTPUT]
name file
match *
format csv
file processed_kafka_dset.csv
path ./output

In the fluent-bit.conf file, it contains the INPUT part to specify the path of the data for analysis and the plugin to use for parsing. In this example, I include an input file called sample_kafka_dset.log under the same folder of the config file, which is from a Kafka stream gathering the movies that user watched and the ranking of the movies from the users. And the custom_parser is the parse written in the parsers.conf file.

For the FILTER, one could perform the filtering criteria for the input data, or customize it with a lua file.

For the OUTPUT part, you could specify the format of the data you want to store. In this example, I store the output in a .csv format, which could be used for further machine learning model training. The path is to specify the directory where the output .csv file should be stored, it should be created before running Fluent Bit server.

To run the Fluent Bit server, type the following command in the terminal:

fluent-bit -c fluent-bit.conf

Parts of the result with the above settings could be seen as follows. It could be seen that the data is parsed and organized in a .csv format, which could be used for machine learning model training or testing.

Sample Output (.csv format) of Using Fluent Bit for Movie Recommendation Data Processing

The output file could also be stored in different formats. For example, the following is printing the output to the console, with a JSON format with key-value pairs.

Sample Output of Using Fluent Bit for Movie Recommendation Data Processing

Example of Fluent Bit Pipeline for Movie Recommendation System

An example pipeline design of integrating Fluent Bit with machine learning model for movie recommendation system could be seen as follows:

Example of System Design Architecture for a Movie Recommendation System Leveraging Fluent Bit

Here’s how Fluent Bit could be useful for the production machine learning system in a movie recommendation scenario:

Step 1: Connect the input data source with Fluent Bit, using the customized parser to parse for the Kafka log which includes the users’ movie-watching history. Other information such as login system for getting the users’ login pattern, or TMDB API for getting the movie feature could also be included. The data of TMDB API could be obtained with the RESTful API parsing format.

Step 2: Parse and Filter the input data with Fluent Bit, performing the ETL process and changing the data into the format that the ML model could use.

Step 3: Feed the data to a machine learning model for inference, and receive the output from the machine learning model that includes 20 recommended movies to the user. The machine learning model itself is not part of the Fluent Bit pipeline, it should be saved on other servers.

Step 4: Leverage Fluent Bit again to save the data to buffer, and output to a Kafka stream that shows the recommended movies to the user interface.

Comparison with Fluentd

What is Fluentd?

Fluentd is also an observability tool useful for log and metric processing developed by the same company as Fluent Bit. In comparison with Fluent Bit, Fluentd could provide a more comprehensive plugin and feature sets for data processing. The website of Fluentd could be found here for further understanding about the tool.

Strength of Fluent Bit

Compared with Fluentd, Fluent Bit is a more lightweight and resource efficient tool for log monitoring and processing. It provides high performance with lower cost, zero dependencies, and less memory usage (around 650 KB). It could be more suitable for containerized environment deployment.

Limitations of Fluent Bit

Since Fluent Bit is designed for lightweight usage, it contains less plugin ecosystem in comparison with Fluentd. In addition, the features for data processing with Fluent Bit are less than that in Fluentd, meaning that Fluentd may be more suitable for enterprise level development.

Comparison with LogStash

What is LogStash?

LogStash is also a data processing pipeline tool useful for collecting data from different sources, transforming it with specific preprocessing steps, and sending the data to the preferred output target. It is often used for the ELK (Elasticsearch, Logstash, and Kibana) stash for log collection and data analysis. More details about LogStash could be found here.

Strength of Fluent Bit

In terms of resource usage such as CPU and memory, Fluent Bit costs lower overhead on the resources in comparison to Logstash. Also, it is more useful for microservice architecture, edge computing environments, and IoT devices.

Limitations of Fluent Bit

In comparison with LogStash, Fluent Bit is more lightweight and might not be that useful for centralized service. In addition, LogStash contains a more complete ecosystem for integration with other tools such as the ELK process.

Conclusion

This post introduces some basic information about Fluent Bit, uses an example of a large-scale machine learning model in production (movie recommendation system) to show how Fluent Bit could be useful in this scenario, and compares Fluent Bit with two tools that are with similar features. The characteristics of light-weighted design, secure data processing and storing, and the structured processing methods make it useful in data pipeline design.

References

https://fluentbit.io/

https://github.com/fluent/fluent-bit

https://docs.fluentbit.io/manual

https://www.fluentd.org/

https://www.elastic.co/logstash

--

--

Sophia

Human-Computer Interaction | Bioinformatics | Software Engineering | DevOps