CASE STUDY

Scalalbe data pipeline

A scalable data pipeline enhances operational efficiency of a social media platform getting > 1 million visits per day

Background

Successful social media platforms come across a lot of data everyday.

This data needs to be logged, processed, managed regularly. These tasks of data managing and data processing can take up a big chunk of the time and carry the risk of human-errors.

Not just that, some tasks like collecting meta-information for current data in the system are practically impossible.
This meta-information is essential for enriching and embellishing the raw-data in the sytem. Without this the user experience cannot be imporoved.

The client engaged us to set up a data processing pipeline for effectively processing and managing the data.

Solution

We created a scalable data pipeline that could collect, manage, enhance information effectively.

Solution highlights:

  •   Parallel processing of data - The implementation was lock-free to maximize data processing throughput.
  •   New scoring and ranking algorithm - This algorithm was able to quickly assign scores and rank the information based on the scores.
  •   Filter out "bad-words" along with mis-spelling of bad-words
  •   Data cleanup - removing unrelated information/noise from the raw data.
  •   Data analysis
  •   Gather meta-information for existing content from multiple sources on the internet
  •   Asynchronous logging of activity - This was done so that logging did not impact the performance of the website.
  •   Asynchronous log management

Results

Based on our scalable data pipeline, the client was able to better organize the data flow, automate the tasks, scale up his data gathering, data processing operation and better utilize the gathered data.

Some highlights of the benefits that were realized by the client were:

  •   Improve operating efficiency of the website by automating human tasks.
  •   Improve user experience based on rich information collected through the new data pipeline. This information was previously not gathered.
  •   Scale up data operations - The client was able to handle more data in less time with no human involvement.