Info-Services-Streaming-Data

About the Client

Incorporated 50 years ago, The client is one of the major streaming companies with over 50M+ subscribers, continuously improving its products and services over the generations. Global Data & AI (“DAI”) is a cross-vertical, integrated full- stack data and analytics platform-based organization. We primarily focus on the end-to end data pipelines to products and services that influence decisions based on analytical modeling and probabilistic and deterministic directions and tendencies. DAI Mission: Build best-in-class data & AI products and solutions to enhance storytelling & experiences for the client's audiences globally.

Why Infoservices

Info Services helps customers implement scalable and cost-effective solutions meeting different needs, particularly data & analytics solutions.
Info Services brought niche capabilities required for streaming initiatives with extremely talented AWS Solution architects and data engineers who successfully implement -ed the solution.
Also, they invest in technologies and employees to create a challenging environment that paves the professional growth for the associates.

Technologies Used

Amazon AWS

AWS is the leading cloud provider, and to solve the business challenge described above, we wanted something that would allow us to develop and deploy faster and provide storage for vast amounts of data (Petabytes of data) - this is where S3 becomes helpful, allow processing and transformation of data, and do not incur additional costs once it is done - this is `where transient EMR cluster, which can be provisioned on need basis. Also, AWS provides MSK & Databricks services for scale, which fits very well with our automation framework.

Partner solutions

Architected the solution with best practices and well architect framework. There is a lot of helpful information available on the internet and from third parties (free and paid) in addition to the information we already have in our content metadata repository. The challenge is integrating data from these sources and linking them to create a unified and consistent representation. Content metadata (data about movies/series, talents, etc.) is one of the most foundational datasets that powers use cases such as Content Valuation, Portfolio Optimization, Demand Prediction, Marketing Analytics, Search, and Recommendation.

This unified data representation will help to address the use case as specified below :

1. Test Strategy & Framework Design:

Assessed the volumes, velocity, variety, veracity needs, and arrived at the test strategy, and designed the framework

2. Data Prep:

We prepare test data and store it in an S3 bucket; from there, data is pushed to MSK (configured at regular intervals).

3. Data Enrichment:

Data is enriched using data bricks supported with AWS, and at each stage of the transfer, data is stored in delta lake test tables

4. Validation

A validation script runs on the top of test tables and the actual target table to identify any feature mismatch or data loss.

Results & Benefits

By this end-to-end testing framework implementation, 90% of data-related bugs are reduced, and engineers feel more confident when pushing the changes to production by executing the jobs in a developer environment where end-to-end testing is configured.

In terms of cost, we have saved more than 80% by spinning up the data-bricks AWS cluster using spot instances only when we need to run an integration test and terminate automatically once the job completes

Implementation of test automation framework for streaming data