Incorporated 50 years ago, The client is one of the major streaming companies with over 50M+ subscribers, continuously improving its products and services over the generations. Global Data & AI (“DAI”) is a cross-vertical, integrated full- stack data and analytics platform-based organization. We primarily focus on the end-to end data pipelines to products and services that influence decisions based on analytical modeling and probabilistic and deterministic directions and tendencies. DAI Mission: Build best-in-class data & AI products and solutions to enhance storytelling & experiences for the client's audiences globally.
AWS is the leading cloud provider, and to solve the business challenge described above, we wanted something that would allow us to develop and deploy faster and provide storage for vast amounts of data (Petabytes of data) - this is where S3 becomes helpful, allow processing and transformation of data, and do not incur additional costs once it is done - this is `where transient EMR cluster, which can be provisioned on need basis. Also, AWS provides MSK & Databricks services for scale, which fits very well with our automation framework.
Architected the solution with best practices and well architect framework. There is a lot of helpful information available on the internet and from third parties (free and paid) in addition to the information we already have in our content metadata repository. The challenge is integrating data from these sources and linking them to create a unified and consistent representation. Content metadata (data about movies/series, talents, etc.) is one of the most foundational datasets that powers use cases such as Content Valuation, Portfolio Optimization, Demand Prediction, Marketing Analytics, Search, and Recommendation.
This unified data representation will help to address the use case as specified below :
Assessed the volumes, velocity, variety, veracity needs, and arrived at the test strategy, and designed the framework
We prepare test data and store it in an S3 bucket; from there, data is pushed to MSK (configured at regular intervals).
Data is enriched using data bricks supported with AWS, and at each stage of the transfer, data is stored in delta lake test tables
A validation script runs on the top of test tables and the actual target table to identify any feature mismatch or data loss.
By this end-to-end testing framework implementation, 90% of data-related bugs are reduced, and engineers feel more confident when pushing the changes to production by executing the jobs in a developer environment where end-to-end testing is configured.
In terms of cost, we have saved more than 80% by spinning up the data-bricks AWS cluster using spot instances only when we need to run an integration test and terminate automatically once the job completes
You’re one step away from building great software. This case study will help you learn more about how Infoservices helps successful companies extend their tech teams.
Want to talk more? Get in touch today!