K Group A team: Data Engineering 이야기


3개 세미나 Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives 배재현 Talk 2: Rdist: a scalable and distributed batch framework for R 최현식 Talk 3: Operational Analytics with Apache Druid 손지훈


Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives

expensie a/b test

data enginerrs and data scients are siloed data engineer: teradata, kafka data scientists: keras, pytouch

data science is expensive

A/B Testing

which UI is better

netflix experimentation lifecycle

design, testing, execution, analsys and montioring, making decision

Sci Task 1: experiment design

scalar, spark

ABlaze, meson -> workflow engine: in-house tool

Data science self-serve with python & sql

pythnon to sql

R, Python integration report builder

pinhall, druid, tableau


Talk 2: Rdist: a scalable and distributed batch framework for R

## Our current approach S3 -> hive -> s3 partition -> EC2 R nodes -> s# 11 ec2 nodes with single R, data partitioned by key oozie job control?

Rdist?

Rdist cli command

Why Rust?

Rdist


Talk 3: Operational Analytics with Apache Druid

apache druid

5 comps: overload, broker, coordinator, middle manater, historical

()https://medium.com/tecnolog%C3%ADa/how-we-built-a-streaming-analytics-solution-using-apache-kafka-druid-66c257adcd9a

demo


action item