3개 세미나 Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives 배재현 Talk 2: Rdist: a scalable and distributed batch framework for R 최현식 Talk 3: Operational Analytics with Apache Druid 손지훈

Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives

expensie a/b test

data enginerrs and data scients are siloed data engineer: teradata, kafka data scientists: keras, pytouch

data science is expensive

A/B Testing

which UI is better

netflix experimentation lifecycle

design, testing, execution, analsys and montioring, making decision

Sci Task 1: experiment design

define facts and metrics
kafka, executing tests
ablaze
eng task 2: annotate user exp
eng task 3: aggregate facts
sci task 3: statistics formulation ###
eng task 3: visualize stats

scalar, spark

ABlaze, meson -> workflow engine: in-house tool

Data science self-serve with python & sql

pythnon to sql

R, Python integration report builder

pinhall, druid, tableau

Talk 2: Rdist: a scalable and distributed batch framework for R

R, production pipeline
how our prediction pipeline is working
- model training using R on GPU machine
- daily prediction batch jobs
- incresing data volume from ~ 340G for 18 months. 매일 340G
R: single thread. 340G too big

Our current approach S3 -> hive -> s3 partition -> EC2 R nodes -> s# 11 ec2 nodes with single R, data partitioned by key oozie job control?

Rdist?

rust Rdist (from user perspective)
To run a job, submit a job config to RDist master A Job configuration(job1.toml): -> yaml with config. package, resource, data

Rdist cli command

Why Rust?

Forign function interface(ffi)
C binding

Rdist

inhouse tool
7월초 in production, open-source? maybe
similar => Spark R

Talk 3: Operational Analytics with Apache Druid

https server high traffic
operational analytics
- understanidng data
- genrally in oa
Generally in OA
- user need to explore data becuase they don’t know
OA uses case
- root cause analysis
  - diagonsoing and troubleshooting problems
- what-if analsys
  - a/b testing
- top-k/heavy hitter analysis
  - customer segmentation
- behavioural analysis
  - unique users, retention rates

apache druid

high performance distrubued analytics data store
()https://adtmag.com/articles/2015/07/30/yahoo-picks-druid.aspx
designed for oa
- 1. scalability
  - shared nothing
- 1. stream/batch ingestion
  - stream ingestion from kafka
  - batch ingestion: hdfs, s3, http
  - roll-up
    - pre-aggregating data at ingestion time
    - reducing data size and faster query processing
- 1. declarative query language
  - json based dsl
  - sql
- 1. high perf

5 comps: overload, broker, coordinator, middle manater, historical

deep storage( HDFS, S3, NFS)

()https://medium.com/tecnolog%C3%ADa/how-we-built-a-streaming-analytics-solution-using-apache-kafka-druid-66c257adcd9a

demo

https://demo.imply.io
imply cloud -> druid, data manager
pivot
crawler -> kafka -> druid

action item

apache druid ()http://druid.io/
- ()https://medium.com/@leventov/comparison-of-the-open-source-olap-systems-for-big-data-clickhouse-druid-and-pinot-8e042a5ed1c7
tableaua
oozie
https://simply.io

data engineering

data science

K Group A team: Data Engineering