3개 세미나 Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives 배재현 Talk 2: Rdist: a scalable and distributed batch framework for R 최현식 Talk 3: Operational Analytics with Apache Druid 손지훈
Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives
expensie a/b test
data enginerrs and data scients are siloed data engineer: teradata, kafka data scientists: keras, pytouch
data science is expensive
A/B Testing
which UI is better
netflix experimentation lifecycle
design, testing, execution, analsys and montioring, making decision
Sci Task 1: experiment design
- define facts and metrics
- kafka, executing tests
- ablaze
- eng task 2: annotate user exp
- eng task 3: aggregate facts
- sci task 3: statistics formulation ###
- eng task 3: visualize stats
scalar, spark
ABlaze, meson -> workflow engine: in-house tool
Data science self-serve with python & sql
pythnon to sql
R, Python integration report builder
pinhall, druid, tableau
Talk 2: Rdist: a scalable and distributed batch framework for R
- R, production pipeline
- how our prediction pipeline is working
- model training using R on GPU machine
- daily prediction batch jobs
- incresing data volume from ~ 340G for 18 months. 매일 340G
- R: single thread. 340G too big
Our current approach S3 -> hive -> s3 partition -> EC2 R nodes -> s# 11 ec2 nodes with single R, data partitioned by key oozie job control?
Rdist?
-
rust Rdist (from user perspective)
-
To run a job, submit a job config to RDist master A Job configuration(job1.toml): -> yaml with config. package, resource, data
Rdist cli command
Why Rust?
- Forign function interface(ffi)
- C binding
Rdist
- inhouse tool
- 7월초 in production, open-source? maybe
- similar => Spark R
Talk 3: Operational Analytics with Apache Druid
- https server high traffic
- operational analytics
- understanidng data
- genrally in oa
- Generally in OA
- user need to explore data becuase they don’t know
- OA uses case
- root cause analysis
- diagonsoing and troubleshooting problems
- what-if analsys
- a/b testing
- top-k/heavy hitter analysis
- customer segmentation
- behavioural analysis
- unique users, retention rates
- root cause analysis
apache druid
- high performance distrubued analytics data store
- ()https://adtmag.com/articles/2015/07/30/yahoo-picks-druid.aspx
- designed for oa
-
- scalability
- shared nothing
-
- stream/batch ingestion
- stream ingestion from kafka
- batch ingestion: hdfs, s3, http
- roll-up
- pre-aggregating data at ingestion time
- reducing data size and faster query processing
-
- declarative query language
- json based dsl
- sql
-
- high perf
-
5 comps: overload, broker, coordinator, middle manater, historical
- deep storage( HDFS, S3, NFS)
demo
-
imply cloud -> druid, data manager
-
pivot
-
crawler -> kafka -> druid