K Group A team: Data Engineering 이야기

3개 세미나 Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives 배재현 Talk 2: Rdist: a scalable and distributed batch framework for R 최현식 Talk 3: Operational Analytics with Apache Druid 손지훈


Talk 1: Bridging the gap between data science and engineering in tooling and automation perspectives

expensie a/b test

data enginerrs and data scients are siloed data engineer: teradata, kafka data scientists: keras, pytouch

data science is expensive

A/B Testing

which UI is better

netflix experimentation lifecycle

design, testing, execution, analsys and montioring, making decision

Sci Task 1: experiment design

  • define facts and metrics
  • kafka, executing tests
  • ablaze
  • eng task 2: annotate user exp
  • eng task 3: aggregate facts
  • sci task 3: statistics formulation ###
  • eng task 3: visualize stats

scalar, spark

ABlaze, meson -> workflow engine: in-house tool

Data science self-serve with python & sql

pythnon to sql

R, Python integration report builder

pinhall, druid, tableau


Talk 2: Rdist: a scalable and distributed batch framework for R

  • R, production pipeline
  • how our prediction pipeline is working
    • model training using R on GPU machine
    • daily prediction batch jobs
    • incresing data volume from ~ 340G for 18 months. 매일 340G
  • R: single thread. 340G too big

## Our current approach S3 -> hive -> s3 partition -> EC2 R nodes -> s# 11 ec2 nodes with single R, data partitioned by key oozie job control?

Rdist?

  • rust Rdist (from user perspective)

  • To run a job, submit a job config to RDist master A Job configuration(job1.toml): -> yaml with config. package, resource, data

Rdist cli command

Why Rust?

  • Forign function interface(ffi)
  • C binding

Rdist

  • inhouse tool
  • 7월초 in production, open-source? maybe
  • similar => Spark R

Talk 3: Operational Analytics with Apache Druid

  • https server high traffic
  • operational analytics
    • understanidng data
    • genrally in oa
  • Generally in OA
    • user need to explore data becuase they don’t know
  • OA uses case
    • root cause analysis
      • diagonsoing and troubleshooting problems
    • what-if analsys
      • a/b testing
    • top-k/heavy hitter analysis
      • customer segmentation
    • behavioural analysis
      • unique users, retention rates

apache druid

  • high performance distrubued analytics data store
  • ()https://adtmag.com/articles/2015/07/30/yahoo-picks-druid.aspx
  • designed for oa
      1. scalability
        • shared nothing
      1. stream/batch ingestion
        • stream ingestion from kafka
        • batch ingestion: hdfs, s3, http
        • roll-up
          • pre-aggregating data at ingestion time
          • reducing data size and faster query processing
      1. declarative query language
        • json based dsl
        • sql
      1. high perf

5 comps: overload, broker, coordinator, middle manater, historical

  • deep storage( HDFS, S3, NFS)

()https://medium.com/tecnolog%C3%ADa/how-we-built-a-streaming-analytics-solution-using-apache-kafka-druid-66c257adcd9a

demo

  • https://demo.imply.io
  • imply cloud -> druid, data manager
  • pivot

  • crawler -> kafka -> druid

action item

COMMENTS
Related Post