Skip to content

ververica/flink-sql-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

e548d2c · Nov 14, 2022

History

11 Commits
Nov 14, 2022
Nov 14, 2022
Nov 14, 2022
Feb 10, 2020
Feb 7, 2020
Nov 14, 2022
Nov 14, 2022
Nov 14, 2022

Repository files navigation

flink-sql-benchmark

Flink TPC-DS benchmark

Step 1: Environment preparation

  • Recommended configuration for Hadoop cluster

    • Resource allocation
      • master *1 :
        • vCPU 32 cores, Memory: 128 GiB / System disk: 120GB *1, Data disk: 80GB *1
      • worker *15 :
        • vCPU 80 cores, Memory: 352 GiB / System disk: 120GB *1, Data disk: 7300GB *30
      • This document was tested in Hadoop3.2.1 and Hive3.1.2
    • Hadoop environment preparation
      • Install Hadoop, and then configure HDFS and Yarn according to Configuration Document
      • Set Hadoop environment variables: ${HADOOP_HOME} and ${HADOOP_CLASSPATH}
      • Configure Yarn, and then start ResourceManager and NodeManager
        • modify yarn.application.classpath in yarn-site, adding mapreduce's dependency to classpath:
          • $HADOOP_YARN_HOME/share/hadoop/yarn/lib/*,$HADOOP_YARN_HOME/share/hadoop/mapreduce/*
    • Hive environment preparation
      • Run hive metastore and HiveServer2
        • Set environment variable: ${HIVE_CONF_DIR}. This variable equals to the site of hive-site.xml.
        • execute: nohup hive --service metastore & / nohup hive --service hiveservice2 &
    • Others
      • gcc is needed in your master node to build the TPC-DS data generator
  • TPC-DS benchmark project preparation

    • Download flink-sql-benchmark project via git clone https://github.com/ververica/flink-sql-benchmark.git to your master node
      • In this document, the benchmark project located absolutely path named ${INSTALL_PATH}
    • Generate the execution jar via cd ${INSTALL_PATH} and run mvn clean package -DskipTests (Please make sure maven had been installed)
  • Flink environment preparation

    • Download Flink-1.16 to folder ${INSTALL_PATH}/packages
      • Flink-1.16 download path
      • Replace flink-conf.yaml
        • mv ${INSTALL_PATH}/flink-sql-benchmark/tools/flink/flink-conf.yaml ${INSTALL_PATH}/flink-sql-benchmark/packages/flink-1.16/conf/flink-conf.yaml
      • Download flink-sql-connector-hive-3.1.2_2.12
      • Start yarn session cluster
        • cd ${INSTALL_PATH}/packages/flink-1.16/bin and run ./yarn-session.sh -d -qu default

Step 2: Generate TPC-DS dataset

Please cd ${INSTALL_PATH} first.

  • Set common environment variables
    • vim tools/common/env.sh:
      • None-partitioned table
        • SCALE is the size of the generated dataset (GB). Recommend value is 10000 (10TB)
          • This variable is recommended to use 1000
        • FLINK_TEST_DB is Hive database name, which will be used by Flink
          • This variable is recommended to use the default name: export FLINK_TEST_DB=tpcds_bin_orc_$SCALE
      • Partitioned table
        • FLINK_TEST_DB need to be modified to export FLINK_TEST_DB=tpcds_bin_partitioned_orc_$SCALE
  • Data generation
    • None-partitioned table
      • run ./tools/datagen/init_db_for_partiiton_tables.sh. init_db_for_partition_tables.sh will first launch a Mapreduce job to generate the data in text format which is stored in HDFS. Then it will create external Hive databases based on the generated text files.
        • Origin text format data will be stored in HDFS folder: /tmp/tpcds-generate/${SCALE}
        • Hive external database points to origin text data is: tpcds_text_${SCALE}
        • Hive orc format external database, which points to origin text data is: tpcds_bin_orc_${SCALE}
    • Partitioned table
      • run ./tools/datagen/init_db_for_none_partition_tables.sh. If origin text format data don't exist in HDFS, it will create origin file first. Otherwise, it will create external Hive databases with partition tables based on the generated text files.
        • Origin text format data will also be stored in HDFS folder: /tmp/tpcds-generate/${SCALE}
        • Hive external database points to origin text data is: tpcds_text_${SCALE}
        • Hive orc format external database with partition table, which points to origin text data is: tpcds_bin_partitioned_orc_${SCALE}
          • This command will be very slow because Hive dynamic partition data writing is very slow

Step 3: Generate table statistics for TPC-DS dataset

Please cd ${INSTALL_PATH} first.

  • Generate statistics for table and every column:
    • run ./tools/stats/analyze_table_stats.sh
      • Note: this step use Flink analyze table syntax to generate statistics, so it only supports since Flink-1.16
      • For partition table, this step is very slow. it's recommend to use Hive analyze table syntax, which will generate same stats with Flink analyze table syntax
        • The document for Hive analyze table syntax
        • In Hive3.x, it can run ANALYZE TABLE partition_table_name COMPUTE STATISTICS FOR COLUMNS; in hive client to generate stats for all partitions instead of specifying one partition

Step 4: Flink run TPC-DS queries

Please cd ${INSTALL_PATH} first.

  • Run TPC-DS query:
    • run specify query: ./tools/flink/run_query.sh 1 q1.sql
      • 1 stands for the iterator of execution, which default number is 1, q1.sql stands for query number (q1 -> q99).
      • If iterator of execution is 1, the total execution time will be longer than the query execution time because of Flink cluster warmup time cost
    • run all queries: ./tools/flink/run_query.sh 2
      • Because of running all queries will cost a long time, it recommends to run this command in nohup mode: nohup ./run_all_queries.sh 2 > partition_result.log 2>&1 &
  • Modify tools/flink/run_query.sh to add other optional factors:
    • --location: sql query path. If you only want to execute partial queries, you can use this parameter to specify a folder and place those sql files in this folder.
    • --mode: execute or explain, default is execute
  • Kill yarn jobs:
    • run yarn application -list to get the yarn job id of the current Flink job
    • run yarn application -kill application_xxx to kill this yarn job
    • If you stop the current Flink job in this way, you need to restart yarn session cluster the next time you run TPC-DS queries
      • cd ${INSTALL_PATH}/packages/flink-1.16/bin and run ./yarn-session.sh -d -qu default

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published