소스 검색

refactor(dw_base): 清理老业务耦合文件(mg2es/DS/钉钉/tid/td_spark_init)并精简 spark_mmq_udf

tianyu.chu 2 주 전
부모
커밋
f20d9c39e6
44개의 변경된 파일7개의 추가작업 그리고 5582개의 파일을 삭제
  1. 0 83
      bin/dingtalk-work-alert.sh
  2. 0 13
      bin/hive-exec-job-starter.py
  3. 0 186
      bin/hive-exec.sh
  4. 0 3
      dw_base/ds/__init__.py
  5. 0 19
      dw_base/ds/config/base_config.yaml
  6. 0 9
      dw_base/ds/config/process_code.yaml
  7. 0 76
      dw_base/ds/ds_start_workflow.py
  8. 0 186
      dw_base/scheduler/country_count_dingtalk.py
  9. 0 240
      dw_base/scheduler/dingtalk_mirror_monitor.py
  10. 0 102
      dw_base/scheduler/dingtalk_notifier.py
  11. 0 370
      dw_base/scheduler/dingtalk_task_monitor.py
  12. 0 368
      dw_base/scheduler/dingtalk_task_monitor_new.py
  13. 0 498
      dw_base/scheduler/ent_interface_dingtalk.py
  14. 0 132
      dw_base/scheduler/ent_interface_dingtalk_call.py
  15. 0 141
      dw_base/scheduler/ent_interface_dingtalk_top10.py
  16. 0 242
      dw_base/scheduler/ent_interface_dingtalk_update.py
  17. 0 185
      dw_base/scheduler/get_oldmongo_cjfs.py
  18. 0 185
      dw_base/scheduler/get_oldmongo_sldw.py
  19. 0 90
      dw_base/scheduler/get_oldmongo_sldw_detail.py
  20. 0 102
      dw_base/scheduler/get_oldmongo_stat.py
  21. 0 139
      dw_base/scheduler/get_oldmongo_ysfs.py
  22. 0 0
      dw_base/scheduler/mg2es/__init__.py
  23. 0 53
      dw_base/scheduler/mg2es/conf_reader.py
  24. 0 37
      dw_base/scheduler/mg2es/dict_redis2hive.py
  25. 0 47
      dw_base/scheduler/mg2es/es_index_backup.py
  26. 0 214
      dw_base/scheduler/mg2es/es_operator.py
  27. 0 250
      dw_base/scheduler/mg2es/es_tmpl_gen.py
  28. 0 82
      dw_base/scheduler/mg2es/git_helper.py
  29. 0 54
      dw_base/scheduler/mg2es/hive_sql.py
  30. 0 39
      dw_base/scheduler/mg2es/path_util.py
  31. 0 61
      dw_base/scheduler/mg2es/redis_operator.py
  32. 0 178
      dw_base/scheduler/mg2es/to_es.py
  33. 0 49
      dw_base/scheduler/mg_company_alias_init.py
  34. 0 71
      dw_base/spark/td_spark_init.py
  35. 0 74
      dw_base/spark/udf/enterprise/unique/spark_tid_match_udf.py
  36. 0 99
      dw_base/spark/udf/spark_id_generate_udf.py
  37. 0 492
      dw_base/spark/udf/spark_mmq_udf.py
  38. 0 96
      dw_base/spark/udf/spark_read_hive_columns_cnt.py
  39. 0 47
      dw_base/utils/hive_file_merge.py
  40. 0 169
      dw_base/utils/spark_parse_json_to_hive.py
  41. 0 92
      dw_base/utils/tid_utils.py
  42. 1 4
      kb/00-项目架构.md
  43. 2 4
      kb/90-重构路线.md
  44. 4 1
      kb/92-重构进度.md

+ 0 - 83
bin/dingtalk-work-alert.sh

@@ -1,83 +0,0 @@
-#!/bin/bash
-#--------------------------------------------------------------------------------------------------
-# 使用企业微信群机器人发送告警的脚本
-# 腾讯云数仓机器人:-key=19e30ec1-d001-4437-ac41-63dc07f78520
-# 小可爱:-key=cc3653b1-78cb-465a-bf95-bf5f5303a37a
-#--------------------------------------------------------------------------------------------------
-BASE_DIR=$(
-  cd "$(dirname "$(realpath "$0")")/.." || exit
-  pwd
-)
-. "${BASE_DIR}"/bin/common/init.sh
-function usage() {
-  echo -e "${NORM_MGT}Usage: $0
-  ${NORM_CYN}\t[-h/-H/--h/--H/--help]             打印脚本使用方法${DO_RESET}"
-  echo -e "${NORM_MGT}Usage: $0
-  ${NORM_GRN}\t<-key[ /=] robot hook key>         机器人url后的key(lxy/common/alerter_constants.py中有记录)
-  ${NORM_GRN}\t<-msg[=/] message need to send>    要发送的消息
-  ${NORM_GRN}\t<-f[=/] file message need to send> 要发送的文件消息
-  ${DO_RESET}"
-  exit "$1"
-}
-
-function parse_args() {
-  for index in $(seq 1 $#); do
-    arg=${*:index:1}
-    case $arg in
-    -key)
-      index=$((index + 1))
-      KEY="${*:index:1}"
-      ;;
-    -key=*)
-      KEY="${arg#*=}"
-      ;;
-    -msg)
-      index=$((index + 1))
-      MSG+=("${*:index:1}")
-      ;;
-    -msg=*)
-      MSG+=("${arg#*=}")
-      ;;
-    -f)
-      index=$((index + 1))
-      FILE_PATH+=("${*:index:1}")
-      ;;
-    -f=*)
-      FILE_PATH+=("${arg#*=}")
-      ;;
-    -h | -H | --help)
-      usage 0
-      ;;
-    *) ;;
-    esac
-  done
-}
-
-function build_message() {
-  if [ -z "${KEY}" ] || [ "${#MSG[@]}" -eq 0 ]; then
-    usage 1
-  fi
-  msg=${MSG[0]}
-  for ((i = 1; i < ${#MSG[@]}; i++)); do
-    msg="${msg}\n${MSG[$i]}"
-  done
-  message=("{
-	  \"msgtype\": \"text\",
-	  \"text\": {
-		  \"content\": \"异常告警:\n${msg[*]}\"
-	  },
-	  \"at\":{
-	    \"isAtAll\":true
-	  }
-  }")
-  url="http://m1.node.cdh/dingtalk/api/robot/send?access_token=${KEY}"
-}
-
-# shellcheck disable=SC2034
-AT=()
-MSG=()
-parse_args "${@}"
-build_message
-
-#echo -e "${NORM_GRN} Send message using ${RED}${url}${DO_RESET}"
-curl "$url" -H 'Content-Type: application/json' -d "${message[*]}"

+ 0 - 13
bin/hive-exec-job-starter.py

@@ -1,13 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-"""
-  Note:为方便本地调试设计,请勿在调度中使用
-"""
-import os
-import sys
-
-project_root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
-sys.path.append(project_root_dir)
-
-if __name__ == '__main__':
-    os.system(f'{project_root_dir}/bin/hive-exec.sh {" ".join(sys.argv[1:])}')

+ 0 - 186
bin/hive-exec.sh

@@ -1,186 +0,0 @@
-#!/bin/bash
-#--------------------------------------------------------------------------------------------------
-#--------------------------------------------------------------------------------------------------
-set -e
-BASE_DIR=$(
-  cd "$(dirname "$(realpath "$0")")/.." || exit
-  pwd
-)
-. "${BASE_DIR}"/bin/common/init.sh
-function usage() {
-  echo -e "${NORM_MGT}Usage: $0
-  ${NORM_GRN}\t<-e[ /=]HQL语句>   HQL语句,需要使用''包括sql语句
-  ${NORM_CYN}\t[-dt[ /=]日期]     %Y%m%d 或 yyyyMMdd 格式的日期(命令行 > 默认)
-  ${NORM_CYN}\t                  可以以四种形式传入日期:
-  ${NORM_CYN}\t                      1. 20211101,表示具体日期
-  ${NORM_CYN}\t                      2. 20211101-,表示20211101至昨天
-  ${NORM_CYN}\t                      3. 20211101-20211107,表示20211101至20211107
-  ${NORM_CYN}\t                      4. 20211101,20211103,表示离散的日期20211101、20211103
-  ${NORM_CYN}\t[-c 参数名:参数值]  Hive参数
-  ${NORM_CYN}\t[-v 变量名:变量值]  Hive变量
-  ${DO_RESET}"
-  echo -e "${NORM_MGT}Usage: $0
-  ${NORM_GRN}\t<-f[ /=]HQL文件>   HQL文件
-  ${NORM_CYN}\t[-dt[ /=]日期]     %Y%m%d 或 yyyyMMdd 格式的日期(命令行 > 默认)
-  ${NORM_CYN}\t                  可以以四种形式传入日期:
-  ${NORM_CYN}\t                      1. 20211101,表示具体日期
-  ${NORM_CYN}\t                      2. 20211101-,表示20211101至昨天
-  ${NORM_CYN}\t                      3. 20211101-20211107,表示20211101至20211107
-  ${NORM_CYN}\t                      4. 20211101,20211103,表示离散的日期20211101、20211103
-  ${NORM_CYN}\t[-c 参数名:参数值]  Hive参数
-  ${NORM_CYN}\t[-v 变量名:变量值]  Hive变量
-  ${DO_RESET}"
-  exit "$1"
-}
-
-function parse_args() {
-  for index in $(seq 1 $#); do
-    arg=${*:index:1}
-    case $arg in
-    -c)
-      index=$((index + 1))
-      HIVE_CONF+=("--hiveconf")
-      HIVE_CONF+=("${*:index:1}")
-      ;;
-    -c=*)
-      HIVE_CONF+=("--hiveconf")
-      HIVE_CONF+=("${arg#*=}")
-      ;;
-    -dt)
-      index=$((index + 1))
-      if [ -z "${DT}" ]; then
-        DT="${*:index:1}"
-      fi
-      ;;
-    -dt=*)
-      if [ -z "${DT}" ]; then
-        DT="${arg#*=}"
-      fi
-      ;;
-    -v)
-      index=$((index + 1))
-      # 例如:dt=20220101、dt:20220101
-      KEY_VALUE="${*:index:1}"
-      # 截取 dt
-      KEY="${KEY_VALUE%%[:|=]*}"
-      # 截取 20220101
-      VALUE="${KEY_VALUE#*[:|=]}"
-      if [ "${KEY}" = "dt" ]; then
-        if [ -z "${DT}" ]; then
-          DT="${VALUE}"
-        fi
-      else
-        HIVE_GLOBAL_VAR+=("--hivevar")
-        HIVE_GLOBAL_VAR+=("${KEY_VALUE}")
-      fi
-      ;;
-    -v=*)
-      KEY_VALUE="${arg#*=}"
-      KEY="${KEY_VALUE%%[:|=]*}"
-      VALUE="${KEY_VALUE#*[:|=]}"
-      if [ "${KEY}" = "dt" ]; then
-        if [ -z "${DT}" ]; then
-          DT="${VALUE}"
-        fi
-      else
-        HIVE_GLOBAL_VAR+=("--hivevar")
-        HIVE_GLOBAL_VAR+=("${KEY_VALUE}")
-      fi
-      ;;
-    -e)
-      index=$((index + 1))
-      HIVE_SQL="${*:index:1}"
-      ;;
-    -e=*)
-      HIVE_SQL="${arg#*=}"
-      ;;
-    -f)
-      index=$((index + 1))
-      HIVE_FILE="${*:index:1}"
-      ;;
-    -f=*)
-      HIVE_FILE="${arg#*=}"
-      ;;
-    -h | -H | --h | --H | --help)
-      usage 0
-      ;;
-    *) ;;
-
-    esac
-  done
-  pretty_print "${NORM_MGT}${0} 收到参数:${NORM_GRN}${*}"
-}
-
-function run_execute() {
-  if [ -n "${HIVE_SQL}" ]; then
-    pretty_print "${NORM_MGT}执行Shell命令 ${NORM_GRN}hive -e ${HIVE_SQL} ${HIVE_CONF[*]} ${HIVE_LOCAL_VAR[*]}"
-    # 执行HQL语句
-    hive -e "${HIVE_SQL}" "${HIVE_CONF[@]}" "${HIVE_LOCAL_VAR[@]}" 2>&1 | tee -a "${LOG_FULL_PATH}"
-    exit "${PIPESTATUS[0]}"
-  elif [ -n "${HIVE_FILE}" ]; then
-    # 执行HQL文件
-    pretty_print "${NORM_MGT}执行Shell命令 ${NORM_GRN}hive -f ${HIVE_FILE} ${HIVE_CONF[*]} ${HIVE_LOCAL_VAR[*]}"
-    if [ "${USER}" == "${RELEASE_USER}" ]; then
-      hive -f "/home/${USER}/release/tendata-warehouse/${HIVE_FILE}" "${HIVE_CONF[@]}" "${HIVE_LOCAL_VAR[@]}" 2>&1 | tee -a "${LOG_FULL_PATH}"
-    else
-      hive -f "/home/${USER}/tendata-warehouse/${HIVE_FILE}" "${HIVE_CONF[@]}" "${HIVE_LOCAL_VAR[@]}" 2>&1 | tee -a "${LOG_FULL_PATH}"
-    fi
-    EXIT_CODE="${PIPESTATUS[0]}"
-    if [ "${EXIT_CODE}" -ne 0 ]; then
-      if [[ "${HIVE_FILE}" =~ .*stg_es_mapping.sql ]]; then
-        exit $((EXIT_CODE))
-      fi
-      if [[ "${HIVE_FILE}" =~ .*stage_es_mapping.sql ]]; then
-        exit $((EXIT_CODE))
-      fi
-      # RELEASE_USER="dev005"
-      if [ "${USER}" == "${RELEASE_USER}" ]; then
-        DINGTALK_ALTER_KEY="4eb576296e66f49628447c8f2931c8892583f3283c96fef872577148aa5f88fa"
-        MESSAGE="在 ${CURRENT_HOST} 上执行HQL文件 /home/${USER}/tendata-warehouse/${HIVE_FILE} 失败"
-        "${BASE_DIR}"/bin/dingtalk-work-alert.sh -key="${DINGTALK_ALTER_KEY}" -msg="${MESSAGE}"
-      else
-        pretty_print "${NORM_MGT}执行HQL文件 ${NORM_GRN}${HIVE_FILE}${NORM_MGT} 失败"
-      fi
-      exit $((EXIT_CODE))
-    fi
-  else
-    usage 1
-  fi
-}
-
-function pretty_print() {
-    # 设置文本颜色和格式
-    NORM_GRN='\033[0;32m'  # 绿色
-    NORM_CYN='\033[0;36m'  # 青色
-    NORM_MGT='\033[0m'   # 重置颜色和格式
-    # 打印带颜色和格式的消息
-    echo -e "${1}"
-}
-
-
-HIVE_CONF=()
-HIVE_GLOBAL_VAR=()
-HIVE_SQL=""
-HIVE_FILE=""
-parse_args "${@}"
-if [ -z "${DT}" ]; then
-  DT=$(date -d '-1 day' +%Y%m%d)
-fi
-date_range "${DT}"
-for DT in "${DATE_RANGE[@]}"; do
-  HIVE_LOCAL_VAR=("${HIVE_GLOBAL_VAR[@]}")
-  HIVE_LOCAL_VAR+=("--hivevar")
-  HIVE_LOCAL_VAR+=("dt=${DT}")
-  LOG_DIR="${LOG_ROOT_DIR}/hive-exec/${DT}"
-  if [ -n "${HIVE_SQL}" ]; then
-    HIVE_FILE_SIMPLE_NAME=$(echo "${HIVE_SQL}" | base64)
-    LOG_FILE_NAME="${HIVE_FILE_SIMPLE_NAME}.log"
-  elif [ -n "${HIVE_FILE}" ]; then
-    HIVE_FILE_SIMPLE_NAME=$(basename "${HIVE_FILE}" .sql)
-    LOG_FILE_NAME="${HIVE_FILE_SIMPLE_NAME}.log"
-  fi
-  mkdir -p "${LOG_DIR}"
-  LOG_FULL_PATH="${LOG_DIR}/${LOG_FILE_NAME}"
-  pretty_print "${NORM_MGT}日志文件将写入 ${NORM_GRN}${LOG_FULL_PATH}${NORM_MGT}"
-  run_execute
-done

+ 0 - 3
dw_base/ds/__init__.py

@@ -1,3 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-

+ 0 - 19
dw_base/ds/config/base_config.yaml

@@ -1,19 +0,0 @@
-base_url: http://xxxx:12345/dolphinscheduler
-project_code:
-request_params:
-  processDefinitionCode:
-  scheduleTime:
-  failureStrategy: END
-  taskDependType: TASK_POST
-  execType: START_PROCESS
-  warningType: NONE
-  runMode: RUN_MODE_SERIAL
-  processInstancePriority: MEDIUM
-  workerGroup: cdh
-  tenantCode: alvis
-  startParams:
-  dryRun: 0
-  testFlag: 0
-  complementDependentMode: OFF_MODE
-  allLevelDependent: false
-  executionOrder: DESC_ORDER

+ 0 - 9
dw_base/ds/config/process_code.yaml

@@ -1,9 +0,0 @@
-project_code:
-  customs-data-mix: 15867179893120
-
-process_code:
-  customs-data-mix:
-    数据融合his_dataX: 15876817825536
-    数据融合fix_dataX: 15876835811200
-    数据融合del_dataX: 15876840305666
-    mix_fill_fix: 130549782670528

+ 0 - 76
dw_base/ds/ds_start_workflow.py

@@ -1,76 +0,0 @@
-"""
-start_workflow(project_name, process_name, start_params):
-    project_name: 项目名称
-    process_name: 工作流名称
-    start_params: 工作流参数(dict)
-"""
-
-import json
-import requests
-import yaml
-import re
-import os
-import logging
-
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse/", abspath)
-config_path = os.getenv("CONFIG_PATH", "dw_base/ds/config/base_config.yaml")
-process_code_path = os.getenv("PROCESS_CODE_PATH", "dw_base/ds/config/process_code.yaml")
-
-
-def load_yaml_config(path):
-    try:
-        with open(path, 'r') as file:
-            config = yaml.safe_load(file)
-        return config
-    except FileNotFoundError:
-        logging.error(f"配置文件 {path} 未找到")
-        return {}
-    except Exception as e:
-        logging.error(f"读取配置文件时发生错误: {e}")
-        return {}
-
-
-def init_params(config):
-    params: dict = config.get("request_params")
-    for key in params.keys():
-        if params[key] is None:
-            params[key] = ""
-        else:
-            params[key] = str(params.get(key))
-    return params
-
-
-def send_request(url, headers, params):
-    if params is None:
-        return
-    try:
-        result = requests.post(url=url, headers=headers, params=params)
-        result.raise_for_status()
-        logging.info(result.json())
-    except requests.exceptions.RequestException as e:
-        logging.error(f"请求失败: {e}")
-
-
-def get_request_base(project_name, process_name, token):
-    base_config: dict = load_yaml_config(root_path + config_path)
-    base_url = base_config.get("base_url")
-    headers = {
-        "token": token
-    }
-    params = init_params(base_config)
-    process_code_config: dict = load_yaml_config(root_path + process_code_path)
-    project_code = process_code_config.get("project_code").get(project_name)
-    url = f"{base_url}/projects/{project_code}/executors/start-process-instance"
-    process_code = process_code_config.get("process_code").get(project_name).get(process_name)
-    params["project_code"] = str(project_code)
-    params["processDefinitionCode"] = str(process_code)
-    return url, headers, params
-
-
-def start_workflow(project_name, process_name, start_params, token):
-    url, headers, params = get_request_base(project_name, process_name, token)
-    params["startParams"] = json.dumps(start_params)
-    send_request(url, headers, params)

+ 0 - 186
dw_base/scheduler/country_count_dingtalk.py

@@ -1,186 +0,0 @@
-# 指标
-# 参数示例: -mgdb kazakhstan -dt 20240304
-import sys
-import re
-import os
-import requests
-import json
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.config_utils import parse_args
-
-
-def send_dingtalk_notification(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    url = f'http://m1.node.cdh/dingtalk/api/robot/send?access_token=72cbdfb0a30fa51defca1dcba1c7b68feaace79c08e69da8cf9a7ea321481b06'
-    # 下面的url用于测试
-    # url = f'http://m1.node.cdh/dingtalk/api/robot/send?access_token=89974c66ec5a33c67acd71c0544fe323dd76c5d7a6f0b92acd09175745b737a0'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def main():
-    # 解析命令行参数
-    CONFIG, _ = parse_args(sys.argv[1:])
-    mgdb = CONFIG.get('mgdb')
-    dt = CONFIG.get('dt')
-
-    with SparkSQL() as spark:
-        country_im_colm = {
-            'russia': 'shrmc',
-            'india': 'jksmc',
-            'india_exp': 'jksmc',
-            'vietnam': 'jksmc',
-            'turkey': 'jksmc',
-            'kazakhstan': 'jksmc',
-            'mexico': 'jksmc',
-            'mexico_bol': 'jksmc'
-        }
-        country_ex_colm = {
-            'russia': 'fhrmc',
-            'india': 'cksmc',
-            'india_exp': 'cksmc',
-            'vietnam': 'cksmc',
-            'turkey': 'cksmc',
-            'kazakhstan': 'cksmc',
-            'mexico': 'cksmc',
-            'mexico_bol': 'cksmc'
-        }
-
-        sql_query1 = (f"select count(1) AS total_tid_count from ( select id "
-                      f"from (select jkstid as id "
-                      f"      from dwd.cts_{mgdb}_im "
-                      f"      where dt in ('19700101', '20240303') "
-                      f"      union all "
-                      f"      select ckstid as id "
-                      f"      from dwd.cts_{mgdb}_ex "
-                      f"      where dt in ('19700101', '20240303')) a "
-                      f"group by id)b ")
-        res = spark.query(sql_query1)[0].collect()
-        cnt1 = res[0]['total_tid_count']
-
-        sql_query2 = (f"select count(1) AS total_tid_count "
-                      f"from ( "
-                      f"select id "
-                      f"from (select id, count(1) "
-                      f"      from (select jkstid as id, {country_im_colm[mgdb]} as mc  "
-                      f"            from dwd.cts_{mgdb}_im "
-                      f"            where dt in ('19700101', '20240303') "
-                      f"            union all "
-                      f"            select ckstid as id, {country_ex_colm[mgdb]} as mc "
-                      f"            from dwd.cts_{mgdb}_ex "
-                      f"            where dt in ('19700101', '20240303')) a "
-                      f"      group by id, mc) b "
-                      f"group by id "
-                      f"having count(1) = 1)b ")
-        res = spark.query(sql_query2)[0].collect()
-        cnt2 = res[0]['total_tid_count']
-
-        sql_query3 = (f"select count(1) AS total_tid_count "
-                      f"from ( "
-                      f"select id "
-                      f"from (select id "
-                      f"      from (select jkstid as id, {country_im_colm[mgdb]} as mc "
-                      f"            from dwd.cts_{mgdb}_im "
-                      f"            where dt in ('19700101', '20240303') "
-                      f"            union all "
-                      f"            select ckstid as id, {country_ex_colm[mgdb]} as mc "
-                      f"            from dwd.cts_{mgdb}_ex "
-                      f"            where dt in ('19700101', '20240303')) a "
-                      f"      group by id, mc) b "
-                      f"group by id "
-                      f"having count(1) > 1)b ")
-        res = spark.query(sql_query3)[0].collect()
-        cnt3 = res[0]['total_tid_count']
-
-        sql_query4 = (f"select (select count(1) "
-                      f"        from dwd.cts_{mgdb}_im "
-                      f"        where dt in ('19700101', '20240303')) + "
-                      f"       (select count(1) "
-                      f"        from dwd.cts_{mgdb}_ex "
-                      f"        where dt in ('19700101', '20240303')) as total_tid_count ")
-        res = spark.query(sql_query4)[0].collect()
-        cnt4 = res[0]['total_tid_count']
-        sql_query5 = (f"select count(1) AS total_tid_count  "
-                      f"from (select jkstid as id "
-                      f"            from dwd.cts_{mgdb}_im "
-                      f"            where dt in ('19700101', '20240303') "
-                      f"            union all "
-                      f"            select ckstid as id "
-                      f"            from dwd.cts_{mgdb}_ex "
-                      f"            where dt in ('19700101', '20240303'))c "
-                      f"where id in (select id "
-                      f"                 from (select id "
-                      f"                       from (select jkstid as id, {country_im_colm[mgdb]} as mc "
-                      f"                             from dwd.cts_{mgdb}_im "
-                      f"                             where dt in ('19700101', '20240303') "
-                      f"                             union all "
-                      f"                             select ckstid as id, {country_ex_colm[mgdb]} as mc "
-                      f"                             from dwd.cts_{mgdb}_ex "
-                      f"                             where dt in ('19700101', '20240303')) a "
-                      f"                       group by id, mc) b "
-                      f"                 group by id "
-                      f"                 having count(1) = 1) ")
-        res = spark.query(sql_query5)[0].collect()
-        cnt5 = res[0]['total_tid_count']
-
-        sql_query6 = (f"select count(1) AS total_tid_count "
-                      f"from (select jkstid as id "
-                      f"            from dwd.cts_{mgdb}_im "
-                      f"            where dt in ('19700101', '20240303') "
-                      f"            union all "
-                      f"            select ckstid as id "
-                      f"            from dwd.cts_{mgdb}_ex "
-                      f"            where dt in ('19700101', '20240303'))c "
-                      f"where id in (select id "
-                      f"                 from (select id "
-                      f"                       from (select jkstid as id, {country_im_colm[mgdb]} as mc "
-                      f"                             from dwd.cts_{mgdb}_im "
-                      f"                             where dt in ('19700101', '20240303') "
-                      f"                             union all "
-                      f"                             select ckstid as id, {country_ex_colm[mgdb]} as mc "
-                      f"                             from dwd.cts_{mgdb}_ex "
-                      f"                             where dt in ('19700101', '20240303')) a "
-                      f"                       group by id, mc) b "
-                      f"                 group by id "
-                      f"                 having count(1) > 1) ")
-        res = spark.query(sql_query6)[0].collect()
-        cnt6 = res[0]['total_tid_count']
-        sql_query7 = (f"select count(1) AS total_tid_count "
-                      f"from (select jkstid as id "
-                      f"            from dwd.cts_{mgdb}_im "
-                      f"            where dt in ('19700101', '20240303') "
-                      f"            union all "
-                      f"            select ckstid as id "
-                      f"            from dwd.cts_{mgdb}_ex "
-                      f"            where dt in ('19700101', '20240303'))c "
-                      f"where id is null ")
-        res = spark.query(sql_query7)[0].collect()
-        cnt7 = res[0]['total_tid_count']
-
-        msg = (f"{mgdb}数据量指标 \n"
-               f"-----------------------------------\n"
-               f"{mgdb}进出口统计:\n\n"
-               f"总tid数量:\t\t\t{cnt1}\n"
-               f"一对一的tid数量:\t\t{cnt2}\n"
-               f"一对多的tid数量:\t\t{cnt3}\n\n"
-               f"详单总数据量:\t\t{cnt4}\n"
-               f"一对一的tid的详单数量:\t{cnt5}\n"
-               f"一对多的tid的详单数量:\t{cnt6}\n"
-               f"tid为空的详单数量:\t\t{cnt7}\n"
-               f"  \n"
-               )
-        send_dingtalk_notification(msg)
-
-
-if __name__ == '__main__':
-    main()

+ 0 - 240
dw_base/scheduler/dingtalk_mirror_monitor.py

@@ -1,240 +0,0 @@
-# 用于钉钉监控T+1任务是否需要重跑
-import sys
-import re
-import os
-import requests
-import json
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from datetime import time
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-from dw_base.utils.config_utils import parse_args
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from elasticsearch.exceptions import NotFoundError
-
-call_count = 0
-
-
-def check_call_count():
-    global call_count
-    if call_count == 0:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 1 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '1')
-    else:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 0 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '0')
-
-
-def send_dingtalk_notification(msg):
-    global call_count
-    call_count += 1
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=a4a48ed82627149f3317ee86e249fd7d973f5bed40fcac55cc2e7ca8d9ae0c61'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def send_dingtalk_notification_es(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=a4a48ed82627149f3317ee86e249fd7d973f5bed40fcac55cc2e7ca8d9ae0c61'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-def get_count(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    return collection.count()
-
-
-def get_count_null(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    # 计数`date`字段不为null的文档
-    # return  collection.count_documents({'date': {'$ne': None}})
-    # 计数`date` 为null的文档
-    return collection.count_documents({'date': None})
-
-
-def get_old_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_clu_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 集群 mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} 集群date字段为空 count: {NORM_GRN}{result}')
-    return result
-
-
-def get_dev_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-dev-rw-200-test.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} dev source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} dev data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_clu_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据集群mongo sink mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} 大数据集群mongo data count: {NORM_GRN}{result}')
-    return result
-
-
-
-
-def get_diff_logic(spark,record,dt):
-    mgdb = record['mgdb']
-    catalog = record['catalog']
-    bigdata_count = record['cnt']
-    clu_cnt = get_clu_count(mgdb, catalog)
-
-    date_null_cnt = get_clu_count_null(mgdb, catalog)
-
-    # 两个mongo数据量对比
-    cnt_diff = clu_cnt - bigdata_count
-
-    # if cnt_diff != 0 or date_null_cnt != 0:
-    if  date_null_cnt != 0:
-        msg3 = (
-            f"\n"
-            f"--------------------------------\n"
-            f"镜像_mir 数据一致性警告\n"
-            f"--------------------------------\n"
-            f"在 {mgdb}_{catalog}  详细差异报告:\n\n"
-            f"\n"
-            f"--------------------------------\n"
-            f"计数对比:\n"
-            f"  大数据_镜像mongo 计数: {clu_cnt}\n"
-            f"  大数据平台 DWD 计数: {bigdata_count}\n"
-            f"  大数据_镜像mongo `date`字段为空 计数: {date_null_cnt}\n"
-            f"\n"
-            f"请检查原因 \n"
-            f"\n"
-            f"--------------------------------\n"
-        )
-        print(msg3)
-        # send_dingtalk_notification(msg3)
-
-    # 添加最终各个国家的统计数据量
-    statistical_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
-    sql_insert_cnt = f"""
-
-    insert into table task.cts_mirror_count 
-    select '{mgdb}','{catalog}',{bigdata_count},{clu_cnt},'{statistical_time}','{dt}'
-
-    """
-    spark.query(sql_insert_cnt)[0].collect()
-def main():
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dt = CONFIG.get('dt')
-    ydt = CONFIG.get('ydt')
-    spark = SparkSQL()
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true',
-                                 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts',
-                                 'spark.sql.crossJoin.enabled': 'true',
-                                 'spark.executor.memory': '8g',
-                                 'spark.executor.memoryOverhead': '2048',
-                                 'spark.driver.memory': '4g',
-                                 'spark.executor.instances': "12",
-                                 'spark.executor.cores': '4',
-                                 "spark.sql.hive.filesourcePartitionFileCacheSize":"536870912"
-                                 }
-    im_sql = (
-        f"select i.code3 as code3,code.english_name as country_name,concat(code.english_name,'_mir') as mgdb,cnt,'shipments_imports' as catalog"
-        f"  from"
-        f"( select   country_code as code3 ,count(1) as cnt from (select country_code from dwd.cts_mirror_country_im  where dt ='{ydt}') im "
-        f"group by country_code) i left join dim.cts_mirror_monitor code"
-        f" on i.code3 = code.code3 where code.english_name is not null")
-    ex_sql = (
-        f"select i.code3 as code3,code.english_name as country_name,concat(code.english_name,'_mir') as mgdb,cnt,'shipments_exports' as catalog "
-        f" from"
-        f"( select   country_code as code3 ,count(1) as cnt from (select country_code from dwd.cts_mirror_country_ex  where dt ='{ydt}') ex "
-        f"group by country_code) i left join dim.cts_mirror_monitor code"
-        f" on i.code3 = code.code3 where code.english_name is not null")
-
-    res_im = spark.query(im_sql)[0].collect()
-    res_ex = spark.query(ex_sql)[0].collect()
-
-
-    for record in res_im:
-        get_diff_logic(spark, record,dt)
-    for record in res_ex:
-        get_diff_logic(spark, record,dt)
-
-    sql_overwrite_cnt = f"""
-
-INSERT overwrite TABLE task.cts_mirror_count
-SELECT country,
-       catalog,
-       dwd_cnt,
-       mongo_cnt,
-       creat_time,
-       dt
-FROM
-  ( SELECT *,
-           row_number() over (partition BY country,catalog
-                              ORDER BY `creat_time` DESC) AS rk
-  FROM task.cts_mirror_count
-  WHERE dt ={dt}   ) tmp 
-where rk =1
-           """
-    spark.query(sql_overwrite_cnt)[0].collect()
-    check_call_count()
-
-
-if __name__ == '__main__':
-    main()
-
-# CREATE TABLE task.cts_mirror_count
-# (
-#     `country`    string COMMENT 'mgdb',
-#     `catalog`    string COMMENT '进出口类型',
-#     `cnt`        bigint comment '数据量',
-#     `creat_time` STRING COMMENT '统计时间'
-# )
-#     PARTITIONED BY ( `dt` string )
-#     TBLPROPERTIES ( 'COMMENT' = '同步到大数据平台的数据量统计');

+ 0 - 102
dw_base/scheduler/dingtalk_notifier.py

@@ -1,102 +0,0 @@
-# 调用钉钉机器人通知相关人员更新ES
-
-import sys
-import re
-import os
-
-
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.log_utils import pretty_print
-from dw_base import *
-import requests
-import json
-import time
-from dw_base.scheduler.polling_scheduler import get_sink_count
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-import random
-def send_dingtalk_notification(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg},
-        "at": {"atMobiles": ["13924570409"]}
-    }
-    json_data = json.dumps(data)
-    url = 'https://oapi.dingtalk.com/robot/send?access_token=bda512e1f980c8d126361afbae9d744e9885705ce6ed047395a1f6bc4114114d'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    start_date = CONFIG.get('start-date')
-    stop_date = CONFIG.get('stop-date')
-    mgdb = CONFIG.get('mgdb')
-    mgtbl = CONFIG.get('mgtbl')
-    batch_id = CONFIG.get('batch_id')
-    cdt =f"{time.strftime('%Y%m%d', time.localtime())}"
-    count = get_sink_count(mgdb, mgtbl, start_date, stop_date)
-
-    spark = SparkSQL()
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true',
-                                 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts',
-                                 'spark.sql.crossJoin.enabled': 'true',
-                                 'spark.executor.memory': '6g',
-                                 'spark.executor.memoryOverhead': '2048',
-                                 'spark.driver.memory': '4g',
-                                 'spark.executor.instances': "15",
-                                 'spark.executor.cores': '2'
-                                 }
-    if count > 0 :
-        try:
-            # 定义延时时间列表
-            delay_times = [0,10,18,26,39,45,52,60,70,80,90,100]
-            delay = random.choice(delay_times)
-            print(f"随机延时时间为:{delay}秒")
-            time.sleep(delay)
-
-            sql = (f"select count(1) as cnt from task.cts_incr_updated_data_cnt  "
-                   f"where  dt = '{cdt}'")
-            res = spark.query(sql)[0].collect()
-            order_id= int(res[0].cnt +1)
-
-            sql_insert_cnt = f"""
-
-            insert into table task.cts_incr_updated_data_cnt 
-            select '{mgdb}','{mgtbl}',{count},'{time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}','{cdt}'
-
-            """
-            spark.query(sql_insert_cnt)[0].collect()
-            msg = (f"数据上新提醒 @13924570409\n"
-                   f"{time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}   ({order_id})\n"
-                   f"{mgdb}.{mgtbl} 今日新增数据量: {count} 已入库完毕,\n调用接口成功,正在刷es索引!"
-                   f"本批数据batch_id为: {batch_id} "
-                   )
-            pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} ({order_id})'
-                         f'{NORM_MGT}已发送通知: {NORM_GRN} {msg} ')
-            send_dingtalk_notification(msg)
-
-        except Exception as e:
-            print(f"发生错误: {e}")
-
-
-#
-#
-# CREATE TABLE task.`cts_incr_updated_data_cnt`
-# (
-#     `mgdb`         STRING COMMENT 'mgdb',
-#     `mgtbl`        STRING COMMENT 'mgtbl',
-#     `count`        int COMMENT '计数',
-#     `created_time` STRING COMMENT '统计时间'
-# )
-#     COMMENT 'cts_incr_updated_data_cnt'
-#     PARTITIONED BY (`dt` STRING)
-#     STORED AS ORC
-#     tblproperties ('orc.compress' = 'ZLIB')
-

+ 0 - 370
dw_base/scheduler/dingtalk_task_monitor.py

@@ -1,370 +0,0 @@
-# 用于钉钉监控T+1任务是否需要重跑
-
-import sys
-import re
-import os
-import requests
-import json
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from datetime import time,datetime
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-from dw_base.utils.config_utils import parse_args
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from elasticsearch.exceptions import NotFoundError
-
-call_count = 0
-
-
-def check_call_count():
-    global call_count
-    if call_count == 0:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 1 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '1')
-    else:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 0 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '0')
-
-
-def send_dingtalk_notification(msg):
-    global call_count
-    call_count += 1
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=a4a48ed82627149f3317ee86e249fd7d973f5bed40fcac55cc2e7ca8d9ae0c61'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-def send_dingtalk_notification_es(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=a4a48ed82627149f3317ee86e249fd7d973f5bed40fcac55cc2e7ca8d9ae0c61'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-def get_count(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    return collection.count()
-def get_count_null(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    # 计数`date`字段不为null的文档
-    # return  collection.count_documents({'date': {'$ne': None}})
-    # 计数`date` 为null的文档
-    return  collection.count_documents({'date': None})
-
-def get_count_range_date(mgdb, mgtbl, target_date):
-    """
-    统计 date 字段值小于目标日期的文档总数
-    Args:
-        client: MongoDB客户端实例
-        mgdb: 数据库名称
-        mgtbl: 集合名称
-        target_date_str: 目标日期字符串 (格式: "YYYYMMDD")
-
-    Returns:
-        int: 符合条件的文档数量
-    """
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    db = client[mgdb]
-    collection = db[mgtbl]
-
-    # 将输入的字符串转换为 datetime 对象
-    target_date = datetime.strptime(target_date, "%Y%m%d").replace(
-        tzinfo=None  # 如果数据库时间不带时区,可以移除此行
-    )
-
-    count = collection.count_documents({'date': {'$lt': target_date}})
-    return count
-
-
-def get_old_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-def get_clu_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-def get_dev_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-dev-rw-200-test.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_clu_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据集群mongo sink mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} 大数据集群mongo data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_bigdata_count(mgdb, mgtbl, dt, spark,cdt):
-    sql = (f"select count(1) cnt "
-           f"from dwd.cts_{mgdb}_{mgtbl} "
-           f" where dt in ('19700101', {dt},{cdt}) ")
-    res = spark.query(sql)[0].collect()
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据dwd表名: {NORM_GRN}dwd.cts_{mgdb}_{mgtbl} '
-                 f'{NORM_MGT} 大数据dwd 1970+昨日分区+当日分区 count: {NORM_GRN}{res[0].cnt}')
-    return res[0].cnt
-
-
-def get_bigdata_global_bol_count(catalog, dt, spark):
-    sql = (f"""
-    select sum(cnt) cnt from (select count(1) cnt from dwd.`cts_north_america_bol_{catalog}`   where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_central_america_bol_{catalog}`  where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_south_america_bol_{catalog}`    where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_asia_bol_{catalog}`             where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_middle_east_bol_{catalog}`      where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_europe_bol_{catalog}`           where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_africa_bol_{catalog}`           where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_oceania_bol_{catalog}`          where dt in ('19700101', {dt})                 ) a""")
-    res = spark.query(sql)[0].collect()
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据dwd表名: global_bol 1拆8 '
-                 f'{NORM_MGT} 大数据dwd 1970+昨日分区count: {NORM_GRN}{res[0].cnt}')
-    return res[0].cnt
-
-
-def get_year_count(mgdb, catalog, dt, spark):
-    if mgdb != "global_bol":
-        sql = (f"select from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy') as year,count(1) hive_cnt "
-               f"from dwd.cts_{mgdb}_{catalog} "
-               f" where dt in ('19700101', {dt}) "
-               f" group by from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy')"
-               f" order by from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy')")
-        res = spark.query(sql)[0].collect()
-        hive_year_cnt_dict = {}
-        es_year_cnt_dict = {}
-
-        host, port = ConfReader().get_es_conf()
-        es_operator = ESOperator(host, port)
-        for record in res:
-            year = record['year']
-            hive_cnt = record['hive_cnt']
-            hive_year_cnt_dict[year] = hive_cnt
-            # index_name = 'customs_' + str(catalogs[catalog]) + '_' + mgdb + '-' + year
-            index_name = str(catalog) + '_' + mgdb + '-' + year
-            try:
-                ES_year_cnt = es_operator.get_index_document_count(index_name)
-            except NotFoundError:
-                # 因为钉钉关键词所以没有发钉钉
-                msg7 = (f"ES Index {index_name} not found.\n"
-                        f" 请检查原因\n"
-                        )
-                # print(msg7)
-
-                send_dingtalk_notification_es(msg7)
-                ES_year_cnt = 0
-            if ES_year_cnt is None:
-                ES_year_cnt = 0
-            es_year_cnt_dict[year] = ES_year_cnt
-            es_diff = ES_year_cnt - hive_cnt
-            if es_diff != 0:
-                msg5 = (
-                    f"-----------------------------\n"
-                    f"\n"
-                    f"{mgdb}_{catalog} - 数据一致性警告:ES{year}与大数据DWD的{year}数量不一致。\n\n"
-                    f"详细差异报告:\n"
-                    f"-----------------------------------------------------------------------\n"
-                    f"年份:{year}\n"
-                    f"ES{year} 计数:{ES_year_cnt}\n"
-                    f"大数据{year} 计数:{hive_cnt}\n"
-                    f"差异值:{es_diff}\n"
-                    f"-----------------------------------------------------------------------\n"
-                    f"\n"
-                    f"请检查原因 \n"
-                    f"\n"
-                    f"-----------------------------\n"
-                )
-                # print(msg5)
-                send_dingtalk_notification_es(msg5)
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT} 大数据dwd表名: {NORM_GRN}dwd.cts_{mgdb}_{catalog} '
-                     f'{NORM_MGT} 大数据hive_year_cnt_dict  {NORM_GRN}{hive_year_cnt_dict}'
-                     f'{NORM_MGT} es_year_cnt_dict  {NORM_GRN}{es_year_cnt_dict}'
-                     )
-
-
-def main():
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dt = CONFIG.get('dt')
-    cdt = CONFIG.get('cdt')
-    spark = SparkSQL()
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true',
-                                 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts',
-                                 'spark.sql.crossJoin.enabled': 'true',
-                                 'spark.executor.memory': '6g',
-                                 'spark.executor.memoryOverhead': '2048',
-                                 'spark.driver.memory': '4g',
-                                 'spark.executor.instances': "15",
-                                 'spark.executor.cores': '2'
-                                 }
-    sql = (f"select mgdb, catalog from task.mg_count_monitor "
-           f"where is_deleted = '0'")
-    res = spark.query(sql)[0].collect()
-    mgdbs_prod = {
-        'dwd表名': '大数据mongo库名',
-        'un_global_trade_tatistics': 'united_nations_stat',
-        "global_bol": "global_bol"
-    }
-    mgdbs_old = {
-        'dwd表名': 'old_mongo库名',
-        'un_global_trade_tatistics': 'united_nations_stat',
-        "global_bol": "global_sea"
-    }
-    catalogs = {
-        'im': 'shipments_imports',
-        'ex': 'shipments_exports',
-    }
-    #  添加需要排除的读 old_mongo 的数据库名称
-    excluded_dbs = ["un_global_trade_tatistics",
-                    "north_america_bol",
-                    "central_america_bol",
-                    "south_america_bol",
-                    "asia_bol",
-                    "middle_east_bol",
-                    "europe_bol",
-                    "africa_bol",
-                    "oceania_bol"]
-    # 以下用于测试
-    # res = [{"mgdb": "global_bol", "catalog": "im"}]
-    # res = [{"mgdb": "ethiopia", "catalog": "ex"}]
-    mirror_dbs = ["fiji"]
-    mirror_dbs_date = {"fiji_im": "20211101", "fiji_ex": "20211101"}
-    for record in res:
-        mgdb = record['mgdb']
-        catalog = record['catalog']
-
-        prod_mgdb = mgdbs_prod.get(record['mgdb'], mgdb)
-        old_mgdb = mgdbs_old.get(record['mgdb'], mgdb)
-
-        if mgdb == "global_bol":
-            old_cnt = get_old_count(old_mgdb, catalogs[catalog])
-            # oldmongo和dwd拆分表
-            clu_cnt = get_bigdata_global_bol_count(catalog, dt, spark)
-            bigdata_count = get_bigdata_global_bol_count(catalog, dt, spark)
-            date_null_cnt=get_clu_count_null(mgdb, catalogs[catalog])
-        else:
-            old_cnt = get_old_count(prod_mgdb, catalogs[catalog])
-            clu_cnt = get_clu_count(prod_mgdb, catalogs[catalog])
-            bigdata_count = get_bigdata_count(mgdb, catalog, dt, spark,cdt)
-            date_null_cnt = get_clu_count_null(mgdb, catalogs[catalog])
-            # get_year_count(mgdb, catalog, dt, spark)
-        if mgdb in mirror_dbs:
-            clu_cnt = get_count_range_date(mgdb, catalogs[catalog], target_date=mirror_dbs_date[f"{mgdb}_{catalog}"])
-            print(f"{mgdb}{catalogs[catalog]} clu_cnt: {clu_cnt}")
-        # 两个mongo数据量对比
-        cnt_diff = old_cnt - clu_cnt
-        # oldmongo和dwd 对比
-        bd_diff = old_cnt - bigdata_count
-
-        if bd_diff != 0 or cnt_diff != 0 or date_null_cnt != 0:
-            msg3 = (
-                f"\n"
-                f"--------------------------------\n"
-                f"数据一致性警告\n"
-                f"--------------------------------\n"
-                f"在 {mgdb}_{catalog}  详细差异报告:\n\n"
-                f"\n"
-                f"--------------------------------\n"
-                f"计数对比:\n"
-                f"  old_mongo 计数: {old_cnt}\n"
-                f"  大数据_mongo 计数: {clu_cnt}\n"
-                f"  大数据平台 DWD 计数: {bigdata_count}\n"
-                f"  大数据_mongo `date`字段为空 计数: {date_null_cnt}\n"
-                f"\n"
-                f"请检查原因 \n"
-                f"\n"
-                f"--------------------------------\n"
-            )
-            if mgdb not in excluded_dbs:
-                send_dingtalk_notification(msg3)
-
-        # 添加最终各个国家的统计数据量
-        statistical_time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
-        sql_insert_cnt=f"""
-        
-        insert into table task.cts_country_count 
-        select '{mgdb}','{catalog}',{clu_cnt},'{statistical_time}','{dt}'
-        
-        """
-        spark.query(sql_insert_cnt)[0].collect()
-
-
-    sql_overwrite_cnt = f"""
-
-INSERT overwrite TABLE task.cts_country_count
-SELECT country,
-       catalog,
-       cnt,
-       creat_time,
-       dt
-FROM
-  ( SELECT *,
-           row_number() over (partition BY country,catalog
-                              ORDER BY `creat_time` DESC) AS rk
-  FROM task.cts_country_count
-  WHERE dt ={dt}   ) tmp 
-where rk =1
-           """
-    spark.query(sql_overwrite_cnt)[0].collect()
-    check_call_count()
-
-if __name__ == '__main__':
-    main()
-
-
-# CREATE TABLE task.cts_country_count
-# (
-#     `country`    string COMMENT 'mgdb',
-#     `catalog`    string COMMENT '进出口类型',
-#     `cnt`        bigint comment '数据量',
-#     `creat_time` STRING COMMENT '统计时间'
-# )
-#     PARTITIONED BY ( `dt` string )
-#     TBLPROPERTIES ( 'COMMENT' = '同步到大数据平台的数据量统计');

+ 0 - 368
dw_base/scheduler/dingtalk_task_monitor_new.py

@@ -1,368 +0,0 @@
-# 用于钉钉监控T+1任务是否需要重跑
-import sys
-import re
-import os
-import requests
-import json
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from datetime import time, datetime
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-from dw_base.utils.config_utils import parse_args
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from elasticsearch.exceptions import NotFoundError
-
-call_count = 0
-
-
-def check_call_count():
-    global call_count
-    if call_count == 0:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 1 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '1')
-    else:
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT}向后传递参数: {NORM_GRN}is_run => 0 '
-                     f'{NORM_MGT} call_count =>{call_count}')
-        print('${setValue(is_run=%s)}' % '0')
-
-
-def send_dingtalk_notification(msg):
-    global call_count
-    call_count += 1
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'https://oapi.dingtalk.com/robot/send?access_token=d4955560edf9d78fbf5273fe3ea4022ecf5955570a68ff710f7fe81926dff71e'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-def send_dingtalk_notification_es(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 下面的url用于测试
-    url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=a4a48ed82627149f3317ee86e249fd7d973f5bed40fcac55cc2e7ca8d9ae0c61'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-def get_count(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    return collection.count()
-def get_count_null(client, mgdb, mgtbl):
-    db = client[mgdb]
-    collection = db[mgtbl]
-    # 计数`date`字段不为null的文档
-    # return  collection.count_documents({'date': {'$ne': None}})
-    # 计数`date` 为null的文档
-    return  collection.count_documents({'date': None})
-
-
-def get_old_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-def get_clu_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-def get_dev_count_null(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-dev-rw-200-test.ini')
-    result = get_count_null(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_clu_count(mgdb, mgtbl):
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    result = get_count(client, mgdb, mgtbl)
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据集群mongo sink mongo: {NORM_GRN}{mgdb}.{mgtbl} '
-                 f'{NORM_MGT} 大数据集群mongo data count: {NORM_GRN}{result}')
-    return result
-
-
-def get_bigdata_count(mgdb, mgtbl, dt, spark,cdt):
-    sql = (f"select count(1) cnt "
-           f"from dwd.cts_{mgdb}_{mgtbl} "
-           f" where dt in ('19700101', {dt},{cdt}) ")
-    res = spark.query(sql)[0].collect()
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据dwd表名: {NORM_GRN}dwd.cts_{mgdb}_{mgtbl} '
-                 f'{NORM_MGT} 大数据dwd 1970+昨日分区+当日分区 count: {NORM_GRN}{res[0].cnt}')
-    return res[0].cnt
-
-
-def get_bigdata_global_bol_count(catalog, dt, spark):
-    sql = (f"""
-    select sum(cnt) cnt from (select count(1) cnt from dwd.`cts_north_america_bol_{catalog}`   where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_central_america_bol_{catalog}`  where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_south_america_bol_{catalog}`    where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_asia_bol_{catalog}`             where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_middle_east_bol_{catalog}`      where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_europe_bol_{catalog}`           where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_africa_bol_{catalog}`           where dt in ('19700101', {dt}) 
-union all select count(1) from dwd.`cts_oceania_bol_{catalog}`          where dt in ('19700101', {dt})                 ) a""")
-    res = spark.query(sql)[0].collect()
-    pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                 f'{NORM_MGT} 大数据dwd表名: global_bol 1拆8 '
-                 f'{NORM_MGT} 大数据dwd 1970+昨日分区count: {NORM_GRN}{res[0].cnt}')
-    return res[0].cnt
-
-
-def get_year_count(mgdb, catalog, dt, spark):
-    if mgdb != "global_bol":
-        sql = (f"select from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy') as year,count(1) hive_cnt "
-               f"from dwd.cts_{mgdb}_{catalog} "
-               f" where dt in ('19700101', {dt}) "
-               f" group by from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy')"
-               f" order by from_unixtime(cast(`date`/1000 as int)- 8 * 60 * 60, 'yyyy')")
-        res = spark.query(sql)[0].collect()
-        hive_year_cnt_dict = {}
-        es_year_cnt_dict = {}
-
-        host, port = ConfReader().get_es_conf()
-        es_operator = ESOperator(host, port)
-        for record in res:
-            year = record['year']
-            hive_cnt = record['hive_cnt']
-            hive_year_cnt_dict[year] = hive_cnt
-            # index_name = 'customs_' + str(catalogs[catalog]) + '_' + mgdb + '-' + year
-            index_name = str(catalog) + '_' + mgdb + '-' + year
-            try:
-                ES_year_cnt = es_operator.get_index_document_count(index_name)
-            except NotFoundError:
-                # 因为钉钉关键词所以没有发钉钉
-                msg7 = (f"ES Index {index_name} not found.\n"
-                        f" 请检查原因\n"
-                        )
-                # print(msg7)
-
-                send_dingtalk_notification_es(msg7)
-                ES_year_cnt = 0
-            if ES_year_cnt is None:
-                ES_year_cnt = 0
-            es_year_cnt_dict[year] = ES_year_cnt
-            es_diff = ES_year_cnt - hive_cnt
-            if es_diff != 0:
-                msg5 = (
-                    f"-----------------------------\n"
-                    f"\n"
-                    f"{mgdb}_{catalog} - 数据一致性警告:ES{year}与大数据DWD的{year}数量不一致。\n\n"
-                    f"详细差异报告:\n"
-                    f"-----------------------------------------------------------------------\n"
-                    f"年份:{year}\n"
-                    f"ES{year} 计数:{ES_year_cnt}\n"
-                    f"大数据{year} 计数:{hive_cnt}\n"
-                    f"差异值:{es_diff}\n"
-                    f"-----------------------------------------------------------------------\n"
-                    f"\n"
-                    f"请检查原因 \n"
-                    f"\n"
-                    f"-----------------------------\n"
-                )
-                # print(msg5)
-                send_dingtalk_notification_es(msg5)
-        pretty_print(f'{NORM_CYN}{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())} '
-                     f'{NORM_MGT} 大数据dwd表名: {NORM_GRN}dwd.cts_{mgdb}_{catalog} '
-                     f'{NORM_MGT} 大数据hive_year_cnt_dict  {NORM_GRN}{hive_year_cnt_dict}'
-                     f'{NORM_MGT} es_year_cnt_dict  {NORM_GRN}{es_year_cnt_dict}'
-                     )
-def get_count_range_date(mgdb, mgtbl, target_date):
-    """
-    统计 date 字段值小于目标日期的文档总数
-    Args:
-        client: MongoDB客户端实例
-        mgdb: 数据库名称
-        mgtbl: 集合名称
-        target_date_str: 目标日期字符串 (格式: "YYYYMMDD")
-
-    Returns:
-        int: 符合条件的文档数量
-    """
-    client = get_mongo_client('/../datasource/mongo/mongo-cluster-cts-prod.ini')
-    db = client[mgdb]
-    collection = db[mgtbl]
-
-    # 将输入的字符串转换为 datetime 对象
-    target_date = datetime.strptime(target_date, "%Y%m%d").replace(
-        tzinfo=None  # 如果数据库时间不带时区,可以移除此行
-    )
-
-    count = collection.count_documents({'date': {'$lt': target_date}})
-    return count
-
-
-
-def main():
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dt = CONFIG.get('dt')
-    cdt = CONFIG.get('cdt')
-    spark = SparkSQL()
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true',
-                                 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts',
-                                 'spark.sql.crossJoin.enabled': 'true',
-                                 'spark.executor.memory': '6g',
-                                 'spark.executor.memoryOverhead': '2048',
-                                 'spark.driver.memory': '4g',
-                                 'spark.executor.instances': "15",
-                                 'spark.executor.cores': '2'
-                                 }
-    sql = (f"select mgdb, catalog from task.mg_count_monitor "
-           f"where is_deleted = '0'")
-    res = spark.query(sql)[0].collect()
-    mgdbs_prod = {
-        'dwd表名': '大数据mongo库名',
-        'un_global_trade_tatistics': 'united_nations_stat',
-        "global_bol": "global_bol"
-    }
-    mgdbs_old = {
-        'dwd表名': 'old_mongo库名',
-        'un_global_trade_tatistics': 'united_nations_stat',
-        "global_bol": "global_sea"
-    }
-    catalogs = {
-        'im': 'shipments_imports',
-        'ex': 'shipments_exports',
-    }
-    #  添加需要排除的读 old_mongo 的数据库名称
-    excluded_dbs = ["un_global_trade_tatistics",
-                    "north_america_bol",
-                    "central_america_bol",
-                    "south_america_bol",
-                    "asia_bol",
-                    "middle_east_bol",
-                    "europe_bol",
-                    "africa_bol",
-                    "oceania_bol"]
-    # 以下用于测试
-    # res = [{"mgdb": "global_bol", "catalog": "im"}]
-    # res = [{"mgdb": "ethiopia", "catalog": "ex"}]
-    mirror_dbs = ["fiji"]
-    mirror_dbs_date = {"fiji_im": "20211101", "fiji_ex": "20211101"}
-    for record in res:
-        mgdb = record['mgdb']
-        catalog = record['catalog']
-
-        prod_mgdb = mgdbs_prod.get(record['mgdb'], mgdb)
-        old_mgdb = mgdbs_old.get(record['mgdb'], mgdb)
-
-        if mgdb == "global_bol":
-            old_cnt = get_old_count(old_mgdb, catalogs[catalog])
-            # oldmongo和dwd拆分表
-            clu_cnt = get_bigdata_global_bol_count(catalog, dt, spark)
-            bigdata_count = get_bigdata_global_bol_count(catalog, dt, spark)
-            date_null_cnt=get_clu_count_null(mgdb, catalogs[catalog])
-        else:
-            old_cnt = get_old_count(prod_mgdb, catalogs[catalog])
-            clu_cnt = get_clu_count(prod_mgdb, catalogs[catalog])
-            bigdata_count = get_bigdata_count(mgdb, catalog, dt, spark,cdt)
-            date_null_cnt = get_clu_count_null(mgdb, catalogs[catalog])
-            # get_year_count(mgdb, catalog, dt, spark)
-        if mgdb in mirror_dbs:
-            clu_cnt = get_count_range_date(mgdb, catalogs[catalog], target_date=mirror_dbs_date[f"{mgdb}_{catalog}"])
-            print(f"{mgdb}{catalogs[catalog]} clu_cnt: {clu_cnt}")
-        # 两个mongo数据量对比
-        cnt_diff = old_cnt - clu_cnt
-        # oldmongo和dwd 对比
-        bd_diff = old_cnt - bigdata_count
-
-        if bd_diff != 0 or cnt_diff != 0 or date_null_cnt != 0:
-            msg3 = (
-                f"\n"
-                f"--------------------------------\n"
-                f"数据一致性警告\n"
-                f"--------------------------------\n"
-                f"在 {mgdb}_{catalog}  详细差异报告:\n\n"
-                f"\n"
-                f"--------------------------------\n"
-                f"计数对比:\n"
-                f"  old_mongo 计数: {old_cnt}\n"
-                f"  大数据_mongo 计数: {clu_cnt}\n"
-                f"  大数据平台 DWD 计数: {bigdata_count}\n"
-                f"  大数据_mongo `date`字段为空 计数: {date_null_cnt}\n"
-                f"\n"
-                f"请检查原因 \n"
-                f"\n"
-                f"--------------------------------\n"
-            )
-            if mgdb not in excluded_dbs:
-                send_dingtalk_notification(msg3)
-
-        # 添加最终各个国家的统计数据量
-        statistical_time=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
-        sql_insert_cnt=f"""
-        
-        insert into table task.cts_country_count 
-        select '{mgdb}','{catalog}',{clu_cnt},'{statistical_time}','{dt}'
-        
-        """
-        spark.query(sql_insert_cnt)[0].collect()
-
-
-    sql_overwrite_cnt = f"""
-
-INSERT overwrite TABLE task.cts_country_count
-SELECT country,
-       catalog,
-       cnt,
-       creat_time,
-       dt
-FROM
-  ( SELECT *,
-           row_number() over (partition BY country,catalog
-                              ORDER BY `creat_time` DESC) AS rk
-  FROM task.cts_country_count
-  WHERE dt ={dt}   ) tmp 
-where rk =1
-           """
-    spark.query(sql_overwrite_cnt)[0].collect()
-    check_call_count()
-
-if __name__ == '__main__':
-    main()
-
-# CREATE TABLE task.cts_country_count
-# (
-#     `country`    string COMMENT 'mgdb',
-#     `catalog`    string COMMENT '进出口类型',
-#     `cnt`        bigint comment '数据量',
-#     `creat_time` STRING COMMENT '统计时间'
-# )
-#     PARTITIONED BY ( `dt` string )
-#     TBLPROPERTIES ( 'COMMENT' = '同步到大数据平台的数据量统计');

+ 0 - 498
dw_base/scheduler/ent_interface_dingtalk.py

@@ -1,498 +0,0 @@
-import base64
-import hashlib
-import hmac
-import sys
-import re
-import os
-import urllib
-import time
-import requests
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-import http.client
-import json
-
-from cryptography.hazmat.primitives.asymmetric import rsa, padding
-from cryptography.hazmat.primitives import serialization
-from base64 import b64encode
-
-# 公钥
-public_key_pem = b"""
------BEGIN PUBLIC KEY-----
-MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDSaL/mqfq/30d5w6/05EL4073z
-ZgsomKTDI9wKUyz+ETkGwWzaNQm8BAXk9nJMCPz25fCTPd2BkifrS2KFKK2+e4hU
-pQxs+FQGaSeR8YEBWsCwh8bWaFWgxKuWpPPdfP6Vcnid/pTAsjbnw0KIHT7x83WZ
-qQTu3GUdyXkfyB41CQIDAQAB
------END PUBLIC KEY-----
-"""
-
-
-class UserInfo:
-    """公司名称"""
-    company_name: str
-    """真实名称"""
-    name: str
-    """用户id"""
-    user_id: int
-    """用户名"""
-    username: str
-
-    def __init__(self, company_name: str, name: str, user_id: int, username: str) -> None:
-        self.company_name = company_name
-        self.name = name
-        self.user_id = user_id
-        self.username = username
-
-    def __str__(self) -> str:
-        return (f"UserInfo:\n"
-                f"  Company Name: {self.company_name}\n"
-                f"  Name: {self.name}\n"
-                f"  User ID: {self.user_id}\n"
-                f"  Username: {self.username}")
-
-
-def encrypt_user_id(user_id):
-    public_key = serialization.load_pem_public_key(public_key_pem)
-    encrypted = public_key.encrypt(
-        user_id.encode(),
-        padding.PKCS1v15()
-    )
-    return b64encode(encrypted).decode()
-
-
-def get_user_info(user_id):
-    encrypted_user_id = encrypt_user_id(user_id)
-    conn = http.client.HTTPConnection("192.168.11.6", 18080)
-    payload = json.dumps({
-        "encryptUserId": encrypted_user_id
-    })
-    headers = {
-        'User-Agent': 'Apifox/1.0.0 (https://apifox.com)',
-        'Content-Type': 'application/json'
-    }
-
-    try:
-        conn.request("POST", "/account/personal", payload, headers)
-        res = conn.getresponse()
-        resdata = res.read().decode("utf-8")
-        res_json = json.loads(resdata)
-        user_info = UserInfo(res_json['companyName'], res_json['name'], res_json['userId'], res_json['username'])
-        return user_info
-    except Exception as e:
-        print("Error:", e)
-    finally:
-        conn.close()
-
-
-spark = SparkSQL(udf_files=['dw_base/spark/udf/contacts/ctc_common.py',
-                            'dw_base/spark/udf/spark_id_generate_udf.py'])
-
-
-def get_sign(secret):
-    timestamp = str(round(time.time() * 1000))
-    secret_enc = secret.encode('utf-8')
-    string_to_sign = '{}\n{}'.format(timestamp, secret)
-    string_to_sign_enc = string_to_sign.encode('utf-8')
-    hmac_code = hmac.new(secret_enc, string_to_sign_enc, digestmod=hashlib.sha256).digest()
-    sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
-    return timestamp, sign
-
-
-def send_dingtalk_markdown(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "markdown",
-        "markdown": {"title": '企业库告警', "text": msg, }
-    }
-    json_data = json.dumps(data)
-    secret = 'SECffb7fe1b4c3aacc7be85d3b03de88fdbf93dfb48fe1c13ea7dba34a84847675e'
-    timestamp, sign = get_sign(secret)
-    url = f'https://oapi.dingtalk.com/robot/send?access_token=ffdb7df856220a925196e911107a4aa259acb2fd1160fee8b11d0c3c800974fc&timestamp={timestamp}&sign={sign}'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def send_dingtalk_notification(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    url = 'https://oapi.dingtalk.com/robot/send?access_token=5183dfe1ecbe06261bcac7b45c1a6b5ae101fec67877d74120a6a95c88d1f917'
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=c4086d8ba377fdade2dff869e71063733095bc718d3bafdfbe8be0966aa050d6'
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=bee997dbf61e839a17de087830ffef6e864c3109fef62a956703bdfe043b0e10'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-# shh非核心业务调用数
-def get_shh_non_core_interface_cnt(dt):
-    sql = f'''
-SELECT sum(cnt) cnt
-FROM
-  (SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_monitor_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") != "EXPORT"
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'CONTACT'
-   UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_bizr_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") IN("ROOT",
-                                                "COMPANY_COUNT")
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'BIZR'
-   UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_mecs_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") IN("CORP",
-                                                "SITE")
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'MECS'
-      UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type")= "BIZR"
-     AND GET_JSON_OBJECT(ori_json, "$.source")= "BIZR"
-     )t
-     '''
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_company_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_ods.ent_shh_api_company_logs where dt = "{dt}" and source != "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_company_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_ods.ent_shh_api_company_logs where dt = "{dt}" and source = "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_contact_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_shh_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") != "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_contact_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_shh_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") = "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_snv_contact_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_snovio_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") != "MANUAL_CONSUME" '
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_snv_contact_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_snovio_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") = "MANUAL_CONSUME" '
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def ent_user_top(dt):
-    sql = (f"select GET_JSON_OBJECT(ori_json, '$.params.userId') as  user ,count(1) as cnt from ent_raw.interface_base "
-           f"where dt='{dt}' and  topic = 'ent_tendata_interface' and GET_JSON_OBJECT(ori_json, '$.type') = 'BRIEF_RESULT' group by GET_JSON_OBJECT(ori_json, '$.params.userId') order by count(1) desc limit 10"
-           )
-    body = ''
-    for row in spark.query(sql)[0].collect():
-        userid = row.user
-        user_info = get_user_info(userid)
-        body += f'{user_info.username},{user_info.name},{user_info.company_name},**{row.cnt}**次 \n\n'
-    return body
-
-def get_manual_request_cnt(dt):
-    sql = f'''SELECT count(DISTINCT GET_JSON_OBJECT(ori_json, '$.params.traceId')) manual_request_cnt,
-              count(distinct GET_JSON_OBJECT(ori_json, '$.params.userId'))                  as user_cnt
-   FROM ent_raw.interface_base
-   WHERE topic = 'ent_tendata_interface'
-     AND dt = '{dt}'
-     AND get_json_object(ori_json, '$.source') = 'BING'
-     AND get_json_object(ori_json, '$.type') = 'MANUAL_REFRESH'
-     AND get_json_object(ori_json, '$.result.canRefresh') = 'true'
-     '''
-    return spark.query(sql)[0].collect()[0]
-def get_ggl_res(dt):
-    sql = f'''WITH MANUAL AS
-  (SELECT DISTINCT GET_JSON_OBJECT(ori_json, '$.params.traceId') trace_id
-   FROM ent_raw.interface_base
-   WHERE topic = 'ent_tendata_interface'
-     AND dt = '{dt}'
-     AND get_json_object(ori_json, '$.source') = 'BING'
-     AND get_json_object(ori_json, '$.type') = 'MANUAL_REFRESH'
-     AND get_json_object(ori_json, '$.result.canRefresh') = 'true'), auto AS
-  (SELECT DISTINCT GET_JSON_OBJECT(ori_json, '$.params.traceId') trace_id
-   FROM ent_raw.interface_base
-   WHERE topic = 'ent_tendata_interface'
-     AND dt = '{dt}'
-     AND get_json_object(ori_json, '$.source') = 'BING'
-     AND get_json_object(ori_json, '$.type') = 'AUTO_REFRESH'
-     AND get_json_object(ori_json, '$.result.canRefresh') = 'true'),
-                                                                     ods AS
-  (SELECT GET_JSON_OBJECT(ori_json, '$.params.traceId') trace_id,
-          GET_JSON_OBJECT(ori_json, '$.result.status_code') res_code
-   FROM ent_raw.interface_base
-   WHERE topic = 'ctc_google_interface'
-     AND dt = '{dt}'),
-                                                                     manual_res AS
-  (SELECT 'manual',
-          sum(if(ods.res_code = '200',1,0))  as cnt
-   FROM ods
-   JOIN MANUAL ON ods.trace_id = manual.trace_id),
-                                                                     auto_res AS
-  (SELECT 'auto',
-          sum(if(ods.res_code = '200',1,0))  as cnt
-   FROM ods
-   JOIN auto ON ods.trace_id = auto.trace_id),
-                                                                     ctc_cnt AS
-  (SELECT 'ctc_cnt',
-          count(1) as cnt
-   FROM ctc_mid.ctc_main_pre
-   WHERE dt = '{dt}'
-     AND SOURCE LIKE '%google%' )
-SELECT *
-FROM manual_res
-UNION ALL
-SELECT *
-FROM auto_res
-UNION ALL
-SELECT *
-FROM ctc_cnt
-    '''
-    row = spark.query(sql)[0].collect()
-    return row
-
-
-def get_manual_base(dt):
-    sql = f'''
-    select count(1)                                                                      as cnt,
-       count(distinct GET_JSON_OBJECT(ori_json, '$.params.userId'))                  as user_cnt,
-       nvl(sum(if(get_json_object(ori_json, '$.result.data.website') is not null, 1, 0)),0) as web_cnt
-from ent_raw.interface_base
-where topic = 'ent_tendata_interface'
-  and dt = '{dt}'
-  and get_json_object(ori_json, '$.source') = 'BING'
-  and get_json_object(ori_json, '$.type') = 'MANUAL'
-  '''
-    row = spark.query(sql)[0].collect()[0]
-    return row
-
-
-def get_auto_base(dt):
-    sql = f'''
-    select count(1)                                                                      as cnt,
-       count(distinct GET_JSON_OBJECT(ori_json, '$.params.userId'))                  as user_cnt,
-       sum(if(get_json_object(ori_json, '$.result.data.website') is not null, 1, 0)) as web_cnt
-from ent_raw.interface_base
-where topic = 'ent_tendata_interface'
-  and dt = '{dt}'
-  and get_json_object(ori_json, '$.source') = 'BING'
-  and get_json_object(ori_json, '$.type') = 'AUTO'
-  '''
-    row = spark.query(sql)[0].collect()[0]
-    return row
-
-
-def get_manual_cnt(dt):
-    sql = f'''
-    with webs as (select distinct get_json_object(ori_json, '$.result.data.website') as website
-              from ent_raw.interface_base
-              where topic = 'ent_tendata_interface'
-                and dt = '{dt}'
-                and get_json_object(ori_json, '$.source') = 'BING'
-                and get_json_object(ori_json, '$.type') = 'MANUAL'
-                and get_json_object(ori_json, '$.result.data.website') is not null),
-     tids as (select website, generate_tid(clean_website(website), 'not_null', null) as tid
-              from webs),
-     pre as (select i.id, i.tid, i.source
-             from ctc_mid.ctc_main_pre i
-                      join tids t on i.tid = t.tid
-             where i.dt = '{dt}'),
-     shh as (select 'shh'               as source,
-                    count(distinct tid) as res_cnt,
-                    count(id)           as ctc_cnt
-             from pre
-             where source like '%shh_%'),
-     snovio as (select 'snovio'            as source,
-                       count(distinct tid) as res_cnt,
-                       count(id)           as ctc_cnt
-                from pre
-                where source like '%snovio%'),
-     all_t as (select 'all'               as source,
-                    count(distinct tid) as res_cnt,
-                    count(id)           as ctc_cnt
-             from pre
-             where source like '%snovio%'
-                or source like '%shh_%')
-select *
-from shh
-union all
-select *
-from snovio
-union all
-select *
-from all_t
-  '''
-    res = spark.query(sql)[0].collect()
-    return res
-
-
-def get_auto_cnt(dt):
-    sql = f'''
-    with webs as (select distinct get_json_object(ori_json, '$.result.data.website') as website
-              from ent_raw.interface_base
-              where topic = 'ent_tendata_interface'
-                and dt = '{dt}'
-                and get_json_object(ori_json, '$.source') = 'BING'
-                and get_json_object(ori_json, '$.type') = 'AUTO'
-                and get_json_object(ori_json, '$.result.data.website') is not null),
-     tids as (select website, generate_tid(clean_website(website), 'not_null', null) as tid
-              from webs),
-     pre as (select i.id, i.tid, i.source
-             from ctc_mid.ctc_main_pre i
-                      join tids t on i.tid = t.tid
-             where i.dt = '{dt}'),
-     shh as (select 'shh'               as source,
-                    count(distinct tid) as res_cnt,
-                    count(id)           as ctc_cnt
-             from pre
-             where source like '%shh_%'),
-     snovio as (select 'snovio'            as source,
-                       count(distinct tid) as res_cnt,
-                       count(id)           as ctc_cnt
-                from pre
-                where source like '%snovio%'),
-     all_t as (select 'all'               as source,
-                    count(distinct tid) as res_cnt,
-                    count(id)           as ctc_cnt
-             from pre
-             where source like '%snovio%'
-                or source like '%shh_%')
-select *
-from shh
-union all
-select *
-from snovio
-union all
-select *
-from all_t
-  '''
-    res = spark.query(sql)[0].collect()
-    return res
-
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dts = CONFIG.get('dt').split(',')
-    for dt in dts:
-        format_dt = f'{dt[:4]}-{dt[4:6]}-{dt[6:]}'
-        shh_company_interface_cnt = get_shh_company_interface_cnt(dt)
-        shh_company_interface_script_cnt = get_shh_company_interface_script_cnt(dt)
-        shh_contact_interface_cnt = get_shh_contact_interface_cnt(dt)
-        shh_contact_interface_script_cnt = get_shh_contact_interface_script_cnt(dt)
-        snv_contact_interface_cnt = get_snv_contact_interface_cnt(dt)
-        snv_contact_interface_script_cnt = get_snv_contact_interface_script_cnt(dt)
-        shh_non_core_interface_cnt = get_shh_non_core_interface_cnt(dt)
-        msg = f'''【接口调用量统计】------------------------------------------
-统计日期: {format_dt}
-1、单接口调用公司信息次数: {shh_company_interface_cnt + shh_company_interface_script_cnt}
-①自然调用次数: {shh_company_interface_cnt}
-②脚本调用次数: {shh_company_interface_script_cnt}
-
-2、单接口调用联系人次数: {shh_contact_interface_cnt + shh_contact_interface_script_cnt}
-①自然调用次数: {shh_contact_interface_cnt}
-②脚本调用次数: {shh_contact_interface_script_cnt}
-
-3、snovio调用联系人次数: {snv_contact_interface_cnt + snv_contact_interface_script_cnt}
-①自然调用次数: {snv_contact_interface_cnt}
-②脚本调用次数: {snv_contact_interface_script_cnt}
-
-4、单接口非核心业务调用次数:{shh_non_core_interface_cnt}
----------------------------------------------------------------'''
-        print(msg)
-        send_dingtalk_notification(msg)
-        ent_user_top_cnt = ent_user_top(dt)
-        msg = f'''### 企业主页接口调用统计top10
-> **统计日期 :  {format_dt}**
-
-{ent_user_top_cnt}
-                '''
-        print(msg)
-        send_dingtalk_markdown(msg)
-
-        manual_base = get_manual_base(dt)
-        auto_base = get_auto_base(dt)
-        manual_cnt = get_manual_cnt(dt)
-        auto_cnt = get_auto_cnt(dt)
-        manual_request_res = get_manual_request_cnt(dt)
-        manual_request_cnt = manual_request_res['manual_request_cnt']
-        manual_user_cnt = manual_request_res['user_cnt']
-        # 处理分母可能为0的情况
-        # manual_user_cnt = manual_base['user_cnt']
-        manual_cnt_total = manual_base['cnt']
-        manual_web_cnt = manual_base['web_cnt']
-        auto_user_cnt = auto_base['user_cnt']
-        auto_cnt_total = auto_base['cnt']
-        auto_web_cnt = auto_base['web_cnt']
-
-        manual_avg_requests = manual_request_cnt / manual_user_cnt if manual_user_cnt != 0 else 0
-        manual_web_percentage = 100 * manual_web_cnt / manual_cnt_total if manual_cnt_total != 0 else 0
-        manual_single_interface_percentage = 100 * manual_cnt[0][
-            'res_cnt'] / manual_web_cnt if manual_web_cnt != 0 else 0
-        manual_snovio_percentage = 100 * manual_cnt[1]['res_cnt'] / manual_web_cnt if manual_web_cnt != 0 else 0
-        manual_solution_percentage = 100 * manual_cnt[2]['res_cnt'] / manual_cnt_total if manual_cnt_total != 0 else 0
-
-        auto_avg_requests = auto_cnt_total / auto_user_cnt if auto_user_cnt != 0 else 0
-        auto_web_percentage = 100 * auto_web_cnt / auto_cnt_total if auto_cnt_total != 0 else 0
-        auto_single_interface_percentage = 100 * auto_cnt[0]['res_cnt'] / auto_web_cnt if auto_web_cnt != 0 else 0
-        auto_snovio_percentage = 100 * auto_cnt[1]['res_cnt'] / (auto_web_cnt - auto_cnt[0]['res_cnt']) if (
-                                                                                                                   auto_web_cnt -
-                                                                                                                   auto_cnt[
-                                                                                                                       0][
-                                                                                                                       'res_cnt']) != 0 else 0
-        auto_solution_percentage = 100 * (
-                auto_cnt[0]['res_cnt'] + auto_cnt[1]['res_cnt']) / auto_cnt_total if auto_cnt_total != 0 else 0
-        ggl_res = get_ggl_res(dt)
-        manual_ggl_cnt = ggl_res[0]['cnt']
-        auto_ggl_cnt = ggl_res[1]['cnt']
-        ctc_ggl_cnt = ggl_res[2]['cnt']
-        msg = f'''【手动/自动更新效果统计】------------------------------------------
-统计日期: {format_dt}
-1、手动更新
-①手动更新请求总人数:{manual_user_cnt}人
-②手动更新请求总次数:{manual_request_cnt}次
-③人均请求次数:{manual_avg_requests:.2f}次
-④手动请求bing网址总次数:{manual_cnt_total}次
-⑤bing获取到网址的次数及占比:{manual_web_cnt}次,{manual_web_percentage:.2f}%
-⑥单接口获取到联系人次数及占比:{manual_cnt[0]['res_cnt']}次,{manual_single_interface_percentage:.2f}%
-⑦单接口获取到联系人去重总数:{manual_cnt[0]['ctc_cnt']}
-⑧snovio接口获取到联系人次数及占比:{manual_cnt[1]['res_cnt']}次,{manual_snovio_percentage:.2f}%
-⑨snovio接口获取到联系人去重总数:{manual_cnt[1]['ctc_cnt']}
-⑩当日手动更新获得联系方式的总次数:{manual_cnt[2]['res_cnt']}
-⑪当日手动更新解决问题的百分比:{manual_solution_percentage:.2f}%
-
-2、自动更新
-①自动更新对应的总人数:{auto_user_cnt}人
-②自动更新请求总次数:{auto_cnt_total}次
-③人均对应自动更新次数:{auto_avg_requests:.2f}次
-④bing获取到网址的次数及占比:{auto_web_cnt}次,{auto_web_percentage:.2f}%
-⑤单接口获取到联系人次数及占比:{auto_cnt[0]['res_cnt']}次,{auto_single_interface_percentage:.2f}%
-⑥单接口获取到联系人去重总数:{auto_cnt[0]['ctc_cnt']}
-⑦snovio接口获取到联系人次数及占比:{auto_cnt[1]['res_cnt']}次,{auto_snovio_percentage:.2f}%
-⑧snovio接口获取到联系人去重总数:{auto_cnt[1]['ctc_cnt']}
-⑨当日自动更新获得联系方式的总次数:{auto_cnt[0]['res_cnt'] + auto_cnt[1]['res_cnt']}
-⑩当日自动更新解决问题的百分比:{auto_solution_percentage:.2f}%
-
-3、google补充
-①手动触发google爬虫获取到联系人次数:{manual_ggl_cnt}次
-②自动触发google爬虫获取到联系人次数:{auto_ggl_cnt}次
-③google爬虫获取到联系人去重数:{ctc_ggl_cnt}人
---------------------------------------------------------------- '''
-        print(msg)
-        send_dingtalk_notification(msg)

+ 0 - 132
dw_base/scheduler/ent_interface_dingtalk_call.py

@@ -1,132 +0,0 @@
-import base64
-import hashlib
-import hmac
-import sys
-import re
-import os
-import urllib
-import time
-import requests
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-import json
-
-spark = SparkSQL(udf_files=['dw_base/spark/udf/contacts/ctc_common.py',
-                            'dw_base/spark/udf/spark_id_generate_udf.py'])
-
-
-
-def send_dingtalk_notification(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    url = 'https://oapi.dingtalk.com/robot/send?access_token=5183dfe1ecbe06261bcac7b45c1a6b5ae101fec67877d74120a6a95c88d1f917'
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=c4086d8ba377fdade2dff869e71063733095bc718d3bafdfbe8be0966aa050d6'
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=bee997dbf61e839a17de087830ffef6e864c3109fef62a956703bdfe043b0e10'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-# shh非核心业务调用数
-def get_shh_non_core_interface_cnt(dt):
-    sql = f'''
-SELECT sum(cnt) cnt
-FROM
-  (SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_monitor_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") != "EXPORT"
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'CONTACT'
-   UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_bizr_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") IN("ROOT",
-                                                "COMPANY_COUNT")
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'BIZR'
-   UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_mecs_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type") IN("CORP",
-                                                "SITE")
-     AND GET_JSON_OBJECT(ori_json, "$.source")= 'MECS'
-      UNION ALL SELECT count(1) cnt
-   FROM ent_raw.interface_base
-   WHERE topic = "ent_shh_interface"
-     AND dt = "{dt}"
-     AND GET_JSON_OBJECT(ori_json, "$.type")= "BIZR"
-     AND GET_JSON_OBJECT(ori_json, "$.source")= "BIZR"
-     )t
-     '''
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_company_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_ods.ent_shh_api_company_logs where dt = "{dt}" and source != "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_company_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_ods.ent_shh_api_company_logs where dt = "{dt}" and source = "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_contact_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_shh_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") != "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_shh_contact_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_shh_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") = "SCRIPT"'
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_snv_contact_interface_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_snovio_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") != "MANUAL_CONSUME" '
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-def get_snv_contact_interface_script_cnt(dt):
-    sql = f'select count(1) cnt from ent_raw.interface_base where topic = "ctc_snovio_interface" and dt = "{dt}" and GET_JSON_OBJECT(ori_json, "$.source") = "MANUAL_CONSUME" '
-    return spark.query(sql)[0].collect()[0]['cnt']
-
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dts = CONFIG.get('dt').split(',')
-    for dt in dts:
-        format_dt = f'{dt[:4]}-{dt[4:6]}-{dt[6:]}'
-        shh_company_interface_cnt = get_shh_company_interface_cnt(dt)
-        shh_company_interface_script_cnt = get_shh_company_interface_script_cnt(dt)
-        shh_contact_interface_cnt = get_shh_contact_interface_cnt(dt)
-        shh_contact_interface_script_cnt = get_shh_contact_interface_script_cnt(dt)
-        snv_contact_interface_cnt = get_snv_contact_interface_cnt(dt)
-        snv_contact_interface_script_cnt = get_snv_contact_interface_script_cnt(dt)
-        shh_non_core_interface_cnt = get_shh_non_core_interface_cnt(dt)
-        msg = f'''【接口调用量统计】------------------------------------------
-统计日期: {format_dt}
-1、单接口调用公司信息次数: {shh_company_interface_cnt + shh_company_interface_script_cnt}
-①自然调用次数: {shh_company_interface_cnt}
-②脚本调用次数: {shh_company_interface_script_cnt}
-
-2、单接口调用联系人次数: {shh_contact_interface_cnt + shh_contact_interface_script_cnt}
-①自然调用次数: {shh_contact_interface_cnt}
-②脚本调用次数: {shh_contact_interface_script_cnt}
-
-3、snovio调用联系人次数: {snv_contact_interface_cnt + snv_contact_interface_script_cnt}
-①自然调用次数: {snv_contact_interface_cnt}
-②脚本调用次数: {snv_contact_interface_script_cnt}
-
-4、单接口非核心业务调用次数:{shh_non_core_interface_cnt}
----------------------------------------------------------------'''
-        print(msg)
-        send_dingtalk_notification(msg)

+ 0 - 141
dw_base/scheduler/ent_interface_dingtalk_top10.py

@@ -1,141 +0,0 @@
-import base64
-import hashlib
-import hmac
-import sys
-import re
-import os
-import urllib
-import time
-import requests
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-import http.client
-import json
-
-from cryptography.hazmat.primitives.asymmetric import rsa, padding
-from cryptography.hazmat.primitives import serialization
-from base64 import b64encode
-
-# 公钥
-public_key_pem = b"""
------BEGIN PUBLIC KEY-----
-MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDSaL/mqfq/30d5w6/05EL4073z
-ZgsomKTDI9wKUyz+ETkGwWzaNQm8BAXk9nJMCPz25fCTPd2BkifrS2KFKK2+e4hU
-pQxs+FQGaSeR8YEBWsCwh8bWaFWgxKuWpPPdfP6Vcnid/pTAsjbnw0KIHT7x83WZ
-qQTu3GUdyXkfyB41CQIDAQAB
------END PUBLIC KEY-----
-"""
-
-
-class UserInfo:
-    """公司名称"""
-    company_name: str
-    """真实名称"""
-    name: str
-    """用户id"""
-    user_id: int
-    """用户名"""
-    username: str
-
-    def __init__(self, company_name: str, name: str, user_id: int, username: str) -> None:
-        self.company_name = company_name
-        self.name = name
-        self.user_id = user_id
-        self.username = username
-
-    def __str__(self) -> str:
-        return (f"UserInfo:\n"
-                f"  Company Name: {self.company_name}\n"
-                f"  Name: {self.name}\n"
-                f"  User ID: {self.user_id}\n"
-                f"  Username: {self.username}")
-
-
-def encrypt_user_id(user_id):
-    public_key = serialization.load_pem_public_key(public_key_pem)
-    encrypted = public_key.encrypt(
-        user_id.encode(),
-        padding.PKCS1v15()
-    )
-    return b64encode(encrypted).decode()
-
-
-def get_user_info(user_id):
-    encrypted_user_id = encrypt_user_id(user_id)
-    conn = http.client.HTTPConnection("192.168.11.6", 18080)
-    payload = json.dumps({
-        "encryptUserId": encrypted_user_id
-    })
-    headers = {
-        'User-Agent': 'Apifox/1.0.0 (https://apifox.com)',
-        'Content-Type': 'application/json'
-    }
-
-    try:
-        conn.request("POST", "/account/personal", payload, headers)
-        res = conn.getresponse()
-        resdata = res.read().decode("utf-8")
-        res_json = json.loads(resdata)
-        user_info = UserInfo(res_json['companyName'], res_json['name'], res_json['userId'], res_json['username'])
-        return user_info
-    except Exception as e:
-        print("Error:", e)
-    finally:
-        conn.close()
-
-
-spark = SparkSQL(udf_files=['dw_base/spark/udf/contacts/ctc_common.py',
-                            'dw_base/spark/udf/spark_id_generate_udf.py'])
-
-
-def get_sign(secret):
-    timestamp = str(round(time.time() * 1000))
-    secret_enc = secret.encode('utf-8')
-    string_to_sign = '{}\n{}'.format(timestamp, secret)
-    string_to_sign_enc = string_to_sign.encode('utf-8')
-    hmac_code = hmac.new(secret_enc, string_to_sign_enc, digestmod=hashlib.sha256).digest()
-    sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
-    return timestamp, sign
-
-
-def send_dingtalk_markdown(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "markdown",
-        "markdown": {"title": '企业库告警', "text": msg, }
-    }
-    json_data = json.dumps(data)
-    secret = 'SECffb7fe1b4c3aacc7be85d3b03de88fdbf93dfb48fe1c13ea7dba34a84847675e'
-    timestamp, sign = get_sign(secret)
-    url = f'https://oapi.dingtalk.com/robot/send?access_token=ffdb7df856220a925196e911107a4aa259acb2fd1160fee8b11d0c3c800974fc&timestamp={timestamp}&sign={sign}'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-def ent_user_top(dt):
-    sql = (f"select GET_JSON_OBJECT(ori_json, '$.params.userId') as  user ,count(1) as cnt from ent_raw.interface_base "
-           f"where dt='{dt}' and  topic = 'ent_tendata_interface' and GET_JSON_OBJECT(ori_json, '$.type') = 'BRIEF_RESULT' group by GET_JSON_OBJECT(ori_json, '$.params.userId') order by count(1) desc limit 10"
-           )
-    body = ''
-    for row in spark.query(sql)[0].collect():
-        userid = row.user
-        user_info = get_user_info(userid)
-        body += f'{user_info.username},{user_info.name},{user_info.company_name},**{row.cnt}**次 \n\n'
-    return body
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dts = CONFIG.get('dt').split(',')
-    for dt in dts:
-        format_dt = f'{dt[:4]}-{dt[4:6]}-{dt[6:]}'
-        ent_user_top_cnt = ent_user_top(dt)
-        msg = f'''### 企业主页接口调用统计top10
-> **统计日期 :  {format_dt}**
-
-{ent_user_top_cnt}
-                '''
-        print(msg)
-        send_dingtalk_markdown(msg)

+ 0 - 242
dw_base/scheduler/ent_interface_dingtalk_update.py

@@ -1,242 +0,0 @@
-import sys
-import re
-import os
-import requests
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-import json
-
-spark = SparkSQL(udf_files=['dw_base/spark/udf/contacts/ctc_common.py',
-                            'dw_base/spark/udf/spark_id_generate_udf.py'],
-                 extra_spark_config={'spark.sql.crossJoin.enabled': True})
-
-
-def send_dingtalk_notification(msg):
-    headers = {"Content-Type": "application/json"}
-    data = {
-        "msgtype": "text",
-        "text": {"content": msg}
-    }
-    json_data = json.dumps(data)
-    # 企业库数据产品线
-    url = 'https://oapi.dingtalk.com/robot/send?access_token=c4086d8ba377fdade2dff869e71063733095bc718d3bafdfbe8be0966aa050d6'
-    # 企业库管理群
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=5183dfe1ecbe06261bcac7b45c1a6b5ae101fec67877d74120a6a95c88d1f917'
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=c4086d8ba377fdade2dff869e71063733095bc718d3bafdfbe8be0966aa050d6'
-    # 企业&联系人机器人测试群
-    # url = 'https://oapi.dingtalk.com/robot/send?access_token=bee997dbf61e839a17de087830ffef6e864c3109fef62a956703bdfe043b0e10'
-    response = requests.post(url=url, data=json_data, headers=headers)
-    response.raise_for_status()
-
-
-def get_base_cnt(dt, trigger_type):
-    sql = f'''
-SELECT count(DISTINCT user_id) as user_cnt, count(distinct trace_id) as trace_cnt
-FROM (SELECT user_id, trace_id
-      FROM ctc_ods.ctc_shh_interface_log
-      WHERE dt = '{dt}'
-        AND trigger_type = '{trigger_type}'
-      UNION ALL
-      SELECT user_id, trace_id
-      FROM ctc_ods.ctc_snv_interface_log
-      WHERE dt = '{dt}'
-        AND trigger_type = '{trigger_type}'
-      UNION ALL
-      SELECT user_id, trace_id
-      FROM ctc_ods.ctc_google_interface_log
-      WHERE dt = '{dt}'
-        AND trigger_type = '{trigger_type}') t
-    '''
-    return spark.query(sql)[0].collect()[0]
-
-
-def get_web_cnt(dt, trigger_type):
-    sql = f'''
-select count(1)                                                                              as request_web_cnt,
-       nvl(sum(if(get_json_object(ori_json, '$.result.data.website') is not null, 1, 0)), 0) as get_web_cnt
-from ent_raw.interface_base
-where topic = 'ent_tendata_interface'
-  and dt = '{dt}'
-  and get_json_object(ori_json, '$.source') = 'BING'
-  and get_json_object(ori_json, '$.type') = '{trigger_type}'
-    '''
-    return spark.query(sql)[0].collect()[0]
-
-
-def get_auto_user_cnt(dt):
-    sql = f'''
-      SELECT 
-       count(DISTINCT get_json_object(ori_json, '$.params.userId')) AS request_user_cnt
-FROM ent_raw.interface_base
-WHERE topic = 'ent_tendata_interface'
-  AND dt = '{dt}'
-  AND get_json_object(ori_json, '$.source') = 'BING'
-  AND get_json_object(ori_json, '$.type') = 'AUTO'
-    '''
-    return spark.query(sql)[0].collect()[0]['request_user_cnt']
-
-
-def get_auto_source_cnt(dt):
-    sql = f'''
-    SELECT *
-from (select count(distinct trace_id) as shh_cnt
-      FROM ctc_ods.ctc_shh_interface_log
-      WHERE dt = '{dt}'
-        AND trigger_type = 'AUTO') shh
-         join (SELECT count(distinct trace_id) as snv_cnt
-               FROM ctc_ods.ctc_snv_interface_log
-               WHERE dt = '{dt}'
-                 AND trigger_type = 'AUTO') snv
-         join (SELECT count(distinct trace_id) as ggl_cnt
-               FROM ctc_ods.ctc_google_interface_log
-               WHERE dt = '{dt}'
-                 AND trigger_type = 'AUTO') ggl
-        '''
-    return spark.query(sql)[0].collect()[0]
-
-def get_res_cnt(dt, trigger_type):
-    sql = f'''
-with init as (select ti,
-                     if(source like '%shh_%', 1, 0)   as shh_flag,
-                     if(source like '%snovio%', 1, 0) as snv_flag,
-                     if(source like '%google%', 1, 0) as ggl_flag
-              from ctc_mid.ctc_main_pre
-                       LATERAL VIEW explode(trace_id) exploded_table1 AS ti
-              where dt = '{dt}'
-                and array_contains(trigger_type, '{trigger_type}')),
-     flag as (select ti
-                   , if(sum(shh_flag) > 0, 1, 0) as shh_get_flag
-                   , if(sum(snv_flag) > 0, 1, 0) as snv_get_flag
-                   , if(sum(ggl_flag) > 0, 1, 0) as ggl_get_flag
-              from init
-              group by ti)
-select nvl(sum(shh_get_flag),0) as shh_get_cnt
-     , nvl(sum(snv_get_flag),0) as snv_get_cnt
-     , nvl(sum(ggl_get_flag),0) as ggl_get_cnt
-     , count(ti)         as all_get_cnt
-from flag
-        '''
-    return spark.query(sql)[0].collect()[0]
-
-
-def get_ctc_cnt(dt, trigger_type):
-    sql = f'''
-select nvl(sum(if(source like '%shh_%', 1, 0)), 0)   as shh_ctc_cnt,
-       nvl(sum(if(source like '%snovio%', 1, 0)), 0) as snv_ctc_cnt,
-       nvl(sum(if(source like '%google%', 1, 0)), 0) as ggl_ctc_cnt
-from ctc_mid.ctc_main_pre
-where dt = '{dt}'
-  and array_contains(trigger_type, '{trigger_type}')
-            '''
-    return spark.query(sql)[0].collect()[0]
-
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dts = CONFIG.get('dt').split(',')
-    for dt in dts:
-        format_dt = f'{dt[:4]}-{dt[4:6]}-{dt[6:]}'
-
-        manual_base_cnt = get_base_cnt(dt, 'MANUAL')
-        manual_web_cnt = get_web_cnt(dt, 'MANUAL')
-        manual_res_cnt = get_res_cnt(dt, 'MANUAL')
-        manual_ctc_cnt = get_ctc_cnt(dt, 'MANUAL')
-
-        manual_user_cnt = manual_base_cnt['user_cnt']
-        manual_trace_cnt = manual_base_cnt['trace_cnt']
-        manual_trace_avg = manual_trace_cnt / manual_user_cnt if manual_user_cnt > 0 else 0
-
-        manual_web_request_cnt = manual_web_cnt['request_web_cnt']
-        manual_web_get_cnt = manual_web_cnt['get_web_cnt']
-        manual_web_get_pct = 100 * manual_web_get_cnt / manual_web_request_cnt if manual_web_request_cnt > 0 else 0
-
-        manual_shh_get_cnt = manual_res_cnt['shh_get_cnt']
-        manual_shh_get_pct = 100 * manual_shh_get_cnt / manual_trace_cnt if manual_trace_cnt > 0 else 0
-        manual_snv_get_cnt = manual_res_cnt['snv_get_cnt']
-        manual_snv_get_pct = 100 * manual_snv_get_cnt / manual_trace_cnt if manual_trace_cnt > 0 else 0
-        manual_ggl_get_cnt = manual_res_cnt['ggl_get_cnt']
-        manual_ggl_get_pct = 100 * manual_ggl_get_cnt / manual_trace_cnt if manual_trace_cnt > 0 else 0
-        manual_all_get_cnt = manual_res_cnt['all_get_cnt']
-        manual_all_get_pct = 100 * manual_all_get_cnt / manual_trace_cnt if manual_trace_cnt > 0 else 0
-
-        manual_ctc_shh_cnt = manual_ctc_cnt['shh_ctc_cnt']
-        manual_ctc_snv_cnt = manual_ctc_cnt['snv_ctc_cnt']
-        manual_ctc_ggl_cnt = manual_ctc_cnt['ggl_ctc_cnt']
-        ############################################################
-        auto_base_cnt = get_base_cnt(dt, 'AUTO')
-        auto_web_cnt = get_web_cnt(dt, 'AUTO')
-        auto_res_cnt = get_res_cnt(dt, 'AUTO')
-        auto_ctc_cnt = get_ctc_cnt(dt, 'AUTO')
-
-        auto_user_cnt = auto_base_cnt['user_cnt']
-        auto_trace_cnt = auto_base_cnt['trace_cnt']
-        auto_trace_avg = auto_trace_cnt / auto_user_cnt if auto_user_cnt > 0 else 0
-
-        auto_web_request_cnt = auto_web_cnt['request_web_cnt']
-        auto_web_get_cnt = auto_web_cnt['get_web_cnt']
-        auto_web_get_pct = 100 * auto_web_get_cnt / auto_web_request_cnt if auto_web_request_cnt > 0 else 0
-
-
-        auto_source_cnt = get_auto_source_cnt(dt)
-        auto_request_shh_cnt = auto_source_cnt['shh_cnt']
-        auto_request_snv_cnt = auto_source_cnt['snv_cnt']
-        auto_request_ggl_cnt = auto_source_cnt['ggl_cnt']
-
-        auto_shh_get_cnt = auto_res_cnt['shh_get_cnt']
-        auto_shh_get_pct = 100 * auto_shh_get_cnt / auto_request_shh_cnt if auto_request_shh_cnt > 0 else 0
-        auto_snv_get_cnt = auto_res_cnt['snv_get_cnt']
-        auto_snv_get_pct = 100 * auto_snv_get_cnt / auto_request_snv_cnt if auto_request_snv_cnt > 0 else 0
-        auto_ggl_get_cnt = auto_res_cnt['ggl_get_cnt']
-        auto_ggl_get_pct = 100 * auto_ggl_get_cnt / auto_request_ggl_cnt if auto_request_ggl_cnt > 0 else 0
-        auto_all_get_cnt = auto_res_cnt['all_get_cnt']
-        auto_all_get_pct = 100 * auto_all_get_cnt / auto_trace_cnt if auto_trace_cnt > 0 else 0
-        auto_all_get_pct = 100 * auto_all_get_cnt / auto_trace_cnt if auto_trace_cnt > 0 else 0
-
-        auto_ctc_shh_cnt = auto_ctc_cnt['shh_ctc_cnt']
-        auto_ctc_snv_cnt = auto_ctc_cnt['snv_ctc_cnt']
-        auto_ctc_ggl_cnt = auto_ctc_cnt['ggl_ctc_cnt']
-
-        auto_user_cnt = get_auto_user_cnt(dt)
-
-        msg = f'''【手动/自动更新效果统计】------------------------------------------
-统计日期: {format_dt}
-1、手动更新
-①手动更新请求总人数:{manual_user_cnt}人
-②手动更新请求总次数:{manual_trace_cnt}次
-③人均请求次数:{manual_trace_avg:.2f}次
-④手动请求bing网址总次数:{manual_web_request_cnt}次
-⑤bing获取到网址的次数及占比:{manual_web_get_cnt}次,{manual_web_get_pct:.2f}%
-⑥单接口获取到联系人次数及占比:{manual_shh_get_cnt}次,{manual_shh_get_pct:.2f}%
-⑦单接口获取到联系人去重总数:{manual_ctc_shh_cnt}
-⑧snovio接口获取到联系人次数及占比:{manual_snv_get_cnt}次,{manual_snv_get_pct:.2f}%
-⑨snovio接口获取到联系人去重总数:{manual_ctc_snv_cnt}
-⑩google爬虫获取到联系人次数及占比:{manual_ggl_get_cnt}次,{manual_ggl_get_pct:.2f}%
-⑪google爬虫获取到联系人去重总数:{manual_ctc_ggl_cnt}
-⑫当日手动更新获得联系方式的总次数:{manual_all_get_cnt}  
-⑬当日手动更新解决联系人问题的百分比:{manual_all_get_pct:.2f}%
-
-2、自动更新
-① 自动更新请求总人数:{auto_user_cnt}人
-② 自动更新请求总次数:{auto_web_request_cnt}次
-③ 人均请求次数:{auto_trace_avg:.2f}次
-④ 自动请求bing网址总次数:{auto_web_request_cnt}次
-⑤ bing获取到网址的次数及占比:{auto_web_get_cnt}次,{auto_web_get_pct:.2f}%
-⑥ 自动请求单接口的总次数:{auto_request_shh_cnt} 次
-⑦ 单接口获取到联系人次数及占比:{auto_shh_get_cnt}次,{auto_shh_get_pct:.2f}%
-⑧ 单接口获取到联系人去重总数:{auto_ctc_shh_cnt}
-⑨ 自动请求snovio接口的总次数:{auto_request_snv_cnt} 次
-⑩ snovio接口获取到联系人次数及占比:{auto_snv_get_cnt}次,{auto_snv_get_pct:.2f}%
-⑪ snovio接口获取到联系人去重总数:{auto_ctc_snv_cnt}
-⑫ 自动请求google爬虫的总次数:{auto_request_ggl_cnt} 次
-⑬ google爬虫获取到联系人次数及占比:{auto_ggl_get_cnt}次,{auto_ggl_get_pct:.2f}%
-⑭ google爬虫获取到联系人去重总数:{auto_ctc_ggl_cnt}
-⑮ 当日自动更新请求联系方式的总次数:{auto_trace_cnt}
-⑯ 当日自动更新获得联系方式的总次数:{auto_all_get_cnt}
-⑰ 当日自动更新解决联系人问题的百分比:{auto_all_get_pct:.2f}%
---------------------------------------------------------------- '''
-        print(msg)
-        send_dingtalk_notification(msg)

+ 0 - 185
dw_base/scheduler/get_oldmongo_cjfs.py

@@ -1,185 +0,0 @@
-import argparse
-import sys
-import re
-import os
-from pyhive import hive
-import pandas as pd
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-
-
-
-# 定义一个数组(列表)
-my_array = [
-    'america_stat',
-    'australia',
-    'brazil',
-    'brazil_stat',
-    'canada',
-    'canada_stat',
-    'china_stat',
-    'cis',
-    'dominica',
-    'england',
-    'ethiopia',
-    'eurasian_bol',
-    'european_union',
-    'fiji',
-    'guatemala',
-    'honduras',
-    'honduras_stat',
-    'hongkong_stat',
-    'indonesia_stat',
-    'japan',
-    'kyrghyzstan',
-    'new_zealand',
-    'nicaragua',
-    'peru_exp',
-    'philippines_stat',
-    'russia_rail',
-    'salvador',
-    'salvador_stat',
-    'south_africa_stat',
-    'south_korea',
-    'south_korea_stat',
-    'spain',
-    'taiwan',
-    'thailand',
-    'thailand_stat',
-    'turkey_stat',
-    'zimbabwe',
-    'taiwan_stat',
-    'tanzania',
-    'tanzania_tboe',
-    'bolivia_stat',
-    'spain_co',
-    'congo_kinshasa',
-    'south_korea_co',
-    'england_stat',
-    'angola_stat',
-    'guatemala_stat',
-    'brazil_air',
-    'egypt_co',
-    'uruguay_nboe',
-    'panama_exp',
-    'bahrain_stat',
-    'dominican_republic_stat',
-    'qatar_stat'
-]
-
-
-def parse_arguments():
-    # 创建 ArgumentParser 对象
-    parser = argparse.ArgumentParser(description='Process some parameters.')
-
-    # 添加参数
-    parser.add_argument('-mgdb', dest='mgdb', required=True, help='Parameter 1')
-
-    # 解析参数
-    return parser.parse_args()
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-
-def get_count(client, mgdb):
-    # 选择数据库
-    db = client[mgdb]
-    # 选择集合
-    collection1 = db['shipments_imports']
-    collection2 = db['shipments_exports']
-
-    # 使用聚合管道进行分组和计数
-    pipeline = [
-        {
-            "$group": {
-                "_id": "$cjfs",  # 按cjfs字段分组
-                "count": {"$sum": 1}  # 计算每个组的数量
-            }
-        }
-    ]
-
-    # 执行聚合查询
-    results1 = list(collection1.aggregate(pipeline))
-    results2 = list(collection2.aggregate(pipeline))
-
-    pretty_print(f'开始合并结果-------------------------------------------------------------------------')
-
-    # 合并结果
-    combined_results = list(results1) + list(results2)
-
-
-    # 假设 combined_results 是一个字典列表
-    # 将结果转换为 DataFrame
-    df = pd.DataFrame(combined_results)
-    df1 = pd.DataFrame(results1)
-    df2 = pd.DataFrame(results2)
-
-    # 连接到 Hive
-    hive_conn = hive.Connection(host='192.168.30.3', port=10000, username='hive', database='dim')
-
-    # 写入 Hive 表
-    cursor = hive_conn.cursor()
-
-    pretty_print(f'开始插入结果-------------------------------------------------------------------------')
-
-    # 插入数据
-    for index, row in df1.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_cjfs_global_old (cjfs, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'im')
-        """
-        pretty_print(f'{insert_query}')
-        cursor.execute(insert_query)
-    # 插入数据
-    for index, row in df2.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_cjfs_global_old (cjfs, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'ex')
-        """
-        cursor.execute(insert_query)
-
-    # 关闭连接
-    cursor.close()
-    hive_conn.close()
-
-    jgj = ('----------------------'+
-           '\n结果1-->' + str(results1) +
-           '结果1end\n结果2-->' + str(results2) +
-           '结果2end\n合并后结果-->'+str(combined_results)+
-           '\n----------------------'
-    )
-    pretty_print(f'{jgj}')
-    return jgj
-
-def get_old_count(client,mgdb):
-    result = get_count(client, mgdb)
-    pretty_print(f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}')
-    return result
-
-
-def main():
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    pretty_print(f'开始循环调用-------------------------------------------------------------------------')
-    pretty_print(f'{my_array}')
-    # 使用for循环遍历数组,并调用函数
-    for item in my_array:
-        pretty_print(f'开始执行:{item}')
-        get_old_count(client,item)
-    client.close()
-    return 0
-
-if __name__ == '__main__':
-    main()

+ 0 - 185
dw_base/scheduler/get_oldmongo_sldw.py

@@ -1,185 +0,0 @@
-import argparse
-import sys
-import re
-import os
-from pyhive import hive
-import pandas as pd
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-
-
-
-# 定义一个数组(列表)
-my_array = [
-    'america_stat',
-    'australia',
-    'brazil',
-    'brazil_stat',
-    'canada',
-    'canada_stat',
-    'china_stat',
-    'cis',
-    'dominica',
-    'england',
-    'ethiopia',
-    'eurasian_bol',
-    'european_union',
-    'fiji',
-    'guatemala',
-    'honduras',
-    'honduras_stat',
-    'hongkong_stat',
-    'indonesia_stat',
-    'japan',
-    'kyrghyzstan',
-    'new_zealand',
-    'nicaragua',
-    'peru_exp',
-    'philippines_stat',
-    'russia_rail',
-    'salvador',
-    'salvador_stat',
-    'south_africa_stat',
-    'south_korea',
-    'south_korea_stat',
-    'spain',
-    'taiwan',
-    'thailand',
-    'thailand_stat',
-    'turkey_stat',
-    'zimbabwe',
-    'taiwan_stat',
-    'tanzania',
-    'tanzania_tboe',
-    'bolivia_stat',
-    'spain_co',
-    'congo_kinshasa',
-    'south_korea_co',
-    'england_stat',
-    'angola_stat',
-    'guatemala_stat',
-    'brazil_air',
-    'egypt_co',
-    'uruguay_nboe',
-    'panama_exp',
-    'bahrain_stat',
-    'dominican_republic_stat',
-    'qatar_stat'
-]
-
-
-def parse_arguments():
-    # 创建 ArgumentParser 对象
-    parser = argparse.ArgumentParser(description='Process some parameters.')
-
-    # 添加参数
-    parser.add_argument('-mgdb', dest='mgdb', required=True, help='Parameter 1')
-
-    # 解析参数
-    return parser.parse_args()
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-
-def get_count(client, mgdb):
-    # 选择数据库
-    db = client[mgdb]
-    # 选择集合
-    collection1 = db['shipments_imports']
-    collection2 = db['shipments_exports']
-
-    # 使用聚合管道进行分组和计数
-    pipeline = [
-        {
-            "$group": {
-                "_id": "$sldw",  # 按sldw字段分组
-                "count": {"$sum": 1}  # 计算每个组的数量
-            }
-        }
-    ]
-
-    # 执行聚合查询
-    results1 = list(collection1.aggregate(pipeline))
-    results2 = list(collection2.aggregate(pipeline))
-
-    pretty_print(f'开始合并结果-------------------------------------------------------------------------')
-
-    # 合并结果
-    combined_results = list(results1) + list(results2)
-
-
-    # 假设 combined_results 是一个字典列表
-    # 将结果转换为 DataFrame
-    df = pd.DataFrame(combined_results)
-    df1 = pd.DataFrame(results1)
-    df2 = pd.DataFrame(results2)
-
-    # 连接到 Hive
-    hive_conn = hive.Connection(host='192.168.30.3', port=10000, username='hive', database='dim')
-
-    # 写入 Hive 表
-    cursor = hive_conn.cursor()
-
-    pretty_print(f'开始插入结果-------------------------------------------------------------------------')
-
-    # 插入数据
-    for index, row in df1.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_sldw_global_old (sldw, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'im')
-        """
-        pretty_print(f'{insert_query}')
-        cursor.execute(insert_query)
-    # 插入数据
-    for index, row in df2.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_sldw_global_old (sldw, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'ex')
-        """
-        cursor.execute(insert_query)
-
-    # 关闭连接
-    cursor.close()
-    hive_conn.close()
-
-    jgj = ('----------------------'+
-           '\n结果1-->' + str(results1) +
-           '结果1end\n结果2-->' + str(results2) +
-           '结果2end\n合并后结果-->'+str(combined_results)+
-           '\n----------------------'
-    )
-    pretty_print(f'{jgj}')
-    return jgj
-
-def get_old_count(client,mgdb):
-    result = get_count(client, mgdb)
-    pretty_print(f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}')
-    return result
-
-
-def main():
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    pretty_print(f'开始循环调用-------------------------------------------------------------------------')
-    pretty_print(f'{my_array}')
-    # 使用for循环遍历数组,并调用函数
-    for item in my_array:
-        pretty_print(f'开始执行:{item}')
-        get_old_count(client,item)
-    client.close()
-    return 0
-
-if __name__ == '__main__':
-    main()

+ 0 - 90
dw_base/scheduler/get_oldmongo_sldw_detail.py

@@ -1,90 +0,0 @@
-import argparse
-import sys
-import re
-import os
-from pyhive import hive
-import pandas as pd
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-
-
-
-# 定义一个数组(列表)
-my_array = [
-    'japan'
-]
-
-
-def parse_arguments():
-    # 创建 ArgumentParser 对象
-    parser = argparse.ArgumentParser(description='Process some parameters.')
-
-    # 添加参数
-    parser.add_argument('-mgdb', dest='mgdb', required=True, help='Parameter 1')
-
-    # 解析参数
-    return parser.parse_args()
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-
-def get_count(client, mgdb):
-    # 选择数据库
-    db = client[mgdb]
-    # 选择集合
-    # collection1 = db['shipments_imports']
-    collection2 = db['shipments_exports']
-
-    # 使用聚合管道进行分组和计数
-    pipeline = [
-        {
-            "$group": {
-                "_id": "$sldw",  # 按sldw字段分组
-                "count": {"$sum": 1} , # 计算每个组的数量
-                "maxid": {"$max": "$_id"} # 计算每个组的最大id值
-            }
-        }
-    ]
-
-    # 执行聚合查询
-    # results1 = list(collection1.aggregate(pipeline))
-    results2 = list(collection2.aggregate(pipeline))
-
-    pretty_print(f'开始合并结果-------------------------------------------------------------------------')
-
-    # 合并结果
-    combined_results =  list(results2)
-    pretty_print(f'结果-------------------------------------------------------------------------{combined_results}')
-
-def get_old_count(client,mgdb):
-    result = get_count(client, mgdb)
-    pretty_print(f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}')
-    return result
-
-
-def main():
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    pretty_print(f'开始循环调用-------------------------------------------------------------------------')
-    pretty_print(f'{my_array}')
-    # 使用for循环遍历数组,并调用函数
-    for item in my_array:
-        pretty_print(f'开始执行:{item}')
-        get_old_count(client,item)
-    client.close()
-    return 0
-
-if __name__ == '__main__':
-    main()

+ 0 - 102
dw_base/scheduler/get_oldmongo_stat.py

@@ -1,102 +0,0 @@
-# 用于钉钉监控T+1任务是否需要重跑
-import argparse
-import sys
-import re
-import os
-import requests
-import json
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from datetime import time
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-from dw_base.utils.config_utils import parse_args
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from elasticsearch.exceptions import NotFoundError
-
-
-# sql = "SELECT mgdb, mgtbl_name FROM tmp.tmp_zjh_1011"
-# spark = SparkSQL()
-# res = spark.query(sql)[0].collect()
-
-def parse_arguments():
-    # 创建 ArgumentParser 对象
-    parser = argparse.ArgumentParser(description='Process some parameters.')
-
-    # 添加参数
-    parser.add_argument('-mgdb', dest='mgdb', required=True, help='Parameter 1')
-
-    # 解析参数
-    return parser.parse_args()
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-
-def get_count(client, mgdb):
-    # 选择数据库
-    db = client[mgdb]
-    # 选择集合
-    collection1 = db['shipments_imports']
-    collection2 = db['shipments_exports']
-    # 根据 mgtbl 确定字段名
-    fields_name1 =  'jksmc'
-    fields_name2 =  'cksmc'
-
-    # # 获取字段名
-    # field_name = fields_name.get(mgtbl)
-    # if field_name is None:
-    #     # 如果集合名不存在,抛出 ValueError 异常
-    #     raise ValueError(f"No field name found for mgtbl: {mgtbl}")
-    # 使用 distinct 方法获取字段的去重后个数
-
-    data1 = collection1.distinct(fields_name1)
-    data2 = collection2.distinct(fields_name2)
-    stat1 = len(data1)
-    stat2 = len(data2)
-    cnt1 = collection1.count()
-    cnt2 = collection2.count()
-    combined_data = len(set(data1 + data2))
-    jgj = ('----------------------'+
-           '\n进口总条数-->'+str(cnt1)+
-           ',\n出口总条数-->'+str(cnt2)+
-           ',\n进口去重企业数-->'+str(stat1)+
-           ',\n出口去重企业数-->'+str(stat2)+
-           ',\n进出口去重企业数-->'+str(combined_data)+
-           '\n----------------------'
-    )
-    pretty_print(f'{jgj}')
-    return jgj
-
-def get_old_count(mgdb):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    result = get_count(client, mgdb)
-    pretty_print(f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}')
-    return result
-
-
-def main():
-    # CONFIG, _ = parse_args(sys.argv[1:])
-    # for record in res:
-    # mgtbl = record['mgtbl_name']
-
-    # 解析命令行参数
-    args = parse_arguments()
-    mgdb = args.mgdb
-    old_cnt = get_old_count(mgdb)
-    return 0
-
-if __name__ == '__main__':
-    main()

+ 0 - 139
dw_base/scheduler/get_oldmongo_ysfs.py

@@ -1,139 +0,0 @@
-# 用于钉钉监控T+1任务是否需要重跑
-import argparse
-import sys
-import re
-import os
-import requests
-import json
-from pyhive import hive
-import pandas as pd
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.log_utils import pretty_print
-from configparser import ConfigParser
-from datetime import time
-from pymongo import MongoClient
-from dw_base import *
-from dw_base.scheduler.polling_scheduler import get_mongo_client
-from dw_base.utils.config_utils import parse_args
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from elasticsearch.exceptions import NotFoundError
-
-
-# sql = "SELECT mgdb, mgtbl_name FROM tmp.tmp_zjh_1011"
-# spark = SparkSQL()
-# res = spark.query(sql)[0].collect()
-
-def parse_arguments():
-    # 创建 ArgumentParser 对象
-    parser = argparse.ArgumentParser(description='Process some parameters.')
-
-    # 添加参数
-    parser.add_argument('-mgdb', dest='mgdb', required=True, help='Parameter 1')
-
-    # 解析参数
-    return parser.parse_args()
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-
-def get_count(client, mgdb):
-    # 选择数据库
-    db = client[mgdb]
-    # 选择集合
-    collection1 = db['shipments_imports']
-    collection2 = db['shipments_exports']
-
-    # 使用聚合管道进行分组和计数
-    pipeline = [
-        {
-            "$group": {
-                "_id": "$ysfs",  # 按ysfs字段分组
-                "count": {"$sum": 1}  # 计算每个组的数量
-            }
-        }
-    ]
-
-    # 执行聚合查询
-    results1 = list(collection1.aggregate(pipeline))
-    results2 = list(collection2.aggregate(pipeline))
-
-    pretty_print(f'开始合并结果-------------------------------------------------------------------------')
-
-    # 合并结果
-    combined_results = list(results1) + list(results2)
-
-
-    # 假设 combined_results 是一个字典列表
-    # 将结果转换为 DataFrame
-    df = pd.DataFrame(combined_results)
-    df1 = pd.DataFrame(results1)
-    df2 = pd.DataFrame(results2)
-
-    # 连接到 Hive
-    hive_conn = hive.Connection(host='192.168.30.3', port=10000, username='hive', database='dim')
-
-    # 写入 Hive 表
-    cursor = hive_conn.cursor()
-
-    pretty_print(f'开始插入结果-------------------------------------------------------------------------')
-
-    # 插入数据
-    for index, row in df1.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_ysfs_global_old (ysfs, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'im')
-        """
-        pretty_print(f'{insert_query}')
-        cursor.execute(insert_query)
-    # 插入数据
-    for index, row in df2.iterrows():
-        insert_query = f"""
-        INSERT INTO dim.cts_ysfs_global_old (ysfs, cnt, gj, jck)
-        VALUES ('{row['_id']}' , '{row['count']}','{mgdb}', 'ex')
-        """
-        cursor.execute(insert_query)
-
-    # 关闭连接
-    cursor.close()
-    hive_conn.close()
-
-    jgj = ('----------------------'+
-           '\n结果1-->' + str(results1) +
-           '结果1end\n结果2-->' + str(results2) +
-           '结果2end\n合并后结果-->'+str(combined_results)+
-           '\n----------------------'
-    )
-    pretty_print(f'{jgj}')
-    return jgj
-
-def get_old_count(mgdb):
-    client = get_mongo_client('/../datasource/mongo/mongo-cts-prod-old.ini')
-    result = get_count(client, mgdb)
-    pretty_print(f'{NORM_MGT} old source mongo: {NORM_GRN}{mgdb} '
-                 f'{NORM_MGT} old data count: {NORM_GRN}')
-    return result
-
-
-def main():
-    # CONFIG, _ = parse_args(sys.argv[1:])
-    # for record in res:
-    # mgtbl = record['mgtbl_name']
-
-    # 解析命令行参数
-    args = parse_arguments()
-    mgdb = args.mgdb
-    old_cnt = get_old_count(mgdb)
-    return 0
-
-if __name__ == '__main__':
-    main()

+ 0 - 0
dw_base/scheduler/mg2es/__init__.py


+ 0 - 53
dw_base/scheduler/mg2es/conf_reader.py

@@ -1,53 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import json
-from configparser import ConfigParser
-
-import yaml
-
-from dw_base.scheduler.mg2es.path_util import PathUtil
-
-
-class ConfReader():
-
-    def get_yml_data(self, yml_file_path):
-        with open(yml_file_path, 'r') as file:
-            data = yaml.safe_load(file)
-        return data
-
-    def get_json_data(self, json_file_path):
-        with open(json_file_path) as f:
-            data = json.load(f)
-        return data
-
-    def get_es_conf(self):
-        path = PathUtil.get_es_conn_path()
-        print(path)
-        config_parser = ConfigParser()
-        config_parser.read(path)
-        host = config_parser.get('base', 'host')
-        port = int(config_parser.get('base', 'port'))
-        return host, port
-
-    def get_redis_conf(self):
-        path = PathUtil.get_redis_conn_path()
-        print(path)
-        config_parser = ConfigParser()
-        config_parser.read(path)
-        host = config_parser.get('base', 'host')
-        port = int(config_parser.get('base', 'port'))
-        db = int(config_parser.get('base', 'db'))
-        password = config_parser.get('base', 'password')
-        # 将空字符串密码转换为 None
-        password = password if password != '' else None
-        return host, port,db, password
-
-if __name__ == '__main__':
-    cf = ConfReader()
-    print(cf.get_es_conf())
-    print(cf.get_redis_conf())

+ 0 - 37
dw_base/scheduler/mg2es/dict_redis2hive.py

@@ -1,37 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.redis_operator import RedisOperator
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.config_utils import parse_args
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    dt = CONFIG.get('dt')
-    cf = ConfReader()
-    host, port, db, password = cf.get_redis_conf()
-    redis_client = RedisOperator(host, port, db, password)
-    spark = SparkSQL().get_spark_session()
-    state_dict = redis_client.get_hash_table_all('customs:state:dict')
-    country_dict = redis_client.get_hash_table_all('customs:country:dict')
-    state_df_dict = [{"field": k.decode(), "value": v.decode()} for k, v in state_dict.items()]
-    country_df_dict = [{"field": k.decode(), "value": v.decode()} for k, v in country_dict.items()]
-    state_df = spark.createDataFrame(state_df_dict)
-    country_df = spark.createDataFrame(country_df_dict)
-    # 注册DataFrame为临时视图
-    state_df.createOrReplaceTempView("redis_state_data")
-    country_df.createOrReplaceTempView("redis_country_data")
-    # 将数据写入Hive表
-    spark.sql("set hive.exec.dynamic.partition=true")
-    spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")
-    spark.sql("set spark.yarn.queue=cts")
-    print("开始写入Hive表")
-    spark.sql(f"INSERT overwrite TABLE dim.redis_cts_state_dict SELECT *,'{dt}' FROM redis_state_data")
-    spark.sql(f"INSERT overwrite TABLE dim.redis_cts_country_dict SELECT *,'{dt}' FROM redis_country_data")
-    # 停止SparkSession
-    spark.stop()

+ 0 - 47
dw_base/scheduler/mg2es/es_index_backup.py

@@ -1,47 +0,0 @@
-import sys
-import os
-import re
-# 配置参数示例  -catalog=imports -database_name=venezuela_bol -year=2023
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import sys
-from time import sleep
-
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from dw_base.utils.config_utils import parse_args
-
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    catalog = CONFIG.get('catalog')
-    database_name = CONFIG.get('database_name')
-    env = CONFIG.get('env','test')
-    host='192.168.0.200'
-    port='9201'
-    if env == 'prod':
-        host = '192.168.11.100'
-        port = '9003'
-    year = CONFIG.get('year')
-    bak_suffix = CONFIG.get('bak_suffix','bak')
-    es_operator = ESOperator(host, port)
-    index_name = f'customs_{catalog}_{database_name}-{year}'
-    bak_index_name = f'{index_name}-{bak_suffix}'
-    es_operator.create_index(bak_index_name)
-    task_id = es_operator.reindex(index_name, bak_index_name)['task']
-    total_time = 0
-    while True:
-        sleep(60)
-        total_time += 60
-        task_info = es_operator.get_task_status(task_id)
-        if task_info['completed'] == True:
-            print('迁移完成--------------------------')
-            print(f'迁移耗时:{total_time}秒')
-            cnt = es_operator.get_index_document_count(bak_index_name)
-            print(f'迁移文档数:{cnt}')
-            break
-        else:
-            print('迁移中----------------------------')
-            print(task_info)
-    # es_operator.delete_index(index_name)

+ 0 - 214
dw_base/scheduler/mg2es/es_operator.py

@@ -1,214 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import json
-from elasticsearch import Elasticsearch
-from elasticsearch.exceptions import NotFoundError
-
-from dw_base import NORM_CYN, NORM_RED, NORM_BLU, NORM_MGT
-
-
-class ESOperator:
-
-    def __init__(self, host, port, timeout=30):
-
-        self.es = Elasticsearch([{'host': host, 'port': port}], timeout=timeout)
-
-    def get_all_indices(self):
-        try:
-            indices = self.es.indices.get_alias("*").keys()
-            return list(indices)
-        except Exception as e:
-            print("Error:", e)
-            return []
-
-    def get_cts_indices(self, catalog, database_name):
-        indices = self.get_all_indices()
-        return [index for index in indices if catalog in index and database_name in index]
-
-    def get_aliases_for_index(self, index_name):
-        aliases = self.es.indices.get_alias(index=index_name)
-        return list(aliases[index_name]['aliases'].keys())
-
-    def add_alias_to_index(self, index_name, alias_name):
-        self.es.indices.put_alias(index=index_name, name=alias_name)
-
-    def get_indices_by_alias(self, alias_name):
-        result = self.es.indices.get_alias(name=alias_name)
-        return list(result.keys())
-
-    def get_random_documents(self, index_name, size=10):
-        try:
-            # 构造随机排序查询
-            query_body = {
-                "size": size,
-                "query": {
-                    "function_score": {
-                        "query": {"match_all": {}},
-                        "random_score": {}
-                    }
-                }
-            }
-            result = self.es.search(index=index_name, body=query_body)
-            return result['hits']['hits']
-        except Exception as e:
-            print("Error:", e)
-            return []
-
-    def get_random_doc_with_id(self, index_name, size=10):
-        doc_list = self.get_random_documents(index_name, size)
-        return {d['_id']: d['_source'] for d in doc_list}
-
-    # 注意!此方法为异步方法,要想获取任务执行结果需配合get_task_status方法使用
-    def reindex(self, source_index, target_index):
-        body = {
-            "source": {
-                "index": source_index
-            },
-            "dest": {
-                "index": target_index
-            }
-        }
-        response = self.es.reindex(body=body, wait_for_completion=False)
-        return response
-
-    def get_task_status(self, task_id):
-        try:
-            task_info = self.es.tasks.get(task_id)
-            return task_info
-        except NotFoundError:
-            return None
-
-    def get_data_from_ids(self, index_name, id_list):
-        doc_list = [self.es.get(index=index_name, id=id) for id in id_list]
-        return {d['_id']: d['_source'] for d in doc_list}
-
-    def delete_index(self, index_name):
-        try:
-            response = self.es.indices.delete(index=index_name, ignore=[400, 404])
-            if response['acknowledged']:
-                print(f"Index '{index_name}' deleted successfully.")
-            else:
-                print(f"Failed to delete index '{index_name}'.")
-        except Exception as e:
-            print("Error:", e)
-
-    def dict_diff(self, old_dict, new_dict):
-        old_keys = set(old_dict.keys())
-        new_keys = set(new_dict.keys())
-        old_only_keys = old_keys - new_keys
-        new_only_keys = new_keys - old_keys
-        common_keys = old_keys & new_keys
-        if old_only_keys:
-            print(f"{NORM_CYN} old_only_keys:")
-            for key in old_only_keys:
-                print(f"{NORM_BLU}      {key} :{new_dict[key]}")
-        if new_only_keys:
-            print(f"{NORM_CYN} new_only_keys:")
-            for key in new_only_keys:
-                print(f"{NORM_BLU}      {key} :{new_dict[key]}")
-        diff_data = {}
-        for key in common_keys:
-            if old_dict[key] != new_dict[key] or type(old_dict[key]) != type(new_dict[key]):
-                diff_data[key] = (old_dict[key], new_dict[key])
-        if diff_data:
-            print(f"{NORM_CYN} diff_data:")
-            for key in diff_data:
-                print(f"{NORM_BLU}      {key}: ")
-                print(
-                    f"{NORM_RED}          value:{NORM_MGT} {diff_data[key][0]}{NORM_RED} -> {NORM_MGT}{diff_data[key][1]}")
-                print(
-                    f"{NORM_RED}          type:{NORM_MGT} {type(old_dict[key])}{NORM_RED} -> {NORM_MGT}{type(new_dict[key])}")
-
-    def get_data_from_id(self, index_name, id):
-        response = self.es.get(index=index_name, id=id)
-        return response['_source']
-
-    def create_index_from_json(self, index_name, settings_and_mappings):
-        try:
-            self.es.indices.create(index=index_name, body=settings_and_mappings)
-            print(f"Index '{index_name}' created successfully.")
-        except Exception as e:
-            print(f"Error creating index '{index_name}':", e)
-
-    def create_index(self, index_name):
-        try:
-            if self.es.indices.exists(index=index_name):
-                print(f"Index '{index_name}' already exists.")
-                return False
-            self.es.indices.create(index=index_name)
-            print(f"Index '{index_name}' created successfully.")
-            return True
-        except Exception as e:
-            print(f"Error creating index '{index_name}': {e}")
-            return False
-
-    def get_index_document_count(self, index_name):
-        try:
-            result = self.es.count(index=index_name)
-            return result['count']
-        except Exception as e:
-            print("Error:", e)
-            return None
-
-    def random_diff(self, new_index, old_index):
-        new_dicts = self.get_random_doc_with_id(new_index)
-        id_list = [id for id in new_dicts.keys()]
-        old_dicts = self.get_data_from_ids(old_index, id_list)
-        for id in id_list:
-            print(f'【id:{id}】------------------------------------------------------')
-            self.dict_diff(old_dicts[id], new_dicts[id])
-
-    def refresh(self, index):
-        if not index:
-            raise ValueError("Index name must be specified.")
-
-        try:
-            self.es.indices.refresh(index=index)
-            print(f"Index {index} refreshed.")
-        except Exception as e:
-            print("Error during refresh:", e)
-
-    # 此方法很重,不建议在生产环境中使用(可能导致线上性能下降)
-    def flush(self, index):
-        if not index:
-            raise ValueError("Index name must be specified.")
-
-        try:
-            self.es.indices.flush(index=index)
-            print(f"Index {index} flushed.")
-        except Exception as e:
-            print("Error during flush:", e)
-
-
-if __name__ == '__main__':
-    # es_operator = ESOperator('192.168.0.200', 9201)
-    es_operator = ESOperator('192.168.11.99', 9005)
-    es_operator.get_data_from_id('corp','b7730f7f75f47296e9261eb5934b140a')
-    # es_operator.refresh('customs_imports_venezuela-2020test')
-    # es_operator.refresh('customs_exports_pakistan-2020test')
-    es_operator.random_diff('customs_exports_mexico-2020test', 'customs_exports_mexico-2020')
-    # print(es_operator.get_cts_indices('exports', 'kazakhstan'))
-    # es_operator.add_alias_to_index('customs_exports_kazakhstan-2023-ctytest', 'cts_kazakhstan_ex-2023-ctytest')
-    # print(es_operator.get_aliases_for_index('customs_exports_kazakhstan-2023-ctytest'))
-    # print(es_operator.get_indices_by_alias('cts_kazakhstan_ex-2023-ctytest'))
-    # es_operator.reindex('customs_exports_kazakhstan-2023','customs_exports_kazakhstan-2023-bak')
-    # old_dict = es_operator.get_random_doc_with_id('customs_exports_kazakhstan-2023-bak')
-    # id_list = [id for id in old_dict.keys()]
-    # new_dict = es_operator.get_data_from_ids('customs_exports_kazakhstan-2023-ctytest', id_list)
-    # for id in id_list:
-    #     print(f'【id:{id}】------------------------------------------------------')
-    #     print(old_dict[id])
-    #     print(new_dict[id])
-    #     es_operator.dict_diff( old_dict[id],new_dict[id])
-    # old = es_operator.get_data_from_id('customs_exports_kazakhstan-2023-benchmark', '656d8f637e0d39686b8206e2')
-    # new = es_operator.get_data_from_id('customs_exports_kazakhstan-2023-bak', '656d8f637e0d39686b8206e2')
-    # print(old['exporterOrig'])
-    # print(new['exporterOrig'])
-    # rp = es_operator.reindex('customs_exports_kazakhstan-2023-ctytest','customs_exports_kazakhstan-2023-bak')
-    # print(rp)
-    # es_operator.delete_index('customs_exports_kazakhstan-2023-bak')

+ 0 - 250
dw_base/scheduler/mg2es/es_tmpl_gen.py

@@ -1,250 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import json
-
-from dw_base.scheduler.mg2es.path_util import PathUtil
-from dw_base.scheduler.mg2es.conf_reader import ConfReader
-
-
-class EsTmplGen:
-    def __init__(self, catalog, database_name):
-        self.catalog = catalog
-        self.database_name = database_name
-        es_json_path, mg2es_mapping_path = PathUtil.get_conf_abspath(catalog, database_name)
-        conf_reader = ConfReader()
-        self.yml_dict = conf_reader.get_yml_data(mg2es_mapping_path)
-        self.es_json = conf_reader.get_json_data(es_json_path)
-        self.type_dict = {
-            'date': 'string',
-            'text': 'string',
-            'keyword': 'string',
-            'scaled_float': 'double',
-        }
-
-        self.catalog_dict = {
-            'exports': 'ex',
-            'imports': 'im'}
-
-    def get_clos_with_type(self):
-        yml_fields = self.yml_dict['transformer']['mapping']['fields']
-        for field in yml_fields:
-            if field.get('name') == 'productTag':
-                yml_fields.remove(field)
-        yml_clos = [c['name'] for c in yml_fields]
-        # 将_id替换成id
-        yml_clos[0] = 'id'
-        # 数组类型:字段handler为text_split_handler,或source类型为数组
-        arr_clos = [c['name'] for c in yml_fields if
-                    c.get('handler') == 'text_split_handler' or isinstance(c.get('source'), list)]
-        # print(f'全字段:{yml_clos}')
-        # print(f'数组类型字段:{arr_clos}')
-        # 创建一个新的列表,用于存放键值对
-        json_list = {}
-        # 遍历 JSON 字典的键值对
-        for key, value in self.es_json['mappings']['properties'].items():
-            # 提取键和 type 的值,并组成新的键值对
-            json_list[key] = value['type']
-        # print(f'es类型:{json_list}')
-        res_list = []
-        for c in yml_clos:
-            if c in arr_clos:
-                res_list.append((c, 'array<string>'))
-            elif c in json_list:
-                res_list.append((c, self.type_dict[json_list[c]]))
-            else:
-                res_list.append((c, 'string'))
-        return res_list
-
-    def make_ddl_body(self):
-        clos_with_type = self.get_clos_with_type()
-        clos_len = [len(c[0]) for c in clos_with_type]
-        max_len = max(clos_len) + 2
-        formatted_clos = ['\t{:<{width}} {}'.format(f'`{c[0]}`', c[1], width=max_len) for c in clos_with_type]
-        clos_str = ",\n".join(formatted_clos)
-        return clos_str
-
-    def make_2es_ddl(self):
-        clos_str = self.make_ddl_body()
-        ddl = (f'create table to_es.cts_{self.database_name}_{self.catalog_dict[self.catalog]}\n'
-               f'(\n'
-               f'{clos_str}'
-               f'\n) PARTITIONED BY ( `dt` string,year_from_date string) \n'
-               f'\tSTORED AS ORC'
-               )
-        return ddl
-
-    def make_es_mapping_ddl(self):
-        clos_str = self.make_ddl_body()
-        clos = [f'{ct[0]}:{ct[0]}' for ct in self.get_clos_with_type()]
-        clos = clos[1:]
-        mapping_prop = ','.join(clos)
-        ddl = (
-            f'create external table if not exists to_es.es_cts_{self.database_name}_{self.catalog_dict[self.catalog]}_yearNeedReplace\n'
-            f'(\n'
-            f'{clos_str}'
-            f"\n) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' \n"
-            f'\tTBLPROPERTIES ('
-            f'''\n'es.nodes' = '192.168.11.100',
-        'es.port' = '9003',
-        'es.http.timeout' = '100m',
-        'es.input.use.sliced.partitions' = 'false',
-        'es.input.json' = 'false',
-        'es.index.auto.create' = 'true',
-        'es.write.operation' = 'upsert',
-        'es.mapping.date.rich' = 'false',
-        'es.batch.write.refresh' = 'false',
-        'es.batch.size.bytes' = '60mb',
-        'es.batch.size.entries' = '5000',
-        'es.batch.write.retry.count' = '10',
-        'es.batch.write.retry.wait' = '60s',
-        'es.update.retry.on.conflict' = '5' ,
-        'es.resource' = 'customs_{self.catalog}_{self.database_name}-yearNeedReplace/_doc',
-        'es.mapping.id' = 'id',
-        'es.mapping.names' =
-            '{mapping_prop}')'''
-        )
-        return ddl
-
-    def make_2es_dml(self):
-        ct_list = self.get_clos_with_type()[2:]
-        yml_fields = self.yml_dict['transformer']['mapping']['fields'][2:]
-        field_list = []
-        filed_dict = {}
-        handler_list = []
-        for i in range(len(ct_list)):
-            yml_field = yml_fields[i]
-            field_tuple = self.get_field(yml_field)
-            field = field_tuple[0]
-            if ct_list[i][1] == 'string':
-                field = f"merge_ws({field})"
-            field_sql = f"{field} as `{field_tuple[1]}`"
-            field_list.append(field_sql)
-            filed_dict[field_tuple[1]] = field_tuple[0]
-            if 'handler' in yml_field:
-                if 'dict_handler' in yml_field['handler']:
-                    handler_list.append(yml_field)
-        dim_join_sql = self.get_dim_join_sql(handler_list, filed_dict)
-        dml_body = '\n\t, '.join(field_list)
-        dml = (f'insert overwrite table to_es.cts_{self.database_name}_{self.catalog_dict[self.catalog]}'
-               f'\nselect i.`id`'
-               f"\n\t, concat(replace(from_unixtime((i.`date` / 1000) - 8 * 60 * 60, 'yyyy-MM-dd HH:mm:ss'),' ','T'),'Z') as `date`"
-               f'\n\t, {dml_body}'
-               f'\n\t, i.`dt`'
-               f"\n\t, from_unixtime((i.`date` / 1000) - 8 * 60 * 60, 'yyyy')                         as `year_from_date`"
-               f'\nfrom to_mongo.cts_{self.database_name}_{self.catalog_dict[self.catalog]} i'
-               f'\n{dim_join_sql}'
-               f'\nwhere i.dt = "dtNeedReplace"'
-               )
-        return dml
-
-    def make_es_mapping_dml(self):
-        clos_with_type = self.get_clos_with_type()
-        clos = [f'i.`{c[0]}`' for c in clos_with_type]
-        dml_body = '\n     , '.join(clos)
-        dml = (
-            f'insert overwrite table to_es.es_cts_{self.database_name}_{self.catalog_dict[self.catalog]}_yearNeedReplace'
-            f'\nselect {dml_body}'
-            f'\nfrom to_es.cts_{self.database_name}_{self.catalog_dict[self.catalog]} i'
-            '\nwhere dt = "dtNeedReplace" and year_from_date = "yearNeedReplace"')
-        return dml
-
-    def get_field(self, fields):
-        name = fields["name"]
-        field = ''
-        if 'default' in fields and 'source' not in fields:
-            field = f"'{fields['default']}'"
-        if 'handler' in fields:
-            if 'dict_handler' in fields['handler']:
-                field = f'{fields["name"]}_dim.`value`'
-            elif fields['handler'] == 'text_split_handler':
-                d = fields['delimiter']
-                if len(d) == 1:
-                    delimiter = d[0]
-                    field = f"array_distinct(split(i.`{fields['source']}`,'{delimiter}'))"
-                else:
-                    delimiter = fields['delimiter'][1]
-                    delimiters = fields['delimiter'][2:-1]
-                    field = f"array_distinct(split(regexp_replace(i.`{fields['source']}`, '[{delimiters}]', '{delimiter}'),'{delimiter}'))"
-        if 'source' in fields and 'handler' not in fields:
-            source = fields['source']
-            if isinstance(source, list):
-                # source = [f"`i.{s}`" for s in source]
-                source = [f"merge_ws(i.`{s}`)" for s in source]
-                # field = f"coalesce({','.join(source)})"
-                field = f"filter(array_distinct(array({','.join(source)})),x -> x is not null)"
-            else:
-                field = f"i.`{source}`"
-        if 'default' in fields and 'source' in fields:
-            field = f"coalesce({field}, '{fields['default']}')"
-        return (field, name)
-
-    def get_dim_join_sql(self, handler_list, filed_dict):
-        sql_list = []
-        for field in handler_list:
-            handler = field['handler']
-            dim = f'{field["name"]}_dim'
-            source = field['source']
-            if '__' in source:
-                source = filed_dict.get(source.split('__')[1])
-            else:
-                source = f'i.`{source}`'
-            if handler == 'country_dict_handler':
-                sql_list.append(
-                    f'left join dim.redis_cts_country_dict as {dim} on {dim}.dt = "dtNeedReplace" and lower({source}) = {dim}.`field`')
-            elif handler == 'state_dict_handler':
-                sql_list.append(
-                    f'left join dim.redis_cts_state_dict as {dim} on {dim}.dt = "dtNeedReplace" and lower({source}) = {dim}.`field`')
-        return '\n'.join(sql_list)
-
-    def replace_sql(self, es_bak_ddl, es_mapping_ddl, es_bak_dml, data_source):
-        if data_source == 'india_im':
-            es_bak_ddl = es_bak_ddl.replace("`importerAddress`          string,",
-                                            "`importerAddress`          array<string>,")
-            es_bak_ddl = es_bak_ddl.replace("`exporterAddress`          string,",
-                                            "`exporterAddress`          array<string>,")
-            es_mapping_ddl = es_mapping_ddl.replace("`importerAddress`          string,",
-                                                    "`importerAddress`          array<string>,")
-            es_mapping_ddl = es_mapping_ddl.replace("`exporterAddress`          string,",
-                                                    "`exporterAddress`          array<string>,")
-            es_bak_dml = es_bak_dml.replace("merge_ws(i.`jksdz`) as `importerAddress`",
-                                            "str_to_arr(i.`jksdz`) as `importerAddress`")
-            es_bak_dml = es_bak_dml.replace("merge_ws(i.`cksdz`) as `exporterAddress`",
-                                            "str_to_arr(i.`cksdz`) as `exporterAddress`")
-        if data_source == 'america_im':
-            es_bak_ddl = es_bak_ddl.replace("`importerAddress`          string,",
-                                            "`importerAddress`          array<string>,")
-            es_bak_ddl = es_bak_ddl.replace("`exporterAddress`          string,",
-                                            "`exporterAddress`          array<string>,")
-            es_bak_ddl = es_bak_ddl.replace("`notifyPartyAddress`       string,",
-                                            "`notifyPartyAddress`       array<string>,")
-            es_mapping_ddl = es_mapping_ddl.replace("`importerAddress`          string,",
-                                                    "`importerAddress`          array<string>,")
-            es_mapping_ddl = es_mapping_ddl.replace("`exporterAddress`          string,",
-                                                    "`exporterAddress`          array<string>,")
-            es_mapping_ddl = es_mapping_ddl.replace("`notifyPartyAddress`       string,",
-                                                    "`notifyPartyAddress`       array<string>,")
-            es_bak_dml = es_bak_dml.replace("merge_ws(i.`shrdz`) as `importerAddress`",
-                                            "str_to_arr(i.`shrdz`) as `importerAddress`")
-            es_bak_dml = es_bak_dml.replace("merge_ws(i.`fhrdz`) as `exporterAddress`",
-                                            "str_to_arr(i.`fhrdz`) as `exporterAddress`")
-            es_bak_dml = es_bak_dml.replace("merge_ws(i.`tzrdz`) as `notifyPartyAddress`",
-                                            "str_to_arr(i.`tzrdz`) as `notifyPartyAddress`")
-        return es_bak_ddl, es_mapping_ddl, es_bak_dml
-
-
-if __name__ == '__main__':
-    # es = EsDDLGen('exports', 'america')
-    es = EsTmplGen('imports', 'america')
-    print('\n\n--2es_ddl-------------------------------------------------------')
-    print(es.make_2es_ddl())
-    print('\n\n--es_mapping_ddl-------------------------------------------------------')
-    print(es.make_es_mapping_ddl())
-    print('\n\n--2es_dml-------------------------------------------------------')
-    print(es.make_2es_dml())
-    print('\n\n--es_mapping_dml-------------------------------------------------------')
-    print(es.make_es_mapping_dml())

+ 0 - 82
dw_base/scheduler/mg2es/git_helper.py

@@ -1,82 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import subprocess
-from datetime import datetime
-
-from dw_base.scheduler.mg2es.path_util import PathUtil
-
-
-class GitHelper:
-
-    # def git_pull(self, working_dir):
-    #     subprocess.run(["git", "pull"], cwd=working_dir, check=True)
-
-    def git_pull(self, working_dir):
-        """
-        从远程仓库拉取最新的更改。
-
-        参数:
-            working_dir (str): Git 仓库的目录。
-
-        异常:
-            subprocess.CalledProcessError: 如果 git pull 命令失败。
-            FileNotFoundError: 如果工作目录不存在。
-        """
-        # 检查工作目录是否存在
-        if not os.path.exists(working_dir):
-            raise FileNotFoundError(f"指定的目录不存在: {working_dir}")
-        try:
-            subprocess.run(["git", "pull"], cwd=working_dir, check=True)
-            print("成功拉取最新的更改。")
-        except subprocess.CalledProcessError as e:
-            print(f"git pull 过程中出错: {e}")
-
-    def git_pull_etlconfig(self):
-        root_path = PathUtil.get_project_root_path()
-        # 调用函数并指定文件路径和工作目录
-        working_dir = root_path + '/../mongo2es-customs'
-        print(f'【git pull】 working_dir: {working_dir}')
-        self.git_pull(working_dir)
-    def get_last_commit_date(self, file_path, working_dir):
-        try:
-            # 构建Git命令
-            git_command = ["git", "log", "-1", "--format=%cd", "--", file_path]
-            # 执行命令并获取输出
-            output = subprocess.check_output(git_command, cwd=working_dir, stderr=subprocess.STDOUT,
-                                             universal_newlines=True)
-            date_object = datetime.strptime(output.strip(), '%a %b %d %H:%M:%S %Y %z')
-            # 格式化日期时间字符串
-            formatted_date = date_object.strftime('%Y-%m-%d %H:%M:%S')
-            print("最近一次提交的日期为: "+formatted_date)
-            # 返回输出(即最近一次提交的日期)
-            return formatted_date
-        except subprocess.CalledProcessError as e:
-            # 如果命令执行失败,输出错误信息
-            print("Error:", e.output)
-            return None
-
-    def get_etlconfig_last_uptime(self, catalog, database_name):
-        root_path = PathUtil.get_project_root_path()
-        # 调用函数并指定文件路径和工作目录
-        es_json_path,mg2es_mapping_path = PathUtil.get_conf_path(catalog,database_name)
-        working_dir = root_path + '/../mongo2es-customs'
-        es_json_date = self.get_last_commit_date(es_json_path, working_dir)
-        mg2es_mapping_date = self.get_last_commit_date(mg2es_mapping_path, working_dir)
-        if es_json_date and mg2es_mapping_date:
-            last_commit_date = max(es_json_date, mg2es_mapping_date)
-            print("最近一次提交的日期:", last_commit_date)
-            return last_commit_date
-        else:
-            print("获取最近一次提交的日期时出错。")
-            return None
-
-
-if __name__ == '__main__':
-    git_helper = GitHelper()
-    # last_commit_date = git_helper.get_etlconfig_last_uptime('exports', 'kazakhstan')
-    git_helper.git_pull_etlconfig()

+ 0 - 54
dw_base/scheduler/mg2es/hive_sql.py

@@ -1,54 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from pyhive import hive
-
-class HiveSQL():
-    def __init__(self, host, port, username, password, database):
-        self.host = host
-        self.port = port
-        self.username = username
-        self.password = password
-        self.database = database
-        self.conn = self.create_connection()
-    def create_connection(self):
-        conn = hive.Connection(
-            host=self.host,
-            port=self.port,
-            username=self.username,
-            password=self.password,
-            database=self.database
-        )
-        return conn
-
-    def query(self, query):
-        print("Executing query:", query)
-        try:
-            cursor = self.conn.cursor()
-            cursor.execute(query)
-            result = cursor.fetchall()
-            return result
-        except Exception as e:
-            print("【error】:", e)
-            return None
-    def execute(self, sql):
-        print("Executing sql:", sql)
-        try:
-            cursor = self.conn.cursor()
-            cursor.execute(sql)
-        except Exception as e:
-            print("【error】:", e)
-            return None
-
-if __name__ == '__main__':
-    hive = HiveSQL('192.168.15.3',10000, 'chutianyu', None, 'test')
-    print(hive.query('select current_user()'))
-    queue_sql = 'SET mapreduce.job.queuename=vip'
-    hive.execute(queue_sql)
-    print(hive.query('select * from test.test_table1  a cross join  test.test_table1 b'))
-    print(hive.query('SET mapreduce.job.queuename'))
-    hive.execute('insert into table test.test_table1 values (12345)')

+ 0 - 39
dw_base/scheduler/mg2es/path_util.py

@@ -1,39 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-import os
-import re
-class PathUtil():
-
-    @staticmethod
-    def get_project_root_path():
-        abspath = os.path.abspath(__file__)
-        root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-        return root_path
-
-    @staticmethod
-    def get_conf_path(catalog,database_name):
-        es_json_path = f"customs/{catalog}/{database_name}/customs_{catalog}_{database_name}-es7.json"
-        mg2es_mapping_path = f"customs/{catalog}/{database_name}/settings.yml"
-        return es_json_path,mg2es_mapping_path
-
-    @staticmethod
-    def get_conf_abspath(catalog, database_name):
-        working_dir = PathUtil.get_project_root_path() + '/../mongo2es-customs'
-        es_json_path = f"{working_dir}/customs/{catalog}/{database_name}/customs_{catalog}_{database_name}-es7.json"
-        mg2es_mapping_path = f"{working_dir}/customs/{catalog}/{database_name}/settings.yml"
-        return es_json_path, mg2es_mapping_path
-
-    @staticmethod
-    def get_es_conn_path():
-        es_conf_path = PathUtil.get_project_root_path() + '/../datasource/elasticsearch/es-prod-cts.ini'
-        return es_conf_path
-
-    @staticmethod
-    def get_redis_conn_path():
-        es_conf_path = PathUtil.get_project_root_path() + '/../datasource/redis/redis-prod-cts.ini'
-        return es_conf_path

+ 0 - 61
dw_base/scheduler/mg2es/redis_operator.py

@@ -1,61 +0,0 @@
-import redis
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-
-class RedisOperator:
-    def __init__(self, host, port=6379, db=0, password=None):
-        self.redis_client = redis.Redis(host=host, port=port, db=db, password=password)
-
-    def get_hash_table_all(self, table_name):
-        """
-        获取哈希表的所有字段和值
-        :param table_name: 哈希表名称
-        :return: 哈希表的所有字段和值,以字典形式返回
-        """
-        return self.redis_client.hgetall(table_name)
-
-    def get_hash_table_field(self, table_name, field):
-        """
-        获取哈希表的指定字段
-        :param table_name: 哈希表名称
-        :param field: 字段名称
-        :return: 指定字段的值
-        """
-        return self.redis_client.hget(table_name, field)
-    def zadd_batch(self, zset_name, mapping):
-        """
-        批量插入有序集合 ZSET 中的多个成员和分数
-        :param zset_name: 有序集合名称
-        :param mapping: 包含多个成员和分数的字典
-        :return: 插入成功的成员数量
-        """
-        return self.redis_client.zadd(zset_name, mapping)
-
-
-    def delete_zset(self, zset_name):
-        """
-        删除整个有序集合 ZSET
-        :param zset_name: 有序集合名称
-        :return: True if the key was removed, False if the key does not exist
-        """
-        return self.redis_client.delete(zset_name) > 0
-
-if __name__ == '__main__':
-       redis_host = '192.168.30.1'
-       redis_port = 8000
-       redis_db = 10
-       redis_password = '111111'
-
-       redis_operator = RedisOperator(redis_host, redis_port, redis_db, redis_password)
-
-       # 示例:批量插入有序集合 ZSET
-       zset_name = 'my_test_zset1'
-       mapping = {'member1': 10, 'member2': 20, 'member3': 30}
-
-       result = redis_operator.zadd_batch(zset_name, mapping)
-       print(f"Inserted {result} members into {zset_name}")

+ 0 - 178
dw_base/scheduler/mg2es/to_es.py

@@ -1,178 +0,0 @@
-import sys
-import os
-import re
-from datetime import datetime
-
-# 传参示例测试环境 -catalog=imports -database_name=venezuela_bol -run_type=test -dt=20200101 -cdt=20200102
-# 生产环境        -catalog=imports -database_name=america -run_type=print -dt=20240808 -cdt=20240808
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.scheduler.mg2es.es_operator import ESOperator
-from dw_base.scheduler.mg2es.es_tmpl_gen import EsTmplGen
-from dw_base.scheduler.mg2es.git_helper import GitHelper
-from dw_base.scheduler.mg2es.hive_sql import HiveSQL
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.config_utils import parse_args
-
-if __name__ == '__main__':
-    git_helper = GitHelper()
-    git_helper.git_pull_etlconfig()
-    catalog_dict = {
-        'exports': 'ex',
-        'imports': 'im'}
-    CONFIG, _ = parse_args(sys.argv[1:])
-    run_type = CONFIG.get('run_type', 'print')
-    dt = CONFIG.get('dt')
-    cdt = CONFIG.get('cdt')
-    ydt = CONFIG.get('ydt')
-    run_year = CONFIG.get('run_year')
-    catalog = CONFIG.get('catalog')
-    database_name = CONFIG.get('database_name')
-    data_source = f'{database_name}_{catalog_dict.get(catalog)}'
-    es_gen = EsTmplGen(catalog, database_name)
-    es_bak_ddl = es_gen.make_2es_ddl().replace("'", '"')
-    es_mapping_ddl = es_gen.make_es_mapping_ddl().replace("'", '"')
-    es_bak_dml = es_gen.make_2es_dml().replace("'", '"')
-    es_mapping_dml = es_gen.make_es_mapping_dml().replace("'", '"')
-    if data_source == 'india_im' or data_source == 'america_im':
-        es_bak_ddl, es_mapping_ddl, es_bak_dml = es_gen.replace_sql(es_bak_ddl, es_mapping_ddl, es_bak_dml, data_source)
-    hive_host = '192.168.30.3'
-    es_operator = ESOperator('192.168.11.100', 9003)
-    if run_type not in ['print', 'test', 'prod']:
-        print('【error】 run_type 参数错误,请检查!')
-    if run_type != 'prod':
-        es_mapping_ddl = es_mapping_ddl.replace('192.168.11.100', '192.168.0.200')
-        es_mapping_ddl = es_mapping_ddl.replace('9003', '9201')
-        hive_host = '192.168.15.3'
-        es_operator = ESOperator('192.168.0.200', 9201)
-    if run_type == 'print':
-        print(f'\n\n【es_bak_ddl】\n{es_bak_ddl}')
-        print(f'\n\n【es_mapping_ddl】\n{es_mapping_ddl}')
-        print(f'\n\n【es_bak_dml】\n{es_bak_dml}')
-        print(f'\n\n【es_mapping_dml】\n{es_mapping_dml}')
-        exit()
-    incr_tbl = f'to_mongo.cts_{database_name}_{catalog_dict[catalog]}'
-    spark = SparkSQL(spark_driver_memory='4g',
-                     spark_executor_memory='6g',
-                     spark_executor_memory_overhead='2048',
-                     spark_driver_cores=1,
-                     spark_executor_cores=2,
-                     spark_executor_instances=10,
-                     udf_files=['dw_base/spark/udf/customs/cts_common.py'])
-    hive = HiveSQL(hive_host, 10000, 'hive', None, 'to_es')
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true', 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts', 'spark.sql.crossJoin.enabled': 'true'}
-    cnt_sql = f'select count(1) cnt from {incr_tbl} where dt = "{dt}"'
-    cnt = spark.query(cnt_sql)[0].collect()[0]['cnt']
-    if cnt == 0:
-        print('\n\n【info】Today, there is no incremental data,exit')
-        exit()
-    else:
-        print(f'\n\n【info】Today, there are {cnt} incremental data,continue')
-    actived_sql = f'select git_last_time from task.es_template where `data_source` ="{data_source}"  and is_actived = "1"'
-    res = spark.query(actived_sql)[0].collect()
-    tbl_git_last_time = None
-    if len(res) != 0:
-        tbl_git_last_time = res[0]['git_last_time']
-    git_last_time = git_helper.get_etlconfig_last_uptime(catalog, database_name)
-    if tbl_git_last_time is None:
-        print('\n\n【info】No active template, insert template')
-        insert_sql = f'''
-        insert overwrite table task.es_template
-        select '{es_bak_ddl}'                                        as `es_bak_ddl`
-             , '{es_mapping_ddl}'                                    as `es_mapping_ddl`
-             , '{es_bak_dml}'                                        as `es_bak_dml`
-             , '{es_mapping_dml}'                                    as `es_mapping_dml`
-             , '{git_last_time}'                                     as `git_last_time`
-             , '1'                                                   as `is_actived`
-             , date_format(CURRENT_TIMESTAMP, 'yyyy-MM-dd HH:mm:ss') as `updated_time`
-             , '{data_source}'                                       as `data_source`
-             , '{cdt}'                                               as `created_dt`
-        '''
-        spark.execute(insert_sql)
-        print('\n\n【info】execute es_bak_ddl')
-        hive.execute(es_bak_ddl)
-    else:
-        print(f'\n\n【info】git_last_time: {git_last_time}  tbl_git_last_time: {tbl_git_last_time}')
-        if git_last_time > tbl_git_last_time:
-            print('\n\n【info】Template is updated, update template')
-            update_sql = f'''
-            insert overwrite table task.es_template
-            select '{es_bak_ddl}'                                        as `es_bak_ddl`
-             , '{es_mapping_ddl}'                                    as `es_mapping_ddl`
-             , '{es_bak_dml}'                                        as `es_bak_dml`
-             , '{es_mapping_dml}'                                    as `es_mapping_dml`
-             , '{git_last_time}'                                     as `git_last_time`
-             , '1'                                                   as `is_actived`
-             , date_format(CURRENT_TIMESTAMP, 'yyyy-MM-dd HH:mm:ss') as `updated_time`
-             , '{data_source}'                                       as `data_source`
-             , '{cdt}'                                               as `created_dt`
-            UNION ALL
-            SELECT i.`es_bak_ddl`
-                 , i.`es_mapping_ddl`
-                 , i.`es_bak_dml`
-                 , i.`es_mapping_dml`
-                 , i.`git_last_time`
-                 , if(i.`is_actived` = '1' , '0', i.`is_actived`)
-                 , if(i.`is_actived` = '1' ,
-                      date_format(CURRENT_TIMESTAMP, 'yyyy-MM-dd HH:mm:ss'), i.`updated_time`)
-                 , i.`data_source`
-                 , i.`created_dt`
-            FROM task.es_template i
-            where data_source = '{data_source}'  and created_dt != '{cdt}'
-            '''
-            spark.execute(update_sql)
-            print('\n\n【info】es_bak is rename')
-            rename_sql = f'''
-                alter table to_es.cts_{database_name}_{catalog_dict[catalog]} rename to to_es.cts_{database_name}_{catalog_dict[catalog]}_{dt}_bak
-            '''
-            hive.execute(rename_sql)
-            print('\n\n【info】execute es_bak_ddl')
-            hive.execute(es_bak_ddl)
-            rs = hive.query('show tables')
-            tbl_prefix = f'es_cts_{database_name}_{catalog_dict[catalog]}_'
-            for tbl in rs:
-                if tbl[0].startswith(tbl_prefix):
-                    drop_sql = f'drop table {tbl[0]}'
-                    hive.execute(drop_sql)
-        else:
-            print('\n\n【info】Template is not updated')
-
-    print('\n\n【info】execute es_bak_dml')
-    es_bak_dml = es_bak_dml.replace('dtNeedReplace', dt)
-    spark.execute(es_bak_dml)
-    years_sql = f"select  year_from_date,count(1) as cnt from to_es.cts_{database_name}_{catalog_dict[catalog]} where dt = '{dt}' group by year_from_date order by year_from_date desc"
-    rows = spark.query(years_sql)[0].collect()
-    if run_year:
-        years = [run_year]
-    else:
-        years = [row['year_from_date'] for row in rows]
-    print(f'\n\n【info】run_year is {years}')
-    year_cnt_dict = {row['year_from_date']: row['cnt'] for row in rows}
-    print(f'\n\n【info】year and cnt :{year_cnt_dict}')
-    spark._spark_session.stop()
-    print('\n\n【info】execute es_mapping_ddl&dml')
-    mem_sql = 'SET mapreduce.map.memory.mb=16384'
-    hive.execute(mem_sql)
-    jvm_sql = 'SET mapreduce.map.java.opts=-Xmx8192m'
-    hive.execute(jvm_sql)
-    queue_sql = 'SET mapreduce.job.queuename=cts'
-    hive.execute(queue_sql)
-    print(hive.query('SET mapreduce.job.queuename'))
-    print(hive.query('select current_user()'))
-    for year in years:
-        print(f'\n\n【info】execute year :{year} ')
-        start_time = datetime.now()
-        year_es_mapping_ddl = es_mapping_ddl.replace('yearNeedReplace', year)
-        year_es_mapping_dml = es_mapping_dml.replace('yearNeedReplace', year)
-        year_es_mapping_dml = year_es_mapping_dml.replace('dtNeedReplace', dt)
-        hive.execute(year_es_mapping_ddl)
-        hive.execute(year_es_mapping_dml)
-        end_time = datetime.now()
-        time_taken = (end_time - start_time).seconds
-        print(
-            f'\n\n【info】execute year :{year} cnt : {year_cnt_dict[year]} time_taken : {time_taken}  seconds-----------------------')
-        index_name = f'customs_{catalog}_{database_name}-{year}'
-        es_operator.refresh(index_name)
-        print(f'\n\n【info】index_name : {index_name} refresh done')

+ 0 - 49
dw_base/scheduler/mg_company_alias_init.py

@@ -1,49 +0,0 @@
-# mongo公司别名表索引初始化
-import sys
-import re
-import os
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-from dw_base.utils.config_utils import parse_args
-from configparser import ConfigParser
-from pymongo import MongoClient
-
-
-def get_mongo_client(conf_path):
-    config_parser = ConfigParser()
-    config_parser.read(root_path + conf_path)
-    print('conf_path:', root_path + conf_path)
-    url = config_parser.get('base', 'address')
-    return MongoClient(url)
-
-
-def create_index(client, tbl_name):
-    collection = client['tendata_corp'][tbl_name]
-    collection.create_index([("tid", 1)], unique=True)
-    collection.create_index([("qybzmc", 1)])
-
-
-def check_index(client, tbl_name):
-    collection = client['tendata_corp'][tbl_name]
-    index_info = str(collection.index_information())
-    cnt = collection.count()
-    print(f'{tbl_name} count: {cnt}')
-    if "'key': [('tid', 1)], 'unique': True" in index_info and "'key': [('qybzmc', 1)]" in index_info:
-        return True
-    else:
-        return False
-
-
-if __name__ == '__main__':
-    ent_mg_conf = '/../datasource/mongo/mongo-ent-prod-alias-rw.ini'
-    CONFIG, _ = parse_args(sys.argv[1:])
-    country = CONFIG.get('country')
-    tbl_name = f'company_alias_{country}'
-    client = get_mongo_client(ent_mg_conf)
-    create_index(client, tbl_name)
-    if check_index(client, tbl_name):
-        print(f'{tbl_name}: index ok')
-    else:
-        print(f'{tbl_name}: index not ok,plz check!')

+ 0 - 71
dw_base/spark/td_spark_init.py

@@ -1,71 +0,0 @@
-# -*- coding:utf-8 -*-
-
-from typing import Dict, Union, List
-
-from dw_base.utils.config_utils import parse_args
-from dw_base.spark.spark_sql import SparkSQL
-from pyspark.sql import SparkSession
-
-"""
-@author xunxu
-提供一种类似spark-submit提交方式的SparkSession初始化工具类
-xxx.py \
---name "spark-sql-test-job" \
---queue cts \
---num-executors 2 \
---executor-memory 1g \
---executor-cores 1 \
---conf spark.sql.shuffle.partitions=300 \
---conf spark.default.parallelism=300 \
---conf spark.dynamicAllocation.enabled=true \
---py-files dw_base/spark/udf/customs/common_clean.py,dw_base/spark/udf/spark_eng_ent_name_clean.py \
--dt="" -cdt="" -ydt="" ...
-"""
-
-
-def get_spark(argv: list) -> (SparkSQL, SparkSession):
-    """
-    Args:
-        argv: sys.argv parsed from the command line
-    Returns: tendata SparkSQL and SparkSession Tuple
-    """
-    conf_args: Dict[str, Union[str, bool, List[str]]]
-    conf_args, _ = parse_args(argv[1:])
-
-    spark_conf_dict = {
-        "hive.exec.dynamic.partition": "true",
-        "hive.exec.dynamic.partition.mode": "nonstrict",
-        "spark.dynamicAllocation.enabled": "true"
-    }
-
-    # 添加所有的--conf配置到extra_spark_conf中
-    if conf_args.__contains__('conf'):
-        spark_conf = conf_args['conf']
-        if isinstance(spark_conf, list):
-            spark_conf_dict.update(
-                dict(map(lambda kv_str: kv_str.split("="), spark_conf))
-            )
-        elif isinstance(spark_conf, str):
-            k, v = spark_conf.split("=")
-            spark_conf_dict[k] = v
-
-    td_spark = SparkSQL(
-        session_name=conf_args.get("name", argv[0]),
-        master=conf_args.get("master", "yarn"),
-        spark_yarn_queue=conf_args.get("queue", "default"),
-        spark_driver_memory=conf_args.get("driver-memory", "1g"),
-        spark_driver_cores=conf_args.get("driver-core", 1),
-        spark_executor_instances=conf_args.get("num-executors", 2),
-        spark_executor_cores=conf_args.get("executor-cores", 2),
-        spark_executor_memory=conf_args.get("executor-memory", "6g"),
-        extra_spark_config=spark_conf_dict,
-        udf_files=conf_args['py-files'].split(",") if conf_args.__contains__('py-files') else None
-    )
-    spark: SparkSession = td_spark.get_spark_session()
-    return td_spark, spark
-
-
-# if __name__ == "__main__":
-#     spark: SparkSession
-#     td_spark, spark = get_spark(sys.argv)
-#     spark.sql("show databases").show(100, truncate=False)

+ 0 - 74
dw_base/spark/udf/enterprise/unique/spark_tid_match_udf.py

@@ -1,74 +0,0 @@
-"""
-批量匹配tid
-"""
-import hashlib
-import json
-from functools import lru_cache
-
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_indonesia import clean_company_name_idn, generate_tid_idn
-from dw_base.utils.tid_utils import TidGeneratorFactory
-
-mapping = {}
-
-tid_generator = TidGeneratorFactory().createTidGenerator('Enterprise')
-
-
-def generate_tid(website, name, country_code3):
-    if not name:
-        return None
-    if country_code3 in ['IDN']:
-        cleaned_name = clean_company_name_idn(name)
-        return match_tid(name, cleaned_name, country_code3)
-    else:
-        return old_generate_tid(website, name, country_code3)
-
-
-def generate_md5_hash(input_str: str):
-    md5_hash = hashlib.md5(input_str.encode('utf-8'))
-    return md5_hash.hexdigest()
-
-
-def old_generate_tid(website, name, country_code3):
-    if not name:
-        return None
-    input_str = website if website else f"{name}-{country_code3 if country_code3 else ''}"
-    return generate_md5_hash(input_str)
-
-
-def match_tid(name: str, cleaned_name: str, country: str):
-    tid = cache_tid(name, cleaned_name, country)
-    if not tid:
-        return generate_tid_idn(cleaned_name, None, None)
-    return tid
-
-
-@lru_cache(maxsize=1000000)
-def cache_tid(name: str, cleaned_name: str, country: str):
-    key = '%s--%s' % (
-        name if name else "",
-        country if country else ""
-    )
-    cleaned_key = '%s--%s' % (cleaned_name if cleaned_name else "",
-                              country if country else "")
-    tid = mapping.get(key) or mapping.get(cleaned_key)
-    if tid is None:
-        # 如果mapping里没有该tid,进行匹配
-        tid = tid_generator.match_tid(name, country)
-        if tid is None:
-            # 如果匹配结果为null,则向mapping写入一个空字符串
-            tid = tid_generator.match_tid(cleaned_name, country)
-            if tid is None:
-                mapping[key] = ''
-                mapping[cleaned_key] = ''
-            else:
-                mapping[cleaned_key] = tid
-        else:
-            mapping[key] = tid
-    elif tid == '':
-        # 对于第一次没有匹配到tid的公司,第二次进入该方法会得到一个空字符串,此时应返回null
-        return None
-    return tid
-
-
-if __name__ == '__main__':
-    print(generate_tid('', 'KENCANA LINTASINDO INTERNASIONAL', 'IDN'))

+ 0 - 99
dw_base/spark/udf/spark_id_generate_udf.py

@@ -1,99 +0,0 @@
-"""
-批量匹配tid
-"""
-import hashlib
-import json
-from functools import lru_cache
-
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_america import generate_tid_usa, clean_company_name_usa
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_india import clean_company_name_ind, generate_tid_ind
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_indonesia import clean_company_name_idn, generate_tid_idn
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_russia import clean_company_name_rus, generate_tid_rus
-from dw_base.spark.udf.enterprise.unique.ent_offline_udf_turkey import clean_company_name_tur, generate_tid_tur
-from dw_base.utils.tid_utils import TidGeneratorFactory
-
-mapping = {}
-
-tid_generator = TidGeneratorFactory().createTidGenerator('Enterprise')
-
-
-def generate_tid(website, name, country_code3):
-    if not name:
-        return None
-    if country_code3 in ['IDN']:
-        cleaned_name = clean_company_name_idn(name)
-        return match_tid(name, cleaned_name, country_code3)
-    elif country_code3 in ['USA']:
-        cleaned_name = clean_company_name_usa(name)
-        return match_tid(name, cleaned_name, country_code3)
-    elif country_code3 in ['TUR']:
-        cleaned_name = clean_company_name_tur(name)
-        return match_tid(name, cleaned_name, country_code3)
-    elif country_code3 in ['IND']:
-        cleaned_name = clean_company_name_ind(name)
-        return match_tid(name, cleaned_name, country_code3)
-    elif country_code3 in ['RUS']:
-        cleaned_name = clean_company_name_rus(name)
-        return match_tid(name, cleaned_name, country_code3)
-    else:
-        return old_generate_tid(website, name, country_code3)
-
-
-def generate_md5_hash(input_str: str):
-    md5_hash = hashlib.md5(input_str.encode('utf-8'))
-    return md5_hash.hexdigest()
-
-
-def old_generate_tid(website, name, country_code3):
-    if not name:
-        return None
-    input_str = website if website else f"{name}-{country_code3 if country_code3 else ''}"
-    return generate_md5_hash(input_str)
-
-
-def match_tid(name: str, cleaned_name: str, country: str):
-    tid = cache_tid(name, cleaned_name, country)
-    if not tid:
-        if country == 'IDN':
-            return generate_tid_idn(cleaned_name, None, None)
-        elif country == 'USA':
-            return generate_tid_usa(cleaned_name, None, None)
-        elif country == 'TUR':
-            return generate_tid_tur(cleaned_name, None)
-        elif country == 'IND':
-            return generate_tid_ind(cleaned_name, None)
-        elif country == 'RUS':
-            return generate_tid_rus(cleaned_name, None, None)
-    return tid
-
-
-@lru_cache(maxsize=1000000)
-def cache_tid(name: str, cleaned_name: str, country: str):
-    key = '%s--%s' % (
-        name if name else "",
-        country if country else ""
-    )
-    cleaned_key = '%s--%s' % (cleaned_name if cleaned_name else "",
-                              country if country else "")
-    tid = mapping.get(key) or mapping.get(cleaned_key)
-    if tid is None:
-        # 如果mapping里没有该tid,进行匹配
-        tid = tid_generator.match_tid(name, country)
-        if tid is None:
-            # 如果匹配结果为null,则向mapping写入一个空字符串
-            tid = tid_generator.match_tid(cleaned_name, country)
-            if tid is None:
-                mapping[key] = ''
-                mapping[cleaned_key] = ''
-            else:
-                mapping[cleaned_key] = tid
-        else:
-            mapping[key] = tid
-    elif tid == '':
-        # 对于第一次没有匹配到tid的公司,第二次进入该方法会得到一个空字符串,此时应返回null
-        return None
-    return tid
-
-
-if __name__ == '__main__':
-    print(generate_tid('', 'KENCANA LINTASINDO INTERNASIONAL', 'IDN'))

+ 0 - 492
dw_base/spark/udf/spark_mmq_udf.py

@@ -1,12 +1,8 @@
 #!/usr/bin/env /usr/bin/python3
 # -*- coding:utf-8 -*-
-import hashlib
 import json
-import re
 from typing import List
-from urllib.parse import urlparse
 
-import requests
 from pyspark.sql.functions import udf
 from pyspark.sql.types import *
 
@@ -15,255 +11,6 @@ def array_to_json(arr: List):
     return json.dumps(arr, ensure_ascii=False)
 
 
-def get_hs_code(arr: List):
-    url = 'https://api.tendata.cn/data/customs/v1/imports/china_stat,panama,kenya,uganda,liberia,botswana,lesotho,namibia,south_africa_stat,new_zealand,australia,ivory_coast,turkey,thailand,venezuela_bol,moldova,costarica,nigeria,indonesia_stat,russia_rail,canada_stat,honduras,hongkong_stat,fiji,zimbabwe,ghana,cameroon,chad,honduras_stat,central_african_republic,maritime_silk_bol,burundi,eurasian_eu,taiwan_stat,ecuador_bol,tanzania,tanzania_tboe,bolivia_stat,spain_co,mexico,rwanda,malawi,congo_kinshasa,south_korea_co,england_stat,angola_stat,mexico_bol,nicaragua,canada,salvador_stat,salvador,guatemala,argentina,america,paraguay,brazil_stat,brazil,brazil_bol,peru,peru_exp,bolivia,ecuador,colombia,venezuela,uruguay,america_stat,chile,russia,ukraine,england,spain,european_union,eurasian_bol,cis,pakistan,pakistan_bol,south_korea,south_korea_stat,india,india_exp,vietnam,taiwan,philippines,dominica,philippines_stat,kazakhstan,kyrghyzstan,sri_lanka,uzbekistan,indonesia,japan,bangladesh,turkey_stat,thailand_stat,ethiopia/report?page=1&size=100000'
-    # 请求参数
-    params = {
-        "reportType": "hs_code",
-        "parameters": {
-            "sort": "sum_of_money,desc",
-            "reportName": "海关编码汇总报告"
-        },
-        "query": {
-            "startDate": "2023-01-01",
-            "endDate": "2023-12-31",
-            "filterBlank": False,
-            "filterLogistics": False,
-            "conditionGroups": [
-                {
-                    "conditions": [
-                        {
-                            "param": "exporter",
-                            "value": arr
-                        }
-                    ]
-                }
-            ]
-        },
-        "sortField": "trades,desc",
-        "reportName": "海关编码汇总报告"
-    }
-    headers = {
-        'x-api-key': 'l0EEiokwMKLywfwbW08ESzCzea1egvMreXmehIII'
-    }
-    try:
-        response = requests.post(url=url, json=params, headers=headers)
-        res = response.json()
-        if res:
-            items = res["results"]["items"]
-            return json.dumps(items, ensure_ascii=False)
-        return None
-    except Exception as e:
-        return None
-
-
-def get_mdg_code(arr: List):
-    url = 'https://api.tendata.cn/data/customs/v1/imports/china_stat,panama,kenya,uganda,liberia,botswana,lesotho,namibia,south_africa_stat,new_zealand,australia,ivory_coast,turkey,thailand,venezuela_bol,moldova,costarica,nigeria,indonesia_stat,russia_rail,canada_stat,honduras,hongkong_stat,fiji,zimbabwe,ghana,cameroon,chad,honduras_stat,central_african_republic,maritime_silk_bol,burundi,eurasian_eu,taiwan_stat,ecuador_bol,tanzania,tanzania_tboe,bolivia_stat,spain_co,mexico,rwanda,malawi,congo_kinshasa,south_korea_co,england_stat,angola_stat,mexico_bol,nicaragua,canada,salvador_stat,salvador,guatemala,argentina,america,paraguay,brazil_stat,brazil,brazil_bol,peru,peru_exp,bolivia,ecuador,colombia,venezuela,uruguay,america_stat,chile,russia,ukraine,england,spain,european_union,eurasian_bol,cis,pakistan,pakistan_bol,south_korea,south_korea_stat,india,india_exp,vietnam,taiwan,philippines,dominica,philippines_stat,kazakhstan,kyrghyzstan,sri_lanka,uzbekistan,indonesia,japan,bangladesh,turkey_stat,thailand_stat,ethiopia/report?page=1&size=100000'
-    # 请求参数
-    params = {
-        "reportType": "country_of_destination_code",
-        "parameters": {
-            "sort": "trades,desc",
-            "reportName": "目的国汇总报告"
-        },
-        "query": {
-            "startDate": "2023-01-01",
-            "endDate": "2023-12-31",
-            "filterBlank": False,
-            "filterLogistics": False,
-            "conditionGroups": [
-                {
-                    "conditions": [
-                        {
-                            "param": "exporter",
-                            "value": arr
-                        }
-                    ]
-                }
-            ]
-        },
-        "sortField": "trades,desc",
-        "reportName": "目的国汇总报告"
-    }
-    headers = {
-        'x-api-key': 'l0EEiokwMKLywfwbW08ESzCzea1egvMreXmehIII'
-    }
-    try:
-        response = requests.post(url=url, json=params, headers=headers)
-        res = response.json()
-        if res:
-            items = res["results"]["items"]
-            return json.dumps(items, ensure_ascii=False)
-        return None
-    except Exception as e:
-        return None
-
-
-@udf(returnType=StructType([
-    StructField("hs_code_4_list", ArrayType(StringType()), False),
-    StructField("trades_sum_list_str", StringType()),
-    StructField("trades_sum_total", IntegerType(), False),
-    StructField("sumOfMoney_sum_total", FloatType(), False),
-    StructField("sumOfMoney_sum_list_str", StringType()),
-    StructField("weight_sum_total", FloatType(), False),
-    StructField("weight_sum_list_str", StringType()),
-    StructField("quantity_sum_total", FloatType(), False),
-    StructField("quantity_sum_list_str", StringType())
-]))
-def get_hs_code_count(content: str):
-    if content:
-        content_json_arr = json.loads(content)
-        result_dict_trades_sum = {}
-        result_dict_sumOfMoney_sum = {}
-        result_dict_weight_sum = {}
-        result_dict_quantity_sum = {}
-
-        for item in content_json_arr:
-            hs_code_4 = item['__gk'][:4]
-            item_new = {
-                "hs_code": hs_code_4,
-                "trades_sum": int(item['trades_sum']),
-                "sumOfMoney_sum": round(item['sumOfMoney_sum'], 2),
-                "weight_sum": round(item['weight_sum'], 2),
-                "quantity_sum": round(item['quantity_sum'], 2)
-            }
-
-            # 处理 trades_sum
-            if hs_code_4 in result_dict_trades_sum:
-                existing_item = result_dict_trades_sum[hs_code_4]
-                existing_item["trades_sum"] += item_new["trades_sum"]
-            else:
-                result_dict_trades_sum[hs_code_4] = {"hs_code": hs_code_4, "trades_sum": item_new["trades_sum"]}
-
-            # 处理 sumOfMoney_sum
-            if hs_code_4 in result_dict_sumOfMoney_sum:
-                existing_item = result_dict_sumOfMoney_sum[hs_code_4]
-                existing_item["sumOfMoney_sum"] += item_new["sumOfMoney_sum"]
-            else:
-                result_dict_sumOfMoney_sum[hs_code_4] = {"hs_code": hs_code_4,
-                                                         "sumOfMoney_sum": item_new["sumOfMoney_sum"]}
-
-            # 处理 weight_sum
-            if hs_code_4 in result_dict_weight_sum:
-                existing_item = result_dict_weight_sum[hs_code_4]
-                existing_item["weight_sum"] += item_new["weight_sum"]
-            else:
-                result_dict_weight_sum[hs_code_4] = {"hs_code": hs_code_4, "weight_sum": item_new["weight_sum"]}
-
-            # 处理 quantity_sum
-            if hs_code_4 in result_dict_quantity_sum:
-                existing_item = result_dict_quantity_sum[hs_code_4]
-                existing_item["quantity_sum"] += item_new["quantity_sum"]
-            else:
-                result_dict_quantity_sum[hs_code_4] = {"hs_code": hs_code_4, "quantity_sum": item_new["quantity_sum"]}
-            # 对每个列表按照 "trades_sum" 的值降序排序
-        trades_sum_list = sorted(list(result_dict_trades_sum.values()), key=lambda x: x["trades_sum"], reverse=True)
-        sumOfMoney_sum_list = sorted(list(result_dict_sumOfMoney_sum.values()), key=lambda x: x["sumOfMoney_sum"],
-                                     reverse=True)
-        weight_sum_list = sorted(list(result_dict_weight_sum.values()), key=lambda x: x["weight_sum"], reverse=True)
-        quantity_sum_list = sorted(list(result_dict_quantity_sum.values()), key=lambda x: x["quantity_sum"],
-                                   reverse=True)
-        # return list(result_dict_trades_sum.values()), list(result_dict_sumOfMoney_sum.values()), list(
-        #     result_dict_weight_sum.values()), list(result_dict_quantity_sum.values())
-        hs_code_4_list = [obj['hs_code'] for obj in trades_sum_list]
-        if 'N/A' in hs_code_4_list:
-            hs_code_4_list.remove('N/A')
-        total_trades_sum = 0
-        for obj in trades_sum_list:
-            total_trades_sum += obj["trades_sum"]
-        trades_sum_tatal = int(total_trades_sum)
-        trades_sum_list_str = ",".join(
-            ['{' + f'{i["hs_code"]},贸易次数:{i["trades_sum"]}' + '}' for i in trades_sum_list])
-
-        total_sumOfMoney_sum = 0.0
-        for obj in sumOfMoney_sum_list:
-            total_sumOfMoney_sum += obj["sumOfMoney_sum"]
-        sumOfMoney_sum_tatal = round(total_sumOfMoney_sum, 2)
-        sumOfMoney_sum_list_str = ",".join(
-            ['{' + f'{i["hs_code"]},美元总价:{i["sumOfMoney_sum"]}' + '}' for i in sumOfMoney_sum_list])
-
-        total_weight_sum = 0.0
-        for obj in weight_sum_list:
-            total_weight_sum += obj["weight_sum"]
-        weight_sum_tatal = round(total_weight_sum, 2)
-        weight_sum_list_str = ",".join(
-            ['{' + f'{i["hs_code"]},千克毛重:{i["weight_sum"]}' + '}' for i in weight_sum_list])
-
-        total_quantity_sum = 0.0
-        for obj in quantity_sum_list:
-            total_quantity_sum += obj["quantity_sum"]
-        quantity_sum_tatal = round(total_quantity_sum, 2)
-        quantity_sum_list_str = ",".join(
-            ['{' + f'{i["hs_code"]},数量:{i["quantity_sum"]}' + '}' for i in quantity_sum_list])
-        return hs_code_4_list, trades_sum_list_str, trades_sum_tatal, sumOfMoney_sum_tatal, sumOfMoney_sum_list_str, weight_sum_tatal, weight_sum_list_str, quantity_sum_tatal, quantity_sum_list_str
-
-    return None, None, None, None, None, None, None, None, None
-
-
-def calculate_total_and_list(data_list, key, unit):
-    total = sum(item[key] for item in data_list)
-    formatted_list = ', '.join([f"{{{i['hs_code']},{unit}:{i[key]}}}" for i in data_list])
-    return total, formatted_list
-
-
-@udf(returnType=StructType([
-    StructField("hs_code_4", ArrayType(StringType()), False),
-    StructField("sum_str", StringType()),
-    StructField("total", FloatType(), False),
-]))
-def get_hs_code_count_str(content: str, target_key: str, unit: str):
-    if content:
-        content_json_arr = json.loads(content)
-        result_dict = {}
-
-        for item in content_json_arr:
-            hs_code_4 = item['__gk'][:4]
-            item_new = {
-                "hs_code": hs_code_4,
-                target_key: int(item[target_key]) if target_key == "trades_sum" else round(item[target_key], 2),
-            }
-
-            if hs_code_4 in result_dict:
-                existing_item = result_dict[hs_code_4]
-                existing_item[target_key] += item_new[target_key]
-            else:
-                result_dict[hs_code_4] = {"hs_code": hs_code_4, target_key: item_new[target_key], "unit": unit}
-
-        # 生成结果的列表
-        hs_code_4_list = [key for key in result_dict.keys() if key != 'N/A']
-        total, formatted_list = calculate_total_and_list(result_dict.values(), target_key, unit)
-
-        return hs_code_4_list, formatted_list, total
-
-    return None, None, None
-
-
-@udf(returnType=StructType([
-    StructField("des_ctry_code", StringType()),
-    StructField("sum_str", StringType())
-]))
-def get_destination_ctry_count(content: str):
-    result_dict_trades_sum = {}
-    if content:
-        content_json_arr = json.loads(content)
-        for item in content_json_arr:
-            des_ctry_code = item['__gk']
-            item_new = {
-                "des_ctry_code": des_ctry_code,
-                "trades_sum": int(item['trades_sum']),
-            }
-            if des_ctry_code in result_dict_trades_sum:
-                existing_item = result_dict_trades_sum[des_ctry_code]
-                existing_item["trades_sum"] += item_new["trades_sum"]
-            else:
-                result_dict_trades_sum[des_ctry_code] = {"des_ctry_code": des_ctry_code,
-                                                         "trades_sum": item_new["trades_sum"]}
-        trades_sum_list = sorted(list(result_dict_trades_sum.values()), key=lambda x: x["trades_sum"], reverse=True)
-        destination_ctry_5_list = [obj['des_ctry_code'] for obj in trades_sum_list][0:5]
-        return json.dumps(destination_ctry_5_list, ensure_ascii=False), json.dumps(trades_sum_list, ensure_ascii=False)
-    return None, None
-
-
 @udf(returnType=ArrayType(StringType()))
 def arr_str_to_arr(json_str: str) -> list:
     if json_str:
@@ -279,102 +26,6 @@ def array_slice(input_array, start, end):
     return []
 
 
-def get_union_tax_no(tax_no1, tax_no2):
-    tax_set = set()
-    if tax_no1:
-        tax_set.add(tax_no1)
-
-    if tax_no2:
-        try:
-            tax_no2_arr = json.loads(tax_no2)
-            if isinstance(tax_no2_arr, list):
-                tax_set.update(tax_no2_arr)
-            else:
-                tax_set.add(tax_no2)
-        except json.JSONDecodeError:
-            tax_set.add(tax_no2)
-    if not tax_set:
-        return None
-    return json.dumps(list(tax_set))
-
-
-@udf(ArrayType(StringType()))
-def tran_social_media(social_media):
-    if social_media:
-        try:
-            social_media_arr = json.loads(social_media)
-            if isinstance(social_media_arr, list):
-                name_link_map = {
-                    'fb': ('facebook', 'facebook.com'),
-                    'yt': ('youtube', 'youtube.com'),
-                    'li': ('linkedin', 'linkedin.com'),
-                    'gp': ('google', 'google.com'),
-                    'tw': ('twitter', 'twitter.com'),
-                    'eb': ('ebay', 'ebay.com'),
-                    'ig': ('instagram', 'instagram.com'),
-                    'wa': ('whatsapp', 'whatsapp.com'),
-                    'pi': ('pinterest', 'pinterest.com')
-                }
-
-                cleaned_social_media_arr = []
-                for social_media in social_media_arr:
-                    name = social_media.get('name')
-                    link = social_media.get('link')
-                    if name in name_link_map:
-                        new_name, new_domain = name_link_map[name]
-                        social_media['name'] = new_name
-                        social_media['link'] = link.replace(f'{name}.com', new_domain)
-
-                        # 如果是whatsapp,处理包含<br>的情况
-                        if name == 'wa':
-                            phone_numbers = link.split('<br>')
-                            for number in phone_numbers:
-                                cleaned_social_media_arr.append(json.dumps({
-                                    'name': new_name,
-                                    'link': number.strip()
-                                }))
-                        else:
-                            cleaned_social_media_arr.append(json.dumps(social_media))
-                return cleaned_social_media_arr
-        except json.JSONDecodeError:
-            return []
-    return []
-
-
-@udf(StringType())
-def clean_phone_string(input_str):
-    if not input_str:
-        return None
-
-    # 转英文逗号
-    cleaned_str = input_str.replace(',', ',')
-
-    # 将前后都是数字的逗号替换为空
-    cleaned_str = re.sub(r'(\d),(\d)', r'\1\2', cleaned_str)
-
-    # "<br>" 转换为英文逗号
-    cleaned_str = cleaned_str.replace('<br>', ',').replace('<br', ',').replace('br>', ',')
-
-    # 去除所有的空格
-    cleaned_str = cleaned_str.replace(' ', '')
-
-    # 去除所有输入法的特殊字符
-    cleaned_str = re.sub(r'[-()()]', '', cleaned_str)
-
-    # 去除“.”
-    cleaned_str = cleaned_str.replace('.', '')
-
-    # "+"号前面若是数字,增加英文逗号
-    cleaned_str = re.sub(r'(\d)\+', r'\1,+', cleaned_str)
-
-    # 去除所有的引号
-    cleaned_str = cleaned_str.replace('"', '').replace("'", '').replace("‘", '').replace("’", '').replace("“",
-                                                                                                          '').replace(
-        "”", '').replace("[", '').replace("]", '')
-
-    return json.dumps(cleaned_str.split(','))
-
-
 @udf(ArrayType(StringType()))
 def str_to_json_arr(json_str):
     if json_str:
@@ -385,146 +36,3 @@ def str_to_json_arr(json_str):
         except json.JSONDecodeError:
             return []
     return []
-
-
-def extract_domain(url):
-    if not url:
-        return None
-    if not url.startswith(('http://', 'https://')):
-        url = 'http://' + url
-    try:
-        domain = urlparse(url).netloc
-        return domain[4:] if domain.startswith('www.') else domain
-    except Exception:
-        return None
-
-
-def add_company_item(input_string: str, input_list_str: str):
-    if not input_list_str:
-        input_list = []
-    else:
-        try:
-            input_list = json.loads(input_list_str)
-            if not isinstance(input_list, list):
-                raise ValueError("input_list_str must be a JSON representation of a list")
-        except json.JSONDecodeError:
-            raise ValueError("input_list_str is not a valid JSON")
-
-    if input_string:
-        try:
-            potential_list = json.loads(input_string)
-            if isinstance(potential_list, list):
-                input_list.extend(potential_list)
-            else:
-                raise ValueError
-        except (json.JSONDecodeError, ValueError):
-            input_list.append(input_string)
-
-    unique_list = list(set(input_list))
-
-    return json.dumps(unique_list)
-
-
-def taiwan_company_status_mapping(status):
-    mapping = {
-        "撤回認許已清算完結": "Dissolved",
-        "臺中市政府": None,  # 特殊值,置为 None
-        "廢止": "Dissolved",
-        "撤銷許可": "Status unknown",
-        "廢止許可": "Dissolved",
-        "廢止登記已清算完結": "Dissolved",
-        "核准登記": "Active",
-        "廢止認許": "Dissolved",
-        "解散": "Dissolved",
-        "破產": "Bankruptcy",
-        "撤銷登記": "Dissolved",
-        "合併解散": "Dissolved (merger or take-over)",
-        "撤回登記": "Dissolved",
-        "撤銷公司設立": "Status unknown",
-        "停業": "Inactive (no precision)",
-        "破產已清算完結": "Dissolved (bankruptcy)",
-        "廢止已清算完結": "Dissolved",
-        "Dissolution / Closed / Deregistration": "Dissolved",
-        "廢止登記": "Dissolved",
-        "撤回認許": "Dissolved",
-        "核准設立,但已命令解散": "Dissolved",
-        "解散已清算完結": "Dissolved",
-        "撤銷已清算完結": "Dissolved",
-        "撤銷": "Dissolved",
-        "核准設立": "Active",
-        "Establishment approved": "Active"
-    }
-    return mapping.get(status, None)
-
-
-def clean_ven_website(website):
-    if not website:
-        return None
-
-        # 1. 过滤掉包含 @ 的网址
-    if '@' in website:
-        return None
-
-        # 2. 过滤掉没有 `.` 的网址
-    if '.' not in website:
-        return None
-
-        # 3. 清洗成域名格式,去掉 "http", "www." 并转换为小写
-    cleaned_url = website.lower()
-    cleaned_url = re.sub(r'^https?://|^https?//|^https?:\\\\|^https?:', '', cleaned_url)  # 去掉 http 或 https
-    # cleaned_url = cleaned_url.replace('http:\\\\', '')
-    cleaned_url = re.sub(r'^www\.|^wwww\.|^www\d*\.|^www,|^www//|^www/|^www |^www:|^www', '', cleaned_url)  #
-    if '.' not in cleaned_url:
-        return None
-    if not re.search(r'[a-zA-Z]', cleaned_url):
-        return None
-    cleaned_url = cleaned_url.replace('; www.antriol.com.ve', '').replace(', www.velastindari.com.ve', '')
-
-    return cleaned_url
-
-
-def common_clean_website(url):
-    if url:
-        cleaned_url = url.lower()
-        # 去除前缀符号
-        cleaned_url = re.sub(r'^[^a-z0-9]*', '', cleaned_url)
-        # 去除前缀http
-        cleaned_url = re.sub(r'^(web)?h?https?[^a-z0-9]*', '', cleaned_url)
-        # 去除前缀www
-        cleaned_url = re.sub(r'^www[0-9]*[^a-z0-9]*', '', cleaned_url)
-        cleaned_url = re.sub(r'^www[^a-z0-9]*', '', cleaned_url)
-
-        # 删除匹配符号后的内容
-        pattern = r'[?&,,/](.*)'
-        match = re.search(pattern, cleaned_url)
-        if match:
-            cleaned_url = cleaned_url[:match.start()]
-        # 去除后缀符号
-        cleaned_url = re.sub(r'[^a-z0-9]*$', '', cleaned_url)
-        # 去除后缀http
-        cleaned_url = re.sub(r'[^a-z0-9]*https?$', '', cleaned_url)
-        # 去除后缀www
-        cleaned_url = re.sub(r'[^a-z0-9]*www$', '', cleaned_url)
-
-        if '.' not in cleaned_url:
-            return None
-        if '@' in cleaned_url:
-            return None
-        if not re.search(r'[a-z]', cleaned_url):
-            return None
-        return cleaned_url
-    return None
-
-
-def format_state_name(state_name):
-    if not state_name:
-        return None
-    words = state_name.split()
-    formatted_words = [word.capitalize() for word in words]
-    return ' '.join(formatted_words)
-
-
-if __name__ == '__main__':
-    result = "http://elcore.kr"
-    print(common_clean_website(result))
-    pass

+ 0 - 96
dw_base/spark/udf/spark_read_hive_columns_cnt.py

@@ -1,96 +0,0 @@
-from dw_base.spark.spark_sql import SparkSQL
-import pandas as pd
-import os
-
-spark = SparkSQL()
-
-def get_hive_cols(tbl):
-    """
-    Args:
-        tbl: hive表名称
-    Returns:
-        返回拼接select count(col*) from table union all select count(distinct col*) from table 字符串
-    """
-    if not tbl:
-        print(f"参数异常 tbl = {tbl}")
-        return None
-    sql = f'SHOW CREATE TABLE {tbl}'
-
-    # 解析show create table输出结果
-    create_table_statement = spark.query(sql)[0].collect()[0][0]
-    fields_start_index = create_table_statement.find("(") + 1
-    fields_end_index = create_table_statement.rfind(")")
-    fields_str = create_table_statement[fields_start_index:fields_end_index]
-    fields_list = [field.split()[0] for field in fields_str.split(",")]
-    new_fields_list = [item.replace("`", "") for item in fields_list]
-    # 输出字段名称
-    select_query = ", ".join([f"COUNT({field}) as {field} " for field in new_fields_list])
-    dist_select_query = ", ".join([f"COUNT(DISTINCT {field}) as {field}" for field in new_fields_list])
-    # 字符串拼接
-    count_query = f"SELECT {select_query} FROM {tbl}"
-    distinct_count_query = f"SELECT {dist_select_query} FROM {tbl}"
-
-    return f"{count_query} UNION ALL {distinct_count_query}"
-
-
-def querySQLAndInsert2excel(sql, excel_file_path, sheet_name, mode):
-    """
-    Args:
-        sql: 传入需要跑批的结果,将结果写入到excel文件中
-        excel_file_path: 写入到linux工作目录的地址
-        sheet_name: excel的sheet
-    Returns:
-    """
-    if not sql:
-        print(f"参数异常 sql = {sql}")
-        return
-    if not sheet_name:
-        print(f"参数异常 sheet_name = {sheet_name}")
-        return
-    if not excel_file_path:
-        print(f"参数异常 excel_file_path = {excel_file_path}")
-        return
-
-    # spark sql执行结果转化为pandas excel
-    df_pandas = spark.query(sql)[0].toPandas()
-    # 覆盖写入到linux指定工作目录
-    with pd.ExcelWriter(excel_file_path, mode=mode) as writer:
-        df_pandas.to_excel(writer, sheet_name=sheet_name, index=False)
-
-
-def save2hdfs(file_path, hdfs_path):
-    """
-    Args:
-        file_path:源linux文件位置
-        hdfs_path: 写入hdfs位置
-    """
-    if file_path and hdfs_path:
-        os.system(f"hadoop fs -put -f {file_path} {hdfs_path}")
-    else:
-        print(f"参数异常 file_path = {file_path} hdfs_path = {hdfs_path}")
-
-
-if __name__ == '__main__':
-
-    file_name = "/home/dev005/tendata-warehouse/workspace/data/tables.txt"
-    file_path = "/home/dev005/tendata-warehouse/workspace/data/cnt.xlsx"
-    hdfs_path = "/user/dev005/workspace/"
-
-    first_line = True  # 设置标识,初始为 True 表示第一行
-    # 打开文件
-    with open(file_name, 'r') as file:
-        # 逐行读取文件内容并进行循环处理
-        for line in file:
-            # 获取每一行中表名称配置
-            table_name = line.strip()
-            sheet_name = table_name.split('.', 1)[1][:31] if '.' in table_name else "未找到小数点"
-            # 获取表中字段名称,并返回拼接字段
-            cols = get_hive_cols(table_name)
-            # 当为第一行时,进行excel覆盖,其他行则为追加
-            if first_line:
-                querySQLAndInsert2excel(cols, file_path, sheet_name, 'w')
-                first_line = False
-            else:
-                querySQLAndInsert2excel(cols, file_path, sheet_name, 'a')
-
-    save2hdfs(file_path, hdfs_path)

+ 0 - 47
dw_base/utils/hive_file_merge.py

@@ -1,47 +0,0 @@
-import sys
-import os
-import re
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-
-from dw_base.scheduler.mg2es.hive_sql import HiveSQL
-from dw_base.utils.config_utils import parse_args
-
-
-def merge_hdfs_file(group, db, dt):
-    hive = HiveSQL('192.168.30.3', 10000, 'alvis', None, db)
-    if group == 'cts' and db == 'dwd' and dt is not None:
-        tbl_list = hive.query("show tables")
-        for tbl in tbl_list:
-            tbl_name = tbl[0]
-            if tbl_name.startswith('cts_') and (tbl_name.endswith('_ex') or tbl_name.endswith('_im')) and 'mirror_country' not in tbl_name:
-                sql = f"alter table {db}.{tbl_name} partition (dt='{dt}') concatenate"
-                # print(sql)
-                hive.execute(sql)
-    else:
-        tbl_list = hive.query("show tables")
-        for tbl in tbl_list:
-            tbl_name = tbl[0]
-            if tbl_name.startswith(f'{group}_'):
-                partition_list = hive.query(f"show partitions {db}.{tbl_name}")
-                if partition_list is None or len(partition_list) == 0:
-                    continue
-                for partition in partition_list:
-                    pt = partition[0]
-                    pt_arr = pt.split('/')
-                    pt_str = ''
-                    for p in pt_arr:
-                        p_key = p.split('=')[0]
-                        p_value = p.split('=')[1]
-                        pt_str = f"{pt_str}{p_key}='{p_value}',"
-                    pt_str = pt_str[:-1]
-                    hive.execute(f"alter table {db}.{tbl_name} partition ({pt_str}) concatenate")
-
-if __name__ == '__main__':
-    CONFIG, _ = parse_args(sys.argv[1:])
-    group = CONFIG.get('group')
-    db = CONFIG.get('db')
-    dt = CONFIG.get('dt')
-    merge_hdfs_file(group, db, dt)

+ 0 - 169
dw_base/utils/spark_parse_json_to_hive.py

@@ -1,169 +0,0 @@
-import os
-import re
-from datetime import datetime, timedelta
-
-import requests
-import sys
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-
-from dw_base.utils.hdfs_merge_small_file import hdfs_estimate_num_partitions_absolute_path
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.config_utils import parse_args
-from dw_base.utils.log_utils import pretty_print
-
-"""
- author: HQL
- create_time:2024-04-28
- update_time:2024-04-28
- remarks:
-    该脚本用于将flume从kafka拉取到HDFS的文件,从ent_raw.ent_crawler_base表中,读取对应topic的表,自动将建过表的数据,抽取到相应表中
-    -base:【可选】默认为 ent_raw.ent_crawler_base,kafka原始数据表得位置
-    -topics:【必须】需要同步的topic,可以传入多个,使用逗号','分隔;例如-topics=topic1,topic2
-    -dt:【可选】需要同步的日期,默认是当日;例如-dt=20240401
-    -table:【可选】需要同步的数据表,当-table不为空时,只能传入一个topic;例如-table=table1
-"""
-
-# 获取spark对象
-spark = SparkSQL()
-NORM_MGT: str = '\033[0;35m'
-NORM_GRN: str = '\033[0;32m'
-
-
-def executor(topics, dt, table, base_table):
-    """
-    Args:
-        topics: hive表中kafka topic的名称
-        dt: 数据同步日期
-        table: 指定同步的数据表
-    Returns:无返回,直接执行写入程序
-    """
-    # 0.当传入table不为空,直接执行当前表的插入
-    if table:
-        exe_sql = get_execute_sql(table, topics[0], dt)
-        if exe_sql is not None:
-            spark.query(exe_sql)
-    else:
-        # 1.for循环topic,读取topic表中存在多少表
-        for topic in topics:
-            # 读取每个topic中存在多少table
-            sql = f"""
-                SELECT get_json_object(ori_json,'$.table') table_name FROM {base_table}
-                WHERE dt = '{dt}'
-                AND topic = '{topic}'
-                group by get_json_object(ori_json,'$.table')
-            """
-            tables = spark.query(sql)[0].collect()
-
-            if not tables:
-                pretty_print(f'{NORM_MGT}该topic: {topic} 在 {dt} 暂时没有数据 \n{NORM_GRN}')
-            else:
-                # 2.for循环读取每个table,获取hive中表的列信息
-                for topic_table in tables:
-                    topic_table = topic_table[0].replace('ods_', 'ent_')
-                    exe_sql = get_execute_sql(topic_table, topic, dt)
-                    if exe_sql is not None:
-                        spark.query(exe_sql)
-
-
-def get_execute_sql(tbl, topic, dt):
-    """
-    Args:
-        tbl: hive表名称
-    Returns:
-        返回拼接select cols from tablename where dt = {dt} and topic = {topic} lateral view json_tuple(ori_json,cols) parse_json as cols
-    """
-    if not tbl:
-        pretty_print(f'{NORM_MGT}参数异常 tbl = {tbl}\n{NORM_GRN}')
-        return None
-    sql = f'DESC ent_ods.{tbl}'
-
-    # 解析show create table输出结果
-    try:
-        spark.query(sql)[0]
-    except Exception as e:
-        pretty_print(f'{NORM_MGT}未发现此表, 请建表 {tbl} 后执行\n{NORM_GRN}')
-        # 异常信息写入钉钉
-        response_text = f"异常播报:\n\t未发现ent_ods.{tbl} 数据表,请先建表后使用"
-        dingtalk(response_text)
-    else:
-        # 解析字段描述信息,提取字段名称
-        column_names = []
-        for row in spark.query(sql)[0].collect():
-            if row.col_name == "# Partition Information":
-                break
-            if row.col_name == "dt":
-                continue
-            # if row.col_name in ('_id', 'date', 'desc'):
-            #     column_names.append("`" + row.col_name + "`")
-            else:
-                column_names.append(row.col_name)
-
-        # 拼接字段名称
-        select_query = ",".join(column_names)
-        lateral_select_query = ",".join([f"'{column}'" for column in column_names])
-        kafka_table = tbl.replace('ent_', 'ods_')
-
-        sql = f"""
-            INSERT OVERWRITE TABLE ent_ods.{tbl} PARTITION (dt={dt})
-            SELECT {select_query} FROM ( SELECT ori_json FROM {base_table} 
-            WHERE dt = '{dt}' 
-            and topic = '{topic}' 
-            and get_json_object(ori_json, '$.table') = '{kafka_table}' ) t
-            LATERAL VIEW json_tuple(ori_json, {lateral_select_query}) parse_json AS {select_query}
-        """
-
-        return sql
-
-
-def dingtalk(response_text):
-    """
-    Args:
-        response_text:写入钉钉机器人的内容
-    Returns:无返回
-    """
-    webhook_url = 'http://m1.node.cdh/dingtalk/api/robot/send?access_token=166d3462282cb6382ef88e7b67d9e06903095172612d44d8a7b94b5ab96976e2'
-    # 构建发送到钉钉机器人的 JSON 数据
-    json_data = {
-        "msgtype": "text",
-        "text": {
-            "content": response_text
-        },
-        "at": {
-            "atMobiles": ['15333978057'],
-            "isAtAll": False
-        }
-    }
-    headers = {"Content-Type": "application/json"}
-    # 发送 HTTP POST 请求到钉钉机器人
-    response = requests.post(webhook_url, json=json_data, headers=headers)
-
-
-if __name__ == '__main__':
-    # 解析命令行参数
-    CONFIG, _ = parse_args(sys.argv[1:])
-    base_table = CONFIG.get('base')
-    topics = CONFIG.get('topics')
-    dt = CONFIG.get('dt')
-    table = CONFIG.get('table')
-
-    if base_table is None or base_table == '':
-        base_table = 'ent_raw.ent_crawler_base'
-
-    if topics is None or topics == '':
-        pretty_print(f'{NORM_MGT}请输入正确的topic名称!\n{NORM_GRN}')
-        pretty_print(f'{NORM_MGT}-dt=topic1, topic2\n{NORM_GRN}')
-        sys.exit()
-    topics = topics.split(',')
-    if dt is None:
-        dt = (datetime.now() - timedelta(days=1)).strftime("%Y%m%d")
-    if table:
-        if len(topics) > 1:
-            pretty_print(f'{NORM_MGT}当传入-table时, 必须只能传入一个-topic\n{NORM_GRN}')
-            pretty_print(f'{NORM_MGT}-topic = {topics} -table = {table}\n{NORM_GRN}')
-            sys.exit()
-
-    # 执行插入脚本
-    executor(topics, dt, table, base_table)

+ 0 - 92
dw_base/utils/tid_utils.py

@@ -1,92 +0,0 @@
-# coding=utf-8
-"""
-udf
-"""
-import json
-import logging
-
-from dw_base.database.mongodb_utils import MongoDBHandler
-
-
-class TidGenerator(object):
-    def match_pid(self,
-                  company_name: str, country: str) -> str:
-        raise Exception("not implemented yet")
-
-
-class MongoTidGenerator(TidGenerator):
-    def __init__(self):
-        self.tid_field = None
-        self.alias_field = None
-        self.country_field = None
-        self.company_aliases = None
-        logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-
-    def match_tid(self,
-                  company_name: str, country: str) -> str:
-        pid = None
-        if company_name:
-            pid = self.__match_by_company_name(company_name, country)
-        return pid
-
-    def __match_by_company_name(self,
-                                company_name: str, country: str) -> str:
-        if not company_name:
-            return None
-
-        documents = []
-        find_result_by_name = self.company_aliases.find({self.alias_field: {"$eq": company_name}})
-        for document in find_result_by_name:
-            tid_value = document.get(self.tid_field)
-            if tid_value and tid_value[:3] == country:
-                documents.append(document)
-        if len(documents) > 0:
-            max_document = max(documents, key=lambda x: x.get(self.tid_field, 0))
-            return max_document.get(self.tid_field)
-        return None
-
-
-class EnterpriseTidGenerator(MongoTidGenerator):
-    def __init__(self):
-        super().__init__()
-        self.uri = 'mongodb://tendata_corp:TD_corpqyk22@192.168.11.27:21868/?authSource=tendata_corp'
-        self.database = "tendata_corp"
-        self.collection_alias = "company_aliases"
-        self.tid_field = 'tid'
-        self.alias_field = 'alias'
-        self.country_field = 'country_code3'
-        self.mongo_client = MongoDBHandler(self.uri).mongo_client
-        self.company_aliases = self.mongo_client.get_database(self.database).get_collection(self.collection_alias)
-
-
-class HBaseTidGenerator(TidGenerator):
-    def __init__(self):
-        raise Exception("not implemented yet")
-
-
-# 自定义异常类
-class UnsupportedDimensionError(Exception):
-    def __init__(self, dimension):
-        self.dimension = dimension
-        super().__init__(f"Not supported generator dimension: {dimension}")
-
-
-class TidGeneratorFactory(object):
-
-    @staticmethod
-    def createTidGenerator(dimension: str):
-        if dimension is None:
-            raise ValueError("Dimension cannot be None")
-
-        switch_generator_dict = {
-            'Enterprise': EnterpriseTidGenerator(),
-        }
-
-        if dimension not in switch_generator_dict:
-            raise UnsupportedDimensionError(dimension)
-
-        return switch_generator_dict[dimension]
-
-
-if __name__ == '__main__':
-    pass

+ 1 - 4
kb/00-项目架构.md

@@ -71,15 +71,13 @@ poyee-data-warehouse/              # 项目根目录(仓库名 = 部署名)
 |------|-----------|------|
 | 全局初始化 | `dw_base/__init__.py` | 环境检测、颜色常量、findspark 初始化、用户/权限判断 |
 | SparkSQL 引擎 | `dw_base/spark/spark_sql.py` | SparkSession 管理、UDF 注册、SQL 执行、数据导出 |
-| Spark 快捷初始化 | `dw_base/spark/td_spark_init.py` | 类 spark-submit 风格的 Session 创建 |
 | UDF 库 | `dw_base/spark/udf/` | 按业务线分类的 Spark 自定义函数 |
 | DataX 引擎 | `dw_base/datax/` | ini 配置解析 → json 作业文件生成 |
 | DataX 数据源 | `dw_base/datax/datasources/` | 各类数据源的连接参数抽象 |
 | DataX 插件 | `dw_base/datax/plugins/` | Reader/Writer 工厂 + 各数据源实现 |
 | 数据库工具 | `dw_base/database/` | MongoDB、MySQL 原生客户端封装 |
-| 调度辅助 | `dw_base/scheduler/` | 钉钉/企微通知、轮询调度、分区清理等 |
+| 调度辅助 | `dw_base/scheduler/` | 轮询调度、分区清理等(告警模块已删,待重写) |
 | Hive 工具 | `dw_base/hive/` | DDL 生成、库表命名规则 |
-| DS 工作流 | `dw_base/ds/` | DolphinScheduler REST API 触发工作流 |
 | 通用工具 | `dw_base/utils/` | 参数解析、日期、文件、日志、SQL 解析、字符串等 |
 
 ## 3. 模块关系图
@@ -538,7 +536,6 @@ jobs/
 | Spark 单作业覆盖 | 对应 `jobs/*.sql` 文件内 `SET spark.x.y=z` | 是 | 开发   |
 | 环境变量 / 路径 | `dw_base/__init__.py`、`bin/common/init.sh` | 是(待改为conf) | 开发   |
 | 告警 Webhook | `dw_base/common/alerter_constants.py` | 是(待改为conf) | 开发   |
-| DS 工作流配置 | `dw_base/ds/config/*.yaml` | 是 | 开发   |
 
 ### 6.2 Spark 参数优先级(三级覆盖)
 

+ 2 - 4
kb/90-重构路线.md

@@ -45,9 +45,7 @@
 | `DATAX_WORKERS=(m3 d1 d2 d3 d4)` + `DATAX_WORKERS_WEIGHTS` 权重 map | `init.sh:18-31`(含展开 `DATAX_WORKERS_QUEUE` 的循环) | workers 列表 + 权重 map **整体**移入 `conf/workers.conf`(ini 或 yaml 格式),`init.sh` 仅保留读取 + 展开逻辑 |
 | `HADOOP_CONF_DIR='/etc/hadoop/conf'` | `__init__.py` | 使用系统环境变量 |
 | `LOG_ROOT_DIR="/opt/data/log"` + whoami 分流 | `init.sh`、`__init__.py` | 删除 whoami 分支,单值改为 `${HOME}/log` 并迁入 `conf/env.sh`,见 §7.2.1 |
-| 钉钉 access_token | `dingtalk_notifier.py` | 移入 `conf/alerter.conf`(敏感项) |
-| 企微 Webhook Key | `dw_base/common/alerter_constants.py` | 外移到 `conf/alerter.ini`(**入库**——部署靠 git pull,gitignore 会拉不到;webhook key 不算高敏感,最多被拿去发垃圾消息),Python 侧改 ConfigParser 加载;`alerter_constants.py` 整个删除 |
-| DS API 地址 | `ds/config/base_config.yaml` | 已在 yaml,保持即可 |
+| 告警 Webhook(钉钉 / 企微 Key) | `dw_base/common/alerter_constants.py`(老告警模块已于 2026-04-20 删除,含 `dingtalk_notifier.py` / `ent_interface_dingtalk*` / `bin/dingtalk-work-alert.sh`) | 新告警模块重写时 Webhook Key 外移到 `conf/alerter.ini`(**入库**——部署靠 git pull,gitignore 会拉不到;webhook key 不算高敏感,最多被拿去发垃圾消息),Python 侧改 ConfigParser 加载;`alerter_constants.py` 整个删除;新项目不再使用钉钉 |
 | Spark 默认参数(executor/driver/shuffle/sql.*) | `dw_base/spark/spark_sql.py` 构造函数 + `.config(...)` 链 | 移入 `conf/spark-defaults.yaml`,SQL 文件可用 `SET` 覆盖,见 §2.3 |
 | DataX ini 路径前缀剥离 `conf/datax/config/` | `bin/datax-single-job-starter.sh`(TEMP 处理)、`bin/datax-job-config-generator.py`(`replace('conf/datax/config/', '')`)、`bin/datax-multiple-job-starter.sh`(日志路径派生) | 原目录已整体挪到 `conf/bak/` 并 gitignore,脚本里 replace 现在是 no-op 死逻辑。去除前缀假设,改为靠 ini 文件名(= 任务唯一标识,见 `21-命名规范.md` §3.9)识别用途 |
 | DataX 生成 JSON 输出目录名 `conf/datax/generated` | `bin/datax-job-config-generator.py` 末尾 `default_output_dir`、`bin/datax-single-job-starter.sh` 第 89/118 行、`bin/datax-multiple-job-starter.sh` 第 187 行、`.gitignore` | 目录改名 `conf/datax-json/`;子路径扁平化为 `conf/datax-json/{env}/{ini_basename}.json`(仅按 env 分一级,去掉 src_dst / project_layer_env 等派生层级);`.gitignore` 同步改 |
@@ -549,7 +547,7 @@ tests/
 
 **后续事项**:
 
-- LAZY 类依赖关联的老代码(`tendata/scheduler/get_oldmongo_*`、`mg2es/`、`ent_interface_dingtalk*`、`customs/similarity.py`、`tendata/oss/oss2_util.py`、`tendata/utils/excel_to_hive_utils.py`)在阶段 4 / 阶段 5 清理废弃代码时一并删除,删完后即可彻底告别这些弱依赖
+- LAZY 类依赖关联的老代码:`get_oldmongo_*` / `mg2es/` / `ent_interface_dingtalk*` 已于 2026-04-20 提前清理(见 92-进度 变更记录);剩余 `customs/similarity.py`、`dw_base/oss/oss2_util.py`、`dw_base/utils/excel_to_hive_utils.py` 等在阶段 4 / 阶段 5 一并清理
 - 不需要 `requirements-base.txt` / `requirements-dev.txt` 分文件——当前依赖规模下单文件已经足够
 - pyspark 2.4.0 暂保留(CDH 集群一致),等集群升级再一并上调
 

+ 4 - 1
kb/92-重构进度.md

@@ -42,7 +42,7 @@
 - [x] 全局替换 SQL 中的 `ADD FILE tendata/...` → `ADD FILE dw_base/...`(2026-04-15)
 - [x] 全局替换 `zip -qr tendata.zip tendata` → `zip -qr dw_base.zip dw_base`(2026-04-15,spark_sql.py f-string 形式已手工修正)
 - [x] 全局替换 `addPyFile('tendata.zip')` → `addPyFile('dw_base.zip')`(2026-04-15,publish.sh 同步更新)
-- [ ] 全局替换路径正则 `re.sub(r"tendata-warehouse.*", ...)` → 使用新项目名(绑定仓库改名,~40 处待处理:dw_base/scheduler/*、dw_base/utils/*、bin/doris-*-starter.py、bin/hive-exec.sh)
+- [ ] 全局替换路径正则 `re.sub(r"tendata-warehouse.*", ...)` → 使用新项目名(绑定仓库改名,2026-04-20 老业务文件批量清理后剩余约 15 处:`dw_base/scheduler/polling_scheduler.py` / `drop_*.py`、`dw_base/utils/*`、`dw_base/ds/ds_start_workflow.py`(已删)、`bin/doris-*-starter.py`、`dw_base/spark/udf/customs/company_abbr.py`)——另有 `dw_base/utils/diff_utils.py` 的 `target_folder='tendata-warehouse'` 字符串字面量、`dw_base/spark/td_spark_init.py`(已删)docstring 等非 re.sub 形式的引用一并处理
 - [x] 排查 `tendata_corp` 等数据库名/表名引用,**确认不要误替换**(2026-04-15,已确认保留:`tendata_corp`、`tendata_bigdata256!`、`ent_tendata_interface`、`api.tendata.cn`)
 - [x] 新建 `jobs/` 目录 + `jobs/{raw,ods,dim,dwd,dws,tdm,ads}/` 子目录(2026-04-15,已放 `.gitkeep`,`dim/` 为顶层独立分层)
 - [x] 新建 `manual/` 目录 + 5 个子目录(`ddl/`、`backfill/`、`fix/`、`adhoc/`、`archive/`)(2026-04-15,已放 `.gitkeep`;`manual/ddl/` 是所有 DDL 的唯一来源)
@@ -111,6 +111,8 @@
 - [ ] DataX 配置生成单测
 - [ ] `__contains__` → `in` 全局替换
 - [ ] 删除废弃空模块和注释代码
+- [ ] **重新实现 Hive HDFS 小文件合并工具**:原 `dw_base/utils/hive_file_merge.py`(2026-04-20 随老业务批清理一并删除)提供 `alter table ... partition (...) concatenate` 压实能力,但硬编码了老 HiveServer 连接 / `cts_*_ex/_im` 表名规则 / `mirror_country` 过滤。新版需通用化:HiveServer 连接从 `conf/` 读取、表过滤参数化,剥离业务命名假设
+- [ ] **重写告警模块**:老钉钉告警文件(`dingtalk_*` / `ent_interface_dingtalk*` / `country_count_dingtalk` / `spark_parse_json_to_hive` 里的 `dingtalk()` / `bin/dingtalk-work-alert.sh`)已于 2026-04-20 全部删除;新项目不再使用钉钉,Webhook Key 走 `conf/alerter.ini`(见 `90-重构路线.md §2.1`)
 - [ ] Spark / HMS 侧 Ranger Hive 策略验证(低优先级,见 `90-重构路线.md` §7.5)
 - [x] 精简 `requirements.txt`(2026-04-15 提前完成:48 行 → 10 个强依赖,老清单备份到 `requirements.txt.bak` 并逐行打标)
 
@@ -153,3 +155,4 @@
 | 2026-04-18 | **§2.8 改造降级为"条件触发"**(第三轮修正):用户提供老项目真实生产 json 样例显示只写 `defaultFS`(无 `hadoopConfig`)也能跑 HA —— 说明老 worker 节点 `hdfs-site.xml` 配置完整,`hadoopConfig` 是**可选覆盖**而非 HA 必要条件。前两轮论断("必须加 `hadoopConfig`"、"运维把 xml 写死单 NN")都被推翻。§2.8 加"新环境 HDFS HA 自检清单"(`echo $HADOOP_CONF_DIR` / grep xml HA keys / `hadoop fs -ls hdfs://nameservice1/`),三项全过则整节改造不做;仅任何一项失败才启动 ini schema 升级 + `HDFSDataSource` 改造。92 阶段 2 checklist 相应改为"自检前置 + 条件触发"4 条子项 | — |
 | 2026-04-18 | **§2.8 锁定 Path B(第四轮,实测决定)**:新 CDH 环境三连实测(json 含/不含 `hadoopConfig` × `HADOOP_CONF_DIR` 设/不设),结论:对 DataX JVM,仅 json 的 `hadoopConfig` 块有效,`HADOOP_CONF_DIR` 无效(`datax.py` 不把 conf 目录入 classpath,与 `hadoop` 命令行不同)。老项目能纯 `defaultFS` 跑通最可能是老运维把 `hdfs-site.xml` 塞进了 DataX classpath 目录,新环境 `/opt/datax` 没这类预置文件。改造要点:(a) `HDFSDataSource.get_datasource_dict()` 吃 `[hadoop_config]` 整节注入 `hadoopConfig`;(b) 删除 `dw_base/__init__.py:16` `os.environ['HADOOP_CONF_DIR']` 死代码。简化 §2.8 文本:去掉 `ha_enabled` 开关(用 `[hadoop_config]` 节存在性代替)、去掉自检决策树(已决定)、去掉"运维手工改 IP"误记 | — |
 | 2026-04-20 | **§7.2.1 再次反转**:删除 `whoami == RELEASE_USER` 分流,`LOG_ROOT_DIR` 改为单值默认 `${HOME}/log` 并保留在 `conf/env.sh`(外配后期可改)。理由:`$HOME` 天然按用户隔离(bigdata/个人用户家目录不同),代码判断是多余一层;`bigdata` 本身就是专属调度账号,其 `$HOME` 即是生产日志合法归宿,不需要系统级 `/opt/data/log` 那条路。同步更新 `90-重构路线.md §7.2.1`(核心段)+ `§2.1 硬编码表行` + `§2.4 env.sh 草稿` + `00-项目架构.md §6 部署段` + `92 阶段 2 checklist` | — |
+| 2026-04-20 | **老业务耦合代码批量清理(重构计划外)**:排查 `tendata` 残留时发现一批与 `tendata_corp` / `ent_tendata_interface` / DolphinScheduler / 钉钉告警强耦合的存量文件,逐项核对后批量删除 40 个文件 + 精简 1 个:**老业务模块 34**(`dw_base/scheduler/` 下 `get_oldmongo_*` ×5、`dingtalk_*` / `ent_interface_dingtalk*` / `country_count_dingtalk` / `mg_company_alias_init` ×8、`mg2es/` 整目录 13 文件;`dw_base/ds/` 整目录 4 文件;`dw_base/spark/udf/spark_read_hive_columns_cnt.py`;`dw_base/utils/tid_utils.py`;`dw_base/spark/td_spark_init.py`(老同事 xunxu 所写未被调用);`bin/hive-exec.sh`),**级联清理 6**(`dw_base/spark/udf/spark_id_generate_udf.py` + `dw_base/spark/udf/enterprise/unique/spark_tid_match_udf.py` 依赖已删 `tid_utils`;`dw_base/utils/hive_file_merge.py` + `dw_base/utils/spark_parse_json_to_hive.py` 依赖已删 `mg2es`/钉钉告警;`bin/hive-exec-job-starter.py` 调用已删 `hive-exec.sh`;`bin/dingtalk-work-alert.sh`),**精简 1**:`dw_base/spark/udf/spark_mmq_udf.py` 从 530 行裁到 4 个数据类型转换函数(phone/domain/website/statname 等场景相关 UDF 与 Mongo 相关逻辑全删)。同步更新:`00-项目架构.md`(移除 `td_spark_init` / DS 相关条目)、`90-重构路线.md`(钉钉 + 企微 Webhook 合并表述、删除 DS API 行、§5.2 依赖清理清单标记提前完成)、`92-进度.md` 阶段 1 第 6 行 `re.sub` checklist 更新残留范围(~15 处)。**阶段 4 新增两项任务**:(1) 重新实现 Hive HDFS 小文件合并工具(通用化连接 / 剥离 `cts_*_ex/_im` 表名假设);(2) 重写告警模块(弃钉钉走 `conf/alerter.ini` Webhook) | — |