Jelajahi Sumber

refactor(spark/udf): 整合通用 UDF 为单文件,删除老业务 UDF 目录

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tianyu.chu 2 minggu lalu
induk
melakukan
c6db00f6cf
74 mengubah file dengan 331 tambahan dan 12757 penghapusan
  1. 1 1
      dw_base/__init__.py
  2. 3 0
      dw_base/spark/udf/business/__init__.py
  3. 2 0
      dw_base/spark/udf/common/__init__.py
  4. 320 225
      dw_base/spark/udf/common/spark_common_udf.py
  5. 0 0
      dw_base/spark/udf/contacts/__init__.py
  6. 0 419
      dw_base/spark/udf/contacts/ctc_common.py
  7. 0 33
      dw_base/spark/udf/contacts/test/ctc_common_test.py
  8. 0 14
      dw_base/spark/udf/contacts/test/ent_logistics_label_test.py
  9. 0 0
      dw_base/spark/udf/customs/__init__.py
  10. 0 82
      dw_base/spark/udf/customs/calculation_repetition_trade.py
  11. 0 130
      dw_base/spark/udf/customs/clean_crawler_data.py
  12. 0 257
      dw_base/spark/udf/customs/common_clean.py
  13. 0 253
      dw_base/spark/udf/customs/common_clean2.py
  14. 0 1556
      dw_base/spark/udf/customs/company_abbr.py
  15. 0 45
      dw_base/spark/udf/customs/cts_common.py
  16. 0 21
      dw_base/spark/udf/customs/india_xx_restoration.py
  17. 0 24
      dw_base/spark/udf/customs/indonesia_qymc_judge.py
  18. 0 16
      dw_base/spark/udf/customs/mirror.py
  19. 0 42
      dw_base/spark/udf/customs/similarity.py
  20. 0 23
      dw_base/spark/udf/customs/str_trans_eng.py
  21. 0 22
      dw_base/spark/udf/customs/test/clean_crawler_data_test.py
  22. 0 13
      dw_base/spark/udf/customs/test/company_abbr_test.py
  23. 0 14
      dw_base/spark/udf/customs/test/cts_common_test.py
  24. 0 26
      dw_base/spark/udf/customs/test/indonesia_qymc_judge_test.py
  25. 0 52
      dw_base/spark/udf/customs/tjsldw.py
  26. 0 0
      dw_base/spark/udf/enterprise/__init__.py
  27. 0 143
      dw_base/spark/udf/enterprise/ent_clean_name_logistics.py
  28. 0 561
      dw_base/spark/udf/enterprise/ent_clean_text.py
  29. 0 277
      dw_base/spark/udf/enterprise/ent_company_abbr.py
  30. 0 24
      dw_base/spark/udf/enterprise/ent_india_offline_udf.py
  31. 0 70
      dw_base/spark/udf/enterprise/ent_logistics_label.py
  32. 0 167
      dw_base/spark/udf/enterprise/ent_spider_clean.py
  33. 0 386
      dw_base/spark/udf/enterprise/spark_eng_ent_ctstel_clean.py
  34. 0 273
      dw_base/spark/udf/enterprise/spark_eng_ent_date_clean_indonesia.py
  35. 0 34
      dw_base/spark/udf/enterprise/spark_eng_ent_json_array_append_udf.py
  36. 0 762
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_america.py
  37. 0 666
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_common.py
  38. 0 132
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_compant.py
  39. 0 307
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_germany.py
  40. 0 312
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_hongkong.py
  41. 0 321
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_indonesia.py
  42. 0 309
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_italy.py
  43. 0 311
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_japan.py
  44. 0 308
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_malaysia.py
  45. 0 307
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_south_korea.py
  46. 0 307
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_taiwan.py
  47. 0 308
      dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_uae.py
  48. 0 50
      dw_base/spark/udf/enterprise/spark_eng_ent_shareholder_clean_russia.py
  49. 0 156
      dw_base/spark/udf/enterprise/test/ent_clean_text_test.py
  50. 0 48
      dw_base/spark/udf/enterprise/test/ent_india_offline_udf_test.py
  51. 0 100
      dw_base/spark/udf/enterprise/test/spark_eng_ent_ctstel_clean_test.py
  52. 0 180
      dw_base/spark/udf/enterprise/unique/ent_offline_udf_america.py
  53. 0 113
      dw_base/spark/udf/enterprise/unique/ent_offline_udf_india.py
  54. 0 91
      dw_base/spark/udf/enterprise/unique/ent_offline_udf_indonesia.py
  55. 0 182
      dw_base/spark/udf/enterprise/unique/ent_offline_udf_russia.py
  56. 0 90
      dw_base/spark/udf/enterprise/unique/ent_offline_udf_turkey.py
  57. 0 4
      dw_base/spark/udf/main_test.py
  58. 0 142
      dw_base/spark/udf/product/cpc_clean_udf.py
  59. 0 19
      dw_base/spark/udf/product/cpms_lang_detect.py
  60. 0 5
      dw_base/spark/udf/product/escape_udf.py
  61. 0 76
      dw_base/spark/udf/product/inflect_udf.py
  62. 0 38
      dw_base/spark/udf/product/spark_string_retrieval_trie.py
  63. 0 52
      dw_base/spark/udf/productApplication/cts_data_clean.py
  64. 0 34
      dw_base/spark/udf/solr_similar_match_udf.py
  65. 0 666
      dw_base/spark/udf/spark_eng_ent_name_clean.py
  66. 0 132
      dw_base/spark/udf/spark_india_format_phone_udf.py
  67. 0 516
      dw_base/spark/udf/spark_json_array_udf.py
  68. 0 38
      dw_base/spark/udf/spark_mmq_udf.py
  69. 0 188
      dw_base/spark/udf/test/common_clean.py
  70. 0 259
      dw_base/spark/udf/test/d2str.py
  71. 0 20
      dw_base/spark/udf/test/test_common_clean.py
  72. 1 1
      kb/00-项目架构.md
  73. 2 3
      kb/90-重构路线.md
  74. 2 1
      kb/92-重构进度.md

+ 1 - 1
dw_base/__init__.py

@@ -24,7 +24,7 @@ PROJECT_ROOT_PATH = os.path.abspath(os.path.dirname(os.path.dirname(__file__)))
 PROJECT_NAME = os.path.basename(PROJECT_ROOT_PATH)
 sys.path.append(PROJECT_ROOT_PATH)
 # 公用的Spark UDF文件
-COMMON_SPARK_UDF_FILE = 'dw_base/spark/udf/spark_common_udf.py'
+COMMON_SPARK_UDF_FILE = 'dw_base/spark/udf/common/spark_common_udf.py'
 BANNED_USER = 'root'
 RELEASE_USER = 'alvis'
 USER = os.environ['USER']

+ 3 - 0
dw_base/spark/udf/business/__init__.py

@@ -0,0 +1,3 @@
+#!/usr/bin/env /usr/bin/python3
+# -*- coding:utf-8 -*-
+# 业务专用 UDF 目录:按需通过 SQL 中 ADD FILE 显式加载,不自动注册

+ 2 - 0
dw_base/spark/udf/common/__init__.py

@@ -0,0 +1,2 @@
+#!/usr/bin/env /usr/bin/python3
+# -*- coding:utf-8 -*-

+ 320 - 225
dw_base/spark/udf/spark_common_udf.py → dw_base/spark/udf/common/spark_common_udf.py

@@ -1,104 +1,55 @@
 #!/usr/bin/env /usr/bin/python3
 # -*- coding:utf-8 -*-
+"""
+通用 UDF —— 与业务无关的数据类型 / 格式操作(JSON / Array / String / Numeric / Date / Hash)
+SparkSQL 入口自动 ADD FILE 注册;业务专用 UDF 请放到 dw_base/spark/udf/business/ 下按需加载
+"""
 
 import difflib
+import hashlib
+import html
 import json
 import random
 import re
 import traceback
+from collections import Counter
 from datetime import datetime
-from typing import Union, List, Dict
+from typing import Dict, List, Union
 
 import pygeohash
 from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType, ArrayType, BooleanType, FloatType, LongType, MapType
+from pyspark.sql.types import (
+    ArrayType, BooleanType, FloatType, IntegerType, LongType, MapType,
+    StringType, StructField, StructType,
+)
 
 from dw_base.utils.datetime_utils import parse_datetime
 
 
-def add_random_number_prefix(datum: str, separator: str, floor: int, ceiling: int) -> str:
-    """
-    为字段添加随机数字前缀
-    Args:
-        datum:
-        separator: 原数据与随机前缀的分隔符
-        floor: 随机数字前缀下限
-        ceiling: 随机数字前缀上限
-
-    Returns:
-
-    """
-    return f'{random.randint(floor, ceiling)}{separator}{datum}'
-
-
-def append_to_json_array(json_array_string: str, new_element, remove_duplicate: bool = False) -> str:
-    """
-    向JSON array添加元素
-    Args:
-        json_array_string: JSON array字符串
-        new_element: 要添加的元素
-        remove_duplicate: 是否去重
-    Returns:
-    """
-    if not new_element:
-        return json_array_string
-    if not json_array_string:
-        return json.dumps([new_element], ensure_ascii=False)
-    json_array = json.loads(json_array_string)  # type: list
-    json_array.append(new_element)
-    if remove_duplicate is True:
-        result = []
-        for elem in json_array:
-            if result.__contains__(elem):
-                continue
-            result.append(elem)
-        return json.dumps(result, ensure_ascii=False)
-    return json.dumps(json_array, ensure_ascii=False)
+# ==================== JSON ====================
 
-
-def array_append(array: List, new_element,
-                 ignore_null: bool = False,
-                 remove_duplicate: bool = False,
-                 need_sort: bool = False) -> List:
-    if not array or len(array) == 0:
-        if new_element or ignore_null is not True:
-            return [new_element]
-        return []
-    if not new_element:
-        if ignore_null is True:
-            return array
-    else:
-        if array.__contains__(new_element) and remove_duplicate is True:
-            return array
-    array.append(new_element)
-    if need_sort:
-        array.sort()
-    return array
+@udf(returnType=BooleanType())
+def is_json(data) -> bool:
+    try:
+        json.loads(data)
+    except:
+        return False
+    return True
 
 
-def field_merge(delimiter: str, *fields_values):
-    """
-    两个字段合并,如果相同只取一个,不同用delimiter分隔
-    Args:
-        delimiter:
-        *fields_values:
-    Returns:
-    """
-    if not fields_values:
+@udf(returnType=ArrayType(StringType()))
+def json_object_keys(json_str: str) -> List[str]:
+    if not json_str:
+        return None
+    try:
+        json_dict = json.loads(json_str)  # type:dict
+        return [k for k in json_dict.keys()]
+    except:
         return None
-    result = []
-    [result.append(value.strip()) for value in fields_values if value and value.strip() not in result]
-    return delimiter.join(result)
 
 
 def flatten_json(json_str: str, reserve_parent: bool = True) -> str:
-    """
-    展平json
-    Args:
-        json_str: 待展平的json
-        reserve_parent: 是否保留父key,默认保留
-    Returns:
-    """
+    """展平 json,reserve_parent 控制是否保留父 key"""
 
     def flatten_json_node(parent, json_element) -> Union[float, int, str, Dict, List]:
         if isinstance(json_element, dict):
@@ -133,42 +84,63 @@ def flatten_json(json_str: str, reserve_parent: bool = True) -> str:
         return json_str
 
 
-def geo_hash(latitude: float, longitude: float, precision: int) -> str:
-    return pygeohash.encode(latitude, longitude, precision)
+def remove_empty_key(info):
+    """递归删除 json 中 value 为空的 key"""
+    json_info = json.loads(info)
 
+    def internal_remove(json_info):
+        try:
+            if isinstance(json_info, dict):
+                info_re = dict()
+                for key, value in json_info.items():
+                    if isinstance(value, dict) or isinstance(value, list):
+                        re = internal_remove(value)
+                        if len(re):
+                            info_re[key] = re
+                    elif value not in ['', {}, [], 'null', None]:
+                        info_re[key] = str(value)
+                return info_re
+            elif isinstance(json_info, list):
+                info_re = list()
+                for value in json_info:
+                    if isinstance(value, dict) or isinstance(value, list):
+                        re = internal_remove(value)
+                        if len(re):
+                            info_re.append(re)
+                    elif value not in ['', {}, [], 'null', None]:
+                        info_re.append(str(value))
+                return info_re
+            else:
+                return None
+        except Exception as e:
+            return None
 
-@udf(returnType=BooleanType())
-def has_chinese(datum: str) -> bool:
-    if datum:
-        pattern = re.compile(u'[\u4e00-\u9fa5]')
-        match = pattern.search(datum)
-        if match:
-            return True
-    return False
+    return json.dumps(internal_remove(json_info), ensure_ascii=False)
 
 
-@udf(returnType=BooleanType())
-def is_json(data) -> bool:
-    try:
-        json.loads(data)
-    except:
-        return False
-    return True
+def append_to_json_array(json_array_string: str, new_element, remove_duplicate: bool = False) -> str:
+    """向 JSON array 追加元素,可选去重"""
+    if not new_element:
+        return json_array_string
+    if not json_array_string:
+        return json.dumps([new_element], ensure_ascii=False)
+    json_array = json.loads(json_array_string)  # type: list
+    json_array.append(new_element)
+    if remove_duplicate is True:
+        result = []
+        for elem in json_array:
+            if result.__contains__(elem):
+                continue
+            result.append(elem)
+        return json.dumps(result, ensure_ascii=False)
+    return json.dumps(json_array, ensure_ascii=False)
 
 
 def json_array_subset(json_array_string: str,
                       subset_fields: Union[List, str],
                       as_list: bool = False,
                       skip_null: bool = False) -> str:
-    """
-    获取json object array string的子集
-    Args:
-        json_array_string:
-        subset_fields: 子集字段
-        as_list: 如果子集字段只有1个,是否以list返回
-        skip_null: 字段的值是None,是否添加在返回的数据中
-    Returns: 子集数组的字符串
-    """
+    """按字段提取 json object array 的子集"""
     if not json_array_string:
         return None
     if not subset_fields:
@@ -203,17 +175,183 @@ def json_array_subset(json_array_string: str,
     return json.dumps(list_subset, ensure_ascii=False)
 
 
+@udf(returnType=ArrayType(StructType([
+    StructField("idx", IntegerType(), False),
+    StructField("obj", StringType(), False),
+])))
+def parse_jsonarr_to_arr(s: str):
+    return [(i + 1, json.dumps(obj)) for i, obj in enumerate(json.loads(s))]
+
+
+@udf(returnType=ArrayType(StructType([
+    StructField("idx", IntegerType(), False),
+    StructField("obj", StringType(), False),
+])))
+def parse_jsonarr_to_strarr(s: str):
+    return [(i + 1, obj) for i, obj in enumerate(json.loads(s))]
+
+
+# ==================== ARRAY ====================
+
 @udf(returnType=ArrayType(StringType()))
-def json_object_keys(json_str: str) -> List[str]:
-    if not json_str:
-        return None
-    try:
-        json_dict = json.loads(json_str)  # type:dict
-        return [k for k in json_dict.keys()]
-    except:
+def array_intersect(arr1, arr2):
+    return list(set(arr1) & set(arr2))
+
+
+def array_append(array: List, new_element,
+                 ignore_null: bool = False,
+                 remove_duplicate: bool = False,
+                 need_sort: bool = False) -> List:
+    if not array or len(array) == 0:
+        if new_element or ignore_null is not True:
+            return [new_element]
+        return []
+    if not new_element:
+        if ignore_null is True:
+            return array
+    else:
+        if array.__contains__(new_element) and remove_duplicate is True:
+            return array
+    array.append(new_element)
+    if need_sort:
+        array.sort()
+    return array
+
+
+@udf(ArrayType(StringType()))
+def array_slice(input_array, start, end):
+    if input_array:
+        return input_array[start:end]
+    return []
+
+
+@udf(returnType=ArrayType(StringType()))
+def merge_list(arr_list: List):
+    res = set()
+    for e in arr_list:
+        if e is not None:
+            for i in e:
+                if i is not None and i != "":
+                    res.add(i)
+    return list(res)
+
+
+@udf(returnType=ArrayType(StringType()))
+def merge_source(incr_source: List, old_source: List):
+    res = set()
+    if incr_source is not None:
+        for i in incr_source:
+            if i is not None and i != "":
+                res.add(i)
+    if old_source is not None:
+        for i in old_source:
+            if i is not None and i != "":
+                res.add(i)
+    return list(res)
+
+
+@udf(returnType=StructType([
+    StructField("k", ArrayType(StringType()), False),
+    StructField("kv", StringType()),
+]))
+def parse_arr_and_count(arr, tag: str, return_count: int = -1):
+    ele_cnt_dict = Counter(arr)
+    json_list = sorted([{"code": key, "num": value} for key, value in ele_cnt_dict.items()], key=lambda x: x["num"], reverse=True)
+    if return_count < 0:
+        return [obj['code'] for obj in json_list], ",".join(['{' + f'{i["code"]},{tag}:{i["num"]}' + '}' for i in json_list])
+    list_len = len(json_list)
+    index = list_len if return_count >= list_len else return_count
+    return [obj['code'] for obj in json_list][:index], ",".join(['{' + f'{i["code"]},{tag}:{i["num"]}' + '}' for i in json_list[:index]])
+
+
+@udf(returnType=StructType([
+    StructField("sum", FloatType(), False),
+    StructField("list", StringType()),
+]))
+def parse_arr_and_sum(struct_arr, tag: str):
+    sum_dict = {}
+    for s in struct_arr:
+        key = s[0]
+        value: float = s[1]
+        if key not in sum_dict:
+            sum_dict[key] = 0.0
+        if value is not None:
+            sum_dict[key] += value
+    json_list = sorted([{"code": key, "num": value} for key, value in sum_dict.items()], key=lambda x: x["num"], reverse=True)
+    total = 0.0
+    for obj in json_list:
+        total += obj["num"]
+    return round(total, 2), ",".join(['{' + f'{i["code"]},{tag}:{round(i["num"], 2)}' + '}' for i in json_list])
+
+
+# ==================== STRING ====================
+
+@udf(returnType=BooleanType())
+def has_chinese(datum: str) -> bool:
+    if datum:
+        pattern = re.compile(u'[\u4e00-\u9fa5]')
+        if pattern.search(datum):
+            return True
+    return False
+
+
+@udf(returnType=FloatType())
+def similarity(left: str, right: str) -> float:
+    return difflib.SequenceMatcher(None, left, right).quick_ratio()
+
+
+@udf(returnType=ArrayType(StringType()))
+def regexp_extract_all(col: str, ptn: str, g: int = 0):
+    return [e.group(g) for e in re.compile(ptn).finditer(col if col else '')]
+
+
+def add_random_number_prefix(datum: str, separator: str, floor: int, ceiling: int) -> str:
+    return f'{random.randint(floor, ceiling)}{separator}{datum}'
+
+
+def field_merge(delimiter: str, *fields_values):
+    """多字段合并,相同仅保留一个,不同用 delimiter 分隔"""
+    if not fields_values:
         return None
+    result = []
+    [result.append(value.strip()) for value in fields_values if value and value.strip() not in result]
+    return delimiter.join(result)
 
 
+def space2null(text):
+    if text and not text.isspace():
+        return text
+    return None
+
+
+def merge_ws(text: str):
+    if text:
+        return ' '.join(text.split())
+    return None
+
+
+def remove_special_char(text, char):
+    if text is not None and text.endswith(char):
+        return text[:-1]
+    return text
+
+
+@udf(returnType=ArrayType(StringType()))
+def explode_str_to_arr(text: str) -> list:
+    """大于 8 位时,从后往前每次少一位截取子串入数组(用于前缀匹配场景)"""
+    if text is None:
+        return []
+    if len(text) <= 8:
+        return [text]
+    return [text[:i] for i in range(len(text), 7, -1)]
+
+
+def html_unescape(text):
+    return html.unescape(text)
+
+
+# ==================== NUMERIC / DATE / HASH ====================
+
 def max_value(*args):
     maxi_value = None
     for elem in args:
@@ -234,6 +372,10 @@ def min_value(*args):
     return mini_value
 
 
+def geo_hash(latitude: float, longitude: float, precision: int) -> str:
+    return pygeohash.encode(latitude, longitude, precision)
+
+
 def millis_timestamp_to_str(ts: int, str_format: str = None) -> str:
     date_time = datetime.fromtimestamp(ts / 1000.0)
     if str_format:
@@ -243,21 +385,13 @@ def millis_timestamp_to_str(ts: int, str_format: str = None) -> str:
 
 @udf(returnType=LongType())
 def parse_datetime_to_timestamp(date_time: str, in_milli_seconds: bool = False, original_format: str = None) -> int:
-    """
-    把字符串表达的日期转为时间戳
-    Args:
-        date_time: 日期
-        original_format: 原日期格式,不传则智能识别
-        in_milli_seconds: 是否返回毫秒
-    Returns:
-        转换后的日期
-    """
+    """字符串日期 → 时间戳;支持 YY.MM.DD / YYYY年M月D日 启发式识别"""
     try:
         if date_time:
             d = date_time.split('.')
             if len(date_time) == 8 and len(d) == 3 and len(d[0]) == 2:
                 date_time = '20' + date_time
-            ret = re.match('(\d+)年(\d+)月(\d+)日', date_time)
+            ret = re.match(r'(\d+)年(\d+)月(\d+)日', date_time)
             if ret:
                 date_time = ret.group().replace('年', '-').replace('月', '-').replace('日', '')
 
@@ -270,100 +404,36 @@ def parse_datetime_to_timestamp(date_time: str, in_milli_seconds: bool = False,
     except:
         try:
             date_time = int(date_time)
-            # 当前时间小于传入的时间戳,认为是毫秒
             if datetime.now().timestamp() < date_time:
-                if in_milli_seconds is True:
-                    return date_time
-                else:
-                    return int(date_time / 1000)
-            else:
-                if in_milli_seconds is True:
-                    return date_time * 1000
-                else:
-                    return date_time
-        except Exception as e:
+                return date_time if in_milli_seconds else int(date_time / 1000)
+            return date_time * 1000 if in_milli_seconds else date_time
+        except Exception:
             return None
 
 
-@udf(returnType=FloatType())
-def similarity(left: str, right: str) -> float:
-    """
-    计算两个字符串的相似度
-    Args:
-        left:
-        right:
-    Returns:
-
-    """
-    return difflib.SequenceMatcher(None, left, right).quick_ratio()
-
-
-def remove_empty_key(info):
-    """
-    删除json中value为空的key
-    Returns: json
-    """
-    json_info = json.loads(info)
-
-    def internal_remove(json_info):
-        try:
-            if isinstance(json_info, dict):
-                info_re = dict()
-                for key, value in json_info.items():
-                    if isinstance(value, dict) or isinstance(value, list):
-                        re = internal_remove(value)
-                        if len(re):
-                            info_re[key] = re
-                    elif value not in ['', {}, [], 'null', None]:
-                        info_re[key] = str(value)
-                return info_re
-            elif isinstance(json_info, list):
-                info_re = list()
-                for value in json_info:
-                    if isinstance(value, dict) or isinstance(value, list):
-                        re = internal_remove(value)
-                        if len(re):
-                            info_re.append(re)
-                    elif value not in ['', {}, [], 'null', None]:
-                        info_re.append(str(value))
-                return info_re
-            else:
-                return None
-        except Exception as e:
-            return None
+@udf(returnType=StringType())
+def get_md5(*cols: str) -> str:
+    """多列拼接(带长度前缀防碰撞)后取 md5"""
+    col_and_len_list = []
+    for col in cols:
+        if col is not None:
+            col_and_len_list.append(str(len(col)))
+            col_and_len_list.append(col)
+    key = ''.join(col_and_len_list)
+    if not key:
+        return ''
+    md5 = hashlib.md5()
+    md5.update(key.encode("utf-8"))
+    return md5.hexdigest()
 
-    return json.dumps(internal_remove(json_info), ensure_ascii=False)
-
-
-@udf(returnType=ArrayType(StringType()))
-def regexp_extract_all(col: str, ptn: str, g: int = 0):
-    return [e.group(g) for e in re.compile(ptn).finditer(col if col else '')]
-
-
-@udf(returnType=ArrayType(StringType()))
-def array_intersect(arr1, arr2):
-    """
-    计算两个数组的交集
-    :param arr1:
-    :param arr2:
-    :return:
-    """
-    return list(set(arr1) & set(arr2))
 
+# ==================== CROSS-TYPE CONVERTERS ====================
 
 def array_to_json(arr: List):
-    """
-    数组转为jsonstring
-    :param arr:
-    :return:
-    """
     return json.dumps(arr, ensure_ascii=False)
 
 
 def map_to_json(map: dict):
-    """
-    map转为jsonstring
-    """
     return json.dumps(map, ensure_ascii=False)
 
 
@@ -372,34 +442,59 @@ def struct_to_json(struct):
     return json.dumps(json_dict, ensure_ascii=False)
 
 
-@udf(returnType=ArrayType(MapType(StringType(), StringType())))
-def str_to_map_arr(json_str: str) -> list:
+def num_to_str(number):
+    if isinstance(number, float) and number.is_integer():
+        return '{:.0f}'.format(number)
+    return str(int(number)) if isinstance(number, int) else str(number)
+
+
+@udf(returnType=ArrayType(StringType()))
+def str_to_arr(json_str: str) -> list:
     if json_str:
         return json.loads(json_str)
     return []
 
 
-def num_to_str(number):
-    # 确保 number 是 float 类型
-    if isinstance(number, float) and number.is_integer():
-        return '{:.0f}'.format(number)
-    else:
-        return str(int(number)) if isinstance(number, int) else str(number)
+@udf(returnType=ArrayType(StringType()))
+def str_to_json_arr(json_str):
+    """JSON array 字符串 → list of json strings(每个元素再 json.dumps)"""
+    if json_str:
+        try:
+            str_arr = json.loads(json_str)
+            if isinstance(str_arr, list):
+                return [json.dumps(sm) for sm in str_arr]
+        except json.JSONDecodeError:
+            return []
+    return []
 
 
-def space2null(text):
-    if text and not text.isspace():
-        return text
-    return None
+@udf(returnType=ArrayType(MapType(StringType(), StringType())))
+def str_to_map_arr(json_str: str) -> list:
+    if json_str:
+        return json.loads(json_str)
+    return []
 
 
-if __name__ == '__main__':
-    cases = [
-        '',
-        None,
-        '  ',
-        '   ',
-        'hello'
-    ]
-    for case in cases:
-        print(space2null(case))
+@udf(returnType=StringType())
+def split_str_to_jsonstr(str_list: List):
+    """每个元素按 ':' 切成 k:v,聚合成 JSON 字符串"""
+    res = []
+    for kv_str in str_list:
+        arr = kv_str.split(':')
+        if len(arr) == 2:
+            res.append({arr[0]: arr[1]})
+    return json.dumps(res, ensure_ascii=False)
+
+
+@udf(returnType=MapType(StringType(), ArrayType(StringType())))
+def split_str_to_maparr(str_list: List):
+    """每个元素按 ':' 切成 k:v,同 key 追加到 list"""
+    res = {}
+    for kv_str in str_list:
+        arr = kv_str.split(':')
+        if len(arr) == 2:
+            if arr[0] not in res:
+                res[arr[0]] = [arr[1]]
+            else:
+                res[arr[0]].append(arr[1])
+    return res

+ 0 - 0
dw_base/spark/udf/contacts/__init__.py


+ 0 - 419
dw_base/spark/udf/contacts/ctc_common.py

@@ -1,419 +0,0 @@
-import hashlib
-import json
-import re
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '&',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 "'",
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '±',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '¨',
-                 '´',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 "'",
-                 "'",
-                 '"',
-                 '"',
-                 ''',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '',
-                 '',
-                 '<',
-                 '>',
-                 "'",
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '',
-                 '',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '&',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '=',
-                 "'",
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '±',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥']
-special_chars = set(special_chars)
-
-
-@udf(returnType=ArrayType(StringType()))
-def str_to_json_arr(json_str: str) -> list:
-    try:
-        if json_str:
-            res = []
-            for j in json.loads(json_str):
-                res.append(json.dumps(j, ensure_ascii=False))
-            return res
-    except json.JSONDecodeError as e:
-        # 处理JSON解析错误
-        print(f"JSONDecodeError: {e}")
-    except Exception as e:
-        # 处理其他异常
-        print(f"Unexpected error: {e}")
-    return []
-
-
-@udf(returnType=ArrayType(StringType()))
-def str_to_arr(json_str: str) -> list:
-    try:
-        if json_str:
-            return json.loads(json_str)
-    except json.JSONDecodeError as e:
-        # 处理JSON解析错误
-        print(f"JSONDecodeError: {e}")
-    except Exception as e:
-        # 处理其他异常
-        print(f"Unexpected error: {e}")
-    return []
-
-
-@udf(returnType=ArrayType(MapType(StringType(), StringType())))
-def str_to_map_arr(json_str: str) -> list:
-    try:
-        if json_str:
-            return json.loads(json_str)
-        return []
-    except json.JSONDecodeError as e:
-        # Handle JSON decoding error
-        print(f"JSONDecodeError: {e}")
-        return []
-    except Exception as e:
-        # Handle other exceptions
-        print(f"Unexpected error: {e}")
-        return []
-
-
-def merge_ws(text: str):
-    if text:
-        return ' '.join(text.split())
-    return None
-
-
-def uppercase_first_letter(word):
-    word = word.lower()
-    return word[:1].upper() + word[1:]
-
-
-def remove_special_chars(word):
-    return ''.join(ch for ch in word if ch not in special_chars)
-
-
-def clean_contact_name(contact_name):
-    if contact_name:
-        names = contact_name.split()
-        cleaned_names = [remove_special_chars(name) for name in names]
-        upper_names = [uppercase_first_letter(name) for name in cleaned_names]
-        cleaned_names = ' '.join(upper_names)
-        return ' '.join(cleaned_names.split())
-    return None
-
-
-def clean_email_status(source, match_level):
-    if match_level:
-        if source == 'shh':
-            try:
-                match_level = float(match_level)
-                if match_level == 1:
-                    return 'PERFECT_MATCH'
-                elif match_level in (2, -1):
-                    return 'SPECULATION_VERIFICATION'
-                elif match_level >= 0.9 and match_level < 1:
-                    return 'POSSIBLE_MATCH'
-                else:
-                    return 'LOW_MATCH'
-            except ValueError:
-                return None
-        elif source == 'snovio':
-            if match_level in ('valid', 'verified'):
-                return 'PERFECT_MATCH'
-            elif match_level in ('not_valid', 'greylisted', 'notVerified'):
-                return 'SPECULATION_VERIFICATION'
-            else:
-                return 'LOW_MATCH'
-    return None
-
-
-def clean_shh_ep(ep):
-    if ep:
-        if ep.endswith('^EMX'):
-            return ep[:-4]
-        elif ep.endswith('^ESD'):
-            return ep[:-4]
-        else:
-            return ep
-    return None
-
-
-def get_shh_email_status(inv, level):
-    if level is not None:
-        try:
-            level = int(level)
-            if level <= -7:
-                if inv:
-                    return 'low'
-                else:
-                    return 'high'
-            elif level <= 0:
-                if inv:
-                    return 'low'
-                else:
-                    return 'middle'
-        except ValueError:
-            return 'low'
-    return 'low'
-
-
-def extract_name_from_email(email):
-    if email and '@' in email:
-        return email.split('@')[0][:20]
-    return None
-
-
-def generate_md5_hash(input_str: str):
-    md5_hash = hashlib.md5(input_str.encode('utf-8'))
-    return md5_hash.hexdigest()
-
-
-def generate_ctc_id(tid, name, position):
-    name = clean_contact_name(name)
-    if not tid:
-        return None
-    if not name:
-        return None
-    if not position:
-        input_str = f"{tid}-{name}"
-    else:
-        input_str = f"{tid}-{name}-{position}"
-    return generate_md5_hash(input_str)
-
-
-def generate_ctc_id_fake_name(tid, name, position):
-    name = clean_contact_name(name)
-    if not tid:
-        return None
-    if not name:
-        return None
-    if not position:
-        input_str = f"{tid}-{name}"
-    else:
-        input_str = f"{tid}-{name}-{position}"
-    return generate_md5_hash(input_str)
-
-
-def clean_website(website):
-    """
-    解析爬虫接口的响应,提取公司网址
-
-    :param website: 爬虫接口的响应
-    :return: 公司网址
-    """
-    if website and website.strip():
-        # 去除 http://, https:// 和 www.
-        website = re.sub(r'^(https?://)?(www\.)?', '', website)
-        if website.endswith('/'):
-            website = website[:-1]
-    return website
-
-
-if __name__ == '__main__':
-    cases = [
-        'http://aaa.com',
-        'https://aaa.com',
-        'https://www.aaa.com',
-        'http://www.aaa.com',
-        'www.aaa.com',
-        'www.aaa.com/asda/asda',
-        'www.aaa.com/asda/asda/',
-        'www.aaa.com/',
-        'https://locations.jackinthebox.com/us/wa/blaine/8140-birch-bay-square-st?utm_source=bing\u0026utm_medium=local\u0026utm_campaign=bing-local'
-    ]
-    for case in cases:
-        print((case)
-              , '->',
-              clean_website(case))
-
-if __name__ == '__main__1':
-    case_list = ['andy zhu',
-                 'henry    liu',
-                 'JENS HESSELBERG LUND',
-                 '  TONY   li',
-                 ' Boy.  YU  .',
-                 'MARK KLINDERA @chief executive officer!'
-                 ]
-    for case in case_list:
-        res = clean_contact_name(case)
-        print("{:<30} ->  |{}|".format(case, res))
-    snovio_case_list = ['unknown',
-                        'valid',
-                        'not_valid',
-                        'greylisted',
-                        'abcsd'
-                        '']
-    shh_case_list = ['',
-                     'abc',
-                     '.81',
-                     '1',
-                     '.89',
-                     '.92',
-                     '.97',
-                     '2',
-                     '.85',
-                     '.93',
-                     '.95',
-                     '.8',
-                     '.9',
-                     '.98',
-                     '.84',
-                     '-1'
-                     ]
-    for case in snovio_case_list:
-        res = clean_email_status('snovio', case)
-        print("{:<30} ->  |{}|".format(case, res))
-    for case in shh_case_list:
-        res = clean_email_status('shh', case)
-        print("{:<30} ->  |{}|".format(case, res))
-    ep_case_list = ['daze@exemail.com.au^ESD',
-                    'ub3erl33trisser@hotmail.com^ESD',
-                    'noel.thompson@orange.net^EMX',
-                    'Potso.Makgatho@eskom.co.za^ESD',
-                    'dcsupplychain@yahoo.co.uk^ESD',
-                    'sunny.patel@i2ieventsgroup.com^EMX',
-                    '_zig_@bellsouth.net^ESD',
-                    'manish.pandey@ge.com^ESD',
-                    'amy_salzman@comcast.com^ESD',
-                    'kpretzer@thestrategicsolution.com^ESD']
-    for case in ep_case_list:
-        res = clean_shh_ep(case)
-        print("{:<30} ->  |{}|".format(case, res))
-    print(extract_name_from_email('12345678901234567890abcdef@q.com'))
-    print(extract_name_from_email('12345678901234567890abcdefq.com'))
-    print(get_shh_email_status('eae', 0))

+ 0 - 33
dw_base/spark/udf/contacts/test/ctc_common_test.py

@@ -1,33 +0,0 @@
-import pytest
-from dw_base.spark.udf.contacts.ctc_common import get_shh_email_status
-from dw_base.spark.udf.contacts.ctc_common import clean_email_status
-
-
-@pytest.mark.parametrize("inv, level, expected", [
-    (True, '-8', 'low'),  # level <= -7 且 inv 为 True
-    (False, '-8', 'high'),  # level <= -7 且 inv 为 False
-    (True, '-1', 'low'),  # -7 < level <= 0 且 inv 为 True
-    (False, '-1', 'middle'),  # -7 < level <= 0 且 inv 为 False
-    (True, 'invalid', 'low'),  # 无效的 level 值(ValueError 异常)
-])
-def test_get_shh_email_status(inv, level, expected):
-    result = get_shh_email_status(inv, level)
-    assert result == expected
-
-
-@pytest.mark.parametrize("source, match_level, expected", [
-    ('shh', 1, 'PERFECT_MATCH'),
-    ('shh', 2, 'SPECULATION_VERIFICATION'),
-    ('shh', -1, 'SPECULATION_VERIFICATION'),
-    ('shh', 0.95, 'POSSIBLE_MATCH'),
-    ('shh', 0.5, 'LOW_MATCH'),
-    ('snovio', 'valid', 'PERFECT_MATCH'),
-    ('snovio', 'verified', 'PERFECT_MATCH')
-])
-def test_clean_email_status_functionality(source, match_level, expected):
-    result = clean_email_status(source, match_level)
-    assert result == expected
-
-
-if __name__ == '__main__':
-    pytest.main()

+ 0 - 14
dw_base/spark/udf/contacts/test/ent_logistics_label_test.py

@@ -1,14 +0,0 @@
-import pytest
-
-from dw_base.spark.udf.enterprise.ent_logistics_label import is_logistic_match
-
-
-@pytest.mark.parametrize("name, expected",[
-    ('ALINE', False)])
-def test_is_logistic_match(name, expected):
-    result = is_logistic_match(name)
-    assert result == expected
-
-
-if __name__ == '__main__':
-    pytest.main()

+ 0 - 0
dw_base/spark/udf/customs/__init__.py


+ 0 - 82
dw_base/spark/udf/customs/calculation_repetition_trade.py

@@ -1,82 +0,0 @@
-import sys
-import re
-from dw_base.spark.spark_sql import SparkSQL
-from dw_base.utils.config_utils import parse_args
-
-
-# md5值相同的情况下去判断
-# 1、核心字段缺失,认定不重复,生成新的 MD5值
-# 2、核心字段不缺失,满足量价信息组2个条件满足,认定重复,共享原始md5值,否则生成新的MD5值
-# -tbl  dim.cts_trade_distinct_end
-def get_md5_sql(tbl):
-    sql = f'SHOW CREATE TABLE {tbl}'
-    spark = SparkSQL()
-    spark._final_spark_config = {'hive.exec.dynamic.partition': 'true',
-                                 'hive.exec.dynamic.partition.mode': 'nonstrict',
-                                 'spark.yarn.queue': 'cts',
-                                 'spark.sql.crossJoin.enabled': 'true',
-                                 'spark.executor.memory': '6g',
-                                 'spark.executor.memoryOverhead': '2048',
-                                 'spark.driver.memory': '4g',
-                                 'spark.executor.instances': "5",
-                                 'spark.executor.cores': '2'
-                                 }
-    ctbl = spark.query(sql)[0].collect()[0]['createtab_stmt']
-    cols_list = re.findall(r'(`[^`]+`) ', ctbl)
-    suff = 'i.'
-    modified_list = [suff + s for s in cols_list]
-    remove_id_sql = 'md5(concat_ws(\'-\',\n     {}) )  '.format(
-        ', '.join([f'nvl(cast({col} as string), "null")' for col in modified_list if col != "i.`id`"]))
-    contain_id_sql = 'md5(concat_ws(\'-\',\n     {}) ) '.format(
-        ', '.join([f'nvl(cast({col} as string), "null")' for col in modified_list]))
-    sel_end = (
-        f"if((check_core_fields(i.`date`, array(i.jksmc, i.cksmc), array(i.cpms, i.hgbm)) = 1 AND"
-        f" check_non_core_fields(i.`myzj`, i.`zl`, i.`sl`) = 0) OR "
-        f"check_core_fields(i.`date`, array(i.jksmc, i.cksmc), array(i.cpms, i.hgbm)) = 0,"
-        f"{contain_id_sql}  ,{remove_id_sql}  ) as md5")
-    return sel_end
-
-
-def check_core_fields(date: str, company_names: list, products: list):
-    """
-    检查核心字段 date,进口商名称或出口商名称,产品描述或海关编码
-    Args:
-        total_dollars:
-        weights:
-        quantities:
-
-    Returns:
-    """
-
-    if date is None or str(date).strip() == '':
-        return 0
-
-    if not any(name is not None and str(name).strip() != '' for name in company_names):
-        return 0
-
-    if not any(product is not None and str(product).strip() != '' for product in products):
-        return 0
-
-    return 1
-
-
-def check_non_core_fields(total_dollars: str, weights: str, quantities: str):
-    """
-    检查非核心字段
-    Args:
-        total_dollars:
-        weights:
-        quantities:
-
-    Returns:
-
-    """
-    non_empty_count = 0
-    if total_dollars is not None and str(total_dollars).strip() != '':
-        non_empty_count += 1
-    if weights is not None and str(weights).strip() != '':
-        non_empty_count += 1
-    if quantities is not None and str(quantities).strip() != '':
-        non_empty_count += 1
-
-    return 1 if non_empty_count >= 2 else 0

+ 0 - 130
dw_base/spark/udf/customs/clean_crawler_data.py

@@ -1,130 +0,0 @@
-import re
-from typing import Set, Any
-
-germany_clean_dict = {
-    '&#232;': 'è',
-    '&#163;': '£',
-    '&#249;': 'ù',
-    '&#238;': 'î',
-    '&#212;': 'Ô',
-    '&#251;': 'û',
-    '&#227;': 'ã',
-    '&#229;': 'å',
-    '&#248;': 'ø',
-    '&#223;': 'ß',
-    '&#179;': '³',
-    '&#245;': 'õ',
-    '&#214;': 'Ö',
-    '&#209;': 'Ñ',
-    '&#234;': 'ê',
-    '&#240;': 'ð',
-    '&#192;': 'À',
-    '&#235;': 'ë',
-    '\u003e': '>',
-    '&#244;': 'ô',
-    '&#202;': 'Ê',
-    '&#226;': 'â',
-    '&#224;': 'à',
-    '&#197;': 'Å',
-    '&#191;': '¿',
-    '&#221;': 'Ý',
-    '&#230;': 'æ',
-    '&#253;': 'ý',
-    '&#242;': 'ò',
-    '&#216;': 'Ø',
-    '&#239;': 'ï',
-    '&#171;': '«',
-    '&#236;': 'ì',
-    '&#201;': 'É',
-    '&#180;': '´',
-    '&#218;': 'Ú',
-    '&#187;': '»',
-    '&#213;': 'Õ',
-    '&#200;': 'È',
-    '&#178;': '²',
-    '&#176;': '°',
-    '&#204;': 'Ì',
-    '&#173;': '',
-    '&#233;': 'é',
-    '&#250;': 'ú',
-    '&#246;': 'ö',
-    '&#225;': 'á',
-    '&#243;': 'ó',
-    '&#228;': 'ä',
-    '&#252;': 'ü',
-    '&#220;': 'Ü',
-    '&#231;': 'ç',
-    '&#241;': 'ñ',
-    '&#205;': 'Í',
-    '&#199;': 'Ç',
-    '&#193;': 'Á',
-    '&#174;': '®',
-    '&#183;': '·',
-    '&#196;': 'Ä',
-    '&#188;': '¼',
-    '&#194;': 'Â',
-    '&#169;': '©',
-    '&#237;': 'í',
-    '&#211;': 'Ó',
-    '&#195;': 'Ã',
-    '&#182;': '¶',
-    '\u0027': '"',
-    '\u0022': "'"
-}
-
-
-def clean_germany_company_name(name) -> str:
-    for key, value in germany_clean_dict.items():
-        if key in name:
-            name = name.replace(key, value)
-    return name
-
-
-def get_regex_match(text) -> Set[Any]:
-    regex_list = set()
-    pattern_list = [r'&#\d{3};', r'\\u[0-9A-Fa-f]{4}']
-    for pattern in pattern_list:
-        match_list = re.findall(pattern, text)
-        if len(match_list) == 0:
-            continue
-        for match in match_list:
-            regex_list.add(match)
-    return regex_list
-
-
-if __name__ == '__main__':
-    test_cases = [
-        'S&#233;cheron SA',
-        'Beiersdorf Ind&#250;stria Com&#233;rcio',
-        'Mitan Mineral&#246;l GmbH',
-        'Atmos Chr&#225;st',
-        'Damatic Automatizaci&#243;n S.L',
-        'Wibre Elektroger&#228;te Edmund Breuninger GmbH & Co. KG',
-        'Aslant&#252;rk Kau&#231;uk San. Tic., Limited Şti.',
-        'eurokomplekt O&#220;',
-        'Aslant&#252;rk Kau&#231;uk San. Tic., Limited Şti.',
-        'Tiru&#241;a',
-        'Baader &#205;sland Ehf',
-        '&#199;ınar Ecza Deposu',
-        'TATAB&#193;NYAI RUG&#211;GY&#193;RT&#211; KFT.',
-        'Aquagart&#174; Trading GmbH',
-        'Wessel&#183;Werk GmbH',
-        '&#196;tztechnik Herz',
-        'GPS Pr&#195;&#188;ftechnik Rhein/Main GmbH',
-        'GEHS GR&#220;N ENERGİE HEIZUNG UND SANİT&#194;R',
-        '@Sartorius Stedim Biotech Wunderland AG G&#246;ttingen /&#169;ss',
-        'Concesionaria Vuela Compa&#241;&#237;a de Aviaci&#243;n SAPI de CV',
-        'TATAB&#193;NYAI RUG&#211;GY&#193;RT&#211; KFT.',
-        'GPS Pr&#195;&#188;ftechnik Rhein/Main GmbH',
-        'Fr&#195;&#182;lich + Kl&#195;Œpfel Drucklufttechnik GmbH & Co. KG',
-        'DE\u0027 LONGHI APPLIANCES S.R.L.',
-        'FREY WILLE\u0022 GmbH & Co.KG.',
-        'Concesionaria Vuela Compa&#241;&#237;a de Aviaci&#243;n SAPI de CV',
-        'TATAB&#193;NYAI RUG&#211;GY&#193;RT&#211; KFT.',
-        'GPS Pr&#195;&#188;ftechnik Rhein/Main GmbH'
-
-    ]
-    for test_case in test_cases:
-        print("{:<50} {:>50}".format(test_case, clean_germany_company_name(test_case)))
-
-# print(get_regex_match("S&#233;cheron SA\\u0022ss"))

+ 0 - 257
dw_base/spark/udf/customs/common_clean.py

@@ -1,257 +0,0 @@
-# 通用企业名称去噪
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´',
-                 '',
-                 '¿',
-                 '‰',
-                 '¯',
-                 '\x1A',
-                 '£',
-                 '>',
-                 '¿',
-                 '«',
-                 '´',
-                 '»',
-                 '°',
-                 '®',
-                 '·',
-                 '¼',
-                 '©',
-                 '¶',
-                 "'",
-                 '"'
-                 ]
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-sub_after_list = ['O/B OF', 'B/O OF', 'O/B', 'B/O', 'BY ORDER OF', 'BY ORDER', 'ON BEHALF OF', 'ON BEHALF', 'П/П']
-sub_before_str = 'C/O'
-# 括入符列表
-same_enclosers = ['"', ''', '"', "'", ]
-diff_enclosers = ['«»', '《》']
-
-head_list = ['КОМПАНІЯ ', 'ООО ', 'СП ООО ', 'ТОО ', 'ТОВ ', 'ФИРМА ', 'КОМПАНИЯ ', 'ФІРМА ', 'КОМПАНИЯ ',
-             'CÔNG TY TNHH ', 'CONG TY CO PHAN ', 'ИП ООО ', 'АО ', 'M S ', 'СП ', 'JV ', 'MS ']
-
-
-def sub_head(text: str):
-    if text:
-        for head in head_list:
-            if text.startswith(head):
-                return text.replace(head, '')
-        return text.strip()
-    else:
-        return None
-
-
-def extract_text_from_enclosers(text):
-    num = 0
-    result = text
-    for encloser in same_enclosers:
-        cnt = text.count(encloser)
-        open_inx = text.find(encloser)
-        close_inx = text.rfind(encloser)
-        if cnt > 2:
-            return text.strip()
-        elif cnt == 2 and close_inx - open_inx > 1:
-            num += 1
-            if num > 1:
-                return text.strip()
-            result = text[open_inx + 1:close_inx]
-    for encloser in diff_enclosers:
-        open_str, close_str = encloser[0], encloser[1]
-        open_cnt = text.count(open_str)
-        close_cnt = text.count(close_str)
-        open_inx = text.find(open_str)
-        close_inx = text.rfind(close_str)
-        if (open_cnt == 1 and close_cnt > 1) or (open_cnt > 1 and close_cnt == 1) or (open_cnt > 1 and close_cnt > 1):
-            return text.strip()
-        elif open_cnt == 1 and close_cnt == 1 and close_inx - open_inx > 1:
-            num += 1
-            if num > 1:
-                return text.strip()
-            result = text[open_inx + 1:close_inx]
-    return result.strip()
-
-
-def clean_company_name(name):
-    if name:
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-def sub_start_end(main_str, sub_str):
-    if main_str.startswith(sub_str):
-        main_str = main_str[len(sub_str):]
-    if main_str.endswith(sub_str):
-        main_str = main_str[:-len(sub_str)]
-    return main_str.strip()
-
-
-def get_sub_after(main_str, sub_str):
-    index = main_str.find(sub_str)
-    if index == -1:
-        return main_str
-    return main_str[index + len(sub_str):].strip()
-
-
-def get_sub_before(main_str, sub_str):
-    index = main_str.find(sub_str)
-    if index == -1:
-        return main_str
-    return main_str[:index].strip()
-
-
-def clean_pre_join(name):
-    if name:
-        name = name.upper().strip()
-        for sub_str in sub_after_list:
-            name = sub_start_end(name, sub_str)
-            name = get_sub_after(name, sub_str)
-        name = sub_start_end(name, sub_before_str)
-        name = get_sub_before(name, sub_before_str)
-        name = extract_text_from_enclosers(name)
-        name = clean_company_name(name)
-        name = sub_head(name)
-        return name
-    return None
-
-
-if __name__ == '__main__':
-    print(clean_pre_join('ASF INC ON BEH¿ BY ORDER OF'))
-
-if __name__ == '__main__2':
-    input_str1 = 'a<b>c'
-    input_str2 = 'a<b>c<d>e<f>gh'
-    input_str3 = 'a<"x>"b'
-    input_str4 = 'This <is a test <example> string.'
-    input_str5 = 'This is a test «aaa» string.'
-    case_list = [input_str1, input_str2, input_str3, input_str4, input_str5]
-    case_list.append('sss"adsd"ddd')
-    case_list.append('This is a test ""aaa» string.')
-    case_list.append('a<"x">b  ')
-    case_list.append('""abcd')
-    case_list.append('a>bc<d')
-    case_list.append('abcd<>')
-    case_list.append('abcd<bbbb》b>')
-    case_list.append('abcd<b'b“bb》b>')
-
-    for case in case_list:
-        extract_text = extract_text_from_enclosers(case)
-        print("{:<50} ->  {}".format(case, extract_text))
-
-if __name__ == '__main__1':
-    case1 = ' AB    cde .((!)  '
-    assert clean_company_name(case1) == 'AB CDE'
-    case2 = None
-    assert clean_company_name(case2) is None
-    case3 = '    '
-    assert clean_company_name(case3) == ''
-    case4 = '~ab#c≥'
-    assert clean_company_name(case4) == 'AB C'
-    case5 = '÷  &            !  '
-    assert clean_company_name(case5) == 'AND'
-    case6 = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    case = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    print('all test cases passed')

+ 0 - 253
dw_base/spark/udf/customs/common_clean2.py

@@ -1,253 +0,0 @@
-import re
-
-# 通用企业名称去噪
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´',
-                 '',
-                 '¿',
-                 '‰',
-                 '¯',
-                 '\x1A'
-                 ]
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-sub_after_list = ['O/B OF', 'B/O OF', 'O/B', 'B/O', 'BY ORDER OF', 'BY ORDER', 'ON BEHALF OF', 'ON BEHALF', 'П/П']
-sub_before_str = 'C/O'
-# 括入符列表
-same_enclosers = ['"', ''', '"', "'", ]
-diff_enclosers = ['«»', '《》']
-
-head_list = ['КОМПАНІЯ ', 'ООО ', 'СП ООО ', 'ТОО ', 'ТОВ ', 'ФИРМА ', 'КОМПАНИЯ ', 'ФІРМА ', 'КОМПАНИЯ ',
-             'CÔNG TY TNHH ', 'CONG TY CO PHAN ', 'ИП ООО ', 'АО ', 'PT ', 'CV ', 'M S ', 'СП ', 'JV ', 'MS ']
-
-
-def sub_head(text: str):
-    if text:
-        for head in head_list:
-            if text.startswith(head):
-                return text.replace(head, '')
-        return text.strip()
-    else:
-        return None
-
-
-def extract_text_from_enclosers(text):
-    num = 0
-    result = text
-    for encloser in same_enclosers:
-        cnt = text.count(encloser)
-        open_inx = text.find(encloser)
-        close_inx = text.rfind(encloser)
-        if cnt > 2:
-            return text.strip()
-        elif cnt == 2 and close_inx - open_inx > 1:
-            num += 1
-            if num > 1:
-                return text.strip()
-            result = text[open_inx + 1:close_inx]
-    for encloser in diff_enclosers:
-        open_str, close_str = encloser[0], encloser[1]
-        open_cnt = text.count(open_str)
-        close_cnt = text.count(close_str)
-        open_inx = text.find(open_str)
-        close_inx = text.rfind(close_str)
-        if (open_cnt == 1 and close_cnt > 1) or (open_cnt > 1 and close_cnt == 1) or (open_cnt > 1 and close_cnt > 1):
-            return text.strip()
-        elif open_cnt == 1 and close_cnt == 1 and close_inx - open_inx > 1:
-            num += 1
-            if num > 1:
-                return text.strip()
-            result = text[open_inx + 1:close_inx]
-    return result.strip()
-
-
-def clean_company_name(name):
-    if name:
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-def sub_start_end(main_str, sub_str):
-    # print(sub_str)
-    re_str1 = re.sub(r'^[^a-zA-Z]*', '', main_str)
-    # print('re_str1' + '=' + re_str1)
-    if re_str1.startswith(sub_str):
-        main_str = re_str1[len(sub_str):]
-        # print('第一次截断='+main_str)
-
-    re_str2 = re.sub(r'[^a-zA-Z]*$', '', main_str)
-    # print('re_str2'+'='+re_str2)
-    if re_str2.endswith(sub_str):
-        main_str = re_str2[:-len(sub_str)]
-        # print('第二次截断=' + main_str)
-    return main_str
-
-
-def get_sub_after(main_str, sub_str):
-    index = main_str.find(sub_str)
-    if index == -1:
-        return main_str
-    return main_str[index + len(sub_str):].strip()
-
-
-def get_sub_before(main_str, sub_str):
-    index = main_str.find(sub_str)
-    if index == -1:
-        return main_str
-    return main_str[:index].strip()
-
-
-def clean_pre_join(name):
-    if name:
-        name = name.upper().strip()
-        for sub_str in sub_after_list:
-            name = sub_start_end(name, sub_str)
-            name = get_sub_after(name, sub_str)
-        name = sub_start_end(name, sub_before_str)
-        name = get_sub_before(name, sub_before_str)
-        name = extract_text_from_enclosers(name)
-        name = clean_company_name(name)
-        name = sub_head(name)
-        return name
-    return None
-
-
-if __name__ == '__main__':
-    print(clean_pre_join('JAMES RIVER WHSE 8974 D6 ON BEHALF ..'))
-
-if __name__ == '__main__2':
-    input_str1 = 'a<b>c'
-    input_str2 = 'a<b>c<d>e<f>gh'
-    input_str3 = 'a<"x>"b'
-    input_str4 = 'This <is a test <example> string.'
-    input_str5 = 'This is a test «aaa» string.'
-    case_list = [input_str1, input_str2, input_str3, input_str4, input_str5]
-    case_list.append('sss"adsd"ddd')
-    case_list.append('This is a test ""aaa» string.')
-    case_list.append('a<"x">b  ')
-    case_list.append('""abcd')
-    case_list.append('a>bc<d')
-    case_list.append('abcd<>')
-    case_list.append('abcd<bbbb》b>')
-    case_list.append('abcd<b'b“bb》b>')
-
-    for case in case_list:
-        extract_text = extract_text_from_enclosers(case)
-        print("{:<50} ->  {}".format(case, extract_text))
-
-if __name__ == '__main__1':
-    case1 = ' AB    cde .((!)  '
-    assert clean_company_name(case1) == 'AB CDE'
-    case2 = None
-    assert clean_company_name(case2) is None
-    case3 = '    '
-    assert clean_company_name(case3) == ''
-    case4 = '~ab#c≥'
-    assert clean_company_name(case4) == 'AB C'
-    case5 = '÷  &            !  '
-    assert clean_company_name(case5) == 'AND'
-    case6 = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    case = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    print('all test cases passed')

+ 0 - 1556
dw_base/spark/udf/customs/company_abbr.py

@@ -1,1556 +0,0 @@
-import sys
-import re
-import os
-
-abspath = os.path.abspath(__file__)
-root_path = re.sub(r"tendata-warehouse.*", "tendata-warehouse", abspath)
-sys.path.append(root_path)
-
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-kaz_enclosers = [('""', '""'), ('"', '"'), ('<<', '>>'), ('?', '?')]
-
-pakistan_suffix_list = [
-    'GROUPCOMPANYLIMITED',
-    'LIMITEDPARTNERSHIP',
-    'CORPORATIONLIMITED',
-    'SMCPRIVATE',
-    'OFCOMPANY',
-    'PRIVATELIMIT',
-    'PRIVATECO',
-    'LIABILITYCOMPANY',
-    'LIMITEDCOMPANY',
-    'COMPANYLIMITED',
-    'INCORPORAT',
-    'CORPORATION',
-    'GROUPCOLTD',
-    'COMPANYLTD',
-    'COLIMITED',
-    'GROUPLTD',
-    'SMCPVT',
-    'PVTLIMIT',
-    'PVTCOLTD',
-    'PVTLTD',
-    'FACTORY',
-    'CORPLTD',
-    'COMPANY',
-    'PTYLTD',
-    'AGENCY',
-    'OFFICE',
-    'CENTER',
-    'COLTD',
-    'COINC',
-    'C0LTD',
-    'LIMIT',
-    'CORP',
-    'LLC',
-    'LTD',
-    'COLT'
-]
-SECOND_AMERICA_SUFFIX_LIST = [
-    ' UNLIMITED',
-    ' LIMITED',
-    ' CO LTD',
-    ' COMPANY LTD',
-    ' AND COMPANY',
-    ' CORPORATION',
-    ' CORP',
-    ' COMPANY INC',
-    ' COMPANY',
-    ' LLC',
-    ' CO INC',
-    ' CO',
-    ' MD',
-    ' LTD',
-    ' INC'
-    ' LLP',
-    ' PLC',
-    ' EST',
-]
-
-third_AMERICA_SUFFIX_LIST = [
-    ' CORPORATION',
-    ' COMPANY LTD',
-    ' COMPANY INC',
-    ' UNLIMITED',
-    ' LIMITED',
-    ' CO LTD',
-    ' COMPANY',
-    ' CO INC',
-    ' CORP',
-    ' LLC',
-    ' LTD',
-    ' INC'
-    ' LLP',
-    ' PLC',
-    ' EST',
-    ' CO',
-    ' MD',
-]
-
-first_chile_SUFFIX_LIST = [
-    ' SPA',
-    ' S A',
-    ' SA',
-    ' LTDA',
-    ' LIMITADA',
-    ' LLC',
-    ' SOCIEDAD ANONIMA',
-    ' CO LTD',
-    ' LTD',
-    ' LIMI',
-    ' E I R'
-]
-
-first_bangladesh_suffix_list = [
-    'CHANGED FROM',
-    'CHANGED',
-    'CHANGE FROM',
-    'CHANGE',
-    'EXCHANGE'
-]
-ukraine_suffix_first = [
-    ' М КИЇВ ВУЛ ',
-    ' ВУЛ '
-]
-ukraine_suffix_second = [
-    ' S R O ',
-    ' Z O O '
-]
-
-second_bangladesh_suffix_list = [
-    'PVT CO LIMITED',
-    'PVT LIMITED',
-    'LIMITED',
-    'PVT LTD',
-    'LTD',
-    'PVT',
-    'CO LTD',
-    'CO',
-    'PLC'
-]
-
-FIRST_Rwanda_suffix_list = [
-    'COMPANY RWANDA LTD',
-    'COMPANY LTD',
-    ' CO LTD',
-    'LTD',
-    'LIMITED'
-]
-FIRST_england_suffix_list = [
-    ' COMPANY LIMITED',
-    ' ENTERPRISES LTD',
-    ' LIMITED',
-    ' COMPANY',
-    ' CO LTD',
-    ' LTD',
-    ' LLP'
-]
-FIRST_philippines_suffix_list = [
-    ' CO INC',
-    ' CO LTD',
-    'INC',
-    'CORPORATION',
-    'CORP',
-    'LLC',
-    'ENTERPRISES',
-    'INCORPORATED',
-    ' CO',
-    'PTE LTD',
-    'PTY LTD',
-    'LTD',
-    'GMBH',
-    'S R L',
-    'SRL'
-]
-FIRST_colombia_suffix_list = [
-    "LIMITADA",
-    "S A S",
-    "LITDA",
-    "LTDA",
-    "SAS",
-    "S A",
-    "LLC"
-]
-frist_america_suffix_list = [
-    'PRODUCT',
-    'UNION OF THE UNITED STATES',
-    ' FOUNDATION',
-    'SA DE CV',
-    ' UNLIMITED',
-    ' LIMITED',
-    'CENTERS OF AMERICA',
-    ' AMERICA CORP',
-    ' USA CORP',
-    ' CORP',
-    ' CORPORATION',
-    'FOUNDATION',
-    ' PLLC',
-    ' LP',
-    ' PA',
-    ' CO',
-    'ENTERPRISE',
-    'COMPANY',
-    ' AMERICA LLC',
-    ' AMERICA INC',
-    ' USA LLC',
-    ' USA INC',
-    ' FL LLC',
-    ' FL INC',
-    ' 2 LLC',
-    ' 2 INC',
-    ' 3 LLC',
-    ' 3 INC',
-    ' 2022 LLC',
-    ' 2022 INC',
-    ' 2021 LLC',
-    ' 2021 INC',
-    ' 2020 LLC',
-    ' 2020 INC',
-    ' CO LLC',
-    ' CO INC',
-    ' LLC',
-    ' INC',
-    ' CO LTD',
-    ' LTD'
-]
-
-indonesia_suffix_list = [
-    'AGENC',
-    'COMPANY',
-    'DEVELOPMENT',
-    'ORGANIZATION',
-    'ASSOCIATION',
-    'SERVICE',
-    'GROUP',
-    'PTY LTD',
-    'PTY LIMIT',
-    ' CO LTD',
-    ' CO LIMIT',
-    ' PTE LTD',
-    'INDONESIA CO',
-    'INDONESIA INCORP',
-    'INDONESIA LTD',
-    'PHILS CO',
-    'INDONESIA UNLIMIT',
-    ' ASIA CO',
-    ' ASIA UNLIMITED',
-    'INCORPORATED',
-    'ENTERPRISE',
-    ' INDONESIA INC',
-    ' ASIA INC',
-    ' INDONESIA CO INC',
-    ' CO',
-    ' CORP',
-    'CORPORATION',
-    ' INC',
-    ' INDONESIA',
-    ' TBK'
-]
-
-venezuela_suffix_list = [
-    'S A',
-    'C A',
-    'R L',
-    'R S',
-    'F P',
-    'S R L',
-    'LTD',
-    'INC',
-    'COMPANY C A',
-    'COMPAÑIA ANONIMA',
-    'CORPORATION C A',
-    'COOPERATIVA',
-    'INTERNATIONAL',
-    'CORPORACIÓN',
-    'REPRESENTACIONES',
-    'ASOCIACION CIVIL',
-    'FUNDACION'
-]
-
-kaz_heads = ["TOO",
-             "ООО",
-             "АО",
-             "ФХ",
-             "ИП OOO",
-             "НПЦ ООО",
-             "СП OOO",
-             "ЧП"]
-
-moldova_suffix_list = [
-    'ASOCIATIA GOSPODARIILOR TARANESTI',
-    'COOPERATIVA DE ÎNTREPRINZATOR',
-    'COOPERATIVA DE PRODUCERE',
-    'COOPERATIVA DE',
-    'COOPERATIVA AGRICOLA DE INTREPRINZATOR',
-    'COOPERATIVA AGRICOLA',
-    'CENTRUL TEHNIC',
-    'COMPANIA',
-    'FIRMA COOPERATISTA TEHNICO-STIINTIFICA DE PRODUCTIE',
-    'FIRMA DE PRODUCTIE',
-    'FIRMA DE PRODUCŢIE ŞI COMERŢ',
-    'FIRMA',
-    'SOCIETATEA COMERCIALĂ',
-    'SOCIETATEA CU RASPUNDERE LIMITATA FIRMA',
-    'SOCIETATEA CU RĂSPUNDERE LIMITATĂ',
-    'SOCIETATEA CU RASPUNDERE LIMITATA',
-    'SOCIETATEA PE ACTIUNI',
-    'SOCIETATEA IN NUME COLECTIV AGENTIA',
-    'INTREPRINDEREA INDIVIDUALA',
-    'ÎNTREPRINZĂTOR INDIVIDUAL',
-    'ÎNTREPRINDEREA INDIVIDUALĂ',
-    'ÎNTREPRINDEREA MUNICIPALĂ',
-    'ÎNTREPRINDEREA CU CAPITAL STRĂIN',
-    'INSTITUŢIA MEDICO-SANITARĂ PUBLICĂ',
-    'REDACTIA GAZETEI',
-    'ORGANIZATIA DE ADMINISTRARE FIDUCIARA A INVESTITILOR',
-    'S R L',
-    'SOCIETATEA CU RESPONSABILITATE LIMITATA',
-    'SOCIETATE CU RĂSPUNDERE LIMITATĂ'
-]
-
-moldova_suffix_list2 = [
-    'S R L',
-    'SOCIETATEA CU RESPONSABILITATE LIMITATA',
-    'SOCIETATE CU RĂSPUNDERE LIMITATĂ'
-]
-
-singapore_suffix_list = [
-    'SINGAPORE PTE LTD',
-    'S PTE LTD',
-    'PTE LTD',
-    'ENTERPRISES',
-    'ENTERPRISE',
-    'ENT',
-    'AGENCIES',
-    'AGENCY',
-    'PRIVATE LIMITED',
-    'COMPANY',
-    'LLP',
-    'CO'
-]
-
-hongkong_suffix_list = [
-    ' CO LIMITED',
-    ' LIMITED',
-    ' CO LTD',
-    ' COMPANY',
-    ' LTD'
-]
-
-china_suffix_list = [
-    ' GROUP CORPORATION LIMITED',
-    ' CORPORATION LIMITED',
-    ' GROUP CORPORATION',
-    ' GROUP CO LIMITED',
-    ' LIMITED COMPANY',
-    ' COMPANY LIMITED',
-    ' GROUP CO LTD',
-    ' CORPORATION',
-    ' CO LIMITED',
-    ' GROUP CORP',
-    ' CORP LTD',
-    ' LIMITED',
-    ' COMPANY',
-    ' FACTORY',
-    ' CO LTD',
-    ' CO INC',
-    ' CORP',
-    ' INC',
-    ' CO'
-]
-
-vietnam_right_separator_list = [
-    'COMPANY LIMITED ',
-    'COMPANY LTD '
-]
-
-vietnam_left_separator_list = [
-    ' CO LTD',
-    ' PTE LTD',
-    ' JOINT STOCK COMPANY',
-    ' COMPANY'
-]
-
-vietnam_suffix_list = [
-    ' CORP',
-    ' LLC',
-    ' CO JSC',
-    ' JSC',
-    ' LTD'
-]
-
-ind_head = [
-    'M S',
-    'MS'
-]
-
-india_suffix_list = [
-    ' CO I PVT L',
-    ' CO PVT L',
-    ' CO PRIVATE L',
-    ' CO I LTD',
-    ' I LTD',
-    ' I LIMITED',
-    ' I PVT L',
-    ' I PRIVATE L',
-    ' COMPANY PRIVATE L',
-    ' COMPANY PVT L',
-    ' P LTD',
-    ' PRIVATE L',
-    ' PVT L',
-    ' CO',
-    ' INC',
-    ' CO LIMITED',
-    ' LTD',
-    ' LIMITED',
-    ' CO I',
-    ' I'
-]
-
-mexico_suffix_list = [
-    ' S P R DE R L DE C V',
-    ' S DE R L DE C V',
-    ' S DE RL DE CV',
-    ' S A P I DE CV',
-    ' S P R DE R L',
-    ' S A DE C V',
-    ' SA DE CV'
-]
-
-nigeria_suffix_list = [
-    ' COMPANY LIMITED',
-    ' COMPANY LTD',
-    ' COMPANY',
-    ' LIMITED',
-    ' PTE LTD',
-    ' CO LTD',
-    ' LTD',
-    ' LLC'
-]
-
-peru_suffix_list = [
-    'SOCIEDAD ANONIMA CERRADA',
-    'SOCIEDAD ANONIMA CER',
-    'E I R LTDA',
-    'S R LTDA',
-    'E I R L',
-    'S R L',
-    'S A C',
-    'SAC',
-    'S A'
-]
-lesotho_suffix_list = [
-    ' LLC (EXTERNAL COMPANY) LTD',
-    ' LLC (EXTERNAL COMPANY)',
-    ' (PROPRIETARY) LIMITED',
-    ' COMPANY (PTY) LTD',
-    ' COMPANY LIMITED',
-    ' COMPANY LTD',
-    ' LIMITED',
-    ' PTY LTD',
-    ' CO LTD'
-]
-
-germany_suffix_list = [
-    'GMBH AND CO KGAA',
-    'GMBH AND CO OHG',
-    'GMBH AND CO KG',
-    'AG AND CO KGAA',
-    'AG AND CO OHG',
-    'LIMITED ŞTI',
-    'GMBH AND CO',
-    'S A DE C V',
-    'CO LIMITED',
-    'LIMITED',
-    'S R L',
-    'GMBH',
-    'GBR',
-    'SRL',
-    'INC',
-    'LLC',
-    'OHG',
-    'A S',
-    'E K',
-    'AG',
-    'SA',
-    'UG'
-]
-
-
-def kaz_extract_text_from_enclosers(text):
-    result = text
-    for encloser in kaz_enclosers:
-        open_str, close_str = encloser[0], encloser[1]
-        open_inx = text.find(open_str)
-        close_inx = text.rfind(close_str)
-        if close_inx - open_inx > 1:
-            return text[open_inx + 1:close_inx]
-    return result
-
-
-def remove_prefix(text, prefix):
-    if text.startswith(prefix):
-        return text[len(prefix):]
-    return text
-
-
-def truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            parts = text.split(suffix, 1)
-            return parts[0]
-    return text
-
-
-def pakistan_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        upper_name = company_name.upper()
-        cleaned_name = re.sub(r'[^A-Z0-9]', '', upper_name)
-        removed_prefix_name = remove_prefix(cleaned_name, 'ms')
-        truncated_name = truncate_at_suffix(removed_prefix_name, pakistan_suffix_list).strip()
-        if len(truncated_name) > 4:
-            return truncated_name
-        elif len(removed_prefix_name) > 4:
-            return removed_prefix_name
-    return None
-
-
-def mirror_pakistan_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        upper_name = company_name.upper()
-        cleaned_name = re.sub(r'[^A-Z0-9 ]', '', upper_name)
-        removed_prefix_name = remove_prefix(cleaned_name, 'ms').strip()
-        truncated_name = truncate_at_suffix(removed_prefix_name, pakistan_suffix_list).strip()
-        if len(truncated_name) > 4:
-            return truncated_name
-        elif len(removed_prefix_name) > 4:
-            return removed_prefix_name
-    return None
-
-
-def split_last(text, suffix):
-    if text:
-        last_occurrence_index = text.rfind(suffix)
-        if last_occurrence_index != -1:
-            return text[:last_occurrence_index]
-        return text
-    return None
-
-
-# 纳米比亚进口的mc_org处理逻辑
-def split_first_dtp(text):
-    if text:
-        if " ---DTP" in text:
-            return text.split(" ---DTP", 1)[0]
-        elif "---DTP" in text:
-            return text.split("---DTP", 1)[0]
-        elif "--DTP" in text:
-            return text.split("--DTP", 1)[0]
-        else:
-            return text
-    return None
-
-
-def america_truncate_at_suffix_first(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != ' FOUNDATION' and suffix != ' UNLIMITED'
-                    and suffix != ' AMERICA CORP' and suffix != ' USA CORP' and suffix != ' CORP'
-                    and suffix != ' CORPORATION' and suffix != 'FOUNDATION'
-                    and suffix != ' PLLC' and suffix != ' LP' and suffix != ' PA' and suffix != ' CO' and suffix != 'ENTERPRISE'
-                    and suffix != 'COMPANY'
-                    and suffix != ' LLC' and suffix != ' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' FOUNDATION' and text.endswith(' FOUNDATION'):
-                return split_last(text, suffix)
-            elif suffix == ' UNLIMITED' and text.endswith(' UNLIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' AMERICA CORP' and text.endswith(' AMERICA CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' USA CORP' and text.endswith(' USA CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP' and text.endswith(' CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' CORPORATION' and text.endswith(' CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == 'FOUNDATION' and text.endswith('FOUNDATION'):
-                return split_last(text, suffix)
-            elif suffix == ' PLLC' and text.endswith(' PLLC'):
-                return split_last(text, suffix)
-            elif suffix == ' LP' and text.endswith(' LP'):
-                return split_last(text, suffix)
-            elif suffix == ' PA' and text.endswith(' PA'):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == 'ENTERPRISE' and text.endswith('ENTERPRISE'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPANY' and text.endswith('COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' LLC' and text.endswith(' LLC'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-    return text
-
-
-def america_truncate_at_suffix_second(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != ' UNLIMITED' and suffix != ' LIMITED'
-                    and suffix != ' AND COMPANY' and suffix != ' CORPORATION' and suffix != ' CORP'
-                    and suffix != ' COMPANY' and suffix != ' LLC'
-                    and suffix != ' CO'
-                    and suffix != ' MD' and suffix != ' LTD' and suffix != ' INC'
-                    and suffix != ' PLC' and suffix != ' LLP' and suffix != ' EST'
-            ):
-                return split_last(text, suffix)
-            elif suffix == ' UNLIMITED' and text.endswith(' UNLIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' AND COMPANY' and text.endswith(' AND COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' CORPORATION' and text.endswith(' CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP' and text.endswith(' CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' COMPANY' and text.endswith(' COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' LLC' and text.endswith(' LLC'):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' MD' and text.endswith(' MD'):
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' LLP' and text.endswith(' LLP'):
-                return split_last(text, suffix)
-            elif suffix == ' PLC' and text.endswith(' PLC'):
-                return split_last(text, suffix)
-            elif suffix == ' EST' and text.endswith(' EST'):
-                return split_last(text, suffix)
-    return text
-
-
-def america_truncate_at_suffix_third(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != ' UNLIMITED' and suffix != ' LIMITED'
-                    and suffix != ' CORPORATION' and suffix != ' CORP'
-                    and suffix != ' COMPANY' and suffix != ' LLC'
-                    and suffix != ' CO'
-                    and suffix != ' MD' and suffix != ' LTD' and suffix != ' INC'
-                    and suffix != ' PLC' and suffix != ' LLP' and suffix != ' EST'
-            ):
-                return split_last(text, suffix)
-            elif suffix == ' CORPORATION' and text.endswith(' CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == ' UNLIMITED' and text.endswith(' UNLIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' COMPANY' and text.endswith(' COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP' and text.endswith(' CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' LLC' and text.endswith(' LLC'):
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' LLP' and text.endswith(' LLP'):
-                return split_last(text, suffix)
-            elif suffix == ' PLC' and text.endswith(' PLC'):
-                return split_last(text, suffix)
-            elif suffix == ' EST' and text.endswith(' EST'):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' MD' and text.endswith(' MD'):
-                return split_last(text, suffix)
-    return text
-
-
-def bangladesh_truncate_at_suffix_first(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != 'CHANGED') and suffix != 'CHANGE' and suffix != 'EXCHANGE':
-                return split_last(text, suffix)
-            elif suffix == 'CHANGED' and text.endswith('CHANGED'):
-                return split_last(text, suffix)
-            elif suffix == 'CHANGE' and text.endswith('CHANGE'):
-                return split_last(text, suffix)
-            elif suffix == 'EXCHANGE' and text.endswith('EXCHANGE'):
-                return split_last(text, suffix)
-    return text
-
-
-def bangladesh_truncate_at_suffix_second(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if suffix == 'PVT CO LIMITED' and text.endswith('PVT CO LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == 'PVT LIMITED' and text.endswith('PVT LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == 'LIMITED' and text.endswith('LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == 'PVT LTD' and text.endswith('PVT LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'LTD' and text.endswith('LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'PVT' and text.endswith('PVT'):
-                return split_last(text, suffix)
-            elif suffix == 'LTD' and text.endswith('LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'CO' and text.endswith('CO'):
-                return split_last(text, suffix)
-            elif suffix == 'PLC' and text.endswith('PLC'):
-                return split_last(text, suffix)
-            elif suffix == 'PVT':
-                return split_last(text, suffix)
-    return text
-
-
-def indonesia_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != ' CO' and suffix != ' CORP' and suffix != 'CORPORATION' and suffix != ' INC'
-                    and suffix != ' INDONESIA' and suffix != ' TBK'):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP' and text.endswith(' CORP'):
-                return split_last(text, suffix)
-            elif suffix == 'CORPORATION' and text.endswith('CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' INDONESIA' and text.endswith(' INDONESIA'):
-                return split_last(text, suffix)
-            elif suffix == ' TBK' and text.endswith(' TBK'):
-                return split_last(text, suffix)
-    return text
-
-
-def rwanda_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != 'COMPANY RWANDA LTD' and suffix != 'COMPANY LTD' and suffix != 'CO LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPANY RWANDA LTD' and text.endswith('COMPANY RWANDA LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPANY LTD' and text.endswith('COMPANY LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'CO LTD' and text.endswith('CO LTD'):
-                return split_last(text, suffix)
-
-    return text
-
-
-def philippines_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if text.endswith(suffix):
-                return split_last(text, suffix)
-    return text
-
-
-def england_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if text.endswith(suffix):
-                return split_last(text, suffix)
-    return text
-
-
-def colombia_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if text.endswith(suffix):
-                return split_last(text, suffix)
-    return text
-
-
-def chile_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (suffix != ' SPA' and suffix != ' S A' and suffix != ' SA' and suffix != ' LTDA'
-                    and suffix != ' LIMITADA' and suffix != ' LLC'
-                    and suffix != ' SOCIEDAD ANONIMA' and suffix != ' CO LTD' and suffix != ' LTD' and suffix != ' LIMI'
-                    and suffix != ' E I R'):
-                return split_last(text, suffix)
-            elif suffix == ' SPA' and text.endswith(' SPA'):
-                return split_last(text, suffix)
-            elif suffix == ' S A' and text.endswith(' S A'):
-                return split_last(text, suffix)
-            elif suffix == ' SA' and text.endswith(' SA'):
-                return split_last(text, suffix)
-            elif suffix == ' LTDA' and text.endswith(' LTDA'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITADA' and text.endswith(' LIMITADA'):
-                return split_last(text, suffix)
-            elif suffix == ' LLC' and text.endswith(' LLC'):
-                return split_last(text, suffix)
-            elif suffix == ' SOCIEDAD ANONIMA' and text.endswith(' SOCIEDAD ANONIMA'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LTD' and text.endswith(' CO LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMI' and text.endswith(' LIMI'):
-                return split_last(text, suffix)
-            elif suffix == ' E I R' and text.endswith(' E I R'):
-                return split_last(text, suffix)
-    return text
-
-
-def venezuela_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (
-                    suffix != 'S A' and suffix != 'C A' and suffix != 'R L' and suffix != 'R S' and suffix != 'F P' and suffix != 'S R L'
-                    and suffix != 'INC' and suffix != 'COMPANY C A' and suffix != 'COMPAÑIA ANONIMA' and suffix != 'CORPORATION C A'
-                    and suffix != 'COOPERATIVA' and suffix != 'INTERNATIONAL' and suffix != 'CORPORACIÓN' and suffix != 'REPRESENTACIONES'
-                    and suffix != 'ASOCIACION CIVIL' and suffix != 'FUNDACION'
-            ):
-                return split_last(text, suffix)
-            elif suffix == 'S A' and text.endswith('S A'):
-                return split_last(text, suffix)
-            elif suffix == 'C A' and text.endswith('C A'):
-                return split_last(text, suffix)
-            elif suffix == 'R L' and text.endswith('R L'):
-                return split_last(text, suffix)
-            elif suffix == 'R S' and text.endswith('R S'):
-                return split_last(text, suffix)
-            elif suffix == 'F P' and text.endswith('F P'):
-                return split_last(text, suffix)
-            elif suffix == 'S R L' and text.endswith('S R L'):
-                return split_last(text, suffix)
-            elif suffix == 'INC' and text.endswith('INC'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPANY C A' and text.endswith('COMPANY C A'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPAÑIA ANONIMA' and text.endswith('COMPAÑIA ANONIMA'):
-                return split_last(text, suffix)
-            elif suffix == 'CORPORATION C A' and text.endswith('CORPORATION C A'):
-                return split_last(text, suffix)
-            elif suffix == 'COOPERATIVA' and text.startswith('COOPERATIVA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'INTERNATIONAL' and text.startswith('INTERNATIONAL'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'CORPORACIÓN' and text.startswith('CORPORACIÓN'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'REPRESENTACIONES' and text.startswith('REPRESENTACIONES'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ASOCIACION CIVIL' and text.startswith('ASOCIACION CIVIL'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'FUNDACION' and text.startswith('FUNDACION'):
-                return text.split(suffix, 1)[1]
-    return text
-
-
-def moldova_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if suffix == 'ASOCIATIA GOSPODARIILOR TARANESTI' and text.startswith('ASOCIATIA GOSPODARIILOR TARANESTI'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COOPERATIVA DE ÎNTREPRINZATOR' and text.startswith('COOPERATIVA DE ÎNTREPRINZATOR'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COOPERATIVA DE PRODUCERE' and text.startswith('COOPERATIVA DE PRODUCERE'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COOPERATIVA DE' and text.startswith('COOPERATIVA DE'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COOPERATIVA AGRICOLA DE INTREPRINZATOR' and text.startswith(
-                    'COOPERATIVA AGRICOLA DE INTREPRINZATOR'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COOPERATIVA AGRICOLA' and text.startswith('COOPERATIVA AGRICOLA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'CENTRUL TEHNIC' and text.startswith('CENTRUL TEHNIC'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'COMPANIA' and text.startswith('COMPANIA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'FIRMA COOPERATISTA TEHNICO-STIINTIFICA DE PRODUCTIE' and text.startswith(
-                    'FIRMA COOPERATISTA TEHNICO-STIINTIFICA DE PRODUCTIE'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'FIRMA DE PRODUCTIE' and text.startswith('FIRMA DE PRODUCTIE'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'FIRMA DE PRODUCŢIE ŞI COMERŢ' and text.startswith('FIRMA DE PRODUCŢIE ŞI COMERŢ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'FIRMA' and text.startswith('FIRMA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA COMERCIALĂ' and text.startswith('SOCIETATEA COMERCIALĂ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA CU RASPUNDERE LIMITATA FIRMA' and text.startswith(
-                    'SOCIETATEA CU RASPUNDERE LIMITATA FIRMA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA CU RĂSPUNDERE LIMITATĂ' and text.startswith('SOCIETATEA CU RĂSPUNDERE LIMITATĂ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA CU RASPUNDERE LIMITATA' and text.startswith('SOCIETATEA CU RASPUNDERE LIMITATA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA PE ACTIUNI' and text.startswith('SOCIETATEA PE ACTIUNI'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'SOCIETATEA IN NUME COLECTIV AGENTIA' and text.startswith(
-                    'SOCIETATEA IN NUME COLECTIV AGENTIA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'INTREPRINDEREA INDIVIDUALA' and text.startswith('INTREPRINDEREA INDIVIDUALA'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ÎNTREPRINZĂTOR INDIVIDUAL' and text.startswith('ÎNTREPRINZĂTOR INDIVIDUAL'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ÎNTREPRINDEREA INDIVIDUALĂ' and text.startswith('ÎNTREPRINDEREA INDIVIDUALĂ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ÎNTREPRINDEREA MUNICIPALĂ' and text.startswith('ÎNTREPRINDEREA MUNICIPALĂ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ÎNTREPRINDEREA CU CAPITAL STRĂIN' and text.startswith('ÎNTREPRINDEREA CU CAPITAL STRĂIN'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'INSTITUŢIA MEDICO-SANITARĂ PUBLICĂ' and text.startswith(
-                    'INSTITUŢIA MEDICO-SANITARĂ PUBLICĂ'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'REDACTIA GAZETEI' and text.startswith('REDACTIA GAZETEI'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'ORGANIZATIA DE ADMINISTRARE FIDUCIARA A INVESTITILOR' and text.startswith(
-                    'ORGANIZATIA DE ADMINISTRARE FIDUCIARA A INVESTITILOR'):
-                return text.split(suffix, 1)[1]
-            elif suffix == 'S R L' and text.endswith('S R L'):
-                return split_last(text, suffix)
-            elif suffix == 'SOCIETATEA CU RESPONSABILITATE LIMITATA' and text.endswith(
-                    'SOCIETATEA CU RESPONSABILITATE LIMITATA'):
-                return split_last(text, suffix)
-            elif suffix == 'SOCIETATE CU RĂSPUNDERE LIMITATĂ' and text.endswith('SOCIETATE CU RĂSPUNDERE LIMITATĂ'):
-                return split_last(text, suffix)
-    return text
-
-
-def moldova_truncate_at_suffix_second(text, suffix_list2):
-    for suffix in suffix_list2:
-        if suffix in text:
-            if suffix == 'S R L' and text.endswith('S R L'):
-                return split_last(text, suffix)
-            elif suffix == 'SOCIETATEA CU RESPONSABILITATE LIMITATA' and text.endswith(
-                    'SOCIETATEA CU RESPONSABILITATE LIMITATA'):
-                return split_last(text, suffix)
-            elif suffix == 'SOCIETATE CU RĂSPUNDERE LIMITATĂ' and text.endswith('SOCIETATE CU RĂSPUNDERE LIMITATĂ'):
-                return split_last(text, suffix)
-    return text
-
-
-def singapore_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if suffix == 'SINGAPORE PTE LTD' and text.endswith('SINGAPORE PTE LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'S PTE LTD' and text.endswith('S PTE LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'PTE LTD' and text.endswith('PTE LTD'):
-                return split_last(text, suffix)
-            elif suffix == 'ENTERPRISES' and text.endswith('ENTERPRISES'):
-                return split_last(text, suffix)
-            elif suffix == 'ENTERPRISE' and text.endswith('ENTERPRISE'):
-                return split_last(text, suffix)
-            elif suffix == 'ENT' and text.endswith('ENT'):
-                return split_last(text, suffix)
-            elif suffix == 'AGENCIES' and text.endswith('AGENCIES'):
-                return split_last(text, suffix)
-            elif suffix == 'AGENCY' and text.endswith('AGENCY'):
-                return split_last(text, suffix)
-            elif suffix == 'PRIVATE LIMITED' and text.endswith('PRIVATE LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == 'COMPANY' and text.endswith('COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == 'LLP' and text.endswith('LLP'):
-                return split_last(text, suffix)
-            elif suffix == 'CO' and text.endswith('CO'):
-                return split_last(text, suffix)
-    return text
-
-
-def hongkong_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if suffix == ' CO LIMITED' and text.endswith(' CO LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LTD' and text.endswith(' CO LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' COMPANY' and text.endswith(' COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-    return text
-
-
-def china_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if suffix == ' GROUP CORPORATION LIMITED' and text.endswith(' GROUP CORPORATION LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' CORPORATION LIMITED' and text.endswith(' CORPORATION LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' GROUP CORPORATION' and text.endswith(' GROUP CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == ' GROUP CO LIMITED' and text.endswith(' GROUP CO LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED COMPANY' and text.endswith(' LIMITED COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' COMPANY LIMITED' and text.endswith(' COMPANY LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' GROUP CO LTD' and text.endswith(' GROUP CO LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' CORPORATION' and text.endswith(' CORPORATION'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LIMITED' and text.endswith(' CO LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' GROUP CORP' and text.endswith(' GROUP CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP LTD' and text.endswith(' CORP LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' COMPANY' and text.endswith(' COMPANY'):
-                return split_last(text, suffix)
-            elif suffix == ' FACTORY' and text.endswith(' FACTORY'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LTD' and text.endswith(' CO LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' CO INC' and text.endswith(' CO INC'):
-                return split_last(text, suffix)
-            elif suffix == ' CORP' and text.endswith(' CORP'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-    return text
-
-
-def vietnam_take_right_half(company_name: str):
-    for separator in vietnam_right_separator_list:
-        if separator in company_name:
-            return company_name.split(separator, 1)[1].strip()
-    return company_name.strip()
-
-
-def vietnam_take_left_half(company_name: str):
-    for separator in vietnam_left_separator_list:
-        if separator in company_name:
-            return company_name.rsplit(separator, 1)[0].strip()
-    return company_name.strip()
-
-
-def vietnam_truncate_at_suffix(company_name: str):
-    for suffix in vietnam_suffix_list:
-        if suffix in company_name and company_name.endswith(suffix):
-            return company_name.rsplit(suffix, 1)[0].strip()
-    return company_name.strip()
-
-
-def india_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (
-                    suffix != ' CO' and suffix != ' INC' and suffix != ' CO LIMITED' and suffix != ' LTD'
-                    and suffix != ' LIMITED' and suffix != ' CO I' and suffix != ' I'
-            ):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LIMITED' and ' AND CO LIMITED' not in text:
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' CO I' and text.endswith(' CO I'):
-                return split_last(text, suffix)
-            elif suffix == ' I' and text.endswith(' I'):
-                return split_last(text, suffix)
-    return text
-
-
-def mexico_truncate_at_suffix(cleaned_name):
-    for suffix in mexico_suffix_list:
-        if suffix in cleaned_name and cleaned_name.endswith(suffix):
-            return cleaned_name.rsplit(suffix, 1)[0].strip()
-    return cleaned_name.strip()
-
-
-def nigeria_truncate_at_suffix(cleaned_name):
-    for suffix in nigeria_suffix_list:
-        if cleaned_name.endswith(suffix):
-            return cleaned_name.rsplit(suffix, 1)[0].strip()
-    return cleaned_name.strip()
-
-
-def peru_truncate_at_suffix(cleaned_name, peru_suffix_list):
-    for suffix in peru_suffix_list:
-        if cleaned_name.endswith(suffix):
-            return cleaned_name.rsplit(suffix, 1)[0].strip()
-    return cleaned_name.strip()
-
-
-def lesotho_truncate_at_suffix(cleaned_name, lesotho_suffix_list):
-    for suffix in lesotho_suffix_list:
-        if cleaned_name.endswith(suffix):
-            return cleaned_name.rsplit(suffix, 1)[0].strip()
-    return cleaned_name.strip()
-
-
-def germany_truncate_at_suffix(cleaned_name, germany_suffix_list):
-    for suffix in germany_suffix_list:
-        if cleaned_name.endswith(suffix):
-            return cleaned_name.rsplit(suffix, 1)[0].strip()
-    return cleaned_name.strip()
-
-
-def america_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = america_truncate_at_suffix_first(cleaned_name, frist_america_suffix_list)
-        if len(truncated_first_name.strip()) < 8:
-            return cleaned_name
-        else:
-            return truncated_first_name
-    return None
-
-
-def america_company_abbr_second(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = america_truncate_at_suffix_second(cleaned_name, SECOND_AMERICA_SUFFIX_LIST)
-        if len(truncated_first_name.strip()) < 5:
-            return cleaned_name
-        else:
-            return truncated_first_name.strip()
-    return None
-
-
-def america_company_abbr_third(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = america_truncate_at_suffix_third(cleaned_name, third_AMERICA_SUFFIX_LIST)
-        if 9 < len(truncated_first_name.strip()) < 12:
-            return cleaned_name
-        elif len(truncated_first_name.strip()) <= 9:
-            return None
-        elif len(truncated_first_name.strip()) >= 12:
-            return truncated_first_name.strip()
-    return None
-
-
-def bangladesh_company_abbr_first(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = bangladesh_truncate_at_suffix_first(cleaned_name, first_bangladesh_suffix_list)
-        return truncated_first_name.strip()
-    return None
-
-
-def bangladesh_company_abbr_second(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = bangladesh_truncate_at_suffix_first(cleaned_name, first_bangladesh_suffix_list)
-        truncated_second_name = bangladesh_truncate_at_suffix_second(truncated_first_name.strip(),
-                                                                     second_bangladesh_suffix_list)
-        if len(truncated_second_name.strip()) < 6:
-            return truncated_first_name.strip()
-        else:
-            return truncated_second_name.strip()
-    return None
-
-
-def chile_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = chile_truncate_at_suffix(cleaned_name, first_chile_SUFFIX_LIST)
-        if len(truncated_first_name.strip()) < 8:
-            return cleaned_name
-        else:
-            return truncated_first_name.strip()
-    return None
-
-
-def rwanda_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = rwanda_truncate_at_suffix(cleaned_name, FIRST_Rwanda_suffix_list)
-        if len(truncated_first_name.strip()) < 6:
-            return cleaned_name
-        else:
-            return truncated_first_name.strip()
-    return None
-
-
-def philippines_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = philippines_truncate_at_suffix(cleaned_name, FIRST_philippines_suffix_list)
-        if len(truncated_first_name.strip()) < 6:
-            return cleaned_name
-        else:
-            return truncated_first_name.strip()
-    return None
-
-
-def colombia_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_first_name = colombia_truncate_at_suffix(cleaned_name, FIRST_colombia_suffix_list)
-        if len(truncated_first_name.strip()) < 6:
-            return cleaned_name
-        else:
-            return truncated_first_name.strip()
-    return None
-
-
-def indonesia_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = indonesia_truncate_at_suffix(cleaned_name, indonesia_suffix_list)
-        if len(truncated_name.strip()) >= 8:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def venezuela_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = venezuela_truncate_at_suffix(cleaned_name, venezuela_suffix_list)
-        if len(truncated_name.strip()) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def uzbekistan_company_abbr(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        company_name = kaz_extract_text_from_enclosers(bak_name)
-        company_name = clean_company_name(company_name)
-        for head in kaz_heads:
-            if company_name.startswith(head):
-                company_name = remove_prefix(company_name, head)
-                break
-        if len(company_name) < 8:
-            return clean_company_name(bak_name)
-        else:
-            return company_name.strip()
-    return None
-
-
-def kazakhstan_company_abbr(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        company_name = kaz_extract_text_from_enclosers(bak_name)
-        company_name = clean_company_name(company_name)
-        for head in kaz_heads:
-            if company_name.startswith(head):
-                company_name = remove_prefix(company_name, head)
-                break
-        if len(company_name) < 8:
-            return clean_company_name(bak_name)
-        else:
-            return company_name.strip()
-    return None
-
-
-def moldova_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        first_truncated_name = moldova_truncate_at_suffix(cleaned_name, moldova_suffix_list)
-        truncated_name = moldova_truncate_at_suffix_second(first_truncated_name, moldova_suffix_list2)
-        if len(truncated_name.strip()) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def singapore_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = singapore_truncate_at_suffix(cleaned_name, singapore_suffix_list)
-        if len(truncated_name.strip()) >= 8:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def hongkong_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = hongkong_truncate_at_suffix(cleaned_name, hongkong_suffix_list)
-        if len(truncated_name.strip()) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def china_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = china_truncate_at_suffix(cleaned_name, china_suffix_list)
-        if len(truncated_name.strip()) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def vietnam_company_abbr(company_name: str) -> str or None:
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        right_half = vietnam_take_right_half(cleaned_name)
-        left_half = vietnam_take_left_half(right_half)
-        truncated_name = vietnam_truncate_at_suffix(left_half)
-        if len(truncated_name) >= 8:
-            return truncated_name
-        else:
-            return cleaned_name
-    return None
-
-
-def india_company_abbr(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        company_name = clean_company_name(bak_name)
-        for head in ind_head:
-            if company_name.startswith(head):
-                company_name = remove_prefix(company_name, head)
-                break
-        truncated_name = india_truncate_at_suffix(company_name, india_suffix_list)
-        if (len(truncated_name.strip()) < 8):
-            return clean_company_name(bak_name)
-        else:
-            return truncated_name.strip()
-    return None
-
-
-def ukraine_truncate_at_suffix_first(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            return split_last(text, suffix)
-    return text
-
-
-def ukraine_truncate_at_suffix_second(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            return split_last(text, suffix) + suffix
-    return text
-
-
-def ukraine_company_abbr_first(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        truncated_name = ukraine_truncate_at_suffix_first(bak_name, ukraine_suffix_first)
-        return truncated_name.strip()
-    return None
-
-
-def ukraine_company_abbr_second(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        truncated_name = ukraine_truncate_at_suffix_second(bak_name, ukraine_suffix_second)
-        return truncated_name.strip()
-    return None
-
-
-def mexico_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = mexico_truncate_at_suffix(cleaned_name)
-        if len(truncated_name) >= 8:
-            return truncated_name
-        else:
-            return cleaned_name
-    return None
-
-
-def nigeria_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = nigeria_truncate_at_suffix(cleaned_name)
-        if len(truncated_name) >= 4:
-            return truncated_name
-        else:
-            return cleaned_name
-    return None
-
-
-def philippines_company_abbr_second(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = philippines_truncate_at_suffix(cleaned_name, FIRST_philippines_suffix_list)
-        if len(truncated_name) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def england_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = england_truncate_at_suffix(cleaned_name, FIRST_england_suffix_list)
-        if len(truncated_name) >= 8:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def peru_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = peru_truncate_at_suffix(cleaned_name, peru_suffix_list)
-        if len(truncated_name) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def lesotho_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = lesotho_truncate_at_suffix(cleaned_name, lesotho_suffix_list)
-        if len(truncated_name) >= 6:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def germany_company_abbr(company_name):
-    if company_name:
-        cleaned_name = clean_company_name(company_name)
-        truncated_name = germany_truncate_at_suffix(cleaned_name, germany_suffix_list)
-        if len(truncated_name) >= 8:
-            return truncated_name.strip()
-        else:
-            return cleaned_name
-    return None
-
-
-def company_abbr(country_name: str, company_name: str) -> str or None:
-    if country_name == 'pakistan':
-        return pakistan_company_abbr(company_name)
-    if country_name == 'mirror_pakistan':
-        return mirror_pakistan_company_abbr(company_name)
-    elif country_name == 'america':
-        return america_company_abbr(company_name)
-    elif country_name == 'indonesia':
-        return indonesia_company_abbr(company_name)
-    elif country_name == 'venezuela':
-        return venezuela_company_abbr(company_name)
-    elif country_name == 'america_second':
-        return america_company_abbr_second(company_name)
-    elif country_name == 'uzbekistan':
-        return uzbekistan_company_abbr(company_name)
-    elif country_name == 'kazakhstan':
-        return kazakhstan_company_abbr(company_name)
-    elif country_name == 'chile':
-        return chile_company_abbr(company_name)
-    elif country_name == 'moldova':
-        return moldova_company_abbr(company_name)
-    elif country_name == 'bangladesh_fist':
-        return bangladesh_company_abbr_first(company_name)
-    elif country_name == 'bangladesh_second':
-        return bangladesh_company_abbr_second(company_name)
-    elif country_name == 'rwanda':
-        return rwanda_company_abbr(company_name)
-    elif country_name == 'singapore':
-        return singapore_company_abbr(company_name)
-    elif country_name == 'hongkong':
-        return hongkong_company_abbr(company_name)
-    elif country_name == 'philippines':
-        return philippines_company_abbr(company_name)
-    elif country_name == 'china':
-        return china_company_abbr(company_name)
-    elif country_name == 'vietnam':
-        return vietnam_company_abbr(company_name)
-    elif country_name == 'india':
-        return india_company_abbr(company_name)
-    elif country_name == 'ukraine_first':
-        return ukraine_company_abbr_first(company_name)
-    elif country_name == 'ukraine_second':
-        return ukraine_company_abbr_second(company_name)
-    elif country_name == 'america_third':
-        return america_company_abbr_third(company_name)
-    elif country_name == 'mexico':
-        return mexico_company_abbr(company_name)
-    elif country_name == 'colombia':
-        return colombia_company_abbr(company_name)
-    elif country_name == 'nigeria':
-        return nigeria_company_abbr(company_name)
-    elif country_name == 'philippines_second':
-        return philippines_company_abbr_second(company_name)
-    elif country_name == 'peru':
-        return peru_company_abbr(company_name)
-    elif country_name == 'lesotho':
-        return lesotho_company_abbr(company_name)
-    elif country_name == 'germany':
-        return germany_company_abbr(company_name)
-    elif country_name == 'england':
-        return england_company_abbr(company_name)
-    else:
-        return company_name
-
-
-if __name__ == '__main__':
-    test_cases = [
-        'Wilhelm Manz GmbH & Co. KG',
-        'Wilhelm Zuleeg GmbH',
-        'Aba Air Group Llc',
-        'CAMUSAT (MAURICE) LIMITED',
-        'BMTS Technology Austria GmbH & Co',
-        'Arhetipo Grup SRL',
-        'Boegli-Gravures SA',
-        'Kronos International Inc.',
-        'YAHO AUTO EXCHANGE CO. LIMITED',
-        'Radpar Otomotiv Sanayi ve Ticaret Limited Şti.',
-        'SERVICIOS INTERSEC S.A. DE C.V.',
-        'PLASTIC SOLUTIONS DI MARTOCCIA CRISTIANS.A.S.',
-        'C-Solution Elektrotechnik GbR',
-        'Baumer Hhs S.R.L.',
-        'AJH Druck & Technik Helge Klemt e.K.',
-        'ADM Hamburg AG',
-        'Lauer Ventilation UG',
-        'Bankhaus J. Faisst OHG',
-        'Continental Teves AG & Co.OHG',
-        'Dow Produktions und Vertriebs GmbH & Co. OHG',
-        'Springer Nature AG & Co. KGaA',
-        'Paragon GmbH & Co. KGaA'
-    ]
-    for test_case in test_cases:
-        print("{:<50} {:>50}".format(test_case, company_abbr('germany', test_case)))
-
-    # test_cases = [
-    #     'COMPANY LIMITED NGOC PHAT TM',
-    #     'COMPANY LTD PHAM',
-    #     'TAIHING MOULDS CO LTD',
-    #     'REPRESENTATIVE OFFICE OF HETTICH SINGAPORE SEA PTE LTD IN HO CHI MINH CITY',
-    #     'SAI GON WASTE SOLUTION JOINT STOCK COMPANY',
-    #     'ENTERTAINMENT FISHING ROD IMPORT EXPORT TRADING COMPANY LIMI',
-    #     'TPP PLUS CORP',
-    #     'VILOMIX VIETNAM LLC',
-    #     'SMILETECH JSC',
-    #     'DUC MINH CTI CO JSC',
-    #     'SMILETECH JSC',
-    #     'HUVICO LTD'
-    # ]
-    # for test in test_cases:
-    #     print(vietnam_company_abbr(test))

+ 0 - 45
dw_base/spark/udf/customs/cts_common.py

@@ -1,45 +0,0 @@
-import json
-import re
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-
-@udf(returnType=ArrayType(StringType()))
-def str_to_arr(json_str: str) -> list:
-    if json_str:
-        return json.loads(json_str)
-    return []
-
-@udf(returnType=ArrayType(MapType(StringType(), StringType())))
-def str_to_map_arr(json_str: str) -> list:
-    if json_str:
-        return json.loads(json_str)
-    return []
-def merge_ws(text: str):
-    if text:
-        return ' '.join(text.split())
-    return None
-
-
-@udf(returnType=ArrayType(StringType()))
-def explode_str_to_arr(text: str) -> list:
-    if text is None:
-        return []
-    if len(text) <= 8:
-        return [text]
-    #大于8位时,从后往前,每少一位截取一个字符串,存入数组中
-    return [text[:i] for i in range(len(text), 7, -1)]
-
-
-def remove_special_char(text,char):
-    if text is not None and text.endswith(char):
-        return text[:-1]
-    return text
-
-
-if __name__ == '__main__':
-    # arr = str_to_arr('[{"email":"aline@forusi.com.br","type":"prospect","status":"verified","position":"Analista de Recursos Humanos","firstName":"Aline","lastName":"Cavalheiro","companyName":"Forusi","sourcePage":"https://www.linkedin.com/in/aline-cavalheiro-bb3644b8"},{"email":"karina@forusi.com.br","type":"prospect","status":"verified","position":"Coordenadora de vendas","firstName":"Karina","lastName":"Evangelista de Oliveira","companyName":"Forusi","sourcePage":"https://www.linkedin.com/in/karina-evangelista-de-oliveira-412934a6"},{"email":"raphael@forusi.com.br","type":"prospect","status":"verified","position":"Comprador Pleno","firstName":"Raphael","lastName":"Mendonça","companyName":"Forusi","sourcePage":"https://www.linkedin.com/in/raphael-mendon%C3%A7a-a7b882116"}]')
-    # print(type(arr))
-    arr = explode_str_to_arr('fsdfsafas')
-    print(arr)

+ 0 - 21
dw_base/spark/udf/customs/india_xx_restoration.py

@@ -1,21 +0,0 @@
-def transform_string(s):
-    result = []
-    i = 0
-    length = len(s)
-    while i < length:
-        # 隐藏前两个字符
-        if i + 2 > length:
-            result.append('X')
-        else:
-            result.append('XX')
-        i += 2
-        if i < length:
-            # 保留接下来的三个字符
-            result.append(s[i:i + 3])
-            i += 3
-    return ''.join(result)
-
-
-if __name__ == '__main__':
-    print(transform_string('PRESTIGE MULTI ALLOYS T&C W.L.L'))
-    print(transform_string('FAR EASTERN POLYTEX(VIETNAM) LIMITED'))

+ 0 - 24
dw_base/spark/udf/customs/indonesia_qymc_judge.py

@@ -1,24 +0,0 @@
-prefix_list = [
-    'PT.',
-    'PT ',
-    'CV.',
-    'CV ']
-
-
-def is_prefix_and_concat(qymc, qymc_org):
-    if qymc is None or qymc_org is None:
-        return qymc_org
-    for prefix in prefix_list:
-        if qymc.startswith(prefix) and not qymc_org.startswith(prefix.replace('.', ' ')):
-            return prefix.replace('.', ' ') + qymc_org
-    return qymc_org
-
-
-def get_qymc_prefix(qymcs: list):
-    if qymcs is None or len(qymcs) == 0:
-        return None
-    for prefix in prefix_list:
-        for qymc in qymcs:
-            if qymc.startswith(prefix):
-                return prefix.replace('.', ' ')
-    return None

+ 0 - 16
dw_base/spark/udf/customs/mirror.py

@@ -1,16 +0,0 @@
-from datetime import datetime
-import time
-
-
-def mirror_flag(date,start_date):
-    start_date_dt = datetime.strptime(start_date, '%Y%m%d')
-    unix_timestamp_start_date_ms = int(time.mktime(start_date_dt.timetuple())) * 1000
-    unix_timestamp_start_date_ms += 8 * 60 * 60 * 1000
-    if date >= unix_timestamp_start_date_ms:
-        return 1
-    else:
-        return 0
-
-
-if __name__ == '__main__':
-    print(mirror_flag(1635728400000, '20211101'))

+ 0 - 42
dw_base/spark/udf/customs/similarity.py

@@ -1,42 +0,0 @@
-# 第一种 大于80好一点 70?
-from fuzzywuzzy import fuzz
-
-
-def levenshtein_similarity(str1, str2):
-    if str1 != None and str2 != None:
-        return fuzz.ratio(str1.upper(), str2.upper())
-    return None
-
-# # 示例
-# str1 = "o python"
-# str2 = "hello python"
-# similarity = levenshtein_similarity(str1, str2)
-# # print(f"相似度: {similarity}")
-#
-#
-# from sklearn.feature_extraction.text import TfidfVectorizer
-# from sklearn.metrics.pairwise import cosine_similarity
-#
-# def cosine_similarity_strings(str1, str2):
-#     vectorizer = TfidfVectorizer()
-#     tfidf = vectorizer.fit_transform([str1, str2])
-#     return cosine_similarity(tfidf)[0, 1]
-#
-# # 示例
-# str1 = "hello pytho"
-# str2 = "hello python"
-# similarity = cosine_similarity_strings(str1, str2)
-# print(f"相似度: {similarity}")
-#
-#
-#
-# import difflib
-#
-# def string_similarity(str1, str2):
-#     return difflib.SequenceMatcher(None, str1, str2).ratio()
-#
-# # 示例
-# str1 = "hello world"
-# str2 = "hello python"
-# similarity = string_similarity(str1, str2)
-# print(f"相似度: {similarity}")

+ 0 - 23
dw_base/spark/udf/customs/str_trans_eng.py

@@ -1,23 +0,0 @@
-# 替换字符串
-turkey_replace_dict = {
-    'ç': 'c', 'Ç': 'C',
-    'ğ': 'g', 'Ğ': 'G',
-    'ı': 'i', 'İ': 'I',
-    'ö': 'o', 'Ö': 'O',
-    'ş': 's', 'Ş': 'S',
-    'ü': 'u', 'Ü': 'U'
-}
-turkey_to_english_trans = str.maketrans(turkey_replace_dict)
-
-
-def replace_str_english(text: str, country_name: str) -> str or None:
-    if text:
-        if country_name == 'turkey':
-            return text.translate(turkey_to_english_trans)
-    return None
-
-
-if __name__ == '__main__':
-    text = 'SÖZAL KİMYA SANAYİ VE TİCARET ANONİM ŞİRKETİ'
-    english_text = replace_str_english(text, 'turkey')
-    print(english_text)

+ 0 - 22
dw_base/spark/udf/customs/test/clean_crawler_data_test.py

@@ -1,22 +0,0 @@
-import pytest
-from typing import Set
-from dw_base.spark.udf.customs.clean_crawler_data import get_regex_match, clean_germany_company_name
-
-
-@pytest.mark.parametrize("company_name, expected", [
-    ('S&#233;cheron SA', {'&#233;'}),
-    ('S&#233;cheron SA\\u0022ss',{'&#233;','\\u0022'}),
-    ('GEHS GR&#220;N ENERGİE HEIZUNG UND SANİT&#194;R',{'&#220;','&#194;'})
-])
-def test_get_regex_match(company_name: str, expected: Set[str]):
-    result = get_regex_match(company_name)
-    assert result == expected
-
-
-@pytest.mark.parametrize("company_name, expected", [
-    ('Beiersdorf Ind&#250;stria Com&#233;rcio', 'Beiersdorf Indústria Comércio'),
-    ('GPS Pr&#195;&#188;ftechnik Rhein/Main GmbH', 'GPS Prüftechnik Rhein/Main GmbH')
-])
-def test_clean_germany_company_name(company_name: str, expected: str):
-    result = clean_germany_company_name(company_name)
-    assert result == expected

+ 0 - 13
dw_base/spark/udf/customs/test/company_abbr_test.py

@@ -1,13 +0,0 @@
-import pytest
-
-from dw_base.spark.udf.customs.company_abbr import company_abbr
-
-
-@pytest.mark.parametrize("country_name, company_name, expected", [
-    ('germany', 'Wilhelm Manz GmbH & Co. KG', 'WILHELM MANZ'),
-    ('germany', 'Wilhelm Zuleeg GmbH', 'WILHELM ZULEEG'),
-    ('germany', 'Paragon GmbH & Co. KGaA', 'PARAGON GMBH AND CO KGAA'),
-])
-def test_company_abbr(country_name: str, company_name: str, expected: str):
-    result = company_abbr(country_name, company_name)
-    assert result == expected

+ 0 - 14
dw_base/spark/udf/customs/test/cts_common_test.py

@@ -1,14 +0,0 @@
-import pytest
-from typing import List
-
-from dw_base.spark.udf.customs.cts_common import explode_str_to_arr
-
-@pytest.mark.parametrize("text, expected", [
-    ('abc', ['abc']),
-    ('abcdefghi',['abcdefghi', 'abcdefgh']),
-    ('xxxxaaaa',['xxxxaaaa']),
-    (None,[]),
-    ('',[''])
-])
-def test_explode_str_to_arr(text:str, expected: List[str]):
-    assert explode_str_to_arr(text) == expected

+ 0 - 26
dw_base/spark/udf/customs/test/indonesia_qymc_judge_test.py

@@ -1,26 +0,0 @@
-import pytest
-
-from dw_base.spark.udf.customs.indonesia_qymc_judge import is_prefix_and_concat, get_qymc_prefix
-
-
-@pytest.mark.parametrize("qymc, qymc_org, expected", [
-    ('PT. ABC', 'PT ABC', 'PT ABC'),
-    ('PT  ABC', 'ABC', 'PT ABC'),
-    ('CV. ABC', 'CV ABC', 'CV ABC'),
-    ('CV ABC', 'CV ABC', 'CV ABC'),
-    (None, 'ABC', 'ABC'),
-    ('ABC', None, None),
-    (None, None, None),
-])
-def test_is_prefix(qymc: str, qymc_org: str, expected: str):
-    assert is_prefix_and_concat(qymc, qymc_org) == expected
-
-
-@pytest.mark.parametrize("qymcs, expected", [
-    (['CV ABC', 'PT ABC'], 'PT '),
-    (['CV. ABC', 'ABC'], 'CV '),
-    (['ABC'], None),
-    (['PT. XXX','CV CV'],'PT ')
-])
-def test_get_qymc_prefix(qymcs: list, expected: str):
-    assert get_qymc_prefix(qymcs) == expected

+ 0 - 52
dw_base/spark/udf/customs/tjsldw.py

@@ -1,52 +0,0 @@
-def get_tjsldwdm(sldwdm, sldwdm_1, sldwdm_2, qkzl):
-    if sldwdm == sldwdm_1 or sldwdm == sldwdm_2:
-        return sldwdm
-    if sldwdm_1 != 'KGS' and sldwdm_2 != 'KGS':
-        return sldwdm
-    if qkzl is None or qkzl == '':
-        return sldwdm
-    return 'KGS'
-
-
-def get_tjsldwdm_en(sldwdm, sldwdm_1, sldwdm_2, qkzl, sldw_en):
-    if sldwdm == sldwdm_1 or sldwdm == sldwdm_2:
-        return sldw_en
-    if sldwdm_1 != 'KGS' and sldwdm_2 != 'KGS':
-        return sldw_en
-    if qkzl is None or qkzl == '':
-        return sldw_en
-    return 'KILOGRAM'
-
-
-def get_tjsldwdm_cn(sldwdm, sldwdm_1, sldwdm_2, qkzl, sldw_cn):
-    if sldwdm == sldwdm_1 or sldwdm == sldwdm_2:
-        return sldw_cn
-    if sldwdm_1 != 'KGS' and sldwdm_2 != 'KGS':
-        return sldw_cn
-    if qkzl is None or qkzl == '':
-        return sldw_cn
-    return '千克'
-
-
-def get_tjsldwdm_sl(sldwdm, sldwdm_1, sldwdm_2, qkzl, sl):
-    if sldwdm == sldwdm_1 or sldwdm == sldwdm_2:
-        return sl
-    if sldwdm_1 != 'KGS' and sldwdm_2 != 'KGS':
-        return sl
-    if qkzl is None or qkzl == '':
-        return sl
-    return qkzl
-
-
-def get_tjsldwdm_sldj(sldwdm, sldwdm_1, sldwdm_2, qkzl, sldj, zldj):
-    if sldwdm == sldwdm_1 or sldwdm == sldwdm_2:
-        return sldj
-    if sldwdm_1 != 'KGS' and sldwdm_2 != 'KGS':
-        return sldj
-    if qkzl is None or qkzl == '':
-        return sldj
-    return zldj
-
-
-if __name__ == '__main__':
-    print(get_tjsldwdm('XXX', 'KGS', None, ''))

+ 0 - 0
dw_base/spark/udf/enterprise/__init__.py


+ 0 - 143
dw_base/spark/udf/enterprise/ent_clean_name_logistics.py

@@ -1,143 +0,0 @@
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´',
-                 '',
-                 '¿',
-                 '‰',
-                 '¯',
-                 '\x1A',
-                 '£',
-                 '>',
-                 '¿',
-                 '«',
-                 '´',
-                 '»',
-                 '°',
-                 '®',
-                 '·',
-                 '¼',
-                 '©',
-                 '¶',
-                 "'",
-                 '"',
-                 '–',
-                 '='
-                 ]
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-multi_char_replacements = {
-    'Ï ½Ï ½Ï ½': ' ',
-    'Ï ½Ï ½': ' '
-}
-
-
-def clean_company_name(name):
-    if name:
-
-        for multi_char, replacement in multi_char_replacements.items():
-            name = name.replace(multi_char, replacement)
-
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-if __name__ == "__main__":
-    print(clean_company_name('BOLLORE LOGISTICSÏ ½Ï ½Ï ½ DUNKERQUE'))

+ 0 - 561
dw_base/spark/udf/enterprise/ent_clean_text.py

@@ -1,561 +0,0 @@
-import codecs
-import re
-import json
-from pyspark.sql.functions import udf
-from pyspark.sql.types import ArrayType, StringType
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-url_bad_list = [
-    'www.,'
-    , 'www.'
-    , '/web:'
-    , 'http:////'
-    , 'http:///'
-    , 'http://'
-    , 'https://'
-    , 'web:'
-    , 'ww�w.'
-    , 'www�'
-    , 'w�'
-]
-
-china_url_suff_list = [
-    '.com.cn',
-    '.com',
-    '.cn',
-    '.org.cn'
-    '.org',
-    '.net',
-    '.info',
-]
-
-# 俄罗斯域名后缀
-russia_url_suff_list = [
-    # --通用顶级域名
-    '.group',
-    '.hu',
-    '.it',
-    '.com.cy',
-    '.com',
-    '.net',
-    '.org',
-    '.int',
-    '.edu',
-    '.tech',
-    '.group',
-    '.eco',
-    '.eu',
-    '.info',
-    '.company'
-    # --俄罗斯域名
-    '.рф',
-    '.ru',
-    '.su',
-    '.by',
-    '.biz',
-    '.pro',
-    '.coop',
-    '.aero',
-    '.museum',
-    '.xyz',
-    '.online',
-    '.site'
-]
-
-
-# 中国工商URL清洗
-def clean_url_china(url):
-    if url is not None:
-        url = url.lower()
-        if url in ['ltd.', '.ltd.']:
-            return None
-        if url.endswith(',ltd.'):
-            return None
-        for bad in url_bad_list:
-            url = url.replace(bad, '')
-        for suffix in china_url_suff_list:
-            if suffix in url:
-                return url[:url.index(suffix)] + suffix
-    return url
-
-
-# 俄罗斯URL清洗
-def clean_url_russia(url):
-    if url is not None and url != '':
-        url = url.lower()
-        for bad in url_bad_list:
-            url = url.replace(bad, '')
-        if '.' not in url:
-            return None
-        for suffix in russia_url_suff_list:
-            if suffix in url:
-                return url[:url.index(suffix)] + suffix
-    return url
-
-
-# 美国工商URL清洗
-def clean_url_america(url):
-    if url:
-        url = url.lower()
-        for bad in url_bad_list:
-            url = url.replace(bad, '')
-        if ':' in url:
-            # 分割URL以获取域名部分
-            parts = url.split(':', 1)
-            url = parts[0]  # 只保留端口号前的域名部分
-            # 再次检查URL中是否包含斜杠,如果是,则只保留斜杠前的部分
-        if '/' in url:
-            parts = url.split('/', 1)
-            url = parts[0]
-        if re.search(r'(\d+\.\d+\.\d+\.\d+)', url):
-            return None
-        return url
-    return None
-
-
-# 通用网址清洗规则
-def clean_url_common(url):
-    if url:
-        url = url.lower()
-        for bad in url_bad_list:
-            url = url.replace(bad, '')
-        if not url:
-            return None
-        if '/' in url:
-            parts = url.split('/', 1)
-            return parts[0]
-        else:
-            return url
-    return None
-
-
-# 网址测试
-# if __name__ == '__main__':
-#     test_case_list = [
-#         'https://www.ianshaw.biz/p/contact-management.php',
-#         'https://charnleyfertilisers.co.uk/',
-#         'https://nyulangone.org/doctors/1205925765/carol-dunetz?cid=syn_yext\u0026y_entity_id=1205925765-primary\u0026y_source=1_MjU0NTEyNzEtNDgzLWxvY2F0aW9uLndlYnNpdGU%3D',
-#         'https://www.carolleviandcompany.it/',
-#         'https://schrotthandel-heinen.de/',
-#         'http://201.149.15.54:88/',
-#         'http://190.107.176.73/~prodinwe/www2/inicio.html',
-#     'https://findadoctor.atlantichealth.org/provider/Joseph+C+Lugo/1140352?unified=lugo\u0026sort=networks%2Crelevance\u0026_ga=2.142101431.428278081.1637589591-505885973.1636636554\u0026_gac=1.36491028.1637590169.EAIaIQobChMImKmH3JKs9AIVl4TICh2yrwEXEAAYASAAEgKlDvD_BwE'
-#
-#
-#      ]
-#     for url in test_case_list:
-#         print(f'url: {url} ---->   {clean_url_america(url)}')
-
-# 国家工商URL清洗
-def clean_url(country, url):
-    if country == 'China':
-        return clean_url_china(url)
-    if country == 'Russia':
-        return clean_url_russia(url)
-    if country == 'America':
-        return clean_url_america(url)
-    return None
-
-
-# 越南电话要替换成分隔符的字符串
-vietnam_tel_split_list = [
-    'faxno'
-    , 'fax-'
-    , '-fax'
-    , 'fax.'
-    , 'fax'
-    , 'tele'
-]
-
-vietnam_tel_bad_list = [
-
-    'f'
-    , 'awelexports@gmailcom'
-    , 'm-'
-    , 'axno'
-    , 'ax'
-    , 'no'
-    , '(ext'
-    , 'linhkt'
-    , '.'
-    , 'nhnh3'
-]
-
-reverse_str_list = [
-    '.',
-    '/'
-]
-
-
-# 字符串反转输出
-def reverse_str(str):
-    if str:
-        for str1 in reverse_str_list:
-            if str1 in str:
-                parts = str.split(str1)
-                # 倒序排列分割后的部分
-                reversed_parts = parts[::-1]
-                # 使用join方法将倒序后的部分重新组合成字符串
-                str = '-'.join(reversed_parts)
-        return str
-    return None
-
-
-# 英文和空格替换成''
-def replace_english_and_space(str):
-    result = re.sub(r'[a-zA-Z\s]', '', str)
-    return result
-
-
-# 数组元素去重
-def array_remove_duplicates(str):
-    if str:
-        str_array = str.split(',')
-        unique_str = list(set(str_array))
-        return ','.join(unique_str)
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '[]',
-        '[91220101123911541QCHN]',
-        '[, 91220101123911541QCHN]'
-    ]
-    for arraystr in test_case_list:
-        print(f'tel: {arraystr} ---->  {array_remove_duplicates(arraystr)}')
-
-company_name_pattern1 = r'(^[0-9]{2}\.[0-9]{3}\.[0-9]{3})(.*)'  # 12.575.462 ALVARO PEREIRA DA SILVEIRA FILHO
-company_name_pattern2 = r'.+( [0-9]+)$'  # HEBE DE ABREU VILELA CPF 027116806149
-
-
-# 公司名称清洗 去重前置xx-xxx-xxx
-def clean_brazil_company_name(name):
-    if name:
-        namepattern1_match = re.search(company_name_pattern1, name)
-        if namepattern1_match:
-            namepattern1 = namepattern1_match.group(2)
-            return clean_company_name(namepattern1)
-        namepattern2_match = re.search(company_name_pattern2, name)
-        if namepattern2_match:
-            namepattern2 = namepattern2_match.group(1)
-            if len(namepattern2) > 8:
-                return clean_company_name(name.replace(namepattern2, ''))
-        return clean_company_name(name)
-    else:
-        return None
-
-
-# 土耳其 ,分隔电话,如果少于10位,则置空
-def phone_clean_turkey(phone):
-    if phone:
-        # 将输入字符串分割成数组
-        phone_arr = phone.split(',')
-        # 过滤数组元素,长度不等10的元素置空
-        phone_arr_new = [str for str in phone_arr if len(str) == 10]
-        # 将过滤后的数组重新组合成字符串,如果没有元素则返回空字符串
-        phone_str = ','.join(phone_arr_new) if phone_arr_new else None
-        return phone_str
-    return None
-
-
-# 土耳其 ,分隔传真,9开头11位,置空;0开头11位,删除0;1位和12为置空
-def fax_clean_turkey(fax):
-    if fax:
-        fax_len = len(fax)
-        if fax_len == 10:
-            return fax
-        elif fax_len == 11 and fax.startswith('0'):
-            return fax[1:]
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        # turkey-phone-alltype
-        '4443361',
-        '2164708444',
-        '2122772674,4',
-        '2123518966,67',
-        '4449911,4441311',
-        '2126944565,4444080',
-        '214511936,2125037861',
-        '2123225997,2123228911',
-        '2165274671,2162663626,4441158',
-        '2163782062,2163782649,2163787830',
-        '',
-        None,
-        # turkey-fax-alltype
-        '021648847322',
-        '02164884732',
-        '92164884732',
-        '2164884732',
-        '0'
-
-    ]
-    for str in test_case_list:
-        print(f'tel: {str} ---->  {fax_clean_turkey(str)}')
-
-# 行业代码清洗
-pattern = r'\d{2}\.\d{2}\.\d{2}'
-
-
-def turkey_nicecode(nicecode):
-    if nicecode:
-        codes = re.findall(pattern, nicecode)
-        result = ', '.join(codes)
-        result = result.replace('.', '')
-        return result
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '["10.71.01-Taze pastane ürünleri imalatı (yaş pasta, kuru pasta, poğaça, kek, börek, pay, turta, waffles vb.)"]',
-        '["15.12.07-Deri, kösele, karma deri ve diğer malzemelerden bavul, el çantası, cüzdan, okul çantası, evrak çantası, deriden sigaralık, deri ayakkabı bağı, kişisel bakım, dikiş, vb. amaçlı seyahat seti, vb. ürünlerin imalatı"]',
-        '["07.29.06-Krom madenciliği"]',
-        '["35.12.13-Elektrik enerjisinin iletimi (elektrik üretim kaynağından dağıtım sistemine aktaran iletim sistemlerinin işletilmesi)","42.22.02-Enerji santralleri inşaatı (hidroelektrik santrali, termik santral, nükleer enerji üretim santralleri vb.)","35.11.19-Elektrik enerjisi üretimi"]'
-    ]
-    # for str in test_case_list:
-    #     print(f'tel: {str} ---->  {turkey_nicecode(str)}')
-
-email_pattern1 = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9]+@[a-zA-Z.]+'  # CONTADOR@JANSENANDRE@HOTMAIL.COM
-email_pattern2 = r'.*@$'  # XXXXXXXXX@XXXXX@  @@@@@@@@@@2  @@@@@@@@@@
-
-brazil_bad_email = [
-    '@', '*', '-', '.', ','
-]
-
-
-# 巴西邮箱清洗
-def email_clean_brazil(email):
-    if email:
-        email = email.lower().replace('@@', '@').replace(',.', '.').replace('.,', '.')
-        for badstr in brazil_bad_email:
-            if email.startswith(badstr) | email.endswith(badstr):
-                return None
-        if '.' not in email:
-            return None
-        if email == 'flr@flr.@bol.com.br':
-            return None
-        if email.count('@') == 1:
-            email = email.replace(',', '.')
-            if re.search(r'[a-zA-Z0-9]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+\.[a-zA-Z]*', email):
-                return email
-        if re.search(email_pattern1, email):
-            return None
-        email_pattern = re.compile(r'[a-zA-Z0-9]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+\.[a-zA-Z]*')
-        emails = email_pattern.findall(email)
-        return emails
-    return None
-
-
-if __name__ == '__main__':
-    test_case_list = [
-        'HUGO.SANSIL@GMAIL.COM   E HUGO@SISTEMAFIEG.ORG.BR',
-        "laaltenhofen@brturbo.com.br   ou luialtenhofen@hotmail.com",
-        "SANDRA_MMC@BOL.COM.BR               NAIRMOTADIAS@HOTMAIL.COM",
-        "fundesco@ig.com.br /e ou juliocesarcoelho@ig.com.br",
-        "choco.mixgold@hotmail.com / ou  elisangela.sena@gmail.com",
-        "SALES@ZGC.COM / REPAIR@ZGC.COM / WWW.ZGC.COM",
-        "veronica@beereayres.com.br      veronicabeer@uol.com.br        advocacia@beereayres.com.br",
-        "emanoel@amazoniaim.aginaria@org.br",
-        "emerson.pires@contabilidadepires@.com.br"
-        , '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
-        , '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@2'
-        , 'XXXXXXXXXXXXX@XXXXX@'
-        , '@'
-        , 'lcregina@terra.com.br+phcontabil@brturb'
-        , 'lcregina@terra.com.br, phcontabil@brturb'
-        , 'abc.abc@abcabc@brturb'
-        , 'alguizardi@hotmail.com -Terceiros.10@hotmail.com'
-        , 'JURIDICO@LSCONTABILIDADE,COM.BR'
-        , 'LUCIANO.KLEIMAN@B4WASTE.COM.,BR'
-        , 'flr@flr.@bol.com.br'
-        , 'aurenirrodrigues@ig,com.br'
-        , 'kallfotosdigital@hotmailcom'
-        , 'jurandireleicao2020@gmail'
-        , ',,CICERO.BONFIM@HOTMAIL.COM'
-        , ',japsjcampos@ig.combr'
-        , 'FERNANDO@TWINFORMATICA.COM.BR<FERNANDO@TWINFORMATICA.COM.BR>'
-    ]
-    for str in test_case_list:
-        print(f'tel: {str} ---->  {email_clean_brazil(str)}')
-
-
-def arr_str_to_str(str):
-    # 检查输入字符串是否为空
-    if str:
-        str = str.replace('[]', '')
-    if str:
-        # 使用json.loads()解析JSON字符串,然后使用join将列表转换为字符串
-        return ','.join(json.loads(str))
-    # 如果输入为空,返回空字符串
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '[]',
-        '["accounting","financial services"]',
-        '["staffing & recruiting"]',
-        '["management consulting","business consulting & services"]',
-        '',
-        None
-    ]
-    for arraystr in test_case_list:
-        print(f'tel: {arraystr} ---->  {arr_str_to_str(arraystr)}')
-
-bad_tel_part1 = r'[^0-9+]'
-bad_tel_part2 = re.compile(r'^(.*?)([^\d]+)$')  # (r'^(.*?)([a-zA-Z\-\(\) ]+)$')
-
-
-def clean_tel_apollo(str):
-    if str:
-        # str = str.lower()
-        # for bad_tel_str in bad_tel_list:
-        #     str= str.replace(bad_tel_str,'')
-        clean_str1 = re.sub(bad_tel_part1, ' ', str)
-        str = ' '.join(clean_str1.split())
-        bad_match = bad_tel_part2.search(str)
-        if bad_match:
-            str = bad_match.group(1).strip()
-        else:
-            str = str.strip()
-        # 判断位数
-        cleane_str2 = re.sub(r'[^\d]', '', str)
-        if len(cleane_str2) < 7:
-            return None
-        else:
-            return str
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '+1-866-344-7857 ext. 311',
-        '(678)826-BUY1',
-        '(844)800-BULL',
-        '+ (373) 68 488 807 MDA',
-        '++420 606 075 787 (Po - Pá)',
-        '+1 412-281-4100 ext 212',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->  {clean_tel_apollo(str_tel)}')
-
-type_url = {
-    "author": "tw.com/",
-    "facebook": "facebook.com/",
-    "google": "google.com/",
-    "google|twcamp": "tw.com/",
-    "instagram": "instagram.com/",
-    "linkedin": "linkedin.com/",
-    "pinterest": "pinterest.com/",
-    "serp|twgr": "tw.com/",
-    "tfw": "tw.com/",
-    "tfw&screen_name=ferrespanola&tw_p=followbutton": "tw.com/",
-    "twitter": "twitter.com/",
-    "youtube": "youtube.com/",
-    "crunchbase": "crunchbase.com/",
-    "angellist": "angel.co/"
-}
-bad_url_list = [
-    'https:', 'https://www', 'www'
-]
-
-
-def socialmedia_url(socialtype, url):
-    if not url:
-        return None
-    # 检查类别是否存在于字典中
-    if socialtype in type_url:
-        url_split = type_url[socialtype]
-        url = url.lower()
-        if url_split in url:
-            url_clean = url.split(url_split)[-1].rstrip('/|#>+-.;?@}')
-            if url_clean in bad_url_list:
-                return None
-            else:
-                return url_clean
-    url = url.lower().rstrip('/|#>+-.;?@}')
-    if url in bad_url_list:
-        return None
-    else:
-        return url
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        ("youtube", "https://youtube.com/user/BrotherCanadaEn"),
-        ("facebook", "https://www.facebook.com/eastwesteng/"),
-        ("google", "https://google.com/search?q=test"),
-        ("author", "https://tw.com/SRAMroad?ref_src=twsrc"),
-        ("tfw&screen_name=ferrespanola&tw_p=followbutton", "https://tw.com/search?q=test"),
-        ("serp|twgr", "https://tw.com/search?q=test"),
-        ("twitter", "https://twitter.com/#"),
-        ("linkedin", "https://www.linkedin.com/in/meb-jsc/#"),
-        ("instagram", "https://www.instagram.com/##############/"),
-        ("facebook", "https://www.facebook.com/https://www.facebook.com/komlider38/"),
-        ("pinterest", "https://www.pinterest.com/lampstore/https://www.pinterest.com/lampstore/"),
-        ("linkedin",
-         "https://www.linkedin.com/start/join?session_redirect=https://www.linkedin.com/company/swelect-energy-systems-ltd?trk=biz-companies-cym&source=D8E90337EA&trk=login_reg"),
-        ("google", "https://twitter.com/search?q=test"),
-        ("whatsapp", "919822025525"),
-        ("nonexistent", "https://nonexistent.com/page"),
-        ("", "919822025525"),
-        ("twitter", "https://twitter.com/92342/3#4"),
-        ("twitter", "https://twitter.com/@#dfw}kdn|"),
-        ("twitter", "https://twitter.com/euroledwwwhttps:"),
-        ("facebook", "https://facebook.com/alburoojrealestate/"),
-        (None, ""),
-        ("", None),
-        (None, None)
-    ]
-
-    for socialtype, url in test_case_list:
-        suffix = socialmedia_url(socialtype, url)
-        print(f'category: {socialtype}, url: {url} ---->  {suffix}')
-
-
-def hongkong_previous_name_clean(str):
-    if str:
-        if str.startswith('-- '):
-            str = str[3:]
-        else:
-            str = str[12:]
-        return str
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '-- PACIFIC PRODUCTS LIMITED AUSTRALIAN PRODUCTS LIMITED',
-        '03-MAY-2013 Fuente Union Import And Export Limited 福恩特聯合進出口有限公司'
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f':{str_tel}---->{hongkong_previous_name_clean(str_tel)}')
-
-# 英国爬虫匹配股份占比
-sharepercent_pattern = re.compile(r'\["ownership-of-shares-(.+?)-percent')
-def uk_sharepercent(str):
-    if str:
-        sharepercent_match = re.search(sharepercent_pattern, str)
-        if sharepercent_match:
-            sharepercent = sharepercent_match.group(1)
-            return sharepercent
-        else:
-            return None
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"]',
-        '["ownership-of-shares-more-than-25-percent-registered-overseas-entity"]',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f':{str_tel}---->{uk_sharepercent(str_tel)}')

+ 0 - 277
dw_base/spark/udf/enterprise/ent_company_abbr.py

@@ -1,277 +0,0 @@
-import re
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´',
-                 '',
-                 '¿',
-                 '‰',
-                 '¯',
-                 '\x1A',
-                 '£',
-                 '>',
-                 '¿',
-                 '«',
-                 '´',
-                 '»',
-                 '°',
-                 '®',
-                 '·',
-                 '¼',
-                 '©',
-                 '¶',
-                 "'",
-                 '"'
-                 ]
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-ind_head = [
-    'THE ',
-    'M S',
-    'MS'
-]
-
-india_suffix_list = [
-    ' PRIVATELIMITED',
-    ' LLP',
-    ' CO I PVT L',
-    ' CO PVT L',
-    ' CO PRIVATE L',
-    ' CO I LTD',
-    ' I LTD',
-    ' I LIMITED',
-    ' I PVT L',
-    ' I PRIVATE L',
-    ' COMPANY PRIVATE L',
-    ' COMPANY PVT L',
-    ' P LTD',
-    ' PRIVATE L',
-    ' PVT L',
-    ' CO LTD',
-    ' CO',
-    ' INC',
-    ' CO LIMITED',
-    ' LTD',
-    ' LIMITED',
-    ' CO I',
-    ' I'
-]
-
-
-def clean_company_name(name):
-    if name:
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-def split_last(text, suffix):
-    if text:
-        last_occurrence_index = text.rfind(suffix)
-        if last_occurrence_index != -1:
-            return text[:last_occurrence_index]
-        return text
-    return None
-
-
-def india_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (
-                    suffix != ' CO' and suffix != ' INC' and suffix != ' CO LIMITED' and suffix != ' LTD'
-                    and suffix != ' LIMITED' and suffix != ' CO I' and suffix != ' I'
-            ):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LIMITED' and ' AND CO LIMITED' not in text:
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' CO I' and text.endswith(' CO I'):
-                return split_last(text, suffix)
-            elif suffix == ' I' and text.endswith(' I'):
-                return split_last(text, suffix)
-    return text
-
-
-def remove_prefix(text, prefix):
-    if text.startswith(prefix):
-        return text[len(prefix):]
-    return text
-
-
-def india_company_abbr(company_name):
-    if company_name:
-        bak_name = company_name.upper()
-        # remove_dots_name = remove_dots_from_abbr(bak_name)
-        company_name = clean_company_name(bak_name)
-        for head in ind_head:
-            if company_name.startswith(head):
-                company_name = remove_prefix(company_name, head)
-                break
-        truncated_name = india_truncate_at_suffix(company_name, india_suffix_list)
-        if (len(truncated_name.strip()) < 8):
-            return clean_company_name(bak_name)
-        else:
-            return truncated_name.strip()
-    return None
-
-
-def company_abbr(country_name: str, company_name: str) -> str or None:
-    if country_name == 'india':
-        return india_company_abbr(company_name)
-
-
-def remove_dots_from_abbr(text):
-    # 定义正则表达式模式
-    pattern = r'(([A-Z]\.)+) .*'
-    # 先检查字符串是否符合模式
-    match = re.search(pattern, text)
-    if match:
-        # 如果符合,则提取匹配的部分,并去掉点
-        matched_text = match.group(1)
-        # 去掉匹配部分中的点
-        modified_text = matched_text.replace('.', '')
-        # 用修改后的部分替换原始匹配部分
-        result = text.replace(matched_text, modified_text)
-        return result
-    else:
-        # 如果不符合,返回原始字符串
-        return text
-
-
-if __name__ == '__main__':
-    # 示例用法
-    case_list = ['X.X. XXXXXX',
-                 'A.A.A. some text B.B.B.B. more text',
-                 'X.X.X. XXXXXX',
-                 'K.N. TEXFAB',
-                 'AAKASH OIL FIELD SERVICES PVT.LTD.',
-                 'PARVEEN TRADING CO.',
-                 'KONNET SOLUTIONS PVT. LTD.',
-                 'DAINICHI COLOR INDIA PVT.LTD.',
-                 'NOVA IRON & STEEL LTD.',
-                 'RPA COPPER DISTRIBUTORS PVT.LTD.',
-                 'DURA AUTO SYSTEMS INDIA PV.LTD.',
-                 'SPG CORPORATION PVT.LTD.',
-                 'MESSRS.K. KRISHNAMURTHY BOOKS & PERIODICALS',
-                 'MALHAR FASHIONS (INDIA) PVT. LTD.',
-                 'ELITE BREADS PVT. LTD',
-                 'MINILEC INDIA PVT.LTD.',
-                 'CALISTA PROPERTIES PVT.LTD.',
-                 'PRADIP ENTERPRISES LTD.',
-                 'ESTEE AUTO PRESSINGS PRIVATE LTD.',
-                 'DR.(MS)BUNTY M.JAVA',
-                 'INDUSTRADE(PROP.PHADKE SANJAY ARAVIND)',
-                 'LEDER FX.',
-                 'PINNACLE TELE SERVICES PVT. LTD.',
-                 'HARIBHARAT EQUIPMENTS PVT.LTD.',
-                 'CECáINTERNATIONALáCORPORATIONá(I)áPVT.áLTD.',
-                 'BRUNOS COMPUTER SOLUTIONS & SOFTWARE PVT. LTD.',
-                 'DREAMS ENTERPRISES.',
-                 'SKR FOODS PVT. LTD.',
-                 ]
-    for case in case_list:
-        print(case + " ===> " + company_abbr('india', case))

+ 0 - 24
dw_base/spark/udf/enterprise/ent_india_offline_udf.py

@@ -1,24 +0,0 @@
-# 企业库印度唯一性调整,离线数据udf
-from datetime import datetime
-
-
-def clean_zau_date(zau_date):
-    try:
-        # 尝试解析日期字符串
-        date_obj = datetime.strptime(zau_date, "%d %B %Y")
-        # 格式化为新的日期字符串
-        return date_obj.strftime("%Y-%m-%d")
-    except (ValueError, TypeError):
-        # 如果解析失败或类型错误,返回 None
-        return None
-
-
-def clean_his_gs_date(his_gs_date):
-    try:
-        # 解析输入的日期字符串
-        date_obj = datetime.strptime(his_gs_date, "%d-%m-%Y")
-        # 格式化为新的日期字符串
-        return date_obj.strftime("%Y-%m-%d")
-    except ValueError:
-        # 如果解析失败,返回 None
-        return None

+ 0 - 70
dw_base/spark/udf/enterprise/ent_logistics_label.py

@@ -1,70 +0,0 @@
-from dw_base.spark.udf.enterprise.ent_clean_name_logistics import clean_company_name
-
-LOGISTIC_MATCH = [
-    "AIR & SEA",
-    "AIR + OCEAN",
-    "APEX",
-    "C.H. ROBINSON",
-    "CARGO",
-    "CONTAINER",
-    "DELIVERY",
-    "DHL",
-    "EXPEDITORS",
-    "EXPRESS",
-    "FEDEX",
-    "FORWARD",
-    "FORWARDER",
-    "FORWARDING",
-    "FREIGHT",
-    "KUEHNE NAGEL",
-    "LINE",
-    "LINES",
-    "LOGISTIC",
-    "LOGISTICAL",
-    "LOGISTICS",
-    "MAERSK",
-    "OOCL",
-    "ORDER",
-    "SCHENKER",
-    "SHIP",
-    "SHIPPING",
-    "SUPPLY CHAIN",
-    "TRANSPORT",
-    "TRANSPORTATION",
-    "LOGISTICOS",
-    "TRANSPORTES",
-    "NVOCC",
-    "AIR AND SEA",
-    "AIR SEA",
-    "AIRSEA",
-    "DSV AIR SEA",
-    "LOGISTICĂ",
-    'LOJISTIK'
-]
-
-REMOVE_LOGISTIC_MATCH = ['VISAGE LINES PERSONAL CARE PRIVATE LIMITED']
-
-def contains_all_tokens(source_tokens, target_tokens):
-    source_set = set(source_tokens)
-    return all(token in source_set for token in target_tokens)
-
-
-def is_logistic_match(name):
-    company_name = clean_company_name(name)
-    name_tokens = company_name.split()
-
-    for logistic_match in LOGISTIC_MATCH:
-        logistic_tokens = clean_company_name(logistic_match).split()
-
-        if contains_all_tokens(name_tokens, logistic_tokens):
-            if 'CONTAINER BAG' in company_name:
-                return False
-            for remove_logistic_match in REMOVE_LOGISTIC_MATCH:
-                if company_name == remove_logistic_match:
-                    return False
-            return True
-
-    return False
-
-if __name__ == '__main__':
-    print(is_logistic_match('ALINE'))

File diff ditekan karena terlalu besar
+ 0 - 167
dw_base/spark/udf/enterprise/ent_spider_clean.py


+ 0 - 386
dw_base/spark/udf/enterprise/spark_eng_ent_ctstel_clean.py

@@ -1,386 +0,0 @@
-import codecs
-import re
-import json
-from pyspark.sql.functions import udf
-from pyspark.sql.types import ArrayType, StringType
-
-# 科学计数法转数字
-
-scientific_pattern = r'([0-9]*\.?[0-9]+)[eE]([-+]?[0-9]+)'
-def scientific_to_number(input_str):
-    if input_str:
-        match = re.match(scientific_pattern, input_str)
-        if match:
-            base_number = float(match.group(1))
-            exponent = int(match.group(2))
-            result = base_number * (10 ** exponent)
-            return str(int(result))
-        else:
-            return input_str
-    return None
-
-
-
-pattern_space = r'(?<!\d)\s+|\s+(?!\d)'
-# pattern_keep_space = r'(\d)\s+(\d)'
-# pattern_remove_space = r'([^\d])\s+([^\d])'
-# 判断电话分隔符,如果特定分隔符前后都是7位,则用@@分隔,方便后续炸开
-def judge_delimiter(tel_str):
-    if tel_str:
-        # 正则判断空格,只保留数字之前的空格
-        tel_str=re.sub(pattern_space, '', tel_str)
-        # 正则拆分
-        parts = re.split(r'[/,\s@&;]+', tel_str)
-        # 用于存储处理后的字符串
-        new_parts = []
-        # 遍历分割后的字符串列表
-        for i in range(len(parts)):
-            # 检查当前部分是否为空,如果是则跳过
-            if not parts[i]:
-                continue
-            # 检查当前部分和下一个部分的长度是否都大于等于6
-            if i < len(parts) - 1 and len(parts[i]) >= 6 and len(parts[i + 1]) >= 6:
-                # 如果是,则将当前部分和下一个部分用@@连接
-                new_parts.append(parts[i] + '@@')
-            else:
-                # 如果不是,则添加当前部分
-                new_parts.append(parts[i]+' ')
-        # 将处理后的字符串部分重新组合成一个字符串
-        return ''.join(new_parts)
-    return None
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '666666 7777777',
-        '262-255- 7177 // 273308256',
-        'abc 123 456 def' ,
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{judge_delimiter(str_tel)}')
-
-# 判断电话或传真位数
-def judge_tel_length(str):
-    if str:
-        length_str = re.sub(r'[^\d]', '', str)
-        if len(length_str) < 6:
-            return None
-        else:
-            return str
-    return None
-
-
-# 删除字符串首位特殊符号
-remove_chars = ' :/-;?@#>.,*'
-def clean_headtail(str):
-    if str:
-        remove_str = str.strip(remove_chars)
-        str = remove_str.lstrip(')').rstrip('(')
-        return str
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '123-243',
-        '(123345)',
-        '(010)1(2)3345',
-        '(010)1(2)334(5)',
-        '(010)123345)',
-        '(010)123345(',
-        '(010123345',
-        ')010123345',
-        '472601(',
-        '',
-        None,
-        '913207067724649993*'
-    ]
-    for str_tel in test_case_list:
-        print(f':{str_tel}---->{clean_headtail(str_tel)}')
-
-tel_bad_list=[
-    ':',
-    ';',
-    ',',
-    '.',
-    '?',
-    '//',
-    '()',
-    '( )',
-    '�'
-]
-
-def col_tel_clean(tel_str):
-    if tel_str:
-        if 'e+' in tel_str.lower():
-            tel_str = scientific_to_number(tel_str)
-        cleaned_zero = re.sub(r'\.0+$', '', tel_str)
-        for bad in tel_bad_list:
-            cleaned_zero = cleaned_zero.replace(bad, ' ')
-        clean_letter = re.sub(r'[a-zA-Z]', '', cleaned_zero)
-        clean_headtail = clean_letter.lstrip('/-;?@#>').rstrip('/-;?@#>')
-        clean_blank = re.sub(r'\s+', ' ', clean_headtail).strip()
-        tel_str = judge_delimiter(clean_blank)
-        if tel_str:
-            # 判断位数
-            length_str = re.sub(r'[^\d]', '', tel_str)
-            if len(length_str) < 6:
-                return None
-            else:
-                return tel_str
-        return None
-    return None
-
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        'Fax: +1 780 468 9165',
-        'FAX.9545852544',
-        'FAX/5618446131',
-        'Fax : 833.338.8901',
-        'Fax/: 833.338.8901',
-        '(615) 316-5100 // FAX (615) 31',
-        'TEL: 507-69828001',
-        '6914 1002 TAX ID:200514854D',
-        'Fax No: +86 (0) 527.84495888',
-        'RUT:76.631.726-K',
-        'FAX. 41 32 392 51 07B>',
-        'FAX9545852544/46',
-        'FAXSIN FAX',
-        'LONGROnO 3871232',
-        '6910500.0000',
-        '6910500.0',
-        '3.203177e+11',
-        '3.19213916545e+11',
-        '1230000',
-        '5397-4880,5397-1333',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{col_tel_clean(str_tel)}')
-
-
-col_bad_email=[
-    '@','*','-','.',','
-]
-def col_email_clean(email):
-    if email:
-        email = email.lower().replace('@@', '@').replace(',.', '.').replace('.,', '.')
-        for badstr in col_bad_email:
-            if email.startswith(badstr)|email.endswith(badstr):
-                email=email.replace(badstr,'')
-        if '.' not in email:
-            return None
-        if email.count('@') == 1:
-            email = email.replace(',', '.')
-            # 标准邮箱
-            if re.search(r'[a-zA-Z0-9]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+\.[a-zA-Z]*', email):
-                return email
-        if email.count('@') >= 2:
-            # CONTADOR@JANSENANDRE@HOTMAIL.COM
-            if re.search( r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9]+@[a-zA-Z.]+', email):
-                return None
-            # 标准邮箱
-            email_pattern = re.compile(r'[a-zA-Z0-9]+[a-zA-Z0-9._%+-]+@[a-zA-Z0-9._%+-]+\.[a-zA-Z]{1,3}')
-            email = email_pattern.findall(email)
-            if email:
-                return ','.join(email)
-    return None
-
-if __name__ == '__main__1':
-    test_case_list = [
-        'info@gcrcompact@gcrieber.com',
-        'HITECH@MOLDER.COM.HK, HITECH@M',
-        'siki.huang@byd.com / betty.qiu@b',
-        'sales@dayusainc.com <sales@day',
-        'cora@38f.net,fini@39f.net',
-        'italmaq.amm@gmail.com',
-        'info@papeleradelpacifico.com,',
-        'WWW,IMPERIO_CARGO@GMAIL.COM',
-        'SUCDEN@SUCDEN.COM. AMERICAS@SUCD',
-        'COMEX@TELMAC.COM.BR/ COMEX4@TE',
-        'abby@jinshen.cnabby@jinshenmc',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{col_email_clean(str_tel)}')
-
-
-#匹配含电话号的传真号码即同时含tel|ph和FAX 取出fax后面的传真号码
-tel_fax_pattern1 = re.compile(r'(ph|tel)(.*)[(]?fax[)]?(.*)', re.IGNORECASE)
-tel_fax_pattern2 = re.compile(r'[^ph|tel]tel[e]?[\s]?[&(]?fax[\s:.)]?[n]?[o]?[\s:.]?', re.IGNORECASE)
-#匹配只有fax的传真号码
-fax_pattern = re.compile(r'(fax)', re.IGNORECASE)
-
-# 印度jksdh提取fax
-def ind_getfax_jksdh(tel_str):
-    if tel_str:
-        tel_fax_match1 = re.search(tel_fax_pattern1, tel_str)
-        tel_fax_match2 = re.search(tel_fax_pattern2, tel_str)
-        fax_match1 = re.search(fax_pattern, tel_str)
-        # 既有电话又有传真时或取传真
-        if tel_fax_match1:
-            # 如果tel和fax连在一起,视为传真把这部分替换为@@
-            if tel_fax_match2:
-                split_fax = re.sub(tel_fax_pattern2, '@@', tel_str)
-                # 将其余字母替换成空
-                split_fax_cleanletter = re.sub(r'[a-zA-Z]', '', split_fax)
-                split_fax_enter = re.sub(r'\s+', ' ', split_fax_cleanletter)
-                return split_fax_enter.strip(remove_chars)
-            get_afterfax = tel_fax_match1.group(3)
-            clean_afterfax = re.sub(r'[a-zA-Z]', '', get_afterfax)
-            return clean_afterfax.strip(remove_chars)
-        # 只有fax 传真
-        if fax_match1:
-            split_fax = re.sub(fax_pattern, '@@', tel_str)
-            split_fax_cleanletter = re.sub(r'[a-zA-Z]', '', split_fax)
-            return split_fax_cleanletter.strip(remove_chars)
-    return None
-
-def ind_fax_jksdh_clean(jksdh):
-    fax = judge_delimiter(jksdh)
-    if fax:
-        for bad in tel_bad_list:
-            fax = fax.replace(bad, '')
-        return fax
-    return None
-
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '011-23557208,telefax0129-2279612 to 615',
-        '02-65111032/020-65111033 tel fax',
-        '033-2358-7784, 03323587789(telefax)',
-        '431222 EXTEN:201/431821(D)/FAX NO.091-0422-431672',
-        'PH 080-91133444  FAX 080-91133502',
-        'PH. 011-514 6164, 514 6165, 540 0984, FAX. 011- 549 2977',
-        'TEL:91-22-4921900-05 FAX:91-22-4939284,4950594',
-        'TELE FAX-91-22-27681365/27686787',
-        'TELE:66011640, FAX:26057125',
-        'TELEFAX+914428143552/+914443502454',
-        'TELEFAX-3432782/02223423810/02223432782',
-        'TELEFAX-4074329',
-        'Tel : (044) 823 2117 Fax (044) 823 4411',
-        'Tel : 080-3349348 / Fax : 080-3348607',
-        'Tel : 080-3349348; Fax : 080-3348607',
-        'Tel : 22-8731998 Fax : 022-8711911',
-        'Tel: 344 3644, Fax no: 342 9023',
-        'Telefax 4930742',
-        '42011184,42011135,42157331/TELEFAX NO.28584954/MOBILE 9840104275',
-        '07232-44134/44247fax45430',
-        '011-23557208,telefax0129-2279612 to 615',
-        '40460655 tel fax no 21021042',
-        '022-25890222 FAX NO.022-25890411',
-        '28271933 FAX NO. 28302531/32',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'str: {str_tel} ---->{ind_getfax_jksdh(str_tel)}')
-
-# 印度jksdh提取phone
-def ind_gettel_jksdh(tel_str):
-    if tel_str:
-        tel_fax_match1 = re.search(tel_fax_pattern1, tel_str)
-        # 既有电话又有传真 取电话
-        if tel_fax_match1:
-            get_aftertel = tel_fax_match1.group(2)
-            clean_aftertel = re.sub(r'[a-zA-Z]', '', get_aftertel)
-            return clean_aftertel.strip(remove_chars)
-        # jksdh不含fax
-        clean_letter = re.sub(r'[a-zA-Z]', ' ', tel_str)
-        clean_enter = re.sub(r'\s+', ' ', clean_letter)
-        return clean_enter.strip(remove_chars)
-    return None
-
-def ind_tel_jksdh_clean(jksdh):
-    tel_str = ind_gettel_jksdh(jksdh)
-    tel = judge_delimiter(tel_str)
-    if tel:
-        for bad in tel_bad_list:
-            tel = tel.replace(bad, '')
-        return tel
-    return None
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '(0161) 662154, 660637 &amp; 664538',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{ind_tel_jksdh_clean(str_tel)}')
-
-if __name__ == '__main__':
-    test_case_list = [
-        '011-23557208,telefax0129-2279612 to 615',
-        '431222 EXTEN:201/431821(D)/FAX NO.091-0422-431672',
-        'PH 080-91133444  FAX 080-91133502',
-        'PH. 011-514 6164, 514 6165, 540 0984, FAX. 011- 549 2977',
-        'TEL:91-22-4921900-05 FAX:91-22-4939284,4950594',
-        'TELE FAX-91-22-27681365/27686787',
-        'TELE:66011640, FAX:26057125',
-        'Tel : (044) 823 2117 Fax (044) 823 4411',
-        'Tel : 080-3349348 / Fax : 080-3348607',
-        'Tel : 080-3349348; Fax : 080-3348607',
-        'Tel : 22-8731998 Fax : 022-8711911',
-        'Tel: 344 3644, Fax no: 342 9023',
-        '25594911 TO 916',
-        '8012997/f-8626376',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{ind_tel_jksdh_clean(str_tel)}')
-
-def ind_fax_jkscz_clean(jksdh):
-    if jksdh:
-        clean_letter = re.sub(r'[a-zA-Z]', ' ', jksdh)
-        clean_enter = re.sub(r'\s+', ' ', clean_letter)
-        tel = judge_delimiter(clean_enter)
-        if tel:
-            for bad in tel_bad_list:
-                tel = tel.replace(bad, '')
-            return tel
-    return None
-
-def pry_phone_clean(jksdh):
-    if jksdh:
-        clean_letter = re.sub(r'[a-zA-Z]', ' ', jksdh)
-        clean_enter = re.sub(r'\s+', ' ', clean_letter)
-        tel = judge_tel_length(clean_enter)
-        if tel:
-            for bad in tel_bad_list:
-                tel = tel.replace(bad, '')
-            return tel
-    return None
-
-if __name__ == '__main__1':
-    test_case_list = [
-        '1234to2',
-        '',
-        None
-    ]
-    for str_tel in test_case_list:
-        print(f'tel: {str_tel} ---->{pry_phone_clean(str_tel)}')
-
-month_dict = {
-              'JAN': '01',
-              'FEB': '02',
-              'MAR': '03',
-              'APR': '04',
-              'MAY': '05',
-              'JUN': '06',
-              'JUL': '07',
-              'AUG': '08',
-              'SEP': '09',
-              'OCT': '10',
-              'NOV': '11',
-              'DEC': '12'
-              }
-
-

+ 0 - 273
dw_base/spark/udf/enterprise/spark_eng_ent_date_clean_indonesia.py

@@ -1,273 +0,0 @@
-import re
-
-pattern1 = r'(\d+)[- /\']?([A-Za-z\d]+)[- /\'\.]+(\d+ ?\d+)'
-pattern2 = r'[,-;\']?(\d+)[- /\']?([A-Za-z\d]+)[- /\'\.]+(\d+ ?\d+)'
-pattern3 = r'[A-Za-z]+, ([A-Za-z]+) (\d+), (\d+)'
-pattern4 = r'([A-Za-z\d]+) ([A-Za-z\d]+)\.? (\d+)'
-pattern5 = r'(!|\d+)[- ]+([A-Za-z]+)[ ]?(\d+)'
-
-month_dict = {'Agsts': '08',
-              'Agsutus': '08',
-              'Agts': '08',
-              'Agust': '08',
-              'Agustus': '08',
-              'Apr': '04',
-              'April': '04',
-              'Aprl': '04',
-              'Aprll': '04',
-              'Aug': '08',
-              'August': '08',
-              'Deaember': '12',
-              'Dec': '12',
-              'December': '12',
-              'Des': '12',
-              'Desember': '12',
-              'Feb': '02',
-              'Febrauri': '02',
-              'Februari': '02',
-              'Februaru': '02',
-              'February': '02',
-              'Febuari': '02',
-              'JULI': '07',
-              'Jan': '01',
-              'Januari': '01',
-              'January': '01',
-              'Jul': '07',
-              'Juli': '07',
-              'July': '07',
-              'Jun': '06',
-              'June': '06',
-              'Juni': '06',
-              'MAy': '05',
-              'Mar': '03',
-              'March': '03',
-              'Maret': '03',
-              'Mart': '03',
-              'May': '05',
-              'Mei': '05',
-              'Mrt': '03',
-              'No': '11',
-              'Nof': '11',
-              'Nop': '11',
-              'Nopember': '11',
-              'Nov': '11',
-              'November': '11',
-              'Oct': '10',
-              'October': '10',
-              'Okober': '10',
-              'Okt': '10',
-              'Okt0ber': '10',
-              'Oktober': '10',
-              'Pebruari': '02',
-              'Sep': '09',
-              'Sepetember': '09',
-              'Sept': '09',
-              'September': '09',
-              'Septembver': '09',
-              'agust': '08',
-              'des': '12',
-              'desmb': '12',
-              'juli': '07',
-              'maret': '03',
-              'mei': '05',
-              'november': '11',
-              'oct': '10',
-              'oktober': '10'
-              }
-
-
-def get_date(text: str):
-    match1 = re.match(pattern1, text)
-    if match1:
-        day, month, year = match1.groups()
-        return year, month, day
-    match2 = re.match(pattern2, text)
-    if match2:
-        day, month, year = match2.groups()
-        return year, month, day
-    match3 = re.match(pattern3, text)
-    if match3:
-        month, day, year = match3.groups()
-        return year, month, day
-    match4 = re.match(pattern4, text)
-    if match4:
-        day, month, year = match4.groups()
-        return year, month, day
-    match5 = re.match(pattern5, text)
-    if match5:
-        day, month, year = match5.groups()
-        return year, month, day
-    return None, None, None
-
-
-def clean_date_indonesia(text):
-    if text:
-        year, month, day = get_date(text)
-        year = clean_year(year)
-        month = clean_month(month)
-        day = clean_day(day)
-        if year and month and day:
-            return f'{year}-{month}-{day}'
-    else:
-        return None
-
-
-def clean_year(year: str):
-    if year:
-        year = year.replace(' ', '')
-        if len(year) == 1:
-            return f'200{year}'
-        elif len(year) == 2:
-            if year < '30':
-                return f'20{year}'
-            else:
-                return f'19{year}'
-        elif len(year) == 3:
-            if year[0] == '0':
-                return f'2{year}'
-            else:
-                return f'{year[0]}0{year[1:]}'
-        try:
-            year_int = int(year)
-            if year_int > 2024 or year_int <= 1900:
-                return None
-        except ValueError:
-            return None
-        return year
-    else:
-        return None
-
-
-def clean_month(month: str):
-    if month:
-        if len(month) == 1:
-            month = f'0{month}'
-        elif re.match( r'^\d{2}$', month):
-            month = month
-        else:
-            month = month_dict.get(month)
-        try:
-            month_int = int(month)
-            if month_int < 1 or month_int > 12:
-                return None
-        except ValueError:
-            return None
-        return month
-    else:
-        return None
-
-
-def clean_day(day: str):
-    if day:
-        if len(day) == 1:
-            if day in ['!', 'I', '1']:
-                return '01'
-            else:
-                return f'0{day}'
-        elif len(day) == 2:
-            try:
-                day_int = int(day)
-                if day_int < 1 or day_int > 31:
-                    return None
-            except ValueError:
-                return None
-            return day
-    else:
-        return None
-
-
-
-if __name__ == '__main__':
-    test_cases = [
-        'Monday, November 03, 2014',
-        'Tuesday, September 08, 2015',
-        "28 Agustus' 09",
-        "19Juli 2011",
-        '25September 2027',
-        'I Februari 2011',
-        ',30 April 2013',
-        '25 Agsts. 08',
-        "4'Nov 08",
-        "15 Des.08",
-        "'06-Sept-10",
-        "! Dec 09",
-        "06- Mei 09",
-        "1 Desember2009",
-        "18 MAy09",
-        "22-Jan013",
-        "21 Okober-20 10",
-        "01 Oktober 2 013",
-        "19-No-13",
-        "8 oct 9 ",
-        '6-May-08',
-        '28 Agustus\' 09',
-        '2-Feb-10',
-        '1/04/2014',
-        '3 Nopember \' 09',
-        '1-5-03',
-        '5-Desember 13',
-        'Wednesday, May 27, 2015',
-        'AHU-0046993.AH.01.08.TAHUN 2018',
-        '25 Agsts. 08',
-        '29-01-2020',
-        '21/10/204',
-        '11Desember 2012',
-        '18 MAy09'
-    ]
-    for test_case in test_cases:
-        print(test_case + '    ->    ', get_date(test_case))
-    year_cases = [
-        '00',
-        '01',
-        '09',
-        '19',
-        '20',
-        '79',
-        '85',
-        '96',
-        '99',
-        '013',
-        '204',
-        '209',
-        '210',
-        '2 013',
-        '20 10',
-        '1028',
-        '2116',
-        '10209',
-        '13',
-    ]
-    for year_case in year_cases:
-        print(year_case + '    ->    ' + str(clean_year(year_case)))
-
-    month_cases = [
-        '01',
-        '09',
-        '11',
-        '5',
-        '7',
-        'Agsts',
-        'January',
-        'Okt',
-        'No',
-        '17'
-    ]
-    for month_case in month_cases:
-        print(month_case + '    ->    ' + str(clean_month(month_case)))
-    day_cases = [
-        '01',
-        '09',
-        '11',
-        '5',
-        '7',
-        '!',
-        'I',
-        '31',
-        '35',
-        '13',
-        '898']
-    for day_case in day_cases:
-        print(day_case + '    ->    ' + str(clean_day(day_case)))
-    print('----------------------------------------------------------------|')
-    for test_case in test_cases:
-        print(test_case + '    ->    ', clean_date_indonesia(test_case))

+ 0 - 34
dw_base/spark/udf/enterprise/spark_eng_ent_json_array_append_udf.py

@@ -1,34 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-
-@udf(returnType=StringType())
-def json_array_append(str_list: str):
-    """
-    Args:
-        str_list: 多级嵌套json数组,非嵌套json数组不需要使用
-    Returns:
-        返回使用特殊字符拼接的字符串,方便hive 使用explode拆分
-    """
-    if str_list is not None and str_list:
-        old_json = json.loads(str_list)
-        new_json = ''
-        for json_single in old_json:
-            new_json += json.dumps(json_single) + '@@@@'  # 拼接特殊字符,用于拼接
-        return new_json
-    else:
-        return ''
-
-
-@udf(returnType=ArrayType(StringType()))
-def json_array_trans(col):
-    if col:
-        return json.loads(col)
-    return []

+ 0 - 762
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_america.py

@@ -1,762 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = [
-    'internationalinc',
-    'enterprisesinc',
-    'international',
-    'manufacturing',
-    'industriesinc',
-    'incorporation',
-    'incorporated',
-    'distribution',
-    'technologies',
-    'distributors',
-    'construction',
-    'distributing',
-    'logisticsinc',
-    'corporation',
-    'enterprises',
-    'productsinc',
-    'electronics',
-    'engineering',
-    'development',
-    'productsllc',
-    'christopher',
-    'fulfillment',
-    'accessories',
-    'industries',
-    'technology',
-    'americainc',
-    'industrial',
-    'associates',
-    'enterprise',
-    'companyinc',
-    'management',
-    'operations',
-    'automotive',
-    'components',
-    'collection',
-    'richardson',
-    'corporatio',
-    'california',
-    'systemsinc',
-    'logistics',
-    'warehouse',
-    'solutions',
-    'furniture',
-    'equipment',
-    'rodriguez',
-    'hernandez',
-    'packaging',
-    'wholesale',
-    'marketing',
-    'materials',
-    'transport',
-    'alexander',
-    'worldwide',
-    'interiors',
-    'gutierrez',
-    'companies',
-    'resources',
-    'fernandez',
-    'chemicals',
-    'products',
-    'services',
-    'williams',
-    'martinez',
-    'gonzalez',
-    'division',
-    'supplies',
-    'anderson',
-    'americas',
-    'thompson',
-    'shipping',
-    'lighting',
-    'robinson',
-    'plastics',
-    'phillips',
-    'mitchell',
-    'groupinc',
-    'holdings',
-    'campbell',
-    'hardware',
-    'solution',
-    'peterson',
-    'logistic',
-    'chemical',
-    'flooring',
-    'richards',
-    'company',
-    'limited',
-    'america',
-    'trading',
-    'systems',
-    'imports',
-    'johnson',
-    'service',
-    'sanchez',
-    'jackson',
-    'express',
-    'houston',
-    'ramirez',
-    'edwards',
-    'roberts',
-    'michael',
-    'charles',
-    'chicago',
-    'designs',
-    'freight',
-    'storage',
-    'collins',
-    'apparel',
-    'granite',
-    'angeles',
-    'atlanta',
-    'stewart',
-    'anonima',
-    'medical',
-    'morales',
-    'product',
-    'bennett',
-    'gallery',
-    'factory',
-    'florida',
-    'produce',
-    'mendoza',
-    'russell',
-    'jenkins',
-    'supply',
-    'center',
-    'canada',
-    'usainc',
-    'mexico',
-    'miller',
-    'design',
-    'garcia',
-    'thomas',
-    'nguyen',
-    'global',
-    'export',
-    'wilson',
-    'martin',
-    'taylor',
-    'office',
-    'branch',
-    'brands',
-    'harris',
-    'joseph',
-    'torres',
-    'nelson',
-    'walker',
-    'wright',
-    'sports',
-    'system',
-    'flores',
-    'metals',
-    'morris',
-    'rivera',
-    'parker',
-    'street',
-    'energy',
-    'morgan',
-    'daniel',
-    'usallc',
-    'carter',
-    'marine',
-    'direct',
-    'compan',
-    'studio',
-    'motors',
-    'limite',
-    'murphy',
-    'george',
-    'rogers',
-    'centre',
-    'beauty',
-    'marble',
-    'robert',
-    'bailey',
-    'turner',
-    'market',
-    'jordan',
-    'watson',
-    'graham',
-    'powell',
-    'barnes',
-    'cooper',
-    'group',
-    'smith',
-    'coinc',
-    'foods',
-    'plant',
-    'jones',
-    'zhang',
-    'brown',
-    'sales',
-    'davis',
-    'stone',
-    'parts',
-    'huang',
-    'lopez',
-    'lewis',
-    'perez',
-    'james',
-    'moore',
-    'scott',
-    'white',
-    'north',
-    'green',
-    'young',
-    'agent',
-    'trade',
-    'adams',
-    'farms',
-    'store',
-    'clark',
-    'allen',
-    'works',
-    'house',
-    'patel',
-    'baker',
-    'singh',
-    'miami',
-    'reyes',
-    'evans',
-    'zheng',
-    'gomez',
-    'david',
-    'cargo',
-    'limit',
-    'glass',
-    'texas',
-    'steel',
-    'ramos',
-    'power',
-    'mills',
-    'tools',
-    'kelly',
-    'chain',
-    'henry',
-    'lines',
-    'collc',
-    'ortiz',
-    'depot',
-    'world',
-    'paper',
-    'price',
-    'wines',
-    'compa',
-    'myers',
-    'south',
-    'decor',
-    'drive',
-    'corp',
-    'ltda',
-    'gmbh',
-    'chen',
-    'wang',
-    'bank',
-    'yang',
-    'intl',
-    'york',
-    'west',
-    'king',
-    'home',
-    'hall',
-    'line',
-    'john',
-    'cruz',
-    'diaz',
-    'park',
-    'tran',
-    'road',
-    'tile',
-    'hill',
-    'tech',
-    'food',
-    'cook',
-    'shop',
-    'wood',
-    'wong',
-    'sarl',
-    'rico',
-    'sons',
-    'east',
-    'chan',
-    'city',
-    'limi',
-    'ross',
-    'long',
-    'ryan',
-    'ruiz',
-    'gray',
-    'reed',
-    'ward',
-    'ltee',
-    'dist',
-    'bell',
-    'comp',
-    'plus',
-    'rose',
-    'farm',
-    'tire',
-    'care',
-    'inc',
-    'llc',
-    'ltd',
-    'usa',
-    'sas',
-    'srl',
-    'dba',
-    'spa',
-    'sac',
-    'liu',
-    'lee',
-    'and',
-    'for',
-    'lin',
-    'mfg',
-    'bhd',
-    'ulc',
-    'kim',
-    'lcc',
-    'sun',
-    'ave',
-    'com',
-    'lim',
-    'col',
-    'cor',
-    'lax',
-    'llp',
-    'div',
-    'cox',
-    'the',
-    'sro',
-    'iii',
-    'nyc',
-    'int',
-    'co',
-    'cv',
-    'sa',
-    'de',
-    'lp',
-    'in',
-    'ca',
-    'li',
-    'of',
-    'bv',
-    'us',
-    'll',
-    'as',
-    'nv',
-    'jr',
-    'lt',
-    'dc',
-    'wu',
-    'sl',
-    'na',
-    'ag',
-    'rl',
-    'yu',
-    'ny',
-    'st',
-    'sc',
-    'xu',
-    'ma',
-    'nc',
-    'ab',
-    'lu',
-    'kg',
-    'la',
-    'le',
-    'pr',
-    'rd',
-    'he',
-    'ii',
-    'go',
-    'ho',
-    'nj',
-    'lc',
-    'pa',
-    'c',
-    'a',
-    'l',
-    's',
-    'i',
-    'd',
-    'o',
-    'f',
-    'v',
-    'p',
-    'm',
-    'e',
-    'g',
-    'r',
-    'b',
-    'n',
-    't',
-    'w',
-    '1',
-    '2',
-    'j',
-    'y',
-    'h'
-]
-# tail_character = ['companylimited',
-#                   'corporationco',
-#                   'establishment',
-#                   'incorporated',
-#                   'sadecvmexico',
-#                   'corporation',
-#                   'foundation',
-#                   'sadecvblvd',
-#                   'abudhabi',
-#                   'derldecv',
-#                   'limitedc',
-#                   'limitda',
-#                   'limited',
-#                   'agroch',
-#                   'berhad',
-#                   'fzesro',
-#                   'incinc',
-#                   'ltdsti',
-#                   'sadecv',
-#                   'sdnbhd',
-#                   'pvtltd',
-#                   'ptyltd',
-#                   'collc',
-#                   'dhabi',
-#                   'fzllc',
-#                   'spzoo',
-#                   'agro',
-#                   'cjsc',
-#                   'comp',
-#                   'corp',
-#                   'dmcc',
-#                   'eirl',
-#                   'fzco',
-#                   'gmbh',
-#                   'ltda',
-#                   'ojsc',
-#                   'pjsc',
-#                   'sarl',
-#                   'aps',
-#                   'bhd',
-#                   'col',
-#                   'doo',
-#                   'est',
-#                   'fzc',
-#                   'jsc',
-#                   'lda',
-#                   'llp',
-#                   'mfg',
-#                   'mfy',
-#                   'nik',
-#                   'plc',
-#                   'psc',
-#                   'pte',
-#                   'pty',
-#                   'pvt',
-#                   'sas',
-#                   'sac',
-#                   'sau',
-#                   'sdn',
-#                   'slu',
-#                   'srl',
-#                   'sro',
-#                   'tbk',
-#                   'tic',
-#                   'wll',
-#                   'spa',
-#                   'inc',
-#                   'ltd',
-#                   'ab',
-#                   'ad',
-#                   'ag',
-#                   'as',
-#                   'bd',
-#                   'bv',
-#                   'ca',
-#                   'co',
-#                   'dc',
-#                   'hm',
-#                   'kk',
-#                   'lc',
-#                   'lt',
-#                   'na',
-#                   'nv',
-#                   'oy',
-#                   'pt',
-#                   'sa',
-#                   'sl',
-#                   'yk'
-#                   ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'ptelimited',
-    'corpration',
-    'colimited',
-    'limited',
-    'company',
-    'sdnbhd',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'fzco',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(remove_punctuation(remove_space(eng_name)))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 666
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_common.py

@@ -1,666 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'llc',
-                  'ltd',
-                  'co',
-                  ]
-
-chian_ent_label = [
-    'shanghai',
-    'peking',
-    'chongqing',
-    'tianjin',
-    'wuhan',
-    'harbin',
-    'shenyang',
-    'guangzhou',
-    'chengdu',
-    'nanjing]',
-    'changchun',
-    'xian',
-    'dalian',
-    'qingdao',
-    'jinan',
-    'hangzhou',
-    'zhengzhou',
-    'shijiazhuang',
-    'taiyuan',
-    'kunming',
-    'changsha',
-    'nanchang',
-    'fuzhou',
-    'lanzhou',
-    'guiyang',
-    'ningbo',
-    'hefei',
-    'anshan',
-    'fushun',
-    'nanning',
-    'zibo',
-    'qiqihar',
-    'jilin',
-    'tangshan',
-    'baotou',
-    'shenzhen',
-    'hohhot',
-    'handan',
-    'wuxi',
-    'xuzhou',
-    'datong',
-    'yichun',
-    'benxi',
-    'luoyang',
-    'suzhou',
-    'xining',
-    'huainan',
-    'jixi',
-    'daqing',
-    'fuxin',
-    'xiamen',
-    'liuzhou',
-    'shantou',
-    'jinzhou',
-    'mudanjiang',
-    'yinchuan',
-    'changzhou',
-    'zhangjiakou',
-    'dandong',
-    'hegang',
-    'kaifeng',
-    'jiamusi',
-    'liaoyang',
-    'hengyang',
-    'baoding',
-    'hunjiang',
-    'xinxiang',
-    'huangshi',
-    'haikou',
-    'yantai',
-    'bengbu',
-    'xiangtan',
-    'weifang',
-    'wuhu',
-    'pingxiang',
-    'yingkou',
-    'anyang',
-    'panzhihua',
-    'pingdingshan',
-    'xiangfan',
-    'zhuzhou',
-    'jiaozuo',
-    'wenzhou',
-    'zhangjiang',
-    'zigong',
-    'shuangyashan',
-    'zaozhuang',
-    'yakeshi',
-    'yichang',
-    'zhenjiang',
-    'huaibei',
-    'qinhuangdao',
-    'guilin',
-    'liupanshui',
-    'panjin',
-    'yangquan',
-    'jinxi',
-    'liaoyuan',
-    'lianyungang',
-    'xianyang',
-    'tai´an',
-    'chifeng',
-    'shaoguan',
-    'nantong',
-    'leshan',
-    'baoji',
-    'linyi',
-    'tonghua',
-    'siping',
-    'changzhi',
-    'tengzhou',
-    'chaozhou',
-    'yangzhou',
-    'dongwan',
-    'ma´anshan',
-    'foshan',
-    'yueyang',
-    'xingtai',
-    'changde',
-    'shihezi',
-    'yancheng',
-    'jiujiang',
-    'dongying',
-    'shashi',
-    'xintai',
-    'jingdezhen',
-    'tongchuan',
-    'zhongshan',
-    'shiyan',
-    'tieli',
-    'jining',
-    'wuhai',
-    'mianyang',
-    'luzhou',
-    'zunyi',
-    'shizuishan',
-    'neijiang',
-    'tongliao',
-    'tieling',
-    'wafangdian',
-    'anqing',
-    'shaoyang',
-    'laiwu',
-    'chengde',
-    'tianshui',
-    'nanyang',
-    'cangzhou',
-    'yibin',
-    'huaiyin',
-    'dunhua',
-    'yanji',
-    'jiangmen',
-    'tongling',
-    'suihua',
-    'gongziling',
-    'xiantao',
-    'chaoyang',
-    'ganzhou',
-    'huzhou',
-    'baicheng',
-    'shangzi',
-    'yangjiang',
-    'qitaihe',
-    'gejiu',
-    'jiangyin',
-    'hebi',
-    'jiaxing',
-    'wuzhou',
-    'meihekou',
-    'xuchang',
-    'liaocheng',
-    'haicheng',
-    'qianjiang',
-    'baiyin',
-    'bei´an',
-    'yixing',
-    'laizhou',
-    'qaramay',
-    'acheng',
-    'dezhou',
-    'nanping',
-    'zhaoqing',
-    'beipiao',
-    'fengcheng',
-    'fuyu',
-    'xinyang',
-    'dongtai',
-    'yuci',
-    'honghu',
-    'ezhou',
-    'heze',
-    'daxian',
-    'linfen',
-    'tianmen',
-    'yiyang',
-    'quanzhou',
-    'rizhao',
-    'deyang',
-    'guangyuan',
-    'changshu',
-    'zhangzhou',
-    'hailar',
-    'nanchong',
-    'jiutai',
-    'zhaodong',
-    'shaoxing',
-    'fuyang',
-    'maoming',
-    'qujing',
-    'ghulja',
-    'jiaohe',
-    'puyang',
-    'huadian',
-    'jiangyou',
-    'qashqar',
-    'anshun',
-    'fuling',
-    'xinyu',
-    'hanzhong',
-    'danyang',
-    'chenzhou',
-    'xiaogan',
-    'shangqiu',
-    'zhuhai',
-    'qingyuan',
-    'aqsu',
-    'xiaoshan',
-    'zaoyang',
-    'xinghua',
-    'hami',
-    'huizhou',
-    'jinmen',
-    'sanming',
-    'ulanhot',
-    'korla',
-    'wanxian',
-    'ruian',
-    'zhoushan',
-    'liangcheng',
-    'jiaozhou',
-    'taizhou',
-    'taonan',
-    'pingdu',
-    'ji´an',
-    'longkou',
-    'langfang',
-    'zhoukou',
-    'suining',
-    'yulin',
-    'jinhua',
-    'liu´an',
-    'shuangcheng',
-    'suizhou',
-    'ankang',
-    'weinan',
-    'longjing',
-    'daan',
-    'lengshuijiang',
-    'laiyang',
-    'xianning',
-    'dali',
-    'anda',
-    'jincheng',
-    'longyan',
-    'xichang',
-    'wendeng',
-    'hailun',
-    'binzhou',
-    'linhe',
-    'wuwei',
-    'duyun',
-    'mishan',
-    'shangrao',
-    'changji',
-    'meixian',
-    'yushu',
-    'tiefa',
-    'huai´an',
-    'leiyang',
-    'zalantun',
-    'weihai',
-    'loudi',
-    'qingzhou',
-    'qidong',
-    'huaihua',
-    'luohe',
-    'chuzhou',
-    'kaiyuan',
-    'linqing',
-    'chaohu',
-    'laohekou',
-    'dujiangyan',
-    'zhumadian',
-    'linchuan',
-    'jiaonan',
-    'sanmenxia',
-    'heyuan',
-    'manzhouli',
-    'lhasa',
-    'lianyuan',
-    'kuytun',
-    'puqi',
-    'hongjiang',
-    'qinzhou',
-    'renqiu',
-    'yuyao',
-    'guigang',
-    'kaili',
-    'yan´an',
-    'beihai',
-    'xuangzhou',
-    'quzhou',
-    'yong´an',
-    'zixing',
-    'liyang',
-    'yizheng',
-    'yumen',
-    'liling',
-    'yuncheng',
-    'shanwei',
-    'cixi',
-    'yuanjiang',
-    'bozhou',
-    'jinchang',
-    'fuan',
-    'suqian',
-    'shishou',
-    'hengshui',
-    'danjiangkou',
-    'fujin',
-    'sanya',
-    'guangshui',
-    'huangshan',
-    'xingcheng',
-    'zhucheng',
-    'kunshan',
-    'haining',
-    'pingliang',
-    'fuqing',
-    'xinzhou',
-    'jieyang',
-    'zhangjiagang',
-    'tong xian',
-    'yaan',
-    'emeishan',
-    'enshi',
-    'bose',
-    'yuzhou',
-    'tumen',
-    'putian',
-    'linhai',
-    'shaowu',
-    'junan',
-    'huaying',
-    'pingyi',
-    'huangyan'
-]
-
-brazil_tail_character_cut = [
-    'industriais ltda',
-    'brasil indstria',
-    'e comercializacao',
-    'brasil ltda',
-    'industria',
-    'eireli',
-    'cia ltda',
-    'ind e com',
-    'brasil ltda epp',
-    'importacao',
-    'e comercio',
-    'comercio',
-    # 'sa',
-    'do brasi',
-    'brasil sa',
-    'limitada',
-    'ltda me',
-    'ltda epp',
-    'ltda'
-]
-
-brazil_tail_character_remove = [
-    'sa',
-    'ltda',
-    'casa'
-]
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        # eng_name = eng_name.lower()
-        eng_name = eng_name.lower().replace(' ', '')
-
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-@udf(returnType=BooleanType())
-def filter_china_ent(name_abb: str) -> bool:
-    if name_abb:
-        for char in chian_ent_label:
-            if char in name_abb:
-                return True
-    return False
-
-
-def cut_tail_char_brazil(eng_name: str) -> str or None:
-    if eng_name:
-        for tail in brazil_tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if len(ent_name_cut) > 5:
-                    return ent_name_cut
-                else:
-                    return eng_name
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_tail_char_brazil(eng_name: str) -> str or None:
-    if eng_name:
-        for char in brazil_tail_character_remove:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)].replace(' ', '')
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ABC ltda  epp industriais ltdaltda  me'
-    print(remove_tail_char_brazil(a))

+ 0 - 132
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_compant.py

@@ -1,132 +0,0 @@
-# 通用企业名称去噪
-# 全角(Full-width)特殊字符列表
-
-# 通用企业名称去噪
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´']
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-
-def clean_company_name(name):
-    if name:
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-if __name__ == '__main__':
-    case1 = ' AB    cde .((!)  '
-    assert clean_company_name(case1) == 'AB CDE'
-    case2 = None
-    assert clean_company_name(case2) is None
-    case3 = '    '
-    assert clean_company_name(case3) == ''
-    case4 = '~ab#c≥'
-    assert clean_company_name(case4) == 'AB C'
-    case5 = '÷  &            !  '
-    assert clean_company_name(case5) == 'AND'
-    case6 = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    case = 'abc&def'
-    assert clean_company_name(case6) == 'ABC AND DEF'
-    print('all test cases passed')
-

+ 0 - 307
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_germany.py

@@ -1,307 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 312
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_hongkong.py

@@ -1,312 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'sadecv',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'corp',
-                  'llc',
-                  'ltd'
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'ptelimited',
-    'corpration',
-    'colimited',
-    'limited',
-    'company',
-    'sdnbhd',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'fzco',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def replace_hk(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace('hongkong', 'hk')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return replace_hk(remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name)))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 321
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_indonesia.py

@@ -1,321 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'sadecv',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'ptelt',
-                  'colt',
-                  'corp',
-                  'llc',
-                  'ltd',
-                  'spa',
-                  'pt',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'ptelimited',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'sdnbhd',
-    'coltd',
-    'c0ltd',
-    'cesro',
-    'collc',
-    'ltda',
-    'gmbh',
-    'fzco',
-    'babv',
-    'inc',
-    'spa',
-    'llc',
-    'fze',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-def get_pre_clean_ent_name(ent_name: str) -> str or None:
-    if not ent_name:
-        return None
-    while True:
-        if re.search(r"\([^()]*\)$", ent_name):
-            ent_name = re.sub(r'\s*\([^)]*\)$', '', ent_name).strip()
-        else:
-            return re.sub(r'^Perusahaan.*?\( *persero *\)', '', ent_name, flags=re.IGNORECASE).strip()
-
-
-if __name__ == '__main__':
-    a = 'Perusahaan Peroan (   Persero    ) Pt. Perkebunan Nusantara ViiKios Lapan-lapan (KOTA TANGERANG SELATAN) (855103) ()'
-    print(get_pre_clean_ent_name(a))

+ 0 - 309
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_italy.py

@@ -1,309 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'srl',
-                  'spa',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 311
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_japan.py

@@ -1,311 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'holdings',
-    'limited',
-    'pteltd',
-    'limied',
-    'coltd',
-    'spzoo',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-    'sdn'
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 308
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_malaysia.py

@@ -1,308 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'sdnbhd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 307
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_south_korea.py

@@ -1,307 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 307
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_taiwan.py

@@ -1,307 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if len(ent_name_cut) > 5:
-                    return ent_name_cut
-                else:
-                    return eng_name
-        return eng_name
-        # original_length = len(eng_name)
-        #
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if original_length > 5 and len(ent_name_cut) > 5:
-        #             eng_name = ent_name_cut
-        #
-        # return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 308
dw_base/spark/udf/enterprise/spark_eng_ent_name_clean_uae.py

@@ -1,308 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'pteltd',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'gmbh',
-                  'llc',
-                  'ltd',
-                  'inc',
-                  'fze',
-                  'co',
-                  'bv',
-                  ]
-
-tail_character_cut = [
-    'corporration',
-    'corporation',
-    'corpration',
-    'colimited',
-    'limited',
-    'trading',
-    'pteltd',
-    'coltd',
-    'c0ltd',
-    'ltda',
-    'gmbh',
-    'inc',
-    'llc',
-    'fze',
-    'ltd',
-]
-
-
-def remove_begin_brackets(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\([^)]*\)', '', eng_name)
-    else:
-        return ''
-
-
-def remove_begin_num(eng_name: str) -> str or None:
-    if eng_name:
-        return re.sub(r'^\d{3,}', '', eng_name)
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-def cut_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        # for tail in tail_character_cut:
-        #     pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-        #     match = re.search(pattern, eng_name)
-        #     if match:
-        #         ent_name_cut = eng_name[:match.start()].strip()
-        #         if len(ent_name_cut) > 5:
-        #             return ent_name_cut
-        #         else:
-        #             return eng_name
-        # return eng_name
-        original_length = len(eng_name)
-
-        for tail in tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if original_length > 5 and len(ent_name_cut) > 5:
-                    eng_name = ent_name_cut
-
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_space(eng_name: str) -> str or None:
-    if eng_name:
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        return remove_tail_char(
-            cut_tail_char(remove_begin_num(remove_space(remove_punctuation(remove_begin_brackets(eng_name))))))
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ADAM HALL GMBH  GMBH ПО ПОРУЧ."EVOL GROUP TR YAZILIM LIMITED SIRKETI" ТУРЦИЯ'
-    print(get_clean_eng_ent_name(a))

+ 0 - 50
dw_base/spark/udf/enterprise/spark_eng_ent_shareholder_clean_russia.py

@@ -1,50 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import re
-
-"""
-Args:
-    匹配俄罗斯股东列中的出资金额和股份占比
-Returns:
-    investment出资金额 percentage股份占比 russian俄文字符
-"""
-investment_pattern = re.compile(r'([\d\s]*(\.)?[\d\s]*) руб\.')
-percentage_pattern = re.compile(r'(\d+(\.)?(\d)*)%')
-russian_pattern = re.compile(r'(.*?)\(\d')       #([а-яА-ЯёЁ]*)(\(\d)
-
-
-def investment(text):
-    investment_match = re.search(investment_pattern, text)
-    if investment_match:
-        investment_amount = investment_match.group(1).replace(' ', '')
-        return investment_amount
-    else:
-        return None
-
-
-def percentage(text):
-    percentage_match = re.search(percentage_pattern, text)
-    if percentage_match:
-        percentage_value = percentage_match.group(1)
-        return percentage_value + '%'
-    else:
-        return None
-
-def russian_sh(text):
-    russian_match = re.search(russian_pattern, text)
-    if russian_match:
-        russian_value = russian_match.group(1).replace('\(\d', '')
-        return russian_value
-    else:
-        return None
-
-# 提取股东姓名
-if __name__ == '__main__':
-    test_case_list = ['Донской Олег Валерьевич (4 500 руб., 45%)',
-                      'ООО "ЕКСМО" (100 000 руб., 100%)',
-                      'Заблоцкий Михаил Александрович (5 000 руб., 50%)',
-                      'ГК ДСГ, ООО (5 000 руб., 50%)'
-    ]
-    for str in test_case_list:
-        print(f'tel: {str} ---->   {russian_sh(str)}')

+ 0 - 156
dw_base/spark/udf/enterprise/test/ent_clean_text_test.py

@@ -1,156 +0,0 @@
-import pytest
-from dw_base.spark.udf.enterprise.ent_clean_text import *
-
-
-@pytest.mark.parametrize("country, url, expected", [
-    ('China', 'https://www.ianshaw.biz/p/contact-management.php', 'ianshaw.biz/p/contact-management.php')
-])
-def test_clean_url(country, url, expected):
-    result = clean_url(country, url)
-    assert result == expected
-
-@pytest.mark.parametrize("url, expected", [
-    ('https://charnleyfertilisers.co.uk/', 'charnleyfertilisers.co.uk')
-])
-def test_clean_url_common(url, expected):
-    result = clean_url_common(url)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('13.02.2024', '2024-02-13'),
-    ('13/02/2024', '2024-02-13'),
-    ('', None),
-    (None, None)
-])
-def test_reverse_str(str, expected):
-    result = reverse_str(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('13a02b20 24', '13022024'),
-])
-def test_replace_english_and_space(str, expected):
-    result = replace_english_and_space(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('981617611,981617611,981617611', '981617611'),
-    ('', None),
-    (None, None)
-])
-def test_array_remove_duplicates(str, expected):
-    result = array_remove_duplicates(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('12.575.462 ALVARO PEREIRA DA SILVEIRA FILHO', 'ALVARO PEREIRA DA SILVEIRA FILHO'),
-    ('HEBE DE ABREU VILELA CPF 027116806149', 'HEBE DE ABREU VILELA CPF'),
-    ('HEBE DE ABREU VILELA CPF 027116', 'HEBE DE ABREU VILELA CPF 027116'),
-    ('', None),
-    (None, None)
-])
-def test_clean_brazil_company_name(str, expected):
-    result = clean_brazil_company_name(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('2124859522,2124859523', '2124859522,2124859523'),
-    ('123456,2124859523', '2124859523'),
-    ('123456789,12345678', None),
-    ('', None),
-    (None, None)
-])
-def test_phone_clean_turkey(str, expected):
-    result = phone_clean_turkey(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('91234567891', None),
-    ('01234567891', '1234567891'),
-    ('123456789012', None),
-    ('21', None),
-    ('', None),
-    (None, None)
-])
-def test_fax_clean_turkey(str, expected):
-    result = fax_clean_turkey(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('["10.71.01-Taze pastane ürünleri imalatı (yaş pasta, kuru pasta, poğaça, kek, börek, pay, turta, waffles vb.)"]', '107101'),
-    ('["35.12.13-Elektrik enerjisinin iletimi (elektrik üretim kaynağından dağıtım sistemine aktaran iletim sistemlerinin işletilmesi)","42.22.02-Enerji santralleri inşaatı (hidroelektrik santrali, termik santral, nükleer enerji üretim santralleri vb.)","35.11.19-Elektrik enerjisi üretimi"]', '351213, 422202, 351119'),
-    ('', None),
-    (None, None)
-])
-def test_turkey_nicecode(str, expected):
-    result = turkey_nicecode(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('HUGO.SANSIL@GMAIL.COM   E HUGO@SISTEMAFIEG.ORG.BR', ['hugo.sansil@gmail.com', 'hugo@sistemafieg.org.br']),
-    ('SANDRA_MMC@BOL.COM.BR               NAIRMOTADIAS@HOTMAIL.COM', ['sandra_mmc@bol.com.br', 'nairmotadias@hotmail.com']),
-    ('', None),
-    (None, None)
-])
-def test_email_clean_brazil(str, expected):
-    result = email_clean_brazil(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('["medical practice","medical practices","hospital & health care"]', 'medical practice,medical practices,hospital & health care'),
-    ('["construction"]', 'construction'),
-    ('', None),
-    (None, None)
-])
-def test_arr_str_to_str(str, expected):
-    result = arr_str_to_str(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('+1-866-344-7857 ext. 311', '+1 866 344 7857 311'),
-    ('(844)800-BULL', None),
-    ('', None),
-    (None, None)
-])
-def test_clean_tel_apollo(str, expected):
-    result = clean_tel_apollo(str)
-    assert result == expected
-
-@pytest.mark.parametrize("socialtype, url, expected", [
-    ("youtube", "https://youtube.com/user/BrotherCanadaEn", 'user/brothercanadaen'),
-    ("whatsapp", "919822025525", '919822025525'),
-    ("twitter", "https://twitter.com/#", ''),
-    ("linkedin", "https://www.linkedin.com/in/meb-jsc/#", 'in/meb-jsc'),
-    ("instagram", "https://www.instagram.com/##############/", ''),
-    ("facebook", "https://www.facebook.com/https://www.facebook.com/komlider38/", 'komlider38'),
-    ("pinterest", "https://www.pinterest.com/lampstore/https://www.pinterest.com/lampstore/", 'lampstore'),
-    (None, "",None),
-    ("", None,None),
-    (None, None,None)
-])
-def test_socialmedia_url(socialtype, url, expected):
-    result = socialmedia_url(socialtype, url)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('-- PACIFIC PRODUCTS LIMITED AUSTRALIAN PRODUCTS LIMITED', 'PACIFIC PRODUCTS LIMITED AUSTRALIAN PRODUCTS LIMITED'),
-    ('03-MAY-2013 Fuente Union Import And Export Limited 福恩特聯合進出口有限公司', 'Fuente Union Import And Export Limited 福恩特聯合進出口有限公司'),
-    ('', None),
-    (None, None)
-])
-def test_hongkong_previous_name_clean(str, expected):
-    result = hongkong_previous_name_clean(str)
-    assert result == expected
-
-@pytest.mark.parametrize("str, expected", [
-    ('["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"]', '25-to-50'),
-    ('["ownership-of-shares-more-than-25-percent-registered-overseas-entity"]', 'more-than-25'),
-    ('', None),
-    (None, None)
-])
-def test_uk_sharepercent(str, expected):
-    result = uk_sharepercent(str)
-    assert result == expected
-
-if __name__ == '__main__':
-    pytest.main()

+ 0 - 48
dw_base/spark/udf/enterprise/test/ent_india_offline_udf_test.py

@@ -1,48 +0,0 @@
-import pytest
-
-from dw_base.spark.udf.enterprise.ent_india_offline_udf import *
-
-
-@pytest.mark.parametrize("zau_date, expected", [
-    ("", None),
-    ("N/A", None),
-    ("  ", None),
-    ("xasdfasd", None),
-    ("2020-01-10", None),
-    ("26 February 2010", "2010-02-26"),
-    ("27 May 2013", "2013-05-27"),
-    ("10 January 2020", "2020-01-10"),
-    ("11 July 2018", "2018-07-11"),
-    ("27 August 1996", "1996-08-27"),
-    ("30 January 1987", "1987-01-30"),
-    ("18 August 2005", "2005-08-18"),
-    ("22 February 2010", "2010-02-22"),
-    ("16 May 2012", "2012-05-16"),
-    ("01 August 2017", "2017-08-01"),
-    ("11 September 2021", "2021-09-11"),
-    ("04 August 1984", "1984-08-04"),
-    ("19 July 2014", "2014-07-19"),
-    ("24 October 1985", "1985-10-24"),
-])
-def test_clean_zau_date_functionality(zau_date, expected):
-    assert clean_zau_date(zau_date) == expected
-
-
-@pytest.mark.parametrize("his_gs_date, expected", [
-    ("26-02-2010", "2010-02-26"),
-    ("27-05-2013", "2013-05-27"),
-    ("10-01-2020", "2020-01-10"),
-    ("11-07-2018", "2018-07-11"),
-    ("27-08-1996", "1996-08-27"),
-    ("30-01-1987", "1987-01-30"),
-    ("18-08-2005", "2005-08-18"),
-    ("22-02-2010", "2010-02-22"),
-    ("16-05-2012", "2012-05-16"),
-    ("01-08-2017", "2017-08-01"),
-    ("11-09-2021", "2021-09-11"),
-    ("04-08-1984", "1984-08-04"),
-    ("19-07-2014", "2014-07-19"),
-    ("24-10-1985", "1985-10-24"),
-])
-def test_clean_his_gs_date_functionality(his_gs_date, expected):
-    assert clean_his_gs_date(his_gs_date) == expected

+ 0 - 100
dw_base/spark/udf/enterprise/test/spark_eng_ent_ctstel_clean_test.py

@@ -1,100 +0,0 @@
-import pytest
-from dw_base.spark.udf.enterprise.spark_eng_ent_ctstel_clean import *
-
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('2.0640004433E10', '20640004433'),
-    ('', None),
-    (None, None)
-])
-def test_scientific_to_number(input_str, expected):
-    result = scientific_to_number(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('666666 7777777', '666666@@7777777 '),
-    ('262-255- 7177 // 273308256', '262-255-7177@@273308256 '),
-    ('', None),
-    (None, None)
-])
-def test_judge_delimiter(input_str, expected):
-    result = judge_delimiter(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('123-243', '123-243'),
-    ('(010)1(2)3345', '(010)1(2)3345'),
-    ('(010)1(2)', None),
-    ('', None),
-    (None, None)
-])
-def test_judge_tel_length(input_str, expected):
-    result = judge_tel_length(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('(010)123345(', '(010)123345'),
-    ('(010)1(2)3345#', '(010)1(2)3345'),
-    (')010123345', '010123345'),
-    ('', None),
-    (None, None)
-])
-def test_clean_headtail(input_str, expected):
-    result = clean_headtail(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('6914 1002 TAX ID:200514854D', '6914 1002 200514854 '),
-    ('FAX. 41 32 392 51 07B', '41 32 392 51 07 '),
-    ('', None),
-    (None, None)
-])
-def test_col_tel_clean(input_str, expected):
-    result = col_tel_clean(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('info@gcrcompact@gcrieber.com', None),
-    ('siki.huang@byd.com / betty.qiu@b', 'siki.huang@byd.com'),
-    ('', None),
-    (None, None)
-])
-def test_col_email_clean(input_str, expected):
-    result = col_email_clean(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('PH 080-91133444  FAX 080-91133502', '080-91133502'),
-    ('Telefax 4930742', '4930742'),
-    ('011-23557208,telefax0129-2279612 to 615','011-23557208@@0129-2279612 615'),
-    ('', None),
-    (None, None)
-])
-def test_ind_getfax_jksdh(input_str, expected):
-    result = ind_getfax_jksdh(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('Tel : (044) 823 2117 Fax (044) 823 4411', '(044)823 2117 '),
-    ('8012997/f-8626376', '8012997@@-8626376 '),
-    ('431222 EXTEN:201/431821(D)/FAX NO.091-0422-431672','431222201@@431821@@091-0422-431672 '),
-    ('', None),
-    (None, None)
-])
-def test_ind_tel_jksdh_clean(input_str, expected):
-    result = ind_tel_jksdh_clean(input_str)
-    assert result == expected
-
-@pytest.mark.parametrize("input_str, expected", [
-    ('8021613420', '8021613420'),
-    ('300836', '300836'),
-    ('612368','612368'),
-    ('', None),
-    (None, None)
-])
-def test_pry_phone_clean(input_str, expected):
-    result = pry_phone_clean(input_str)
-    assert result == expected
-
-if __name__ == '__main__':
-    pytest.main()

+ 0 - 180
dw_base/spark/udf/enterprise/unique/ent_offline_udf_america.py

@@ -1,180 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import hashlib
-import re
-# 企业库唯一性调整,离线数据udf
-from datetime import datetime
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-
-def generate_md5_hash(input_str: str):
-    input_data = input_str.encode('utf-8')
-    md5_hash = hashlib.md5()
-    md5_hash.update(input_data)
-    return md5_hash.hexdigest()
-
-
-def generate_tid_usa(company_name: str,
-                     business_number: str,
-                     state: str) -> str or None:
-    if not company_name:
-        return None
-    if business_number:
-        input_str = business_number + 'AAA'
-    else:
-        if state:
-            input_str = f"{company_name}-{state}BBB"
-        else:
-            input_str = company_name + 'CCC'
-    return 'USA' + generate_md5_hash(input_str)
-
-
-def clean_company_name_extra(s: str) -> str or None:
-    if s:
-        suffixes = ["INCORPORATED",
-                    "LIMITED LIABILITY COMPANY",
-                    "PUBLIC LIMITED COMPANY",
-                    "LIMITED LIABILITY PARTNERSHIP",
-                    "LIMITED PARTNERSHIP",
-                    "GENERAL PARTNERSHIP",
-                    "PROFESSIONAL CORPORATION",
-                    "NON PROFIT ORGANIZATION",
-                    "S CORPORATION",
-                    "BENEFIT CORPORATION",
-                    "DOING BUSINESS AS",
-                    "COMPANY LIMITE",
-                    "CORPORATION",
-                    "COMPANY",
-                    "LIMITED",
-                    "S CORP",
-                    "B CORP",
-                    "CO LTD",
-                    "INC",
-                    "LLC",
-                    "CORP",
-                    "CO",
-                    "LTD",
-                    "PLC",
-                    "LLP",
-                    "LP",
-                    "GP",
-                    "PC",
-                    "NPO",
-                    "DBA"]
-
-        # 去除后缀
-        for suffix in suffixes:
-            if s.endswith(suffix):
-                s = s[:-len(suffix)]
-                break
-
-        # 去除字符串前后的空格
-        s = s.strip()
-
-        return s
-
-
-def clean_company_name_usa(company_name: str) -> str or None:
-    if company_name:
-        name = clean_company_name(company_name)
-        if name:
-            name = clean_company_name_extra(name)
-            return name
-    return None
-
-
-state_abbr_to_full = {
-    "FL": "Florida", "Fla.": "Florida",
-    "GA": "Georgia", "Ga.": "Georgia",
-    "HI": "Hawaii",
-    "ID": "Idaho",
-    "IL": "Illinois", "Ill.": "Illinois",
-    "IN": "Indiana", "Ind.": "Indiana",
-    "IA": "Iowa",
-    "KS": "Kansas", "Kan.": "Kansas",
-    "KY": "Kentucky", "Ky.": "Kentucky",
-    "LA": "Louisiana", "La.": "Louisiana",
-    "ME": "Maine",
-    "MD": "Maryland", "Md.": "Maryland",
-    "MA": "Massachusetts", "Mass.": "Massachusetts",
-    "MI": "Michigan", "Mich.": "Michigan",
-    "MN": "Minnesota", "Minn.": "Minnesota",
-    "MS": "Mississippi", "Miss.": "Mississippi",
-    "MO": "Missouri", "Mo.": "Missouri",
-    "MT": "Montana", "Mont.": "Montana",
-    "NE": "Nebraska", "Neb.": "Nebraska",
-    "NV": "Nevada", "Nev.": "Nevada",
-    "NH": "New Hampshire", "N.H.": "New Hampshire",
-    "NJ": "New Jersey", "N.J.": "New Jersey",
-    "NM": "New Mexico", "N.M.": "New Mexico",
-    "NY": "New York", "N.Y.": "New York",
-    "NC": "North Carolina", "N.C.": "North Carolina",
-    "ND": "North Dakota", "N.D.": "North Dakota",
-    "OH": "Ohio",
-    "OK": "Oklahoma", "Okla.": "Oklahoma",
-    "OR": "Oregon", "Ore.": "Oregon",
-    "PA": "Pennsylvania", "Pa.": "Pennsylvania",
-    "RI": "Rhode Island", "R.I.": "Rhode Island",
-    "SC": "South Carolina", "S.C.": "South Carolina",
-    "SD": "South Dakota", "S.D.": "South Dakota",
-    "TN": "Tennessee", "Tenn.": "Tennessee",
-    "TX": "Texas", "Tex.": "Texas",
-    "UT": "Utah",
-    "VT": "Vermont", "Vt.": "Vermont",
-    "VA": "Virginia", "Va.": "Virginia",
-    "WA": "Washington", "Wash.": "Washington",
-    "WV": "West Virginia", "W.Va.": "West Virginia",
-    "WI": "Wisconsin", "Wis.": "Wisconsin",
-    "WY": "Wyoming", "Wyo.": "Wyoming"
-}
-
-search_terms = list(state_abbr_to_full.keys()) + list(state_abbr_to_full.values())
-
-
-# def get_country_state(address: str) -> str or None:
-#     if address:
-#         # for term in search_terms:
-#         #     if term.upper() in address:
-#         #         return state_abbr_to_full.get(term, term)
-#         # return None
-#         address_upper = address.upper()
-#         for abbr, full_name in state_abbr_to_full.items():
-#             abbr_upper = abbr.upper()
-#             full_name_upper = full_name.upper()
-#             abbr_pattern = r'(?:\s|^)' + re.escape(abbr_upper) + r'(?:\s|\.|,|;|$|$)'
-#             full_name_pattern = r'(?:\s|^)' + re.escape(full_name_upper) + r'(?:\s|\.|,|;|$|$)'
-#
-#             if re.search(full_name_pattern, address_upper):
-#                 return full_name
-#             elif re.search(abbr_pattern, address_upper):
-#                 return full_name
-#         return None
-#     else:
-#         return None
-
-def get_country_state(address: str) -> str or None:
-    if not address:
-        return None
-
-    address_upper = address.upper()
-    patterns = {}
-
-    for abbr, full_name in state_abbr_to_full.items():
-        abbr_upper = abbr.upper()
-        full_name_upper = full_name.upper()
-        patterns[abbr_upper] = re.compile(r'(?:\s|^)' + re.escape(abbr_upper) + r'(?:\s|\.|,|;|$|$)')
-        patterns[full_name_upper] = re.compile(r'(?:\s|^)' + re.escape(full_name_upper) + r'(?:\s|\.|,|;|$|$)')
-
-    for name_upper, pattern in patterns.items():
-        if pattern.search(address_upper):
-            for abbr, full_name in state_abbr_to_full.items():
-                if name_upper == abbr.upper() or name_upper == full_name.upper():
-                    return full_name
-
-    return None
-
-
-if __name__ == '__main__':
-    name = '	326 GRAND ST9735535523PATERSON NJ 07505'
-    print(get_country_state(name))
-    pass

+ 0 - 113
dw_base/spark/udf/enterprise/unique/ent_offline_udf_india.py

@@ -1,113 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import hashlib
-# 企业库唯一性调整,离线数据udf
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-ind_head = [
-    'M S',
-    'MS'
-]
-
-india_suffix_list = [
-    ' CO I PVT L',
-    ' CO PVT L',
-    ' CO PRIVATE L',
-    ' CO I LTD',
-    ' I LTD',
-    ' I LIMITED',
-    ' I PVT L',
-    ' I PRIVATE L',
-    ' COMPANY PRIVATE L',
-    ' COMPANY PVT L',
-    ' P LTD',
-    ' PRIVATE L',
-    ' PVT L',
-    ' CO',
-    ' INC',
-    ' CO LIMITED',
-    ' LTD',
-    ' LIMITED',
-    ' CO I',
-    ' I'
-]
-
-
-def generate_md5_hash(input_str: str):
-    input_data = input_str.encode('utf-8')
-    md5_hash = hashlib.md5()
-    md5_hash.update(input_data)
-    return md5_hash.hexdigest()
-
-
-def generate_tid_ind(company_name: str,
-                     business_number: str) -> str or None:
-    if not company_name:
-        return None
-    if business_number:
-        input_str = business_number + 'AAA'
-    else:
-        input_str = company_name + 'BBB'
-    return 'IND' + generate_md5_hash(input_str)
-
-
-def clean_company_name_ind(company_name: str) -> str or None:
-    if company_name:
-        bak_name = company_name.upper()
-        company_name = clean_company_name(bak_name)
-        for head in ind_head:
-            if company_name.startswith(head):
-                company_name = remove_prefix(company_name, head)
-                break
-        truncated_name = india_truncate_at_suffix(company_name, india_suffix_list)
-        if (len(truncated_name.strip()) < 8):
-            return clean_company_name(bak_name)
-        else:
-            return truncated_name.strip()
-    return None
-
-
-def india_truncate_at_suffix(text, suffix_list):
-    for suffix in suffix_list:
-        if suffix in text:
-            if (
-                    suffix != ' CO' and suffix != ' INC' and suffix != ' CO LIMITED' and suffix != ' LTD'
-                    and suffix != ' LIMITED' and suffix != ' CO I' and suffix != ' I'
-            ):
-                return split_last(text, suffix)
-            elif suffix == ' CO' and text.endswith(' CO'):
-                return split_last(text, suffix)
-            elif suffix == ' INC' and text.endswith(' INC'):
-                return split_last(text, suffix)
-            elif suffix == ' CO LIMITED' and ' AND CO LIMITED' not in text:
-                return split_last(text, suffix)
-            elif suffix == ' LTD' and text.endswith(' LTD'):
-                return split_last(text, suffix)
-            elif suffix == ' LIMITED' and text.endswith(' LIMITED'):
-                return split_last(text, suffix)
-            elif suffix == ' CO I' and text.endswith(' CO I'):
-                return split_last(text, suffix)
-            elif suffix == ' I' and text.endswith(' I'):
-                return split_last(text, suffix)
-    return text
-
-
-def split_last(text, suffix):
-    if text:
-        last_occurrence_index = text.rfind(suffix)
-        if last_occurrence_index != -1:
-            return text[:last_occurrence_index]
-        return text
-    return None
-
-
-def remove_prefix(text, prefix):
-    if text.startswith(prefix):
-        return text[len(prefix):]
-    return text
-
-
-if __name__ == '__main__':
-    name = 'P.T.ACEH KIAT BEUTARI JL INDONESI'
-    print(clean_company_name_ind(name))
-    pass

+ 0 - 91
dw_base/spark/udf/enterprise/unique/ent_offline_udf_indonesia.py

@@ -1,91 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import hashlib
-# 企业库唯一性调整,离线数据udf
-from datetime import datetime
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-
-def generate_md5_hash(input_str: str):
-    input_data = input_str.encode('utf-8')
-    md5_hash = hashlib.md5()
-    md5_hash.update(input_data)
-    return md5_hash.hexdigest()
-
-
-def generate_tid_idn(company_name: str,
-                     business_number: str,
-                     city: str) -> str or None:
-    if not company_name:
-        return None
-    if business_number:
-        if city:
-            input_str = f"{business_number}-{city}AAA"
-        else:
-            input_str = business_number + 'BBB'
-    else:
-        input_str = company_name + 'CCC'
-    return 'IDN' + generate_md5_hash(input_str)
-
-
-def clean_company_name_extra(s: str) -> str or None:
-    if s:
-        prefixes = ['PT', 'PT.', 'CV', 'CV.']
-        suffixes = ['PT', ',PT', 'CV', ',CV']
-
-        # 去除前缀
-        for prefix in prefixes:
-            if s.startswith(prefix):
-                s = s[len(prefix):]
-                break
-
-        # 去除后缀
-        for suffix in suffixes:
-            if s.endswith(suffix):
-                s = s[:-len(suffix)]
-                break
-
-        # 截断字符:如果DI前后有空格,就把DI及后面的字符截掉
-        if ' DI ' in s:
-            s = s[:s.index(' DI ')]
-
-        # 去除字符串前后的空格
-        s = s.strip()
-
-        return s
-
-
-def clean_company_name_idn(company_name: str) -> str or None:
-    if company_name:
-        name = clean_company_name(company_name)
-        if name:
-            name = clean_company_name_extra(name)
-            return name
-    return None
-
-
-def get_standard_company_name(s: str) -> str:
-    s = s.strip()
-
-    if s.upper().startswith('PT ') or s.upper().startswith('PT.'):
-        s = 'PT.' + s[3:].strip()
-    elif s.upper().startswith('P.T.') or s.upper().startswith('P T '):
-        s = 'PT.' + s[4:].strip()
-    elif s.upper().startswith('CV ') or s.upper().startswith('CV.'):
-        s = 'CV.' + s[3:].strip()
-    else:
-        s = 'PT.' + s
-
-    # 检查并去除后缀
-    suffixes_to_remove = ['., PT', ', PT', ',PT', 'PT', ',CV', 'CV']
-    for suffix in suffixes_to_remove:
-        if s.upper().endswith(suffix):
-            s = s[:-len(suffix)].strip()
-            break
-    return s
-
-
-if __name__ == '__main__':
-    name = 'P.T.ACEH KIAT BEUTARI JL INDONESI'
-    print(get_standard_company_name(name))
-    pass

+ 0 - 182
dw_base/spark/udf/enterprise/unique/ent_offline_udf_russia.py

@@ -1,182 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import hashlib
-# 企业库唯一性调整,离线数据udf
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-escape_mapping = {
-    "&quot;": '"',
-    "&lt;": "<",
-    "&gt;": ">",
-    "&amp;": "&",
-    "&apos;": "'"
-}
-
-rus_head = [
-    'ООО',
-    'АО',
-    'ПАО',
-    'ЗАО',
-    'ИП',
-    'Общество с ограниченной ответственностью',
-    'Акционерное общество',
-    'Публичное акционерное общество',
-    'Закрытое акционерное общество',
-    'Индивидуальный предприниматель'
-]
-
-
-def replace_escape_char(input_str):
-    if not input_str:
-        return None
-    for key, value in escape_mapping.items():
-        input_str = input_str.replace(key, value)
-    return input_str
-
-
-def generate_md5_hash(input_str: str):
-    input_data = input_str.encode('utf-8')
-    md5_hash = hashlib.md5()
-    md5_hash.update(input_data)
-    return md5_hash.hexdigest()
-
-
-def generate_tid_rus(company_name: str,
-                     business_number: str,
-                     region: str) -> str or None:
-    if not company_name:
-        return None
-    if business_number:
-        input_str = business_number + 'AAA'
-    else:
-        if region:
-            input_str = f"{company_name}-{region}BBB"
-        else:
-            input_str = company_name + 'CCC'
-    return 'RUS' + generate_md5_hash(input_str)
-
-
-def clean_company_name_extra(s: str) -> str or None:
-    if s:
-        # 去除前缀
-        for prefix in rus_head:
-            if s.startswith(prefix.upper()):
-                s = s[len(prefix):]
-                break
-        # 去除字符串前后的空格
-        s = s.strip()
-        return s
-    return None
-
-
-def clean_company_name_rus(company_name: str) -> str or None:
-    if company_name:
-        replace_name = replace_escape_char(company_name)
-        name = clean_company_name(replace_name)
-        if name:
-            name = clean_company_name_extra(name)
-            return name
-    return None
-
-
-region_mapping = {
-    'Нижегородская область': 'Нижегородская область',
-    'Тамбовская область': 'Тамбовская область',
-    'Саратовская область': 'Саратовская область',
-    'Пермский край': 'Пермский край',
-    'Кировская область': 'Кировская область',
-    'Еврейская автономная область': 'Еврейская автономная область',
-    'Херсонская область': None,
-    'Красноярский край': 'Красноярский край',
-    'Тюменская область': 'Тюменская область',
-    'Липецкая область': 'Липецкая область',
-    'Коми (Республика)': 'Республика Коми',
-    'Владимирская область': 'Владимирская область',
-    'Башкортостан (Республика)': 'Республика Башкортостан',
-    'Санкт-Петербург': 'Город Санкт-Петербург',
-    'Калининградская область': 'Калининградская область',
-    'Бурятия (Республика)': 'Республика Бурятия',
-    'Амурская область': 'Амурская область',
-    'Камчатский край': 'Камчатский край',
-    'Оренбургская область': 'Оренбургская область',
-    'Севастополь': 'Севастополь',
-    'Марий Эл (Республика)': 'Республика Марий Эл',
-    'Астраханская область': 'Астраханская область',
-    'Алтай (Республика)': 'Республика Алтай',
-    'Ивановская область': 'Ивановская область',
-    'Тульская область': 'Тульская область',
-    'Мурманская область': 'Мурманская область',
-    'Саха (Республика) (Якутия)': 'Республика Саха (Якутия)',
-    'Чеченская Республика': 'Чеченская Республика',
-    'Ленинградская область': 'Ленинградская область',
-    'Луганская Народная Республика': None,
-    'Челябинская область': 'Челябинская область',
-    'Томская область': 'Томская область',
-    'Мордовия (Республика)': 'Республика Мордовия',
-    'Кабардино-Балкарская Республика': 'Кабардино-Балкарская Республика',
-    'Республика Татарстан': 'Республика Татарстан',
-    'Северная Осетия-Алания (Республика)': 'Республика Северная Осетия-Алания',
-    'Тверская область': 'Тверская область',
-    'Ярославская область': 'Ярославская область',
-    'Иркутская область': 'Иркутская область',
-    'Орловская область': 'Орловская область',
-    'Рязанская область': 'Рязанская область',
-    'Сахалинская область': 'Сахалинская область',
-    'Волгоградская область': 'Волгоградская область',
-    'Ставропольский край': 'Ставропольский край',
-    'Псковская область': 'Псковская область',
-    'Московская область': 'Московская область',
-    'Магаданская область': 'Магаданская область',
-    'Приморский край': 'Приморский край',
-    'Хакасия (Республика)': 'Республика Хакасия',
-    'Самарская область': 'Самарская область',
-    'Карачаево-Черкесская Республика': 'Карачаево-Черкесская Республика',
-    'Чувашская Республика-Чувашия': 'Чувашская Республика',
-    'Смоленская область': 'Смоленская область',
-    'Свердловская область': 'Свердловская область',
-    'Ингушетия (Республика)': 'Республика Ингушетия',
-    'Новгородская область': 'Новгородская область',
-    'Ростовская область': 'Ростовская область',
-    'Пензенская область': 'Пензенская область',
-    'Вологодская область': 'Вологодская область',
-    'Кемеровская область': 'Кемеровская область - Кузбасс',
-    'Москва': 'Город Москва',
-    'Курганская область': 'Курганская область',
-    'Белгородская область': 'Белгородская область',
-    'Курская область': 'Курская область',
-    'Архангельская область': 'Архангельская область',
-    'Ульяновская область': 'Ульяновская область',
-    'Воронежская область': 'Воронежская область',
-    'Брянская область': 'Брянская область',
-    'Донецкая Народная Республика': None,
-    'Республика Крым': 'Республика Крым',
-    'Костромская область': 'Костромская область',
-    'Дагестан (Республика)': 'Республика Дагестан',
-    'Калужская область': 'Калужская область',
-    'Запорожская область': None,
-    'Забайкальский край': 'Забайкальский край',
-    'Адыгея (Республика) (Адыгея)': 'Республика Адыгея',
-    'Хабаровский край': 'Хабаровский край',
-    'Чукотский автономный округ': 'Чукотский автономный округ',
-    'Краснодарский край': 'Краснодарский край',
-    'Алтайский край': 'Алтайский край',
-    'Калмыкия (Республика)': 'Республика Калмыкия',
-    'Омская область': 'Омская область',
-    'Новосибирская область': 'Новосибирская область',
-    'Байконур': 'Байконур',
-    'Удмуртская Республика': 'Удмуртская Республика',
-    'Тыва (Республика)': 'Республика Тыва',
-    'Карелия (Республика)': 'Республика Карелия'
-}
-
-
-def get_standard_region(region: str) -> str or None:
-    if not region:
-        return None
-    return region_mapping.get(region)
-
-
-if __name__ == '__main__':
-    name = 'Хакасия (Республика)'
-    print(get_standard_region(name))
-    pass

+ 0 - 90
dw_base/spark/udf/enterprise/unique/ent_offline_udf_turkey.py

@@ -1,90 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import hashlib
-import re
-# 企业库唯一性调整,离线数据udf
-from datetime import datetime
-from dw_base.spark.udf.customs.common_clean import clean_company_name
-
-turkey_replace_dict = {
-    'ç': 'c', 'Ç': 'C',
-    'ğ': 'g', 'Ğ': 'G',
-    'ı': 'i', 'İ': 'I',
-    'ö': 'o', 'Ö': 'O',
-    'ş': 's', 'Ş': 'S',
-    'ü': 'u', 'Ü': 'U'
-}
-
-
-def replace_str_english(text: str) -> str or None:
-    if text:
-        return text.translate(str.maketrans(turkey_replace_dict))
-    return None
-
-
-def generate_md5_hash(input_str: str):
-    input_data = input_str.encode('utf-8')
-    md5_hash = hashlib.md5()
-    md5_hash.update(input_data)
-    return md5_hash.hexdigest()
-
-
-def generate_tid_tur(company_name: str,
-                     business_number: str) -> str or None:
-    if not company_name:
-        return None
-    if business_number:
-        input_str = business_number + 'AAA'
-    else:
-        input_str = company_name + 'BBB'
-    return 'TUR' + generate_md5_hash(input_str)
-
-
-def clean_company_name_extra(s: str) -> str or None:
-    if s:
-        suffixes = [
-            "Tahmini Anonim Sirket",
-            "Anonim Sirket",
-            "Halka Acık Sirket",
-            "Komandit Sirket",
-            "Limited Sirket",
-            "Baslangıc Sirket",
-            "Tahmini Anonim Sirketi",
-            "Anonim Sirketi",
-            "Halka Acık Sirketi",
-            "Komandit Sirketi",
-            "Limited Sirketi",
-            "Baslangıc Sirketi",
-            " LTD Sti",
-            " T A S",
-            " A S",
-            " H S",
-            " K S",
-            " B S"]
-
-        # 去除后缀
-        for suffix in suffixes:
-            if s.endswith(suffix.upper()):
-                s = s[:-len(suffix)]
-                break
-
-        # 去除字符串前后的空格
-        s = s.strip()
-
-        return s
-
-
-def clean_company_name_tur(company_name: str) -> str or None:
-    if company_name:
-        company_name = replace_str_english(company_name)
-        name = clean_company_name(company_name)
-        if name:
-            name = clean_company_name_extra(name)
-            return name
-    return None
-
-
-if __name__ == '__main__':
-    name = 'KBR İNŞAAT METAL VE ELEKTRİK SANAYİ TİCARET LİMİTED ŞİRKETİ'
-    print(clean_company_name_tur(name))
-    pass

+ 0 - 4
dw_base/spark/udf/main_test.py

@@ -1,4 +0,0 @@
-import pytest
-
-if __name__ == '__main__':
-    pytest.main()

+ 0 - 142
dw_base/spark/udf/product/cpc_clean_udf.py

@@ -1,142 +0,0 @@
-# encoding: utf8
-import re
-from inflect_udf import phrase_singular
-
-COMMA_STR = ","
-PIPE_SYMBOL = "|"
-
-# 是否包含数字和逗号 - 等
-chemical_re_1 = re.compile(r'\d+([.\']\d*)?\s*([,-])\s*\d+([.\']\d*)?')
-# 是否包含化学词汇
-chemical_re_2 = re.compile(r'ETHYL|ACID|AMINE|SALT|DIOXANE|AMINO')
-# 是否只有数字和逗号
-chemical_re_3 = re.compile(r'^[\d,]*$')
-
-
-def is_chemical_expression(word):
-    if bool(chemical_re_3.match(word)):
-        return False
-
-    # 检查是否包含数字和逗号
-    has_digits_and_commas = bool(chemical_re_1.search(word))
-
-    # 检查是否包含一些特定的化学词汇
-    has_chemical_word = bool(chemical_re_2.search(word))
-
-    result = (has_digits_and_commas or has_chemical_word)
-
-    return result
-
-
-# 中文字符
-chinese_char_pattern = re.compile(r'[\u4e00-\u9fff]+')
-# 数字
-number_pattern = re.compile(r'^-?\d+(\.\d+)?$')
-# 包含数字
-digit_pattern = re.compile(r'\d')
-# 保留词
-need_str_pattern = re.compile(r'(^3D)|DDR|VITAMIN|CONNECTOR|MP3|LAPTOP|RYZEN|INTEL|PHONE|NYLON|COVID|CAT|CABLE')
-# 切分符
-split_pattern = re.compile(r'\s*,\s*(?:OF|FOR)|\s*[,;&]\s*')
-# 特殊符号
-special_chars_pattern = re.compile(r'[¥$#~!!=??@><》《{}【】]')
-# 不包含空格和逗号,包含-得特殊化学
-no_space_pattern = re.compile(r'^(?!.*[ ,])(?=.*-).*$')
-
-# 需要保留的短字符产品词
-SHORT_CHAR_CPC = {
-    "TV", "CD", "PC", "TF", "MP", "SD", "VR", "MR", "IC", "PU"
-}
-
-
-def contains_chinese(text):
-    # 匹配中文字符的正则表达式
-    return bool(chinese_char_pattern.search(text))
-
-
-def is_number(s):
-    # 匹配整数、小数和负数
-    return bool(number_pattern.match(s))
-
-
-def contains_digit(s):
-    # 匹配字符串中是否包含数字
-    return bool(digit_pattern.search(s))
-
-
-def need_str(s):
-    # 匹配是否是保留词
-    return bool(need_str_pattern.search(s))
-
-
-def special_char_remove(cpc):
-    # 提取特殊字符产品词数据
-    return bool(special_chars_pattern.search(cpc))
-
-
-def no_space(cpc):
-    return bool(no_space_pattern.search(cpc))
-
-
-def is_invalid_cpc(word):
-    """
-    判断一个产品词是否是不合法的产品词
-    """
-    if (is_number(word)  # 纯数字
-            or (len(word.replace('.', '')) < 3 and word not in SHORT_CHAR_CPC)  # 长度<=3,且不是例举产品词的
-            or contains_chinese(word)  # 包含中文的
-            or special_char_remove(word)):  # 包含特殊字符的
-        return True
-
-
-def multi_cpc_clean(cpc, force=False):
-    if cpc is None:
-        return None
-    cpc = cpc.strip()
-    if cpc == '':
-        return None
-    if COMMA_STR not in cpc and PIPE_SYMBOL not in cpc:
-        if is_invalid_cpc(cpc):
-            return None
-        if is_chemical_expression(cpc):
-            return cpc
-        if not contains_digit(cpc) or need_str(cpc) or no_space(cpc):
-            return phrase_singular(cpc)
-        else:
-            return None
-
-    cpc_list = []
-    # 先按管道符切分
-    for cpc_i in cpc.split("|"):
-        cpc_i = cpc_i.strip()
-        # 是否是化学表达式
-        if is_chemical_expression(cpc_i):
-            cpc_list.append(
-                phrase_singular(cpc_i)
-            )
-            continue
-        # cpc中包含多种产品
-        for cpc_j in re.split(split_pattern, cpc_i):
-            cpc_j = cpc_j.strip()
-            if is_invalid_cpc(cpc_j):
-                continue
-
-            cpc_j = phrase_singular(cpc_j)
-            if cpc_j not in cpc_list:
-                if not contains_digit(cpc_j) or need_str(cpc_j):
-                    cpc_list.append(cpc_j)
-
-    if len(cpc_list) == 0:
-        return None
-    else:
-        return " | ".join(cpc_list)
-
-
-if __name__ == '__main__':
-    # print(multi_cpc_clean('SEMICARBAZIDE-13C, 15N2 HYDROCHLORIDE'))
-    print(multi_cpc_clean('FAN'))
-    # print(multi_cpc_clean('4-METHYLBENZALDEHYDE'))
-    # print(multi_cpc_clean('1-ALLYL-2-THIOUREA'))
-    # print(multi_cpc_clean('1HHH'))
-    # print(multi_cpc_clean('BALL OR ROLLER BEARINGS'))
-    # print(special_char_remove("AAAA》hhhh"))

+ 0 - 19
dw_base/spark/udf/product/cpms_lang_detect.py

@@ -1,19 +0,0 @@
-import re
-
-
-def is_russian(text):
-    # 正则表达式匹配西里尔字母
-    cyrillic_pattern = re.compile('[\u0400-\u04FF]')
-    if bool(cyrillic_pattern.search(text)):
-        return 1
-    else:
-        return 0
-
-
-def is_spanish(text):
-    # 正则表达式匹配西班牙语字符
-    spanish_pattern = re.compile(r'[ñÑáéíóúÁÉÍÓÚ]')
-    if bool(spanish_pattern.search(text)):
-        return 1
-    else:
-        return 0

+ 0 - 5
dw_base/spark/udf/product/escape_udf.py

@@ -1,5 +0,0 @@
-import html
-
-
-def html_unescape(text):
-    return html.unescape(text)

+ 0 - 76
dw_base/spark/udf/product/inflect_udf.py

@@ -1,76 +0,0 @@
-# encoding: utf8
-
-import inflect
-
-# 创建inflect引擎实例
-p = inflect.engine()
-
-# 自定义单复数词表
-p.defnoun("appendix", "appendices")
-p.defnoun("bus", "buses")
-p.defnoun("thesis", "theses")
-p.defnoun("index", "indices")
-p.defnoun("axis", "axes")
-p.defnoun("cactus", "cacti")
-p.defnoun("focus", "foci")
-p.defnoun("fungus", "fungi")
-p.defnoun("radius", "radii")
-p.defnoun("nucleus", "nuclei")
-p.defnoun("synopsis", "synopses")
-p.defnoun("crisis", "crises")
-p.defnoun("analysis", "analyses")
-p.defnoun("diagnosis", "diagnoses")
-p.defnoun("phenomenon", "phenomena")
-p.defnoun("criterion", "criteria")
-p.defnoun("matrix", "matrices")
-p.defnoun("die", "dies")
-
-# 用户自定义的单数单词词表
-USER_DEFINED_SINGULAR_WORDS = [singular_word.lower() for singular_word in p.pl_sb_user_defined[::2]]
-
-
-def singular(word: str):
-    """
-    将复数名词转换为单数形式。
-
-    :param word: 需要转换的名词, eg:"COMPONENTS"
-    :return: 单数形式的名词, "COMPONENT"
-    """
-    if word is None or word.strip() == '':
-        return word
-    try:
-        word_l = word.lower()
-        # 用户自定义的单数单词列表
-        if word_l in USER_DEFINED_SINGULAR_WORDS:
-            return word
-        # ss结尾, 's结尾
-        if word_l.endswith("ss") or word_l.endswith("'s"):
-            return word
-        # 单词长度小于3
-        if len(word) <= 3 and word_l not in ["men"]:
-            return word
-        singular_form = p.singular_noun(word)
-        if singular_form is False:
-            # 如果word本身就是单数形式,则直接返回原字符串
-            return word
-        return singular_form
-    except Exception as _:
-        return word
-
-
-def phrase_singular(phrase: str):
-    """
-    将词组的最后一个单词复数转单数
-    :param phrase: eg:"GEARBOX COMPONENTS"
-    :return: "GEARBOX COMPONENT"
-    """
-    if phrase is None or phrase == '':
-        return None
-
-    words = phrase.split()
-    if len(words) > 1:
-        tmp = words[0: -1]
-        tmp.append(singular(words[-1]))
-        return " ".join(tmp)
-    else:
-        return singular(phrase)

+ 0 - 38
dw_base/spark/udf/product/spark_string_retrieval_trie.py

@@ -1,38 +0,0 @@
-from pyspark.sql.functions import udf
-from pyspark.sql.types import IntegerType
-
-class TrieNode:
-    def __init__(self):
-        self.children = {}
-        self.is_end_of_word = False
-
-def build_trie(words):
-    root = TrieNode()
-    for word in words:
-        node = root
-        for char in word:
-            if char not in node.children:
-                node.children[char] = TrieNode()
-            node = node.children[char]
-        node.is_end_of_word = True
-    return root
-
-def search_in_trie(root, text):
-    node = root
-    for char in text:
-        if char not in node.children:
-            return 0
-        node = node.children[char]
-        if node.is_end_of_word:
-            return 1
-        else:
-            return 0;
-
-@udf(returnType=IntegerType())
-def trie_contains(text, words):
-    root = build_trie(words)
-    return search_in_trie(root, text)
-
-
-if __name__ == '__main__':
-    print(trie_contains('halhudiohah iohfsnihf ohfhoefoi','ioh'))

+ 0 - 52
dw_base/spark/udf/productApplication/cts_data_clean.py

@@ -1,52 +0,0 @@
-import codecs
-import re
-import json
-from pyspark.sql.functions import udf
-from pyspark.sql.types import ArrayType, StringType
-
-
-
-hgbm_pattern=r'[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+[a-zA-Z]+$'
-def hgbm_clean(hgbm):
-    if hgbm:
-        if not hgbm or not any(char.isdigit() for char in hgbm):
-            # 如果 hgbm 不含数字,返回 None
-            return None
-        if 'e+' in hgbm.lower():
-            # 如果是科学计数法,截取前 8 位数字
-            return hgbm.split('e+')[0][:8]
-        if re.search(hgbm_pattern, hgbm):
-            return hgbm.replace('.','')
-        hgbm = re.sub(r'[\.;;,,/\s+]', ',', hgbm)
-        hgbm = hgbm.strip(')&\'_')
-        if '-' in hgbm:
-            if len(hgbm) >= 12:
-                hgbm = hgbm.replace('-',',')
-            else:
-                hgbm = hgbm.replace('-', '')
-        return hgbm
-    return None
-
-if __name__ == '__main__':
-
-
-    test_list = [
-        '843290108512e+31',
-        '6212.90.00.000A',
-        '6212.90.00.000AQ',
-        '6212.90.00.000A1',
-        '94069/730830/854449',
-        '940161000019 940360100000 940350000019',
-        '630532;630532,630532,630532,630532,630532,630532,630532,630532,630532',
-        '320649/320690/38245905/320619',
-        '39173900-39209990-39269097',
-        '68030000)',
-        '39262000001',
-        '9403-6000',
-        '33011920&',
-        '34013000_',
-        '\'39262000001',
-        '69072100+44152020'
-    ]
-    for str in test_list:
-        print(f'{str}---->{hgbm_clean(str)}')

+ 0 - 34
dw_base/spark/udf/solr_similar_match_udf.py

@@ -1,34 +0,0 @@
-import json
-import requests
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType, ArrayType, StructType, StructField, BooleanType
-
-
-def edismax_call(collection: str, q_alt: str, q: str, qf: str, mm: str = '70%', rows: int = 1, stopwords: str = 'true',
-                 tie: float = 0.2, wt: str = 'json'):
-    def_type: str = 'edismax'
-    params = {"defType": def_type, "mm": mm, "q.alt": q_alt, "q": q, 'qf': qf, 'rows': rows, 'stopwords': stopwords,
-              'tie': tie, 'wt': wt}
-    resp = requests.get(f'http://m2.node.dev:8886/solr/{collection}/select', params=params)
-    return resp
-
-@udf(returnType=StructType([
-    StructField("is_finded",BooleanType(),False),
-    StructField("basic_arr",ArrayType(StringType()),True)
-]))
-def get_china_company_name_match(raw_name:str,mm:str = '70%', rows: int = 1):
-    solr_resp = edismax_call('ent_china_biz_basic', raw_name,
-                             raw_name, 'ent_name_en_abb^1.0',mm,rows)
-    if solr_resp.status_code != 200:
-        return False, None
-    else:
-        resp = json.loads(solr_resp.text)['response']
-        if resp['numFound'] == 0:
-            return False, None
-        else:
-            most_match_one = resp['docs'][0]
-            return True, [most_match_one['ent_name_chn'],most_match_one['ent_name_en'],most_match_one['ent_name_en_abb'],most_match_one['unc_id']]
-
-
-if __name__ == '__main__':
-    print(get_china_company_name_match('SAMSUNG ELECTRONICS CO. LTD.,.  '))

+ 0 - 666
dw_base/spark/udf/spark_eng_ent_name_clean.py

@@ -1,666 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-
-import json
-import re
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-full_width_character = ['.',
-                        ',',
-                        '-',
-                        '(',
-                        ')',
-                        '@',
-                        '?',
-                        '‘',
-                        '’',
-                        '“',
-                        '”',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '`',
-                        '#',
-                        '+',
-                        '!',
-                        '$',
-                        '|',
-                        ':',
-                        '/',
-                        ';',
-                        '*',
-                        '《',
-                        '》',
-                        '<',
-                        '>',
-                        '%',
-                        '^',
-                        '&',
-                        '_',
-                        '[',
-                        ']',
-                        '{',
-                        '}',
-                        '\\',
-                        '~',
-                        '=',
-                        "'",
-                        '±',
-                        '°',
-                        '«',
-                        '»',
-                        'µ',
-                        '¶',
-                        '·',
-                        '€',
-                        '£',
-                        '¥',
-                        '¢',
-                        '×',
-                        '÷',
-                        '±',
-                        '¬',
-                        '…',
-                        '→',
-                        '←',
-                        '↑',
-                        '↓',
-                        '↔',
-                        '⇒',
-                        '⇐',
-                        '≈',
-                        '≠',
-                        '≤',
-                        '≥'
-                        ]
-half_width_character = [
-
-    '.',
-    ',',
-    '-',
-    '(',
-    ')',
-    '@',
-    '?',
-    "'",
-    "'",
-    '"',
-    '"',
-    ''',
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    "'",
-    '#',
-    '+',
-    '!',
-    '$',
-    '|',
-    ':',
-    '/',
-    ';',
-    '*',
-    '<',
-    '>',
-    '%',
-    '^',
-    '&',
-    '_',
-    '[',
-    ']',
-    '{',
-    '}',
-    '\',
-    '~',
-    '=',
-    "'",
-    '±',
-    '°',
-    '«',
-    '»',
-    'µ',
-    '¶',
-    '·',
-    '€',
-    '£',
-    '¥',
-    '¢',
-    '×',
-    '÷',
-    '±',
-    '¬',
-    '…',
-    '→',
-    '←',
-    '↑',
-    '↓',
-    '↔',
-    '⇒',
-    '⇐',
-    '≈',
-    '≠',
-    '≤',
-    '≥'
-]
-tail_character = ['groupcompanylimited',
-                  'limitedpartnership',
-                  'corporationlimited',
-                  'researchinstitute',
-                  'liabilitycompany',
-                  'limitedcompany',
-                  'companylimited',
-                  'youxiangongsi',
-                  'incorporated',
-                  'shanghaiinc',
-                  'corporation',
-                  'groupcoltd',
-                  'companyltd',
-                  'shlimited',
-                  'colimited',
-                  'groupltd',
-                  'chinaltd',
-                  'chinainc',
-                  'factory',
-                  'corpltd',
-                  'company',
-                  'ptyltd',
-                  'agency',
-                  'office',
-                  'center',
-                  'coltd',
-                  'coinc',
-                  'c0ltd',
-                  'colt',
-                  'corp',
-                  'llc',
-                  'ltd',
-                  'co',
-                  ]
-
-chian_ent_label = [
-    'shanghai',
-    'peking',
-    'chongqing',
-    'tianjin',
-    'wuhan',
-    'harbin',
-    'shenyang',
-    'guangzhou',
-    'chengdu',
-    'nanjing]',
-    'changchun',
-    'xian',
-    'dalian',
-    'qingdao',
-    'jinan',
-    'hangzhou',
-    'zhengzhou',
-    'shijiazhuang',
-    'taiyuan',
-    'kunming',
-    'changsha',
-    'nanchang',
-    'fuzhou',
-    'lanzhou',
-    'guiyang',
-    'ningbo',
-    'hefei',
-    'anshan',
-    'fushun',
-    'nanning',
-    'zibo',
-    'qiqihar',
-    'jilin',
-    'tangshan',
-    'baotou',
-    'shenzhen',
-    'hohhot',
-    'handan',
-    'wuxi',
-    'xuzhou',
-    'datong',
-    'yichun',
-    'benxi',
-    'luoyang',
-    'suzhou',
-    'xining',
-    'huainan',
-    'jixi',
-    'daqing',
-    'fuxin',
-    'xiamen',
-    'liuzhou',
-    'shantou',
-    'jinzhou',
-    'mudanjiang',
-    'yinchuan',
-    'changzhou',
-    'zhangjiakou',
-    'dandong',
-    'hegang',
-    'kaifeng',
-    'jiamusi',
-    'liaoyang',
-    'hengyang',
-    'baoding',
-    'hunjiang',
-    'xinxiang',
-    'huangshi',
-    'haikou',
-    'yantai',
-    'bengbu',
-    'xiangtan',
-    'weifang',
-    'wuhu',
-    'pingxiang',
-    'yingkou',
-    'anyang',
-    'panzhihua',
-    'pingdingshan',
-    'xiangfan',
-    'zhuzhou',
-    'jiaozuo',
-    'wenzhou',
-    'zhangjiang',
-    'zigong',
-    'shuangyashan',
-    'zaozhuang',
-    'yakeshi',
-    'yichang',
-    'zhenjiang',
-    'huaibei',
-    'qinhuangdao',
-    'guilin',
-    'liupanshui',
-    'panjin',
-    'yangquan',
-    'jinxi',
-    'liaoyuan',
-    'lianyungang',
-    'xianyang',
-    'tai´an',
-    'chifeng',
-    'shaoguan',
-    'nantong',
-    'leshan',
-    'baoji',
-    'linyi',
-    'tonghua',
-    'siping',
-    'changzhi',
-    'tengzhou',
-    'chaozhou',
-    'yangzhou',
-    'dongwan',
-    'ma´anshan',
-    'foshan',
-    'yueyang',
-    'xingtai',
-    'changde',
-    'shihezi',
-    'yancheng',
-    'jiujiang',
-    'dongying',
-    'shashi',
-    'xintai',
-    'jingdezhen',
-    'tongchuan',
-    'zhongshan',
-    'shiyan',
-    'tieli',
-    'jining',
-    'wuhai',
-    'mianyang',
-    'luzhou',
-    'zunyi',
-    'shizuishan',
-    'neijiang',
-    'tongliao',
-    'tieling',
-    'wafangdian',
-    'anqing',
-    'shaoyang',
-    'laiwu',
-    'chengde',
-    'tianshui',
-    'nanyang',
-    'cangzhou',
-    'yibin',
-    'huaiyin',
-    'dunhua',
-    'yanji',
-    'jiangmen',
-    'tongling',
-    'suihua',
-    'gongziling',
-    'xiantao',
-    'chaoyang',
-    'ganzhou',
-    'huzhou',
-    'baicheng',
-    'shangzi',
-    'yangjiang',
-    'qitaihe',
-    'gejiu',
-    'jiangyin',
-    'hebi',
-    'jiaxing',
-    'wuzhou',
-    'meihekou',
-    'xuchang',
-    'liaocheng',
-    'haicheng',
-    'qianjiang',
-    'baiyin',
-    'bei´an',
-    'yixing',
-    'laizhou',
-    'qaramay',
-    'acheng',
-    'dezhou',
-    'nanping',
-    'zhaoqing',
-    'beipiao',
-    'fengcheng',
-    'fuyu',
-    'xinyang',
-    'dongtai',
-    'yuci',
-    'honghu',
-    'ezhou',
-    'heze',
-    'daxian',
-    'linfen',
-    'tianmen',
-    'yiyang',
-    'quanzhou',
-    'rizhao',
-    'deyang',
-    'guangyuan',
-    'changshu',
-    'zhangzhou',
-    'hailar',
-    'nanchong',
-    'jiutai',
-    'zhaodong',
-    'shaoxing',
-    'fuyang',
-    'maoming',
-    'qujing',
-    'ghulja',
-    'jiaohe',
-    'puyang',
-    'huadian',
-    'jiangyou',
-    'qashqar',
-    'anshun',
-    'fuling',
-    'xinyu',
-    'hanzhong',
-    'danyang',
-    'chenzhou',
-    'xiaogan',
-    'shangqiu',
-    'zhuhai',
-    'qingyuan',
-    'aqsu',
-    'xiaoshan',
-    'zaoyang',
-    'xinghua',
-    'hami',
-    'huizhou',
-    'jinmen',
-    'sanming',
-    'ulanhot',
-    'korla',
-    'wanxian',
-    'ruian',
-    'zhoushan',
-    'liangcheng',
-    'jiaozhou',
-    'taizhou',
-    'taonan',
-    'pingdu',
-    'ji´an',
-    'longkou',
-    'langfang',
-    'zhoukou',
-    'suining',
-    'yulin',
-    'jinhua',
-    'liu´an',
-    'shuangcheng',
-    'suizhou',
-    'ankang',
-    'weinan',
-    'longjing',
-    'daan',
-    'lengshuijiang',
-    'laiyang',
-    'xianning',
-    'dali',
-    'anda',
-    'jincheng',
-    'longyan',
-    'xichang',
-    'wendeng',
-    'hailun',
-    'binzhou',
-    'linhe',
-    'wuwei',
-    'duyun',
-    'mishan',
-    'shangrao',
-    'changji',
-    'meixian',
-    'yushu',
-    'tiefa',
-    'huai´an',
-    'leiyang',
-    'zalantun',
-    'weihai',
-    'loudi',
-    'qingzhou',
-    'qidong',
-    'huaihua',
-    'luohe',
-    'chuzhou',
-    'kaiyuan',
-    'linqing',
-    'chaohu',
-    'laohekou',
-    'dujiangyan',
-    'zhumadian',
-    'linchuan',
-    'jiaonan',
-    'sanmenxia',
-    'heyuan',
-    'manzhouli',
-    'lhasa',
-    'lianyuan',
-    'kuytun',
-    'puqi',
-    'hongjiang',
-    'qinzhou',
-    'renqiu',
-    'yuyao',
-    'guigang',
-    'kaili',
-    'yan´an',
-    'beihai',
-    'xuangzhou',
-    'quzhou',
-    'yong´an',
-    'zixing',
-    'liyang',
-    'yizheng',
-    'yumen',
-    'liling',
-    'yuncheng',
-    'shanwei',
-    'cixi',
-    'yuanjiang',
-    'bozhou',
-    'jinchang',
-    'fuan',
-    'suqian',
-    'shishou',
-    'hengshui',
-    'danjiangkou',
-    'fujin',
-    'sanya',
-    'guangshui',
-    'huangshan',
-    'xingcheng',
-    'zhucheng',
-    'kunshan',
-    'haining',
-    'pingliang',
-    'fuqing',
-    'xinzhou',
-    'jieyang',
-    'zhangjiagang',
-    'tong xian',
-    'yaan',
-    'emeishan',
-    'enshi',
-    'bose',
-    'yuzhou',
-    'tumen',
-    'putian',
-    'linhai',
-    'shaowu',
-    'junan',
-    'huaying',
-    'pingyi',
-    'huangyan'
-]
-
-brazil_tail_character_cut = [
-    'industriais ltda',
-    'brasil indstria',
-    'e comercializacao',
-    'brasil ltda',
-    'industria',
-    'eireli',
-    'cia ltda',
-    'ind e com',
-    'brasil ltda epp',
-    'importacao',
-    'e comercio',
-    'comercio',
-    # 'sa',
-    'do brasi',
-    'brasil sa',
-    'limitada',
-    'ltda me',
-    'ltda epp',
-    'ltda'
-]
-
-brazil_tail_character_remove = [
-    'sa',
-    'ltda',
-    'casa'
-]
-
-
-def get_clean_eng_ent_name(eng_name: str) -> str or None:
-    if eng_name:
-        # eng_name = eng_name.lower()
-        eng_name = eng_name.lower().replace(' ', '')
-
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_tail_char(eng_name: str) -> str or None:
-    if eng_name:
-        for char in tail_character:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)]
-        return eng_name
-    else:
-        return ''
-
-
-@udf(returnType=BooleanType())
-def filter_china_ent(name_abb: str) -> bool:
-    if name_abb:
-        for char in chian_ent_label:
-            if char in name_abb:
-                return True
-    return False
-
-
-def cut_tail_char_brazil(eng_name: str) -> str or None:
-    if eng_name:
-        for tail in brazil_tail_character_cut:
-            pattern = re.compile(f'{tail}\s*', flags=re.IGNORECASE)
-            match = re.search(pattern, eng_name)
-            if match:
-                ent_name_cut = eng_name[:match.start()].strip()
-                if len(ent_name_cut) > 5:
-                    return ent_name_cut
-                else:
-                    return eng_name
-        return eng_name
-    return ''
-
-
-def remove_punctuation(eng_name: str) -> str or None:
-    if eng_name:
-        eng_name = eng_name.lower()
-        for char in full_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        for char in half_width_character:
-            eng_name = re.sub(re.escape(char), '', eng_name)
-
-        return eng_name
-    else:
-        return ''
-
-
-def remove_tail_char_brazil(eng_name: str) -> str or None:
-    if eng_name:
-        for char in brazil_tail_character_remove:
-            if eng_name.endswith(char):
-                return eng_name[:-len(char)].replace(' ', '')
-        return eng_name.replace(' ', '')
-    else:
-        return ''
-
-
-if __name__ == '__main__':
-    a = 'ABC ltda  epp industriais ltdaltda  me'
-    print(remove_tail_char_brazil(a))

+ 0 - 132
dw_base/spark/udf/spark_india_format_phone_udf.py

@@ -1,132 +0,0 @@
-import re
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StructType, StructField, IntegerType , StringType
-
-# 定义结构体类型
-# schema = StructType([
-#     StructField("type", IntegerType(), False),
-#     StructField("contact", StringType(), True)
-# ])
-
-@udf(returnType=StructType([
-    StructField("type", IntegerType(), False),
-    StructField("contact", StringType(), True)
-]))
-def format_phone(s:str)->(int,str):
-    # "国区编号+分隔符+10位手机号码
-    # 国区编号:91、+91、(+91)、(91)或不展示
-    # 分隔符:一个半角空格或横杠或不展示
-    # 10位手机号码:10位数字或5个数字+分隔符+5个数字,首个数字为6-9其中的数字"
-    # +91 9987654321
-    # 919987654321
-    # +91 99876-54321
-
-    # type:
-    # 1-手机号
-    # 2-座机
-    # 3-邮箱
-    # 4-地址
-    # 99-其他
-    type,res = parse_phone(s)
-    if type != 99:
-        return (type,res)
-    else:
-        type,res = parse_fixed_phone(s)
-        if type != 99:
-            return (type,res)
-        else:
-            return parse_email(s)
-
-@udf(returnType=IntegerType())
-def check_email_type(s:str)->int:
-    return parse_email(s)[0]
-
-def parse_email(s:str)->(int,str):
-    rex = re.search(r'^[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)*@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)*(\.[a-zA-Z]+)+$',s)
-    if rex:
-        return (3,rex.group(0))
-    else:
-        return (99,s)
-
-
-def parse_fixed_phone(s:str)->(int,str):
-    s_rex = re.search(r'^(\+91|91|\(91\)|\(\+91\))(.*)', s)
-    start_pattern = r'^([- ]*)(011|11|022|22|033|33|044|44|020|20|040|40|080|80|0141|141|\(011\)|\(11\)|\(022\)|\(22\)|\(033\)|\(33\)|\(044\)|\(44\)|\(020\)|\(20\)|\(040\)|\(40\)|\(080\)|\(80\)|\(0141\)|\(141\))?[- ]*(.*)'
-    if s_rex:
-        last_sub = s_rex.group(2)
-        return get_fixed_phone_res(s, last_sub, start_pattern)
-    else:
-        return get_fixed_phone_res(s, s, start_pattern)
-
-def parse_phone(s:str)->(int,str):
-    s_rex = re.search(r'^(\+91|91|\(91\)|\(\+91\))(.*)', s)
-    if s_rex:
-        last_sub = s_rex.group(2)
-        return get_phone_res(s, last_sub, r'^[6-9- 0]')
-    else:
-        return get_phone_res(s, s, r'^[6-9- 0]')
-
-
-def get_fixed_phone_res(phone_str:str,last_sub_phone:str,start_pattern:str)->(int,str):
-    # 分隔符:一个半角空格或横杠或不展示
-    # 国区编码+分隔符+区号+分隔符+7/8/10座机号码
-    # 国区编号:同上
-    # 分隔符:同上
-    # 区号:011、022、033、044、020、040、080、0141,0可能隐藏,区号外可能带括号
-    # 7位座机号码:7位数字或者3位数字+分隔符+4位数字
-    # 8位座机号码:8位数字或者4位数字+分割符+4位数字
-    # 10位座机号码:10位数字或者4位数字+分隔符+6位数字
-    re_search_res = re.search(start_pattern, last_sub_phone)
-    if re_search_res:
-        area,fixed_phone = re_search_res.group(2),re_search_res.group(3)
-        left,right = split_fixed_phone(fixed_phone)
-        if left is None:
-            if area:
-                left,right = split_fixed_phone(area + fixed_phone)
-                if left:
-                    return 2, "{} {} {}".format("+91", left, right)
-            return 99,phone_str
-        else:
-            if not area:
-                return 2, "{} {} {}".format("+91", left,right)
-            area_num_rex = re.search(r'\d+',area)
-            if area_num_rex:
-                area_num = area_num_rex.group(0)
-                if area_num.startswith('0'):
-                    area_str="".join(["(",area_num,")"])
-                else:
-                    area_str = "".join(["(0", area_num, ")"])
-            return 2, "{} {} {} {}".format("+91", area_str, left,right)
-    else:
-        return 99,phone_str
-
-
-def get_phone_res(phone_str:str,last_sub_phone:str,start_pattern:str)->(int,str):
-    # 分隔符:一个半角空格或横杠或不展示
-    if not bool(re.match(start_pattern, last_sub_phone)):
-        return 99,phone_str
-    rex_last_sub_phone = re.search(r'([0]?)(\d.*$)', last_sub_phone)
-    if rex_last_sub_phone:
-        phone = rex_last_sub_phone.group(2)
-        # 10位手机号码:10位数字或5个数字+分隔符+5个数字,首个数字为6-9其中的数字
-        if not bool(re.match(r'^[6-9]{1}[0-9]{4}[- ]*[0-9]{5}$', phone)):
-            return 99,phone_str
-        return 1,"{} {} {}".format("+91",phone[0:5],phone[-5:])
-    else:
-        return 99, phone_str
-
-def split_fixed_phone(fixed_phone:str)->(str,str):
-    p1 = re.search(r'^([1-9]{1}[0-9]{3})([- ]*)([0-9]{6})$', fixed_phone)
-    if p1:
-        return p1.group(1),p1.group(3)
-
-    p2 = re.search(r'^([1-9]{1}[0-9]{3})([- ]*)([0-9]{4})$', fixed_phone)
-    if p2:
-        return p2.group(1),p2.group(3)
-    p3 = re.search(r'^([1-9]{1}[0-9]{2})([- ]*)([0-9]{4})$', fixed_phone)
-    if p3:
-        return p3.group(1),p3.group(3)
-    return None,None
-
-# 注册UDF并指定返回类型
-get_type_and_format_phone = udf(format_phone)

+ 0 - 516
dw_base/spark/udf/spark_json_array_udf.py

@@ -1,516 +0,0 @@
-import hashlib
-import json
-from collections import Counter
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType, ArrayType, StructType, StructField, IntegerType, FloatType, MapType,BooleanType,TimestampType
-
-
-@udf(returnType=ArrayType(StructType([
-    StructField("idx",IntegerType(),False),
-    StructField("obj",StringType(),False)
-])))
-def parse_jsonarr_to_arr(s:str)->[(int,str)]:
-    res_arr = [(i+1,json.dumps(obj)) for i,obj in enumerate(json.loads(s))]
-    return res_arr
-
-
-@udf(returnType=ArrayType(StructType([
-    StructField("idx",IntegerType(),False),
-    StructField("obj",StringType(),False)
-])))
-def parse_jsonarr_to_strarr(s:str)->[(int,str)]:
-    res_arr = [(i+1,obj) for i,obj in enumerate(json.loads(s))]
-    return res_arr
-
-@udf(returnType=StructType([
-    StructField("k",ArrayType(StringType()),False),
-    StructField("kv",StringType())
-    ]))
-def parse_arr_and_count(arr,tag:str,return_count:int=-1):
-    ele_cnt_dict = Counter(arr)
-    json_list = sorted([{"code": key, "num": value} for key, value in ele_cnt_dict.items()],key=lambda x:x["num"], reverse=True)
-    if return_count < 0:
-        return [obj['code'] for obj in json_list],",".join(['{'+f'{i["code"]},{tag}:{i["num"]}'+'}' for i in json_list])
-    else:
-        list_len = len(json_list)
-        index = list_len
-        if return_count < list_len:
-            index = return_count
-        return [obj['code'] for obj in json_list][:index],",".join(['{'+f'{i["code"]},{tag}:{i["num"]}'+'}' for i in json_list[:index]])
-
-
-@udf(returnType=StructType([
-    StructField("sum",FloatType(),False),
-    StructField("list",StringType())
-    ]))
-def parse_arr_and_sum(struct_arr,tag:str):
-    sum_dict={}
-    for s in struct_arr:
-        key = s[0]
-        value:float = s[1]
-        if key not in sum_dict:
-            sum_dict[key]=0.0
-        if value is not None:
-            sum_dict[key] += value
-    json_list = sorted([{"code": key, "num": value} for key, value in sum_dict.items()],key=lambda x:x["num"], reverse=True)
-    total = 0.0
-    for obj in json_list:
-        total += obj["num"]
-    return round(total,2),",".join(['{'+f'{i["code"]},{tag}:{round(i["num"],2)}'+'}' for i in json_list])
-
-@udf(returnType=StringType())
-def split_str_to_jsonstr(str_list: List):
-    res = []
-    for kv_str in str_list:
-        arr = kv_str.split(':')
-        if len(arr) == 2:
-            res.append({arr[0]: arr[1]})
-    return json.dumps(res,ensure_ascii=False)
-
-
-@udf(returnType=MapType(StringType(), ArrayType(StringType())))
-def split_str_to_maparr(str_list: List):
-    res = {}
-    for kv_str in str_list:
-        arr = kv_str.split(':')
-        if len(arr) == 2:
-            if arr[0] not in res:
-                res[arr[0]]=[arr[1]]
-            else:
-                res[arr[0]].append(arr[1])
-    return res
-
-@udf(returnType=MapType(StringType(), StringType()))
-def distinct_arrmap(map_list: List):
-    res = {}
-    for kv_map in map_list:
-        if 'time' in res:
-            if int(kv_map["time"]) > int(res["time"]):
-                res = kv_map
-        else:
-            res = kv_map
-    if len(res)==0:
-        return {}
-    else:
-        return {"snovio":res["value"]}
-
-
-@udf(returnType=MapType(StringType(), StringType()))
-def distinct_arrlist(arr_list: List):
-    # [inv,similar,field,last_time]
-    res = {}
-    for arr in arr_list:
-        if len(res)==0:
-            res["inv"] = [arr[0],arr[2],arr[3]]
-            if arr[2] == 'email':
-                res["similar"] = [arr[1],arr[2],arr[3]]
-            continue
-        if arr[2] == 'email':
-            if arr[3] is None:
-                continue
-            if int(arr[3]) > int(res["inv"][2]):
-                if arr[0] != '':
-                    res["inv"] = [arr[0],arr[2],arr[3]]
-            else:
-                if res["inv"][0] == '':
-                    res["inv"] = [arr[0], arr[2], arr[3]]
-            if int(arr[3]) > int(res["similar"][2]):
-                if arr[1] != '':
-                    res["similar"] = [arr[1], arr[2], arr[3]]
-            else:
-                if res["similar"][0] == '':
-                    res["similar"] = [arr[1], arr[2], arr[3]]
-        else:
-            if res["inv"][1]=='ep1':
-                if arr[3] is None:
-                    continue
-                if int(arr[3])>int(res["inv"][2]):
-                    res["inv"] = [arr[0],arr[2],arr[3]]
-            else:
-                if arr[2]=='ep1':
-                    res["inv"] = [arr[0],arr[2],arr[3]]
-                else:
-                    if arr[3] is None:
-                        continue
-                    if int(arr[3]) > int(res["inv"][2]):
-                        res["inv"] = [arr[0], arr[2], arr[3]]
-            # if res["similar"][1] == 'ep1':
-            #     if arr[3] is None:
-            #         continue
-            #     if int(arr[3]) > int(res["similar"][2]):
-            #         res["similar"] = [arr[1], arr[2], arr[3]]
-            #     else:
-            #         if arr[2] == 'ep1':
-            #             res["similar"] = [arr[1], arr[2], arr[3]]
-            #         else:
-            #             if arr[3] is None:
-            #                 continue
-            #             if int(arr[3]) > int(res["similar"][2]):
-            #                 res["similar"] = [arr[1], arr[2], arr[3]]
-
-    wrap_res = {}
-    if "similar" in res and res["similar"][0] is not None and res["similar"][0]!='':
-        wrap_res["mail_sou_linkedin_similar"] = res["similar"][0]
-    if "inv" in res and res["inv"][0] is not None and res["inv"][0]!='':
-        wrap_res["mail_sou_linkedin_inv"] = res["inv"][0]
-    return wrap_res
-
-
-
-@udf(returnType=StructType([
-    StructField("status",BooleanType(),True),
-    StructField("hidden",BooleanType(),True),
-    StructField("start_date",TimestampType(),True),
-    StructField("end_date",TimestampType(),True),
-    StructField("insert_time",TimestampType(),True)
-    ]))
-def merge_status_info(status_info_list: list):
-    res = {
-        "status":None,
-        "hidden":None,
-        "start_date":None,
-        "end_date":None,
-        "insert_time":None
-    }
-    if status_info_list is not None and len(status_info_list)>1:
-        for status_info in status_info_list:
-            if res["insert_time"] is None:
-                res = status_info
-            else:
-                if res["insert_time"] < status_info["insert_time"]:
-                    res = status_info
-    return res
-
-
-@udf(returnType=MapType(StringType(), StringType()))
-def merge_email(map_list: List):
-    res = {}
-    if map_list is not None:
-        for kv_map in map_list:
-            for k in kv_map.keys():
-                if k not in res:
-                    res[k] = kv_map[k]
-    return res
-
-
-@udf(returnType=MapType(StringType(), ArrayType(StringType())))
-def merge_source_p_id(map_obj_list: List[dict]):
-
-    tmp_res = {}
-    if map_obj_list is not None:
-        for map_obj in map_obj_list:
-            if map_obj is not None:
-                for k, v in map_obj.items():
-                    if k not in tmp_res:
-                        tmp_res[k] = set(v)
-                    else:
-                        tmp_res[k].update(v)
-    res = {}
-    for key,value in tmp_res.items():
-        res[key] = list(value)
-    return res
-
-
-
-@udf(returnType=ArrayType(StringType()))
-def merge_source(incr_source: List,old_source: List):
-    res = set()
-    if incr_source is not None:
-        for i in incr_source:
-            if i is not None and i != "":
-                res.add(i)
-    if old_source is not None:
-        for i in old_source:
-            if i is not None and i != "":
-                res.add(i)
-    return list(res)
-
-
-@udf(returnType=ArrayType(StringType()))
-def merge_list(arr_list: List):
-    res = set()
-    for e in arr_list:
-        if e is not None:
-            for i in e:
-                if i is not None and i != "":
-                    res.add(i)
-    return list(res)
-
-@udf(returnType=MapType(StringType(),ArrayType(StringType())))
-def merge_position_map(left_dict_list :list):
-    res = {}
-    if left_dict_list is not None:
-        for kv_map in left_dict_list:
-            if kv_map is not None:
-                for k,v in kv_map.items():
-                    if v is not None:
-                        if k not in res:
-                            res[k] = set(v)
-                        else:
-                            res[k].update(set(v))
-    for k,v in res.items():
-        res[k] = list(v)
-    return res
-
-@udf(returnType=ArrayType(StringType()))
-def merge_location(arr_list: List):
-    # [location,time]
-    res = []
-    for arr in arr_list:
-        if arr is not None and len(arr) > 1:
-            if len(res) == 0:
-                res.extend(arr)
-            else:
-                if arr[1]>res[1]:
-                    res = arr
-    return res
-
-
-@udf(returnType=ArrayType(StructType([
-    StructField("channel",StringType(),False),
-    StructField("channel_ids",ArrayType(StringType()),True),
-])))
-def split_channel_to_arr(channels:list,channel_ids: dict):
-    rest = []
-    if channels is None:
-        return rest
-    for channel in channels:
-        if channel_ids is not None and channel in channel_ids:
-            rest.append({"channel":channel,"channel_ids":channel_ids[channel]})
-        else:
-            rest.append({"channel": channel, "channel_ids": None})
-    return rest
-
-@udf(returnType=StructType([
-    StructField("original",MapType(StringType(), StringType()),False),
-    StructField("zh",MapType(StringType(), StringType()))
-    ]))
-def merge_parse_email(incr_map: dict,old_dict: dict):
-    if old_dict is None:
-        res = {}
-    else:
-        res = old_dict
-    parse_res = {}
-    if incr_map is not None:
-        for k in incr_map.keys():
-            res[k] = incr_map[k]
-    if "mail_sou_linkedin_inv" in res and res["mail_sou_linkedin_inv"]!='' and res["mail_sou_linkedin_inv"] is not None:
-        parse_res["mail_sou"] = "可能退信"
-    else:
-        if "mail_sou_linkedin_similar" in res and res["mail_sou_linkedin_similar"] is not None:
-            if res["mail_sou_linkedin_similar"]=='2':
-                parse_res["mail_sou"] = "推测+验证"
-            elif res["mail_sou_linkedin_similar"]=='-1':
-                parse_res["mail_sou"] = "推测"
-            elif res["mail_sou_linkedin_similar"]=='1':
-                parse_res["mail_sou"] = "匹配"
-            else:
-                similar_f = float('0' + res["mail_sou_linkedin_similar"])
-                if similar_f>=0.8 and similar_f<0.9:
-                    parse_res["mail_sou"] = "匹配度低"
-                elif similar_f>=0.9 and similar_f<1:
-                    parse_res["mail_sou"] = "可能匹配"
-    if "snovio" in res:
-        if res["snovio"]=='valid':
-            parse_res["snovio"]="匹配"
-        elif res["snovio"]=="unknown":
-            parse_res["snovio"]="匹配度低"
-        elif res["snovio"]=="not valid":
-            parse_res["snovio"]="可能退信"
-        elif res["snovio"]=="greylisted":
-            parse_res["snovio"]="推测"
-    return res,parse_res
-
-@udf(returnType=StringType())
-def get_email_status(status_map: dict):
-    if status_map is None or len(status_map) == 0:
-        return None
-    if 'snovio' in status_map:
-        status_zh = status_map['snovio']
-    elif 'mail_sou' in status_map:
-        status_zh = status_map['mail_sou']
-    else:
-        status_zh = None
-    if status_zh == '推测':
-        return 'SPECULATION'
-    elif status_zh == '推测+验证':
-        return 'SPECULATION_VERIFICATION'
-    elif status_zh == '匹配':
-        return 'PERFECT_MATCH'
-    elif status_zh == '可能匹配':
-        return 'POSSIBLE_MATCH'
-    elif status_zh == '匹配度低':
-        return 'LOW_MATCH'
-    elif status_zh == '可能退信':
-        return 'POSSIBLE_REFUND'
-    else:
-        return None
-
-
-@udf(returnType=StringType())
-def get_media_type(social_media:str):
-    # 根据本表[social_media]进行判断:
-    # 1,带关键词"linkedin"-->linkedin
-    # 2,带关键词"twitter"-->twitter
-    # 3,带关键词"facebook"-->facebook
-    if social_media is None:
-        return None
-    if 'linkedin' in social_media:
-        return 'linkedin'
-    elif 'twitter' in social_media:
-        return 'twitter'
-    elif 'facebook' in social_media:
-        return 'facebook'
-
-
-@udf(returnType=StringType())
-def get_black_white_grey_status(action_status:str,action_status_code:str):
-    def judge_by_action_status(action_status:str):
-        if action_status=='CLICK' or action_status=='OPEN' or action_status=='SENT_SUCCESS':
-            return "WHITE"
-        elif action_status=='MISSING' or action_status=='INIT':
-            return "GREY"
-        elif action_status=='HARD_BOUNCE' or action_status=='SYSTEM_HARD' or action_status=='SYSTEM_SOFT' or action_status=='SYSTEM_BOUNCE' \
-            or action_status=='UNSUBSCRIBE' or action_status=='SOFT_BOUNCE' or action_status=='SPAM_COMPLAINT' or action_status=='SYSTEM_UNSUBSCRIBE':
-            return "BLACK"
-        else:
-            return "GREY"
-
-    if action_status_code is None:
-        return judge_by_action_status(action_status)
-    else:
-        if action_status_code == '503' or action_status_code == '506' or action_status_code == '507' or action_status_code == '508' or action_status_code == '401' \
-            or action_status_code == '402' or action_status_code == '403' or action_status_code == '404' or action_status_code == '406' or action_status_code == '407' or action_status_code == '408' \
-                or action_status_code == '509' or action_status_code == '409':
-            return "BLACK"
-        elif action_status_code == '505' or action_status_code == '405':
-            return "GREY"
-        else:
-            return judge_by_action_status(action_status)
-
-@udf(returnType=IntegerType())
-def get_mail_status_priority(action_status:str,action_status_code:str):
-    status_dict = {
-        'MISSING':0,
-        'SENT_SUCCESS':2,
-        'OPEN':4,
-        'CLICK':5,
-        'SOFT_BOUNCE':6,
-        'SYSTEM_SOFT':6,
-        'UNSUBSCRIBE':7,
-        'SYSTEM_UNSUBSCRIBE':7,
-        'SPAM_COMPLAINT':8,
-        'SYSTEM_BOUNCE':9,
-        'SYSTEM_HARD':9,
-        'HARD_BOUNCE':10
-    }
-    code_dict = {
-        '506':10,
-        '406':10,
-        '404':10,
-        '503':9,
-        '401':9,
-        '403':9,
-        '402':8,
-        '507':7,
-        '407':7,
-        '508':7,
-        '408':7,
-        '509':6,
-        '409':6,
-        '505':1,
-        '405':1
-
-    }
-
-
-
-    if action_status_code is not None:
-        if action_status_code in code_dict:
-            return code_dict[action_status_code]
-        else:
-            if action_status in status_dict:
-                return status_dict[action_status]
-            else:
-                return -1
-    else:
-        if action_status in status_dict:
-            return status_dict[action_status]
-        else:
-            return -1
-
-
-
-@udf(returnType=StringType())
-def get_md5(*cols:str) -> str:
-    col_and_len_list = []
-    for col in cols:
-        if col is not None:
-            l = len(col)
-            col_and_len_list.append(str(l))
-            col_and_len_list.append(col)
-
-    key = ''.join(col_and_len_list)
-    if key is None or len(key) == 0:
-        return ''
-    md5 = hashlib.md5()
-    md5.update(key.encode("utf-8"))
-    return md5.hexdigest()
-
-@udf(returnType=StringType())
-def get_mail_usable(black_white_grey_status:str):
-    if black_white_grey_status == 'WHITE':
-        return 'Usable'
-    elif black_white_grey_status == 'BLACK':
-        return 'Disable'
-    elif black_white_grey_status == 'GREY':
-        return 'Uncertain'
-    else:
-        return None
-
-
-@udf(returnType=StructType(
-    [
-        StructField("info",
-            ArrayType(StructType(
-            [
-                StructField("same",StringType(),False),
-                StructField("name",StringType(),False),
-                StructField("staff_count",IntegerType(),False)
-            ]
-            ),True)),
-        StructField("num",StringType(),False)
-    ])
-)
-def get_similar_comanynames(linkedin_related_companies:str):
-    res_dict = {}
-    similar_companies = []
-    total_num = 0
-    if linkedin_related_companies is None:
-        res_dict["info"] = None
-        res_dict["num"] = total_num
-        return res_dict
-    for company in json.loads(linkedin_related_companies, encoding="utf-8"):
-        if company is None:
-            continue
-
-        if 'same' in company and 'name' in company and 'staffCount' in company:
-            similar_companies.append({'same':company['same'],'name':company['name'],'staff_count':company['staffCount']})
-            total_num += company['staffCount']
-    res_dict["info"] = similar_companies
-    res_dict["num"] = total_num
-    return res_dict
-
-
-
-# 注册UDF并指定返回类型
-# get_json_arr = udf(parse_jsonarr_to_arr)
-# get_json_strarr = udf(parse_jsonarr_to_strarr)
-
-
-if __name__ == '__main__':
-    a= parse_jsonarr_to_arr('[{"aaa":"bbb"},{"aaa":"bbb"}]')
-    for i in a:
-        print(i)

+ 0 - 38
dw_base/spark/udf/spark_mmq_udf.py

@@ -1,38 +0,0 @@
-#!/usr/bin/env /usr/bin/python3
-# -*- coding:utf-8 -*-
-import json
-from typing import List
-
-from pyspark.sql.functions import udf
-from pyspark.sql.types import *
-
-
-def array_to_json(arr: List):
-    return json.dumps(arr, ensure_ascii=False)
-
-
-@udf(returnType=ArrayType(StringType()))
-def arr_str_to_arr(json_str: str) -> list:
-    if json_str:
-        return json.loads(json_str)
-    return []
-
-
-@udf(ArrayType(StringType()))
-def array_slice(input_array, start, end):
-    if input_array:
-        result_array = input_array[start:end]
-        return result_array
-    return []
-
-
-@udf(ArrayType(StringType()))
-def str_to_json_arr(json_str):
-    if json_str:
-        try:
-            str_arr = json.loads(json_str)
-            if isinstance(str_arr, list):
-                return [json.dumps(sm) for sm in str_arr]
-        except json.JSONDecodeError:
-            return []
-    return []

+ 0 - 188
dw_base/spark/udf/test/common_clean.py

@@ -1,188 +0,0 @@
-# 通用企业名称去噪
-
-special_chars = ['.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '‘',
-                 '’',
-                 '“',
-                 '”',
-                 '`',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '《',
-                 '》',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\\',
-                 '~',
-                 '=',
-                 '\'',
-                 '±',
-                 '°',
-                 '«',
-                 '»',
-                 'µ',
-                 '¶',
-                 '·',
-                 '€',
-                 '£',
-                 '¥',
-                 '¢',
-                 '×',
-                 '÷',
-                 '¬',
-                 '…',
-                 '→',
-                 '←',
-                 '↑',
-                 '↓',
-                 '↔',
-                 '⇒',
-                 '⇐',
-                 '≈',
-                 '≠',
-                 '≤',
-                 '≥',
-                 '.',
-                 ',',
-                 '-',
-                 '(',
-                 ')',
-                 '@',
-                 '?',
-                 '"',
-                 '\'',
-                 '#',
-                 '+',
-                 '!',
-                 '$',
-                 '|',
-                 ':',
-                 '/',
-                 ';',
-                 '*',
-                 '<',
-                 '>',
-                 '%',
-                 '^',
-                 '_',
-                 '[',
-                 ']',
-                 '{',
-                 '}',
-                 '\',
-                 '~',
-                 '¨',
-                 '´',
-                 '',
-                 '¿',
-                 '‰',
-                 '¯',
-                 ]
-special_char_dict = {c: ' ' for c in set(special_chars)}
-special_char_dict['&'] = ' and '
-special_char_dict['&'] = ' and '
-special_chars_trans = str.maketrans(special_char_dict)
-
-head_list = ['MS ', 'M S ']
-
-tail_list = [' I PRIVATE LIMITED',
-             ' I PRIVATELIMITED',
-             ' PrivateATE LIMITED',
-             ' COMPANY LIMITED',
-             ' PRIVATE LIMITED',
-             ' PRIVATELIMITED',
-             ' COMPANY PRIVATE L',
-             ' COMPANY I PRIVATE L',
-             ' CO I PRIVATE L',
-             ' CO PRIVATE L',
-             ' I PRIVATE L',
-             ' I PRIVATE',
-             ' PRIVATE L',
-             ' COMPANY PVT L',
-             ' I LIMITED',
-             ' LIMITED',
-             ' P LTD',
-             ' CO I LTD',
-             ' I LTD',
-             ' CO I PVT L',
-             ' CO PVT L',
-             ' PVT L',
-             ' LTD',
-             ' CO I',
-             ' I PVT L',
-             ' I PVT',
-             ' PVT LTD',
-             ' PVT L',
-             ' PVT',
-             ' PRIVATE',
-             ' CO',
-             ' INC',
-             ' I']
-
-special_tail_list = [' CO LIMITED',
-                     ' CO LTD',
-                     ' COLTD']
-
-
-def sub_head(name):
-    for head in head_list:
-        if name.startswith(head):
-            name = name[len(head):]
-            break
-    return name
-
-
-def sub_tail(name):
-    for tail in special_tail_list:
-        no_tail = f'AND{tail}'
-        if name.endswith(tail):
-            if name.endswith(no_tail):
-                return name
-            else:
-                return name[:-len(tail)]
-    for tail in tail_list:
-        if name.endswith(tail):
-            return name[:-len(tail)]
-    return name
-
-
-def clean_company_name(name):
-    if name:
-        # 特殊字符替换为空格
-        name = name.translate(special_chars_trans)
-        # 转大写,去除连续空格,去除首尾空格
-        name = ' '.join(name.upper().split())
-        return name
-    else:
-        return None
-
-
-def clean_pre_join(name):
-    o_name = clean_company_name(name)
-    if not o_name:
-        return None
-    name = sub_head(o_name)
-    name = sub_tail(name)
-    if len(name) < 8:
-        return o_name
-    return name

+ 0 - 259
dw_base/spark/udf/test/d2str.py

@@ -1,259 +0,0 @@
-import re
-
-pattern1 = r'(\d+)[- /\']?([A-Za-z\d]+)[- /\'\.]+(\d+ ?\d+)'
-pattern2 = r'[,-;\']?(\d+)[- /\']?([A-Za-z\d]+)[- /\'\.]+(\d+ ?\d+)'
-pattern3 = r'[A-Za-z]+, ([A-Za-z]+) (\d+), (\d+)'
-pattern4 = r'([A-Za-z\d]+) ([A-Za-z\d]+)\.? (\d+)'
-pattern5 = r'(!|\d+)[- ]+([A-Za-z]+)[ ]?(\d+)'
-
-month_dict = {'Agsts': '08',
-              'Agsutus': '08',
-              'Agts': '08',
-              'Agust': '08',
-              'Agustus': '08',
-              'Apr': '04',
-              'April': '04',
-              'Aprl': '04',
-              'Aprll': '04',
-              'Aug': '08',
-              'August': '08',
-              'Deaember': '12',
-              'Dec': '12',
-              'December': '12',
-              'Des': '12',
-              'Desember': '12',
-              'Feb': '02',
-              'Febrauri': '02',
-              'Februari': '02',
-              'Februaru': '02',
-              'February': '02',
-              'Febuari': '02',
-              'JULI': '07',
-              'Jan': '01',
-              'Januari': '01',
-              'January': '01',
-              'Jul': '07',
-              'Juli': '07',
-              'July': '07',
-              'Jun': '06',
-              'June': '06',
-              'Juni': '06',
-              'MAy': '05',
-              'Mar': '03',
-              'March': '03',
-              'Maret': '03',
-              'Mart': '03',
-              'May': '05',
-              'Mei': '05',
-              'Mrt': '03',
-              'No': '11',
-              'Nof': '11',
-              'Nop': '11',
-              'Nopember': '11',
-              'Nov': '11',
-              'November': '11',
-              'Oct': '10',
-              'October': '10',
-              'Okober': '10',
-              'Okt': '10',
-              'Okt0ber': '10',
-              'Oktober': '10',
-              'Pebruari': '02',
-              'Sep': '09',
-              'Sepetember': '09',
-              'Sept': '09',
-              'September': '09',
-              'Septembver': '09',
-              'agust': '08',
-              'des': '12',
-              'desmb': '12',
-              'juli': '07',
-              'maret': '03',
-              'mei': '05',
-              'november': '11',
-              'oct': '10',
-              'oktober': '10'
-              }
-
-
-def get_date(text: str):
-    match1 = re.match(pattern1, text)
-    if match1:
-        day, month, year = match1.groups()
-        return year, month, day
-    match2 = re.match(pattern2, text)
-    if match2:
-        day, month, year = match2.groups()
-        return year, month, day
-    match3 = re.match(pattern3, text)
-    if match3:
-        month, day, year = match3.groups()
-        return year, month, day
-    match4 = re.match(pattern4, text)
-    if match4:
-        day, month, year = match4.groups()
-        return year, month, day
-    match5 = re.match(pattern5, text)
-    if match5:
-        day, month, year = match5.groups()
-        return year, month, day
-    return None, None, None
-
-
-def clean_date_indonesia(text):
-    if text:
-        year, month, day = get_date(text)
-        year = clean_year(year)
-        month = clean_month(month)
-        day = clean_day(day)
-        if year and month and day:
-            return f'{year}-{month}-{day}'
-    else:
-        return None
-
-
-def clean_year(year: str):
-    if year:
-        year = year.replace(' ', '')
-        if len(year) == 1:
-            return f'200{year}'
-        elif len(year) == 2:
-            if year < '30':
-                return f'20{year}'
-            else:
-                return f'19{year}'
-        elif len(year) == 3:
-            if year[0] == '0':
-                return f'2{year}'
-            else:
-                return f'{year[0]}0{year[1:]}'
-        try:
-            year_int = int(year)
-            if year_int > 2024 or year_int <= 1900:
-                return None
-        except ValueError:
-            return None
-        return year
-    else:
-        return None
-
-
-def clean_month(month: str):
-    if month:
-        if len(month) == 1:
-            month = f'0{month}'
-        elif re.match( r'^\d{2}$', month):
-            month = month
-        else:
-            month = month_dict.get(month)
-        try:
-            month_int = int(month)
-            if month_int < 1 or month_int > 12:
-                return None
-        except ValueError:
-            return None
-        return month
-    else:
-        return None
-
-
-def clean_day(day: str):
-    if day:
-        if len(day) == 1:
-            if day in ['!', 'I', '1']:
-                return '01'
-            else:
-                return f'0{day}'
-        elif len(day) == 2:
-            try:
-                day_int = int(day)
-                if day_int < 1 or day_int > 31:
-                    return None
-            except ValueError:
-                return None
-            return day
-    else:
-        return None
-
-
-
-if __name__ == '__main__':
-    test_cases = [
-        'Monday, November 03, 2014',
-        'Tuesday, September 08, 2015',
-        "28 Agustus' 09",
-        "19Juli 2011",
-        '25September 2027',
-        'I Februari 2011',
-        ',30 April 2013',
-        '25 Agsts. 08',
-        "4'Nov 08",
-        "15 Des.08",
-        "'06-Sept-10",
-        "! Dec 09",
-        "06- Mei 09",
-        "1 Desember2009",
-        "18 MAy09",
-        "22-Jan013",
-        "21 Okober-20 10",
-        "01 Oktober 2 013",
-        "19-No-13",
-        "8 oct 9 ",
-    ]
-    for test_case in test_cases:
-        print(test_case + '    ->    ', get_date(test_case))
-    year_cases = [
-        '00',
-        '01',
-        '09',
-        '19',
-        '20',
-        '79',
-        '85',
-        '96',
-        '99',
-        '013',
-        '204',
-        '209',
-        '210',
-        '2 013',
-        '20 10',
-        '1028',
-        '2116',
-        '10209',
-        '13',
-    ]
-    for year_case in year_cases:
-        print(year_case + '    ->    ' + str(clean_year(year_case)))
-
-    month_cases = [
-        '01',
-        '09',
-        '11',
-        '5',
-        '7',
-        'Agsts',
-        'January',
-        'Okt',
-        'No',
-        '17'
-    ]
-    for month_case in month_cases:
-        print(month_case + '    ->    ' + str(clean_month(month_case)))
-    day_cases = [
-        '01',
-        '09',
-        '11',
-        '5',
-        '7',
-        '!',
-        'I',
-        '31',
-        '35',
-        '13',
-        '898']
-    for day_case in day_cases:
-        print(day_case + '    ->    ' + str(clean_day(day_case)))
-    print('----------------------------------------------------------------|')
-    for test_case in test_cases:
-        print(test_case + '    ->    ', clean_date_indonesia(test_case))

+ 0 - 20
dw_base/spark/udf/test/test_common_clean.py

@@ -1,20 +0,0 @@
-import pytest
-from dw_base.spark.udf.test.common_clean import clean_pre_join
-
-@pytest.mark.parametrize("name, expected", [
-    ('MS ABC Ltd.', 'ABC'),
-    ('MS ABC I PRIVATE LIMITED', 'ABC'),
-    ("M S ABC COMPANY PRIVATE L", 'ABC'),
-    ('ABC Ltd.', 'ABC'),
-    ('ABC P LTD', 'ABC'),
-    ('ABC PRIVATE LIMITED', 'ABC'),
-    ('ABC LIMITED', 'ABC'),
-    ('ABC INC', 'ABC'),
-    ('ABC AAA', 'ABC AAA'),
-    ('ABC CO LIMITED', 'ABC'),
-    ('ABC AND CO LIMITED', 'ABC AND CO LIMITED'),
-    ('ABC COLTD', 'ABC'),
-
-])
-def test_clean_pre_join(name, expected):
-    assert clean_pre_join(name) == expected

+ 1 - 1
kb/00-项目架构.md

@@ -71,7 +71,7 @@ poyee-data-warehouse/              # 项目根目录(仓库名 = 部署名)
 |------|-----------|------|
 | 全局初始化 | `dw_base/__init__.py` | 环境检测、颜色常量、findspark 初始化、用户/权限判断 |
 | SparkSQL 引擎 | `dw_base/spark/spark_sql.py` | SparkSession 管理、UDF 注册、SQL 执行、数据导出 |
-| UDF 库 | `dw_base/spark/udf/` | 按业务线分类的 Spark 自定义函数 |
+| UDF 库 | `dw_base/spark/udf/` | `common/` 通用 UDF(入口自动注册)+ `business/` 业务专用 UDF(按需 `ADD FILE`) |
 | DataX 引擎 | `dw_base/datax/` | ini 配置解析 → json 作业文件生成 |
 | DataX 数据源 | `dw_base/datax/datasources/` | 各类数据源的连接参数抽象 |
 | DataX 插件 | `dw_base/datax/plugins/` | Reader/Writer 工厂 + 各数据源实现 |

+ 2 - 3
kb/90-重构路线.md

@@ -508,9 +508,8 @@ sql = "... WHERE TABLE_SCHEMA='%s' ..." % (database, table_name)
 tests/
 ├── conftest.py                    # pytest 公共 fixtures
 ├── unit/
-│   ├── test_udf_trd.py            # UDF 单测(按业务域组织,纯函数,不依赖 Spark)
-│   ├── test_udf_usr.py
-│   ├── test_udf_pub.py
+│   ├── test_udf_common.py         # 通用 UDF 单测(纯函数,不依赖 Spark)
+│   ├── test_udf_business.py       # 业务 UDF 单测(如有,按文件组织)
 │   ├── test_config_utils.py       # 工具函数单测
 │   ├── test_datetime_utils.py
 │   ├── test_sql_utils.py

+ 2 - 1
kb/92-重构进度.md

@@ -42,7 +42,7 @@
 - [x] 全局替换 SQL 中的 `ADD FILE tendata/...` → `ADD FILE dw_base/...`(2026-04-15)
 - [x] 全局替换 `zip -qr tendata.zip tendata` → `zip -qr dw_base.zip dw_base`(2026-04-15,spark_sql.py f-string 形式已手工修正)
 - [x] 全局替换 `addPyFile('tendata.zip')` → `addPyFile('dw_base.zip')`(2026-04-15,publish.sh 同步更新)
-- [ ] 全局替换路径正则 `re.sub(r"tendata-warehouse.*", ...)` → 使用新项目名(绑定仓库改名;2026-04-20 批老业务清理后残留进一步缩小到 `dw_base/utils/` 剩余文件、`bin/doris-*-starter.py`、`dw_base/spark/udf/customs/company_abbr.py` 等;`diff_utils.py` 的字符串字面量、`polling_scheduler.py` / `drop_*.py` 已随删除一并清零)
+- [ ] 全局替换路径正则 `re.sub(r"tendata-warehouse.*", ...)` → 使用新项目名(绑定仓库改名;2026-04-20 批老业务清理后残留进一步缩小到 `dw_base/utils/` 剩余文件、`bin/doris-*-starter.py` 等;`diff_utils.py` / `polling_scheduler.py` / `drop_*.py` / `spark/udf/customs/*` 的字符串字面量已随删除一并清零)
 - [x] 排查 `tendata_corp` 等数据库名/表名引用,**确认不要误替换**(2026-04-15,已确认保留:`tendata_corp`、`tendata_bigdata256!`、`ent_tendata_interface`、`api.tendata.cn`)
 - [x] 新建 `jobs/` 目录 + `jobs/{raw,ods,dim,dwd,dws,tdm,ads}/` 子目录(2026-04-15,已放 `.gitkeep`,`dim/` 为顶层独立分层)
 - [x] 新建 `manual/` 目录 + 5 个子目录(`ddl/`、`backfill/`、`fix/`、`adhoc/`、`archive/`)(2026-04-15,已放 `.gitkeep`;`manual/ddl/` 是所有 DDL 的唯一来源)
@@ -158,3 +158,4 @@
 | 2026-04-20 | **§7.2.1 再次反转**:删除 `whoami == RELEASE_USER` 分流,`LOG_ROOT_DIR` 改为单值默认 `${HOME}/log` 并保留在 `conf/env.sh`(外配后期可改)。理由:`$HOME` 天然按用户隔离(bigdata/个人用户家目录不同),代码判断是多余一层;`bigdata` 本身就是专属调度账号,其 `$HOME` 即是生产日志合法归宿,不需要系统级 `/opt/data/log` 那条路。同步更新 `90-重构路线.md §7.2.1`(核心段)+ `§2.1 硬编码表行` + `§2.4 env.sh 草稿` + `00-项目架构.md §6 部署段` + `92 阶段 2 checklist` | — |
 | 2026-04-20 | **老业务耦合代码第二批清理(重构计划外)**:在 UDF/模块独立化讨论中顺带盘点 `dw_base/` 子模块,决定 16 文件批量删除:**整目录删 3 个**——`oss/`(oss2_util.py + __init__,新业务不需要对象存储)、`scheduler/`(polling_scheduler / drop_partitions / drop_daily_full_snapshot_tbls 三业务文件,前者绑死老 Mongo 轮询、后两者按 N 天清分区的能力已在阶段 4 记录重写任务)、`hive/`(hive_utils + hive_constants;hive_utils 中 `get_hive_create_table_ddl*` 零引用 + 依赖 `COLUMN_NAME_COMMENT_DICT` 老业务字段字典、DDL 生成器整体不重建;`get_hive_database_name` / `get_hive_table_prefix` 两个命名约定函数语义已在 `kb/21-命名规范.md` 有规则,不重建代码,后续 `bin/datax-gc-generator.py` 从零重写时按新约定实现);**utils/ 删 7 文件**——data_distinct / diff_utils / excel_to_hive_utils / hive_diff_database / hive_to_excel_utils / pdt_check_table / pdt_check_table_multis,全部零外部引用 + 强业务耦合(硬编码 tendata 路径 / 老集群 IP `192.168.30.3` / 中文表名拼音转换 / 海关 `cts_*` 表名模式)。**连带效应**:`bin/datax-gc-generator.py:26` import hive_utils 成破损 import,由 90-路线 §2.7 "从零重写" 任务覆盖,不单独修复。**阶段 4 新增任务**:重新实现分区保留工具(元表驱动 + 参数化天数,目录可能不叫 scheduler)。**CLAUDE.md 规则追加**:"空模块直接删"原则首次执行延后(elasticsearch/flink/ml/validation/common/ 暂留,后续更细粒度规整) | — |
 | 2026-04-20 | **老业务耦合代码批量清理(重构计划外)**:排查 `tendata` 残留时发现一批与 `tendata_corp` / `ent_tendata_interface` / DolphinScheduler / 钉钉告警强耦合的存量文件,逐项核对后批量删除 40 个文件 + 精简 1 个:**老业务模块 34**(`dw_base/scheduler/` 下 `get_oldmongo_*` ×5、`dingtalk_*` / `ent_interface_dingtalk*` / `country_count_dingtalk` / `mg_company_alias_init` ×8、`mg2es/` 整目录 13 文件;`dw_base/ds/` 整目录 4 文件;`dw_base/spark/udf/spark_read_hive_columns_cnt.py`;`dw_base/utils/tid_utils.py`;`dw_base/spark/td_spark_init.py`(老同事 xunxu 所写未被调用);`bin/hive-exec.sh`),**级联清理 6**(`dw_base/spark/udf/spark_id_generate_udf.py` + `dw_base/spark/udf/enterprise/unique/spark_tid_match_udf.py` 依赖已删 `tid_utils`;`dw_base/utils/hive_file_merge.py` + `dw_base/utils/spark_parse_json_to_hive.py` 依赖已删 `mg2es`/钉钉告警;`bin/hive-exec-job-starter.py` 调用已删 `hive-exec.sh`;`bin/dingtalk-work-alert.sh`),**精简 1**:`dw_base/spark/udf/spark_mmq_udf.py` 从 530 行裁到 4 个数据类型转换函数(phone/domain/website/statname 等场景相关 UDF 与 Mongo 相关逻辑全删)。同步更新:`00-项目架构.md`(移除 `td_spark_init` / DS 相关条目)、`90-重构路线.md`(钉钉 + 企微 Webhook 合并表述、删除 DS API 行、§5.2 依赖清理清单标记提前完成)、`92-进度.md` 阶段 1 第 6 行 `re.sub` checklist 更新残留范围(~15 处)。**阶段 4 新增两项任务**:(1) 重新实现 Hive HDFS 小文件合并工具(通用化连接 / 剥离 `cts_*_ex/_im` 表名假设);(2) 重写告警模块(弃钉钉走 `conf/alerter.ini` Webhook) | — |
+| 2026-04-20 | **UDF 模块重组(重构计划外)**:独立 `dw_base/spark/udf/` 目录结构为 `common/`(通用 UDF,SparkSQL 入口自动 `ADD FILE` 注册)+ `business/`(业务专用 UDF,SQL 中按需 `ADD FILE` 加载)两类。(a) 6 份源文件(根 `spark_common_udf.py` 24 函数 + `spark_json_array_udf.py` 23 函数 + `spark_mmq_udf.py` 3 函数 + `customs/cts_common.py` + `product/escape_udf.py` + `enterprise/spark_eng_ent_json_array_append_udf.py`)通读 + 去重 + 业务耦合剥离后,合并为单文件 `common/spark_common_udf.py`(500 行 40 函数,分 JSON / Array / String / Numeric-Date-Hash / Cross-type-converters 5 段)。单文件方案而非按类型拆分,理由:跨类型转换函数(`json2str` / `arr2json` / `str2map` 等约 9 个,占 20%+)没有明确归属,强行分只会制造边界争议。(b) 清理 `dw_base/spark/udf/` 下所有老业务 UDF 子目录与根级业务文件共 60 个:整目录删 `contacts/` / `customs/` / `enterprise/` / `product/` / `productApplication/` / `test/`;根目录删 `spark_eng_ent_name_clean.py` / `spark_india_format_phone_udf.py` / `solr_similar_match_udf.py` / `main_test.py` 以及 3 份源 UDF 文件。(c) `dw_base/__init__.py:27` `COMMON_SPARK_UDF_FILE` 常量路径由 `dw_base/spark/udf/spark_common_udf.py` 改为 `dw_base/spark/udf/common/spark_common_udf.py`(`bin/spark-sql-starter.py:172-173` 两处 usage 靠常量传递自动生效)。(d) 删除老 `dingtalk_*` / `mg2es` 级联清理中没赶上的 UDF 业务耦合文件在此批统一清零。`business/` 目录暂为骨架,后续真正出现新业务 UDF 时按需补 | — |

Beberapa file tidak ditampilkan karena terlalu banyak file yang berubah dalam diff ini