Ver Fonte

refactor(spider): 重构代理和请求重试逻辑并优化数据解析与数据库操作

- 使用 tenacity 代替 retrying 实现重试机制,统一重试回调日志输出
- 代理获取函数增加 logger 参数支持,改进异常日志记录
- 调整请求和解析流程,新增日志入参,提升调试信息完善度
- 优化日期转换函数,增加异常捕获和日志记录
- 更新数据解析逻辑,修正 XPath 路径并规范字段映射
- 修改数据保存方式,采用字典参数直接写入数据库
- 线程池任务提交及异常处理加入日志参数,更精细任务状态跟踪
- 扩展数据库连接池最大连接数及缓存数,提升连接复用效率
- 重构批量插入逻辑,增加异常降级处理及详细日志,解决重复数据插入问题
- 更新调度任务调用方式,支持日志参数传递及首次执行明确日志输出
- README 增加新爬虫任务启动命令说明
charley há 1 mês atrás
pai
commit
a231253ec0
1 ficheiros alterados com 5 adições e 5 exclusões
  1. 5 5
      ags_spider/ags_new_daily.py

+ 5 - 5
ags_spider/ags_new_daily.py

@@ -14,10 +14,10 @@ from mysql_pool import MySQLConnectionPool
 from tenacity import retry, stop_after_attempt, wait_fixed
 
 
-# logger.remove()
-# logger.add("logs/{time:YYYYMMDD}.log", encoding='utf-8', rotation="00:00",
-#            format="[{time:YYYY-MM-DD HH:mm:ss.SSS}] {level} {message}",
-#            level="DEBUG", retention="7 day")
+logger.remove()
+logger.add("logs/{time:YYYYMMDD}.log", encoding='utf-8', rotation="00:00",
+           format="[{time:YYYY-MM-DD HH:mm:ss.SSS}] {level} {message}",
+           level="DEBUG", retention="7 day")
 
 
 def after_log(retry_state):
@@ -232,7 +232,7 @@ def get_new_task(sql_pool):
     每日更新任务为+2000,-1000
     """
     ags_id_list = sql_pool.select_all(
-        f"SELECT id, cert_id FROM ags_task WHERE state != 1 AND cert_id <= '{end_max_cert_str}' ORDER BY id DESC LIMIT 10000") # 3000别忘了!!!!!!!!!!!!!!!!!!
+        f"SELECT id, cert_id FROM ags_task WHERE state != 1 AND cert_id <= '{end_max_cert_str}' ORDER BY id DESC LIMIT 6000") # 3000别忘了!!!!!!!!!!!!!!!!!!
     # ags_id_list = sql_pool.select_all("SELECT id,cert_id FROM ags_task WHERE id < 927059 AND state = 0 LIMIT 10000")
     ags_id_list = [i for i in ags_id_list]
     return ags_id_list