自动清洗数据的原理、源代码、大模型结合使用 ----------------------- 原理:字符或单词转化为向量,计算两个向量的夹角余弦,相似度;快速模式,只有指纹相同的才计算相似度度; 作者:是否时同一个单位的、同一个国家的、同一个学科的 ------------------------ 源代码: namespace SciM.NLP { public class StringSimilarity { public static double ComputeSimmilarity(string str1,string str2) { return Math.Min(ComputeSimmilarityBaseChar(str1,str2) ,ComputeSimmilarityBaseWord(str1, str2)); } public static double ComputeSimmilarityBaseChar(string str1,string str2) { str1 = str1.Replace(" ", "").Replace("(", "").Replace(")", "").Replace("(", "").Replace(")", "").Replace("[", "").Replace("]", "").Replace("-", "").Replace(".", "").Replace(",",""); str2 = str2.Replace(" ", "").Replace("(", "").Replace(")", "").Replace("(", "").Replace(")", "").Replace("[", "").Replace("]", "").Replace("-", "").Replace(".", "").Replace(",",""); double ret = 0; List allchars = new List(); foreach(char c in str1.ToCharArray()) { if (!allchars.Contains(c)) allchars.Add(c); } foreach (char c in str2.ToCharArray()) { if (!allchars.Contains(c)) allchars.Add(c); } double [] d1=new double[allchars.Count]; double [] d2 = new double[allchars.Count]; for (int i=0;i c == allchars[i]); d2[i] = str2.Count(c => c == allchars[i]); } Math2Vector m=new Math2Vector(d1,d2); ret = m.cosSim; return ret; } /// /// 根据单词计算相似度 /// /// /// /// public static double ComputeSimmilarityBaseWord(string str1, string str2) { str1 = str1.Replace("(", " ").Replace(")", " ").Replace("(", " ").Replace(")", " ").Replace("[", " ").Replace("]", "" ).Replace("-", " ").Replace(".", " ").Replace(","," "); str2 = str2.Replace("(", " ").Replace(")", " ").Replace("(", " ").Replace(")", " ").Replace("[", " ").Replace("]", " ").Replace("-", " ").Replace(".", " ").Replace(","," "); List strarray1=new List(str1.Split(' ',StringSplitOptions.RemoveEmptyEntries).Distinct()); List strarray2 =new List(str2.Split(' ',StringSplitOptions.RemoveEmptyEntries).Distinct()); return ComputeSimmilarity(strarray1,strarray2); } public static double ComputeSimmilarity(List str1, List str2) { double ret = 0; List allchars = new List(); foreach (string c in str1) { if (!allchars.Contains(c)) allchars.Add(c); } foreach (string c in str2) { if (!allchars.Contains(c)) allchars.Add(c); } double[] d1 = new double[allchars.Count]; double[] d2 = new double[allchars.Count]; for (int i = 0; i < allchars.Count; i++) { d1[i] = str1.Count(c => c == allchars[i]); d2[i] = str2.Count(c => c == allchars[i]); } Math2Vector m = new Math2Vector(d1, d2); ret = m.cosSim; return ret; } } } --------------------------- 大模型提示词: 下面的人物名字需要进行归一化处理,如果是同一个人,将其归类,归类成下面这种格式: A|A1|A2 C|C1|C2 B 其中,A1、A2与A指向同一个人,C1、C2与C指向同一个人,B就是B本身,没有其他写法; 不要改变大小写;输出顺序是先输出有多个写法的,后输出单个写法的。 名单如下: Shi, Ji-Long Li, Tao Yang, Tengzhou Pu, Jialing Zhang, Jun Wang, Wenguang Liu, Haoxue Huang, Min Cui, Guihua Luo, M. Ronnier Melgosa, Manuel Xu, Wencai Li, Dongli Fu, Yabo Liu, Zunzhong Wang, Yajun Yu, Xuemei Shang, Wei Wei, Hua Zhang, Xinlin Zhao, Junyan Luo, Shiyong Li, Lixiang Yang, Chunyu Hui, Sili Yu, Wenwen Kurths, Juergen Peng, Haipeng Yang, Yixian Ma, Lin Yang, Lizhen Li, Yonghe Xiong Yuqing Sang Lijun Chen Qiang Yang Lizhen Wang Zhengduo Liu Zhongwei Lei, Wenwen Chen, Qiang Liu Fuping Meng Xianjun Xiao Jiaqi Wang Yumei Shen Guoqiang Zhang Gaimei He Cunfu Wu Bin Wei, Jia Huang, Guohe Zhu, Lei Zhao, Shan An, Chunjiang Fan, Yurui Yue Shi-Juan Su Xiao Jiang Han-Jie Liu Shao-Xuan Hong You-Li Zhang Kai Huang Wan-Xia Xiong Zu-Jiang Zhao Ying Liu Cui-Ge Wei Yong-Ju Meng Tao Xu Yi-Zhuang Wu Jin-Guang Xiao, Kaida Liao, Ningfang Zardawi, Faraedon Van Noort, Richard Yang, Zhixiong Yates, Julian M. Fei Fei Solodovnyk, Anastasiia Xiong, Yu-Qing Li, Xing-Cun Lei, Wen-Wen Zhao, Qiao Sang, Li-Jun Liu, Zhong-Wei Wang, Zheng-Duo Yang, Li-Zhen Wang Anling Yang Changchun Liu, Xiao-Bo Liu, Fu-Ping Meng, Xian-Jun Xiao, Jia-Qi Ding Jie Shi Yong-Zheng Yu Jiang Zhang Fubin Cai Huiping Lei Wenwen Qi Fengyang Li Xingcun Fei, Fei Zhang, Wenguan He, Zhiqun Wang, Yongsheng Pang, Hui Zhao, Shengmin Li Sen Liu Fu-Ping Meng Xian-Jun Wang Yu-Mei Shen Guo-Qiang Xia Jia-Qi Liu Zhong-Wei Wang Zheng-Duo Yang Li-Zhen Xin, Zhiqing Li, Luhai Deng, Pujun Yi, Fang Tang, Xiaojun Zhao, Wen Du, Peng Li, Sen Li, Bin Liu, Zhongwei Cai, Huiping Liu, Xuying Zhao, Yue-Qing Liang, Ying-Hua Zhao, Xi-Zhe Jia, Qian-Yi Li, Hong-Sheng Pang, Hua Yang, Size Wang, Xiaohua Chai, Chengwen Qi, Yuansheng Wang, An-Ling Yu, Zhao-Xian Mo, Lixin Liu, Dongzhi Li, Wei Wang, Lichang Zhou, Xueqin Dong, L. F. Fan, W. L. Wang, S. Ji, Y. F. Liu, Z. W. Chen, Q. Dong Li-Fang Yang Yu-Jie Liu Wei-Yuan Yue Han Wang Shuai Guang, Li Zhang, Chunxiu Wu, Hao Cheng, Shudan Zhang, Rui Zhang, A. Zhang, Mingxia Li Bin Li, Juan Wang, Zhenduo Wu Wei-Xia Zhan Yong Zhao Tong-Jun Han Ying-Rong Chen Ya-Fei Kong Qing-Feng Yang Chang-Chun Liang Chun-Jun Zou Hui He Zhi-Qun Zhang Chun-Xiu Li Dan Wang Yong-Sheng Cao Shaozhong Li Yang Tu Xuyan Mu, Linping Zou, Ye Kang Ling-Hua Zhou Hui Zhang Ao Wang Du-Jin Li Xiao-Wei Zou Jing Xu Duan-Fu Tao, Minli Zhang, Minghua Liang, Chunjun Li Ji-Cheng Lin Li Li Xi-Meng Li Guang Lei Ming-Kai Wu, Z. F. Wu, X. M. Zhuge, L. J. Hong, B. Yang, X. M. Chen, X. M. Li, B. Yu, T. He, J. J. Yang Li Wang Zhen-Duo Zhang Shou-Ye Xie, Yun-Feng Yang, Yi-Min Wang, Chang-Sui Fang, Xiao-Yang He, Qiu-Ju Jiao, Zhi-Yong Huo Chunqing Zhang Yuefei Meng Yuedong Wang An-Ling Li Rui-Zhong Chen Hui-Guo Zhang Wenguan He Zhiqun Hui Guanbao Mu Linping Wang Yongsheng Zhao Shengmin Jing Xiping Zhang, Junfeng Zhang, Yuefei Liu, Fuping Hu, Wenjuan Dang, Zhi-Min Zhou, Tao Yao, Sheng-Hong Yuan, Jin-Kai Zha, Jun-Wei Song, Hong-Tao Li, Jian-Ying Yang, Wan-Tai Bai, Jinbo Yue, Lei Zhou, Meili Weng, Jing Chen, Siguang Jin, Shuo Hu Wen-Juan Xie Fen-Yan Weng Jing Zhang Jun-Feng Bian Xin-Chao Ma, Lan-Jie Xie, Dan Duan, Xue Wang Anxuan Cao Yuezu Zhang, Chunmei Yan, Huan Wu Jun Zeng Fengcai Chen Bingqiang Song Jianfeng Yao Yingue Xie Dagang Gao Bo Yuan Zhejun Li, Rui-Zhong Li, Jin-Yao Chen, Hui-Guo Yang, Chang-Chun Gao Jie Bao Dezhou Cheng Xi Zhang Yuling Luo Shiyong Wang Ning Xu Wencai Lv Yong Wei Xian-Fu Wang Na Huang Bei-Qing Sun Cheng-Bo Liu, Jiang-Fan Yang, Hai-Jian Wang, Wei Li, Zhongxiao Bian, Xinchao Sang, Lijun Tang, Wenjie Xie, Fenyan Huo, Chunqing Yu Zhao-Xian Jiao Zhi-Yong Xie, Feiyan Chun-Mei, Zhang Xin-Chao, Bian Qiang, Chen Ya-Bo, Fu Yue-Fei, Zhang Bao De-Zhou Feng, Yuguang Yang, Haijian Sun Yun-Jin Fu Ya-Bo Zhang Chun-Mei Sang Li-Jun Zhang Yue-Fei Li Ruizhong Chen Huiguo Tian, Yimin Qin, Mengzhao Zhang, Yongming Ma, Tao Chen Meng