在信息爆炸的時(shí)代,網(wǎng)絡(luò)上充斥著大量的敏感信息,可能會(huì)產(chǎn)生很多的負(fù)面影響。為了應(yīng)對(duì)這一挑戰(zhàn),一直在尋求有效的方式來(lái)替換或過(guò)濾掉敏感字詞,而Java DFA(Deterministic Finite Automaton)算法正是在這方面發(fā)揮著關(guān)鍵作用。
DFA即Deterministic Finite Automaton,是一種有窮自動(dòng)機(jī),通常用于處理字符串匹配問(wèn)題。在Java中,DFA算法用于搜索和替換文本中的特定模式,如敏感字詞或關(guān)鍵詞。DFA算法通過(guò)將文本逐字符逐字符地與事先定義好的敏感字列表進(jìn)行比較,從而快速而高效地檢測(cè)和替換敏感字。
DFA算法基于狀態(tài)轉(zhuǎn)移。它首先構(gòu)建一個(gè)狀態(tài)轉(zhuǎn)移圖,其中每個(gè)狀態(tài)代表算法在處理字符串時(shí)的狀態(tài)。然后,算法從輸入文本的開(kāi)頭開(kāi)始,根據(jù)當(dāng)前字符和當(dāng)前狀態(tài),查找下一個(gè)狀態(tài),并根據(jù)狀態(tài)的不同采取不同的操作。當(dāng)輸入文本中的字符被處理完畢時(shí),算法會(huì)得到一個(gè)已替換敏感字的文本或者是否包含敏感詞的一個(gè)狀態(tài)。
例如替換文本中的敏感詞:
文本:Java新視界,為你開(kāi)啟Java世界的大門。實(shí)用技巧,深度解析,讓Java更簡(jiǎn)單,更強(qiáng)大!一起攀登Java技術(shù)高峰,實(shí)現(xiàn)編程夢(mèng)想!敏感詞列表:["新視界", "新視野", "技術(shù)", "技術(shù)高峰", "編程夢(mèng)想", "實(shí)現(xiàn)夢(mèng)想"]
基于敏感詞,構(gòu)建森林:
基于森林,構(gòu)建JSON對(duì)象:
{ "技":{ "isEnd":"0", "術(shù)":{ "高":{ "峰":{ "isEnd":"1" }, "isEnd":"0" }, "isEnd":"1" } }, "新":{ "isEnd":"0", "視":{ "界":{ "isEnd":"1" }, "isEnd":"0", "野":{ "isEnd":"1" } } }, "編":{ "isEnd":"0", "程":{ "isEnd":"0", "夢(mèng)":{ "isEnd":"0", "想":{ "isEnd":"1" } } } }, "實(shí)":{ "現(xiàn)":{ "isEnd":"0", "夢(mèng)":{ "isEnd":"0", "想":{ "isEnd":"1" } } }, "isEnd":"0" }}
mport java.util.*;/** * 敏感詞處理工具 - DFA算法實(shí)現(xiàn) * @author Java新視界 * @modifier Java新視界 * @date 2023/10/25 16:58 */public class SensitiveWordUtil { /** * 敏感詞匹配規(guī)則 */ public static final int MIN_MATCH_TYPE = 1; //最小匹配規(guī)則,如:敏感詞庫(kù)["新視界","視界"],語(yǔ)句:"Java新視界",匹配結(jié)果:Java新[視界] public static final int MAX_MATCH_TYPE = 2; //最大匹配規(guī)則,如:敏感詞庫(kù)["新視界","視界"],語(yǔ)句:"Java新視界",匹配結(jié)果:Java[新視界] private static Map<String, Object> initSensitiveWordMap(Set<String> sensitiveWordSet) { Map<String, Object> map = new HashMap(Math.max((int) (sensitiveWordSet.size() / .75f) + 1, 16)); //初始化敏感詞容器,減少擴(kuò)容操作 for (String aKeyWordSet : sensitiveWordSet) { //迭代keyWordSet Map nowMap = map; for (int i = 0; i < aKeyWordSet.length(); i++) { char keyChar = aKeyWordSet.charAt(i); Object wordMap = nowMap.get(keyChar); if (wordMap != null) { nowMap = (Map) wordMap; //如果存在該key,直接賦值 } else { //不存在則,則構(gòu)建一個(gè)map,同時(shí)將isEnd設(shè)置為0 Map<String, String> newWorMap = new HashMap<>(3); newWorMap.put("isEnd", "0"); nowMap.put(keyChar, newWorMap); nowMap = newWorMap; } if (i == aKeyWordSet.length() - 1) {//判斷最后一個(gè) nowMap.put("isEnd", "1"); } } } return map; } public static Set<String> getSensitiveWord(Set<String> sensitiveWordSet,String txt, int matchType) { Set<String> sensitiveWordList = new HashSet<>(); Map<String, Object> map = initSensitiveWordMap(sensitiveWordSet); for (int i = 0; i < txt.length(); i++) { //判斷是否包含敏感字符 int length = checkSensitiveWord(map,txt, i, matchType); if (length > 0) { //存在,加入list中 sensitiveWordList.add(txt.substring(i, i + length)); i = i + length - 1; //減1的原因,是因?yàn)閒or會(huì)自增 } } return sensitiveWordList; } public static String replaceSensitiveWord(Set<String> sensitiveWordSet, String txt, char replaceChar, int matchType) { String resultTxt = txt; //獲取所有的敏感詞 Set<String> set = getSensitiveWord(sensitiveWordSet,txt, matchType); Iterator<String> iterator = set.iterator(); String word; String replaceString; while (iterator.hasNext()) { word = iterator.next(); replaceString = getReplaceChars(replaceChar, word.length()); resultTxt = resultTxt.replaceAll(word, replaceString); } return resultTxt; } public static String replaceSensitiveWord(Set<String> sensitiveWordSet, String txt, String replaceStr, int matchType) { String resultTxt = txt; //獲取所有的敏感詞 Set<String> set = getSensitiveWord(sensitiveWordSet,txt, matchType); Iterator<String> iterator = set.iterator(); String word; while (iterator.hasNext()) { word = iterator.next(); resultTxt = resultTxt.replaceAll(word, replaceStr); } return resultTxt; } private static String getReplaceChars(char replaceChar, int length) { String resultReplace = String.valueOf(replaceChar); for (int i = 1; i < length; i++) { resultReplace += replaceChar; } return resultReplace; } private static int checkSensitiveWord(Map<String, Object> nowMap, String txt, int beginIndex, int matchType) { boolean flag = false; int matchFlag = 0; char word; for (int i = beginIndex; i < txt.length(); i++) { word = txt.charAt(i); //獲取指定key nowMap = (Map<String, Object>) nowMap.get(word); if (nowMap != null) { //存在,則判斷是否為最后一個(gè) matchFlag++; //找到相應(yīng)key,匹配標(biāo)識(shí)+1 //如果為最后一個(gè)匹配規(guī)則,結(jié)束循環(huán),返回匹配標(biāo)識(shí)數(shù) if ("1".equals(nowMap.get("isEnd"))) { flag = true; //結(jié)束標(biāo)志位為true //最小規(guī)則,直接返回,最大規(guī)則還需繼續(xù)查找 if (MIN_MATCH_TYPE == matchType) { break; } } } else { //不存在,直接返回 break; } } if (matchFlag < 2 || !flag) { //長(zhǎng)度必須大于等于1,為詞 matchFlag = 0; } return matchFlag; }}
public static void main(String[] args) { Set<String> sensitiveWordSet = new HashSet<>(Arrays.asList("新視界", "新視野","技術(shù)", "技術(shù)高峰", "編程夢(mèng)想", "實(shí)現(xiàn)夢(mèng)想")); String string = "Java新視界,為你開(kāi)啟Java世界的大門。實(shí)用技巧,深度解析,讓Java更簡(jiǎn)單,更強(qiáng)大!一起攀登Java技術(shù)高峰,實(shí)現(xiàn)編程夢(mèng)想!"; //獲取語(yǔ)句中的敏感詞 Set<String> set = SensitiveWordUtil.getSensitiveWord(sensitiveWordSet,string, SensitiveWordUtil.MAX_MATCH_TYPE); System.out.println("語(yǔ)句中 包含敏感詞的個(gè)數(shù)為:" + set.size() + "。包含:" + set); set = SensitiveWordUtil.getSensitiveWord(sensitiveWordSet,string, SensitiveWordUtil.MIN_MATCH_TYPE); System.out.println("語(yǔ)句中 包含敏感詞的個(gè)數(shù)為:" + set.size() + "。包含:" + set); //替換語(yǔ)句中的敏感詞 String filterStr = SensitiveWordUtil.replaceSensitiveWord(sensitiveWordSet,string, '*', SensitiveWordUtil.MAX_MATCH_TYPE); System.out.println(filterStr); filterStr = SensitiveWordUtil.replaceSensitiveWord(sensitiveWordSet,string, '*', SensitiveWordUtil.MIN_MATCH_TYPE); System.out.println(filterStr);}
運(yùn)行結(jié)果:
優(yōu)勢(shì):
挑戰(zhàn):
總的來(lái)說(shuō),DFA算法在替換敏感字領(lǐng)域具有廣泛的應(yīng)用,為網(wǎng)絡(luò)社區(qū)、金融機(jī)構(gòu)、政府和其他領(lǐng)域提供了一種強(qiáng)大工具,用于過(guò)濾和替換敏感信息,維護(hù)社會(huì)秩序,保護(hù)用戶的隱私,以及確保互聯(lián)網(wǎng)上的安全和和諧。隨著技術(shù)的不斷發(fā)展,DFA算法將繼續(xù)發(fā)揮重要作用,以適應(yīng)不斷變化的需求和挑戰(zhàn)。
本文鏈接:http://www.tebozhan.com/showinfo-26-15314-0.htmlDFA算法,高效實(shí)現(xiàn)敏感詞檢測(cè)與替換!
聲明:本網(wǎng)頁(yè)內(nèi)容旨在傳播知識(shí),若有侵權(quán)等問(wèn)題請(qǐng)及時(shí)與本網(wǎng)聯(lián)系,我們將在第一時(shí)間刪除處理。郵件:2376512515@qq.com