请问HN:在自托管的学习管理系统中,有什么更好的抄袭检测方法吗?

1作者: pigon10024 天前原帖
我正在构建一个开源学习管理系统(LMS),并使用 OpenSearch 的 more_like_this 查询和字符 n-grams 来进行相似性评分,以实现抄袭检测。 基本上,当学生提交答案时,我会搜索其他学生在同一问题上的相似答案。这个方法效果还不错,但感觉有点不够优雅——只是重复利用了我已经有的搜索引擎。 当前设置: ```python search = cls.search().filter( "nested", path="answers", query={"term": {"answers.question_id": str(question_id)}} ) search = search.query( "nested", path="answers", query={ "more_like_this": { "fields": ["answers.answer"], "like": text, "min_term_freq": 1, "minimum_should_match": "1%", } }, ) # 获取前10个结果,然后在 Python 中重新排序 def normalize(t): return re.sub(r"\s+", "", t.strip()) def char_ngrams(t, n=3): return set(t[i:i+n] for i in range(len(t)-n+1)) norm_text = normalize(text) text_ngrams = char_ngrams(norm_text) for hit in response.hits: norm_answer = normalize(hit.answer) answer_ngrams = char_ngrams(norm_answer) intersection = len(text_ngrams & answer_ngrams) union = len(text_ngrams | answer_ngrams) ratio = int((intersection / union) * 100) if ratio >= 60: # 标记为相似 ``` 约束条件: - 仅限自托管,不使用外部 API - 几千名学生 - 希望操作简单,反正已经在运行 OpenSearch 问题: - 这种方法合理吗,还是我遗漏了什么明显的东西? - 其他自托管系统使用什么?查看了 Moodle 文档,但他们的抄袭插件大多调用外部服务。 - 有人尝试过不需要 GPU 的轻量级机器学习模型吗? 这种搜索引擎的方法有效,但我很好奇是否有更适合我们约束条件的更好方法。
查看原文
I&#x27;m building an open-source LMS and added plagiarism detection using OpenSearch&#x27;s more_like_this query plus character n-grams for similarity scoring.<p>Basically when a student submits an answer, I search for similar answers from other students on the same question. Works decently but feels a bit hacky - just reusing the search engine I already had.<p>Current setup:<p><pre><code> search = cls.search().filter( &quot;nested&quot;, path=&quot;answers&quot;, query={&quot;term&quot;: {&quot;answers.question_id&quot;: str(question_id)}} ) search = search.query( &quot;nested&quot;, path=&quot;answers&quot;, query={ &quot;more_like_this&quot;: { &quot;fields&quot;: [&quot;answers.answer&quot;], &quot;like&quot;: text, &quot;min_term_freq&quot;: 1, &quot;minimum_should_match&quot;: &quot;1%&quot;, } }, ) # get top 10, then re-rank in Python def normalize(t): return re.sub(r&quot;\s+&quot;, &quot;&quot;, t.strip()) def char_ngrams(t, n=3): return set(t[i:i+n] for i in range(len(t)-n+1)) norm_text = normalize(text) text_ngrams = char_ngrams(norm_text) for hit in response.hits: norm_answer = normalize(hit.answer) answer_ngrams = char_ngrams(norm_answer) intersection = len(text_ngrams &amp; answer_ngrams) union = len(text_ngrams | answer_ngrams) ratio = int((intersection &#x2F; union) * 100) if ratio &gt;= 60: # flag as similar </code></pre> Constraints: - Self-hosted only, no external APIs - Few thousand students - Want simple operations, already running OpenSearch anyway<p>Questions: - Is this approach reasonable or am I missing something obvious? - What do other self-hosted systems use? Checked Moodle docs but their plagiarism plugins mostly call external services - Anyone tried lightweight ML models for this that don&#x27;t need GPU?<p>The search engine approach works but curious if there&#x27;s a better way that fits our constraints.