Korcen

korcen-ml은 기존 키워드 기반의 korcen의 우회가 쉽다는 단점을 극복하기위해 딥러닝을 통해 정확도를 한층 더 올리려는 프로젝트입니다.

KOGPT2 모델만 공개하고 있으며 모델 파일은 여기에서 확인이 가능합니다.

더 많은 모델 파일과 학습 데이터를 다운받고 싶다면 문의주세요.

	데이터 문장수
VDCNN(23.4.30)	200,000
VDCNN_KOGPT2(23.06.15)	2,000,000
VDCNN_LLAMA2(23.09.30)	5,000,000
VDCNN_LLAMA2_V2(24.06.04)	13,000,000
LSTM_EXAONE3(24.08.16)	13,000,000

키워드 기반 기존 라이브러리 : py version, ts version

서포트 디스코드 서버

모델 검증

데이터마다 욕설의 기준이 달라 오차가 있다는 걸 감안하고 확인하시기 바랍니다.

	korean-malicious-comments-dataset	Curse-detection-data	kmhas_korean_hate_speech	Korean Extremist Website Womad Hate Speech Data	LGBT-targeted HateSpeech Comments Dataset (Korean)
korcen	0.7121	0.8415	0.6800	0.6305	0.4479
TF VDCNN(23.4.30)	0.6900	0.4885		0.4885
TF VDCNN_KOGPT2(23.06.15)	0.7545	0.7824		0.7055	0.6875
TF VDCNN_LLAMA2(23.09.30)	0.7762	0.8104	0.7296
TF VDCNN_LLAMA2_V2(24.06.04)	0.8322	0.8420	0.7837	0.7120	0.7477
TF LSTM_EXAONE3(24.08.16)	0.8395	0.8432	0.8851	0.7130	0.6919
TF BIDIRECTIONAL_LSTM_EXAONE3(테스트 중)
TF TRANSFORMER_EXAONE3(테스트 중)
JAX LSTM_EXAONE3(개발 중)

example

#py: 3.10, tf: 2.10
import tensorflow as tf
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 1000

model_path = 'vdcnn_model.h5'
tokenizer_path = "tokenizer.pickle"

model = tf.keras.models.load_model(model_path)
with open(tokenizer_path, "rb") as f:
    tokenizer = pickle.load(f)

def preprocess_text(text):
    text = text.lower()
    
    return text

def predict_text(text):
    sentence = preprocess_text(text)
    encoded_sentence = tokenizer.encode_plus(sentence,
                                             max_length=maxlen,
                                             padding="max_length",
                                             truncation=True)['input_ids']
    sentence_seq = pad_sequences([encoded_sentence], maxlen=maxlen, truncating="post")
    prediction = model.predict(sentence_seq)[0][0]
    return prediction
    
while True:
    text = input("Enter the sentence you want to test: ")
    result = predict_text(text)
    if result >= 0.5:
        print("This sentence contains abusive language.")
    else:
        print("It's a normal sentence.")

Maker

Tanat

github:   Tanat05
discord:  Tanat05
email:    tanat@tanat.kr

Reference

@misc {l._junbum_2023,
    author       = { {L. Junbum} },
    title        = { llama-2-ko-70b },
    year         = 2023,
    url          = { https://huggingface.co/beomi/llama-2-ko-70b },
    doi          = { 10.57967/hf/1130 },
    publisher    = { Hugging Face }
}

@article{exaone-3.0-7.8B-instruct,
  title={EXAONE 3.0 7.8B Instruction Tuned Language Model},
  author={LG AI Research},
  journal={arXiv preprint arXiv:2408.03541},
  year={2024}
}

License

모든 korcen은 Apache-2.0라이선스 하에 공개되고 있습니다. 모델 및 코드를 사용할 경우 라이선스 내용을 준수해주세요.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
example		example
model		model
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Korcen

모델 검증

example

Maker

Reference

License

About

Languages

License

Tanat05/korcen-ml

Folders and files

Latest commit

History

Repository files navigation

Korcen

모델 검증

example

Maker

Reference

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages