python - What the right way to analyse a text for adult content identification? -
i want filter out adult content tweets (or text matter).
for spam detection, have datasets check whether particular text spam or ham.
for adult content, found dataset want use (extract below):
arrbad = [ 'acrotomophilia', 'anal', 'anilingus', 'anus', . . etc. . 'zoophilia']
question
how can use dataset filter text instances?
i approach text classification problem, because using blacklists of words typically not work classify full texts. main reason why blacklists don't work have lot of false positives (one example: list contains word 'sexy', alone isn't enough flag document being adults). need training set documents tagged being "adult content" , others "safe work". here do:
- check whether existing labelled dataset can used. need several thousands of documents of each class.
- if don't find any, create one. instance can create scraper , download reddit content. read instance text classification of nsfw reddit posts
- build text classifier nltk. if don't know how, read: learning classify text
Comments
Post a Comment