python - What the right way to analyse a text for adult content identification? -

i want filter out adult content tweets (or text matter).

for spam detection, have datasets check whether particular text spam or ham.

for adult content, found dataset want use (extract below):

arrbad = [ 'acrotomophilia', 'anal', 'anilingus', 'anus', . . etc. . 'zoophilia']

question

how can use dataset filter text instances?

i approach text classification problem, because using blacklists of words typically not work classify full texts. main reason why blacklists don't work have lot of false positives (one example: list contains word 'sexy', alone isn't enough flag document being adults). need training set documents tagged being "adult content" , others "safe work". here do:

check whether existing labelled dataset can used. need several thousands of documents of each class.
if don't find any, create one. instance can create scraper , download reddit content. read instance text classification of nsfw reddit posts
build text classifier nltk. if don't know how, read: learning classify text

Search This Blog

Facebook Talkie

python - What the right way to analyse a text for adult content identification? -

Comments

Post a Comment

Popular posts from this blog

delphi - How to make a proper alternate row color on a filtered TVirtualStringTree -

amazon web services - S3 Pre-signed POST validate file type? -

c# - Check Keyboard Input Winforms -