0 votes
1 view
in AI and Deep Learning by (48.7k points)

I wanted some input on an interesting problem I've been assigned. The task is to analyze hundreds, and eventually thousands, of privacy policies and identify core characteristics of them. For example, do they take the user's location? do they share/sell with third parties? etc.

I've talked to a few people, read a lot about privacy policies, and thought about this myself. Here is my current plan of attack:

First, read a lot of privacy and find the major "cues" or indicators that a certain characteristic is met. For example, if hundreds of privacy policies have the same line: "We will take your location.", that line could be a cue with 100% confidence that that privacy policy includes taking of the user's location. Other cues would give much smaller degrees of confidence about a certain characteristic... For example, the presence of the word "location" might increase the likelihood that the user's location is store by 25%.

The idea would be to keep developing these cues, and their appropriate confidence intervals to the point where I could categorize all privacy policies with a high degree of confidence. An analogy here could be made to email-spam catching systems that use Bayesian filters to identify which mail is likely commercial and unsolicited.

I wanted to ask whether you guys think this is a good approach to this problem. How exactly would you approach a problem like this? Furthermore, are there any specific tools or frameworks you'd recommend using. Any input is welcome. This is my first time doing a project which touches on artificial intelligence, specifically machine learning and NLP.

1 Answer

0 votes
by (105k points)

Your solution needs text classification. Given that you have multiple output sections per document, it's truly a multilabel classification in Machine Learning. 

The usual approach is to perform the supervised learning in which we manually label a set of documents with the classes/labels that you want to predict, then train a classifier on features like typically a word or n-gram occurrences or counts, possibly weighted by tf-idf of the documents.

Here you can refer to the link which tells you about the A Machine Learning Solution to Assess Privacy Policy Completeness.

Welcome to Intellipaat Community. Get your technical queries answered by top developers !