I work at a public health department that takes in and stores lots of medical data every day. I've written a program that uses regular expressions to determine if particular fields in the incoming data are valid or invalid. Ex: DOBs come in as YYYYmmDD, so they should match regex ^[0-9]{8}$
I want to analyze the "invalid" data to help identify problems in our system (we get way too much data to go through each 'bad' record row-by-row). Can anyone suggest AI techniques/machine learning techniques that can 'monitor' the bad data and find patterns in what is wrong? I think that coming up with a bunch of regular expressions for possible ways the data could be invalid (ex. not enough or too many characters) and then keeping track of those results might work. But instead of me thinking up all of the ways the data could be invalid, I'm curious about ways to 'learn' the patterns from the bad data using AI.
Are there any known techniques that do this?