Personally Identifiable Information Hides in Dark Data

To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card...
Michael Buckbee
3 min read
Last updated May 9, 2022

To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card numbers, and all the other usual suspects. With the additional words “reasonable basis to believe that the information can be used to identify the individual”, HIPAA’s definition takes in digital handles such as emails, IP addresses and even facial imagery. But there’s a little more to HIPAA’s PII definition, and it applies specifically to free form text (commonly found in word processing documents, spreadsheets, presentations, etc.)

The complete list of HIPAA’s PIIs is enumerated in the law’s Safe Harbor guidelines. In plain-speak, these guidelines tell health IT administrators what information is considered private, requiring special authorization to view or process. It includes the aforementioned identifiers, as well as medical record numbers, health insurance IDs, and some others. By the way, we’ve conveniently put this PII list in our omnibus data protection compliance whitepaper.

Get the Free Pen Testing Active Directory Environments EBook

“This really opened my eyes to AD security in a way defensive work never did.”

An unstated assumption made by many is that PII only lives in structured formats—in other words, fields in a database. Readers of this blog of course know that PIIs are often likely to be harvested from the massive amounts of human generated dark data found on corporate files servers.

The HIPAA regulators have understood this as well. In clarifying the rules for removing PII —“de-identifying”—data for publication and general usage, they explicitly cover the possibility that PII can also reside in free-form text. I’ve excerpted the key paragraph from their de-identification best practices below :

PHI [protected health information] may exist in different types of data in a multitude of forms and formats in a covered entity.  This data may reside in highly structured database tables, such as billing records. Yet, it may also be stored in a wide range of documents with less structure and written in natural language, such as discharge summaries, progress notes, and laboratory test interpretations … The de-identification standard makes no distinction between data entered into standardized fields and information entered as free text (i.e., structured and unstructured text)— an identifier listed in the Safe Harbor standard must be removed regardless of its location.

Got that? PHI, which is essentially PII along with other sensitive medical information, embedded in spreadsheets, docs, and presentations is just as worthy of HIPAA privacy protections as fields in databases.

So if we follow these ideas—PIIs can be anything that reasonably links to an individual, and this data can exist in text—to their logical conclusion, then we need to consider a new possibility. Suppose this sentence from a doctor’s notes were uploaded to a file server:

The patient, a technical content specialist at Varonis, a software company, has been complaining about tennis elbow.

The natural question to ask is whether “technical content specialist at Varonis” is a PII?

It’s not a PII in the sense of a uniquely coded key such as social security number or health insurance ID that links back to a person. But in another sense, it acts very much like PII. Don’t believe me? Try typing that phrase into Google and see what comes up.

We’re really talking more about the meaning of the text—or as experts would say, the semantic value—rather than actual letters, numbers, and other syntax. But HIPAA’s Safe Harbor rule even takes this into account: it specifically notes that the “knowledge” in free text can also be used to point back to a person.

As a practical matter, the HIPAA rules mean that any reference to a patient’s job title and company is a violation of the law’s privacy protections.

This leads to a broader discussion on what’s called the “semantic web”. In brief, Google and a few others are already doing leading edge work on extracting meaning and knowledge from web content. You can see for yourself how well Google does this by entering the keywords “height of the empire state building” in a search. You’ll get back an actual answer, 1454’, in addition to all the docs with that exact phrase.

The larger point is that along with stealing PIIs, hackers and cyber thieves are also getting better at mining and interpreting human generated text for personal details, and then building more convincing fake identities to be used in social attacks, such as phishing and pretexting.

Bottom line: these bits and pieces of personal information that are scattered across file servers in clear-text documents can be used to identify an individual with very high likelihood.

That’s important to keep in mind when someone in your company asks, “do we know what’s in our files and the risks involved if our servers are breached?”

What should I do now?

Below are three ways you can continue your journey to reduce data risk at your company:

1

Schedule a demo with us to see Varonis in action. We'll personalize the session to your org's data security needs and answer any questions.

2

See a sample of our Data Risk Assessment and learn the risks that could be lingering in your environment. Varonis' DRA is completely free and offers a clear path to automated remediation.

3

Follow us on LinkedIn, YouTube, and X (Twitter) for bite-sized insights on all things data security, including DSPM, threat detection, AI security, and more.

Try Varonis free.

Get a detailed data risk report based on your company’s data.
Deploys in minutes.

Keep reading

Varonis tackles hundreds of use cases, making it the ultimate platform to stop data breaches and ensure compliance.

speed-data:-when-lives-depend-on-strong-security-with-john-mason
Speed Data: When Lives Depend on Strong Security With John Mason
Tempo Technology Services' John Mason shares why strong cybersecurity in healthcare is so critical and how organizations can combat malicious actors.
strengthening-resilience:-data-security-vs-data-resilience-tools
Strengthening Resilience: Data Security vs Data Resilience Tools
Learn the difference between backup tools and true DSPs and what to look for when you’re choosing a DSP.
speed-data:-exploring-the-virtual-ciso-role-with-robert-blythe
Speed Data: Exploring the Virtual CISO Role With Robert Blythe
Bob Blythe talks about the virtual CISO market, gen AI, and where the future of tech is headed.
speed-data:-developing-‘security-as-a-service’-with-alexis-bonnell
Speed Data: Developing ‘Security as a Service’ With Alexis Bonnell
Alexis Bonnell, CIO for the U.S. Air Force Research Laboratory, shares her thoughts on the relationship between knowledge and AI.