Have you ever requested a screening or people search service allowing you to see public records on an individual? Probably yes, even if you did not realize, because it is essentially a background check. If you have ever applied for a job, taken out a bank loan, rented a house or a car, or performed any action that puts another person or group at risk, you have probably been vetted using a background check. And if you didn’t pass it, you may have been refused the service. How does this usually happen?
A traditional background check is usually based on past events that are documented. If the case has not been documented, it often is not even recorded. For instance, did you know that only 3% of the U.S. population have felony records?
There is another issue: when we operate solely on the data contained in a background check, we still may not understand the person’s personality in terms of their intentions, beliefs, and behaviors—psychological traits that may or may not translate into future actions. This can be a very important part of cultural fit and of assessing an individual’s character.
The ability to gain greater insight into a person’s nature is a core feature of our unique Socialprofiler service. Where else could you find alternative data about a person that is public and accessible to everyone? What information could show potential issues associated with a person’s temperament that might create conflict in the future?
Today, products that are changing the landscape of the internet are Machine Learned (“ML”) services and public data. At the dawn of the internet, website ranking led to the emergence of search engines; now they may be replaced by large language models (“LLMs”)—neural networks that can answer user questions. Over time, what will happen to these social networking search algorithms? Will they also be replaced by LLMs?
Thanks to machine learning, we can assess other people’s social network profiles, interests, and activities from a new angle. Thus, it is our pleasure to introduce you to one of the latest advancements in machine-learned analysis tools contained within Socialprofiler.
Using the strength of ML algorithms and Big Data processing, Socialprofiler is an automation tool that allows you to search an individual’s social media profiles on various platforms and analyze that individual’s interests based on what they follow, like, and post themselves. It’s the development of this complex proprietary technology—an ML system for interest detection—that provides the additional insight and information requested by our users.
Data Acquisition
All of our services work by collecting publicly available data from the internet, i.e. indexing data. We acquire public information, like Google and Bing, but with social media network platforms. Any LLM, which is soon predicted to replace search algorithms, is also trained on public data collected by specialized robots. Thus, the use of public data is at the heart of technological progress—something we may not always think about.
“Indexing is the automated collection and structuring of information from websites using a program or service,” says Wikipedia. This means the robots that do the indexing have been with us since the dawn of the internet, because all search engines have to collect information from websites, links to other websites to index the page, and those which determine its search page output for human use. Socialprofiler does the same thing, only in the last step we apply our machine learning algorithms to assess a person’s interests, categorize them, and make them available in an easy-to-use graphic format.
Attitudes towards indexing on the internet have changed as much as its freedom. Until about ten years ago, every reputable website or social network had a public API that allowed a huge amount of automation and data collection for analysis. At some point, social networks realized that data was highly valuable and sought after, so they started regulating robots. Luckily, in the USA publicly available data is legally considered part of the public domain. As the 9th Circuit Appeals Court has ruled, LinkedIn’s petition to ban collecting data of its users’ accounts was void, and web scraping was reaffirmed as a legal practice.
Today, it’s more difficult than ever to work with sites that are more defensive in nature. However, the lessons from LLMs prove that online data collection is an important part of creating new life-changing products and one of the next big breakthroughs in the field in which Socialprofiler operates.
Working with collected data
The acquired data are essentially profiles and referral links from one Instagram account to another. More often than not, sites with similar themes link to each other and are an indication of their proximity in content.
For example, one blogger can have up to 80 referral links to similar accounts. Based on a list of 1000 bloggers in the United States, similarity measures were used iteratively to identify accounts belonging to a common category and representing a common interest or subject.
The described processing of blogger-originated data produced a graph with 3,428,453 vertices (bloggers) and 96,967,974 edges (representing a measure of the similarity of two bloggers). As an example, when using Instagram, if a user visits the National Geographic account, then National Geographic referral links the person to subscribe to similar accounts. This would make National Geographic the first node in a graph of relationships. The recommendations provided by Instagram result in more nodes.
The National Geographic node is associated with one or more edges, with each corresponding to a recommendation/node. Next, the process takes the first recommendation and determines what recommendations Instagram generates from it. This produces a new set of nodes and edges between recommendations. This process flow can be repeated for multiple iterations. Given data with a large set of nodes and edges, the process analyzes the network formed by these nodes to identify common communities or groupings. This information can then be used to identify the common interests within the nodes in that community.
Connections between nodes may be indicative of a common group. That group may indicate a strong common interest, illustrated by the topics discussed within the community, posted content, or traits assigned to the community. Therefore, connections may be used to identify a group having a specific common interest, particularly if there is supporting “confirmation” provided by membership in other distinct communities having a similar interest.
Building a Hierarchy of Interests
How do we determine the category to assign to an allocated community when there are so many communities that manually labeling them will not produce the desired result?
Socialprofiler encountered this problem during the development of our service and solved it using advanced LLMs with a varied approach.
Socialprofiler’s State of the Art LLMs
LLMs provide elaborate query capabilities for building hierarchies of interests, yielding excellent results for hundreds of interest categories. The hierarchy based on LLM-generated and human-verified taxonomies, as shown in the operator chat below (pic.1.), is obtained and verified.

pic 1. LLM output for the ‘common list of interests’ request
Based on recommendation tables, Socialprofiler allocated communities with similar interests to be named within resulting taxonomies, allowing us to proceed to options that can be implemented together.
Alternative Realities w/o LLMs
Let’s delve deeper into how to determine what category should be assigned to an allocated community. Socialprofiler starts with compiling a list of the most common categories to which the detailed interests will then be attached. What kind of general categories can these be? Sports, cars, fashion, photography, etc.
Socialprofiler details the general categories with more specific interests to start building a word2vec model. Any community highlighted on the graph carries textual information from biography. Through word2vec analysis, Socialprofiler measures the distance of the text message to each of the keywords describing the general category. The smaller the distance between the words, the closer the category and interest are to each other in meaning. Learn more about how it works in the next paragraph.
Word2vec - Word2Vec is a natural language processing technique used to represent words as vectors or numerical values in a multi-dimensional space. These vectors are created by training a neural network on a large corpus of text, such as a collection of news articles or books.
Word2Vec allows machines to understand the meaning of words and their relationships with other words in a way similar to how humans understand language. This has many practical applications like improving search engine performance, sentiment analysis, and language translation.
A benefit of Word2Vec is that it can capture the semantic relationships between words. For example, it can recognize that “king” is related to “queen” in a similar way that “man” is related to “woman” (pic.2.). This makes it a powerful tool for natural language processing and understanding.

pic 2. Word2vec exploration in multidimensional space.
A key output of word2vec is the selection of keywords for each interest in the hierarchy using a semantic language model. Then, Socialprofiler uses another proprietary attribute—regular expressions, a set of rules for constructing search patterns to find specific patterns of text within a larger body of text (pic.3).

pic 3. Regular expression detects email pattern in text.
Keywords from word2vec are used to write regular expressions in semi-automatic mode to verify and search for taxa among the communities obtained from the graph. Based on the regular expressions, the primary candidates for addition to a particular interest will be selected.
How do LLMs Assist Socialprofiler?
Before we learn more about LLMs, you should gain an understanding of the embedding process. Embedding is a technique used in natural language processing and machine learning to represent words or phrases as vectors in a multi-dimensional space. These vectors capture the semantic meaning of the words or phrases, which allows for more effective processing and analysis.
Embeddings are created by training a neural network on a large corpus of text, such as a collection of news articles or books. The network learns to represent each word or phrase as a vector, based on its context within the text. As you recall, since Socialprofiler has already done this with word2vec, we can get embeddings for every word in the English language. Since we are now operating with embeddings, we can measure the distance between words and whole paragraphs of text. The smaller the distance between the objects being compared, the more they have in common—and in the limit, it is within the same word or synonym.
Now let’s return to LLMs. Since LLMs are advanced neural networks, they can return embeddings of words and concepts and provide embeddings for all the interests and keywords (pic.4). With these keywords, we will obtain embeddings from LLMs for each interest. These embeddings will be used to search for close clusters from the community graph.

pic 4. LLM output for the ‘keywords that indicate interests’ request.
How is this accomplished? Preliminarily, each community will have its own embedding from the texts that describe accounts in their bio. Then, this embedding similarity, which is a secondary attribute, confirms that the candidate belongs to the selected interest.
A Human Readable Description of Every Interest via Instagram
Having obtained the interest label for every major account on Instagram, it is possible to proceed with the human description of each interest. This can be done using LLM summarization of embeddings. To do this, it is necessary to collect the bios of all the people in the cluster, for example, and describe them without omitting anything that is not common (pic.5). Then several clusters can be combined into a common category and built into a hierarchy.

pic 5. LLM output for the task ‘convert the chain of interests to the human language’.
A Logical Conclusion
Now that Socialprofiler has identified a person’s interests, you can apply them as you see fit, determining how they align in real-life situations you may be evaluating. Would a team comprised of far-left and far-right constituents easily get along? And if you need to answer a question of what suits and doesn’t suit a group you interact with, review your Socialprofiler results to assess commonality in beliefs, traits, and partisanship.