Scraping Job Postings for API Signals

I am gathering signals from leading enterprises to understand what they are investing in. One of the ways that I gather signals is harvesting the job postings they publish. While not entirely honest, they are the most honest public signal available online today about a company. It is something they have to standardize and make public because they want to attract candidates, but like other signals they are also shaped to accommodate the larger enterprise narrative. With that said I find the signals gathered, but also the ritual of job postings to tell a very interesting story about the state of technology today, and specifically the online theater of Internet technology.

I have pulled the jobs for 100 companies. 75% of those companies use similar software to publish their jobs, with Workday being the most restrictive to scraping. Less than 5% use CloudFlare to get in your way. Most do a bunch of really wonky things to try and make it difficult to harvesting the jobs, with the majority being unintentional HTML and other scripting getting in the way. 90% of the companies publish a JSON-LD on the detail page of a job, making it extremely easy and consistent to actually scrape the job posting—once you have managed to find the URL. It is an interesting dance between the companies, the tools they use, and I’d say LinkedIn, and other job posting websites.

It is ironic that I am using the signals gathered via scraping job postings to understand the state of APIs. Nobody wants to standardize and just provide a default API or feed for their jobs, so they go all in on making more difficult to do, while simultaneously adding a standard just to the detail page. Why don’t we just have a standard XML or JSON feed for our jobs as a default for companies? Why not sell access to your jobs data? I’d pay for it from an authoritative source. You can tell people put a lot of work into making it hard to scrape, and that is one of the reasons I consider job postings to be such a truthful representation, because people seeing it as possessing value and will invest in obfuscating.

I have given up all hope that companies will understand the importance of APIs and invest accordingly. They will only do it when there is a selfish incentive, which isn’t always about money. They will only do it when they are made to. You can demonstrate it will make them a bunch of money and they won’t do it. You need regulation or some completely selfish reason that benefits them as a company and nobody else. I think blogs, calendars, press releases, and job postings provide all the evidence we need that companies do not want interoperability, and there is a lot of money to be made just fighting over the scraps around something rather than any work that might involve actually benefiting someone else.