I Decided to Learn How to Make a Web Scraper Because I Wanted to Know How External APIs Get Their Data

A fellow Cal Poly Pomona "senpai" once introduced to the participants of the HackPoly 2017 hackathon that he created an API called "Food2Fork." This API provided users with popular recipes shared online. As a sophomore studying for my Computer Engineering degree, I admired his accomplishment.

Ever since then, I wondered, how do I make an API? It wasn't until one and a half years later that the question resurfaced while I was planning out my startup's web backend, Goal Striver.

I eventually learned how to do it using PHP, PHP frameworks (e.g. Laravel), and Node.js via Express.js.

What is an API?

An API, or application program interface, is a way that allows software to interact with one another. There are different types of APIs out there. Each API has a unique functionality. For instance, if you wanted to incorporate a feature in your webapp, mobile app, or program, the ability to identify objects in an image, you would use a machine learning API such as Clarifai.

Fast forward to 2018, I now started to wonder where informational APIs get their information from in the first place. I also decided to avoid Google for an answer to see if I can figure it out by myself.

My First Hypothesis

My first hypothesis was that the developer decided to one day visit all the websites of a particular topic and copy/paste every single bit of information and store it into his or her database.

Now, I thought to myself, this can't be it. After all, these informational APIs have over several thousands of records of information. There's no way they have the sanity and patience to do this. It was at that moment that I remembered something I learned from back in my high school years when I was learning C++ on my own time: "if something is reptitious, automate it."

My Second Hypothesis

This led me to revisit a random project I experimented with, which had to do with creating a web crawler. Back in sophomore year of college, I created a web crawler by following Bucky's tutorial on YouTube using Python via BeautifulSoup. I made one back then because I wanted to make a Twitter bot. So with that in mind, I modified my hypothesis just slightly.

What if these developers actually created a web crawler and scraped the websites of interest, and stored the data into their database? This method seemed more feasible. Then, during the golden year of crypto, 2017, when Bitcoin was in its 5 figure success, I found out that CoinMarketCap has an API, and I also wondered how they created their site. It turns out they just connected to multiple separate APIs from almost all exchanges around the world, and just averaged the price out. No scraping involved, but easier way to get things done.

Creating My First Informational API

It is now 2018, I knew I had to make an informational data-driven API like Food2Fork for a project of mine. It is time I stopped procrastinating on it, and do what I had to do. Make a web crawler that will scrape data off the internet, so I had something to work off of. In addition, I might as well create a simple interface so that I can manually add in entries so I don't have to create a new web scraper every time.

It has been several years since I last used Python, and by now, I've forgotten everything about Python, much less the syntax of how to even make a simple class in Python! All I can recall is that the syntax for Python 2 vs Python 3 is vastly different. So I decided to use my new favorite tool, Node.js, to accomplish the task. Besides, I would have to use it eventually since I wanted to deploy it to a webserver in the very near future.

To speed up the process, I used the library called "cheerio", which helps me do most of the heavylifting when it comes to the data scraping and data parsing activities along with the use of "request.js". I spent the weekend working on this, and at last, I got it to work by Sunday before lunchtime. I activated the program, scraped a few websites, checked my database, and bam! Everything was there.

Now, what did I make specifically? I will keep that a secret. But, for my two friends who do know, should you happen to come across this post, shhh!!! Don't tell them. :)

Waking up to this morning, I decided to take a look at Food2Fork, and I realized that now, I can practically make a clone of the entire website thanks to my newfound knowledge of web scraping. muwahhahahahaha!

Conclusion

If you plan to create your own informational based API that doesn't involve releasing access rights or information stored in your database for an existing product or service, but rather, to create something to make accessing a particular type of information much easier, first make a web crawler, store that information, and then create a REST API for users to access it. However, before you go and make your web crawler, make sure you check the terms and privacy policy of the website you plan to crawl. They have legal terms as to whether they allow 3rd party crawlers, spiders, or web scrapers to perform said activities on their website.

Moving forward, what should I create next? I want to create more informational APIs like Food2Fork, but what is there to create when almost everything has been created already? Hmmm... I guess only time will tell. But, if you have any ideas or suggestions, feel free to leave it in the comments, and I'll think about potentially giving it a shot if I find any practical use for that.

What is an API?

My First Hypothesis

My Second Hypothesis

Creating My First Informational API

Conclusion

Why I Bought 100 Shares of AMD Stocks Instead of Intel Stocks?

The Top 3 Things That Determines Your Productivity