It should be noted that code writing is preceded by careful work on studying the data source, its structure, data volume, mutual relations between units of information, etc. It is necessary to determine the technologies used for the site operation. Based on the analysis of the collected information, approaches to writing the program are determined.
It usually takes two days to a week for a single site of medium complexity, depending on the current workload.
The main time is made up of the time it takes to load the page (data) from the server. After that, the collection of information is almost instantaneous. That is, if you already know the source site, you can measure the average load time and multiply it by the number of pages from which you need to collect information. The time can also be affected by connection instability, problems on the server of the source site, blocking, etc.
Collecting information from other sources (API services, text files, databases, tabular data) is quite fast and takes minutes. Here the main time is spent on writing a parsing program.
In fact, it may not even be there if the structure of your site and the structure of the source site are completely the same. However, this is extremely rare. In addition, it is often necessary to collect data from different sources, the structures of which are also rarely identical.
Collected data should be brought to a common format, define common units of measurement for all data, eliminate duplicates and synonyms, and, finally, create the kind of catalog you need.
Sometimes it is useful to design such a catalog as a separate small program, a kind of database, which you will always have at hand for quick searches and for various kinds of experiments with data structure. It will also be useful if you need to quickly compare your existing data with new sources in the future.
When your data structure is ready, it is still an abstraction. That is, the catalog has categories, the products have attributes, but in order to import all of this to the site, you still need to create a driver for the specific database architecture of the site.
In other words, you need to make a kind of map, guided by which the program will lay out your data (usually specially prepared Excel tables) in the database site.
The import process itself depends largely on your hosting, its speed, various restrictions set on it, as well as the complexity of the structure of the catalog as a whole, and the product in particular. The more products will be loaded, the slower the process will become over time.
For example, the process of uploading a thousand products with a single picture to an empty online store based on WordPress will take no more than an hour even on a weak hosting. However, if you perform the same operation at a time when the site already has 150-200 thousand products, it may take several hours and most likely will require splitting into smaller import batches.
There are no restrictions on how you will personally “consume and assimilate” publicly available information: with your eyes, ears, fingers or technical devices, whether you will memorize it and, if so, how you will do it. No one forbids you to analyze or structure information with your own brain or, again, with the help of technical means.
The question of legality starts from the moment of use. And here everything depends on the information itself, what rights to it or related products are established (copyright, trade, license…), whether it can be distributed as one’s own or in one’s own name, whether it can be sold without a license or permission of the right holder, etc. Since there are a lot of questions in each specific situation, it is better to consult a specialized lawyer for each specific case. On my own behalf, I strongly recommend that you do so to avoid future trouble.