Github link: Exungsh/032002337: An Covid-19 data analysis program based on the Health Commission(NHC) announcement
1. PSP form
Personal Software Process Stages | Estimated time (minutes) | Actual time (minutes) |
---|---|---|
Planning | ∞ | 0 |
Estimate (estimated time) | 0 | 0 |
Development | 0 | 0 |
Analysis (needs analysis (including learning new technologies)) | 0 | 0 |
Design Spec (Generate Design Documentation) | 0 | 0 |
Design Review | 0 | 0 |
Coding Standard | 0 | 0 |
Design | 0 | 0 |
Coding (specific coding) | 0 | 0 |
Code Review | 0 | 0 |
Test (test (self-test, modify code, commit changes)) | 0 | 0 |
Test Report | 0 | 0 |
Size Measurement (computational effort) | 0 | 0 |
Postmortem & Process Improvement Plan | 0 | 0 |
Total | 0 | 0 |
Second, the realization of task requirements
1. Project design and technology stack
1.1. Project Design
The whole project is mainly divided into three parts
- Crawling the website (value) and the corresponding date (key) of the epidemic report of the Health and Health Commission to form a dictionary and store it in a json file
- By traversing the dictionary, crawling the historical notification content, and generating a txt file
- By interacting with the user (specifying the date), analyze the specified txt file to generate tables, histograms and maps
1.2, technology stack
- Python
- Html + CSS (Beautifulsoup )
- Deep Learning (referring to my learning report format)
2. Crawler and data processing
The business is mainly divided into three parts
- get_url.py: Get a list of URLs
- get_text.py: Get the announcement content
- get_result.py: user interaction, get the final result
get_url.py --> url_list.json --> get_text.py --> txt --> get_result.py --> table, bar chart, map
2.1,get_url.py
Core code:
area = requests.get(target_url, headers=headers).text # Crawl web text bs = BeautifulSoup(area, "html.parser") # Get a tag (the purpose is to get the announcement URL) a_url = bs.select('.list > ul > li > a') # Get the span tag (the purpose is to get the date) span_date = bs.select('.list > ul > li > span') for a in a_url: url.append('http://www.nhc.gov.cn/' + a['href']) for span in span_date: date.append(span.text) # packed into a dictionary result = dict(zip(date, url)) return result
Traverse each page to get a {date:url} dictionary
with open("url_list.json", "w") as f: f.write(json.dumps(url_dict, indent=4))
Export as json file
json file content (part):
{ "2022-09-12": "http://www.nhc.gov.cn//xcs/yqtb/202209/093a5fe2183b42169296326741d81565.shtml", "2022-09-11": "http://www.nhc.gov.cn//xcs/yqtb/202209/338e611615da4998a1202694eee8f196.shtml", "2022-09-10": "http://www.nhc.gov.cn//xcs/yqtb/202209/8ac84d72227c4a318694ddae45412c9a.shtml", "2022-09-09": "http://www.nhc.gov.cn//xcs/yqtb/202209/0702822269e648a882c267aa672cebf8.shtml", "2022-09-08": "http://www.nhc.gov.cn//xcs/yqtb/202209/78ea88c5c23e41c391376ee9b103cfec.shtml", "2022-09-07": "http://www.nhc.gov.cn//xcs/yqtb/202209/b9867ea1be624141b41f461a431239d7.shtml", "2022-09-06": "http://www.nhc.gov.cn//xcs/yqtb/202209/892ec8bb4db44a96bd06169ac2d7de09.shtml", "2022-09-05": "http://www.nhc.gov.cn//xcs/yqtb/202209/9a6ef43336a2401ca6dc4f2e6f97e5a6.shtml", "2022-09-04": "http://www.nhc.gov.cn//xcs/yqtb/202209/cb9e0c28d4b2467fac0ca2871bbfd95b.shtml", "2022-09-03": "http://www.nhc.gov.cn//xcs/yqtb/202209/97243736d6e94317810ac51ba23fe189.shtml", "2022-09-02": "http://www.nhc.gov.cn//xcs/yqtb/202209/e0a18445e0ab47608527b9c910f77699.shtml", "2022-09-01": "http://www.nhc.gov.cn//xcs/yqtb/202209/b236ae4939f24155a506f0cfb0f16ace.shtml", "2022-08-31": "http://www.nhc.gov.cn//xcs/yqtb/202208/8fbbe614bd0c4a5ca9cf8a9e4c289e9a.shtml", "2022-08-30": "http://www.nhc.gov.cn//xcs/yqtb/202208/2cc3c1a07dd348b09afac2a880ca72ca.shtml" }
2.2,get_text.py
Core code:
Get notification content function:
# Pass in the URL and return all the content of the p tag in the page def get(target_url): area = requests.get(target_url, headers=headers).text # Crawl web text bs = BeautifulSoup(area, "html.parser") d = bs.findAll('p') # d stores the contents of all p tags return d
Merge the returned list to get a complete paragraph
for p in p_data: data = data + p.text fh = open(date+'.txt', 'w', encoding='utf-8') fh.write(data) fh.close()
txt content:
9 From 0:00 to 24:00 on March 8, 31 provinces (autonomous regions and municipalities directly under the Central Government) and the Xinjiang Production and Construction Corps reported 301 new confirmed cases. Among them, 42 were imported cases (16 in Guangdong, 9 in Fujian, 6 in Shanghai, 4 in Beijing, 2 in Yunnan, 1 in Tianjin, 1 in Liaoning, 1 in Heilongjiang, 1 in Shandong, and 1 in Shaanxi), including 2 259 local cases (59 in Sichuan, 42 in Inner Mongolia, 36 in Guangdong, 22 in Guangxi, 20 in Tibet, 17 in Beijing, and 17 in Shandong). 16 cases, 13 cases in Heilongjiang, 7 cases in Qinghai, 6 cases in Guizhou, 4 cases in Shaanxi, 3 cases in Liaoning, 3 cases in Jiangsu, 3 cases in Yunnan, 3 cases in Xinjiang, 1 case in Tianjin, 1 case in Jiangxi, 1 case in Henan, and 1 case in Hainan , 1 in Chongqing), including 41 cases from asymptomatic infections to confirmed cases (15 in Inner Mongolia, 7 in Heilongjiang, 4 in Sichuan, 3 in Guangdong, 3 in Xinjiang, 2 in Tibet, 1 in Liaoning and 1 in Shandong , 1 in Henan, 1 in Chongqing, 1 in Guizhou, 1 in Shaanxi, and 1 in Qinghai). No new deaths were reported. No new suspected cases were reported. There were 336 newly cured and discharged cases on the same day, including 30 imported cases and 306 local cases (88 in Hainan, 71 in Sichuan, 30 in Tibet, 24 in Heilongjiang, 20 in Shaanxi, 17 in Chongqing, 14 in Guangdong, and 14 in Qinghai). 14 cases, 5 in Henan, 4 in Xinjiang, 3 in Inner Mongolia, 2 in Shanxi, 2 in Liaoning, 2 in Zhejiang, 2 in Fujian, 2 in Shandong, 2 in Hubei, 2 in Yunnan, 1 in Tianjin and 1 in Hunan ), 32,566 close contacts were released from medical observation, and severe cases decreased by 10 compared with the previous day. There are 560 confirmed cases imported from abroad (no severe cases), and no existing suspected cases. A total of 22,906 confirmed cases, 22,346 cured and discharged cases, and no deaths. As of 24:00 on September 8, according to reports from 31 provinces (autonomous regions and municipalities) and the Xinjiang Production and Construction Corps, there were 6,226 confirmed cases (including 28 severe cases), 234,876 cured and discharged cases, and 5,226 deaths. , a total of 246,328 confirmed cases have been reported, and there are no existing suspected cases. A total of 5,632,275 close contacts have been traced, and 278,389 close contacts are still under medical observation. 31 provinces (autonomous regions and municipalities) and Xinjiang Production and Construction Corps reported 1,103 new cases of asymptomatic infections, including 70 imported cases and 1,033 local cases (347 in Tibet, 109 in Heilongjiang, 81 in Shandong, 77 in Inner Mongolia, and 77 in Liaoning). 77 cases, 77 cases in Guangxi, 71 cases in Sichuan, 56 cases in Jiangxi, 28 cases in Guangdong, 27 cases in Xinjiang, 15 cases in Qinghai, 13 cases in Guizhou, 12 cases in Hubei, 11 cases in Shaanxi, 8 cases in Henan, 5 cases in Hainan and 4 cases in Tianjin , 3 cases in Fujian, 3 cases in Gansu, 2 cases in Beijing, 2 cases in Jilin, 2 cases in Shanghai, 2 cases in Yunnan, and 1 case in Jiangsu). There were 1,710 asymptomatic infections released from medical observation that day, including 109 imported from abroad and 1,601 from China (902 in Tibet, 155 in Qinghai, 135 in Hainan, 59 in Shaanxi, 50 in Xinjiang, 49 in Hebei, and 38 in Henan. 34 in Heilongjiang, 34 in Hubei, 28 in Shandong, 27 in Sichuan, 14 in Jiangxi, 13 in Chongqing, 12 in Gansu, 11 in Guangxi, 8 in Liaoning, 6 in Shanghai, 5 in Guangdong, 4 in Jiangsu and 4 in Hunan 3 cases in Inner Mongolia, 2 cases in Shanxi, 2 cases in Zhejiang, 2 cases in Anhui, 2 cases in Yunnan, 1 case in Tianjin, and 1 case in Jilin). There were 24,048 asymptomatic infections (648 imported from abroad). A total of 5,977,507 confirmed cases have been reported from Hong Kong, Macao and Taiwan. Among them, there were 396,687 cases in the Hong Kong Special Administrative Region (77,564 discharged and 9,769 deaths), 793 in the Macau Special Administrative Region (787 discharged and 6 deaths), and 5,580,027 in Taiwan (13,742 discharged and 10,170 deaths). (Note: When citing by the media, please mark "Information from the official website of the National Health Commission". ) Address: No. 1, Xizhimenwai South Road, Xicheng District, Beijing Postcode: 100044 Tel: 010-68792114 ICP Record number: Beijing ICP Prepare No. 18052910 Beijing Public Network Security No. 11010202000005 National Health Commission of the People's Republic of China all rights reserved Technical Support: Statistical Information Center of the National Health Commission Website ID: bm24000006
2.3,get_result.py
Core code:
Use the re library to find the required text field
Use the jieba library to segment Chinese text and convert the segmented results into a list
new_text = re.findall('Indigenous cases.*?)', text)[0] print(new_text) cut = jieba.cut(new_text, cut_all=False) # jieba cut word new_result = ' '.join(cut) new_result = new_result.split() print(new_result)
come to conclusion:
['native', 'case', '259', 'example', '(', 'Sichuan', '59', 'example', ',', 'Inner Mongolia', '42', 'example', ',', 'Guangdong', '36', 'example', ',', 'Guangxi', '22', 'example', ',', 'Tibet', '20', 'example', ',', 'Beijing', '17', 'example', ',', 'Shandong', '16', 'example', ',', 'Heilongjiang', '13', 'example', ',', 'Qinghai', '7', 'example', ',', 'Guizhou', '6', 'example', ',', 'Shaanxi', '4', 'example', ',', 'Liaoning', '3', 'example', ',', 'Jiangsu', '3', 'example', ',', 'Yunnan', '3', 'example', ',', 'Xinjiang', '3', 'example', ',', 'Tianjin', '1', 'example', ',', 'Jiangxi', '1', 'example', ',', 'Henan', '1', 'example', ',', 'Hainan', '1', 'example', ',', 'chongqing', '1', 'example', ')']
Process the list to get the final required data, and write to the excel table at the same time
count = 0 # The record is now traversed to the first element of the list # write excel title title = ['province', 'Newly diagnosed'] col = 0 for i in title: sheet.write(0, col, i) col += 1 new_row = 1 # start processing list for word in new_result: if word.isdigit(): if new_result[count - 2] == 'native': new_sum = int(word) # new_sum stores the total number of people else: new_num.append(int(word)) # Add the number of people to the new_num list new_prov.append(new_result[count - 1]) # Add province to new_prov list sheet.write(new_row, 0, new_result[count - 1]) # write to form sheet.write(new_row, 1, int(word)) #write to form new_row += 1 count += 1
3. The performance improvement of the data statistics interface part
Using python's cprofiler library to analyze get_result.py, the results are as follows:
214 function calls (207 primitive calls) in 0.001 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.001 0.001 <string>:1(<module>) 2 0.000 0.000 0.000 0.000 enum.py:358(__call__) 2 0.000 0.000 0.000 0.000 enum.py:670(__new__) 1 0.000 0.000 0.000 0.000 enum.py:977(__and__) 1 0.000 0.000 0.001 0.001 re.py:250(compile) 1 0.000 0.000 0.001 0.001 re.py:289(_compile) 1 0.000 0.000 0.000 0.000 sre_compile.py:249(_compile_charset) 1 0.000 0.000 0.000 0.000 sre_compile.py:276(_optimize_charset) 2 0.000 0.000 0.000 0.000 sre_compile.py:453(_get_iscased) 1 0.000 0.000 0.000 0.000 sre_compile.py:461(_get_literal_prefix) 1 0.000 0.000 0.000 0.000 sre_compile.py:492(_get_charset_prefix) 1 0.000 0.000 0.000 0.000 sre_compile.py:536(_compile_info) 2 0.000 0.000 0.000 0.000 sre_compile.py:595(isstring) 1 0.000 0.000 0.000 0.000 sre_compile.py:598(_code) 3/1 0.000 0.000 0.000 0.000 sre_compile.py:71(_compile) 1 0.000 0.000 0.001 0.001 sre_compile.py:759(compile) 3 0.000 0.000 0.000 0.000 sre_parse.py:111(__init__) 7 0.000 0.000 0.000 0.000 sre_parse.py:160(__len__) 18 0.000 0.000 0.000 0.000 sre_parse.py:164(__getitem__) 7 0.001 0.000 0.001 0.000 sre_parse.py:172(append) 3/1 0.000 0.000 0.000 0.000 sre_parse.py:174(getwidth) 1 0.000 0.000 0.000 0.000 sre_parse.py:224(__init__) 8 0.000 0.000 0.000 0.000 sre_parse.py:233(__next) 2 0.000 0.000 0.000 0.000 sre_parse.py:249(match) 6 0.000 0.000 0.000 0.000 sre_parse.py:254(get) 1 0.000 0.000 0.000 0.000 sre_parse.py:286(tell) 1 0.000 0.000 0.001 0.001 sre_parse.py:435(_parse_sub) 2 0.000 0.000 0.001 0.000 sre_parse.py:493(_parse) 1 0.000 0.000 0.000 0.000 sre_parse.py:76(__init__) 2 0.000 0.000 0.000 0.000 sre_parse.py:81(groups) 1 0.000 0.000 0.000 0.000 sre_parse.py:923(fix_flags) 1 0.000 0.000 0.001 0.001 sre_parse.py:939(parse) 1 0.000 0.000 0.000 0.000 {built-in method _sre.compile} 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 25 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 29/26 0.000 0.000 0.000 0.000 {built-in method builtins.len} 2 0.000 0.000 0.000 0.000 {built-in method builtins.max} 9 0.000 0.000 0.000 0.000 {built-in method builtins.min} 6 0.000 0.000 0.000 0.000 {built-in method builtins.ord} 48 0.000 0.000 0.000 0.000 {method 'append' of 'list' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 5 0.000 0.000 0.000 0.000 {method 'find' of 'bytearray' objects} 1 0.000 0.000 0.000 0.000 {method 'items' of 'dict' objects}
- It can be seen that the performance of the program that finally interacts with the user is still very good, and the response time is very fast.
- As for the other two programs get_url.py and get_text.py, in order to ensure that the data can be climbed down, there are a lot of loops and sleep in the code, so the performance is poor.
4. Implementation ideas of daily hot spots
Boss Ke mentioned that he can do statistics on epidemic areas within seven days. His main ideas are as follows:
- Because of the seven-day statistics, the original get_result.py program can no longer meet the requirements.
- But you can use the get_result.py program to organize the analysis results into a dictionary and store them in json.
- Traverse all the dates in the program to get a json file with the number of new and asymptomatic people every day.
- The user enters the date, performs a query and merge (set) on the data of the seven days before the date, and prints out all the newly added provinces.
5. Display of data visualization interface
- The visual interface has two versions of map and histogram
- Visualization using the pyecharts library
- The incoming data is four dictionaries: newly diagnosed provinces, newly diagnosed people, newly added asymptomatic provinces, and newly added asymptomatic people
Core code:
Histogram:
bar_new = ( Bar() .add_xaxis(new_prov)# The x-axis is the province name .add_yaxis("new confirmed cases", new_num)# The y-axis is the number of people .set_global_opts(title_opts=opts.TitleOpts( title=input_date+"A summary of the number of newly diagnosed and asymptomatic people nationwide", subtitle='Newly diagnosed cases nationwide' +str(new_sum)+ "people, new asymptomatic"+str(wzz_sum)+'people'), legend_opts=opts.LegendOpts(pos_top="8%") ) ) bar_wzz = ( Bar() .add_xaxis(wzz_prov) .add_yaxis("New asymptomatic number", wzz_num) .set_global_opts(legend_opts=opts.LegendOpts(pos_bottom="40%")) ) # Display the two tables together (Grid(init_opts=opts.InitOpts(width='1500px', height='600px')) .add(bar_new, grid_opts=opts.GridOpts(pos_top="90px", pos_bottom="60%", height="200px")) .add(bar_wzz, grid_opts=opts.GridOpts(pos_top="60%", height="200px")) ).render(input_date+'Histogram of the number of newly diagnosed and asymptomatic people nationwide.html')
result:
map:
# map x = [] # Map the number of infections by province to province for z in zip(list(new_prov), list(new_num)): list(z) x.append(z) area_map = Map() area_map.add("Distribution map of new confirmed cases in China", x, "china", is_map_symbol_show=False) area_map.set_global_opts(title_opts=opts.TitleOpts(title="Distribution map of new confirmed cases in China", subtitle=date), visualmap_opts=opts.VisualMapOpts( is_piecewise=True, pieces=[ {"min": 1500, "label": '>10000 people', "color": "black"}, {"min": 500, "max": 15000, "label": '500-1000 people', "color": "#6F171F"}, {"min": 100, "max": 499, "label": '100-499 people', "color": "#C92C34"}, {"min": 10, "max": 99, "label": '10-99 people', "color": "#E35B52"}, {"min": 1, "max": 9, "label": '1-9 people', "color": "#F39E86"}])) area_map.render(input_date + 'Newly diagnosed map nationwide.html')
result:
Three, experience
- At first, I thought that I was just going to any webpage to crawl the epidemic data (in fact, I was unwilling to face the text data of the Health and Health Commission), I went directly to the json file of NetEase, analyzed and made a table and took it away.
- Later, it was confirmed that it was very painful to crawl the Health and Health Commission. When I first tested the crawling, I found that the anti-crawling mechanism of the Health and Health Commission was annoying, either returning a garbled webpage, or simply 412 could not crawl anything (later found that as long as This is not the case if you are not crawling too hard), the solution is to change a new cookie or crawl repeatedly until there is a meaningful result.
- So the most basic question is, how do I judge whether what I am crawling is garbled or the data I want. The solution is to use beautifulsoup to get the text of the web page. If the returned list is empty, it means that it is garbled and re-crawl.
- In view of the high failure rate of crawling, this big task is divided into three programs, so as to avoid putting it in one program and one error will be lost.
- For the text analysis, I have to complain about the format of the website of the Health and Health Commission. Each paragraph of the recent announcement has an exclusive p tag, so I took it for granted that I used beautifulsoup to get all the p tag content, and used subscripts to distinguish the paragraphs. When I ran a random test later, I found that all the text of the announcement a few months ago was placed under the same p tag, which directly kills the game logically and can only be rewritten (it also shows that the test is very important)
- The logic of the second version of the program is not segmented at all, and all the text is directly lumped together, and the regular segmentation is completed.
- I would like to thank the jieba library. This is a library taught by my teacher when I took python as an elective in my freshman year. It can segment the text according to Chinese semantics, which saves me a lot of work.
- The final program is not particularly perfect. The longer the notification format is, the more inconsistent the format will be, and the more likely the data will be missing. After all, it is expected that the query is carried out through regular rules. It tells us that too many things can’t be done. Only by updating the algorithm in real time can we do it well every day (nonsense)
- Of course python is really fun, tiring to death but fun.
- Last but not least, this is a lot of work!